Priscables Electronics | Data Prep

Marketing Analytics

Author

Uzomah Teslim

Published

March 30, 2025

1. Introduction

Prisca runs a thriving online electronics store, Priscables, but she’s sitting on a goldmine of untapped insights: 1 year of transaction data (185,000 rows). Her challenge?

  • She wants to understand her customers better (Who are they? Where are they?).
  • She needs to identify the best times to run ads or offer discounts to boost sales.
  • She’s unsure how to optimize her product lineup (What’s selling? What’s not?).

Without analyzing this data, she’s missing opportunities to grow strategically.

2. Project Goal

Transform raw transaction data into actionable insights to:

  • Understand customer behavior (peak shopping times, popular locations).
  • Improve marketing decisions (when/where to run ads, offer promotions).
  • Optimize product offerings (highlight top sellers, phase out underperformers).

Download Dataset

Load libraries

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff

import warnings
warnings.filterwarnings('ignore')
# List of CSV files to be merged
sales_list = [
    "Sales_January.csv", 
    "Sales_February.csv", 
    "Sales_March.csv",
    "Sales_April.csv", 
    "Sales_May.csv", 
    "Sales_June.csv",
    "Sales_July.csv",
    "Sales_August.csv", 
    "Sales_September.csv",
    "Sales_October.csv",
    "Sales_November.csv", 
    "Sales_December.csv"
]

# Create an empty list to store dataframes
df_list = []

# Loop through the list of CSV files and read each file into a dataframe
for sales_file in sales_list:
    df = pd.read_csv(sales_file)
    df_list.append(df)

# Merge the dataframes in the list
merged_sales_df = pd.concat(df_list)

# Write the merged dataframe to a CSV file
merged_sales_df.to_csv('sales_data.csv', index=False)
df = pd.read_csv('sales_data.csv')

2. Data Understanding

# Short Info of data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186850 entries, 0 to 186849
Data columns (total 6 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Order ID          186305 non-null  object
 1   Product           186305 non-null  object
 2   Quantity Ordered  186305 non-null  object
 3   Price Each        186305 non-null  object
 4   Order Date        186305 non-null  object
 5   Purchase Address  186305 non-null  object
dtypes: object(6)
memory usage: 8.6+ MB
# Statistical analysis of data
df.describe()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
count 186305 186305 186305 186305 186305 186305
unique 178438 20 10 24 142396 140788
top Order ID USB-C Charging Cable 1 11.95 Order Date Purchase Address
freq 355 21903 168552 21903 355 355
df.head()
Order ID Product Quantity Ordered Price Each Order Date Purchase Address
0 141234 iPhone 1 700 01/22/19 21:25 944 Walnut St, Boston, MA 02215
1 141235 Lightning Charging Cable 1 14.95 01/28/19 14:15 185 Maple St, Portland, OR 97035
2 141236 Wired Headphones 2 11.99 01/17/19 13:33 538 Adams St, San Francisco, CA 94016
3 141237 27in FHD Monitor 1 149.99 01/05/19 20:33 738 10th St, Los Angeles, CA 90001
4 141238 Wired Headphones 1 11.99 01/25/19 11:59 387 10th St, Austin, TX 73301
df.columns
Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')
df.dtypes
Order ID            object
Product             object
Quantity Ordered    object
Price Each          object
Order Date          object
Purchase Address    object
dtype: object
def missing_values_table(df):
    # Calculate the number of missing values per column
    mis_val = df.isnull().sum()

    # Calculate the percentage of missing values per column
    mis_val_percent = 100 * df.isnull().sum() / len(df)

    # Create a table combining the missing values and their percentages
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

    # Rename the columns of the table
    mis_val_table_ren_columns = mis_val_table.rename(columns={0: 'No. Missing Values', 1: '%'})

    # Filter out columns with no missing values, sort in descending order by percentage, and round to 1 decimal place
    mis_val_table_ren_columns = mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values('%', ascending=False).round(1)

    # Print the number of features with missing values
    print(f"There are {mis_val_table_ren_columns.shape[0]} features with missing values.")

    return mis_val_table_ren_columns

# Call the function to display the missing values table for the DataFrame 'df'
missing_values_table(df)
There are 6 features with missing values.
No. Missing Values %
Order ID 545 0.3
Product 545 0.3
Quantity Ordered 545 0.3
Price Each 545 0.3
Order Date 545 0.3
Purchase Address 545 0.3

Insights from Data Understanding

I swiftly noticed a few key issues with the dataset:

  1. Data Type Discrepancies: The dataset consists of 186,850 rows and 6 columns. One noteworthy problem is that the ‘Price Each’ column is categorized as an object, even though it should ideally be of the float data type.

  2. Date Column Mislabeling: Another issue I identified is that the ‘Order Date’ column is currently labeled as an object, rather than being recognized as a datetime data type.

  3. Quantity Ordered Data Type: Similarly, the ‘Quantity Ordered’ column is also marked as an object instead of the more suitable int32 datatype.

  4. Missing Values: Furthermore, upon closer examination, I found that the dataset contains a total of 186,850 entries. By subtracting 186,305 from this count, we can determine that there are 545 missing values in each feature of our dataset.

3. Data Cleaning & Preprocessing

Renaming Columns

# Print column names
print(df.columns)
Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
       'Purchase Address'],
      dtype='object')
# Create a copy of the DataFrame
data = df.copy()
# Renaming the columns
data = data.rename(columns={'Quantity Ordered': 'Quantity_Ordered',
                       'Price Each': 'Product_Price',
                        'Order ID' : 'Order_ID',
                       'Order Date': 'Order_Date',
                       'Purchase Address': 'Purchase_Address'})
# Print column names
print(data.columns)
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Product_Price',
       'Order_Date', 'Purchase_Address'],
      dtype='object')

Handling Missing Values

# Count missing values in each column
data.isna().sum()
Order_ID            545
Product             545
Quantity_Ordered    545
Product_Price       545
Order_Date          545
Purchase_Address    545
dtype: int64
# Remove missing values 
data = data.dropna()
# Count missing values in each column again to see update
data.isna().sum()
Order_ID            0
Product             0
Quantity_Ordered    0
Product_Price       0
Order_Date          0
Purchase_Address    0
dtype: int64

Handling Duplicates

# Select duplicated rows in the DataFrame
data.loc[data.duplicated()]
Order_ID Product Quantity_Ordered Product_Price Order_Date Purchase_Address
875 142071 AA Batteries (4-pack) 1 3.84 01/17/19 23:02 131 2nd St, Boston, MA 02215
1102 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
1194 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
1897 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
2463 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
... ... ... ... ... ... ...
185070 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185085 317971 AA Batteries (4-pack) 1 3.84 12/17/19 18:39 250 Chestnut St, San Francisco, CA 94016
185481 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185925 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185955 Order ID Product Quantity Ordered Price Each Order Date Purchase Address

618 rows × 6 columns

# Checking an example duplicate
data.query('Product == "AA Batteries (4-pack)"')
Order_ID Product Quantity_Ordered Product_Price Order_Date Purchase_Address
19 141253 AA Batteries (4-pack) 1 3.84 01/17/19 00:09 385 11th St, Atlanta, GA 30301
24 141258 AA Batteries (4-pack) 1 3.84 01/19/19 21:47 820 1st St, San Francisco, CA 94016
34 141268 AA Batteries (4-pack) 1 3.84 01/14/19 10:13 611 Elm St, New York City, NY 10001
58 141290 AA Batteries (4-pack) 3 3.84 01/02/19 08:25 4 1st St, Los Angeles, CA 90001
59 141291 AA Batteries (4-pack) 1 3.84 01/26/19 18:11 632 13th St, Los Angeles, CA 90001
... ... ... ... ... ... ...
186822 319643 AA Batteries (4-pack) 2 3.84 12/25/19 12:38 811 Hill St, San Francisco, CA 94016
186823 319644 AA Batteries (4-pack) 1 3.84 12/30/19 18:09 420 Madison St, New York City, NY 10001
186826 319647 AA Batteries (4-pack) 2 3.84 12/28/19 17:43 119 Spruce St, San Francisco, CA 94016
186833 319654 AA Batteries (4-pack) 1 3.84 12/23/19 14:08 245 Dogwood St, Atlanta, GA 30301
186846 319667 AA Batteries (4-pack) 2 3.84 12/01/19 12:01 549 Willow St, Los Angeles, CA 90001

20577 rows × 6 columns

# Identify duplicated rows and extract them along with all columns 
data.loc[data.duplicated(), :]
Order_ID Product Quantity_Ordered Product_Price Order_Date Purchase_Address
875 142071 AA Batteries (4-pack) 1 3.84 01/17/19 23:02 131 2nd St, Boston, MA 02215
1102 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
1194 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
1897 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
2463 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
... ... ... ... ... ... ...
185070 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185085 317971 AA Batteries (4-pack) 1 3.84 12/17/19 18:39 250 Chestnut St, San Francisco, CA 94016
185481 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185925 Order ID Product Quantity Ordered Price Each Order Date Purchase Address
185955 Order ID Product Quantity Ordered Price Each Order Date Purchase Address

618 rows × 6 columns

# Remove duplicated rows from the DataFrame based on specified columns
data = data.loc[~data.duplicated(subset=['Order_ID','Order_Date','Purchase_Address'])]

# The '~' operator negates the result of 'duplicated' function, selecting the non-duplicated rows
# The 'subset' parameter specifies the columns to consider when determining duplication

# The resulting DataFrame will only contain unique combinations of values across the specified columns
# Resetting the index and creating a copy of the data
data = data.reset_index(drop=True).copy()

Data Type Conversion

# Find unique values in the 'Product_Price' column
unique_values = data['Product_Price'].unique()
print(unique_values)
['700' '14.95' '11.99' '149.99' '2.99' '389.99' '11.95' '99.99' '150'
 '1700' '300' '400' '3.84' '600' '109.99' '379.99' '999.99' '600.0'
 'Price Each' '700.0' '150.0' '300.0' '1700.0' '400.0']
Important

Notice Price Each in the data: a string converting to a float will not be possible.

# Find unique values in the 'Product' column
unique_values = data['Product'].unique()
print(unique_values)
['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
 'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
 'Bose SoundSport Headphones' 'Apple Airpods Headphones'
 'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
 'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
 '34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
 'LG Washing Machine' 'Product']
# Find unique values in the 'Quantity_Ordered' column
unique_values = data['Quantity_Ordered'].unique()
print(unique_values)
['1' '2' '3' '5' '4' '7' 'Quantity Ordered' '6' '9' '8']
Important

Notice here Quantity Ordered: a string cannot allow for convertion to int32.

# Taking care of the data quality issues
data[data['Product_Price'] == 'Price Each']
data = data[data['Product_Price'] != 'Price Each']

data[data['Quantity_Ordered'] == 'Quantity Ordered']
data = data[data['Quantity_Ordered'] != 'Quantity Ordered']
# Data conversion
data['Quantity_Ordered'] = data['Quantity_Ordered'].astype(int)
data['Product'] = data['Product'].astype('category')
data['Order_Date'] = pd.to_datetime(data['Order_Date'])
data['Product_Price'] = data['Product_Price'].astype(float)
# Confirming the conversion
data.dtypes
Order_ID                    object
Product                   category
Quantity_Ordered             int64
Product_Price              float64
Order_Date          datetime64[ns]
Purchase_Address            object
dtype: object

4. Creating Columns from Existing Ones

Crafting the Month Column

# Extract and create a 'Month' column from the 'Order_Date'
data['Month'] = data['Order_Date'].dt.month

# Find unique months in the 'Month' column
unique_months = data['Month'].unique()
print(unique_months)
[ 1  2  3  4  5  6  7  8  9 10 11 12]
# Import the calendar module
import calendar

# Map numerical month values to month names
data['Month'] = data['Month'].\
apply(lambda x: calendar.month_abbr[x])

# Display the updated 'Month' column with unique month names
unique_months = data['Month'].unique()
print(unique_months)
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']
# Extract city information from Purchase Address using regular expressions
data['City'] = data['Purchase_Address'] \
    .str.extract(r',\s*([^\d,]+,\s*[A-Z]{2})', expand=False)

print(data['City'].unique())
['Boston, MA' 'Portland, OR' 'San Francisco, CA' 'Los Angeles, CA'
 'Austin, TX' 'Atlanta, GA' 'Seattle, WA' 'New York City, NY' 'Dallas, TX'
 'Portland, ME']

Crafting Sales Column

# Calculate sales by multiplying quantity ordered and product price
data['Sales'] = data['Quantity_Ordered'] * data['Product_Price']

Crafting Hour Column

# # Extract the hour and store it in a new 'hour' column
data['Hour'] = data['Order_Date'].dt.strftime('%H:%M:%S').str[:2]

# Display the DataFrame with the 'Hour' column
data.head()
Order_ID Product Quantity_Ordered Product_Price Order_Date Purchase_Address Month City Sales Hour
0 141234 iPhone 1 700.00 2019-01-22 21:25:00 944 Walnut St, Boston, MA 02215 Jan Boston, MA 700.00 21
1 141235 Lightning Charging Cable 1 14.95 2019-01-28 14:15:00 185 Maple St, Portland, OR 97035 Jan Portland, OR 14.95 14
2 141236 Wired Headphones 2 11.99 2019-01-17 13:33:00 538 Adams St, San Francisco, CA 94016 Jan San Francisco, CA 23.98 13
3 141237 27in FHD Monitor 1 149.99 2019-01-05 20:33:00 738 10th St, Los Angeles, CA 90001 Jan Los Angeles, CA 149.99 20
4 141238 Wired Headphones 1 11.99 2019-01-25 11:59:00 387 10th St, Austin, TX 73301 Jan Austin, TX 11.99 11
Insights from Data Cleaning & Preprocessing

To improve the analytical capabilities of my dataset, I performed data type conversions on three specific columns: Product_Price, Order_Date, and Quantity_Ordered. These conversions were essential for accurate analysis and interpretation of the data.

By performing these data type conversions, I have equipped the dataset with accurate and appropriate data types for each column. This ensures optimal compatibility with analytical techniques and enables me to derive meaningful insights from the data.

Next, I created four key columns: Month, City, Sales, and Hour. These features provide valuable insights for my analysis.

Creation of Month Column: I began by using the Order_Date column to extract the Month information. This allows me to analyze sales patterns and trends on a monthly basis. The conversion process involves extracting the month component from the date and assigning it to a new column.

Creation of City Column: Next, I created a City column by extracting the city information from the existing Purchase_Address column. This enables me to analyze sales performance and customer preferences across different cities. The city information is extracted from the address and stored in a new column for further analysis.

Calculation of Sales: I generated a Sales column by multiplying the Quantity_Ordered and Product_Price columns. This calculation provides me with the total revenue generated from each individual order. The resulting values are stored in a new column, which can be used to examine sales performance at a granular level.

Creation of Hour Column: Finally, I used the Order_Date column to extract the Hour information.

By performing these data type conversions and creating the necessary columns, I have enhanced the analytical capabilities of the dataset, enabling me to effectively address my research questions.

5. Save Dta

data.to_csv('priscables_clean.csv', index=False)

6. Next Steps

Moving on…

Now, you can move to the next notebook, where I discuss Exploratory Data Analysis

Click here to continue