import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import warnings
'ignore') warnings.filterwarnings(
Priscables Electronics | Data Prep
Marketing Analytics
1. Introduction
Prisca runs a thriving online electronics store, Priscables, but she’s sitting on a goldmine of untapped insights: 1 year of transaction data (185,000 rows). Her challenge?
- She wants to understand her customers better (Who are they? Where are they?).
- She needs to identify the best times to run ads or offer discounts to boost sales.
- She’s unsure how to optimize her product lineup (What’s selling? What’s not?).
Without analyzing this data, she’s missing opportunities to grow strategically.
2. Project Goal
Transform raw transaction data into actionable insights to:
- Understand customer behavior (peak shopping times, popular locations).
- Improve marketing decisions (when/where to run ads, offer promotions).
- Optimize product offerings (highlight top sellers, phase out underperformers).
Load libraries
# List of CSV files to be merged
= [
sales_list "Sales_January.csv",
"Sales_February.csv",
"Sales_March.csv",
"Sales_April.csv",
"Sales_May.csv",
"Sales_June.csv",
"Sales_July.csv",
"Sales_August.csv",
"Sales_September.csv",
"Sales_October.csv",
"Sales_November.csv",
"Sales_December.csv"
]
# Create an empty list to store dataframes
= []
df_list
# Loop through the list of CSV files and read each file into a dataframe
for sales_file in sales_list:
= pd.read_csv(sales_file)
df
df_list.append(df)
# Merge the dataframes in the list
= pd.concat(df_list)
merged_sales_df
# Write the merged dataframe to a CSV file
'sales_data.csv', index=False) merged_sales_df.to_csv(
= pd.read_csv('sales_data.csv') df
2. Data Understanding
# Short Info of data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186850 entries, 0 to 186849
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Order ID 186305 non-null object
1 Product 186305 non-null object
2 Quantity Ordered 186305 non-null object
3 Price Each 186305 non-null object
4 Order Date 186305 non-null object
5 Purchase Address 186305 non-null object
dtypes: object(6)
memory usage: 8.6+ MB
# Statistical analysis of data
df.describe()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
count | 186305 | 186305 | 186305 | 186305 | 186305 | 186305 |
unique | 178438 | 20 | 10 | 24 | 142396 | 140788 |
top | Order ID | USB-C Charging Cable | 1 | 11.95 | Order Date | Purchase Address |
freq | 355 | 21903 | 168552 | 21903 | 355 | 355 |
df.head()
Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address | |
---|---|---|---|---|---|---|
0 | 141234 | iPhone | 1 | 700 | 01/22/19 21:25 | 944 Walnut St, Boston, MA 02215 |
1 | 141235 | Lightning Charging Cable | 1 | 14.95 | 01/28/19 14:15 | 185 Maple St, Portland, OR 97035 |
2 | 141236 | Wired Headphones | 2 | 11.99 | 01/17/19 13:33 | 538 Adams St, San Francisco, CA 94016 |
3 | 141237 | 27in FHD Monitor | 1 | 149.99 | 01/05/19 20:33 | 738 10th St, Los Angeles, CA 90001 |
4 | 141238 | Wired Headphones | 1 | 11.99 | 01/25/19 11:59 | 387 10th St, Austin, TX 73301 |
df.columns
Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
'Purchase Address'],
dtype='object')
df.dtypes
Order ID object
Product object
Quantity Ordered object
Price Each object
Order Date object
Purchase Address object
dtype: object
def missing_values_table(df):
# Calculate the number of missing values per column
= df.isnull().sum()
mis_val
# Calculate the percentage of missing values per column
= 100 * df.isnull().sum() / len(df)
mis_val_percent
# Create a table combining the missing values and their percentages
= pd.concat([mis_val, mis_val_percent], axis=1)
mis_val_table
# Rename the columns of the table
= mis_val_table.rename(columns={0: 'No. Missing Values', 1: '%'})
mis_val_table_ren_columns
# Filter out columns with no missing values, sort in descending order by percentage, and round to 1 decimal place
= mis_val_table_ren_columns[mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values('%', ascending=False).round(1)
mis_val_table_ren_columns
# Print the number of features with missing values
print(f"There are {mis_val_table_ren_columns.shape[0]} features with missing values.")
return mis_val_table_ren_columns
# Call the function to display the missing values table for the DataFrame 'df'
missing_values_table(df)
There are 6 features with missing values.
No. Missing Values | % | |
---|---|---|
Order ID | 545 | 0.3 |
Product | 545 | 0.3 |
Quantity Ordered | 545 | 0.3 |
Price Each | 545 | 0.3 |
Order Date | 545 | 0.3 |
Purchase Address | 545 | 0.3 |
Insights from Data Understanding
I swiftly noticed a few key issues with the dataset:
Data Type Discrepancies: The dataset consists of 186,850 rows and 6 columns. One noteworthy problem is that the ‘Price Each’ column is categorized as an object, even though it should ideally be of the float data type.
Date Column Mislabeling: Another issue I identified is that the ‘Order Date’ column is currently labeled as an object, rather than being recognized as a datetime data type.
Quantity Ordered Data Type: Similarly, the ‘Quantity Ordered’ column is also marked as an object instead of the more suitable int32 datatype.
Missing Values: Furthermore, upon closer examination, I found that the dataset contains a total of 186,850 entries. By subtracting 186,305 from this count, we can determine that there are 545 missing values in each feature of our dataset.
3. Data Cleaning & Preprocessing
Renaming Columns
# Print column names
print(df.columns)
Index(['Order ID', 'Product', 'Quantity Ordered', 'Price Each', 'Order Date',
'Purchase Address'],
dtype='object')
# Create a copy of the DataFrame
= df.copy() data
# Renaming the columns
= data.rename(columns={'Quantity Ordered': 'Quantity_Ordered',
data 'Price Each': 'Product_Price',
'Order ID' : 'Order_ID',
'Order Date': 'Order_Date',
'Purchase Address': 'Purchase_Address'})
# Print column names
print(data.columns)
Index(['Order_ID', 'Product', 'Quantity_Ordered', 'Product_Price',
'Order_Date', 'Purchase_Address'],
dtype='object')
Handling Missing Values
# Count missing values in each column
sum() data.isna().
Order_ID 545
Product 545
Quantity_Ordered 545
Product_Price 545
Order_Date 545
Purchase_Address 545
dtype: int64
# Remove missing values
= data.dropna() data
# Count missing values in each column again to see update
sum() data.isna().
Order_ID 0
Product 0
Quantity_Ordered 0
Product_Price 0
Order_Date 0
Purchase_Address 0
dtype: int64
Handling Duplicates
# Select duplicated rows in the DataFrame
data.loc[data.duplicated()]
Order_ID | Product | Quantity_Ordered | Product_Price | Order_Date | Purchase_Address | |
---|---|---|---|---|---|---|
875 | 142071 | AA Batteries (4-pack) | 1 | 3.84 | 01/17/19 23:02 | 131 2nd St, Boston, MA 02215 |
1102 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1194 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1897 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
2463 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
... | ... | ... | ... | ... | ... | ... |
185070 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185085 | 317971 | AA Batteries (4-pack) | 1 | 3.84 | 12/17/19 18:39 | 250 Chestnut St, San Francisco, CA 94016 |
185481 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185925 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185955 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
618 rows × 6 columns
# Checking an example duplicate
'Product == "AA Batteries (4-pack)"') data.query(
Order_ID | Product | Quantity_Ordered | Product_Price | Order_Date | Purchase_Address | |
---|---|---|---|---|---|---|
19 | 141253 | AA Batteries (4-pack) | 1 | 3.84 | 01/17/19 00:09 | 385 11th St, Atlanta, GA 30301 |
24 | 141258 | AA Batteries (4-pack) | 1 | 3.84 | 01/19/19 21:47 | 820 1st St, San Francisco, CA 94016 |
34 | 141268 | AA Batteries (4-pack) | 1 | 3.84 | 01/14/19 10:13 | 611 Elm St, New York City, NY 10001 |
58 | 141290 | AA Batteries (4-pack) | 3 | 3.84 | 01/02/19 08:25 | 4 1st St, Los Angeles, CA 90001 |
59 | 141291 | AA Batteries (4-pack) | 1 | 3.84 | 01/26/19 18:11 | 632 13th St, Los Angeles, CA 90001 |
... | ... | ... | ... | ... | ... | ... |
186822 | 319643 | AA Batteries (4-pack) | 2 | 3.84 | 12/25/19 12:38 | 811 Hill St, San Francisco, CA 94016 |
186823 | 319644 | AA Batteries (4-pack) | 1 | 3.84 | 12/30/19 18:09 | 420 Madison St, New York City, NY 10001 |
186826 | 319647 | AA Batteries (4-pack) | 2 | 3.84 | 12/28/19 17:43 | 119 Spruce St, San Francisco, CA 94016 |
186833 | 319654 | AA Batteries (4-pack) | 1 | 3.84 | 12/23/19 14:08 | 245 Dogwood St, Atlanta, GA 30301 |
186846 | 319667 | AA Batteries (4-pack) | 2 | 3.84 | 12/01/19 12:01 | 549 Willow St, Los Angeles, CA 90001 |
20577 rows × 6 columns
# Identify duplicated rows and extract them along with all columns
data.loc[data.duplicated(), :]
Order_ID | Product | Quantity_Ordered | Product_Price | Order_Date | Purchase_Address | |
---|---|---|---|---|---|---|
875 | 142071 | AA Batteries (4-pack) | 1 | 3.84 | 01/17/19 23:02 | 131 2nd St, Boston, MA 02215 |
1102 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1194 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
1897 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
2463 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
... | ... | ... | ... | ... | ... | ... |
185070 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185085 | 317971 | AA Batteries (4-pack) | 1 | 3.84 | 12/17/19 18:39 | 250 Chestnut St, San Francisco, CA 94016 |
185481 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185925 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
185955 | Order ID | Product | Quantity Ordered | Price Each | Order Date | Purchase Address |
618 rows × 6 columns
# Remove duplicated rows from the DataFrame based on specified columns
= data.loc[~data.duplicated(subset=['Order_ID','Order_Date','Purchase_Address'])]
data
# The '~' operator negates the result of 'duplicated' function, selecting the non-duplicated rows
# The 'subset' parameter specifies the columns to consider when determining duplication
# The resulting DataFrame will only contain unique combinations of values across the specified columns
# Resetting the index and creating a copy of the data
= data.reset_index(drop=True).copy() data
Data Type Conversion
# Find unique values in the 'Product_Price' column
= data['Product_Price'].unique()
unique_values print(unique_values)
['700' '14.95' '11.99' '149.99' '2.99' '389.99' '11.95' '99.99' '150'
'1700' '300' '400' '3.84' '600' '109.99' '379.99' '999.99' '600.0'
'Price Each' '700.0' '150.0' '300.0' '1700.0' '400.0']
Notice Price Each in the data: a string converting to a float will not be possible.
# Find unique values in the 'Product' column
= data['Product'].unique()
unique_values print(unique_values)
['iPhone' 'Lightning Charging Cable' 'Wired Headphones' '27in FHD Monitor'
'AAA Batteries (4-pack)' '27in 4K Gaming Monitor' 'USB-C Charging Cable'
'Bose SoundSport Headphones' 'Apple Airpods Headphones'
'Macbook Pro Laptop' 'Flatscreen TV' 'Vareebadd Phone'
'AA Batteries (4-pack)' 'Google Phone' '20in Monitor'
'34in Ultrawide Monitor' 'ThinkPad Laptop' 'LG Dryer'
'LG Washing Machine' 'Product']
# Find unique values in the 'Quantity_Ordered' column
= data['Quantity_Ordered'].unique()
unique_values print(unique_values)
['1' '2' '3' '5' '4' '7' 'Quantity Ordered' '6' '9' '8']
Notice here Quantity Ordered: a string cannot allow for convertion to int32.
# Taking care of the data quality issues
'Product_Price'] == 'Price Each']
data[data[= data[data['Product_Price'] != 'Price Each']
data
'Quantity_Ordered'] == 'Quantity Ordered']
data[data[= data[data['Quantity_Ordered'] != 'Quantity Ordered'] data
# Data conversion
'Quantity_Ordered'] = data['Quantity_Ordered'].astype(int)
data['Product'] = data['Product'].astype('category')
data['Order_Date'] = pd.to_datetime(data['Order_Date'])
data['Product_Price'] = data['Product_Price'].astype(float) data[
# Confirming the conversion
data.dtypes
Order_ID object
Product category
Quantity_Ordered int64
Product_Price float64
Order_Date datetime64[ns]
Purchase_Address object
dtype: object
4. Creating Columns from Existing Ones
Crafting the Month Column
# Extract and create a 'Month' column from the 'Order_Date'
'Month'] = data['Order_Date'].dt.month
data[
# Find unique months in the 'Month' column
= data['Month'].unique()
unique_months print(unique_months)
[ 1 2 3 4 5 6 7 8 9 10 11 12]
# Import the calendar module
import calendar
# Map numerical month values to month names
'Month'] = data['Month'].\
data[apply(lambda x: calendar.month_abbr[x])
# Display the updated 'Month' column with unique month names
= data['Month'].unique()
unique_months print(unique_months)
['Jan' 'Feb' 'Mar' 'Apr' 'May' 'Jun' 'Jul' 'Aug' 'Sep' 'Oct' 'Nov' 'Dec']
# Extract city information from Purchase Address using regular expressions
'City'] = data['Purchase_Address'] \
data[str.extract(r',\s*([^\d,]+,\s*[A-Z]{2})', expand=False)
.
print(data['City'].unique())
['Boston, MA' 'Portland, OR' 'San Francisco, CA' 'Los Angeles, CA'
'Austin, TX' 'Atlanta, GA' 'Seattle, WA' 'New York City, NY' 'Dallas, TX'
'Portland, ME']
Crafting Sales Column
# Calculate sales by multiplying quantity ordered and product price
'Sales'] = data['Quantity_Ordered'] * data['Product_Price'] data[
Crafting Hour Column
# # Extract the hour and store it in a new 'hour' column
'Hour'] = data['Order_Date'].dt.strftime('%H:%M:%S').str[:2]
data[
# Display the DataFrame with the 'Hour' column
data.head()
Order_ID | Product | Quantity_Ordered | Product_Price | Order_Date | Purchase_Address | Month | City | Sales | Hour | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 141234 | iPhone | 1 | 700.00 | 2019-01-22 21:25:00 | 944 Walnut St, Boston, MA 02215 | Jan | Boston, MA | 700.00 | 21 |
1 | 141235 | Lightning Charging Cable | 1 | 14.95 | 2019-01-28 14:15:00 | 185 Maple St, Portland, OR 97035 | Jan | Portland, OR | 14.95 | 14 |
2 | 141236 | Wired Headphones | 2 | 11.99 | 2019-01-17 13:33:00 | 538 Adams St, San Francisco, CA 94016 | Jan | San Francisco, CA | 23.98 | 13 |
3 | 141237 | 27in FHD Monitor | 1 | 149.99 | 2019-01-05 20:33:00 | 738 10th St, Los Angeles, CA 90001 | Jan | Los Angeles, CA | 149.99 | 20 |
4 | 141238 | Wired Headphones | 1 | 11.99 | 2019-01-25 11:59:00 | 387 10th St, Austin, TX 73301 | Jan | Austin, TX | 11.99 | 11 |
To improve the analytical capabilities of my dataset, I performed data type conversions on three specific columns: Product_Price, Order_Date, and Quantity_Ordered. These conversions were essential for accurate analysis and interpretation of the data.
By performing these data type conversions, I have equipped the dataset with accurate and appropriate data types for each column. This ensures optimal compatibility with analytical techniques and enables me to derive meaningful insights from the data.
Next, I created four key columns: Month, City, Sales, and Hour. These features provide valuable insights for my analysis.
Creation of Month Column: I began by using the Order_Date column to extract the Month information. This allows me to analyze sales patterns and trends on a monthly basis. The conversion process involves extracting the month component from the date and assigning it to a new column.
Creation of City Column: Next, I created a City column by extracting the city information from the existing Purchase_Address column. This enables me to analyze sales performance and customer preferences across different cities. The city information is extracted from the address and stored in a new column for further analysis.
Calculation of Sales: I generated a Sales column by multiplying the Quantity_Ordered and Product_Price columns. This calculation provides me with the total revenue generated from each individual order. The resulting values are stored in a new column, which can be used to examine sales performance at a granular level.
Creation of Hour Column: Finally, I used the Order_Date column to extract the Hour information.
By performing these data type conversions and creating the necessary columns, I have enhanced the analytical capabilities of the dataset, enabling me to effectively address my research questions.
5. Save Dta
'priscables_clean.csv', index=False) data.to_csv(
6. Next Steps
Now, you can move to the next notebook, where I discuss Exploratory Data Analysis