import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
'display.max_columns', None)
pd.set_option(import warnings
'ignore') warnings.filterwarnings(
TessX | Data Preparation
Pricing Analytics
1. Introduction
TessX is a car dealership with a clear goal: to sell cars quickly and make a fair profit. But there’s a big challenge they face every day: pricing cars is not easy.
Think about it this way:
- If a car is priced too high, buyers might not be interested, and the car could sit unsold for months. This costs the dealership money in storage and maintenance.
- If a car is priced too low, it might sell quickly, but the dealership loses out on potential profit.
Right now, the team at TessX relies on their experience, intuition, and basic calculations to set prices. It’s a manual process that takes a lot of time and effort. And with thousands of cars to price, mistakes can happen. These mistakes can cost the dealership a lot of money.
The goal
What if there was a better way to price cars? What if TessX could use data to make smarter decisions?
This is where the project comes in. The goal is to build a tool that takes the guesswork out of pricing.
Here’s how it works:
1. The team enters details about a car, such as its make, horsepower, etc., into the app.
2. The app uses a machine learning model to analyze this data and compare it to thousands of past sales.
3. Within seconds, the team gets a data-driven price recommendation – a price that’s fair for buyers and profitable for TessX.
The bigger vision:
- Help TessX sell cars faster without losing money.
- Make the pricing process easier and less stressful for the team.
- Build trust with buyers by offering fair and transparent prices.
Dataset overview (17 Columns)
- Index - A unique number assigned to each row for identification.
- Make - The manufacturer or brand of the vehicle (e.g., Toyota, Ford, BMW).
- Model - The specific name of the vehicle under the manufacturer (e.g., Camry, Mustang, X5).
- Year - The year the car was manufactured or released.
- Trim - A specific version or configuration of the car model, which includes different features and options (e.g., Sport, LX, Limited).
- MSRP - Manufacturer’s Suggested Retail Price - The price recommended by the manufacturer before dealer discounts or negotiations.
- Invoice Price - The price the dealer pays the manufacturer for the car, usually lower than the MSRP.
- Used/New Price - The price of the car in the market, depending on whether it is new or used.
- Body Size - The general size category of the car (e.g., compact, midsize, full-size, SUV).
- Body Style - The design or shape of the car (e.g., sedan, coupe, hatchback, SUV, truck).
- Cylinders - The number of cylinders in the engine, affecting power and fuel efficiency (e.g., 4-cylinder, 6-cylinder, 8-cylinder).
- Engine Aspiration - How the engine takes in air for combustion (e.g., naturally aspirated, turbocharged, supercharged).
- Drivetrain - How power is delivered to the wheels (e.g., FWD – Front-Wheel Drive, RWD – Rear-Wheel Drive, AWD – All-Wheel Drive, 4WD – Four-Wheel Drive).
- Transmission - The system that changes gears in the car (e.g., automatic, manual, CVT).
- Horsepower - The power output of the engine, measured in HP (higher HP means a more powerful engine).
- Torque - The twisting force of the engine, measured in lb-ft (important for acceleration and towing power).
- Highway Fuel Economy - The fuel efficiency of the car when driving on highways, typically measured in miles per gallon (MPG) or liters per 100 km (L/100 km).
Load libraries
= pd.read_csv('car_data.csv') df
2. Data Understanding
# Short Info of data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1610 entries, 0 to 1609
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 1610 non-null int64
1 Make 1610 non-null object
2 Model 1610 non-null object
3 Year 1610 non-null int64
4 Trim 1610 non-null object
5 MSRP 1610 non-null object
6 Invoice Price 1058 non-null object
7 Used/New Price 1610 non-null object
8 Body Size 1610 non-null object
9 Body Style 1610 non-null object
10 Cylinders 1445 non-null object
11 Engine Aspiration 1610 non-null object
12 Drivetrain 1610 non-null object
13 Transmission 1610 non-null object
14 Horsepower 1605 non-null object
15 Torque 1583 non-null object
16 Highway Fuel Economy 1186 non-null object
dtypes: int64(2), object(15)
memory usage: 214.0+ KB
# Statistical analysis of data
df.describe()
index | Year | |
---|---|---|
count | 1610.000000 | 1610.000000 |
mean | 2932.104969 | 2023.450932 |
std | 1857.612482 | 0.497741 |
min | 0.000000 | 2023.000000 |
25% | 1602.250000 | 2023.000000 |
50% | 3207.500000 | 2023.000000 |
75% | 4809.750000 | 2024.000000 |
max | 6414.000000 | 2024.000000 |
df.head()
index | Make | Model | Year | Trim | MSRP | Invoice Price | Used/New Price | Body Size | Body Style | Cylinders | Engine Aspiration | Drivetrain | Transmission | Horsepower | Torque | Highway Fuel Economy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Aston Martin | DBX707 | 2024 | Base | $242,000 | NaN | $242,000 | Large | SUV | V8 | Twin-Turbo | AWD | automatic | 697 hp @ 6000 rpm | 663 ft-lbs. @ 2750 rpm | 20 mpg |
1 | 1 | Audi | A3 | 2024 | Premium w/40 TFSI | $35,800 | $33,653 | $35,800 | Compact | Sedan | I4 | Turbocharged | FWD | automatic | 201 hp @ 4800 rpm | 221 ft-lbs. @ 4100 rpm | 37 mpg |
2 | 2 | Audi | A3 | 2024 | Premium w/40 TFSI | $37,800 | $35,533 | $37,800 | Compact | Sedan | I4 | Turbocharged | AWD | automatic | 201 hp @ 5000 rpm | 221 ft-lbs. @ 4000 rpm | 34 mpg |
3 | 3 | Audi | A3 | 2024 | Premium Plus w/40 TFSI | $41,400 | $38,917 | $41,400 | Compact | Sedan | I4 | Turbocharged | AWD | automatic | 201 hp @ 5000 rpm | 221 ft-lbs. @ 4000 rpm | 34 mpg |
4 | 4 | Audi | A3 | 2024 | Premium Plus w/40 TFSI | $39,400 | $37,037 | $39,400 | Compact | Sedan | I4 | Turbocharged | FWD | automatic | 201 hp @ 4800 rpm | 221 ft-lbs. @ 4100 rpm | 37 mpg |
df.columns
Index(['index', 'Make', 'Model', 'Year', 'Trim', 'MSRP', 'Invoice Price',
'Used/New Price', 'Body Size', 'Body Style', 'Cylinders',
'Engine Aspiration', 'Drivetrain', 'Transmission', 'Horsepower',
'Torque', 'Highway Fuel Economy'],
dtype='object')
df.dtypes
index int64
Make object
Model object
Year int64
Trim object
MSRP object
Invoice Price object
Used/New Price object
Body Size object
Body Style object
Cylinders object
Engine Aspiration object
Drivetrain object
Transmission object
Horsepower object
Torque object
Highway Fuel Economy object
dtype: object
= [
numerical_columns "Year", "MSRP", "Invoice Price", "Used/New Price",
"Cylinders", "Horsepower", "Torque", "Highway Fuel Economy"
]
= [
categorical_columns "Make", "Model", "Trim", "Body Size", "Body Style",
"Engine Aspiration", "Drivetrain", "Transmission"
]
Categorical data looks
# Print value counts for each categorical column in a better format
for col in categorical_columns:
= df[col].value_counts().index.tolist()
unique_values
if len(unique_values) > 15:
= "More than 15"
values_str else:
= ", ".join(map(str, unique_values))
values_str
print(f"\n{col.upper()} -> {values_str}")
print("-" * 50) # Separator for better readability
MAKE -> Ford, Mercedes-Benz, Audi, Nissan, BMW, Bentley, Aston Martin
--------------------------------------------------
MODEL -> More than 15
--------------------------------------------------
TRIM -> More than 15
--------------------------------------------------
BODY SIZE -> Large, Midsize, Compact
--------------------------------------------------
BODY STYLE -> SUV, Pickup Truck, Sedan, Cargo Van, Coupe, Passenger Van, Convertible, Hatchback, Convertible SUV, Wagon, Cargo Minivan, Passenger Minivan
--------------------------------------------------
ENGINE ASPIRATION -> Turbocharged, Naturally Aspirated, Twin-Turbo, Electric Motor, Twincharged, Supercharged
--------------------------------------------------
DRIVETRAIN -> AWD, RWD, 4WD, FWD
--------------------------------------------------
TRANSMISSION -> automatic, manual
--------------------------------------------------
3. Data Cleaning
# Convert categorical columns to category data type
= df[categorical_columns].astype('category') df[categorical_columns]
# Columns to clean (remove $ and convert to numeric)
= ["MSRP", "Invoice Price", "Used/New Price"]
price_columns
# Remove '$' and convert to integer
for col in price_columns:
= df[col].replace('[\$,]', '', regex=True).astype(float) # Convert to float (or use int if no decimals)
df[col]
# Convert 'Cylinders' to categorical type
"Cylinders"] = df["Cylinders"].astype("category")
df[
# Verify the changes
print(df.dtypes)
index int64
Make category
Model category
Year int64
Trim category
MSRP float64
Invoice Price float64
Used/New Price float64
Body Size category
Body Style category
Cylinders category
Engine Aspiration category
Drivetrain category
Transmission category
Horsepower object
Torque object
Highway Fuel Economy object
dtype: object
# Extract only the horsepower value
"Horsepower"] = df["Horsepower"].str.extract(r'(\d+)').astype(float)
df[
# Verify the changes
print(df["Horsepower"].head())
print(df.dtypes) # Check data types
0 697.0
1 201.0
2 201.0
3 201.0
4 201.0
Name: Horsepower, dtype: float64
index int64
Make category
Model category
Year int64
Trim category
MSRP float64
Invoice Price float64
Used/New Price float64
Body Size category
Body Style category
Cylinders category
Engine Aspiration category
Drivetrain category
Transmission category
Horsepower float64
Torque object
Highway Fuel Economy object
dtype: object
# Extract only the torque value (first number)
"Torque"] = df["Torque"].str.extract(r'(\d+)').astype(float)
df[
# Verify the changes
print(df["Torque"].head())
print(df.dtypes) # Check data types
0 663.0
1 221.0
2 221.0
3 221.0
4 221.0
Name: Torque, dtype: float64
index int64
Make category
Model category
Year int64
Trim category
MSRP float64
Invoice Price float64
Used/New Price float64
Body Size category
Body Style category
Cylinders category
Engine Aspiration category
Drivetrain category
Transmission category
Horsepower float64
Torque float64
Highway Fuel Economy object
dtype: object
# Extract only the numeric value and convert to float
"Highway Fuel Economy"] = df["Highway Fuel Economy"].str.extract(r'(\d+)').astype(float)
df[
# Verify the changes
print(df["Highway Fuel Economy"].head())
print(df.dtypes) # Check data types
0 20.0
1 37.0
2 34.0
3 34.0
4 37.0
Name: Highway Fuel Economy, dtype: float64
index int64
Make category
Model category
Year int64
Trim category
MSRP float64
Invoice Price float64
Used/New Price float64
Body Size category
Body Style category
Cylinders category
Engine Aspiration category
Drivetrain category
Transmission category
Horsepower float64
Torque float64
Highway Fuel Economy float64
dtype: object
Check for missing values in numerical columns
# Check for missing values in numerical columns
= df[numerical_columns].isnull().sum()
missing_values
# Calculate percentage of missing values
= len(df)
total_rows = (missing_values / total_rows) * 100
missing_percentage
# Combine into a DataFrame for better visualization
= pd.DataFrame({
missing_df 'Missing Values': missing_values,
'Missing Percentage (%)': missing_percentage
})
# Print results
print("Missing values in numerical features:\n")
print(missing_df)
Missing values in numerical features:
Missing Values Missing Percentage (%)
Year 0 0.000000
MSRP 0 0.000000
Invoice Price 552 34.285714
Used/New Price 0 0.000000
Cylinders 165 10.248447
Horsepower 5 0.310559
Torque 27 1.677019
Highway Fuel Economy 424 26.335404
Since the ‘Year’ column contains only two distinct values (2023 and 2024), it does not form a continuous numerical trend. Instead, it represents distinct model categories. Treating it as a categorical variable ensures better handling in analysis and modeling, particularly when comparing features across different model years.
Check for missing values in categorical columns
# Check for missing values in categorical columns
= df[categorical_columns].isnull().sum()
missing_values_cat
# Calculate percentage of missing values
= (missing_values_cat / len(df)) * 100
missing_percentage_cat
# Combine into a DataFrame for better visualization
= pd.DataFrame({
missing_cat_df 'Missing Values': missing_values_cat,
'Missing Percentage (%)': missing_percentage_cat
})
# Print results
print("Missing values in categorical features:\n")
print(missing_cat_df)
Missing values in categorical features:
Missing Values Missing Percentage (%)
Make 0 0.0
Model 0 0.0
Trim 0 0.0
Body Size 0 0.0
Body Style 0 0.0
Engine Aspiration 0 0.0
Drivetrain 0 0.0
Transmission 0 0.0
"Year"] = df["Year"].astype("category") df[
Checking whether my data is now in the correct data type
df.dtypes
index int64
Make category
Model category
Year category
Trim category
MSRP float64
Invoice Price float64
Used/New Price float64
Body Size category
Body Style category
Cylinders category
Engine Aspiration category
Drivetrain category
Transmission category
Horsepower float64
Torque float64
Highway Fuel Economy float64
dtype: object
# Identify numerical and categorical columns
= df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_cols = df.select_dtypes(include=['category']).columns.tolist()
categorical_cols
# Print results
print("Numerical Columns:", numerical_cols)
print("\nCategorical Columns:", categorical_cols)
Numerical Columns: ['index', 'MSRP', 'Invoice Price', 'Used/New Price', 'Horsepower', 'Torque', 'Highway Fuel Economy']
Categorical Columns: ['Make', 'Model', 'Year', 'Trim', 'Body Size', 'Body Style', 'Cylinders', 'Engine Aspiration', 'Drivetrain', 'Transmission']
- Since my categorical columns, Trim (373) and Model (150), have high cardinality, I will remove them as they won’t be useful for the model.
- I removed the numerical column Index because it is redundant.
= ['MSRP', 'Invoice Price', 'Used/New Price',
numerical_cols 'Horsepower', 'Torque', 'Highway Fuel Economy']
= ['Make','Year', 'Body Size', 'Body Style',
categorical_cols 'Cylinders', 'Engine Aspiration', 'Drivetrain',
'Transmission']
Correlation
# Correlation matrix for numerical columns
= df[numerical_cols].corr()
correlation_matrix
# Display the correlation matrix
correlation_matrix
MSRP | Invoice Price | Used/New Price | Horsepower | Torque | Highway Fuel Economy | |
---|---|---|---|---|---|---|
MSRP | 1.000000 | 0.999362 | 1.000000 | 0.703264 | 0.528913 | -0.127039 |
Invoice Price | 0.999362 | 1.000000 | 0.999362 | 0.771598 | 0.527983 | 0.029318 |
Used/New Price | 1.000000 | 0.999362 | 1.000000 | 0.703264 | 0.528913 | -0.127039 |
Horsepower | 0.703264 | 0.771598 | 0.703264 | 1.000000 | 0.794974 | -0.125567 |
Torque | 0.528913 | 0.527983 | 0.528913 | 0.794974 | 1.000000 | 0.034291 |
Highway Fuel Economy | -0.127039 | 0.029318 | -0.127039 | -0.125567 | 0.034291 | 1.000000 |
="ticks") # Change "darkgrid" to any of the styles above sns.set_theme(style
# Plot the heatmap
=(8, 6))
plt.figure(figsize=True, cmap='Reds', fmt='.2f', linewidths=0.5)
sns.heatmap(correlation_matrix, annot'Correlation Matrix for Numerical Features')
plt.title( plt.show()
I am taking out correlated columns whose correlation to MSRP is exactly 1.
After Handling Correlation: Columns are…
= ['MSRP', 'Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_cols
= ['Make','Year', 'Body Size', 'Body Style',
categorical_cols 'Cylinders', 'Engine Aspiration', 'Drivetrain',
'Transmission']
4. Remark
Data cleaning is an ongoing process. As we conduct further analysis, such as data visualization and storytelling, additional issues, including outliers, may emerge. However, at this stage, we have completed the initial phase of data cleaning.
Columns that made it
= ['MSRP', 'Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_cols
= ['Make','Year', 'Body Size', 'Body Style',
categorical_cols 'Cylinders', 'Engine Aspiration', 'Drivetrain',
'Transmission']
5. Save Data
# Define numerical and categorical columns
= ['MSRP', 'Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_cols = ['Make', 'Year', 'Body Size', 'Body Style',
categorical_cols 'Cylinders', 'Engine Aspiration', 'Drivetrain',
'Transmission']
= numerical_cols + categorical_cols
selected_cols = df[selected_cols]
cleaned_df
# Save the cleaned dataset to a CSV file
"cleaned_data.csv", index=False)
cleaned_df.to_csv(
print("Cleaned data saved as 'cleaned_data.csv'.")
Cleaned data saved as 'cleaned_data.csv'.
Now, you can move to the next notebook, where I perform exploratory data analysis (EDA) to gain insights.