import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import warnings
'ignore')
warnings.filterwarnings(
= pd.read_csv('cleaned_data.csv') df
TessX | Exploring the Data Through Questions
Pricing Analytics
Introduction
After completing Data Preparation, the next step is Exploratory Data Analysis (EDA). In this notebook, I explore the data further by asking key questions to uncover patterns, trends, and insights. This helps in understanding the data’s structure and guiding the next steps in the analysis.
If you’d like to review the previous notebook on Data Preparation, click the link below:
View Data Preparation Notebook
Load libraries
What’s the Most Common Car Price?
Code
# Generate KDE using Plotly
= df['MSRP'].dropna().values # Drop NaNs if any
msrp_values = np.linspace(msrp_values.min(), msrp_values.max(), 100)
x_range
# Create KDE curve
= ff.create_distplot([msrp_values], ['MSRP'], show_hist=False, show_rug=False)
density
# Extract KDE curve values
= density.data[0]['x']
kde_x = density.data[0]['y']
kde_y
# Create figure
= go.Figure()
fig
# Add KDE curve
fig.add_trace(go.Scatter(=kde_x,
x=kde_y,
y='tozeroy',
fill='lines',
mode=dict(color='#E3120B', width=2), # Line color
line='#E3120B',
fillcolor='x+y'
hoverinfo
))
# Customize layout
fig.update_layout(="<span style='color:black'>Car Price Distribution</span>", # Title in black
title=dict(
xaxis="", # Add x-axis label
titlerange=[0, 400000], # Set x-axis range
=100000, # Tick marks every 100,000
dtick=",d", # Full number format (e.g., 100,000 instead of 100k)
tickformat="$", # Add dollar sign ($100,000)
tickprefix=False,
showgrid=True,
zeroline=2,
zerolinewidth='#f7f7f7'
zerolinecolor
),=dict(
yaxis="", # Add y-axis label
title=False,
showticklabels=False
showgrid
),='#F7F7F7',
plot_bgcolor='#F7F7F7',
paper_bgcolor=dict(family="Hiragino Kaku Gothic Pro, sans-serif", size=14, color='#000000'),
font=dict(l=50, r=50, t=100, b=50)
margin
)
# Show plot
fig.show()
Most cars are affordable, and only a few are luxury. 84.5% of cars are under $91K, proving luxury is an exception, not the norm.
What’s the Relationship Between Horsepower and MSRP
Code
# Create scatter plot with red points
= px.scatter(
fig
df, ='Horsepower',
x='MSRP',
y=0.5, # Adjust transparency
opacity=['#E3120B'], # Set all points to red
color_discrete_sequence="<span style='color:black'>MSRP vs. Horsepower</span>", # Title in black
title={'Horsepower': '', 'MSRP': ''} # Remove axis labels
labels
)
# Customize layout to remove x-axis line and gridlines
fig.update_layout(='#F5F4EF', # Set plot background color
plot_bgcolor='#F5F4EF', # Set page background color
paper_bgcolor=dict(
xaxis=False, # Remove x-axis line
showline=False, # Remove x-axis gridlines
showgrid='auto',
tickmode='black',
tickcolor=2,
tickwidth="outside",
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black')
tickfont
),=dict(
yaxis="right", # Move y-axis to the right
side=False, # Remove y-axis gridlines
showgrid=dict(family="Hiragino Kaku Gothic Pro, sans-serif"),
tickfont=False # Remove the thick zero line
zeroline
),=dict(
title=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black')
font
),=dict(family="Hiragino Kaku Gothic Pro, sans-serif") ,
font=dict(l=50, r=50, t=100, b=50)
margin
)
# Show plot
fig.show()
More Horsepower Means Higher Prices. There is a clear trend—cars with more horsepower tend to be more expensive. High-performance engines require advanced engineering, premium materials, and specialized components, driving up the cost. This makes sense, as luxury and sports cars prioritize power and speed.
How Do Car Prices Vary by Brand
Code
# Group by 'Make' and calculate median MSRP, then sort in ascending order
= df.groupby('Make')['MSRP'].median().sort_values(ascending=True).reset_index()
median_msrp_by_make
# Create bar plot using Plotly
= px.bar(
fig
median_msrp_by_make,='Make',
x='MSRP',
y="<span style='color:black'>MSRP by Car Make</span>", # Title in black
title={'Make': '', 'MSRP': ''},
labels=['#E3120B'] # Set all bars to red
color_discrete_sequence
)
# Update layout with custom size, backgrounds, tick styles, font, and gridlines
fig.update_layout(='#F5F4EF', # Set plot background color
plot_bgcolor='#F5F4EF', # Set page background color
paper_bgcolor=dict(
xaxis=True, # Show the x-axis line
showline='black', # Set x-axis line color to black
linecolor='auto', # Ensure ticks are displayed
tickmode='black', # Set tick color to black
tickcolor=2, # Set tick width to 2
tickwidth="outside", # Make ticks appear outside
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'), # Set x-axis text color
tickfontrange=[-0.7, len(median_msrp_by_make["Make"]) - 0] # Properly shift left without hiding any bars
),=dict(
yaxis="right", # Move y-axis to the right
side=dict(family="Hiragino Kaku Gothic Pro, sans-serif"), # Set y-axis tick font
tickfont='lightgray', # Light gray gridlines
gridcolor=True, # Show y-axis values
showticklabels=0.5, # Subtle grid width
gridwidth='solid', # Solid gridlines
griddash=False # Hide the thick zero line for a cleaner look
zeroline
),=dict(
title=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black') # Set title font and color
font
),=dict(family="Hiragino Kaku Gothic Pro, sans-serif"), # Set general font
font=dict(l=50, r=50, t=100, b=50)
margin
)
# Show plot
fig.show()
Luxury brands like BMW and Mercedes-Benz command higher prices, but ultra-luxury brands such as Bentley and Aston Martin stand in a league of their own. Their exclusivity, craftsmanship, and brand prestige push their prices significantly higher compared to mainstream luxury cars.
Which Body Types Have the Highest and Lowest Prices?
Code
# Calculate median MSRP for each Body Size category and sort in ascending order
= df.groupby('Body Size')['MSRP'].median().sort_values(ascending=True)
median_values
# Convert 'Body Size' into a categorical variable with this custom order
'Body Size'] = pd.Categorical(df['Body Size'], categories=median_values.index, ordered=True)
df[
# Create the vertical box plot with a single color
= px.box(
fig
df, ='Body Size',
x='MSRP',
y="<span style='color:black'>MSRP by Body Size</span>", # Title in black
title={'Body Size': median_values.index}, # Order from lowest to highest
category_orders=['#E3120B'] # Set all box plots to red
color_discrete_sequence
)
# Customize layout
fig.update_layout(=False, # Remove legend
showlegend="#F5F4EF", # Set page background color
paper_bgcolor="#F5F4EF", # Set plot background color
plot_bgcolor=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color="black"), # Set font style and color
font=dict(l=50, r=50, t=100, b=50),
margin=dict(
yaxis=True, # Keep y-axis grid lines
showgrid="lightgray", # Light gray grid lines
gridcolor=1, # Gridline thickness
gridwidth=False, # Remove y-axis line
showline=False, # Remove only the zero line
zeroline=True, # Keep y-axis tick labels
showticklabels=dict(family="Hiragino Kaku Gothic Pro, sans-serif", size=12, color="black"), # Set tick font
tickfont=None # Remove y-axis label
title
),=dict(
xaxis=False, # Remove x-axis grid lines
showgrid=False, # Remove x-axis line
showline="outside", # Keep x-axis ticks
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", size=14, color="black"), # Set tick font
tickfont='array', # Ensure order from lowest to highest
categoryorder=median_values.index, # Apply custom category order
categoryarray=None # Remove x-axis label
title
)
)
# Show the plot
fig.show()
Smaller vehicles, such as compact sedans and hatchbacks, are typically the most affordable. Midsize cars show more variability in pricing, often depending on features and trim levels. Large vehicles, including SUVs and luxury sedans, sit at the higher end due to their size, materials, and additional features.
Which Drivetrain Offers the Best Value for Its Price?
Code
# Calculate the median MSRP for each drivetrain and sort in ascending order
= df.groupby("Drivetrain")["MSRP"].median().reset_index()
median_msrp = median_msrp.sort_values("MSRP", ascending=True)
median_msrp
# Create bar plot using Plotly
= px.bar(
fig
median_msrp,="Drivetrain",
x="MSRP",
y="<span style='color:black'>MSRP by Drivetrain</span>", # Title in black
title={'Drivetrain': '', 'MSRP': ''},
labels=['#E3120B'] # Set all bars to red
color_discrete_sequence
)
# Update layout with custom size, backgrounds, tick styles, font, and Economist-style gridlines
fig.update_layout(='#F5F4EF', # Set plot background color
plot_bgcolor='#F5F4EF', # Set page background color
paper_bgcolor=dict(
xaxis=True, # Show the x-axis line
showline='black', # Set x-axis line color to black
linecolor='auto', # Ensure ticks are displayed
tickmode='black', # Set tick color to black
tickcolor=2, # Set tick width to 2
tickwidth="outside", # Make ticks appear outside
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'), # Set x-axis text to black
tickfontrange=[-0.7, len(median_msrp["Drivetrain"]) - 0] # Properly shift left without hiding any bars
),=dict(
yaxis="right", # Move y-axis to the right
side=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'), # Set y-axis tick font to black
tickfont='lightgray', # Closer to The Economist’s gridline color
gridcolor=True, # Show y-axis values
showticklabels=0.5, # Set grid width to 0.5 for a subtle look
gridwidth='solid', # Make gridlines solid
griddash=False # Hide the thick zero line for a cleaner look
zeroline
),=dict(
title=dict(family="Hiragino Kaku Gothic Pro, sans-serif") # Set title font
font
),=dict(family="Hiragino Kaku Gothic Pro, sans-serif"), # Set general font
font=dict(l=50, r=50, t=100, b=50)
margin
)
# Show plot
fig.show()
All-wheel drive (AWD) and four-wheel drive (4WD) systems increase vehicle costs due to added components and engineering. In contrast, front-wheel drive (FWD) is more affordable and common in budget-friendly vehicles. The trade-off often involves performance and capability in different road conditions.
Is There a Link Between Transmission and Engine Power?
Code
# Create the box plot
= px.box(df, x='Transmission', y='Horsepower',
fig ='Transmission',
color={'manual': '#E3120B', 'automatic': '#0057B8'},
color_discrete_map="<span style='color:black'>Horsepower by Transmission Type</span>") # Title in black
title
# Update layout to remove unwanted elements
fig.update_layout(="#F5F4EF", # Background color
paper_bgcolor="#F5F4EF", # Background color
plot_bgcolor=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color="black"), # Font style
font=dict(
xaxis=None, # Remove x-axis label
title=False, # Remove x-axis line
showline="outside",
ticks=dict(size=14, color="black")
tickfont
),=dict(
yaxis=None, # Remove y-axis label
title=False, # Remove y-axis line
showline=False, # Remove zero line
zeroline=True, # Show y-axis numbers
showticklabels=dict(size=14, color="black"), # Adjust font size and color
tickfont="lightgray", # Light gray grid lines
gridcolor=1
gridwidth
),=False, # Remove legend
showlegend=dict(l=50, r=50, t=100, b=50)
margin
)
# Show the plot
fig.show()
Automatic transmissions dominate the market, providing a broad spectrum of horsepower levels. Manual transmissions, often favored by enthusiasts, tend to be paired with mid-to-high horsepower engines, offering a more engaging driving experience.
Which Cylinder Type Delivers the Most Torque?
Code
# Define the natural order of cylinders
= ['I3', 'I4', 'I5', 'I6', 'V6', 'V8', 'V10', 'V12', 'W12']
natural_order
# Convert 'Cylinders' to a categorical type with this order
'Cylinders'] = pd.Categorical(df['Cylinders'], categories=natural_order, ordered=True)
df[
# Aggregate median torque per cylinder type
= df.groupby('Cylinders')['Torque'].median().reset_index()
median_torque_by_cylinders
# Define custom hex color codes for each cylinder type
= {
custom_colors 'I3': '#EB6E64',
'I4': '#EB6E64',
'I5': '#EB6E64',
'I6': '#EB6E64',
'V6': '#EB6E64',
'V8': '#EB6E64',
'V10': '#EB6E64',
'V12': '#EB6E64',
'W12': '#E3120B'
}
# Create bar chart with Plotly
= px.bar(
fig
median_torque_by_cylinders,='Cylinders',
x='Torque', # Corrected y-axis column
y="Torque by Cylinder Type",
title={'Cylinders': 'Cylinders', 'Torque': 'Median Torque (Nm)'},
labels='Cylinders', # Map colors to cylinder types
color=custom_colors # Use custom hex color codes
color_discrete_map
)
# Update layout to match previous style
fig.update_layout(='#F5F4EF',
plot_bgcolor='#F5F4EF',
paper_bgcolor=dict(
xaxis="Cylinders",
title=True,
showline='black',
linecolor='array',
tickmode=list(range(len(natural_order))), # Ensure correct ticks
tickvals=natural_order, # Ensure correct labels
ticktext='black',
tickcolor=2,
tickwidth="outside",
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'),
tickfontrange=[-0.5, len(natural_order) - 0.5] # Adjusted range
),=dict(
yaxis="",
title="right",
side=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'),
tickfont='lightgray',
gridcolor=True, # Now shows torque values
showticklabels=0.5,
gridwidth='solid',
griddash=False
zeroline
),=dict(
title=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black')
font
),=dict(family="Hiragino Kaku Gothic Pro, sans-serif"),
font=dict(l=50, r=50, t=100, b=50) ,
margin=False # Remove the legend
showlegend
)
# Show plot
fig.show()
While larger engines generally produce higher torque, the data reveals that the increase in torque isn’t consistent across all engine types. For instance, the torque starts at 225 for the I3 and gradually increases, reaching 369 for the I6. However, the jump in torque is more noticeable in the larger engines, with the V8 producing 485 torque, the V12 reaching 609.5, and the W12 exceeding both with 664 torque. This suggests that engine design and configuration, such as cylinder arrangement, play a crucial role in determining torque, rather than just the number of cylinders alone.
Is Engine Aspiration the Key to Better Fuel Economy?
Code
# Calculate the median for each Engine Aspiration category
= df.groupby('Engine Aspiration')['Highway Fuel Economy'].median().reset_index()
median_values
# Sort the median values in ascending order
= median_values.sort_values('Highway Fuel Economy', ascending=True)
median_values
# Create the bar chart with Plotly
= px.bar(
fig
median_values,='Engine Aspiration',
x='Highway Fuel Economy',
y="<span style='color:black'>Highway Fuel Economy by Engine Aspiration</span>", # Title in black
title={'Engine Aspiration': '', 'Highway Fuel Economy': ''},
labels=['#E3120B'] # Set all bars to red
color_discrete_sequence
)
# Update layout to match your previous style
fig.update_layout(='#F5F4EF',
plot_bgcolor='#F5F4EF',
paper_bgcolor=dict(
xaxis=True,
showline='black',
linecolor='auto',
tickmode='black',
tickcolor=2,
tickwidth="outside",
ticks=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'),
tickfontrange=[-0.7, len(median_values["Engine Aspiration"]) - 0]
),=dict(
yaxis="right",
side=dict(family="Hiragino Kaku Gothic Pro, sans-serif", color='black'), # Ensure tick labels are readable
tickfont='lightgray',
gridcolor=True, # Show y-axis numbers
showticklabels=0.5,
gridwidth='solid',
griddash=False
zeroline
),=dict(
title=dict(family="Hiragino Kaku Gothic Pro, sans-serif")
font
),=dict(family="Hiragino Kaku Gothic Pro, sans-serif"),
font=dict(l=50, r=50, t=100, b=50)
margin
)
# Show plot
fig.show()
Turbocharged and naturally aspirated engines show differences in fuel economy, but one insight stands out—electric vehicles outperform all traditional engines in efficiency. Their ability to maximize energy use makes them the best choice for reducing fuel costs and emissions.
Next Steps
Now, you can move to the next notebook, where I discuss Feature Engineering by exploring the data further, examining the methods I will use in my data pipeline, and building a baseline model.