import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import morethemes as mt
from scipy import stats
# Set WSJ theme
"wsj")
mt.set_theme(
= pd.read_csv('cleaned_data.csv') data
TessX | Baseline Model
Pricing Analytics
Feature Engineering & Baseline Model
Now that we’ve explored the data and gained insights, it’s time to prepare the dataset for modeling. In this section, we will focus on transforming the raw data into a format that improves model performance.
If you want to see the insights we discovered while exploring the data through questions, click here.
What We’ll Do:
- Univariate Analysis – Examine individual features to understand their distributions and spot potential issues.
- Missing Value Analysis – Identify and handle missing values to ensure a clean dataset.
- Target Variable Analysis – Analyze the target variable to detect patterns and ensure it’s well-defined for modeling.
- Categorical Encoding – Convert categorical variables into numerical format using One-Hot Encoding.
How This Fits into Our Pipeline
Every insight gained here will be incorporated into our preprocessing pipeline.
Load libraries & data
Univariate analysis
Code
# Define numerical columns
= ['Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_cols
# Set figure size (Reduce height)
= plt.subplots(len(numerical_cols), 3, figsize=(8, 3 * len(numerical_cols)))
fig, axes
# Add General Title (Reduce font size)
"Univariate Analysis", fontsize=20, fontweight="bold")
fig.suptitle(
# Loop through each numerical column
for i, col in enumerate(numerical_cols):
# Histogram
=20, kde=True, ax=axes[i, 0])
sns.histplot(data[col].dropna(), bins0].set_title(f'Histogram of {col}', fontsize=10)
axes[i,
# QQ Plot
= stats.probplot(data[col].dropna(), dist="norm", plot=axes[i, 1])
res 1].get_lines()[0].set_color("#855c75")
axes[i, 1].get_lines()[1].set_color("red")
axes[i, 1].set_title(f'QQ Plot of {col}', fontsize=10)
axes[i,
# Box Plot
=data[col], ax=axes[i, 2])
sns.boxplot(x2].set_title(f'Box Plot of {col}', fontsize=10)
axes[i,
"Univariate Analysis", fontsize=20, fontweight="bold")
fig.suptitle(
# Adjust layout (More space for title)
=[0, 0, 1, 0.92]) # Reduce the top margin
plt.tight_layout(rect=0.3, hspace=0.6, top=0.85) # Adjust subplot spacing
plt.subplots_adjust(wspace
plt.show()
Missing value analysis
= data.isnull().sum() # Count missing values
missing_values = len(data) # Total number of rows
total_values = (missing_values / total_values) * 100 # Calculate percentage
missing_percentage
# Combine into a DataFrame for better visualization
= pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0] # Show only columns with missing values
missing_data
print(missing_data)
Missing Values Percentage
Horsepower 5 0.310559
Torque 27 1.677019
Highway Fuel Economy 424 26.335404
Cylinders 165 10.248447
Code
# Assuming you have a DataFrame named 'data'
= data.isnull().sum() # Count missing values
missing_values = len(data) # Total number of rows
total_values = (missing_values / total_values) * 100 # Calculate percentage
missing_percentage
# Combine into a DataFrame for better visualization
= pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
missing_data = missing_data[missing_data['Missing Values'] > 0] # Show only columns with missing values
missing_data
# Plot the missing data
=(9, 4))
plt.figure(figsize'Percentage'])
plt.bar(missing_data.index, missing_data['Percentage of Missing Values')
plt.ylabel('Missing Values Percentage Per Column')
plt.title(
# Show plot
plt.show()
I will use median imputation for numerical columns because my univariate analysis shows that the numerical features are skewed. For categorical features, I will impute missing values using the most frequent category.
Target variable analysis
Code
# Set figure size (Reduced for better fit)
= plt.subplots(1, 3, figsize=(8, 4))
fig, axes
# Add General Title (Smaller font for better fit)
"Target Variable Analysis", fontsize=20, fontweight="bold")
fig.suptitle(
# Histogram
'MSRP'].dropna(), bins=30, kde=True, ax=axes[0])
sns.histplot(data[0].set_title("Histogram of MSRP", fontsize=10)
axes[
# QQ Plot
= stats.probplot(data['MSRP'].dropna(), dist="norm", plot=axes[1])
res 1].get_lines()[0].set_color("#855c75")
axes[1].get_lines()[1].set_color("red")
axes[1].set_title("QQ Plot of MSRP", fontsize=10)
axes[
# Box Plot
=data['MSRP'], ax=axes[2])
sns.boxplot(x2].set_title("Box Plot of MSRP", fontsize=10)
axes[
# Adjust layout (Reduce spacing)
=[0, 0, 1, 0.93])
plt.tight_layout(rect=0.4) # Adjust space between subplots
plt.subplots_adjust(wspace
plt.show()
Categorical encoding
Converting Categorical Variables to Numerical Format
To use categorical variables in our model, we need to convert them into a numerical format. We will apply One-Hot Encoding, which creates separate binary columns (0 or 1) for each category.
Example: One-Hot Encoding
Consider a categorical feature called “Car_Type” with three unique values: "Sedan"
, "SUV"
, and "Truck"
.
Car_Type | Sedan | SUV | Truck |
---|---|---|---|
Sedan | 1 | 0 | 0 |
SUV | 0 | 1 | 0 |
Truck | 0 | 0 | 1 |
SUV | 0 | 1 | 0 |
Each category is converted into its own column, where 1
indicates the presence of that category, and 0
means it’s absent.
This encoding method ensures that the model treats categorical variables properly without assuming any ordinal relationship between them.
I will try a log transformation for the target variable.
For categorical columns, I will try one-hot encoding.
I will add a standard scaler to the pipeline.
There are outliers in the independent variables, but I will not take action until I build and diagnose the model. However, I need to watch out for high fuel economy.
How log transformation will appear in the pipeline
Code
from feature_engine.transformation import LogTransformer
# Copy data to avoid modification
= data.copy()
data_transformed
# Apply Log Transformation
= LogTransformer(variables=["MSRP"])
lt_target = lt_target.fit_transform(data_transformed)
data_transformed
# Create figure with two rows (Before & After)
= plt.subplots(2, 3, figsize=(8, 5))
fig, axes
# Titles
"MSRP: Before and After Log Transformation", fontsize=20, fontweight='bold')
fig.suptitle(
# BEFORE Log Transformation
'MSRP'].dropna(), bins=30, kde=True, ax=axes[0, 0])
sns.histplot(data[0, 0].set_title("Histogram (Before)", fontsize=10)
axes[
# QQ Plot
= stats.probplot(data['MSRP'].dropna(), dist="norm", plot=axes[0, 1])
res 0, 1].get_lines()[0].set_color("#855c75")
axes[0, 1].get_lines()[1].set_color("red")
axes[0, 1].set_title("QQ Plot (Before)", fontsize=10)
axes[
=data['MSRP'], ax=axes[0, 2])
sns.boxplot(x0, 2].set_title("Box Plot (Before)", fontsize=10)
axes[
# AFTER Log Transformation
'MSRP'].dropna(), bins=30, kde=True, ax=axes[1, 0])
sns.histplot(data_transformed[1, 0].set_title("Histogram (After Log)", fontsize=10)
axes[
# QQ Plot (After Log Transformation)
= stats.probplot(data_transformed['MSRP'].dropna(), dist="norm", plot=axes[1, 1])
res 1, 1].get_lines()[0].set_color("#855c75")
axes[1, 1].get_lines()[1].set_color("red")
axes[1, 1].set_title("QQ Plot (After Log)", fontsize=10)
axes[
=data_transformed['MSRP'], ax=axes[1, 2])
sns.boxplot(x1, 2].set_title("Box Plot (After Log)", fontsize=10)
axes[
# Adjust layout
=[0, 0, 1, 0.95]) # Adjust title spacing
plt.tight_layout(rect=0.4, hspace=0.6) # Reduce gaps between plots
plt.subplots_adjust(wspace
plt.show()
Log Transformation: Handling Skewed Data
Log transformation is a technique used to reduce skewness in numerical data, making distributions more normal-like and improving model performance. It is especially useful when a feature has a long tail with extreme values.
Why Use Log Transformation?
- Reduces skewness in right-skewed data.
- Stabilizes variance, making patterns easier to detect.
- Improves model performance by handling large differences in scale.
Example: Before & After Log Transformation
Consider a feature called “Price”, which has highly skewed values:
Original Price ($) | Log-Transformed Price (log₁₀) |
---|---|
1,000 | 3.00 |
10,000 | 4.00 |
50,000 | 4.70 |
100,000 | 5.00 |
500,000 | 5.70 |
Here, applying log₁₀(x) compresses large values, making the distribution more balanced.
When to Use Log Transformation
- When a numerical feature is right-skewed.
- When the data has large variations in magnitude.
- When the feature contains positive values only (since log of negative values is undefined).
Formula:
For natural log (base e):
\[
x' = \log(x)
\]
For log base 10:
\[
x' = \log_{10}(x)
\]
Log transformation helps models learn better by making relationships more linear and reducing the impact of outliers.
Modeling - Baseline Model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
= pd.read_csv('cleaned_data.csv') data
# Define numerical and categorical columns
= ['Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_columns = ['Make', 'Year', 'Body Size', 'Body Style',
categorical_columns 'Engine Aspiration', 'Drivetrain', 'Transmission','Cylinders']
= data.drop("MSRP", axis=1)
X = data["MSRP"] y
# Split into training and test sets (80% train, 20% test)
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Log-transform targets (after splitting the data)
= np.log1p(y_train)
y_train_log = np.log1p(y_test)
y_test_log
# Define preprocessing pipeline
= Pipeline([
numerical_pipeline 'imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
(
])
= Pipeline([
categorical_pipeline 'imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
(
])
# Combine both pipelines
= ColumnTransformer([
preprocessor 'num', numerical_pipeline, numerical_columns),
('cat', categorical_pipeline, categorical_columns)
(
])
# Create the final pipeline with Linear Regression
= Pipeline([
model_pipeline 'preprocessor', preprocessor),
('model', LinearRegression())
(
])
# Train the model using the log-transformed targets
model_pipeline.fit(X_train, y_train_log)
# Predictions using the log-transformed targets
= model_pipeline.predict(X_train)
y_train_pred_log = model_pipeline.predict(X_test)
y_test_pred_log
# Convert log-transformed predictions back to original scale
= np.expm1(y_train_pred_log)
y_train_pred = np.expm1(y_test_pred_log)
y_test_pred
# Evaluate the model on both training and test sets
def evaluate_model(y_true, y_pred, dataset_type="Test"):
= r2_score(y_true, y_pred)
r2 = mean_squared_error(y_true, y_pred)
mse = np.sqrt(mse)
rmse = mean_absolute_error(y_true, y_pred)
mae
print(f"\n📊 {dataset_type} Set Performance:")
print(f"R² Score: {r2:.4f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
return r2, mse, rmse, mae
# Print metrics for both sets
= evaluate_model(y_train, y_train_pred, "Training")
train_metrics = evaluate_model(y_test, y_test_pred, "Test")
test_metrics
# --- Plot results (on test set) ---
=(8, 6))
plt.figure(figsize
# 1️⃣ Residual Plot
2, 2, 1)
plt.subplot(=y_test, y=y_test_pred, lowess=True, line_kws={"color": "red"})
sns.residplot(x"Actual MSRP", fontsize=9)
plt.xlabel("Residuals", fontsize=9)
plt.ylabel("Residual Plot", fontsize=10)
plt.title(
# 2️⃣ Predicted vs Actual
2, 2, 2)
plt.subplot(=y_test, y=y_test_pred, alpha=0.6)
sns.scatterplot(xmin(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.plot([y_test."Actual MSRP", fontsize=9)
plt.xlabel("Predicted MSRP", fontsize=9)
plt.ylabel("Predicted vs Actual", fontsize=10)
plt.title(
# 3️⃣ Distribution of Errors
2, 2, 3)
plt.subplot(- y_test_pred, bins=25, kde=True)
sns.histplot(y_test "Prediction Error", fontsize=9)
plt.xlabel("Error Distribution", fontsize=10)
plt.title(
# 4️⃣ QQ Plot
2, 2, 4)
plt.subplot(= stats.probplot(y_test - y_test_pred, dist="norm", plot=plt)
res 0].set_color("#855c75")
plt.gca().get_lines()[1].set_color("red")
plt.gca().get_lines()["QQ Plot of Errors", fontsize=10)
plt.title("Theoretical Quantiles", fontsize=9)
plt.xlabel("Sample Quantiles", fontsize=9)
plt.ylabel(
# Adjust layout
plt.tight_layout()=0.4, wspace=0.3) # Adjust spacing between plots
plt.subplots_adjust(hspace
plt.show()
📊 Training Set Performance:
R² Score: 0.9227
MSE: 236503194.57
RMSE: 15378.66
MAE: 8257.33
📊 Test Set Performance:
R² Score: 0.9339
MSE: 186273348.91
RMSE: 13648.20
MAE: 8072.74
Code
# --- Show All Features ---
= numerical_columns
numerical_features
= model_pipeline.named_steps['preprocessor'].named_transformers_['cat']
categorical_transformer = categorical_transformer.named_steps['encoder'].get_feature_names_out(categorical_columns)
categorical_features
= np.concatenate([numerical_features, categorical_features])
all_feature_names
# --- Feature Importance ---
= model_pipeline.named_steps['model']
model = model.coef_
feature_importance
# Combine feature names with their importance
= dict(zip(all_feature_names, feature_importance))
feature_importance_dict
# Sort features by absolute importance (highest to lowest)
= sorted(feature_importance_dict.items(), key=lambda x: abs(x[1]), reverse=True)
sorted_features
# Extract feature names and importance values
= zip(*sorted_features)
features, importance_values
# Create bar plot
=(6.9, 9))
plt.figure(figsize
plt.barh(features, importance_values)"Feature Importance")
plt.xlabel("Features")
plt.ylabel("Feature Importance (Sorted)")
plt.title(# Invert y-axis to show highest importance at the top
plt.gca().invert_yaxis()
plt.show()
Model Performance
The model’s performance on both the training and test sets is as follows:
Training Set:
- R² Score: 0.9227
- MSE: 236,503,182.49
- RMSE: 15,378.66
- MAE: 8,257.33
Test Set:
- R² Score: 0.9339
- MSE: 186,273,410.20
- RMSE: 13,648.20
- MAE: 8,072.74
At first glance, the model appears to generalize well. However, it is unusual for the test set to perform better than the training set.
Investigating the unexpected test performance
Typically, a model performs slightly worse on the test set because it encounters new, unseen data. When the test set outperforms the training set, potential causes include:
- Small Dataset Bias: If the dataset is small, the train-test split may not be fully representative, and a fortunate split could make the test set easier to predict.
- Solution: Use k-fold cross-validation to obtain a more reliable estimate of model performance.
This discrepancy is not necessarily a major issue, but it presents an opportunity to refine the model and ensure its robustness.
Keeping the model efficient
Currently, the model uses 48 features to predict car prices. However, not all features contribute equally to predictive accuracy. Some have a strong impact on pricing, while others provide little to no value.
Features with high impact:
- Make_Bentley (0.7065): Luxury brands significantly influence pricing.
- Cylinders_V10 (0.6165): A higher number of cylinders generally leads to higher prices.
Features with minimal impact:
- Drivetrain_AWD (-0.0028): Little to no effect on pricing.
- Body Size_Midsize (0.0021): Negligible influence.
Including unnecessary features can overcomplicate the model and reduce its stability. To address this, we will apply Recursive Feature Elimination (RFE).
Next Steps
Rather than manually selecting features, RFE will:
- Systematically remove the least useful features.
- Retain only features that significantly improve predictions.
- Enhance model accuracy while reducing complexity.
At this stage, the model serves as a baseline. The next step is to optimize the feature set, ensuring the pricing tool remains both powerful and efficient.
Go to the next notebook to continue: Click here