from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
TessX | Feature Selection - RFE
Pricing Analytics
1. Introduction
After engineering features and building a baseline model, the next step is feature selection. Not all features contribute equally to model performance. Some add valuable predictive power, while others introduce noise or redundancy. In this notebook, we will refine our dataset by selecting the most important features to improve accuracy and efficiency.
What We’ll Do in This Notebook
In this section, we will:
- Apply Recursive Feature Elimination (RFE) to systematically remove less important features.
- Identify the most predictive variables for our model.
- Improve model accuracy while reducing complexity.
Why Feature Selection is Important
Feature selection is a crucial step in machine learning. It helps:
- Improve model efficiency by reducing the number of input variables.
- Enhance generalization to new data by eliminating unnecessary patterns.
- Speed up training time and reduce computational costs.
By selecting only the most useful features, we ensure that our pricing model remains both powerful and interpretable.
📌 Want to See Our Previous Notebook?
If you’d like to review how we handled missing values, performed univariate analysis, engineered categorical features, and built our baseline model, check out the previous notebook:
2. Feature Selection - RFE
import pandas as pd
import numpy as np
import morethemes as mt
from scipy import stats
# Set WSJ theme
"wsj")
mt.set_theme(
import warnings
'ignore')
warnings.filterwarnings(= pd.read_csv('cleaned_data.csv') data
# Define numerical and categorical columns
= ['Horsepower', 'Torque', 'Highway Fuel Economy']
numerical_columns = ['Make', 'Year', 'Body Size', 'Body Style',
categorical_columns 'Engine Aspiration', 'Drivetrain', 'Transmission','Cylinders']
= data.drop("MSRP", axis=1)
X = data["MSRP"] y
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Split into training and test sets (80% train, 20% test)
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Log-transform targets (after splitting the data)
= np.log1p(y_train)
y_train_log = np.log1p(y_test)
y_test_log
# Define preprocessing pipeline
= Pipeline([
numerical_pipeline 'imputer', SimpleImputer(strategy='median')), # Fill missing values with median
('scaler', StandardScaler()) # Standardize numerical features
(
])
= Pipeline([
categorical_pipeline 'imputer', SimpleImputer(strategy='most_frequent')), # Fill missing values before encoding
('encoder', OneHotEncoder(handle_unknown='ignore')) # One-hot encode categorical features
(
])
# Combine both pipelines
= ColumnTransformer([
preprocessor 'num', numerical_pipeline, numerical_columns),
('cat', categorical_pipeline, categorical_columns)
(
])
# Create full pipeline with RFE and Linear Regression
# Step 1: Preprocess the data
# Step 2: Apply RFE to select the top N features
# Step 3: Fit the Linear Regression model on the selected features
# Define the number of features to select
= 10 # You can adjust this based on your dataset
n_features_to_select
# Create the pipeline
= Pipeline([
model_pipeline 'preprocessor', preprocessor), # Preprocess the data
('feature_selection', RFE(
(=LinearRegression(), # Base estimator for RFE
estimator=n_features_to_select # Number of features to select
n_features_to_select
)),'model', LinearRegression()) # Final model
(
])
# Train the model using the log-transformed targets
model_pipeline.fit(X_train, y_train_log)
# Predictions using the log-transformed targets
= model_pipeline.predict(X_train)
y_train_pred_log = model_pipeline.predict(X_test)
y_test_pred_log
# Convert log-transformed predictions back to original scale
= np.expm1(y_train_pred_log) # Inverse of log1p
y_train_pred = np.expm1(y_test_pred_log) # Inverse of log1p
y_test_pred
# Evaluate the model on both training and test sets
def evaluate_model(y_true, y_pred, dataset_type="Test"):
= r2_score(y_true, y_pred)
r2 = mean_squared_error(y_true, y_pred)
mse = np.sqrt(mse)
rmse = mean_absolute_error(y_true, y_pred)
mae
print(f"\n📊 {dataset_type} Set Performance:")
print(f"R² Score: {r2:.4f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"MAE: {mae:.2f}")
return r2, mse, rmse, mae
# Print metrics for both sets
= evaluate_model(y_train, y_train_pred, "Training")
train_metrics = evaluate_model(y_test, y_test_pred, "Test")
test_metrics
# --- Plot results (on test set) ---
# --- Plot results (on test set) ---
=(8, 6))
plt.figure(figsize# 1️⃣ Residual Plot
2, 2, 1)
plt.subplot(=y_test, y=y_test_pred, lowess=True, line_kws={"color": "red"})
sns.residplot(x"Actual MSRP", fontsize=9)
plt.xlabel("Residuals", fontsize=9)
plt.ylabel("Residual Plot", fontsize=10)
plt.title(
# 2️⃣ Predicted vs Actual
2, 2, 2)
plt.subplot(=y_test, y=y_test_pred, alpha=0.6)
sns.scatterplot(xmin(), y_test.max()], [y_test.min(), y_test.max()], 'r--') # 45-degree reference line
plt.plot([y_test."Actual MSRP", fontsize=9)
plt.xlabel("Predicted MSRP", fontsize=9)
plt.ylabel("Predicted vs Actual", fontsize=10)
plt.title(
# 3️⃣ Distribution of Errors
2, 2, 3)
plt.subplot(- y_test_pred, bins=25, kde=True)
sns.histplot(y_test "Prediction Error", fontsize=9)
plt.xlabel("Error Distribution", fontsize=10)
plt.title(
# 4️⃣ QQ Plot
2, 2, 4)
plt.subplot(= stats.probplot(y_test - y_test_pred, dist="norm", plot=plt)
res
# Change colors to red
0].set_color("#855c75") # QQ plot points
plt.gca().get_lines()[1].set_color("red") # Reference line
plt.gca().get_lines()[
"QQ Plot of Errors", fontsize=10)
plt.title("Theoretical Quantiles", fontsize=9)
plt.xlabel("Sample Quantiles", fontsize=9)
plt.ylabel(
# Adjust layout
plt.tight_layout()=0.4, wspace=0.3) # Adjust spacing between plots
plt.subplots_adjust(hspace
plt.show()
# --- Inspect Selected Features ---
# Get the feature names after preprocessing
# For numerical features
= numerical_columns # Assuming numerical_columns is a list of numerical feature names
numerical_features
# For categorical features (after one-hot encoding)
= model_pipeline.named_steps['preprocessor'].named_transformers_['cat']
categorical_transformer = categorical_transformer.named_steps['encoder'].get_feature_names_out(categorical_columns)
categorical_features
# Combine all feature names
= np.concatenate([numerical_features, categorical_features])
all_feature_names
# Get the selected features from RFE
= model_pipeline.named_steps['feature_selection']
rfe_selector = rfe_selector.support_ # Boolean mask of selected features
selected_features
# Map selected features to their names
= all_feature_names[selected_features]
selected_feature_names
# Display the selected features
print("\n🔍 Selected Features by RFE:")
print(selected_feature_names)
📊 Training Set Performance:
R² Score: 0.9004
MSE: 304699841.70
RMSE: 17455.65
MAE: 9834.18
📊 Test Set Performance:
R² Score: 0.9035
MSE: 272072310.43
RMSE: 16494.61
MAE: 10156.11
🔍 Selected Features by RFE:
['Horsepower' 'Make_Aston Martin' 'Make_Bentley' 'Make_Ford' 'Make_Nissan'
'Body Style_Cargo Van' 'Body Style_Passenger Van' 'Transmission_manual'
'Cylinders_V10' 'Cylinders_V12']
3. Summary
Feature Selection with RFE
To improve model simplicity and interpretability, Recursive Feature Elimination (RFE) was used to identify the most relevant predictors.
Model Performance After RFE (10 Features Selected)
Training Set
- R² Score: 0.9004
- MSE: 304,699,841.13
- RMSE: 17,455.65
- MAE: 9,834.18
Test Set
- R² Score: 0.9035
- MSE: 272,072,306.86
- RMSE: 16,494.61
Selected Features
- Horsepower
- Make (Aston Martin, Bentley, Ford, Nissan)
- Body Style (Cargo Van, Passenger Van)
- Transmission (Manual)
- Cylinders (V10, V12)
By reducing the number of features from 48 to 10, the model maintained strong predictive power, with only a slight decrease in R² from 0.93 to 0.90.
Challenges with the RFE Model
While the RFE-selected model is more streamlined, certain issues remain:
a. Limited Representation of Popular Car Brands
- Key brands such as BMW, Toyota, and Mercedes were not selected.
- This omission may lead to gaps in model predictions and impact user experience.
b. Imbalanced Transmission Data
- The model identified Manual Transmission as a significant predictor.
- However, 97% of the dataset consists of automatic cars, while only 3% are manual.
- This imbalance may introduce bias in transmission-related predictions.
c. Disproportionate Body Style Representation
- SUVs account for 29.8% of the dataset, whereas some body styles, such as Cargo Minivans (0.6%) and Wagons (0.9%), have very few samples.
- Features with limited representation may not generalize well in real-world predictions.
4. What Next
To refine the model, further feature selection will be conducted:
- Transmission will be removed to mitigate class imbalance.
- Body Style will be re-evaluated to assess its impact on performance.
Optimizing for Practical Use
Instead of using all RFE-selected features, the model will be retrained with only the most meaningful predictors that enhance accuracy while maintaining usability.
Final Feature Selection
✅ Features to Keep
- Make
- Horsepower
- Cylinders
❌ Features to Remove
- Transmission (Highly imbalanced)
- Body Style (Under evaluation)
5. Next Steps
Now that we have completed feature selection and gained key insights, it’s time to refine our model further. I will retrain the model using Make, Horsepower, and Cylinders, and test whether removing Body Style still maintains accuracy.
This approach ensures that the model remains accurate, interpretable, and user-friendly, while allowing users to find their car brand easily in the app.
Moving to the Final Notebook
In the next and final notebook, we will Train the optimized model with the selected features.
Now, let’s move on to training the final model!