TessX | Feature Selection - RFE

Pricing Analytics

Author

Uzomah Teslim

Published

March 17, 2025

1. Introduction

After engineering features and building a baseline model, the next step is feature selection. Not all features contribute equally to model performance. Some add valuable predictive power, while others introduce noise or redundancy. In this notebook, we will refine our dataset by selecting the most important features to improve accuracy and efficiency.


What We’ll Do in This Notebook

In this section, we will:

  • Apply Recursive Feature Elimination (RFE) to systematically remove less important features.
  • Identify the most predictive variables for our model.
  • Improve model accuracy while reducing complexity.

Why Feature Selection is Important

Feature selection is a crucial step in machine learning. It helps:

  • Improve model efficiency by reducing the number of input variables.
  • Enhance generalization to new data by eliminating unnecessary patterns.
  • Speed up training time and reduce computational costs.

By selecting only the most useful features, we ensure that our pricing model remains both powerful and interpretable.


📌 Want to See Our Previous Notebook?

If you’d like to review how we handled missing values, performed univariate analysis, engineered categorical features, and built our baseline model, check out the previous notebook:

➡️ Feature Engineering & Baseline Model

2. Feature Selection - RFE

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import pandas as pd
import numpy as np
import morethemes as mt
from scipy import stats

# Set WSJ theme
mt.set_theme("wsj")

import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv('cleaned_data.csv')
# Define numerical and categorical columns
numerical_columns = ['Horsepower', 'Torque', 'Highway Fuel Economy']
categorical_columns = ['Make', 'Year', 'Body Size', 'Body Style', 
                        'Engine Aspiration', 'Drivetrain', 'Transmission','Cylinders']
X = data.drop("MSRP", axis=1)
y = data["MSRP"]
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Split into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Log-transform targets (after splitting the data)
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)

# Define preprocessing pipeline
numerical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Fill missing values with median
    ('scaler', StandardScaler())  # Standardize numerical features
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fill missing values before encoding
    ('encoder', OneHotEncoder(handle_unknown='ignore'))  # One-hot encode categorical features
])

# Combine both pipelines
preprocessor = ColumnTransformer([
    ('num', numerical_pipeline, numerical_columns),
    ('cat', categorical_pipeline, categorical_columns)
])

# Create full pipeline with RFE and Linear Regression
# Step 1: Preprocess the data
# Step 2: Apply RFE to select the top N features
# Step 3: Fit the Linear Regression model on the selected features

# Define the number of features to select
n_features_to_select = 10  # You can adjust this based on your dataset

# Create the pipeline
model_pipeline = Pipeline([
    ('preprocessor', preprocessor),  # Preprocess the data
    ('feature_selection', RFE(
        estimator=LinearRegression(),  # Base estimator for RFE
        n_features_to_select=n_features_to_select  # Number of features to select
    )),
    ('model', LinearRegression())  # Final model
])

# Train the model using the log-transformed targets
model_pipeline.fit(X_train, y_train_log)

# Predictions using the log-transformed targets
y_train_pred_log = model_pipeline.predict(X_train)
y_test_pred_log = model_pipeline.predict(X_test)

# Convert log-transformed predictions back to original scale
y_train_pred = np.expm1(y_train_pred_log)  # Inverse of log1p
y_test_pred = np.expm1(y_test_pred_log)  # Inverse of log1p

# Evaluate the model on both training and test sets
def evaluate_model(y_true, y_pred, dataset_type="Test"):
    r2 = r2_score(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    
    print(f"\n📊 {dataset_type} Set Performance:")
    print(f"R² Score: {r2:.4f}")
    print(f"MSE: {mse:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"MAE: {mae:.2f}")
    return r2, mse, rmse, mae

# Print metrics for both sets
train_metrics = evaluate_model(y_train, y_train_pred, "Training")
test_metrics = evaluate_model(y_test, y_test_pred, "Test")




# --- Plot results (on test set) ---

# --- Plot results (on test set) ---
plt.figure(figsize=(8, 6))
# 1️⃣ Residual Plot
plt.subplot(2, 2, 1)
sns.residplot(x=y_test, y=y_test_pred, lowess=True, line_kws={"color": "red"})
plt.xlabel("Actual MSRP", fontsize=9)
plt.ylabel("Residuals", fontsize=9)
plt.title("Residual Plot", fontsize=10)

# 2️⃣ Predicted vs Actual
plt.subplot(2, 2, 2)
sns.scatterplot(x=y_test, y=y_test_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')  # 45-degree reference line
plt.xlabel("Actual MSRP", fontsize=9)
plt.ylabel("Predicted MSRP", fontsize=9)
plt.title("Predicted vs Actual", fontsize=10)

# 3️⃣ Distribution of Errors
plt.subplot(2, 2, 3)
sns.histplot(y_test - y_test_pred, bins=25, kde=True)
plt.xlabel("Prediction Error", fontsize=9)
plt.title("Error Distribution", fontsize=10)

# 4️⃣ QQ Plot
plt.subplot(2, 2, 4)
res = stats.probplot(y_test - y_test_pred, dist="norm", plot=plt)

# Change colors to red
plt.gca().get_lines()[0].set_color("#855c75")  # QQ plot points
plt.gca().get_lines()[1].set_color("red")  # Reference line

plt.title("QQ Plot of Errors", fontsize=10)
plt.xlabel("Theoretical Quantiles", fontsize=9)
plt.ylabel("Sample Quantiles", fontsize=9)


# Adjust layout
plt.tight_layout()
plt.subplots_adjust(hspace=0.4, wspace=0.3)  # Adjust spacing between plots
plt.show()


# --- Inspect Selected Features ---
# Get the feature names after preprocessing
# For numerical features
numerical_features = numerical_columns  # Assuming numerical_columns is a list of numerical feature names

# For categorical features (after one-hot encoding)
categorical_transformer = model_pipeline.named_steps['preprocessor'].named_transformers_['cat']
categorical_features = categorical_transformer.named_steps['encoder'].get_feature_names_out(categorical_columns)

# Combine all feature names
all_feature_names = np.concatenate([numerical_features, categorical_features])

# Get the selected features from RFE
rfe_selector = model_pipeline.named_steps['feature_selection']
selected_features = rfe_selector.support_  # Boolean mask of selected features

# Map selected features to their names
selected_feature_names = all_feature_names[selected_features]

# Display the selected features
print("\n🔍 Selected Features by RFE:")
print(selected_feature_names)

📊 Training Set Performance:
R² Score: 0.9004
MSE: 304699841.70
RMSE: 17455.65
MAE: 9834.18

📊 Test Set Performance:
R² Score: 0.9035
MSE: 272072310.43
RMSE: 16494.61
MAE: 10156.11


🔍 Selected Features by RFE:
['Horsepower' 'Make_Aston Martin' 'Make_Bentley' 'Make_Ford' 'Make_Nissan'
 'Body Style_Cargo Van' 'Body Style_Passenger Van' 'Transmission_manual'
 'Cylinders_V10' 'Cylinders_V12']

3. Summary

Feature Selection with RFE

To improve model simplicity and interpretability, Recursive Feature Elimination (RFE) was used to identify the most relevant predictors.

Model Performance After RFE (10 Features Selected)

Training Set

  • R² Score: 0.9004
  • MSE: 304,699,841.13
  • RMSE: 17,455.65
  • MAE: 9,834.18

Test Set

  • R² Score: 0.9035
  • MSE: 272,072,306.86
  • RMSE: 16,494.61

Selected Features

  • Horsepower
  • Make (Aston Martin, Bentley, Ford, Nissan)
  • Body Style (Cargo Van, Passenger Van)
  • Transmission (Manual)
  • Cylinders (V10, V12)

By reducing the number of features from 48 to 10, the model maintained strong predictive power, with only a slight decrease in R² from 0.93 to 0.90.

Challenges with the RFE Model

While the RFE-selected model is more streamlined, certain issues remain:

b. Imbalanced Transmission Data

  • The model identified Manual Transmission as a significant predictor.
  • However, 97% of the dataset consists of automatic cars, while only 3% are manual.
  • This imbalance may introduce bias in transmission-related predictions.

c. Disproportionate Body Style Representation

  • SUVs account for 29.8% of the dataset, whereas some body styles, such as Cargo Minivans (0.6%) and Wagons (0.9%), have very few samples.
  • Features with limited representation may not generalize well in real-world predictions.

4. What Next

To refine the model, further feature selection will be conducted:

  • Transmission will be removed to mitigate class imbalance.
  • Body Style will be re-evaluated to assess its impact on performance.

Optimizing for Practical Use

Instead of using all RFE-selected features, the model will be retrained with only the most meaningful predictors that enhance accuracy while maintaining usability.

Final Feature Selection

Features to Keep
- Make
- Horsepower
- Cylinders

Features to Remove
- Transmission (Highly imbalanced)
- Body Style (Under evaluation)

5. Next Steps

Now that we have completed feature selection and gained key insights, it’s time to refine our model further. I will retrain the model using Make, Horsepower, and Cylinders, and test whether removing Body Style still maintains accuracy.

This approach ensures that the model remains accurate, interpretable, and user-friendly, while allowing users to find their car brand easily in the app.

Moving to the Final Notebook

In the next and final notebook, we will Train the optimized model with the selected features.

Next Steps

Now, let’s move on to training the final model!

Click here to continue