ML in Python Part 3 - Advanced Regression with Scikit-learn
In this third part of our series, we’ll explore more sophisticated machine learning techniques using Scikit-learn. While Parts 1 and 2 focused on classification, we’ll now tackle regression problems and learn how to build complex preprocessing pipelines. We’ll use the California Housing dataset to demonstrate these concepts.
Note: The purpose of this post is to highlight the flexibility and capabilities of scikit-learn’s advanced features. Therefore, this tutorial focuses on introducing you to those advanced routines rather than creating the optimal regression model.
Why Advanced Preprocessing?
Real-world data rarely comes in a clean, ready-to-use format. Data scientists often spend more time preparing data than training models. Common challenges include:
- Missing values that need imputation
- Categorical features requiring encoding
- Features with different scales
- Redundant or irrelevant features
- Non-linear relationships between variables
Scikit-learn provides powerful tools to handle these challenges systematically. Let’s see how to combine them effectively into a preprocessing pipeline that can handle all these issues automatically.
As always, first, let’s import the scientific Python packages we need.
# Standard scientific Python imports
import numpy as np
import matplotlib.pyplot as plt
1. Load Dataset
The California Housing dataset contains information about houses in California districts. It’s a perfect dataset for demonstrating advanced preprocessing because it includes:
- Both numerical and categorical features
- Missing values that need handling
- Features on different scales
- Complex relationships between variables
The dataset itself contains information about the houses, including features like total area, lot shape, neighborhood information, overall quality, year built, etc. And the target feature that we would like to predict is the SalePrice
.
Let’s load the data and take a look:
# Load dataset
from sklearn import datasets
housing = datasets.fetch_openml(name='house_prices')
# Extract feature matrix X and target vector y
X = housing['data'].drop(columns='Id')
y = housing['target']
print(f"Dimension of X: {X.shape}\nDimension of y: {y.shape}")
Dimension of X: (1460, 79)
Dimension of y: (1460,)
As you can see, we have 1460 samples (houses), each containing 79 features (i.e. characteristics). Let’s examine the first few entries to better understand our data:
# Show first few entries and columns of the dataset
X.iloc[:5, :5]
MSSubClass | MSZoning | LotFrontage | LotArea | Street | |
---|---|---|---|---|---|
0 | 60.0 | RL | 65.0 | 8450.0 | Pave |
1 | 20.0 | RL | 80.0 | 9600.0 | Pave |
2 | 60.0 | RL | 68.0 | 11250.0 | Pave |
3 | 70.0 | RL | 60.0 | 9550.0 | Pave |
4 | 60.0 | RL | 84.0 | 14260.0 | Pave |
If we look closer at the feature matrix X, we can see that of those 79 features, 36 are of type float and 43
are of type ‘object’ (i.e. categorical features), and that some entries are missing. Plus, the target feature
SalePrice
has a right skewed value distribution.
Therefore, if possible, our pipeline should be able to handle all of this peculiarities. Even better, let’s try to setup a pipeline that helps us to find the optimal way how to preprocess this dataset.
2. Feature Analysis
Before building our pipeline, let’s understand what we’re working with:
# Quick overview of feature types
print("Feature types:")
print(X.dtypes.value_counts())
# Check for missing values
missing_values = X.isnull().sum()
print("\nFeatures with missing values:")
print(missing_values[missing_values > 0].sort_values(ascending=False))
Feature types:
object 43
int64 33
float64 3
Name: count, dtype: int64
Features with missing values:
PoolQC 1453
MiscFeature 1406
Alley 1369
Fence 1179
FireplaceQu 690
LotFrontage 259
GarageType 81
GarageYrBlt 81
GarageFinish 81
GarageQual 81
GarageCond 81
BsmtExposure 38
BsmtFinType2 38
BsmtFinType1 37
BsmtCond 37
BsmtQual 37
MasVnrArea 8
MasVnrType 8
Electrical 1
dtype: int64
And visualizing the target variable distribution:
# Analyze target variable distribution
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8, 4))
sns.histplot(y, bins=50)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.show()
This analysis reveals several important preprocessing needs:
- We have both numerical (float) and categorical (object) features
- Several features have missing values
- Our target variable (house prices) shows right skew
- Features are on very different scales (e.g., year vs. price)
These insights will guide our pipeline design.
3. Split data into train and test set
As always, let’s first go ahead and split the dataset into train and test set.
from sklearn.model_selection import train_test_split
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2)
4. Building the Pipeline
One of Scikit-learn’s most powerful features is its Pipeline API. We’ll create a pipeline that:
- Handles missing values differently for numerical and categorical features
- Applies appropriate scaling to numerical features
- Properly encodes categorical features
- Optionally reduces dimensionality
- Fits our chosen regression model
So let’s setup a pipeline that performs these different pre-processing routines: Transformation of categorical data to numerical data, data imputer for missing values, data scaling, potential dimensionality reduction, etc.
Handling categorical data
First, let’s create a small pipeline that takes categorical data, fills missing values with 'missing'
and
than applies one-hot encoding on these categorical features.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
categorical_preprocessor = Pipeline(
[
('imputer_cat', SimpleImputer(fill_value='missing', strategy='constant')),
('onehot', OneHotEncoder(handle_unknown='ignore')),
]
)
Handling numerical data
To handle numerical data we will use a slightly more advanced processing pipeline (to showcase some scikit-learn feature, not because it’s the best thing to do). So let’s first fill missing values with e.g. the mean of the feature, potentially apply a polynomial expansion to module non-linear relationships, apply a scaler and then potentially apply dimensionality reduction via PCA and/or by selecting only the “most relevant” features.
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression, mutual_info_regression
from sklearn.pipeline import FeatureUnion
# FeatureUnion performs both processing steps in parallel and combines the features
dim_reduction = FeatureUnion(
[
('pca', PCA()),
('feat_selecter', SelectKBest()),
]
)
from sklearn.preprocessing import (PolynomialFeatures, StandardScaler,
RobustScaler, PowerTransformer)
# Package all relevant preprocessing routines for numerical data into one pipeline
numeric_preprocessor = Pipeline(
[
('imputer_numeric', SimpleImputer(missing_values=np.nan, strategy='mean')),
('polytrans', PolynomialFeatures()),
('scaler', StandardScaler()),
('dim_reduction', dim_reduction),
]
)
Combining preprocessing pipelines
Now that we have a preprocessing pipeline for the categorical and numerical features, let’s combine them into one preprocessing pipeline.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
[
('numerical', numeric_preprocessor, X.select_dtypes('number').columns),
('categorical', categorical_preprocessor, X.select_dtypes(exclude='number').columns),
],
remainder='passthrough',
)
Add regression model
After the data is preprocessed we want to hand it over to a regression estimator. For this purpose, let’s chose a ridge regression.
from sklearn.linear_model import Ridge
ridge = Ridge()
pipe = Pipeline(steps=[
('preprocessor', preprocessor),
('ridge', Ridge())
])
As such, the pipeline would be finished. But because we know that our target feature SalePrice
is right
skewed, we should ideally apply a log-transformation before fitting the model. Instead of doing this
transformation manually (and reverting it at the end), we can also use scikit-learn’s
TransformedTargetRegressor
to do that on the fly.
from sklearn.compose import TransformedTargetRegressor
regressor = TransformedTargetRegressor(
regressor=pipe, func=np.log1p, inverse_func=np.expm1
)
5. Parameter grid
Before training our model, we should also define a parameter grid that allows us to fine-tune the processing and model parameters. Given our complex routine, we actually have a lot of parameter that we can play around with.
from sklearn.model_selection import ParameterGrid
# Shorten key identifier by separating common prefix
prefix = 'regressor__preprocessor__'
# Create parametergrid
param_grid = {
# Explore imputers
f'{prefix}numerical__imputer_numeric__add_indicator': [True, False],
f'{prefix}numerical__imputer_numeric__strategy': [
'mean', 'median', 'most_frequent', 'constant'],
f'{prefix}categorical__imputer_cat__add_indicator': [True, False],
f'{prefix}categorical__imputer_cat__strategy': ['most_frequent', 'constant'],
# Explore numerical preprocessors
f'{prefix}numerical__polytrans__degree': [1, 2],
f'{prefix}numerical__polytrans__interaction_only': [False, True],
f'{prefix}numerical__dim_reduction__pca': ['drop', PCA(0.9), PCA(0.99)],
f'{prefix}numerical__dim_reduction__feat_selecter__k': [5, 25, 100, 'all'],
f'{prefix}numerical__dim_reduction__feat_selecter__score_func': [
f_regression, mutual_info_regression],
# Explore scalers
f'{prefix}numerical__scaler': [StandardScaler(), RobustScaler(), PowerTransformer()],
# Explore regressor
'regressor__ridge__alpha': np.logspace(-5, 5, 11),
}
print(len(ParameterGrid(param_grid)))
101376
As you can see, we have more than 100’000 different parameter combinations that we could explore. So using a
GridSearchCV
routine and checking all of them individually would take way too much time. Luckily,
scikit-learn also provides a RandomizedSearchCV
routine, with which you can randomly explore a few parameter
grid combinations.
Furthermore, both GridSearchCV
and RandomizedSearchCV
routines also allow you to change the performance
metric with which the model performs is scored. So let’s take 'neg_mean_absolute_percentage_error'
(for more
see here).
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
regressor,
param_grid,
n_iter=250,
refit=True,
cv=2,
return_train_score=True,
n_jobs=-1,
verbose=1,
scoring='neg_mean_absolute_percentage_error',
)
6. Train model
Everything is ready, so let’s go ahead and train the model.
res = random_search.fit(X_tr, y_tr)
Fitting 2 folds for each of 250 candidates, totalling 500 fits
7. Performance investigation after RandomizedSearchCV
Once the model has explored a fixed number of grid points, we can go ahead and look at their performance. The easiest is to just put everything into a pandas DataFrame and sort the entries by the best test score.
import pandas as pd
df_res = pd.DataFrame(res.cv_results_)
df_res = df_res.iloc[:, ~df_res.columns.str.contains('time|split[0-9]*|rank|params')]
new_columns = [
c.split('param_regressor__')[1] if 'param_regressor' in c else c
for c in df_res.columns
]
new_columns = [
c.split('preprocessor__')[1] if 'preprocessor__' in c else c for c in new_columns
]
df_res.columns = new_columns
df_res = df_res.sort_values('mean_test_score', ascending=False)
df_res.head(10)
ridge alpha | numerical scaler | numerical polytrans interaction_only | numerical polytrans degree | numerical imputer_numeric strategy | numerical imputer_numeric add_indicator | numerical dim_reduction pca | numerical dim_reduction feat_selecter k | categorical imputer_cat strategy | categorical imputer_cat add_indicator | mean_test_score | std_test_score | mean_train_score | std_train_score |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | PowerTransformer() | True | 1 | constant | False | PCA(n_components=0.99) | 5 | most_frequent | True | -0.0910812 | 0.00726406 | -0.0690628 | 0.00763033 |
10 | RobustScaler() | True | 2 | mean | False | drop | 100 | most_frequent | False | -0.0914982 | 0.0101425 | -0.0658095 | 0.00502802 |
10 | PowerTransformer() | False | 1 | most_frequent | False | PCA(n_components=0.99) | 25 | most_frequent | True | -0.0920796 | 0.00689614 | -0.0685709 | 0.008111 |
10 | PowerTransformer() | True | 1 | median | False | PCA(n_components=0.9) | 25 | constant | False | -0.0922996 | 0.00720934 | -0.0685542 | 0.0080486 |
10 | RobustScaler() | False | 2 | constant | True | PCA(n_components=0.99) | 100 | most_frequent | True | -0.0942421 | 0.0122098 | -0.0646248 | 0.00574666 |
1 | RobustScaler() | True | 2 | most_frequent | False | drop | 100 | most_frequent | False | -0.0947194 | 0.00546636 | -0.0560936 | 0.00559328 |
10 | StandardScaler() | False | 2 | most_frequent | True | PCA(n_components=0.9) | 100 | constant | False | -0.0961898 | 0.0107717 | -0.0603564 | 0.00414943 |
1 | PowerTransformer() | False | 1 | constant | True | PCA(n_components=0.9) | 25 | constant | True | -0.0968886 | 0.0075747 | -0.0598735 | 0.00662794 |
1 | PowerTransformer() | False | 1 | most_frequent | True | PCA(n_components=0.99) | 25 | constant | False | -0.0971127 | 0.00728891 | -0.0597802 | 0.00659613 |
1 | PowerTransformer() | True | 1 | median | True | PCA(n_components=0.99) | 5 | most_frequent | False | -0.0971796 | 0.00667626 | -0.0604481 | 0.00637354 |
If you explore this table a bit you can better judge which parameter variations in your grid search are actually useful and which ones aren’t. In this example we will not focus on this and directly continue with computing the model performance on the training and test set.
# Evaluate model performance on training and test set
score_tr = -random_search.score(X_tr, y_tr)
score_te = -random_search.score(X_te, y_te)
print(
f"Prediction accuracy on train data: {score_tr*100:.2f}%\n\
Prediction accuracy on test data: {score_te*100:.2f}%"
)
Prediction accuracy on train data: 7.10%
Prediction accuracy on test data: 8.38%
Great, the score seems reasonably good! But now that we know better which preprocessing routine seems to be the
best (thanks to RandomizedSearchCV
), let’s go ahead and further fine-tune the ridge model.
8. Fine tune best preprocessing pipeline
To further fine tune the best preprocessing pipeline, we can just load the ‘best estimator’ from the
RandomizedSearchCV
exploration and specify a new parameter grid that we want to explore - this time with the
GridSearchCV
routine (so that we look at all grid points).
# Select best estimator
best_estimator = random_search.best_estimator_
# Specify new parameter grid to explore
param_grid = {'regressor__ridge__alpha': np.logspace(-5, 5, 51)}
To showcase one more additional thing, let’s go ahead and use a nested cross-validation routine to improve the generalization power of our model. In other words, in contrast to the previous approach where we separated the test from the train set only once, we will now also apply a cross validation approach on this split as well. Together with the cross validation in the grid search, we therefore use cross validation twice, hence the name “nested”.
# Establish the two cross validations
from sklearn.model_selection import KFold
inner_cv = KFold(n_splits=3, shuffle=True)
outer_cv = KFold(n_splits=3, shuffle=True)
And now, let’s combine all of this with and run the model.
from sklearn.model_selection import GridSearchCV, cross_validate
# Create grid search object with parameter grid and inner cross validation
grid_search = GridSearchCV(
best_estimator,
param_grid,
refit=True,
cv=inner_cv,
n_jobs=-1,
scoring='neg_mean_absolute_percentage_error',
)
# Train model with outer cross validation (and return estimators for post-model investigation)
cv_results = cross_validate(
grid_search,
X=X,
y=y,
cv=outer_cv,
n_jobs=1,
return_estimator=True,
return_train_score=True,
)
Once the model has finished training, we can extract the different scores from the most outer loop and print their average score, as well as the standard deviation over the folds. Plus, the same thing can also be done for the most optimal ridge model parameter ‘alpha’. These information can give us some insights about the model generalization.
df_nested = pd.DataFrame(cv_results)
cv_train_scores = -df_nested['train_score']
cv_test_scores = -df_nested['test_score']
cv_alphas = [c.best_params_['regressor__ridge__alpha'] for c in df_nested['estimator']]
print(
"Generalization score with hyperparameters tuning:\n"
f" Train Score: {cv_train_scores.mean()*100:.1f}% +/- {cv_train_scores.std()*100:.1f}%\n"
f" Test Score: {cv_test_scores.mean()*100:.1f}% +/- {cv_test_scores.std()*100:.1f}%\n"
f" Optimal Alpha: {np.mean(cv_alphas):.1f} +/- {np.std(cv_alphas):.1f}\n"
)
Generalization score with hyperparameters tuning:
Train Score: 7.6% +/- 0.9%
Test Score: 9.0% +/- 0.2%
Optimal Alpha: 29.6 +/- 23.8
9. Feature importance investigation with permutation testing
Some model provide some insights about feature importance (i.e. which features the model uses most for the prediction). However, this is sometimes prone to multiple issues. A better approach is to use a permutation approach. This approach performs the same model fitting (in this case based on the best model with the best hyper parameters) but during each iteration randomly shuffles a given feature and investigates how this perturbates the final score.
# Select the best estimator with the best hyper parameter
final_estimator = grid_search.set_params(
estimator__regressor__ridge__alpha=np.mean(cv_alphas)
)
# Fit this estimator to the initial training set
_ = final_estimator.fit(X_tr, y_tr)
Now that the model is ready and trained, we can go ahead and perform the feature importance investigation via
permutation testing. To showcase one additional feature, let’s actually perform this routine twice, once while
focusing on the r2
of the model, and once while focusing on the neg_mean_absolute_percentage_error
.
from sklearn.inspection import permutation_importance
scoring = ['r2', 'neg_mean_absolute_percentage_error']
result = permutation_importance(
final_estimator,
X_te,
y_te,
n_repeats=50,
random_state=0,
n_jobs=-1,
scoring=scoring,
)
Once everything is computed, we can go ahead and plot the feature importance for each feature, separated by the two different scoring metrics.
fig, axs = plt.subplots(1, 2, figsize=(16, 16))
for i, s in enumerate(scoring):
sorted_idx = result[s].importances_mean.argsort()
axs[i].boxplot(
result[s].importances[sorted_idx].T, vert=False, labels=X_te.columns[sorted_idx]
)
axs[i].set_title("Permutation Importances (test set) | %s" % s)
fig.tight_layout()
plt.show()
Summary and Next Steps
In this tutorial, we’ve covered advanced scikit-learn concepts:
- Building complex preprocessing pipelines
- Handling mixed data types
- Feature selection and engineering automatically
- Implementing grid search with cross-validation
- Model comparison and evaluation
- Analyzing feature importance
Key takeaways:
- Preprocessing pipelines make complex workflows manageable
- Grid search helps find optimal parameters systematically
- Feature selection can improve model performance
- Understanding feature importance aids model interpretation
- Cross-validation provides robust performance estimates
In Part 4, we’ll explore advanced neural network architectures with TensorFlow, building on both the neural network concepts from Part 2 and the preprocessing techniques we’ve learned here.