import seaborn as sns
=data, x="x1", y="x2", hue="y", palette="Dark2") sns.scatterplot(data
Choosing hyperparameters
After the exercise in the last chapter, you’re hopefully thinking “why am I spending my time trying different values of n_neighbors
, can’t it do this automatically?” If so, you’re in luck!
It is a very common thing to need to try out a bunch of different values for your hyperparameters and so scikit-learn provides us with some tools to help out.
Let’s do a similar thing to the last chapter, but load a different file this time. One with four different classes and follow through the usual steps:
import pandas as pd
= pd.read_csv("https://bristol-training.github.io/applied-data-analysis-in-python/data/blobs.csv")
data = data[["x1", "x2"]]
X = data["y"] y
from sklearn.model_selection import train_test_split
= train_test_split(X, y) train_X, test_X, train_y, test_y
The tools that allows us to do the hyper-parameter searching is called GridSearchCV
which will rerun the model training for every possible hyperparameter that we pass it.
The GridSearchCV
constructor takes two things: 1. the model that we want to explore, 2. a dictionary containing the hyper-parameter values we want to test.
In this case, we are asking it to try every value of n_neighbors
from 1 to 49 and it will use the training data to choose the best value.
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
= {
hyperparameters "n_neighbors" : range(1, 175),
}= GridSearchCV(KNeighborsClassifier(), hyperparameters)
model model.fit(train_X, train_y)
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'n_neighbors': range(1, 175)})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=KNeighborsClassifier(), param_grid={'n_neighbors': range(1, 175)})
KNeighborsClassifier(n_neighbors=33)
KNeighborsClassifier(n_neighbors=33)
The best way to visualise the data is to plot it. We can do this by grabbing the cv_results_
attribute of GridSearchCV
and plotting the mean_test_score
against the value of n_neighbors
. GridSearchCV
will run each experiment multiple times with different splits of training and validation data to provide some measure of uncertainty of the score:
= pd.DataFrame(model.cv_results_)
cv_results "param_n_neighbors", "mean_test_score", yerr="std_test_score", figsize=(10,8)) cv_results.plot.scatter(
One thing that GridSearchCV
does, once it has scanned through all the parameters, is do a final fit using the whole training data set using the best hyperparameters from the search. This allows you to use the GridSearchCV
object model
as if it were a KNeighborsClassifier
object.
from sklearn.inspection import DecisionBoundaryDisplay
="Pastel2")
DecisionBoundaryDisplay.from_estimator(model, X, cmap=X, x="x1", y="x2", hue=y, palette="Dark2") sns.scatterplot(data
or use it directly with predict
:
= pd.DataFrame({
new_X "x1": [0, -10, 5, -5],
"x2": [10, 5, 0, -10],
})
model.predict(new_X)
array([0, 3, 1, 2])
or measure its performance against the test data set:
model.score(test_X, test_y)
0.928
Using something like GridSearchCV
allows you to find the best hyperparameters for your models while keeping them working most generally.