import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Logistic regression - breast cancer classification
Let us explore another regression model and apply that to the breast cancer dataset from sklearn
. But first let us import the necessary libraries we need for this exercise.
Breast Cancer Dataset
The dataset contains 569 samples with 30 features derived from the images of breast mass cells. Each sample is labeled as either malignant (0) or benign (1).
For each feature, the dataset provides the mean, standard error, and “worst” (mean of the three largest values) measurements, resulting in 30 total features.
Logistic Regression
Logistic regression functions as a supervised machine learning algorithm specifically designed for classification problems, commonly of a binary outcome (true/false, yes/no, 0/1). The primary purpose is to estimate the likelihood of an instance belonging to a particular class.
Logistic Function
The core function used in logistic regression is the sigmoid function:
\[\sigma(z)=\frac{1}{1+e^{-z}}\]
The logistic regression model calculates a linear combination of the \(n\) input features: \(z = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n\), where \(w_0\) is the intercept and \(w_1 \dots w_n\) are regression coefficients multiplied by some predictor values. It applies the sigmoid function to map this value to a probability: \(P(y=1|x) = \sigma(z)\) and classifies an instance as positive (1) if the probability is greater than 0.5, and negative (0) otherwise.
Advantages of Logistic Regression for Medical Diagnostics
Logistic regression is particularly useful for medical applications because:
- Interpretability: The coefficients tell us how each feature influences the prediction.
- Probabilistic output: Rather than just providing a binary prediction, it gives a probability score that can be used to assess confidence in the diagnosis.
- Efficiency: It works well with limited data and is computationally efficient.
- Regularization: This can be regularized to prevent overfitting, especially important when dealing with many features.
Load dataset
Let us now load the dataset from sklearn along with some important in-built functions for analysing and training our model.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve
= load_breast_cancer()
breast_cancer = breast_cancer.data
X = breast_cancer.target
y = breast_cancer.feature_names
feature_names = breast_cancer.target_names
target_names
# print to check the overall structure of our dataset
# and also to find how many classes we have
print(f"Dataset dimensions: {X.shape}")
print(f"Target classes: {target_names}")
Dataset dimensions: (569, 30)
Target classes: ['malignant' 'benign']
This shows that our dataset contains \(569\) samples and \(30\) features (rows and columns, respectively).
The target class is malignant and benign, which means our goal should be to predict whether a sample is malignant or benign.
Data Visualisation
Now that we have loaded our dataset let us inspect how they look like.
= pd.DataFrame(X, columns=feature_names)
df 'target'] = y
df['diagnosis'] = df['target'].map({0: 'malignant', 1: 'benign'})
df[
print("\nData overview:")
print(df.head())
print("\nTarget distribution:")
print(df['diagnosis'].value_counts())
Data overview:
mean radius mean texture mean perimeter mean area mean smoothness \
0 17.99 10.38 122.80 1001.0 0.11840
1 20.57 17.77 132.90 1326.0 0.08474
2 19.69 21.25 130.00 1203.0 0.10960
3 11.42 20.38 77.58 386.1 0.14250
4 20.29 14.34 135.10 1297.0 0.10030
mean compactness mean concavity mean concave points mean symmetry \
0 0.27760 0.3001 0.14710 0.2419
1 0.07864 0.0869 0.07017 0.1812
2 0.15990 0.1974 0.12790 0.2069
3 0.28390 0.2414 0.10520 0.2597
4 0.13280 0.1980 0.10430 0.1809
mean fractal dimension ... worst perimeter worst area worst smoothness \
0 0.07871 ... 184.60 2019.0 0.1622
1 0.05667 ... 158.80 1956.0 0.1238
2 0.05999 ... 152.50 1709.0 0.1444
3 0.09744 ... 98.87 567.7 0.2098
4 0.05883 ... 152.20 1575.0 0.1374
worst compactness worst concavity worst concave points worst symmetry \
0 0.6656 0.7119 0.2654 0.4601
1 0.1866 0.2416 0.1860 0.2750
2 0.4245 0.4504 0.2430 0.3613
3 0.8663 0.6869 0.2575 0.6638
4 0.2050 0.4000 0.1625 0.2364
worst fractal dimension target diagnosis
0 0.11890 0 malignant
1 0.08902 0 malignant
2 0.08758 0 malignant
3 0.17300 0 malignant
4 0.07678 0 malignant
[5 rows x 32 columns]
Target distribution:
diagnosis
benign 357
malignant 212
Name: count, dtype: int64
=(10, 6))
plt.figure(figsize='diagnosis', data=df)
sns.countplot(x'Distribution of Diagnosis Classes')
plt.title( plt.show()
Splitting dataset for training
To train the dataset, first we need to split it into two parts: training dataset and testing/validating dataset.
We can do so by using train_test_split
and defining the size of test dataset as well as the random state to shuffle the dataset.
= train_test_split(X, y, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
# check feature distributions by class
=(15, 10))
plt.figure(figsize= ['mean radius', 'mean texture', 'mean perimeter']
features_to_plot for i, feature in enumerate(features_to_plot):
2, 3, i+1)
plt.subplot(='target', y=feature, data=df)
sns.boxplot(xf'{feature} by Class')
plt.title('Class (0: Malignant, 1: Benign)')
plt.xlabel(
plt.tight_layout() plt.show()
Training set shape: (426, 30)
Testing set shape: (143, 30)
To verify if the training and testing data were split correctly (\(75:25\) ratio), you can check is \(143\) is \(25\%\) of \(569\).
Another approach to understand the relationship between features and tumour classifications (malignant vs. benign) is to visualise the means of each feature across both classes.
If you’d like to explore additional features, you can refer to the complete feature set available in the Wisconsin Breast Cancer Diagnostic dataset.
Scaling the dataset and fitting the model
Now that we have our dataset ready for training, let us fit the logistic regression model.
# initialise the scaler
# this is basically pre-processing of the data to standardise the features
= StandardScaler()
scaler
# fit the scaler on training data
scaler.fit(X_train)
# transform both training and test data
= scaler.transform(X_train)
X_train_scaled = scaler.transform(X_test)
X_test_scaled
# fit the model and initialise and train a basic logistic regression model
= LogisticRegression(random_state=42, max_iter=1000)
model
model.fit(X_train_scaled, y_train)
# evaluate the basic model
= model.predict(X_test_scaled)
y_pred print("\nLogistic Regression Results:")
print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
Logistic Regression Results:
precision recall f1-score support
Malignant 0.96 0.98 0.97 54
Benign 0.99 0.98 0.98 89
accuracy 0.98 143
macro avg 0.98 0.98 0.98 143
weighted avg 0.98 0.98 0.98 143
Here, we see four classification matrices of the model: precision, recall, f1-score, and support.
Precision: Measures the accuracy of positive predictions. It is the ratio of true positives to all predicted positives. Here, a precision of \(0.96\) for malignant means \(96\%\) of tumors predicted as malignant were actually malignant.
Recall: It is the ratio of true positives to all actual positives. A recall of \(0.98\) for malignant means the model identified \(98\%\) of all malignant tumors.
F1-Score: The harmonic mean of precision and recall provides a balance between these two sometimes competing metrics. An F1-score of \(0.97\) for malignant indicates an excellent balance between identifying positive cases and avoiding false alarms.
Support: The actual number of samples in each class within the dataset being evaluated. The support values (\(54\) malignant, \(89\) benign) provide context for interpreting the other metrics and indicate the relative frequency of each class in the test data.
Cross Validation
Cross-validation helps us estimate how well our model will generalise to new data by the following measures:
- Dividing the training data into multiple subsets (folds)
- Training and testing/validating the model on different combinations of these folds
- Averaging the results to get a more reliable performance estimate
= cross_val_score(LogisticRegression(random_state=42, max_iter=1000),
cv_scores =5, scoring='accuracy')
X_train_scaled, y_train, cvprint(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Cross-validation scores: [0.98837209 0.96470588 1. 0.96470588 0.95294118]
Mean CV accuracy: 0.9741 ± 0.0173
As we had 5-fold cross-validation, we see \(5\) different scores in each fold. And the average accuracy is around \(97\%\).
Visualise Model Performance
A confusion matrix is a performance evaluation tool that provides a detailed breakdown of correct and incorrect predictions for each class, allowing you to assess the performance of your classification model. The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made. Using this table it is easy to see which predictions are wrong.
=(8, 6))
plt.figure(figsize= confusion_matrix(y_test, y_pred)
cm =True, fmt='d', cmap='Blues',
sns.heatmap(cm, annot=['Malignant', 'Benign'],
xticklabels=['Malignant', 'Benign'])
yticklabels'Confusion Matrix')
plt.title('Predicted')
plt.xlabel('Actual')
plt.ylabel( plt.show()
From this confusion matrix, we learn how the binary classification model predicts cases of malignant and benign.
In true positive category we have \(53\) cases as malignant while \(1\) malignant case was incorrectly classified as benign. But we don’t have any false positive or true negative predictions.