import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
Logistic regression - house pricing
What is logistic regression?
Logistic regression is a tool that can be used for solving classification problems, in particular, predicting the probability of a binary outcome. For example:
- Predicting whether “it will rain or not” (yes/no)
- Predicting whether “a house is worth buying” (worth/not worth)
When is logistic regression Suitable?
Logistic regression is useful when you need to make a “two-choice” decision (though it can also be extended to more categories).
It’s suitable in scenarios like:
- Classification problems: Like predicting “will a student pass or fail?” or “does a patient have a disease?”
- Quantifiable features: When you have numerical inputs (like house size, income) to base your prediction on.
- Need probabilities: When you want not just a “yes or no” answer, but also “how likely is it?”
Some real-life examples are:
- Predicting “should I bring an umbrella today?”: Using temperature, humidity, and wind speed, logistic regression might tell you “70% chance of rain”.
- Predicting “is this house worth buying?”: Using size, location, and price, it might say “80% worth buying”.
In this notebook, we’ll use the California Housing dataset to try predicting whether a house in a certain area has a “high” or “low” price using logistic regression!
Pre-requisites
Before we dive into building our logistic regression model, we need to gather some tools—like setting up our workbench before starting a project! These tools (called “libraries” in programming) help us load data, do math, draw pictures, and build the model. Each library has a special job, and together they make our work easier.
Step 1: Loading and Exploring the California Housing Dataset
What Are We Doing Here?
We’re starting by loading the California Housing dataset, which tells us about small areas in California (not individual houses, but regions called “block groups”). We’ll peek at the data, set up our prediction task (high-priced or low-priced areas), and draw a picture to see how income relates to house prices.
Why This Step?
We need to understand our data before predicting anything—it’s like checking your ingredients before cooking! Here, we’ll turn house prices into a “yes or no” question: is this area high-priced?
# load california housing dataset
= fetch_california_housing()
california = pd.DataFrame(california.data, columns=california.feature_names)
data 'MedHouseVal'] = california.target # old target is the MedHouseVal
data[
# Let us look at the head of the data
print("head of housing dataset:")
print(data.head())
# here we set a new target 'HIGH_PRICE' that is whether the price is higher than the median of all block groups' MedHouseVal, if higher, 1, otherwise, 0
= data['MedHouseVal'].median()
median_price 'HIGH_PRICE'] = np.where(data['MedHouseVal'] > median_price, 1, 0)
data[
# plt to see the relationship between MedInc and MedHouseVal
=(10, 6))
plt.figure(figsize'MedInc'], data['MedHouseVal'], c=data['HIGH_PRICE'], cmap='coolwarm', alpha=0.5)
plt.scatter(data['MedInc')
plt.xlabel('MedHouseVal')
plt.ylabel('relationships between MedInc and MedHouseVal')
plt.title(
plt.show()
print("red is High price house,blue is Low price house,we need to predict whether a house is high price or low price")
head of housing dataset:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude MedHouseVal
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
red is High price house,blue is Low price house,we need to predict whether a house is high price or low price
Step 2: A First Look at logistic regression with a Simple Example
What Are We Doing Here?
Before using all the data, let’s try logistic regression with a tiny example. We’ll use just one piece of info—income—to predict if an area has high-priced houses. It’s a quick way to see how the model learns and makes guesses.
Why This Step?
Starting small helps us understand how logistic regression works—like practicing with a toy car before driving a real one! We’ll see how it predicts “yes” or “no” based on income.
print("Logistic regression is a tool that help us predict true or false")
print("eg: we want to predict whether a house is high price or low price according to the income")
# simple data:MedInc and HIGH_PRICE
= pd.DataFrame({
simple_data 'MedInc': [2, 4, 6, 8],
'HIGH_PRICE': [0, 0, 1, 1]
})
# train a simple model
= LogisticRegression()
simple_model 'MedInc']], simple_data['HIGH_PRICE'])
simple_model.fit(simple_data[[
#predict
= simple_model.predict([[3], [7]])
predictions print(f"prediction of 3:{predictions[0]}(0=low-price,1=high-price)")
print(f"prediction of 7:{predictions[1]}")
print("The model learned that the higher the income, the more likely housing prices will be!")
Logistic regression is a tool that help us predict true or false
eg: we want to predict whether a house is high price or low price according to the income
prediction of 3:0(0=low-price,1=high-price)
prediction of 7:1
The model learned that the higher the income, the more likely housing prices will be!
/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/sklearn/utils/validation.py:2739: UserWarning: X does not have valid feature names, but LogisticRegression was fitted with feature names
warnings.warn(
Step 3: Preparing Data and Training the logistic regression Model
What Are We Doing Here?
Now we’ll get our data ready and teach our logistic regression model to make predictions. We’ll pick a few key features (like income and number of rooms), clean things up, and split the data into practice and test sets—like preparing flashcards for studying and saving some for a quiz.
Why This Step?
Data needs a bit of prep to work well with our model, just like washing veggies before cooking. Then we’ll train the model to spot patterns, like “high-income areas often have pricier houses,” and test it to see what it learned.
# select some key features,like (MedInc)、(AveRooms)、(Population)
= data[['MedInc', 'AveRooms', 'Population']]
X = data['HIGH_PRICE']
y
# Standardize features (allow different ranges of numbers to be compared fairly)
= StandardScaler()
scaler = scaler.fit_transform(X)
X_scaled
# Divide training set and test set
= train_test_split(X_scaled, y, test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test
# Train a logistic regression model
= LogisticRegression()
model
model.fit(X_train, y_train)
# Predict on the test set
= model.predict(X_test)
y_pred print("Prediction results of the first 5 test samples:", y_pred[:5])
print("Top 5 real results:", y_test.values[:5])
Prediction results of the first 5 test samples: [0 0 1 1 0]
Top 5 real results: [0 0 1 1 1]
Step 4: Checking How Well Our Model Did
What Are We Doing Here?
Let’s see how good our logistic regression model is! We’ll check how many predictions it got right, draw a “confusion matrix” to see where it messed up, and figure out which features (like income or rooms) mattered most to its decisions.
Why This Step?
It’s like grading a test — we want to know if our model learned well or if it needs more practice. This helps us understand what it’s good at and where it struggles.
# Calculate model accuracy
= accuracy_score(y_test, y_pred)
accuracy print(f"Model accuracy:{accuracy:.2f}(Proportion of correct predictions)")
# Draw confusion matrix
= confusion_matrix(y_test, y_pred)
cm = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['low price', 'high price'])
disp ='Blues')
disp.plot(cmap'matrix')
plt.title(
plt.show()
# See which features are important
= pd.DataFrame({
feature_importance 'features': ['(MedInc)', '(AveRooms)', '(Population)'],
'weights': model.coef_[0]
})print("importance of features:")
print(feature_importance)
Model accuracy:0.76(Proportion of correct predictions)
importance of features:
features weights
0 (MedInc) 2.338169
1 (AveRooms) -0.902615
2 (Population) -0.091441
Step 5: Ideas for Improving Our Model’s Accuracy
What Are We Doing Here?
Our model achieved an accuracy of 0.76, which means it got 76% of its predictions right—not bad for a start! We also saw the importance of features: MedInc
(income) has a big positive weight (2.34), while AveRooms
and Population
have smaller or negative weights. But can we do better than 0.76? In this step, we’ll suggest some ideas for improving the model and let you try them out.
Why This Step?
Building a model is just the beginning. Improving it is like tuning a guitar—you try different things to get the best sound. Here are some ideas to explore, and we’ll give you a starting point to experiment with!
Suggestions for Improvement
Here are a few ways you can try to boost the model’s accuracy. Pick one (or more!) and see what happens:
- Add More Features: We only used
MedInc
,AveRooms
, andPopulation
. What if we include other features likeHouseAge
,Latitude
, orLongitude
? Maybe they’ll help the model learn more patterns. - Tune the Model’s Settings: logistic regression has a setting called
C
(it controls how strict the model is). The default is 1.0, but you could try smaller (like 0.1) or larger (like 10) values to see if it improves accuracy. - Adjust the Decision Threshold: Right now, we predict “high price” if the probability is above 0.5 (default by LogisticRegression). What if we change it to 0.7 or 0.3? This might catch more high-priced areas (or fewer false positives).
Try It Out!
Below is a basic code framework to get you started. Pick one idea, tweak the code, and see if the accuracy improves. Don’t be afraid to experiment!
# Start experimenting with improving the model
print("Try one of the suggestions to improve the model!")
# Example 1: Add more features (uncomment and modify)
# X = data[['MedInc', 'AveRooms', 'Population', 'HouseAge', 'Latitude', 'Longitude']]
# X_scaled = scaler.fit_transform(X)
# X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)
# model = LogisticRegression()
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy with more features: {accuracy:.2f}")
# Example 2: Tune the C parameter (uncomment and modify)
# C = 0.1
# model = LogisticRegression(C=C) # Try different values like 0.1, 10, etc.
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Accuracy with new C={C}: {accuracy:.2f}")
# Example 3: Adjust the decision threshold (uncomment and modify)
# y_prob = model.predict_proba(X_test)[:, 1] # Get probabilities for class 1
# threshold = 0.7 # Try 0.7 or 0.3
# y_pred_custom = (y_prob > threshold).astype(int)
# accuracy_custom = accuracy_score(y_test, y_pred_custom)
# print(f"Accuracy with threshold {threshold}: {accuracy_custom:.2f}")
# Add more experiments as you like!
print("Pick an idea, tweak the code, and see if you can beat the original accuracy of 0.76!")
Try one of the suggestions to improve the model!
Pick an idea, tweak the code, and see if you can beat the original accuracy of 0.76!