Exercises: Logistic Regression

Contributors

Exercises: Logistic Regression

These exercises progress from conceptual understanding through interpretation, Python implementation, model evaluation, and applied analysis. Estimated completion time: 3.5 hours.

Difficulty Guide: - :star: Foundational (5-10 min each) - :star::star: Intermediate (10-20 min each) - :star::star::star: Challenging (20-40 min each) - :star::star::star::star: Advanced/Research (40+ min each)

Part A: Conceptual Understanding :star:

A.1. In your own words, explain why we can't use simple linear regression when the response variable is binary (0 or 1). Give a specific example of a nonsensical prediction linear regression might produce.

A.2. True or false (explain each):

(a) Logistic regression predicts probabilities directly using a straight line.

(b) The sigmoid function can output values less than 0 or greater than 1.

(c) A logistic regression coefficient of $b_1 = 0$ means the predictor has no effect on the outcome.

(d) If the odds ratio for a predictor is 1.0, the predictor has no effect.

(e) Logistic regression requires the response variable to be exactly 0 or 1.

(f) An accuracy of 95% always means the model is excellent.

A.3. Convert the following probabilities to odds and log-odds. Show your work.

(a) P = 0.80

(b) P = 0.50

(c) P = 0.20

(d) P = 0.95

(e) P = 0.01

A.4. Convert the following odds back to probabilities. Show your work.

(a) Odds = 4

(b) Odds = 1

(c) Odds = 0.25

(d) Odds = 19

A.5. Explain the difference between:

(a) Probability and odds

(b) Odds and odds ratio

(c) Log-odds and odds ratio

A.6. A news article says "people who exercise regularly have 2.5 times the odds of living past 85 compared to sedentary individuals." Translate this into a sentence a non-statistician would understand. Is this the same as saying regular exercisers are 2.5 times more likely to live past 85? Why or why not?

Part B: Interpreting Logistic Regression Output :star::star:

B.1. A researcher predicts whether a student passes an exam (1 = pass, 0 = fail) from hours studied. The logistic regression output is:

$$\ln\left(\frac{P(\text{pass})}{1-P(\text{pass})}\right) = -4.0 + 0.80 \times \text{hours}$$

(a) What is the predicted probability of passing for a student who studies 5 hours?

(b) What is the predicted probability of passing for a student who studies 8 hours?

(c) What is the odds ratio for hours studied? Interpret it in context.

(d) How many hours of study would give a 50% probability of passing? (Hint: set the log-odds to 0 and solve for hours.)

B.2. A logistic regression predicts loan default (1 = default, 0 = repaid) from three predictors:

                    coef    std err       z      P>|z|
------------------------------------------------------
Intercept          -3.20      0.45   -7.111     0.000
credit_score       -0.02      0.003  -6.667     0.000
debt_ratio          2.85      0.38    7.500     0.000
income_1000s       -0.05      0.01   -5.000     0.000

(a) Compute the odds ratio for each predictor.

(b) Interpret each odds ratio in context, using the phrase "holding other variables constant."

(c) Which predictor has the largest effect on the odds of default? Justify your answer carefully (hint: consider the scale of each variable).

(d) A borrower has a credit score of 680, a debt ratio of 0.40, and an income of $55,000. What is the predicted probability of default?

B.3. A medical researcher predicts 30-day hospital readmission (1 = readmitted, 0 = not) and reports these odds ratios:

Predictor	Odds Ratio	95% CI	p-value
Age (per year)	1.03	(1.01, 1.05)	0.003
Prior admissions	1.45	(1.22, 1.72)	< 0.001
Charlson index	1.18	(1.08, 1.29)	< 0.001
Has primary care physician	0.62	(0.44, 0.87)	0.006
Discharged on Friday	1.08	(0.79, 1.48)	0.625

(a) Which predictors are statistically significant at $\alpha = 0.05$?

(b) Interpret the odds ratio for "Prior admissions" in plain language.

(c) Interpret the odds ratio for "Has primary care physician." Why is it less than 1?

(d) A colleague says "being discharged on Friday increases readmission risk by 8%." Is this a valid conclusion? Why or why not?

(e) A hospital administrator asks: "So if a patient has 3 prior admissions, their probability of readmission is 1.45 $\times$ 3 = 4.35?" Explain the error.

Part C: Probability, Odds, and the Sigmoid :star::star:

C.1. For the logistic regression model $\ln(P/(1-P)) = 1.5 - 0.3x$:

(a) Calculate the predicted probability when $x = 0, 2, 5, 8, 10, 15$.

(b) Plot these probabilities (by hand or in Python). What shape does the curve have?

(c) At what value of $x$ does the predicted probability equal 0.50?

(d) Is the probability ever exactly 0 or exactly 1? Why or why not?

C.2. A gambler says "the team has a 60% chance of winning, so the odds are 60 to 40, which is 1.5 to 1." Then they say "each additional win in their streak adds 0.3 to their log-odds of winning the next game."

(a) Is the gambler's initial odds calculation correct?

(b) If the team currently has a 60% chance of winning, what are their log-odds?

(c) After one streak win (adding 0.3 to log-odds), what is the new probability of winning?

(d) After three streak wins (adding 0.9 total), what is the new probability?

(e) Does the same increase in log-odds (0.3) always correspond to the same increase in probability? Why or why not?

C.3. Connecting to Bayes' theorem (Ch. 9). A disease screening model has: - Sensitivity = 0.90 (90% of sick patients are correctly identified) - Specificity = 0.85 (85% of healthy patients are correctly identified) - Base rate (prevalence) = 0.02 (2% of the population has the disease)

(a) Using Bayes' theorem from Chapter 9, calculate the positive predictive value (PPV): if the model says "positive," what's the probability the person actually has the disease?

(b) Now construct the confusion matrix for a population of 10,000 people tested.

(c) Calculate the same PPV from the confusion matrix. Verify it matches your Bayes' theorem answer.

(d) Why is the PPV so much lower than the sensitivity? (Hint: base rate.)

Part D: Confusion Matrix and Metrics :star::star:

D.1. A churn prediction model produces the following confusion matrix:

	Actually Churned	Actually Stayed
Predicted Churned	80	40
Predicted Stayed	20	360

(a) Calculate: accuracy, sensitivity, specificity, precision, and F1 score.

(b) The marketing team says "80% accuracy is pretty good." Respond to this claim, considering the base rate of churn in this dataset.

(c) If the cost of missing a churner (false negative) is $500 in lost revenue, and the cost of a false positive is $50 in wasted retention offer, what is the total cost of errors?

(d) Should the company lower the threshold to catch more churners? What would happen to each metric?

D.2. A hiring algorithm predicts whether a candidate will be a "good hire" (1) or not (0). The confusion matrices for two demographic groups are:

Group A:

	Actually Good	Actually Not
Predicted Good	60	20
Predicted Not	10	110

Group B:

	Actually Good	Actually Not
Predicted Good	45	35
Predicted Not	5	115

(a) Calculate the false positive rate for each group. (FPR = FP / (FP + TN))

(b) Calculate the sensitivity (true positive rate) for each group.

(c) Calculate the accuracy for each group.

(d) The company claims the algorithm is fair because accuracy is similar for both groups. Evaluate this claim.

(e) Connect to Theme 6: If Group B has a higher false positive rate, what does this mean in practical terms for people in Group B?

D.3. Explain why accuracy alone is a poor metric for evaluating a model that predicts a rare event (e.g., fraud, which occurs in only 1% of transactions). What metric or metrics would you use instead?

Part E: Python Implementation :star::star::star:

E.1. Using the Titanic survival dataset (or a similar dataset of your choice), fit a logistic regression predicting survival from passenger class, age, and sex.

# Starter code
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
import matplotlib.pyplot as plt

# Load data (adjust path as needed)
# df = pd.read_csv('titanic.csv')
# Or use seaborn: import seaborn as sns; df = sns.load_dataset('titanic')

# Your tasks:
# (a) Clean the data (handle missing values for age)
# (b) Create dummy variables for sex and passenger class
# (c) Fit a logistic regression using statsmodels
# (d) Report and interpret the odds ratios
# (e) Create a confusion matrix using a 0.5 threshold
# (f) Calculate accuracy, sensitivity, specificity, precision, F1
# (g) Plot the ROC curve and report the AUC
# (h) Discuss: does the model perform differently for men vs. women?

E.2. Fit a logistic regression to the following simulated medical data. A researcher wants to predict whether a patient develops diabetes (1 = yes, 0 = no) based on BMI and fasting glucose level.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

np.random.seed(123)
n = 400
bmi = np.random.normal(27, 5, n).clip(18, 45)
glucose = np.random.normal(100, 20, n).clip(60, 200)

# True model
log_odds = -8.0 + 0.12 * bmi + 0.04 * glucose
prob = 1 / (1 + np.exp(-log_odds))
diabetes = np.random.binomial(1, prob)

df = pd.DataFrame({'bmi': bmi.round(1),
                    'glucose': glucose.round(1),
                    'diabetes': diabetes})

# (a) Explore the data: what is the diabetes rate? How do BMI and
#     glucose differ between groups?
# (b) Fit a logistic regression with statsmodels. Print the summary.
# (c) Calculate and interpret the odds ratios.
# (d) For a patient with BMI = 30 and glucose = 120, what is the
#     predicted probability of diabetes?
# (e) Create a confusion matrix and classification report.
# (f) Plot the ROC curve. What is the AUC?
# (g) Lower the threshold to 0.3. How do sensitivity and specificity
#     change? When might this lower threshold be appropriate?

E.3. Using Sam's basketball shot data from Section 24.11 (or generating your own), answer:

(a) What is the optimal threshold for Sam if missing a "good shot" opportunity (false negative) is more costly than attempting a bad shot (false positive)?

(b) Create two ROC curves on the same plot: one for a model with only distance, and one with distance + defender distance + quarter. Which model has a higher AUC?

(c) For the full model, calculate the predicted probability for the following shot scenarios and rank them from highest to lowest: - 8 feet, defender at 4 feet, 1st quarter - 15 feet, defender at 8 feet, 2nd quarter - 20 feet, defender at 2 feet, 4th quarter - 5 feet, defender at 3 feet, 3rd quarter

Part F: Applied Analysis :star::star::star:

F.1. Alex's A/B test, revisited. In earlier chapters, Alex tested whether a new recommendation algorithm increased watch time (a numerical outcome). Now Alex wants to predict which users will cancel their subscription.

(a) Why is logistic regression more appropriate than linear regression for this question?

(b) If Alex's model predicts churn with 75% accuracy but only 40% sensitivity, what does this mean in practical terms?

(c) Alex's manager says "we need to identify at least 80% of potential churners, even if it means some false alarms." What metric should Alex optimize? How would changing the threshold help?

(d) If the model's AUC is 0.72, is this a useful model? What would AUC = 0.50 imply?

F.2. James's algorithm audit, continued. Consider the following scenario:

A predictive policing algorithm has overall accuracy of 78% for both White and Black defendants. However:

Group	False Positive Rate	False Negative Rate
White	18%	24%
Black	32%	15%

(a) The algorithm developer says: "The accuracy is identical for both groups, so the algorithm is fair." Evaluate this claim.

(b) What does the difference in false positive rates mean in human terms?

(c) What does the difference in false negative rates mean? Who benefits from this?

(d) Is it possible for an algorithm to have equal accuracy for two groups but different error rates? Explain mathematically.

(e) Write a paragraph that Professor Washington might present to a judicial review board summarizing these findings.

F.3. Maya's readmission model in context. Maya's hospital is deciding whether to implement her readmission prediction model to allocate follow-up care resources.

(a) The model has sensitivity = 0.78 and specificity = 0.90. If 15% of patients are readmitted, what is the PPV? (Use Bayes' theorem or a confusion matrix.)

(b) The hospital can afford to provide follow-up care to at most 25% of patients. How should Maya set her threshold? What tradeoff is she making?

(c) A colleague suggests adding zip code as a predictor, which improves AUC from 0.82 to 0.87. Maya hesitates. Why might she be concerned? (Connect to Theme 6.)

(d) Should the hospital deploy the model even if it's imperfect? What are the risks of deploying it? What are the risks of not deploying it?

Part G: Conceptual Synthesis :star::star::star:

G.1. Fill in the following comparison table:

Feature	Linear Regression (Ch. 22-23)	Logistic Regression (Ch. 24)
Response variable type
Model equation
Coefficient interpretation
Fitting method	Least squares
Residuals	$y - \hat{y}$
Primary evaluation metric	$R^2$
Predicted output range	$(-\infty, +\infty)$
Assumption about relationship	Linear between $x$ and $y$
Can include multiple predictors

G.2. Explain how logistic regression connects to each of the following concepts from earlier chapters:

(a) Conditional probability (Ch. 9)

(b) The normal distribution as a model (Ch. 10)

(c) Confounding variables (Ch. 4, 23)

(d) Type I and Type II errors (Ch. 13)

(e) Effect sizes (Ch. 17)

G.3. A student says: "Logistic regression is just linear regression with a different function wrapped around it." Is this statement accurate? What does it capture well, and what does it miss?

G.4. Threshold concept check. Without looking back at the chapter, explain the relationship between probability, odds, and log-odds. Why does logistic regression model log-odds rather than probability directly?

Part H: Challenge Problems :star::star::star::star:

H.1. The fairness impossibility theorem (simplified). Consider an algorithm predicting recidivism where the base rate of reoffending differs between two groups: 30% for Group A and 15% for Group B.

(a) If the algorithm is calibrated (meaning: among people the algorithm gives a 40% risk score, exactly 40% actually reoffend in both groups), can the false positive rates be equal across groups? Explore this with a numerical example of 1,000 people from each group.

(b) This is related to a result by Chouldechova (2017): if base rates differ and the algorithm is calibrated, equal false positive rates are mathematically impossible (except in trivial cases). In one paragraph, explain the implications of this finding for James's work.

H.2. Comparing models. Fit three models to predict churn in Alex's dataset:

(a) Model 1: only monthly_hours as a predictor

(b) Model 2: monthly_hours + support_tickets

(c) Model 3: monthly_hours + support_tickets + months_subscribed

For each model, compute the AUC. Then answer: does adding more predictors always improve AUC? How does this relate to the adjusted $R^2$ concept from Chapter 23?

H.3. Logistic regression from scratch. The following questions walk you through the mechanics of maximum likelihood estimation (optional, for the mathematically curious).

(a) For a single observation with true label $y_i$ and predicted probability $\hat{p}_i$, the likelihood contribution is $\hat{p}_i^{y_i} (1-\hat{p}_i)^{1-y_i}$. Verify that this formula gives $\hat{p}_i$ when $y_i = 1$ and $(1-\hat{p}_i)$ when $y_i = 0$.

(b) The log-likelihood for $n$ observations is $\ell = \sum_{i=1}^n [y_i \ln(\hat{p}_i) + (1-y_i)\ln(1-\hat{p}_i)]$. This is called binary cross-entropy loss in machine learning. Why do you think machine learning uses the negative of this quantity as a loss function to minimize?

(c) Unlike linear regression, there is no closed-form solution for the logistic regression coefficients. The computer finds them using iterative optimization (Newton-Raphson or gradient descent). Why does this matter in practice? (Hint: think about convergence and computing time.)

H.4. Regression to the mean and logistic regression. In Chapter 22, you learned about regression to the mean: extreme observations tend to be followed by less extreme ones. Does a similar phenomenon occur in logistic regression? Consider a patient who was readmitted after their first hospital stay. If the model predicts a high probability of readmission for this patient on a subsequent visit, is that prediction more or less reliable than for a patient with no prior readmissions? Discuss in terms of base rates and the predictive value of prior outcomes.

Part I: Metacognition :star:

I.1. Rate your confidence (1-5) on each of the following skills:

Converting between probability, odds, and log-odds
Interpreting odds ratios from logistic regression output
Fitting a logistic regression model in Python
Computing and interpreting a confusion matrix
Explaining the ROC curve and AUC
Identifying potential bias in classification algorithms

For any skill rated 3 or below, write one specific action you'll take to improve (re-read a section, rework a problem, ask a specific question).

I.2. The threshold concept for this chapter is "thinking in odds." On a scale of 1-5, how comfortable are you thinking in odds rather than probabilities? What specific example or analogy helped it click for you (or what's still confusing)?

I.3. Before this chapter, did you have an intuition for how AI classification works? Has that changed? In 2-3 sentences, describe what you now understand about classification that you didn't before.