Case Study 2 — The Sigmoid: Turning a Score into a Probability

Field: Data science / machine learning (Data Science track) Calculus used: Exponential function (§2.2), composition (§2.5), inverses (§2.5), the modeling viewpoint (§2.7) Forward references: Chapter 6 (gradient descent first appears), Chapter 7 (the sigmoid's derivative, via the chain rule), Chapter 30 (sigmoid inside multivariable gradient descent and neural networks)


The setup

A binary classifier decides yes-or-no questions: is this email spam, is this tumor malignant, should this loan be approved, will this ad be clicked? Underneath, almost every such model first computes a single real number — a score — and then must convert it into a probability in $[0,1]$. The score can be any real number at all: large and positive when the model is confident the answer is "yes," large and negative when it is confident of "no," near zero when it is unsure. The conversion from "any real number" to "a probability" is a modeling problem, and its standard solution is one beautifully chosen function.

That function is the sigmoid (or logistic function):

$$\sigma(z) = \frac{1}{1 + e^{-z}}.$$

It is built entirely from the exponential of §2.2, it composes with a linear score to give logistic regression, and its inverse is one of the most-used quantities in statistics. This case study follows the sigmoid from its shape, to why it is the natural squashing function, to a preview of the calculus that makes it train efficiently — all using only Chapter 2 tools, with explicit forward pointers to where calculus takes over.

What the sigmoid does

Plot it first; the §2.6 discipline says never present a formula without its picture.

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z: np.ndarray) -> np.ndarray:
    return 1 / (1 + np.exp(-z))

z = np.linspace(-8, 8, 300)
plt.figure(figsize=(8, 4.5))
plt.plot(z, sigmoid(z), 'b-', lw=2, label=r'$\sigma(z)=1/(1+e^{-z})$')
plt.axhline(0, color='gray', lw=0.5); plt.axhline(1, color='gray', lw=0.5)
plt.axvline(0, color='gray', lw=0.5)
plt.scatter([0], [0.5], color='red', zorder=5, label=r'$\sigma(0)=\tfrac12$')
plt.xlabel('$z$'); plt.ylabel(r'$\sigma(z)$')
plt.title('The sigmoid (logistic) function')
plt.legend(); plt.grid(True, alpha=0.3); plt.show()
# Output: the classic S-curve, flat near 0 on the left, near 1 on the right,
#         rising through (0, 1/2).

The plot is the famous "S": near $0$ for very negative $z$, rising steeply through $\tfrac12$ at $z=0$, leveling toward $1$ for very positive $z$. Reading the properties straight off the formula:

  • Domain $\mathbb{R}$; range $(0,1)$ — open at both ends, so the output is always a strict probability.
  • $\sigma(0) = \dfrac{1}{1+1} = \dfrac12$.
  • $\displaystyle\lim_{z\to\infty}\sigma(z) = 1$ and $\displaystyle\lim_{z\to-\infty}\sigma(z) = 0$ (because $e^{-z}\to 0$ and $\to\infty$ respectively).
  • Symmetry $\sigma(-z) = 1 - \sigma(z)$. Check it directly: $$\sigma(-z) = \frac{1}{1+e^{z}} = \frac{e^{-z}}{e^{-z}+1} = 1 - \frac{1}{1+e^{-z}} = 1 - \sigma(z).$$ This is the statement that the curve is point-symmetric about $(0,\tfrac12)$, and it is exactly why "$z$ favoring yes" and "$-z$ favoring no" are mirror images.

The sigmoid is, in short, the natural way to squash an unbounded score into a probability that never quite reaches certainty.

Why a score becomes a probability

In logistic regression, the score is a linear combination of the input features:

$$z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b,$$

where $x_i$ are features (tumor dimensions, word counts, financial ratios) and $w_i, b$ are learned weights. This raw $z$ ranges over all of $\mathbb{R}$ and is not a probability. Feeding it through the sigmoid gives

$$P(\text{yes} \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}} \in (0,1).$$

Notice the structure: $P(\text{yes}\mid x) = \sigma(\text{linear in } x)$. That is a composition in the exact sense of §2.5 — an outer function $\sigma$ wrapped around an inner linear function. The classifier is a composed function, and recognizing the inner/outer split now is precisely the skill that the chain rule (Chapter 7) will reward when we differentiate it.

Real-World Application. Every time a bank scores a credit application, a medical model flags an image for review, or a moderation system rates a post, somewhere in the pipeline a sigmoid (or a close cousin) is turning a raw score into a probability. It is among the most heavily executed functions in modern data infrastructure.

The inverse: the logit

Because $\sigma$ is one-to-one on $\mathbb{R}$ (it strictly increases — it passes the horizontal line test of §2.5), it has an inverse. Find it by the swap-and-solve recipe: set $p = \dfrac{1}{1+e^{-z}}$ and solve for $z$.

$$p(1+e^{-z}) = 1 \;\Longrightarrow\; e^{-z} = \frac{1-p}{p} \;\Longrightarrow\; -z = \ln\frac{1-p}{p} \;\Longrightarrow\; z = \ln\frac{p}{1-p}.$$

So the inverse of the sigmoid is the logit:

$$\operatorname{logit}(p) = \ln\frac{p}{1-p}, \qquad p \in (0,1).$$

It maps probabilities back to real scores, $\operatorname{logit}:(0,1)\to\mathbb{R}$ — domain and range swapped from $\sigma$, exactly as inverses must (§2.5). The ratio $p/(1-p)$ is the odds, and $\ln$ of the odds is the log-odds, so the logit is the log-odds function. This is what makes logistic regression so interpretable: each weight $w_i$ is the amount by which a one-unit increase in feature $x_i$ shifts the log-odds of "yes." The whole model reads as "log-odds is linear in the features," with the sigmoid translating log-odds back into a probability.

A preview of the calculus (Chapter 7)

We will not differentiate until Chapter 6, and we will not have the chain rule until Chapter 7 — but the punchline is too elegant to withhold. The derivative of the sigmoid is

$$\sigma'(z) = \sigma(z)\,\bigl(1 - \sigma(z)\bigr).$$

The derivative of the sigmoid is expressible in terms of the sigmoid itself. Once you have computed $\sigma(z)$ for a data point, you get $\sigma'(z)$ for free — no new exponentials. That single fact is why sigmoid-based models were cheap to train in the early neural-network era: the slope you need for learning is already sitting in the value you just computed. Chained through many such derivatives, this is what gradient descent — the optimization anchor introduced in Chapter 6 and developed for several variables in Chapter 30 — uses to nudge the weights toward a better fit.

We can already sanity-check the identity numerically with the difference quotient of §2.6, even before we can prove it:

import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))

z0, h = 0.7, 1e-6
numeric = (sigmoid(z0+h) - sigmoid(z0-h)) / (2*h)   # symmetric difference
identity = sigmoid(z0) * (1 - sigmoid(z0))           # σ(1-σ)
print(numeric, identity)   # ≈ 0.2193  0.2193  — they agree

The numerical slope and the $\sigma(1-\sigma)$ identity agree to many digits — a foreshadowing of the proof we owe you in Chapter 7.

A note on modern networks

Sigmoid was the dominant neural-network activation for decades. Around 2010, researchers found that a simpler, piecewise function — the rectified linear unit (§2.4 in spirit),

$$\operatorname{ReLU}(z) = \max(0, z) = \begin{cases} z & z \ge 0 \\ 0 & z < 0,\end{cases}$$

— trains faster in deep networks and is now the default for hidden layers. But the sigmoid never left: it still sits at the output of binary classifiers, where a genuine probability in $(0,1)$ is exactly what is wanted. Both functions make the same point — machine learning is, at bottom, the disciplined choice of function families to model input-output relationships, the §2.7 viewpoint applied at industrial scale.

Your turn — mini-project

Fit a logistic regression and see the sigmoid at work.

  1. Generate $200$ points in $\mathbb{R}^2$; label a point $1$ if $x_1 + x_2 > 0$ and $0$ otherwise (so the "true" weights are $w_1 = w_2 = 1$, $b = 0$).
  2. Fit with scikit-learn's LogisticRegression.
  3. Inspect the learned weights — are they close to the truth?
  4. Plot the data and the decision boundary $\sigma(w_1 x_1 + w_2 x_2 + b) = \tfrac12$, which (since $\sigma = \tfrac12 \iff$ its input is $0$) is the line $w_1 x_1 + w_2 x_2 + b = 0$.
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

np.random.seed(42)
X = np.random.randn(200, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

model = LogisticRegression().fit(X, y)
print("weights:", model.coef_, " intercept:", model.intercept_)

w, b = model.coef_[0], model.intercept_[0]
x1 = np.linspace(-3, 3, 100)
x2 = -(w[0]*x1 + b) / w[1]            # boundary: w·x + b = 0
plt.scatter(X[y==0,0], X[y==0,1], c='red',  label='class 0')
plt.scatter(X[y==1,0], X[y==1,1], c='blue', label='class 1')
plt.plot(x1, x2, 'k--', label='decision boundary')
plt.legend(); plt.show()
# Output: two clouds split by a line near x1 + x2 = 0; learned weights ≈ equal.

In Chapter 30 you will compute the gradient of the logistic loss — multivariable calculus — and watch the weights converge from random initial values, the sigmoid's derivative identity doing the work at every step.

Discussion questions

  1. Other functions also map $\mathbb{R}$ into a bounded interval — e.g. a rescaled $\arctan z$, or the cumulative distribution function of any continuous random variable. Why might the sigmoid be preferred over these for a probability model? (Consider its clean derivative and its log-odds inverse.)
  2. The S-shape encodes a specific kind of response: insensitive at the extremes, sharp near $z = 0$. Name two real processes (drug dose-response, voter persuasion, a sensor threshold...) that match this shape, and one that does not.
  3. The range is the open interval $(0,1)$: the sigmoid never outputs exactly $0$ or $1$. Is that a feature or a bug for a probability model, and why?
  4. Suppose you replaced the sigmoid with the clipped linear map $f(z) = z$ on $[0,1]$, flat outside. Where would this behave acceptably, and where would it fail compared with $\sigma$? Tie your answer to the four model criteria of §2.7.
  5. Name three ML systems you used this week and guess where a sigmoid (or relative) sits in each pipeline. The aim is intuition for where these functions live, not a literature search.

Further reading (annotated)

  • Cox, D. R. (1958). "The regression analysis of binary sequences." J. Royal Statistical Society B 20(2), 215–242. The original logistic-regression paper; historically the source of the sigmoid-as-probability idea.
  • Hastie, Tibshirani & Friedman (2009). The Elements of Statistical Learning (2nd ed.). Springer. Free at https://hastie.su.domains/ElemStatLearn/. Chapter 4 treats logistic regression with full statistical care.
  • Goodfellow, Bengio & Courville (2016). Deep Learning. MIT Press. Free at https://www.deeplearningbook.org/. Chapter 6 compares activation functions (sigmoid, ReLU) and explains why deep networks moved on from sigmoid hidden layers.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4 derives logistic regression and the role of the logit link in clean, readable form.

The sigmoid exists because it is mathematically convenient, not because nature demanded $1/(1+e^{-z})$. It earned its place by pulling off three properties at once: it maps $\mathbb{R}$ smoothly onto $(0,1)$, it has a derivative expressible through itself, and its inverse is the interpretable log-odds. Modeling at its best is the search for a function that achieves exactly the right bundle of properties.