Chapter 5 introduced the derivative at a point: $f'(a)$ is the slope of $f$ at $x = a$, defined as the limit of a difference quotient. That number answered a single question — how steep is the curve right here? This chapter takes the decisive next...
Prerequisites
- chapter-05-rates-of-change
Learning Objectives
- Compute f'(x) from the limit definition for any elementary function
- Sketch f' given the graph of f
- Compute and interpret higher-order derivatives
- Identify all points where a function fails to be differentiable
- Recognize gradient descent as derivative-driven minimization
In This Chapter
- 6.1 From a Slope at a Point to a Slope Everywhere
- 6.2 The Derivative as a Function
- 6.3 Differentiability Implies Continuity
- 6.4 Graphing the Derivative
- 6.5 The Tangent Line and the Best Linear Approximation
- 6.6 Higher-Order Derivatives
- 6.7 When Derivatives Fail to Exist
- 6.8 Gradient Descent: The First Anchor Example
- 6.9 Notation, and a Note for the Curious
- 6.10 Summary and Looking Ahead
The Derivative: Definition, Notation, and What It Means
6.1 From a Slope at a Point to a Slope Everywhere
Chapter 5 introduced the derivative at a point: $f'(a)$ is the slope of $f$ at $x = a$, defined as the limit of a difference quotient. That number answered a single question — how steep is the curve right here? This chapter takes the decisive next step. Instead of computing the slope at one chosen point, we compute it at every point at once, and we collect all those slopes into a brand-new function.
That new function is the derivative function $f'$, the central character of all of Part II. Everything that follows — the differentiation rules of Chapter 7, the optimization of Chapter 10, the differential equations of Chapter 19 — is really a study of how to build $f'$ and what it tells you.
And it tells you nearly everything about $f$:
- Where $f$ is increasing (where $f' > 0$) or decreasing (where $f' < 0$).
- Where $f$ has maxima and minima (among the points where $f' = 0$ or $f'$ fails to exist).
- How fast $f$ is changing at each point (the magnitude $|f'|$).
- The local behavior of $f$ near any point, through the tangent-line approximation of Chapter 11.
This is one of the great moves in mathematics: turn a process you apply point-by-point into an object you can study, plot, and differentiate again. By the end of the chapter you will also meet gradient descent, the algorithm that trains essentially every modern AI system. It is, as you will see, nothing more than the derivative used as a compass.
The Key Insight. The derivative is not a number attached to a point — it is a function. From the graph of $f$ you can read off the graph of $f'$; from the sign of $f'$ you can read off where $f$ rises and falls; from the zeros of $f'$ you can find $f$'s peaks and valleys. To understand $f$ deeply is to understand $f'$.
6.2 The Derivative as a Function
Recall the pointwise definition from Chapter 5:
$$f'(a) = \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}.$$
Now let the point itself become a variable. If this limit exists for every $x$ in some set $D$, we define a new function $f' : D \to \mathbb{R}$ by
$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}.$$
This $f'$ is the derivative function of $f$ (also called simply the derivative, or the rate of change of $f$). The set $D$ on which the limit exists is the domain of differentiability of $f$.
For most "nice" functions — polynomials, exponentials, sine, cosine — $D$ is the entire natural domain of $f$. For others, $D$ is a proper subset: the function $f(x) = |x|$ is defined for all real $x$, but its derivative is missing the single point $x = 0$. Keeping these two domains distinct is the first habit of careful differential calculus.
The procedure for finding $f'$ from the definition is always the same four-step ritual:
- Write the difference quotient $\dfrac{f(x+h) - f(x)}{h}$.
- Expand and simplify the numerator until a factor of $h$ appears.
- Cancel that $h$ to kill the indeterminate $0/0$ form.
- Take the limit as $h \to 0$.
Let us run the ritual on two functions of genuinely different character.
Worked Example 1: $f(x) = x^3$
$$f'(x) = \lim_{h \to 0} \frac{(x + h)^3 - x^3}{h}.$$
Expand $(x+h)^3 = x^3 + 3x^2 h + 3x h^2 + h^3$, so the numerator is $3x^2 h + 3x h^2 + h^3$. Every term carries a factor of $h$:
$$f'(x) = \lim_{h \to 0} \frac{3x^2 h + 3x h^2 + h^3}{h} = \lim_{h \to 0} \left(3x^2 + 3x h + h^2\right) = 3x^2.$$
So $\dfrac{d}{dx}(x^3) = 3x^2$, valid for every real $x$. This is one instance of the power rule, which Chapter 7 will prove in general so you never have to expand a binomial again.
Worked Example 2: $f(x) = \sqrt{x}$ (for $x > 0$)
Here the difference quotient is $\dfrac{\sqrt{x+h} - \sqrt{x}}{h}$, and there is no $h$ to cancel — yet. The trick is to rationalize by multiplying by the conjugate:
$$f'(x) = \lim_{h \to 0} \frac{\sqrt{x + h} - \sqrt{x}}{h} \cdot \frac{\sqrt{x + h} + \sqrt{x}}{\sqrt{x + h} + \sqrt{x}} = \lim_{h \to 0} \frac{(x + h) - x}{h\left(\sqrt{x + h} + \sqrt{x}\right)}.$$
The numerator collapses to $h$, which now cancels:
$$f'(x) = \lim_{h \to 0} \frac{1}{\sqrt{x + h} + \sqrt{x}} = \frac{1}{2\sqrt{x}}.$$
So $\dfrac{d}{dx}(\sqrt{x}) = \dfrac{1}{2\sqrt{x}}$, defined for $x > 0$. Notice the domain mismatch: $f$ itself is defined and continuous at $x = 0$ (with $f(0) = 0$), but $f'(0)$ does not exist, because $\frac{1}{2\sqrt{x}} \to +\infty$ as $x \to 0^+$. Geometrically, the graph of $\sqrt{x}$ has a vertical tangent at the origin — it rises infinitely steeply there. We return to this kind of failure in §6.7.
Common Pitfall. Many students cancel the $h$ before the indeterminate form is resolved, writing $\frac{\sqrt{x+h}-\sqrt{x}}{h}$ and then "letting $h \to 0$" to get $\frac{0}{0}$ and quitting. But $0/0$ is not an answer — it is a signal that algebra is still required. The difference quotient is designed to look like $0/0$; the entire art is to rewrite it (factor, expand, rationalize, use a known limit) until the offending $h$ cancels. Never plug in $h = 0$ until that cancellation is done.
Check Your Understanding. Use the limit definition to compute $f'(x)$ for $f(x) = x^2 + x$.
Answer
The difference quotient is $\dfrac{(x+h)^2 + (x+h) - (x^2 + x)}{h} = \dfrac{2xh + h^2 + h}{h} = 2x + h + 1$. Letting $h \to 0$ gives $f'(x) = 2x + 1$. Note the derivative of the sum is the sum of the derivatives — a foreshadowing of the linearity rule in Chapter 7.
A small catalog from first principles
The same ritual produces a starter set of derivatives. You should be able to reproduce each one; they are the raw material Chapter 7 will package into rules.
| $f(x)$ | difference-quotient trick | $f'(x)$ |
|---|---|---|
| $c$ (constant) | numerator is $c - c = 0$ | $0$ |
| $x$ | numerator is $(x+h) - x = h$ | $1$ |
| $x^2$ | expand $(x+h)^2$, cancel $h$ | $2x$ |
| $x^3$ | expand $(x+h)^3$, cancel $h$ | $3x^2$ |
| $1/x$ | common denominator, cancel $h$ | $-1/x^2$ |
| $\sqrt{x}$ | rationalize with conjugate | $1/(2\sqrt{x})$ |
| $e^x$ | factor $e^x$, use $\lim_{h\to0}\frac{e^h-1}{h}=1$ | $e^x$ |
The last row deserves a comment: it is the property that makes $e$ special. Of all exponential functions $b^x$, only $b = e$ has a derivative exactly equal to itself, because $e$ is defined so that $\lim_{h \to 0}\frac{e^h - 1}{h} = 1$. We will lean on this fact constantly.
Computational Note. The limit definition suggests an obvious way to approximate a derivative on a computer: pick a small $h$ and evaluate $\frac{f(x+h) - f(x)}{h}$. This forward difference works, but a better choice is the symmetric (central) difference $\frac{f(x+h) - f(x-h)}{2h}$, which uses points on both sides and is far more accurate for the same $h$. The reason, which Chapter 23 makes precise with Taylor series, is that the symmetric formula cancels the leading error term, so its error shrinks like $h^2$ rather than $h$.
python import numpy as np f = np.sin # f = sin, whose exact derivative is cos x, h = 1.0, 1e-5 forward = (f(x + h) - f(x)) / h symmetric = (f(x + h) - f(x - h)) / (2*h) exact = np.cos(x) print(f"forward : {forward:.10f}") # 0.5402980985 (error ~4e-6) print(f"symmetric: {symmetric:.10f}") # 0.5403023059 (error ~1e-11) print(f"exact cos: {exact:.10f}") # 0.5403023059The exact value is $\cos(1) \approx 0.5403023059$. Both finite differences land near it, but the symmetric formula agrees to many more digits for the same $h$. This is the everyday method behindnumpy'snp.gradient, and it is how gradient descent (§6.8) estimates slopes when no formula for $f'$ is available.
6.3 Differentiability Implies Continuity
Before plotting derivatives, we settle one structural fact that organizes everything: a function that has a derivative at a point must already be continuous there.
Theorem (Differentiable $\Rightarrow$ Continuous). If $f'(a)$ exists, then $f$ is continuous at $a$.
Why we care. This tells you immediately that any break in a graph — a jump, a hole, an asymptote — is automatically a place with no derivative. You never need to test differentiability at a discontinuity; it fails for free.
The proof. We must show $\lim_{x \to a} f(x) = f(a)$, equivalently $\lim_{x\to a}\big(f(x) - f(a)\big) = 0$. For $x \neq a$, write the difference as the difference quotient times the gap:
$$f(x) - f(a) = \frac{f(x) - f(a)}{x - a}\cdot (x - a).$$
As $x \to a$, the first factor tends to $f'(a)$ (a finite number, by hypothesis) and the second factor tends to $0$. A finite number times $0$ is $0$, so $f(x) - f(a) \to 0$, which is exactly continuity. $\blacksquare$
What this means. The implication runs in one direction only. Differentiability is strictly stronger than continuity: every differentiable function is continuous, but not every continuous function is differentiable. The standard counterexample is $f(x) = |x|$, which is continuous at $0$ (no break) yet has a corner there, so $f'(0)$ does not exist. Keep the logic straight:
| Continuous? | Differentiable? | Example |
|---|---|---|
| No | No (automatically) | a step function at its jump |
| Yes | No | $\lvert x\rvert$ at $0$ (corner) |
| Yes | Yes | polynomials, $\sin x$, $e^x$ everywhere |
Warning. The arrow points only from differentiable to continuous, and reversing it is a genuine trap, not a mere slip. In §6.9 we exhibit the Weierstrass function, which is continuous at every point of the line and differentiable at none of them. Continuity is about having no gaps; differentiability is about having a well-defined direction. They are different demands, and the second is much harder to meet.
6.4 Graphing the Derivative
Here is a skill worth more than a dozen formulas: given only the graph of $f$ — no equation at all — you can sketch the graph of $f'$. The translation dictionary is short.
- Where $f$ rises steeply, $f'$ is large and positive.
- Where $f$ rises gently, $f'$ is small and positive.
- Where $f$ has a horizontal tangent, $f'$ is zero.
- Where $f$ falls, $f'$ is negative (steeper fall $\Rightarrow$ more negative).
Geometric Intuition. Picture sliding a tiny ruler along the curve of $f$, keeping it tangent at each point. The slope of that ruler is the height of the point you plot on the graph of $f'$. As you slide left to right, watch the ruler tilt: when it tips uphill, the $f'$-graph is above the axis; when it momentarily goes flat at a peak or valley, the $f'$-graph crosses the axis; when it tips downhill, the $f'$-graph dips below. The graph of $f'$ is the running transcript of how the tangent ruler tilts.
Let us make this concrete with $f(x) = x^3 - 3x$. By the catalog above (and the rules of Chapter 7), $f'(x) = 3x^2 - 3 = 3(x-1)(x+1)$.
import numpy as np
import matplotlib.pyplot as plt
# Plot f(x) = x^3 - 3x alongside its derivative f'(x) = 3x^2 - 3
x = np.linspace(-2.5, 2.5, 300)
f = x**3 - 3*x
fprime = 3*x**2 - 3
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(x, f, 'b-', linewidth=2); ax1.set_title('$f(x) = x^3 - 3x$')
ax1.axhline(0, color='gray', lw=0.5); ax1.axvline(0, color='gray', lw=0.5)
ax1.grid(True, alpha=0.3)
ax2.plot(x, fprime, 'r-', linewidth=2); ax2.set_title("$f'(x) = 3x^2 - 3$")
ax2.axhline(0, color='gray', lw=0.5); ax2.axvline(0, color='gray', lw=0.5)
ax2.grid(True, alpha=0.3)
# Mark the critical points where f' = 0 (at x = ±1)
ax2.plot([-1, 1], [0, 0], 'ko', markersize=8)
plt.tight_layout(); plt.show()
Figure 6.1 — The function $f(x) = x^3 - 3x$ (left) and its derivative $f'(x) = 3x^2 - 3$ (right). The left graph has horizontal tangents at $x = -1$ (a local maximum) and $x = +1$ (a local minimum); the right graph crosses zero at exactly those two $x$-values. Between them, $f$ is falling, and indeed $f' < 0$ on $(-1, 1)$.
Read the two panels together. Where the left curve peaks ($x=-1$) and bottoms out ($x=1$), the right curve touches the axis. Where the left curve plunges (the middle), the right curve is below the axis. Where the left curve climbs steeply (the outer arms), the right curve shoots up. The derivative graph is a faithful report on the original.
The Key Insight. The places where $f' = 0$ or $f'$ fails to exist are called the critical points of $f$, and they are where all the interesting behavior lives — local maxima, local minima, and other turning points. This single observation is the engine of optimization. We will exploit it relentlessly beginning in Chapter 10.
Sketching $f'$ with no formula at all
When you are handed only a sketch, follow this routine to produce $f'$:
- Mark every point where $f$ has a horizontal tangent — these become the zeros of $f'$.
- Shade the intervals where $f$ is increasing — there $f'$ lies above the axis.
- Shade the intervals where $f$ is decreasing — there $f'$ lies below the axis.
- Gauge the steepness: a steeper part of $f$ pushes $f'$ farther from the axis.
- Connect your marks with a smooth curve consistent with steps 1–4.
Check Your Understanding. The graph of $f$ is a smooth hill: it rises from the left, peaks at $x = 2$, then falls. Where is $f'(x)$ positive, where is it zero, and where is it negative?
Answer
$f' > 0$ for $x < 2$ (the hill is rising), $f'(2) = 0$ (the horizontal tangent at the summit), and $f' < 0$ for $x > 2$ (the hill is falling). So the graph of $f'$ starts above the axis, crosses down through zero at $x = 2$, and continues below — it is a decreasing function passing through the point $(2, 0)$.
6.5 The Tangent Line and the Best Linear Approximation
The number $f'(a)$ has a clean geometric job: it is the slope of the tangent line to $y = f(x)$ at the point $(a, f(a))$. Point-slope form gives the tangent line directly:
$$y = f(a) + f'(a)\,(x - a).$$
This line is not just a line through the point; it is the best linear approximation to $f$ near $a$. Among all straight lines through $(a, f(a))$, the tangent is the unique one that hugs the curve to first order — the error between $f(x)$ and the tangent line shrinks faster than $x - a$ as $x \to a$. That fact is the seed of linear approximation and Newton's method, which Chapter 11 develops in full, and it is also why gradient descent (§6.8) works: stepping along the local slope is stepping along the curve's own best linear model of itself.
For a quick example, take $f(x) = \sqrt{x}$ at $a = 4$. We have $f(4) = 2$ and $f'(4) = \frac{1}{2\sqrt 4} = \frac14$, so the tangent line is $y = 2 + \frac14(x - 4)$. Near $x = 4$ this line approximates $\sqrt{x}$ well: at $x = 4.2$ it predicts $2.05$, while the true value is $\sqrt{4.2} \approx 2.0494$ — a tiny error of about $0.0006$.
Real-World Application — Marginal cost and marginal revenue (economics). Economists call the derivative of a total-cost function $C(q)$ the marginal cost $C'(q)$: the approximate cost of producing one more unit when output is $q$. Likewise $R'(q)$ is marginal revenue. The entire theory of profit maximization is the statement "produce until marginal revenue equals marginal cost," i.e. until $R'(q) = C'(q)$ — a derivative equation. A firm with cost $C(q) = 0.01q^2 + 5q + 200$ has marginal cost $C'(q) = 0.02q + 5$, so the $101$st unit costs about $C'(100) = \$7$ to make. The derivative turns a vague managerial question ("is one more unit worth it?") into an exact, computable comparison. We pursue marginal analysis throughout Part II and again with integrals in Chapter 14's Net Change Theorem.
6.6 Higher-Order Derivatives
The derivative $f'$ is a function, so we may ask for its derivative. Differentiating $f'$ produces the second derivative $f''$:
$$f''(x) = \frac{d}{dx}\big[f'(x)\big].$$
If $f''$ is itself differentiable, differentiate again to get the third derivative $f'''(x) = f^{(3)}(x)$, and so on through $f^{(4)}(x), f^{(5)}(x), \ldots$. Each successive derivative reports the rate of change of the one before it.
Worked Example: $f(x) = x^4$
$$f(x) = x^4, \quad f'(x) = 4x^3, \quad f''(x) = 12x^2, \quad f'''(x) = 24x, \quad f^{(4)}(x) = 24, \quad f^{(5)}(x) = 0,$$
and every higher derivative is zero thereafter. This illustrates a general fact: a polynomial of degree $n$ has $f^{(n+1)}(x) = 0$ identically — each differentiation lowers the degree by one until nothing is left.
Interpretation 1: Acceleration, jerk, and snap
The higher derivatives have crisp physical meanings when the underlying function is position $s(t)$:
| Order | Name | Meaning | Notation |
|---|---|---|---|
| $1$ | velocity | rate of change of position | $s'(t) = \dot s$ |
| $2$ | acceleration | rate of change of velocity | $s''(t) = \ddot s$ |
| $3$ | jerk | rate of change of acceleration | $s'''(t) = \dddot s$ |
| $4$ | snap | rate of change of jerk | $s^{(4)}(t)$ |
Newton's second law lives in the second row: force equals mass times acceleration, $F = m\,s''(t)$. The third row — jerk — is what your body registers as a sudden jolt; it is large precisely when a force changes abruptly.
Real-World Application — Elevator and rail ride comfort (engineering). A high-quality elevator or high-speed train does not merely limit acceleration; it limits jerk, the third derivative of position. A constant acceleration feels like a steady push, but a change in acceleration is felt as a lurch. Control engineers prescribe smooth "S-curve" velocity profiles whose acceleration ramps up and down gradually, keeping jerk below roughly $1\text{–}2\,\text{m/s}^3$ so passengers feel pressed, never thrown. The same jerk-minimization logic governs roller-coaster transitions, robotic arm trajectories, and camera-crane motion in film. The third derivative is, quite literally, engineered into your comfort.
Interpretation 2: Concavity
The second derivative also has a purely geometric reading: it measures concavity, the way a curve bends.
- If $f''(x) > 0$ on an interval, the slope is increasing, so the graph bends upward — it is concave up (cup-shaped, holding water).
- If $f''(x) < 0$, the slope is decreasing and the graph bends downward — concave down (cap-shaped, spilling water).
- Where $f''$ changes sign, the curve switches its bending; such a point is an inflection point.
This concavity reading is the second pillar of curve sketching, which we build in Chapter 9, and it is why the second derivative shows up in the second-derivative test for classifying maxima and minima.
Geometric Intuition. First derivative answers "which way am I tilted?"; second derivative answers "which way am I curving?" On a hill, $f' > 0$ means you are walking uphill; $f'' < 0$ means the slope is easing off, so you are nearing a crest. The pairing of sign-of-$f'$ with sign-of-$f''$ pins down the shape of any graph: rising-and-curving-up, rising-and-curving-down, and so on, four cases in all.
6.7 When Derivatives Fail to Exist
We saw with $\sqrt{x}$ that a function can be perfectly defined at a point yet have no derivative there. By Chapter 5's analysis, there are exactly four ways differentiability can fail at a point $a$, and recognizing them on sight is essential.
- Corner. The one-sided slopes both exist but disagree. The graph has a sharp kink. Example: $f(x) = |x|$ at $x = 0$, where the left slope is $-1$ and the right slope is $+1$.
- Cusp. The one-sided slopes run off to $+\infty$ and $-\infty$, pinching the graph to a point. Example: $f(x) = x^{2/3}$ at $x = 0$.
- Vertical tangent. The slope tends to $+\infty$ (or $-\infty$) from both sides; the tangent line is vertical and has no finite slope. Example: $f(x) = x^{1/3}$ at $x = 0$ — and, as we saw, $\sqrt{x}$ at $0$.
- Discontinuity. The function is not even continuous at $a$. By the theorem of §6.3, the derivative cannot exist — no continuity, no derivative.
import numpy as np
import matplotlib.pyplot as plt
# The three "continuous but not differentiable" archetypes at x = 0
x = np.linspace(-1, 1, 400)
fig, axes = plt.subplots(1, 3, figsize=(13, 3.5))
for ax, (y, title) in zip(axes, [
(np.abs(x), 'corner: $|x|$'),
(np.cbrt(x**2), 'cusp: $x^{2/3}$'),
(np.cbrt(x), 'vertical tangent: $x^{1/3}$')]):
ax.plot(x, y, 'b-', lw=2); ax.set_title(title)
ax.axhline(0, color='gray', lw=0.5); ax.axvline(0, color='gray', lw=0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()
Figure 6.2 — Three ways differentiability fails while continuity survives: a corner (slopes disagree), a cusp (slopes diverge to $\pm\infty$ in opposite senses), and a vertical tangent (slope diverges to a single signed infinity). In all three the graph is unbroken, yet no finite tangent slope exists at the origin.
A function that is differentiable at every point of an interval is called smooth on that interval (more precisely, infinitely differentiable; we sharpen the language in §6.9). Polynomials, $e^x$, $\sin x$, and $\cos x$ are smooth on all of $\mathbb{R}$. Most functions you write down are smooth except possibly at a few isolated trouble spots — and finding those trouble spots is the point of this section.
Common Pitfall. Students often assume "$f$ is defined and continuous at $a$, therefore $f'(a)$ exists." The functions $|x|$, $x^{2/3}$, and $x^{1/3}$ are all defined and continuous at $0$, yet none is differentiable there. Continuity buys you no derivative on its own. Before claiming $f'(a)$ exists, check for corners, cusps, and vertical tangents — not just for breaks in the graph.
6.8 Gradient Descent: The First Anchor Example
We now meet the first of the book's four threaded anchor examples — and the most distinctly modern. Gradient descent is the algorithm that trains nearly every neural network on Earth, from image classifiers to large language models. Most calculus textbooks never mention it. Yet at its core it is just the derivative, used as a direction-finder. We introduce it conceptually here; the full multivariable version, with its applications to machine learning, arrives in Chapter 30.
The problem. Suppose you want the minimum of a function $f(x)$. The textbook approach (Chapter 10) is to set $f'(x) = 0$ and solve. But for many real functions, $f'(x) = 0$ is a transcendental equation with no closed-form solution — there is simply no algebra that isolates $x$. What then?
The idea. Gradient descent answers: do not solve; walk downhill. The derivative already tells you which way downhill is. If $f'(x) > 0$, the function rises to the right, so the way down is to the left. If $f'(x) < 0$, the function falls to the right, so the way down is to the right. In both cases, stepping in the direction of $-f'(x)$ decreases $f$.
The Key Insight. The derivative is a compass that points uphill. To minimize, step in the opposite direction — along $-f'(x)$. That single sentence is the whole conceptual content of gradient descent; everything else is bookkeeping about step size and when to stop.
The algorithm.
- Start at some point $x_0$.
- Compute the slope $f'(x_n)$ at the current point.
- Step against it: $x_{n+1} = x_n - \alpha\, f'(x_n)$, where $\alpha > 0$ is a small step size, called the learning rate.
- Repeat until $f'(x_n)$ is close to zero, i.e. you are near a critical point.
The minus sign is the entire trick. When $f' > 0$ the update subtracts a positive number and moves left (downhill); when $f' < 0$ it subtracts a negative number and moves right (also downhill). The procedure is self-correcting.
Worked Example: minimize $f(x) = (x - 3)^2 + 1$
The minimum is obviously at $x = 3$, where $f(3) = 1$ — which lets us check the algorithm against a known answer. Its derivative is $f'(x) = 2(x - 3)$. Start at $x_0 = 0$ with learning rate $\alpha = 0.2$:
| step $n$ | $x_n$ | $f'(x_n) = 2(x_n - 3)$ | update $x_{n+1} = x_n - 0.2\,f'(x_n)$ |
|---|---|---|---|
| $0$ | $0$ | $-6$ | $0 - 0.2(-6) = 1.2$ |
| $1$ | $1.2$ | $-3.6$ | $1.2 - 0.2(-3.6) = 1.92$ |
| $2$ | $1.92$ | $-2.16$ | $2.352$ |
| $3$ | $2.352$ | $-1.296$ | $2.6112$ |
| $4$ | $2.6112$ | $-0.7776$ | $2.7667$ |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
| $30$ | $\approx 3$ | $\approx 0$ | $\approx 3$ |
Each step closes the gap to $x = 3$ by a factor of $1 - 2\alpha = 0.6$, so the sequence marches geometrically toward the minimum. After thirty iterations it is indistinguishable from $3$.
import numpy as np
import matplotlib.pyplot as plt
# Gradient descent on f(x) = (x - 3)^2 + 1, whose true minimum is at x = 3
def f(x): return (x - 3)**2 + 1
def fprime(x): return 2 * (x - 3)
x = 0.0
learning_rate = 0.2
history = [x]
for _ in range(30):
x = x - learning_rate * fprime(x) # step against the slope
history.append(x)
xs = np.linspace(-1, 7, 200)
plt.plot(xs, f(xs), 'b-', linewidth=2, label='$f(x)$')
plt.plot(history, [f(h) for h in history], 'ro-', markersize=5,
label='gradient-descent path')
plt.xlabel('$x$'); plt.ylabel('$f(x)$')
plt.title('Gradient descent on $(x-3)^2 + 1$')
plt.legend(); plt.grid(True, alpha=0.3); plt.show()
print(f"Final x: {history[-1]:.6f}, final f(x): {f(history[-1]):.6f}")
# Final x: 2.999999, final f(x): 1.000000
Warning — the learning rate matters. Choose $\alpha$ too large and gradient descent overshoots: with $\alpha = 1.1$ on this same $f$, each step lands farther from $3$ than the last and the iterates diverge. Choose $\alpha$ too small and convergence crawls. The update multiplies the gap by $1 - 2\alpha$; convergence requires $|1 - 2\alpha| < 1$, i.e. $0 < \alpha < 1$ for this function. Tuning the learning rate is one of the central practical arts of machine learning, and it is governed entirely by the local behavior of the derivative.
Why this matters
A neural network has thousands to billions of parameters, and its loss function depends on all of them at once. Setting "the gradient equals zero" and solving is utterly hopeless — the equations are far too tangled to admit any formula. But you can always compute the gradient (the multivariable derivative of Chapter 30) at the current parameter values, then nudge every parameter a little against it. Iterate millions of times and the network settles into a low-loss configuration. That loop — compute slope, step downhill, repeat — is gradient descent, and it is, in a precise sense, the foundational algorithm of modern AI.
This is also the first stirring of a recurring theme of the book: hand computation builds understanding, but machine computation builds power. You can differentiate $(x-3)^2 + 1$ by hand and find its minimum instantly; no human can hand-minimize a billion-parameter loss. The derivative is the bridge between the two worlds.
We will meet gradient descent again:
- Chapter 11, as a cousin of Newton's method for root-finding and optimization.
- Chapter 30, in full multivariable generality, where the single slope $f'$ becomes the gradient vector $\nabla f$ pointing in the direction of steepest ascent, and gradient descent becomes the workhorse that trains neural networks.
Add to Your Modeling Portfolio. Most models contain a parameter that must be fit to data, and the best parameter is usually the one that minimizes an error function $L(\text{param})$. Gradient descent on $L$ is how you find it. Write down the error function you would minimize for your track, and identify its one free parameter — we will run gradient descent on it for real in Chapter 30. - Biology: the growth rate $r$ in $P(t) = P_0 e^{rt}$; minimize the squared error between $P(t)$ and observed population counts. - Economics: a coefficient in a demand curve; minimize the gap between predicted and observed quantities sold. - Physics: a friction or damping coefficient; minimize the mismatch between modeled and measured trajectories. - Data Science: a single regression coefficient $\beta$ in $\hat y = \beta x$; minimize the mean squared residual $L(\beta) = \frac{1}{n}\sum (y_i - \beta x_i)^2$.
6.9 Notation, and a Note for the Curious
Three notations, all in active use
The derivative has accumulated three standard notations over three centuries, and you must be fluent in all of them because different fields prefer different ones.
| Notation | First derivative | Higher derivatives | Tradition |
|---|---|---|---|
| Lagrange | $f'(x)$ | $f''(x),\ f'''(x),\ f^{(n)}(x)$ | general statements |
| Leibniz | $\dfrac{df}{dx}$ or $\dfrac{dy}{dx}$ | $\dfrac{d^2 f}{dx^2},\ \dfrac{d^n f}{dx^n}$ | substitution, integration |
| Newton | $\dot{x}$ | $\ddot{x},\ \dddot{x}$ | time derivatives in physics |
A fourth, the operator notation $Df$ (read "$D$ applied to $f$"), treats differentiation itself as a machine $D = \frac{d}{dx}$ that eats a function and returns its derivative. It is convenient in differential equations and linear algebra, where one studies $D$ as an object in its own right. The operator is linear — $D(af + bg) = a\,Df + b\,Dg$ — which is the structural fact behind nearly every differentiation rule of Chapter 7.
When the variable is clear we abbreviate to $f'$, $f''$. To stress evaluation at a specific point we write $f'(a)$ or $\left.\dfrac{df}{dx}\right|_{x=a}$. All notations mean the same thing; choose by convenience:
- Lagrange $f'$ is compact and ideal for stating theorems.
- Leibniz $\frac{df}{dx}$ shines for the chain rule (Chapter 7), is indispensable for implicit differentiation (Chapter 8), and its very form anticipates integration (Part III), where $dx$ reappears.
- Newton's dot is universal in physics for derivatives with respect to time.
Common Pitfall. The Leibniz symbol $\frac{dy}{dx}$ looks like a fraction, and students often treat it as one — cancelling the $dx$'s freely. In single-variable calculus it is a single, indivisible symbol denoting a limit, not a quotient of two numbers. You can sometimes manipulate it as if it were a fraction (separable differential equations, the chain rule, $u$-substitution), and there are good reasons those manipulations work, but they are theorems to be justified, not algebra you are entitled to assume. Treat $\frac{dy}{dx}$ as one object until you have earned the right to split it.
A function that is differentiable but not continuously differentiable
Differentiability has degrees of refinement that matter in later chapters. Consider
$$f(x) = \begin{cases} x^2 \sin(1/x) & x \neq 0, \\ 0 & x = 0. \end{cases}$$
This function is differentiable everywhere, including at $0$: the squeeze theorem applied to the difference quotient $\frac{f(h) - f(0)}{h} = h\sin(1/h)$ gives $f'(0) = 0$. Yet the derivative function $f'$ is discontinuous at $0$ — away from the origin it contains a $\cos(1/x)$ term that oscillates wildly and never settles down as $x \to 0$. So a function can have a derivative everywhere whose derivative is nonetheless not continuous.
This motivates a hierarchy. A function whose first $n$ derivatives all exist and are continuous is said to be of class $C^n$. If derivatives of every order exist and are continuous, the function is $C^\infty$ (smooth in the strict sense). Polynomials, $e^x$, $\sin x$, $\cos x$ are all $C^\infty$. The oscillating example above is $C^0$ and differentiable, but not $C^1$, because $f'$ exists yet is not continuous.
Math Major Sidebar. The gap between merely-differentiable and continuously-differentiable ($C^1$) is one of the first real subtleties of analysis, and it has teeth in later theory: it governs the hypotheses of the Mean Value Theorem (Chapter 9), the remainder term in Taylor's theorem (Chapter 23), and the existence-and-uniqueness theorems for differential equations (Chapter 19), which typically demand $C^1$ data. For routine calculus you may safely assume your functions are $C^\infty$. For real analysis, you must track exactly how many continuous derivatives you have — the difference between $C^0$, $C^1$, and $C^\infty$ is where many theorems live or die.
Historical Note. In 1872 Karl Weierstrass exhibited a function $W(x) = \sum_{n=0}^{\infty} b^n \cos(a^n \pi x)$ (with $0 < b < 1$, $ab > 1 + \tfrac{3\pi}{2}$) that is continuous everywhere but differentiable nowhere — an unbroken curve that has a sharp corner at every single point. The result scandalized the mathematical world, which had assumed continuous curves must be "mostly smooth," and it forced the careful, limit-based definitions of continuity and differentiability we use today. Pathology, it turned out, was the rule, not the exception; the smooth functions of a first calculus course are the rare, well-behaved minority.
6.10 Summary and Looking Ahead
The derivative is not a number but a function. The derivative function $f'$ records the slope of $f$ at every point, and from it you can read off where $f$ rises and falls, where its peaks and valleys lie, and how steeply it moves. You can build $f'$ algebraically from the limit definition, and you can sketch it geometrically straight from the graph of $f$ — two skills that reinforce the book's running theme that geometry and algebra are inseparable.
Higher derivatives extract finer information: the second derivative is acceleration in kinematics and concavity in geometry; the third is jerk, engineered into the comfort of your elevator ride. Differentiability is strictly stronger than continuity — every break kills the derivative, but smooth-looking corners and cusps kill it too, and Weierstrass's monster kills it everywhere while staying perfectly continuous.
Finally, the derivative is not a museum piece from the seventeenth century. Gradient descent turns the single idea "step against the slope" into the algorithm that trains every modern neural network — a payoff we previewed here and will complete in Chapter 30. The derivative measures change, and calculus is the mathematics of change; this chapter is where that theme first becomes a tool.
In Chapter 7 we trade the limit-definition ritual for a toolkit of rules — power, product, quotient, and chain — that let you differentiate any elementary function in seconds, often in your head. Everything you computed laboriously here will become a one-line move.
Continue to: Exercises · Quiz · Case Study 1 · Case Study 2 · Key Takeaways · Further Reading