Stand far enough back from a curve and you see its global shape — its peaks, its valleys, the way it bends. Walk right up to a single point and zoom in, and something remarkable happens: the curve flattens out. Zoom in far enough on the graph of any...
Prerequisites
- Chapter 10: Optimization
Learning Objectives
- Use the linear approximation $L(x) = f(a) + f'(a)(x-a)$ to estimate function values.
- Apply differentials $dy = f'(x)\,dx$ to estimate change and propagate measurement error.
- Derive and apply Newton's method to solve $f(x) = 0$.
- Explain quadratic convergence and recognize when Newton's method fails.
- Connect linear approximation forward to Taylor series and the tangent plane.
In This Chapter
- 11.1 The Big Idea: Up Close, Everything Is Straight
- 11.2 Linearization: The Tangent Line as a Formula
- 11.3 Worked Examples: Approximating by Hand
- 11.4 Differentials: A Language for Small Change
- 11.5 Error Propagation: From Measurement to Result
- 11.6 Newton's Method: Solving Equations With Tangent Lines
- 11.7 Newton in Action: How a Calculator Finds a Square Root
- 11.8 Implementing and Analyzing Newton's Method in Python
- 11.9 When Newton's Method Fails
- 11.10 More Applications: Where Roots and Linearizations Live
- 11.11 The Error of Linear Approximation — and the Road to Taylor Series
- 11.12 Two Reliable Cousins: Bisection and the Secant Method
- 11.13 Why All of This Matters
- Looking Ahead
- Reflection
Chapter 11 — Linear Approximation, Differentials, and Newton's Method
11.1 The Big Idea: Up Close, Everything Is Straight
Stand far enough back from a curve and you see its global shape — its peaks, its valleys, the way it bends. Walk right up to a single point and zoom in, and something remarkable happens: the curve flattens out. Zoom in far enough on the graph of any differentiable function and you cannot tell it apart from a straight line. That line is the tangent line, and this disappearing-curvature phenomenon is the engine of this entire chapter.
We made this precise back in Chapter 6: a function $f$ is differentiable at $a$ exactly when it has a well-defined tangent line there, with slope $f'(a)$. In this chapter we put that tangent line to work. Instead of asking "what is the slope?" we ask "what is the value?" — and we use the tangent line to compute (approximately) values of $f$ that would otherwise be hard to reach.
This is one of the great recurring moves of the subject: approximation is the soul of calculus. Limits approximate, then take a limit. Integrals approximate with rectangles, then take a limit. Here we approximate a curve with a line and simply stop — keeping the error small but nonzero, and learning to control how small. Newton's method, the chapter's payoff, turns this same approximation into an algorithm that hunts down the solution of any equation.
The Key Insight. Calculus is, at heart, the art of replacing complicated functions with simple linear ones near a point of interest. Linear approximation is the first and simplest such substitution, and nearly everything in numerical computing — root-finding, optimization, error analysis, machine learning — descends from it.
11.2 Linearization: The Tangent Line as a Formula
Near a point $x = a$, a differentiable function $f$ is well-approximated by its tangent line. We give that line a name. The linear approximation (or linearization, or tangent line approximation) of $f$ at $a$ is
$$L(x) = f(a) + f'(a)(x - a),$$
and the central claim is
$$f(x) \approx L(x) \quad \text{for } x \text{ near } a.$$
Read the formula geometrically. The line passes through the point $(a, f(a))$ — plug in $x = a$ and you get $L(a) = f(a)$ — and it has slope $f'(a)$, the slope of $f$ at $a$. So $L$ is precisely the tangent line, now written as a function you can evaluate. The term $f'(a)(x-a)$ is "rise = slope × run": starting from the known height $f(a)$, you walk a horizontal distance $x - a$ and climb at the tangent's slope.
Geometric Intuition. Picture the graph of $f$ and its tangent line at $a$ kissing the curve at one point. For $x$ right next to $a$, the curve and its tangent are nearly on top of each other, so reading a height off the line gives almost the same answer as reading it off the curve. As $x$ wanders away from $a$, the curve peels away from the line — upward if $f$ is concave up, downward if concave down — and the approximation degrades. The whole game is to stay close to $a$, where "close enough" is genuinely close.
Geometry + algebra are inseparable. The formula $L(x) = f(a) + f'(a)(x-a)$ and the picture of a tangent line grazing a curve are two descriptions of the same object. Whenever you write the formula, see the line; whenever you draw the line, the formula is its equation.
11.3 Worked Examples: Approximating by Hand
The recipe never changes: pick a nearby anchor $a$ where $f(a)$ and $f'(a)$ are easy, then evaluate $L$ at your target.
Example 1 — A square root. Approximate $\sqrt{4.1}$.
Take $f(x) = \sqrt{x}$ and anchor at $a = 4$, where everything is clean: $f(4) = 2$ and $f'(x) = \frac{1}{2\sqrt{x}}$ gives $f'(4) = \frac{1}{4}$. Then
$$L(x) = 2 + \tfrac{1}{4}(x - 4), \qquad L(4.1) = 2 + \tfrac{1}{4}(0.1) = 2.025.$$
The true value is $\sqrt{4.1} = 2.024846\ldots$, so the error is about $0.00015$ — three correct decimal places from one line of arithmetic, no calculator's square-root key required.
Example 2 — A logarithm. Approximate $\ln(1.05)$.
Take $f(x) = \ln x$ and anchor at $a = 1$: $f(1) = 0$ and $f'(x) = 1/x$ gives $f'(1) = 1$. Then
$$L(x) = 0 + 1\cdot(x - 1) = x - 1, \qquad L(1.05) = 0.05.$$
The true value is $\ln(1.05) = 0.048790\ldots$, an error of about $0.0012$. This is the famous rule "$\ln(1+x) \approx x$ for small $x$" — it is just the linearization of $\ln$ at $1$.
Example 3 — The small-angle approximation. Approximate $\sin(0.1)$ (radians).
Take $f(x) = \sin x$ at $a = 0$: $f(0) = 0$, $f'(x) = \cos x$, $f'(0) = 1$. Then $L(x) = x$, so $L(0.1) = 0.1$. The true value is $\sin(0.1) = 0.099833\ldots$ — an error near $0.00017$. This is why physicists write $\sin\theta \approx \theta$ for small angles: it is the linear approximation of $\sin$ at $0$, and we will see in §11.10 that it underlies the entire theory of the pendulum.
Check Your Understanding. Use linear approximation to estimate $(1.02)^{10}$ without a calculator. (Hint: anchor at $a = 1$.)
Answer
Let $f(x) = x^{10}$, $a = 1$. Then $f(1) = 1$ and $f'(x) = 10x^9$, so $f'(1) = 10$. The linearization is $L(x) = 1 + 10(x - 1)$, and $L(1.02) = 1 + 10(0.02) = 1.2$. The true value is $1.02^{10} = 1.21899\ldots$, so we are within about 2% — and we did it in our heads. This is the seed of the financial rule of thumb that a 2% periodic return over 10 periods is "roughly 20%."Common Pitfall. Linear approximation gives a worse estimate the farther you stray from the anchor, and students forget to check that their target is actually close to $a$. Approximating $\sqrt{4.1}$ from $a = 4$ is excellent; approximating $\sqrt{7}$ from $a = 4$ gives $L(7) = 2.75$ against a true value of $2.6458$ — an error of more than $0.1$. When the target is far, choose a closer anchor (here $a = 9$, where $\sqrt 9 = 3$ and $L(7) = 2.667$ is far better) or use a higher-order approximation (§11.11).
11.4 Differentials: A Language for Small Change
There is a second, equivalent way to package the linear approximation — one that is built for talking about change, and that prepares the notation we will need for integration in Chapter 12.
Given $y = f(x)$, define the differential $dy$ by
$$dy = f'(x)\,dx,$$
where $dx$ is an independent variable representing a (small) change in the input. The differential $dy$ is the corresponding change in $y$ predicted by the tangent line — the linear, idealized change.
Contrast this with the actual change, written with $\Delta$:
$$\Delta y = f(x + \Delta x) - f(x).$$
The actual change $\Delta y$ rides along the curve; the differential $dy$ rides along the tangent line. They are the curve-versus-line distinction of §11.2 dressed in change-language. The fundamental approximation is simply
$$\Delta y \approx dy = f'(x)\,\Delta x \qquad (\Delta x \text{ small}).$$
Geometric Intuition. Stand at $(x, f(x))$ and step right by $dx = \Delta x$. The curve rises by the true amount $\Delta y$; the tangent line rises by $dy$. The gap between them, $\Delta y - dy$, is the little wedge between curve and line — and as you shrink the step, that wedge shrinks quadratically (twice as fast as the step), which is exactly why the linear approximation is so good for small steps. We make this rate precise in §11.11.
Computational Note. Differentials are not just heuristic. In
sympy,f.diff(x)returns $f'(x)$, and multiplying by a symbolicdxgives the differential directly; numerically,dy = fprime(x) * dxis the one-line workhorse behind every error bar a scientist ever computes. We use exactly this below.
Worked example — estimating a change. A metal cube has side length $s = 5.00$ cm and is heated so that each side expands by $ds = 0.02$ cm. By approximately how much does its volume grow? The volume is $V = s^3$, so $dV = 3s^2\,ds$. Evaluating at $s = 5$,
$$dV = 3(5)^2(0.02) = 3 \cdot 25 \cdot 0.02 = 1.5 \text{ cm}^3.$$
The exact change is $\Delta V = 5.02^3 - 5^3 = 126.506 - 125 = 1.506$ cm³, so the differential $dV = 1.5$ captures it to within $0.006$ cm³ — a fraction of a percent. This is the differential doing what it does best: turning a messy difference of cubes into a single multiplication. Notice, too, the relative-error preview: the side grew by $0.02/5 = 0.4\%$, and the volume by $1.5/125 = 1.2\%$ — three times as much, the cube law of §11.5 already visible.
11.5 Error Propagation: From Measurement to Result
Here is where differentials earn their keep. Suppose you measure a quantity $x$ with some uncertainty $\Delta x$ — a ruler reads $10.0$ cm but could be off by $\pm 0.1$ cm — and you then compute $y = f(x)$. How much uncertainty does that produce in $y$? The differential answers immediately:
$$\Delta y \approx |f'(x)|\,\Delta x.$$
The derivative is the amplification factor: it tells you how strongly an input wobble is magnified into an output wobble. A steep function ($|f'|$ large) amplifies error; a flat one suppresses it.
Two derived quantities matter even more in practice. The relative error is the error as a fraction of the value, $\Delta y / y$, and the percentage error is that fraction times 100%. Relative error is usually what a scientist or engineer cares about, because it is dimensionless and comparable across quantities of wildly different sizes.
Worked example — the volume of a sphere. A ball bearing's radius is measured as $r = 10$ cm with possible error $\pm 0.1$ cm. Its volume is $V = \tfrac{4}{3}\pi r^3$. How uncertain is the volume?
Differentiate: $\dfrac{dV}{dr} = 4\pi r^2$, which at $r = 10$ equals $400\pi$. The absolute error in volume is
$$\Delta V \approx |V'(r)|\,\Delta r = 400\pi \cdot 0.1 = 40\pi \approx 125.7 \text{ cm}^3.$$
Against the volume itself, $V = \tfrac{4}{3}\pi(1000) \approx 4188.8$ cm³, this is a relative error of
$$\frac{\Delta V}{V} \approx \frac{40\pi}{\tfrac{4}{3}\pi \cdot 1000} = \frac{40}{\tfrac{4}{3}\cdot 1000} = 0.03 = 3\%.$$
Now notice something elegant. The radius was measured to $0.1/10 = 1\%$ relative error, but the volume comes out to $3\%$. The error tripled. That is not a coincidence: because $V \propto r^3$, the cube law forces the relative errors into a fixed ratio.
The Key Insight. For a power law $y = x^n$, relative errors obey a clean rule: $\dfrac{\Delta y}{y} \approx n\,\dfrac{\Delta x}{x}$. Powers magnify relative error by their exponent. A cube triples it; a square root ($n = 1/2$) halves it. This single fact — derivable in one line by differentiating $\ln y = n \ln x$ — governs error analysis throughout the experimental sciences.
Real-World Application — Experimental physics. When a physicist reports a measured quantity as $g = 9.78 \pm 0.05\ \mathrm{m/s^2}$, that $\pm 0.05$ was almost certainly produced by error propagation through differentials. A pendulum experiment measures period $T$ and length $\ell$, then computes $g = 4\pi^2 \ell / T^2$. The relative error rule gives $\Delta g/g \approx \Delta\ell/\ell + 2\,\Delta T/T$: the period enters squared, so timing error counts double. Knowing this tells the experimenter to spend their effort timing many swings rather than measuring the length to absurd precision.
11.6 Newton's Method: Solving Equations With Tangent Lines
We now turn the linear approximation into an algorithm. The problem: solve $f(x) = 0$ — find a root of $f$. Most equations that arise in practice ($x = \cos x$, Kepler's equation, a bond's yield) have no formula for their solution. We need a numerical method, and the best one is built from the tangent line.
Deriving the iteration. Start with a guess $x_n$ that is somewhere near a root. Near $x_n$, the function looks like its tangent line:
$$f(x) \approx f(x_n) + f'(x_n)(x - x_n).$$
We cannot easily solve $f(x) = 0$, but we can solve the linear equation: find where the tangent line hits the axis. Set the right side to zero and call the solution $x_{n+1}$:
$$0 = f(x_n) + f'(x_n)(x_{n+1} - x_n).$$
Solve for $x_{n+1}$ (assuming $f'(x_n) \neq 0$):
$$x_{n+1} - x_n = -\frac{f(x_n)}{f'(x_n)} \quad\Longrightarrow\quad \boxed{\,x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}\,.}$$
This is Newton's method (or the Newton–Raphson method). Each step replaces the curve by its tangent line, solves the resulting linear equation exactly, and uses that solution as the next guess. Then it repeats.
Geometric Intuition. Stand at the point $(x_n, f(x_n))$ on the curve. Draw the tangent line and slide down it until you hit the $x$-axis: that landing point is $x_{n+1}$. Because the tangent line hugs the curve, its $x$-intercept is usually much closer to the true root than $x_n$ was. Now climb back up to the curve at $x_{n+1}$, draw a new tangent, slide down again — and watch the guesses race toward the root. Each tangent is a fresh local straightening of the curve.
Historical Note. Isaac Newton described a version of this method around 1669 for solving polynomial equations, though his formulation looked nothing like the modern one — it had no derivatives and no iteration formula. Joseph Raphson published a cleaner, genuinely iterative form in 1690, and Thomas Simpson in 1740 gave essentially the version with $f'(x)$ that we use today and extended it to systems of equations. The name "Newton–Raphson" honors the two who got there first; Simpson, as often happens in mathematics, got the formula and not the credit.
11.7 Newton in Action: How a Calculator Finds a Square Root
Newton's method is not a curiosity — it is, quite literally, how your calculator and your computer's math library compute square roots. Here is the derivation and a hand-trace.
To compute $\sqrt{2}$, find the positive root of $f(x) = x^2 - 2$. Then $f'(x) = 2x$, and the Newton iteration is
$$x_{n+1} = x_n - \frac{x_n^2 - 2}{2x_n} = x_n - \frac{x_n}{2} + \frac{1}{x_n} = \frac{x_n}{2} + \frac{1}{x_n} = \frac{1}{2}\!\left(x_n + \frac{2}{x_n}\right).$$
That last form — average your guess with (2 over your guess) — is so clean that it was known to the Babylonians nearly four thousand years ago, long before anyone had a derivative. Newton's method rediscovers their algorithm as a special case. Start from a deliberately crude $x_0 = 1$:
| $n$ | $x_n$ | correct digits |
|---|---|---|
| 0 | $1.000000000$ | 0 |
| 1 | $\tfrac12(1 + 2) = 1.500000000$ | 0 |
| 2 | $\tfrac12(1.5 + 1.3\overline{3}) = 1.416666667$ | 2 |
| 3 | $1.414215686$ | 5 |
| 4 | $1.414213562$ | 11 |
The true value is $\sqrt 2 = 1.41421356237\ldots$. Look at the rightmost column: $0, 0, 2, 5, 11$. Once the method "locks on," the number of correct digits roughly doubles every step. That explosive doubling is called quadratic convergence, and it is what makes Newton's method the workhorse of numerical computing.
Real-World Application — Inside every CPU. When you type
math.sqrt(2)or press the √ key, the hardware does not consult a table. It runs a couple of Newton (or closely related Newton–Raphson) iterations, starting from a fast rough estimate baked into the chip. Because each iteration doubles the precision, three or four steps reach the full 53-bit accuracy of a double-precision float. Division on many processors works the same way: to compute $1/d$, hardware applies Newton's method to $f(x) = 1/x - d$, whose iteration $x_{n+1} = x_n(2 - d\,x_n)$ uses only multiplication and subtraction — no division at all. Newton's method lets a chip divide without knowing how to divide.
11.8 Implementing and Analyzing Newton's Method in Python
Now we make it real. The implementation is short, and the convergence behavior is worth watching directly.
# Newton's method: solve f(x) = 0 starting from x0.
# Returns the root, the iteration count, and the full history of guesses.
from typing import Callable
def newton(f: Callable[[float], float],
fprime: Callable[[float], float],
x0: float, tol: float = 1e-12,
max_iter: int = 50) -> tuple[float, int, list[float]]:
x = x0
history = [x]
for n in range(1, max_iter + 1):
fx = f(x)
if abs(fx) < tol: # close enough to a root
return x, n - 1, history
x = x - fx / fprime(x) # the Newton step
history.append(x)
return x, max_iter, history
# Compute sqrt(2) as the root of x^2 - 2.
root, iters, hist = newton(lambda x: x**2 - 2, lambda x: 2*x, x0=1.0)
print(f"sqrt(2) ~ {root:.12f} in {iters} iterations")
# Output: sqrt(2) ~ 1.414213562373 in 5 iterations
for k, xk in enumerate(hist):
print(f" x{k} = {xk:.15f} error = {abs(xk - 2**0.5):.2e}")
# Output (note the error SQUARING each line once it locks on):
# x0 = 1.000000000000000 error = 4.14e-01
# x1 = 1.500000000000000 error = 8.58e-02
# x2 = 1.416666666666667 error = 2.45e-03
# x3 = 1.414215686274510 error = 2.12e-06
# x4 = 1.414213562374690 error = 1.59e-12
# x5 = 1.414213562373095 error = 2.22e-16
Read the error column down the page: $4\times10^{-1}$, $9\times10^{-2}$, $2\times10^{-3}$, $2\times10^{-6}$, $2\times10^{-12}$. Each error is roughly the square of the one above it. That is the signature of quadratic convergence, and we can say precisely why it happens.
Why convergence is quadratic. Let $r$ be the true root and $e_n = x_n - r$ the error at step $n$. A short Taylor argument (anticipating Chapter 23) shows that, near a simple root where $f'(r) \neq 0$,
$$e_{n+1} \approx \frac{f''(r)}{2 f'(r)}\, e_n^2.$$
The new error is proportional to the square of the old error. If $e_n$ is, say, $10^{-3}$, then $e_{n+1}$ is on the order of $10^{-6}$, then $10^{-12}$, then machine precision — the doubling of correct digits we watched above. This is dramatically faster than methods whose error merely shrinks by a constant factor each step.
Hand computation builds understanding; machine computation builds power. You traced $\sqrt 2$ by hand in §11.7 to feel the doubling; the Python loop lets you watch the error square itself in real numbers and then turn the same five lines loose on equations no hand could solve. Neither replaces the other.
Check Your Understanding. You want to compute $\sqrt[3]{5}$ with Newton's method. What function $f$ should you find a root of, and what is the resulting iteration formula?
Answer
Use $f(x) = x^3 - 5$, so $f'(x) = 3x^2$. The iteration is $x_{n+1} = x_n - \dfrac{x_n^3 - 5}{3x_n^2} = \dfrac{1}{3}\!\left(2x_n + \dfrac{5}{x_n^2}\right)$. Starting from $x_0 = 2$ gives $x_1 = \tfrac13(4 + 1.25) = 1.75$, then $x_2 \approx 1.7108$, converging to $\sqrt[3]{5} = 1.709975\ldots$ This is the general "compute an $n$th root" recipe that calculators use.
11.9 When Newton's Method Fails
Quadratic convergence is a local promise: it holds once you are close enough to a simple root. Far from that regime, Newton's method can stall, oscillate, or fly off to infinity. Knowing the failure modes is as important as knowing the method.
1. A bad starting guess. If $x_0$ is far from any root, the tangent line at $x_0$ can point the next guess away from the root rather than toward it. Newton's method has no global sense of direction; it only ever trusts the local tangent.
2. A near-zero derivative. The step divides by $f'(x_n)$. If $f'(x_n) \approx 0$ — for instance if an iterate lands near a critical point of $f$ — the step $f(x_n)/f'(x_n)$ becomes enormous and hurls the guess far away. A flat spot is a trap.
3. Cycling. For some functions and starting points, the iterates fall into a repeating loop and never converge. The classic example is $f(x) = x^3 - 2x + 2$ started at $x_0 = 0$: the method oscillates between $0$ and $1$ forever, never approaching the real root near $-1.77$.
4. A multiple (repeated) root. If $f$ has a root where $f' = 0$ too (a double root, like $f(x) = (x-1)^2$), the quadratic convergence is lost. The method still converges, but only linearly — the error shrinks by a constant factor each step instead of squaring — because the derivation in §11.8 divided by $f'(r) = 0$.
Warning. Newton's method gives you no automatic warning when it fails. It will happily return a number, oscillate without complaint, or overflow. Always (a) sanity-check the output by plugging it back into $f$, (b) cap the iterations, and (c) prefer a bracketing method when reliability matters more than speed. The cubic-cycling example below diverges silently; only the iteration cap saves you.
# A cycling failure: f(x) = x^3 - 2x + 2 starting at x0 = 0 never converges.
_, iters, hist = newton(lambda x: x**3 - 2*x + 2,
lambda x: 3*x**2 - 2, x0=0.0, max_iter=8)
print([round(h, 4) for h in hist])
# Output: [0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0]
# The iterates lock into a 0 -> 1 -> 0 cycle and never find the real root near -1.7693.
11.10 More Applications: Where Roots and Linearizations Live
Newton's method and linear approximation appear across every quantitative field. Here are four readings of the same two ideas — reinforcing that calculus appears in every quantitative field.
Astronomy — Kepler's equation. To locate a planet or satellite in its elliptical orbit, you must solve Kepler's equation
$$M = E - e\sin E$$
for the eccentric anomaly $E$, given the mean anomaly $M$ (which grows linearly with time) and the eccentricity $e$. There is no closed-form solution, but Newton's method on $g(E) = E - e\sin E - M$ converges in two or three iterations:
$$E_{n+1} = E_n - \frac{E_n - e\sin E_n - M}{1 - e\cos E_n}.$$
This computation runs millions of times a day inside GPS receivers, satellite-tracking software, and planetarium apps.
Pure math — the Dottie number. The equation $x = \cos x$ has exactly one real solution but no formula for it. Solve $f(x) = x - \cos x$, with $f'(x) = 1 + \sin x$. Starting from $x_0 = 0.5$, Newton converges to $x \approx 0.739085$ in four steps. Punch any number into a calculator (in radians) and hit cosine over and over: it always drifts to this same value, the Dottie number — the unique fixed point of cosine. Newton's method gets there in a handful of steps instead of dozens.
Finance — internal rate of return. A bond's price is the present value of its future cash flows, a high-degree polynomial in the discount rate $r$. Inverting it — finding the yield $r$ that makes the price match the market — is a root-finding problem with no algebraic solution. Every bond-pricing system on Wall Street runs Newton's method (or a robust hybrid) to back out yields and to compute the internal rate of return that equates an investment's cash inflows and outflows.
Concretely, suppose $\$1000$ is invested and must grow to $\$2000$ in five years with monthly compounding. The rate $r$ satisfies $1000(1 + r/12)^{60} = 2000$, i.e. $(1 + r/12)^{60} = 2$. This particular equation happens to invert cleanly: $1 + r/12 = 2^{1/60} = e^{(\ln 2)/60}$, giving $r/12 = e^{0.0115525} - 1 \approx 0.011619$ and $r \approx 0.1394 = 13.94\%$. But change the cash-flow schedule even slightly — uneven deposits, a final balloon payment — and the equation becomes a genuine polynomial root-finding problem with no formula, and Newton's method is exactly what the spreadsheet's IRR function runs under the hood.
Optimization — finding critical points. To optimize a function $g$, you look for where $g'(x) = 0$ (Chapter 10). That is a root-finding problem for $g'$, so apply Newton's method to $g'$ itself:
$$x_{n+1} = x_n - \frac{g'(x_n)}{g''(x_n)}.$$
This "Newton's method for optimization" is the seed of the algorithms that train statistical models and that we generalize to many variables in Chapter 31. It is a cousin of gradient descent, the anchor example we have been developing since Chapter 6: gradient descent steps proportionally to the slope, while Newton's method also divides by the curvature $g''$, using second-derivative information to take a smarter step.
Real-World Application — Physics linearizes the world. A huge fraction of solvable physics is linearized physics. The pendulum equation $\ddot\theta = -\frac{g}{\ell}\sin\theta$ has no elementary solution — until you replace $\sin\theta$ by its linear approximation $\theta$ (Example 3 of §11.3), turning it into $\ddot\theta = -\frac{g}{\ell}\theta$, the equation of simple harmonic motion with period $2\pi\sqrt{\ell/g}$. Hooke's law $F = -kx$ is the linearization of any restoring force near equilibrium; special relativity's low-speed limit $\gamma \approx 1 + \tfrac12 v^2/c^2$ is a linearization that recovers Newtonian kinetic energy $\tfrac12 mv^2$. The strategy is always the same: write the exact law, linearize around the operating point, and solve the easy linear problem.
11.11 The Error of Linear Approximation — and the Road to Taylor Series
How wrong is the linear approximation? The answer ties this chapter to the rest of the book.
For a function with a continuous second derivative, the error of the linearization at $a$ is controlled by the curvature:
$$|f(x) - L(x)| \le \frac{M}{2}(x - a)^2, \qquad M = \max |f''| \text{ near } a.$$
Two lessons live in this bound. First, the error is quadratic in the distance $x - a$ — halve your distance from the anchor and the error drops by a factor of four. That is exactly the curve-peels-away-quadratically picture from §11.4. Second, the error is proportional to the curvature $M$: a nearly straight function is approximated almost perfectly, while a sharply bending one is not.
The natural question is whether we can do better than linear. We can — by keeping more terms. The quadratic approximation adds a curvature term:
$$f(x) \approx f(a) + f'(a)(x - a) + \frac{f''(a)}{2}(x - a)^2,$$
with error now of order $(x-a)^3$. Keep going — adding $\frac{f'''(a)}{6}(x-a)^3$, then $\frac{f^{(4)}(a)}{24}(x-a)^4$, and so on — and you build the Taylor series, the higher-order generalization of everything in this chapter. Linear approximation is just the first-order Taylor polynomial; the full machinery, including when the infinite series reproduces $f$ exactly, is the subject of Chapter 23. (That chapter also delivers the famous identity $e^{i\pi} + 1 = 0$, whose first hint we plant here: the linearizations $\sin\theta\approx\theta$ and $\cos\theta \approx 1$ are the opening terms of the series that link exponentials and trigonometry.)
Math Major Sidebar. The error bound is a corollary of Taylor's theorem with Lagrange remainder: if $f \in C^2$ near $a$, then for each $x$ there is a point $\xi$ between $a$ and $x$ with $f(x) - L(x) = \tfrac12 f''(\xi)(x-a)^2$ exactly. Bounding $|f''(\xi)|$ by $M$ gives the inequality above. The deeper statement, $\lim_{x\to a}\frac{f(x)-L(x)}{x-a} = 0$, characterizes the tangent line as the unique linear function whose error vanishes faster than first order — the rigorous meaning of "best linear approximation," and the seed of the definition of differentiability in higher dimensions.
Looking forward to several variables. Everything here has a multivariable twin. For a function $f(x, y)$ of two variables, the tangent line becomes a tangent plane, and the linearization reads $L(x, y) = f(a,b) + f_x(a,b)(x-a) + f_y(a,b)(y-b)$. Differentials become $dz = f_x\,dx + f_y\,dy$, and error propagation generalizes to $\sigma_z^2 \approx f_x^2\sigma_x^2 + f_y^2\sigma_y^2$ for independent uncertainties. That is the story of Chapter 29, where the single idea of this chapter — up close, a smooth thing looks flat — graduates from a line to a plane to a hyperplane.
11.12 Two Reliable Cousins: Bisection and the Secant Method
Newton's method is fast but fragile (§11.9). Two relatives trade some speed for robustness, and it is worth knowing when to reach for each.
Bisection. If $f$ is merely continuous and you have a bracket — two points $a$ and $b$ with $f(a)$ and $f(b)$ of opposite sign — the Intermediate Value Theorem (Chapter 4) guarantees a root between them. Bisection repeatedly halves the bracket: evaluate $f$ at the midpoint, keep whichever half still straddles a sign change, and repeat. It cannot fail and needs no derivative, but it converges only linearly — gaining one bit of accuracy per step, far slower than Newton's doubling-of-digits. It is the tortoise: slow, but it always finishes.
The secant method. Newton needs $f'(x)$, which is sometimes unavailable or expensive. The secant method replaces the true derivative with the slope of the line through the two most recent points:
$$x_{n+1} = x_n - f(x_n)\cdot\frac{x_n - x_{n-1}}{f(x_n) - f(x_{n-1})}.$$
This is Newton's method with a finite-difference stand-in for $f'$. It converges superlinearly, at order $\approx 1.618$ (the golden ratio, fittingly) — slower than Newton's order-2 but with no derivative required. In production, libraries such as scipy.optimize.brentq blend bisection's reliability with secant/inverse-quadratic speed to get the best of both worlds, which is why robust root-finders almost never use raw Newton alone.
11.13 Why All of This Matters
Pull back and look at the chapter as a whole. A single picture — up close, a smooth curve looks like its tangent line — generated everything:
- Linear approximation estimates function values from a nearby anchor.
- Differentials repackage that estimate as predicted change and, multiplied by a measurement uncertainty, give error propagation in one line.
- Newton's method turns the same tangent line into a root-finding algorithm with quadratic convergence — the engine inside calculators, CPUs, orbital mechanics, and optimization software.
- Taylor series (Chapter 23) extend the line to a polynomial of any order, and the tangent plane (Chapter 29) extends it to many variables.
That one idea — replace by the tangent — quietly powers an enormous fraction of all numerical computation. Differential equations are first attacked by linearizing them (Chapter 19); statistical models are fit by Newton-type iteration; machine-learning optimizers descend along local linear models of a loss surface. You have learned not one trick but the prototype of a thousand.
Add to Your Modeling Portfolio. Add a linearization-and-root-finding tool to your model. Pick the track you have been developing. Track A — Biology: linearize your population model near an equilibrium (set $P' = 0$) to classify its stability, and use Newton's method to locate the equilibrium itself when the growth law is nonlinear. Track B — Economics: linearize a demand or cost curve about an operating price, and use Newton's method to solve a marginal condition ($R'(p) = 0$ or supply $=$ demand) that has no closed form. Track C — Physics: apply the small-angle / small-displacement linearization to reduce your nonlinear force law to simple harmonic motion, and report the resulting period. Track D — Data Science: implement Newton's method for optimization, $x_{n+1} = x_n - g'(x_n)/g''(x_n)$, on a 1-D loss function, and compare its convergence to plain gradient descent — the anchor example you will scale to many dimensions in Chapter 30.
Looking Ahead
This chapter closes Part II. You now command the full toolkit of differentiation: limits and continuity (Part I), the derivative and its rules (Chapters 6–8), and the applications — optimization, related rates, and now linear approximation and Newton's method (Chapters 9–11).
Chapter 12 introduces antiderivatives — running differentiation in reverse — and the differential notation $dy$ you met in §11.4 becomes central: the antiderivative $\int f(x)\,dx$ is built from exactly the $dx$ you have been carrying. That chapter opens Part III on integration, whose climax is the Fundamental Theorem of Calculus in Chapter 14 — the result that welds the two halves of calculus into one. The straight line you learned to trust here is about to meet its inverse.
Reflection
For four thousand years, the Babylonians averaged a guess with two-over-the-guess to extract square roots, never knowing why it worked. Newton's method reveals the reason: each step straightens the curve into its tangent line, solves the easy linear problem, and lands closer to the truth — so much closer that the correct digits double. The same straight-line idea estimates a logarithm, propagates a ruler's error into a volume, and locates a planet in its orbit. Calculus does not solve hard problems by being clever about curves. It solves them by noticing that, up close, there are no curves at all — only lines, and lines we can always solve. Hold onto that. In Chapter 23 the line becomes a polynomial, in Chapter 29 it becomes a plane, and the idea you learned today will carry you the rest of the way through the book.