Case Study 1 — How a Processor Computes a Square Root
Field: Computer arithmetic and numerical computing Calculus used: Newton's method (§11.6–11.8), quadratic convergence (§11.8), linear approximation as a starting "seed" (§11.2)
When you write math.sqrt(2.0) in any language, or press the √ key on a calculator, the answer comes back in a few nanoseconds. There is no giant lookup table inside the chip, and no magic. There is a tangent line — applied two or three times. This case study follows the computation all the way down to the silicon, and shows that the entire trick is the chapter's central idea: replace the curve by its tangent and iterate.
The problem the hardware actually solves
A floating-point number is stored as $x = m \cdot 2^{e}$, with the mantissa $m$ in $[1, 2)$ and an integer exponent $e$. Taking a square root splits cleanly across this representation: $$\sqrt{x} = \sqrt{m}\cdot 2^{e/2}.$$ Halving the exponent is trivial (when $e$ is odd, borrow a factor of $2$ into the mantissa). So the hard part reduces to a single bounded problem: compute $\sqrt{m}$ for $m$ in a small interval near $1$. This is where Newton's method earns its place.
Curiously, hardware designers usually do not solve $\sqrt{m}$ directly. They compute the reciprocal square root $1/\sqrt{m}$ instead, because that iteration uses only multiplication — and a CPU multiplies far faster than it divides. Once $1/\sqrt{m}$ is known, one extra multiply gives $\sqrt{m} = m \cdot (1/\sqrt{m})$. So the real target is the root of $$f(y) = \frac{1}{y^2} - S, \qquad S = m,$$ whose positive solution is exactly $y = 1/\sqrt{S}$.
Deriving the division-free iteration
Apply Newton's method, $y_{n+1} = y_n - f(y_n)/f'(y_n)$. Here $f'(y) = -2/y^3$, so $$y_{n+1} = y_n - \frac{\tfrac{1}{y_n^2} - S}{-\,\tfrac{2}{y_n^3}} = y_n + \frac{y_n^3}{2}\!\left(\frac{1}{y_n^2} - S\right) = y_n + \frac{y_n}{2} - \frac{S\,y_n^3}{2}.$$ Collecting terms gives the celebrated form $$\boxed{\,y_{n+1} = y_n\!\left(\frac{3}{2} - \frac{S}{2}\,y_n^2\right) = \frac{y_n}{2}\bigl(3 - S\,y_n^2\bigr).\,}$$ Read what survives: there is no division anywhere on the right-hand side, only multiplications, a subtraction, and one halving (a single-bit exponent shift, essentially free). This is precisely why the reciprocal-square-root route is chosen — Newton's method has converted a square root and a division into a handful of multiplies.
This exact line of code became internet-famous as the "fast inverse square root" used in 1990s 3-D graphics engines to normalize lighting vectors millions of times per frame. The mysterious constant 0x5f3759df in that code does just one thing: it manufactures a good $y_0$ by reinterpreting the float's bits, which is itself a crude linear approximation to $\log_2$. After that bit hack, the line y = y * (1.5f - 0.5f * S * y * y); is exactly one step of the boxed iteration above.
Why two or three steps suffice: quadratic convergence
The reason the hardware can stop after so few iterations is the chapter's headline result. Near the simple root $r = 1/\sqrt{S}$, the error $e_n = y_n - r$ obeys $$e_{n+1} \approx \frac{f''(r)}{2 f'(r)}\,e_n^2,$$ the quadratic-convergence law of §11.8. The number of correct bits roughly doubles every step. If the seed $y_0$ is already good to, say, 8 bits, then one step gives ~16, the next ~32, the next ~53 — the full precision of a double. Three Newton steps from a decent seed is enough for IEEE double precision; for single precision, often just one or two.
Let us watch it concretely for $S = 2$, so the target is $1/\sqrt{2} = 0.70710678\ldots$ Start from a deliberately rough seed $y_0 = 0.75$:
# Reciprocal square root of S via the division-free Newton iteration.
S = 2.0
y = 0.75 # crude seed (a hardware lookup gives a better one)
for n in range(4):
print(f" y{n} = {y:.9f} error = {abs(y - 2**-0.5):.2e}")
y = 0.5 * y * (3.0 - S * y * y)
# Hand-computed output:
# y0 = 0.750000000 error = 4.29e-02
# y1 = 0.703125000 error = 3.98e-03
# y2 = 0.707133770 error = 2.70e-05
# y3 = 0.707106782 error = 1.26e-09
Trace the first step by hand to confirm. With $y_0 = 0.75$: $y_0^2 = 0.5625$, so $S\,y_0^2 = 1.125$ and $3 - 1.125 = 1.875$; then $y_1 = 0.5 \cdot 0.75 \cdot 1.875 = 0.375 \cdot 1.875 = 0.703125$. One more step: $y_1^2 = 0.494385$, $S\,y_1^2 = 0.98877$, $3 - 0.98877 = 2.01123$, and $y_2 = 0.5 \cdot 0.703125 \cdot 2.01123 = 0.707134$. Now read the error column: $4\times10^{-2}$, $4\times10^{-3}$, $3\times10^{-5}$, $1\times10^{-9}$. Once the iteration "locks on," each error is roughly the square of the one above — the doubling of correct digits that lets the chip stop after only a few steps.
The role of the seed: a linear approximation in disguise
Quadratic convergence is a local promise (§11.9): it only kicks in once you are reasonably close. That is why every hardware square-root unit begins with a fast, rough estimate — a small lookup table, or a bit manipulation, that produces a seed good to a handful of bits. That seed is itself an approximation problem, and the cheapest good approximations are linear: a piecewise-linear fit to $1/\sqrt{m}$ over $[1, 2)$, exactly the linearization idea of §11.2 applied in pieces. The architecture is a two-stage cascade: a linear approximation gets you into the neighborhood, then Newton's quadratic convergence sprints to full precision.
Why not just divide?
A natural objection: why use the reciprocal-square-root form at all, when $f(y) = y^2 - S$ gives the clean Babylonian iteration $y_{n+1} = \tfrac12(y_n + S/y_n)$ from §11.7? Because that iteration contains the division $S/y_n$. On real hardware, division is several times slower than multiplication and is often itself implemented by Newton's method (applied to $f(x) = 1/x - d$, giving $x_{n+1} = x_n(2 - d\,x_n)$ — see §11.7). The reciprocal-square-root route avoids opening that second can of worms: it computes both the root and the needed reciprocal with multiplies alone. Engineering, like calculus, prefers the operation it already does well.
What this case study shows
A single picture — a curve is locally its tangent line — generated an algorithm that the most heavily used instructions on Earth rely on. Newton's method turned "find $\sqrt S$" into "iterate a polynomial," quadratic convergence made three iterations enough, and linear approximation supplied the seed that makes those iterations start in the right place. Every time a video game renders a frame or a physics engine normalizes a vector, this chapter is running, billions of times a second.
Discussion Questions
- The boxed iteration $y_{n+1} = \tfrac12 y_n(3 - S y_n^2)$ has no division. Trace where the division would have appeared in the raw Newton formula and show algebraically how it cancelled. Why does that cancellation matter for hardware?
- Quadratic convergence roughly doubles correct bits per step. Starting from an 8-bit seed, how many steps are needed for 24-bit (single) and 53-bit (double) precision? Does this match the "two or three steps" claim?
- The cycling and divergence failures of §11.9 are real risks for general Newton iterations. Explain why the square-root iteration is safe — i.e., why a good seed in $[1,2)$ cannot trigger those failures here. (Hint: examine the sign and size of $f'$ over the interval.)
- The Babylonian iteration needs a division; the reciprocal form does not. If a future chip made division as cheap as multiplication, which method would you choose, and why?
Short Annotated Reading
- Stewart, Calculus: Early Transcendentals, §4.8 (Newton's Method). The standard worked treatment, including the square-root example and the failure cases — the textbook backbone of this study.
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (ACM Computing Surveys, 1991). The classic on how reals are actually stored; explains the mantissa/exponent split this case study exploits.
- Muller et al., Handbook of Floating-Point Arithmetic (2nd ed., Birkhäuser). Chapter on division and square root details the seed-plus-Newton hardware cascade with full error analysis.
- OpenStax Calculus Volume 1, §4.9 (Newton's Method). A free, gentler parallel to Stewart, good for re-deriving the iteration.