Case Study 1 — How a Processor Computes a Square Root

DataField.Dev

Case Study 1 — How a Processor Computes a Square Root

Field: Computer arithmetic and numerical computing Calculus used: Newton's method (§11.6–11.8), quadratic convergence (§11.8), linear approximation as a starting "seed" (§11.2)

When you write math.sqrt(2.0) in any language, or press the √ key on a calculator, the answer comes back in a few nanoseconds. There is no giant lookup table inside the chip, and no magic. There is a tangent line — applied two or three times. This case study follows the computation all the way down to the silicon, and shows that the entire trick is the chapter's central idea: replace the curve by its tangent and iterate.

The problem the hardware actually solves

A floating-point number is stored as $x = m \cdot 2^{e}$, with the mantissa $m$ in $[1, 2)$ and an integer exponent $e$. Taking a square root splits cleanly across this representation: $$\sqrt{x} = \sqrt{m}\cdot 2^{e/2}.$$ Halving the exponent is trivial (when $e$ is odd, borrow a factor of $2$ into the mantissa). So the hard part reduces to a single bounded problem: compute $\sqrt{m}$ for $m$ in a small interval near $1$. This is where Newton's method earns its place.

Curiously, hardware designers usually do not solve $\sqrt{m}$ directly. They compute the reciprocal square root $1/\sqrt{m}$ instead, because that iteration uses only multiplication — and a CPU multiplies far faster than it divides. Once $1/\sqrt{m}$ is known, one extra multiply gives $\sqrt{m} = m \cdot (1/\sqrt{m})$. So the real target is the root of $$f(y) = \frac{1}{y^2} - S, \qquad S = m,$$ whose positive solution is exactly $y = 1/\sqrt{S}$.

Deriving the division-free iteration

Apply Newton's method, $y_{n+1} = y_n - f(y_n)/f'(y_n)$. Here $f'(y) = -2/y^3$, so $$y_{n+1} = y_n - \frac{\tfrac{1}{y_n^2} - S}{-\,\tfrac{2}{y_n^3}} = y_n + \frac{y_n^3}{2}\!\left(\frac{1}{y_n^2} - S\right) = y_n + \frac{y_n}{2} - \frac{S\,y_n^3}{2}.$$ Collecting terms gives the celebrated form $$\boxed{\,y_{n+1} = y_n\!\left(\frac{3}{2} - \frac{S}{2}\,y_n^2\right) = \frac{y_n}{2}\bigl(3 - S\,y_n^2\bigr).\,}$$ Read what survives: there is no division anywhere on the right-hand side, only multiplications, a subtraction, and one halving (a single-bit exponent shift, essentially free). This is precisely why the reciprocal-square-root route is chosen — Newton's method has converted a square root and a division into a handful of multiplies.

This exact line of code became internet-famous as the "fast inverse square root" used in 1990s 3-D graphics engines to normalize lighting vectors millions of times per frame. The mysterious constant 0x5f3759df in that code does just one thing: it manufactures a good $y_0$ by reinterpreting the float's bits, which is itself a crude linear approximation to $\log_2$. After that bit hack, the line y = y * (1.5f - 0.5f * S * y * y); is exactly one step of the boxed iteration above.

Why two or three steps suffice: quadratic convergence

The reason the hardware can stop after so few iterations is the chapter's headline result. Near the simple root $r = 1/\sqrt{S}$, the error $e_n = y_n - r$ obeys $$e_{n+1} \approx \frac{f''(r)}{2 f'(r)}\,e_n^2,$$ the quadratic-convergence law of §11.8. The number of correct bits roughly doubles every step. If the seed $y_0$ is already good to, say, 8 bits, then one step gives ~16, the next ~32, the next ~53 — the full precision of a double. Three Newton steps from a decent seed is enough for IEEE double precision; for single precision, often just one or two.

Let us watch it concretely for $S = 2$, so the target is $1/\sqrt{2} = 0.70710678\ldots$ Start from a deliberately rough seed $y_0 = 0.75$:

# Reciprocal square root of S via the division-free Newton iteration.
S = 2.0
y = 0.75                      # crude seed (a hardware lookup gives a better one)
for n in range(4):
    print(f"  y{n} = {y:.9f}   error = {abs(y - 2**-0.5):.2e}")
    y = 0.5 * y * (3.0 - S * y * y)
# Hand-computed output:
#   y0 = 0.750000000   error = 4.29e-02
#   y1 = 0.703125000   error = 3.98e-03
#   y2 = 0.707133770   error = 2.70e-05
#   y3 = 0.707106782   error = 1.26e-09

Trace the first step by hand to confirm. With $y_0 = 0.75$: $y_0^2 = 0.5625$, so $S\,y_0^2 = 1.125$ and $3 - 1.125 = 1.875$; then $y_1 = 0.5 \cdot 0.75 \cdot 1.875 = 0.375 \cdot 1.875 = 0.703125$. One more step: $y_1^2 = 0.494385$, $S\,y_1^2 = 0.98877$, $3 - 0.98877 = 2.01123$, and $y_2 = 0.5 \cdot 0.703125 \cdot 2.01123 = 0.707134$. Now read the error column: $4\times10^{-2}$, $4\times10^{-3}$, $3\times10^{-5}$, $1\times10^{-9}$. Once the iteration "locks on," each error is roughly the square of the one above — the doubling of correct digits that lets the chip stop after only a few steps.

The role of the seed: a linear approximation in disguise

Quadratic convergence is a local promise (§11.9): it only kicks in once you are reasonably close. That is why every hardware square-root unit begins with a fast, rough estimate — a small lookup table, or a bit manipulation, that produces a seed good to a handful of bits. That seed is itself an approximation problem, and the cheapest good approximations are linear: a piecewise-linear fit to $1/\sqrt{m}$ over $[1, 2)$, exactly the linearization idea of §11.2 applied in pieces. The architecture is a two-stage cascade: a linear approximation gets you into the neighborhood, then Newton's quadratic convergence sprints to full precision.

Why not just divide?

A natural objection: why use the reciprocal-square-root form at all, when $f(y) = y^2 - S$ gives the clean Babylonian iteration $y_{n+1} = \tfrac12(y_n + S/y_n)$ from §11.7? Because that iteration contains the division $S/y_n$. On real hardware, division is several times slower than multiplication and is often itself implemented by Newton's method (applied to $f(x) = 1/x - d$, giving $x_{n+1} = x_n(2 - d\,x_n)$ — see §11.7). The reciprocal-square-root route avoids opening that second can of worms: it computes both the root and the needed reciprocal with multiplies alone. Engineering, like calculus, prefers the operation it already does well.

What this case study shows

A single picture — a curve is locally its tangent line — generated an algorithm that the most heavily used instructions on Earth rely on. Newton's method turned "find $\sqrt S$" into "iterate a polynomial," quadratic convergence made three iterations enough, and linear approximation supplied the seed that makes those iterations start in the right place. Every time a video game renders a frame or a physics engine normalizes a vector, this chapter is running, billions of times a second.

Discussion Questions

The boxed iteration $y_{n+1} = \tfrac12 y_n(3 - S y_n^2)$ has no division. Trace where the division would have appeared in the raw Newton formula and show algebraically how it cancelled. Why does that cancellation matter for hardware?
Quadratic convergence roughly doubles correct bits per step. Starting from an 8-bit seed, how many steps are needed for 24-bit (single) and 53-bit (double) precision? Does this match the "two or three steps" claim?
The cycling and divergence failures of §11.9 are real risks for general Newton iterations. Explain why the square-root iteration is safe — i.e., why a good seed in $[1,2)$ cannot trigger those failures here. (Hint: examine the sign and size of $f'$ over the interval.)
The Babylonian iteration needs a division; the reciprocal form does not. If a future chip made division as cheap as multiplication, which method would you choose, and why?

Short Annotated Reading

Stewart, Calculus: Early Transcendentals, §4.8 (Newton's Method). The standard worked treatment, including the square-root example and the failure cases — the textbook backbone of this study.
Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (ACM Computing Surveys, 1991). The classic on how reals are actually stored; explains the mantissa/exponent split this case study exploits.
Muller et al., Handbook of Floating-Point Arithmetic (2nd ed., Birkhäuser). Chapter on division and square root details the seed-plus-Newton hardware cascade with full error analysis.
OpenStax Calculus Volume 1, §4.9 (Newton's Method). A free, gentler parallel to Stewart, good for re-deriving the iteration.