LU and PLU Decomposition: Efficient Solving at Scale

DataField.Dev

46 min read

> Learning paths. Math majors — read everything; pay special attention to the existence/uniqueness conditions for $A=LU$ in §10.4, the proof that the multipliers assemble into $L$ in §10.3, and the permutation-matrix algebra of §10.7. CS / Data...

Prerequisites

chapter-04-gaussian-elimination

Learning Objectives

Explain LU factorization as 'Gaussian elimination, remembered': L stores the elimination multipliers, U is the echelon form.
Compute the LU factorization of a small matrix by hand and verify it against scipy.linalg.lu.
Solve Ax=b using a precomputed LU by forward-substituting Ly=b then back-substituting Ux=y.
Explain why factor-once-reuse-many beats re-eliminating for each new right-hand side, and quantify the savings.
State when pivoting is required, build the permutation matrix P, and read PA=LU (the PLU factorization).
Compare the operation counts of LU (~1/3 n^3), solving via LU, and computing an inverse, and justify 'never invert to solve.'

In This Chapter

10.1 What is LU decomposition, and why factor a matrix at all?
10.2 Why are triangular systems easy to solve?
10.3 How does Gaussian elimination secretly produce L and U?
10.4 When does a matrix have an LU factorization (without pivoting)?
10.5 A full worked LU, verified against numpy and scipy
10.6 How do you solve Ax = b once you have A = LU?
10.7 What is the permutation matrix P, and why does PLU exist when LU doesn't?
10.8 How expensive is LU, and why is it better than computing the inverse?
10.9 What does the from-scratch LU look like in code?
10.10 Putting it together: factorization as a reusable tool

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

LU and PLU Decomposition: Efficient Solving at Scale

Learning paths. Math majors — read everything; pay special attention to the existence/uniqueness conditions for $A=LU$ in §10.4, the proof that the multipliers assemble into $L$ in §10.3, and the permutation-matrix algebra of §10.7. CS / Data Science — focus on the factor-once-reuse-many argument of §10.6, the operation counts of §10.8, and the from-scratch toolkit; this is the chapter that explains why scipy.linalg.lu_factor exists and why you should never invert a matrix to solve a system. Physics / Engineering — focus on the geometric reading of $L$ and $U$ as a sequence of shears, the circuit-simulation case study, and the worked hand factorizations. In this chapter everyone learns the same core move: how to remember an elimination so you never have to repeat it.

10.1 What is LU decomposition, and why factor a matrix at all?

In Chapter 4 you learned Gaussian elimination — the workhorse algorithm that grinds any system $A\mathbf{x}=\mathbf{b}$ down to a triangle and back-substitutes to the answer. It is complete, mechanical, and trustworthy. But it has a quiet inefficiency that becomes glaring at scale, and fixing it is one of the most important ideas in computational linear algebra. This chapter is about that fix.

Picture a working engineer with a single matrix $A$ — say it encodes the structure of a bridge, or the wiring of a circuit, or the heat flow through a metal plate. The engineer does not solve $A\mathbf{x}=\mathbf{b}$ once. They solve it hundreds of times: same $A$, but a different right-hand side $\mathbf{b}$ for each load on the bridge, each input voltage, each moment in a simulation. If you run full Gaussian elimination from scratch for every one of those $\mathbf{b}$'s, you are redoing the exact same work — eliminating the exact same $A$ — over and over, throwing away the result each time. That is like re-reading a recipe from the first word every time you cook the same dish.

The waste is not subtle. Recall from Chapter 4 that eliminating an $n\times n$ matrix costs on the order of $n^3$ arithmetic operations, while the back-substitution at the end costs only about $n^2$. The expensive part — the elimination — is identical no matter what $\mathbf{b}$ you are solving against, because the row operations that clear out $A$ never look at the right-hand side until the very end. So if you re-eliminate for each of a thousand right-hand sides, you pay that $n^3$ cost a thousand times over, when you only ever needed to pay it once. For a modest $1000\times1000$ system that is the difference between a calculation that finishes in a blink and one that crawls — purely because the work was repeated needlessly. The fix is not a cleverer elimination; it is to stop repeating the one you already did.

The Key Insight — The expensive part of solving $A\mathbf{x}=\mathbf{b}$ is eliminating $A$, and that part does not depend on $\mathbf{b}$ at all. So do it once, record the result, and reuse it for every right-hand side. The record is a factorization: $A = LU$.

Here is the central idea in one sentence — the whole of LU factorization explained before we touch a formula: LU decomposition is Gaussian elimination, remembered. When you eliminate $A$ to its echelon form, you produce an upper-triangular matrix — call it $U$. Along the way you use a sequence of multipliers ("subtract $2$ times row 1 from row 2," "subtract $\tfrac{1}{3}$ times row 1 from row 3," and so on). Ordinarily those multipliers are scratch work you discard. LU factorization keeps them. It collects every multiplier into a lower-triangular matrix $L$, and the remarkable fact — which we will prove — is that

$$A = LU,$$

where $L$ is lower-triangular with $1$'s on its diagonal and $U$ is the upper-triangular echelon form you reached. You have not thrown the elimination away; you have packaged it as two triangular factors. And triangular systems, as Chapter 4 showed, are trivial to solve. That packaging is the whole game.

There is a useful way to picture what "remembering" buys you. Ordinary Gaussian elimination is like solving a maze by walking it: you find the exit, but the moment you leave, the path is gone, and the next person (the next $\mathbf{b}$) has to walk the whole maze again. LU factorization is like drawing a map of the maze as you walk it the first time. The map costs a little extra effort to produce, but once you have it, every future traveler reaches the exit by simply reading it — no re-walking. $L$ and $U$ together are that map: a compact record of the route elimination took, reusable forever. The first solve pays to draw the map; every solve after that is free by comparison.

This is your first serious encounter with a recurring big idea of the book, the one Part II's introduction promised: factoring a matrix reveals structure you can reuse. We will meet it again and again — the determinant falls out of LU (Chapter 11), the eigendecomposition $A=PDP^{-1}$ factors a matrix into "change coordinates, scale, change back" (Chapter 25), and the singular value decomposition $A=U\Sigma V^{\mathsf{T}}$ factors every matrix into rotate–stretch–rotate (Chapter 30). LU is the simplest and most practical member of that family, and it is where the habit of factoring to understand and to reuse begins. Each of those later factorizations splits a matrix into pieces that are individually simple — triangular, diagonal, orthogonal — and whose product rebuilds the original; LU's pieces are the two triangles, and its simplicity is the reason it is the one you will compute most often in practice.

Let's lay out the plan. We first recall what triangular systems are and why they are easy (§10.2). We watch elimination secretly build $L$ and $U$, and we factor a $3\times3$ matrix completely by hand and check it against scipy (§10.3–10.5). We then cash in the payoff: solving $A\mathbf{x}=\mathbf{b}$ as two substitutions, and the factor-once-reuse-many argument that makes LU indispensable (§10.6). We confront the catch — sometimes a zero pivot forces a row swap, and the bookkeeping for that is a permutation matrix $P$, giving the $PA=LU$ factorization that real software actually uses (§10.7). Finally we count operations carefully, settle the "never invert to solve" rule once and for all, and build the factorization from scratch (§10.8–10.9). As always, we begin with the simplest piece and the picture behind it.

10.2 Why are triangular systems easy to solve?

Before factoring anything, recall why we would ever want triangular pieces. The answer is that triangular systems are the easiest systems there are — you met this in Chapter 4 as back-substitution, but it is worth isolating, because LU's entire value rests on it.

A matrix is upper-triangular if every entry below the main diagonal is zero, and lower-triangular if every entry above the diagonal is zero. Here are a $3\times3$ example of each:

$$U = \begin{bmatrix} 2 & 1 & -1 \\ 0 & 3 & 4 \\ 0 & 0 & 5 \end{bmatrix}, \qquad L = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ -1 & 4 & 1 \end{bmatrix}.$$

The $L$ above is special in a way we will use constantly: its diagonal entries are all $1$. We call such a matrix lower-triangular with unit diagonal (or unit lower-triangular), and it is exactly the shape that records elimination multipliers, as you will see.

Now suppose you must solve $U\mathbf{x}=\mathbf{c}$ for the $U$ above, with $\mathbf{c}=(0,11,15)$. Write out the equations:

$$\begin{aligned} 2x_1 + x_2 - x_3 &= 0,\\ 3x_2 + 4x_3 &= 11,\\ 5x_3 &= 15. \end{aligned}$$

The bottom equation has one unknown: $x_3 = 3$. Substitute upward into the middle equation: $3x_2 + 12 = 11 \Rightarrow x_2 = -\tfrac13$. Substitute both into the top: $2x_1 - \tfrac13 - 3 = 0 \Rightarrow x_1 = \tfrac{5}{3}$. That bottom-up unraveling is back-substitution, and it is cheap: each equation reveals one new unknown using only the ones already found.

A lower-triangular system is solved the same way but top-down — the first equation has one unknown, the second has two (one known), and so on. That mirror process is called forward substitution. To feel it, solve $L\mathbf{y}=\mathbf{b}$ for the $L$ above with $\mathbf{b}=(3,4,7)$. The first equation is $y_1=3$ (the unit diagonal makes it a freebie). The second, $2y_1+y_2=4$, gives $y_2=4-2(3)=-2$. The third, $-y_1+4y_2+y_3=7$, gives $y_3=7-(-1)(3)-4(-2)=7+3+8=18$. Each step uses only the $y$'s already found, exactly mirroring back-substitution but marching down instead of up.

Both substitution directions cost on the order of $n^2$ arithmetic operations for an $n\times n$ system — vastly cheaper than the $\sim n^3$ of full elimination. The reason the count is $n^2$ and not $n^3$ is worth seeing: solving for the $i$-th unknown requires subtracting off the $\sim i$ terms already known and one division, so the total work is $1+2+\dots+n\approx n^2/2$. Compare that to elimination, where clearing each of $n$ columns touches an $n\times n$ block of entries — the extra factor of $n$ is precisely what separates the cheap solve from the expensive factorization. Hold that contrast: $n^2$ versus $n^3$ is the entire economic argument for LU. If we can replace solving the original messy system with solving two triangular systems, we trade one $n^3$ job for two $n^2$ jobs every time a new right-hand side arrives — and the $n^3$ job, done once, can be shared across all of them.

Geometric Intuition — A triangular matrix transforms space in a staged way. An upper-triangular $U$ acts so that the last coordinate's image depends only on the last input coordinate, the second-to-last on the last two, and so on — information flows in one direction along the axes, never circling back. That one-directional flow is exactly why you can untangle the system one variable at a time. A general matrix mixes all coordinates together at once, which is why it is harder; LU's job is to unmix that into two clean one-directional stages.

# Triangular solves are cheap: forward (lower) and back (upper) substitution.
import numpy as np
from scipy.linalg import solve_triangular
U = np.array([[2., 1., -1.],
              [0., 3.,  4.],
              [0., 0.,  5.]])
c = np.array([0., 11., 15.])
x = solve_triangular(U, c, lower=False)   # back-substitution on upper-triangular U
print(x)          # [ 1.66666667 -0.33333333  3.        ]
print(U @ x)      # [ 0. 11. 15.]  -> reproduces c

solve_triangular returns $(\tfrac53, -\tfrac13, 3)$ — exactly our hand result — and multiplying $U$ by it reproduces $\mathbf{c}=(0,11,15)$. The point is not that scipy can do it; the point is that it can do it fast, because triangular structure is the easiest structure to exploit. Now we build that structure on purpose.

10.3 How does Gaussian elimination secretly produce L and U?

Let's watch the factorization appear out of ordinary elimination. We will eliminate a matrix to its triangular form $U$ — that part is pure Chapter 4 — and pay attention to the multipliers we use, because those become $L$.

Take the matrix

$$A = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix}.$$

We run forward elimination with no row swaps (this matrix is friendly; §10.7 handles the case where we cannot avoid a swap). The first pivot is $a_{11}=2$.

Clear column 1. To kill the $4$ in row 2 we subtract $\tfrac{4}{2}=2$ times row 1 from row 2 — the multiplier is $\ell_{21}=2$. To kill the $8$ in row 3 we subtract $\tfrac{8}{2}=4$ times row 1 — the multiplier is $\ell_{31}=4$:

$$\begin{aligned} R_2 &\leftarrow R_2 - 2R_1\\ R_3 &\leftarrow R_3 - 4R_1\end{aligned} \quad\Longrightarrow\quad \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 3 & 5 \end{bmatrix}.$$

(Row 2 became $(4,3,3)-2(2,1,1)=(0,1,1)$; row 3 became $(8,7,9)-4(2,1,1)=(0,3,5)$.)

Clear column 2. The new pivot is the $1$ in row 2. To kill the $3$ in row 3 we subtract $\tfrac{3}{1}=3$ times row 2 — the multiplier is $\ell_{32}=3$:

$$R_3 \leftarrow R_3 - 3R_2 \quad\Longrightarrow\quad \begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix} = U.$$

We have reached the upper-triangular echelon form $U$. Now here is the magic. Collect the three multipliers we used, placed in the positions of the entries they eliminated, and put $1$'s on the diagonal:

$$L = \begin{bmatrix} 1 & 0 & 0 \\ \ell_{21} & 1 & 0 \\ \ell_{31} & \ell_{32} & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{bmatrix}.$$

The multiplier $\ell_{21}=2$ that cleared position $(2,1)$ sits in position $(2,1)$ of $L$; $\ell_{31}=4$ sits at $(3,1)$; $\ell_{32}=3$ sits at $(3,2)$. The diagonal is all $1$'s. I claim that $A = LU$. Let's check by hand:

$$LU = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{bmatrix}\begin{bmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{bmatrix} = \begin{bmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{bmatrix} = A. \checkmark$$

(Row 2 of $LU$: $2\cdot(2,1,1) + 1\cdot(0,1,1) = (4,3,3)$. Row 3: $4\cdot(2,1,1)+3\cdot(0,1,1)+1\cdot(0,0,2)=(8,7,9)$. Both match $A$.) The factorization is exact. We did nothing but ordinary elimination and wrote down the multipliers we would otherwise have thrown away.

The Key Insight — $U$ is where elimination ends; $L$ is how it got there. The diagonal $1$'s of $L$ say "start from the original rows"; the entry $\ell_{ij}$ below the diagonal says "row $i$ had $\ell_{ij}$ copies of row $j$ subtracted from it." Multiplying $L$ by $U$ replays the elimination in reverse, rebuilding $A$ from its echelon form.

Take a moment to read $L$ as a story, because this reading is what makes LU memorable rather than mechanical. Row $i$ of $A$, the original matrix, equals row $i$ of $U$ (its echelon version) plus the combination of higher pivot-rows that elimination removed from it — and the coefficients of that combination are exactly the entries $\ell_{i1},\ell_{i2},\dots$ in row $i$ of $L$. For our example, $L$'s last row $(4,3,1)$ says: "the original row 3 was $4$ times pivot-row 1, plus $3$ times pivot-row 2, plus $1$ times the final row 3 of $U$." Check it against $A$: $4(2,1,1)+3(0,1,1)+1(0,0,2)=(8,7,9)$, which is row 3 of $A$. So $L$ is a complete recipe for reconstituting each original row from the triangular pieces. That is why no information is lost in the factorization: $U$ records what survived elimination, $L$ records what was taken away, and together they hold everything $A$ ever contained.

Why does this work? The clean way to see it uses the elementary-matrix idea from Chapter 4. Each elimination step "$R_i \leftarrow R_i - \ell_{ij}R_j$" is left-multiplication by an elementary matrix $E_{ij}$ — the identity with $-\ell_{ij}$ in position $(i,j)$. Our three steps mean

$$E_{32}\,E_{31}\,E_{21}\,A = U,$$

where, for instance, $E_{21}=\begin{bsmallmatrix}1&0&0\\-2&1&0\\0&0&1\end{bsmallmatrix}$. Each $E_{ij}$ is a shear (Chapter 4, §4.3) and is invertible; solving for $A$ gives

$$A = E_{21}^{-1}E_{31}^{-1}E_{32}^{-1}\,U.$$

Now two small miracles combine. First, the inverse of $E_{ij}$ (which subtracts $\ell_{ij}R_j$ from $R_i$) is simply the matrix that adds it back: $E_{ij}^{-1}$ is the identity with $+\ell_{ij}$ in position $(i,j)$. Second — and this is the part worth pausing on — when you multiply those inverse shears together in this order, the multipliers do not interfere; each $+\ell_{ij}$ simply drops into its own slot below the diagonal. The product is exactly

$$L = E_{21}^{-1}E_{31}^{-1}E_{32}^{-1} = \begin{bmatrix} 1 & 0 & 0 \\ \ell_{21} & 1 & 0 \\ \ell_{31} & \ell_{32} & 1 \end{bmatrix}.$$

So $A = LU$ with $L$ unit lower-triangular holding the multipliers, and $U$ the echelon form. The "no interference" claim is what makes $L$ so easy to read off — and it is precisely why you can build $L$ by just recording multipliers in place as you eliminate, with no extra matrix multiplication at all.

Math-Major Sidebar (optional) — Why don't the multipliers interfere? The subtlety is the order of multiplication. We have $L = E_{21}^{-1}E_{31}^{-1}E_{32}^{-1}$, and one checks directly that for these particular elementary matrices the product just superimposes the off-diagonal entries: $\begin{bsmallmatrix}1&0&0\\ \ell_{21}&1&0\\0&0&1\end{bsmallmatrix}\begin{bsmallmatrix}1&0&0\\0&1&0\\ \ell_{31}&0&1\end{bsmallmatrix}\begin{bsmallmatrix}1&0&0\\0&1&0\\0&\ell_{32}&1\end{bsmallmatrix} = \begin{bsmallmatrix}1&0&0\\ \ell_{21}&1&0\\ \ell_{31}&\ell_{32}&1\end{bsmallmatrix}$. The cross-terms that could appear (e.g. $\ell_{21}\ell_{32}$ landing in position $(3,1)$) all vanish because the column being filled in each factor sits strictly below the row already filled in the previous one. Reverse the multiplication order and this fails — yet another reminder that matrix multiplication does not commute (Chapter 8). The result generalizes: for an $n\times n$ matrix factored without pivoting, the multiplier $\ell_{ij}$ (the factor that cleared entry $(i,j)$) lands in position $(i,j)$ of $L$, full stop. $\blacksquare$

10.4 When does a matrix have an LU factorization (without pivoting)?

We should be honest, in the spirit of the style bible's rule to always state a theorem's conditions, about when the clean story above holds. The factorization $A=LU$ — with no row swaps — exists exactly when elimination never hits a zero in a pivot position.

Theorem (LU factorization without pivoting). Let $A$ be an $n\times n$ matrix. If Gaussian elimination can reduce $A$ to upper-triangular form without any row swaps — equivalently, if every pivot encountered is nonzero — then $A$ factors as $A=LU$, where $L$ is unit lower-triangular (1's on the diagonal, multipliers below) and $U$ is upper-triangular. When such a factorization exists with $L$ unit-diagonal, it is unique.

The condition has a crisp equivalent that math majors will appreciate: $A=LU$ exists without pivoting if and only if every leading principal minor of $A$ — the determinant of the top-left $k\times k$ block, for each $k=1,\dots,n$ — is nonzero. The reason is that the $k$-th pivot, times the product of the earlier pivots, equals that $k\times k$ leading minor; a zero leading minor forces a zero pivot, which stalls plain elimination. (We will make the pivots-determinant connection precise in Chapter 11, where the product of the pivots is the determinant up to a sign.)

It helps to see why a zero pivot stalls the no-swap algorithm, rather than just accepting that it does. The next elimination step divides by the current pivot — that is how the multiplier $\ell_{ij}=M_{ij}/M_{jj}$ is formed. If the pivot $M_{jj}$ is zero, the division is undefined and there is no multiplier to subtract; the algorithm simply has no legal move that uses that row in that position. The only escape is to bring up a different row with a nonzero entry in the column — a swap — which is exactly the operation the plain $A=LU$ story forbids itself. So "every leading minor nonzero" is precisely the guarantee that we never reach a position where the only way forward is a swap. When that guarantee fails, we are not stuck for good; we just need the more general $PA=LU$ of §10.7, which builds the swaps into the bookkeeping. The condition is not a defect of LU — it is an honest statement of exactly when the simplest version applies, and the chapter's job is to handle the rest.

The uniqueness is worth a word, because it is why "the LU factorization" is well-defined. Suppose $A=L_1U_1=L_2U_2$ with both $L$'s unit lower-triangular and both $U$'s upper-triangular and invertible. Then $L_2^{-1}L_1 = U_2U_1^{-1}$. The left side is unit lower-triangular (a product of unit lower-triangulars); the right side is upper-triangular. A matrix that is both lower- and upper-triangular is diagonal — and a unit lower-triangular diagonal matrix is the identity. So $L_2^{-1}L_1=I$ and $U_2U_1^{-1}=I$, giving $L_1=L_2$ and $U_1=U_2$. The unit-diagonal convention on $L$ is exactly what removes the ambiguity (otherwise you could push a scalar from $L$ into $U$).

Warning

— The plain $A=LU$ factorization does not exist for every matrix, even for some perfectly invertible ones. The classic example is $A=\begin{bsmallmatrix}0&1\\1&0\end{bsmallmatrix}$: its top-left entry is $0$, so the very first pivot is zero and elimination stalls before it starts. This matrix is invertible (it just swaps coordinates), yet it has no $LU$ with the unit-diagonal $L$ — you are forced to swap rows. The fix is pivoting and the permutation matrix $P$ of §10.7, which is exactly why production software computes $PA=LU$, not $A=LU$. State the condition, respect the trap: no pivoting works only when every pivot is nonzero.

Check Your Understanding — Without computing the full factorization, can the matrix $A=\begin{bsmallmatrix}1&2\\3&6\end{bsmallmatrix}$ be reduced to triangular form without a row swap? And does that guarantee it has a usable $LU$ for solving a system?

Answer

Yes, reduced without a swap: the first pivot is $a_{11}=1\neq0$, and one step $R_2\leftarrow R_2-3R_1$ gives $\begin{bsmallmatrix}1&2\\0&0\end{bsmallmatrix}$, which is upper-triangular. So $A=LU$ with $L=\begin{bsmallmatrix}1&0\\3&1\end{bsmallmatrix}$, $U=\begin{bsmallmatrix}1&2\\0&0\end{bsmallmatrix}$. But the factorization is not useful for solving $A\mathbf{x}=\mathbf{b}$ uniquely, because the second pivot of $U$ is $0$ — this $A$ is singular ($\det A = 6-6=0$, its rows are dependent). LU happily factors it, but $U$ has a zero on the diagonal, so the back-substitution step $U\mathbf{x}=\mathbf{y}$ has no unique solution. Existence of $A=LU$ does not imply invertibility; you still need the pivots of $U$ all nonzero to solve.

10.5 A full worked LU, verified against numpy and scipy

Let's do one more complete factorization, slowly, and confirm every number against numpy and scipy — honoring the book's non-negotiable that every numeric result shown must match the code. Take

$$A = \begin{bmatrix} 1 & 2 & 4 \\ 3 & 8 & 14 \\ 2 & 6 & 13 \end{bmatrix}.$$

Column 1. Pivot $a_{11}=1$. Multipliers: $\ell_{21}=3/1=3$, $\ell_{31}=2/1=2$.

$$\begin{aligned} R_2 &\leftarrow R_2 - 3R_1\\ R_3 &\leftarrow R_3 - 2R_1\end{aligned}\;\Longrightarrow\; \begin{bmatrix} 1 & 2 & 4 \\ 0 & 2 & 2 \\ 0 & 2 & 5 \end{bmatrix}.$$

(Row 2: $(3,8,14)-3(1,2,4)=(0,2,2)$. Row 3: $(2,6,13)-2(1,2,4)=(0,2,5)$.)

Column 2. Pivot is the $2$ in row 2. Multiplier $\ell_{32}=2/2=1$:

$$R_3 \leftarrow R_3 - 1\cdot R_2 \;\Longrightarrow\; \begin{bmatrix} 1 & 2 & 4 \\ 0 & 2 & 2 \\ 0 & 0 & 3 \end{bmatrix} = U.$$

Assemble $L$ from the multipliers $\ell_{21}=3,\ \ell_{31}=2,\ \ell_{32}=1$:

$$L = \begin{bmatrix} 1 & 0 & 0 \\ 3 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}, \qquad U = \begin{bmatrix} 1 & 2 & 4 \\ 0 & 2 & 2 \\ 0 & 0 & 3 \end{bmatrix}.$$

Check $LU=A$ on row 3: $2\cdot(1,2,4) + 1\cdot(0,2,2) + 1\cdot(0,0,3) = (2,6,13)$. ✓ Now numpy and scipy:

# Full LU of a 3x3, verified two ways.
import numpy as np
from scipy.linalg import lu
A = np.array([[1., 2., 4.],
              [3., 8., 14.],
              [2., 6., 13.]])
L = np.array([[1., 0., 0.],
              [3., 1., 0.],
              [2., 1., 1.]])
U = np.array([[1., 2., 4.],
              [0., 2., 2.],
              [0., 0., 3.]])
print(np.allclose(L @ U, A))     # True  -> our hand factorization is exact
P_s, L_s, U_s = lu(A)            # scipy's PLU (it pivots by default)
print(np.round(P_s, 3))          # NOT identity: [[0 0 1],[1 0 0],[0 1 0]] -> scipy swapped rows
print(np.round(L_s, 4))          # [[1 0 0],[0.6667 1 0],[0.3333 -1 1]] -> differs from our L
print(np.round(U_s, 4))          # [[3 8 14],[0 0.6667 3.6667],[0 0 3]] -> differs from our U

Two things happen, and the second is instructive. First, np.allclose(L @ U, A) is True: our hand factors multiply back to $A$ exactly. Second, scipy.linalg.lu returns a factorization too — but it returns a PLU factorization $A=PLU$ with partial pivoting (scipy's lu always pivots for numerical safety), so its $L$ and $U$ need not equal ours. For this $A$, partial pivoting picks the largest first-column entry ($3$, in row 2) as the pivot, so scipy does permute the rows, and its $P$ is not the identity. The numbers differ from our no-pivot factorization, yet both are valid: ours satisfies $LU=A$, scipy's satisfies $PLU=A$. Let me show the reconciliation explicitly, because the mismatch surprises people:

# scipy pivots; reconcile its PLU with A, and verify it really reconstructs A.
import numpy as np
from scipy.linalg import lu
A = np.array([[1., 2., 4.],
              [3., 8., 14.],
              [2., 6., 13.]])
P, L, U = lu(A)                       # scipy convention: A = P @ L @ U
print(np.allclose(P @ L @ U, A))      # True
print(np.allclose(L @ U, P.T @ A))    # True -> P^T A = L U (rows of A permuted)
print(np.diag(L))                     # [1. 1. 1.] -> L still has unit diagonal

Both checks print True. scipy's factors obey $A=PLU$, equivalently $P^{\mathsf{T}}A=LU$, and its $L$ keeps the unit diagonal. The lesson is not that our hand answer was wrong — it satisfies $A=LU$ perfectly — but that the LU factorization is only unique once you fix the pivoting strategy. "No pivoting" and "partial pivoting" give different (both correct) factorizations of the same $A$. We took the no-pivot path by hand because $A$ allowed it; scipy took the safer pivoted path because it always does. We make peace with pivoting properly in §10.7.

This is a good moment to absorb a habit that will save you confusion for the rest of the book: when a library's factorization disagrees with your hand computation, suspect a convention difference before you suspect an error. Factorizations are rarely unique without extra constraints, and different tools impose different constraints — scipy always pivots, some references write $PA=LU$ while others write $A=PLU$, and the choice of which entry to call "the pivot" can differ. The reliable check is never "do the factors match?" but "do the factors reconstruct the original matrix under the library's stated convention, and do both routes solve $A\mathbf{x}=\mathbf{b}$ to the same $\mathbf{x}$?" The answer $\mathbf{x}$ to a well-posed system is unique (when $A$ is invertible), even when the road to it is not. Verify the invariant — the reconstruction $LU=P^{\mathsf{T}}A$ or the residual $A\mathbf{x}-\mathbf{b}\approx\mathbf{0}$ — not the intermediate factors.

Computational Note — scipy.linalg.lu(A) returns (P, L, U) with the convention $A = P\,L\,U$ (note: $P$ on the left, and some references instead write $PA=LU$ — watch which one your library uses). For solving, prefer the pair lu_factor/lu_solve, which pack $L$ and $U$ into one array plus a compact pivot vector — that is the production interface. And remember the index shift from Chapter 4: the math says "the multiplier in row 2, column 1 is $\ell_{21}$"; scipy stores it as L[1,0]. Mathematics indexes from 1; numpy and scipy index from 0.

10.6 How do you solve Ax = b once you have A = LU?

Now the payoff. You have spent effort to factor $A=LU$. How does that help you solve $A\mathbf{x}=\mathbf{b}$? Substitute the factorization in:

$$A\mathbf{x}=\mathbf{b} \quad\Longrightarrow\quad LU\mathbf{x}=\mathbf{b}.$$

The trick is to give the product $U\mathbf{x}$ a name, $\mathbf{y}=U\mathbf{x}$, which splits the one hard solve into two easy ones. This little renaming is the conceptual hinge of the whole method, so it is worth saying slowly. The equation $LU\mathbf{x}=\mathbf{b}$ has the unknown $\mathbf{x}$ buried two factors deep. Rather than try to undo both factors at once, we peel them off one at a time. Define $\mathbf{y}=U\mathbf{x}$ — the vector that $U$ produces from $\mathbf{x}$, whatever it turns out to be. Then $L\mathbf{y}=\mathbf{b}$ involves only $L$ and the new unknown $\mathbf{y}$, so we solve it first (it is triangular, hence cheap). Once we know $\mathbf{y}$, the defining relation $U\mathbf{x}=\mathbf{y}$ is also triangular in the original unknown $\mathbf{x}$, so we solve that next. Two triangular solves, chained — that is the entire algorithm, and the only idea is "name the intermediate vector and solve in two stages":

$$\boxed{\;L\mathbf{y}=\mathbf{b}\;}\quad\text{(solve for }\mathbf{y}\text{ by forward substitution)},\qquad \boxed{\;U\mathbf{x}=\mathbf{y}\;}\quad\text{(solve for }\mathbf{x}\text{ by back substitution)}.$$

Both systems are triangular — exactly the easy kind from §10.2. First forward-substitute through the lower-triangular $L$ to get $\mathbf{y}$; then back-substitute through the upper-triangular $U$ to get $\mathbf{x}$. Two cheap $n^2$ solves replace one expensive $n^3$ elimination.

Let's do it concretely with the factorization from §10.5 — $L=\begin{bsmallmatrix}1&0&0\\3&1&0\\2&1&1\end{bsmallmatrix}$, $U=\begin{bsmallmatrix}1&2&4\\0&2&2\\0&0&3\end{bsmallmatrix}$ — and right-hand side $\mathbf{b}=(7,29,29)$.

Step 1 — forward-substitute $L\mathbf{y}=\mathbf{b}$. The equations (reading $L$ row by row) are:

$$\begin{aligned} y_1 &= 7,\\ 3y_1 + y_2 &= 29 \;\Rightarrow\; y_2 = 29 - 21 = 8,\\ 2y_1 + y_2 + y_3 &= 29 \;\Rightarrow\; y_3 = 29 - 14 - 8 = 7. \end{aligned}$$

So $\mathbf{y}=(7,8,7)$. Notice how the unit diagonal made each step a single subtraction with no division — that is a small bonus of the unit-diagonal convention.

Step 2 — back-substitute $U\mathbf{x}=\mathbf{y}$. The equations (reading $U$ bottom-up):

$$\begin{aligned} 3x_3 &= 7 \;\Rightarrow\; x_3 = \tfrac{7}{3},\\ 2x_2 + 2x_3 &= 8 \;\Rightarrow\; x_2 = \tfrac{8 - 2\cdot\frac73}{2} = \tfrac{8 - \frac{14}{3}}{2} = \tfrac{10/3}{2} = \tfrac{5}{3},\\ x_1 + 2x_2 + 4x_3 &= 7 \;\Rightarrow\; x_1 = 7 - 2\cdot\tfrac53 - 4\cdot\tfrac73 = 7 - \tfrac{10}{3} - \tfrac{28}{3} = 7 - \tfrac{38}{3} = -\tfrac{17}{3}. \end{aligned}$$

So $\mathbf{x}=\left(-\tfrac{17}{3},\ \tfrac{5}{3},\ \tfrac{7}{3}\right)$. Let's verify the whole pipeline against numpy — both the direct solve and the two-stage LU solve must agree:

# Solve Ax=b via LU: forward-solve Ly=b, then back-solve Ux=y. Compare to np.linalg.solve.
import numpy as np
from scipy.linalg import solve_triangular
A = np.array([[1., 2., 4.], [3., 8., 14.], [2., 6., 13.]])
L = np.array([[1., 0., 0.], [3., 1., 0.], [2., 1., 1.]])
U = np.array([[1., 2., 4.], [0., 2., 2.], [0., 0., 3.]])
b = np.array([7., 29., 29.])
y = solve_triangular(L, b, lower=True)       # forward substitution
x = solve_triangular(U, y, lower=False)      # back substitution
print(np.round(y, 4))                        # [7. 8. 7.]
print(np.round(x, 4))                        # [-5.6667  1.6667  2.3333]
print(np.allclose(x, np.linalg.solve(A, b))) # True -> matches the direct solver

The intermediate $\mathbf{y}=(7,8,7)$ matches our hand forward-substitution, and $\mathbf{x}\approx(-5.667,\,1.667,\,2.333) = (-\tfrac{17}{3},\tfrac53,\tfrac73)$ matches our back-substitution to the digit. The final np.allclose against np.linalg.solve is True: the two-stage LU solve and the direct solver land on the same answer.

Why this is efficient: factor once, reuse for many right-hand sides

Here is the argument that makes LU worth the trouble — the chapter's anchor idea. Suppose you must solve $A\mathbf{x}=\mathbf{b}$ for the same $A$ but for $m$ different right-hand sides $\mathbf{b}_1,\dots,\mathbf{b}_m$. Compare two strategies:

Re-eliminate each time (naive). Run full Gaussian elimination from scratch for every $\mathbf{b}_k$. Each solve costs $\sim\tfrac23 n^3$. Total: $\sim m\cdot\tfrac23 n^3$. You redo the identical elimination of $A$, $m$ times.
Factor once, reuse (LU). Compute $A=LU$ one time at cost $\sim\tfrac23 n^3$. Then each right-hand side costs only two triangular solves, $\sim 2n^2$. Total: $\sim\tfrac23 n^3 + m\cdot 2n^2$.

For large $n$ and many right-hand sides, the difference is enormous. The naive cost grows like $m n^3$; the LU cost is dominated by a single $n^3$ plus a cheap $mn^2$ tail. If $n=1000$ and you have $m=500$ right-hand sides, re-eliminating does roughly $500$ jobs of size $n^3$, while LU does one job of size $n^3$ plus $500$ jobs of size $n^2$ — and $n^2$ is a thousand times smaller than $n^3$. The factorization, computed once, pays for itself after the second right-hand side and saves more with every solve thereafter.

Real-World Application — Structural engineering (the same stiffness matrix, many loads). When engineers analyze a structure — a bridge, a building frame, an aircraft wing — they assemble a single large stiffness matrix $K$ relating displacements to forces, and solve $K\mathbf{u}=\mathbf{f}$ for the displacements $\mathbf{u}$ under a load $\mathbf{f}$. But a design must survive many load cases: dead weight, wind from each direction, snow, traffic, seismic shaking, and combinations thereof — often hundreds of right-hand sides $\mathbf{f}$, all against the same $K$. Factoring $K$ once via LU (in practice its symmetric cousin, the Cholesky factorization $K=R^{\mathsf{T}}R$) and reusing the factors for every load case is exactly why finite-element analysis is tractable. Re-solving from scratch per load case would multiply the cost by the number of loads. This chapter's case study works the circuit-simulation version of the same idea.

There is a second, subtler reason factor-once-reuse-many matters, beyond saving time: it keeps the answers consistent. When the same factorization underlies every solve, every right-hand side is processed through the identical $L$ and $U$, so the only thing that varies between solutions is the input $\mathbf{b}$ — not some re-derived, slightly-differently-rounded version of the elimination. In a long simulation that takes thousands of steps, reusing one factorization avoids accumulating thousands of independent rounding-error stories. The factorization becomes a fixed, trusted object you query, rather than a computation you nervously repeat.

Real-World Application — Data pipelines and finance: many systems, shared structure. The factor-once-reuse-many pattern is not confined to physics and engineering. In quantitative finance, calibrating a model or pricing a book of derivatives often means solving $A\mathbf{x}=\mathbf{b}$ for the same covariance or system matrix $A$ against hundreds of scenario vectors $\mathbf{b}$ — one per market shock, stress test, or Monte-Carlo path. In data science, ridge regression solves $(X^{\mathsf{T}}X+\lambda I)\,\boldsymbol\beta = X^{\mathsf{T}}\mathbf{y}$, and a hyperparameter search or cross-validation loop re-solves it for many target vectors $\mathbf{y}$ (or many folds) while the matrix changes little or not at all. In each case the smart implementation factors the shared matrix once — an LU (or, since these matrices are symmetric positive-definite, a Cholesky factorization, LU's specialized cousin from Chapter 28) — and reuses the factors across the whole batch. A naive loop that re-solves from scratch can be tens or hundreds of times slower, turning an overnight job into a five-minute one. The same idea recurs whenever a fixed linear map is queried with a stream of different inputs.

Common Pitfall — Don't recompute the factorization inside the loop. A beginner writing simulation code often calls a full solver solve(A, b_k) once per right-hand side inside a loop — silently redoing the $n^3$ elimination every iteration. The fix is to factor outside the loop and reuse: in scipy, lu, piv = lu_factor(A) once, then lu_solve((lu, piv), b_k) inside the loop. The mathematics is the factor-once-reuse-many idea; the bug is putting the factorization where it gets repeated. (numpy's np.linalg.solve does not cache its factorization between calls, so calling it in a loop is the trap; scipy.linalg.lu_factor/lu_solve is the cure.)

10.7 What is the permutation matrix P, and why does PLU exist when LU doesn't?

We deferred the catch, and now we face it. The plain $A=LU$ story assumed elimination never lands a zero in a pivot position. But Chapter 4 already warned us: sometimes it does, and the remedy is to swap rows. The moment we swap, the clean $A=LU$ breaks — and the bookkeeping for the swaps is a permutation matrix, giving the factorization that real numerical software actually computes.

A permutation matrix $P$ is an identity matrix with its rows reordered. Multiplying $PA$ on the left reorders the rows of $A$ in exactly that pattern. For example, the matrix that swaps rows 1 and 2 of a $3\times3$ matrix is

$$P = \begin{bmatrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{bmatrix}, \qquad PA \text{ has rows 1 and 2 of } A \text{ exchanged.}$$

Permutation matrices have two pleasant properties we will use: each one is orthogonal (its inverse is its transpose, $P^{-1}=P^{\mathsf{T}}$ — undoing a shuffle is just shuffling back), and the product of permutation matrices is another permutation matrix. We will meet orthogonal matrices in full generality in Chapter 21; permutations are the simplest examples.

Now suppose elimination on $A$ requires row swaps to dodge zero (or, in practice, small) pivots. The key realization is this: if we knew in advance which swaps the elimination would perform, we could apply them all to $A$ up front, and then the pre-swapped matrix would factor cleanly. Bundle all the swaps into a single permutation $P$. Then the row-reordered matrix $PA$ has an ordinary $LU$ factorization:

Why can the swaps be "front-loaded" like this? In Chapter 4 you saw that a swap is just another elementary row operation, and that the order of operations is what determines the result. The subtlety is that the swaps and the eliminations are interleaved in time — you swap, then eliminate a column, then maybe swap again, then eliminate the next. It is not obvious that you could pull all the swaps to the front. The reason it works is a piece of permutation algebra: a swap applied after some eliminations can be "commuted" back past them, at the cost of relabeling which multipliers go where. When you carry out that bookkeeping carefully (it is exactly what plu in the toolkit does — swap the rows of the partial $L$ along with the rows of the working matrix), all the swaps collect into one net permutation $P$ that you could equally have applied to $A$ at the very start. The upshot is the clean statement that $PA$ — the matrix with its rows pre-shuffled into the order partial pivoting wants — factors with no further swaps at all.

$$\boxed{\,PA = LU\,}$$

This is the PLU factorization (also called LU with partial pivoting). It says: permute the rows of $A$ first (by $P$), and the result factors into unit-lower-triangular $L$ times upper-triangular $U$. And here is the headline theorem, the one that makes LU universally applicable:

Theorem (PLU factorization — existence for every square matrix). Every square matrix $A$ has a factorization $PA=LU$, where $P$ is a permutation matrix, $L$ is unit lower-triangular, and $U$ is upper-triangular. With partial pivoting (always swap up the row whose pivot-column entry has the largest absolute value), such a factorization always exists; if $A$ is invertible the pivots of $U$ are all nonzero.

Unlike plain $A=LU$, which can fail, $PA=LU$ always works for a square matrix — that is why production libraries compute it. The permutation costs almost nothing (it is just relabeling rows), and it buys both universality (no more stalling on zero pivots) and numerical stability (no more dividing by dangerously small pivots, the issue Chapter 4's Computational Note raised).

A worked PLU: the example that forces a swap

Let's factor a matrix that cannot avoid a swap — the zero-pivot trap made concrete:

$$A = \begin{bmatrix} 0 & 2 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 3 \end{bmatrix}.$$

The top-left entry is $0$. We cannot pivot on it. Partial pivoting says: look down column 1 for the entry of largest absolute value, and swap that row to the top. Column 1 is $(0,1,2)$; the largest magnitude is the $2$ in row 3. So we swap rows 1 and 3. Record this swap in $P$ (it exchanges rows 1 and 3):

$$P = \begin{bmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}, \qquad PA = \begin{bmatrix} 2 & 1 & 3 \\ 1 & 1 & 0 \\ 0 & 2 & 1 \end{bmatrix}.$$

Now eliminate $PA$ as usual. Column 1, pivot $2$. Multipliers: $\ell_{21}=1/2$, $\ell_{31}=0/2=0$:

$$\begin{aligned}R_2 &\leftarrow R_2 - \tfrac12 R_1\\ R_3 &\leftarrow R_3 - 0\cdot R_1\end{aligned}\;\Longrightarrow\; \begin{bmatrix} 2 & 1 & 3 \\ 0 & \tfrac12 & -\tfrac32 \\ 0 & 2 & 1 \end{bmatrix}.$$

(Row 2: $(1,1,0)-\tfrac12(2,1,3)=(0,\tfrac12,-\tfrac32)$.) Column 2: the pivot is $\tfrac12$ in row 2. Partial pivoting would compare $|\tfrac12|$ against $|2|$ below it and prefer to swap — but to keep this hand example clean, suppose we proceed without the second swap (we eliminate with the $\tfrac12$ pivot; this is a valid LU, just not the maximal-pivot one scipy would choose). Multiplier $\ell_{32} = 2 / \tfrac12 = 4$:

$$R_3 \leftarrow R_3 - 4R_2 \;\Longrightarrow\; \begin{bmatrix} 2 & 1 & 3 \\ 0 & \tfrac12 & -\tfrac32 \\ 0 & 0 & 7 \end{bmatrix} = U.$$

(Row 3: $(0,2,1)-4(0,\tfrac12,-\tfrac32)=(0,0,1+6)=(0,0,7)$.) Assemble $L$ from $\ell_{21}=\tfrac12,\ \ell_{31}=0,\ \ell_{32}=4$:

$$L = \begin{bmatrix} 1 & 0 & 0 \\ \tfrac12 & 1 & 0 \\ 0 & 4 & 1 \end{bmatrix}, \qquad U = \begin{bmatrix} 2 & 1 & 3 \\ 0 & \tfrac12 & -\tfrac32 \\ 0 & 0 & 7 \end{bmatrix}, \qquad P = \begin{bmatrix} 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{bmatrix}.$$

Verify $PA=LU$. Compute $LU$ row 3: $0\cdot(2,1,3) + 4\cdot(0,\tfrac12,-\tfrac32) + 1\cdot(0,0,7) = (0,2,-6+7)=(0,2,1)$ — which is indeed row 3 of $PA$. ✓ Let's confirm in code and also show how to solve with a PLU (you must permute $\mathbf{b}$ the same way):

# PLU by hand vs scipy, and solving Ax=b through the permutation.
import numpy as np
from scipy.linalg import lu_factor, lu_solve
A = np.array([[0., 2., 1.],
              [1., 1., 0.],
              [2., 1., 3.]])
P = np.array([[0., 0., 1.], [0., 1., 0.], [1., 0., 0.]])
L = np.array([[1., 0., 0.], [0.5, 1., 0.], [0., 4., 1.]])
U = np.array([[2., 1., 3.], [0., 0.5, -1.5], [0., 0., 7.]])
print(np.allclose(P @ A, L @ U))     # True -> our hand PLU is exact
# Solve A x = b the production way: factor once, then lu_solve.
b = np.array([5., 2., 9.])
lu_piv = lu_factor(A)                 # packs PLU into (matrix, pivot-vector)
x = lu_solve(lu_piv, b)               # permutes b internally, then 2 triangular solves
print(np.round(x, 4))                 # [0.5714 1.4286 2.1429]
print(np.allclose(A @ x, b))          # True -> checks out

np.allclose(P @ A, L @ U) is True — our hand PLU reconstructs the row-swapped matrix exactly. And lu_factor/lu_solve is the production pattern: it computes the PLU once (with full partial pivoting, so its internal $P$, $L$, $U$ may differ from our hand version, but $A\mathbf{x}=\mathbf{b}$ has one answer and both find it), then solves cheaply. To solve with a PLU by hand, you permute the right-hand side too: from $PA=LU$, the system $A\mathbf{x}=\mathbf{b}$ becomes $LU\mathbf{x}=P\mathbf{b}$, so forward-solve $L\mathbf{y}=P\mathbf{b}$ then back-solve $U\mathbf{x}=\mathbf{y}$. The permutation just reshuffles $\mathbf{b}$ to match the reshuffled rows.

Geometric Intuition — In Chapter 4 we saw each elimination step is a shear, and a swap is a reflection (it exchanges two coordinate axes, flipping orientation). So $PA=LU$ reads, geometrically, as: first reflect/relabel the axes (by $P$), then the transformation factors into a sequence of shears that build a clean staged map (by $L$ and $U$). The permutation is the orientation bookkeeping that lets the shears proceed without dividing by zero. When we reach the determinant in Chapter 11, each swap will contribute a factor of $-1$ (a reflection flips signed volume), which is exactly how the sign of $\det A$ emerges from the number of swaps in its PLU.

Warning — state when pivoting is required. Pivoting is mandatory (not optional) precisely when a pivot position holds a zero and a nonzero entry sits below it — without a swap, you would divide by zero and the algorithm fails outright. Pivoting is advisable even when not mandatory: whenever a pivot is nonzero but small relative to the entries below it, dividing by it amplifies floating-point error, so partial pivoting swaps up the largest-magnitude entry for stability. The first reason (avoiding zero) is about existence; the second (avoiding small) is about accuracy. By hand we pivot to dodge exact zeros; in software, partial pivoting for size is the default, always on. This is the practical face of the floating-point arithmetic limits that every numerical computation lives within, and we return to it in Chapter 38.

10.8 How expensive is LU, and why is it better than computing the inverse?

We have claimed LU is efficient. Let's make the cost precise, because the numbers settle a question students ask constantly: why not just compute $A^{-1}$ and multiply?

The cost of factoring. Forward elimination — which is the factorization — does its work column by column. Clearing column $k$ requires, for each of the $\sim(n-k)$ rows below the pivot, computing one multiplier and updating $\sim(n-k)$ entries: roughly $(n-k)^2$ operations. Summing over all columns,

$$\sum_{k=1}^{n}(n-k)^2 \;\approx\; \sum_{j=1}^{n} j^2 \;\approx\; \frac{n^3}{3}.$$

So the LU factorization costs about $\tfrac13 n^3$ multiply–add pairs (often quoted as $\tfrac23 n^3$ floating-point operations, counting a multiply and an add separately — the convention varies, but the leading term is cubic with a coefficient of $\tfrac13$ or $\tfrac23$). The two triangular solves that follow cost only $\sim n^2$ each. The cubic factorization dominates; the substitutions are nearly free by comparison.

The Key Insight — Solving $A\mathbf{x}=\mathbf{b}$ has two parts of wildly different cost: an $O(n^3)$ factorization and an $O(n^2)$ pair of substitutions. All the expense is in the factorization, and the factorization does not depend on $\mathbf{b}$. That asymmetry — expensive part reusable, cheap part per-right-hand-side — is the entire reason factorization beats re-solving.

Let's put numbers on the factor-once savings, because the abstract counts can hide how dramatic it is. Take $n=1000$ and $m=200$ right-hand sides. Re-eliminating each time costs about $m\cdot\tfrac23 n^3 = 200\cdot\tfrac23\cdot10^9 \approx 1.3\times10^{11}$ operations. Factoring once costs $\tfrac23 n^3\approx 6.7\times10^8$, and the $200$ triangular-solve pairs add only $m\cdot2n^2 = 200\cdot2\times10^6 = 4\times10^8$, for a total of about $1.1\times10^9$ operations. That is more than a hundredfold reduction — and the gap only widens as $m$ grows, because the naive cost scales with $m$ while LU's dominant $n^3$ term is paid just once. The substitutions are so cheap that even hundreds of them barely register against a single factorization. This is the quantitative shape of "do the hard work once."

Why not invert? Suppose instead you compute $A^{-1}$ (Chapter 9's Gauss–Jordan on $[A\mid I]$) and then form $\mathbf{x}=A^{-1}\mathbf{b}$. Two problems, one about cost and one about accuracy:

It costs more. Computing $A^{-1}$ is itself an elimination, but applied to $n$ right-hand sides at once (the $n$ columns of the identity). It runs about $2n^3$ operations — roughly three times the $\tfrac23 n^3$ of a single LU. Then every solve $A^{-1}\mathbf{b}$ is a matrix–vector product costing $2n^2$, the same order as a triangular pair but with a worse constant and after a far more expensive setup. You pay more up front and gain nothing per solve.
It is less accurate. Forming $A^{-1}$ explicitly and multiplying introduces more rounding error than solving the triangular systems directly — the inverse can have large entries that amplify floating-point noise, and you do more arithmetic to produce and then apply it. The error analysis (Chapter 38) is unambiguous: the LU solve is the more stable path.

Hence the rule every numerical-computing course drills into you, and which this book repeats:

Common Pitfall — never invert a matrix to solve a system. Writing x = np.linalg.inv(A) @ b to solve $A\mathbf{x}=\mathbf{b}$ is slower and less accurate than x = np.linalg.solve(A, b). The inverse does about $3\times$ the work of an LU and amplifies rounding error. Compute $A^{-1}$ only when you genuinely need the inverse matrix itself (rare); to solve a system, factor and substitute. This is the practical lesson behind the whole chapter — and np.linalg.solve honors it internally, computing a PLU rather than an inverse.

# Demonstrate: solve() and inv()@b agree, but solve() does less work and is more stable.
import numpy as np
rng = np.random.default_rng(0)
A = rng.standard_normal((6, 6))
b = rng.standard_normal(6)
x_solve = np.linalg.solve(A, b)        # PLU factor + 2 triangular solves (preferred)
x_inv   = np.linalg.inv(A) @ b         # form inverse, then multiply (avoid)
print(np.allclose(x_solve, x_inv))     # True -> same answer for a well-conditioned A
print(np.linalg.norm(A @ x_solve - b)) # ~1e-15 -> residual is at machine precision

For a small, well-conditioned matrix the two answers agree (np.allclose is True) — the inverse route is not wrong here, just wasteful. The gap widens as $n$ grows and as the matrix becomes ill-conditioned, where the inverse route loses accuracy first. The residual $\lVert A\mathbf{x}-\mathbf{b}\rVert$ for the solve is at the $10^{-15}$ level — machine precision — which is the standard the LU-based solver meets and the inversion route degrades from.

There is a broader principle hiding in the "never invert" rule, and it is worth extracting because it recurs all over computational mathematics: prefer the operation that does only what you need. Solving $A\mathbf{x}=\mathbf{b}$ asks for one vector, $\mathbf{x}$. Computing $A^{-1}$ asks for an entire matrix — the answer to "what does $A$ do to every possible right-hand side at once" — which is far more information than the single solve requires, and information you then have to pay to compute, store, and multiply through. Whenever a problem tempts you to build a big general object (an inverse, a full eigendecomposition, a dense factorization of a sparse matrix) on the way to a small specific answer, pause: there is usually a cheaper, more stable path that targets the answer directly. The matrix inverse is the canonical cautionary example — mathematically elegant, computationally a trap — and internalizing why is part of growing from someone who knows linear algebra into someone who can compute with it.

These cost arguments are the heart of numerical methods: the same matrix factored once and reused, the cubic-versus-quadratic split between factorization and solve, and the hard rule against explicit inversion. Chapter 38 returns to when even a careful LU strains (ill-conditioning, the condition number), but the operation-count story is already enough to explain why $A=LU$, not $A^{-1}$, is how the world solves linear systems.

10.9 What does the from-scratch LU look like in code?

You have factored by hand, solved by hand, and watched scipy agree. The deepest way to own LU is to write it yourself, so this chapter's toolkit contribution is the factorization and its solver, in pure Python, using scipy only to check. Let's sketch the logic; the full module is assembled separately, and the callout names exactly what you implement.

The structure mirrors what you did by hand. lu_decompose(A) runs forward elimination without pivoting, recording each multiplier $\ell_{ij}=M_{ij}/M_{jj}$ in $L$ at the moment it is used to clear entry $(i,j)$, while $M$ becomes $U$. It is the §10.3 procedure with the multipliers stored instead of discarded. plu(A) adds partial pivoting: before clearing each column, find the largest-magnitude entry in that column at-or-below the diagonal, swap that row up (recording the swap in a permutation), and also swap the already-computed multipliers in $L$ to keep the rows consistent. It returns $P,L,U$ with $PA=LU$. solve_lu(L,U,b) does the two substitutions of §10.6: forward-solve $L\mathbf{y}=\mathbf{b}$, then back-solve $U\mathbf{x}=\mathbf{y}$. Here is the heart of the no-pivot factorization — the loop you will flesh out:

# Sketch of LU without pivoting (pure Python, no numpy). Math is 1-indexed; code is 0-indexed.
def lu_decompose(A):
    n = len(A)
    U = [list(map(float, row)) for row in A]          # working copy -> becomes U
    L = [[1.0 if i == j else 0.0 for j in range(n)]   # identity -> fills with multipliers
         for i in range(n)]
    for col in range(n):                              # column being cleared
        pivot = U[col][col]
        if abs(pivot) < 1e-12:
            raise ValueError("zero pivot -- this matrix needs PLU (pivoting)")
        for r in range(col + 1, n):                   # rows below the pivot
            mult = U[r][col] / pivot                  # the multiplier l_{r,col}
            L[r][col] = mult                          # *record* it (don't discard!)
            for k in range(col, n):                   # update the row -> creates the zero
                U[r][k] -= mult * U[col][k]
    return L, U

That inner arithmetic — mult = U[r][col] / pivot then U[r][k] -= mult * U[col][k] — is literally the "subtract $\ell$ times the pivot row" you did with pencil, and the single new line L[r][col] = mult is the whole difference between Chapter 4's elimination and LU: you keep the multiplier. The plu version wraps this with a row-swap search at the top of each column, exactly the partial pivoting of §10.7. The nested loops make it $O(n^3)$, the cost we counted in §10.8.

A detail worth knowing, even though our from-scratch version keeps $L$ and $U$ as separate arrays for clarity: production libraries store both factors in a single $n\times n$ array, the same size as the original $A$. The trick is that $U$ occupies the upper triangle (including the diagonal) and the multipliers of $L$ occupy the strictly lower triangle — and since $L$'s diagonal is always $1$, it never needs to be stored at all. As elimination zeros out an entry below the diagonal, that slot is now free, so the algorithm overwrites it with the very multiplier that created the zero. The factorization thus happens in place, consuming no extra memory beyond a small pivot vector recording the swaps. This is why scipy.linalg.lu_factor returns a single combined array plus a pivot list rather than two full matrices: it is the memory-frugal form of exactly the procedure you are writing. Your version trades that frugality for readability, which is the right call when the goal is understanding rather than speed.

There is one more correspondence worth making explicit, because it ties this chapter back to Chapter 4 and forward to Chapter 11. The lu_decompose loop above is Chapter 4's forward_eliminate with two changes: it drops the augmented right-hand-side column (LU factors $A$ alone, not $[A\mid\mathbf{b}]$, precisely so the result can serve any $\mathbf{b}$ later), and it records L[r][col] = mult instead of discarding the multiplier. Everything else — the column sweep, the multiplier, the row update — is identical. That is the sense in which LU is Gaussian elimination, just remembered: the same loop, with the scratch work saved.

Build Your Toolkit — Add toolkit/lu.py with three functions, all in pure Python (no numpy in the implementation): - lu_decompose(A) — return $(L, U)$ with $A=LU$, $L$ unit lower-triangular holding the multipliers, $U$ the upper-triangular echelon form, by the §10.3 procedure. Raise a clear error if a zero (or near-zero) pivot is hit, since this version does not pivot. - plu(A) — return $(P, L, U)$ with $PA=LU$ using partial pivoting (swap up the largest-magnitude entry in each column; remember to permute the partial $L$ rows along with the swap). Return $P$ as a matrix (or a permutation list). - solve_lu(L, U, b) — solve $A\mathbf{x}=\mathbf{b}$ given the factors by forward-substituting $L\mathbf{y}=\mathbf{b}$ then back-substituting $U\mathbf{x}=\mathbf{y}$ (§10.6). For a PLU, accept the permutation and apply it to $\mathbf{b}$ first.

Verify, don't trust: for random matrices, check np.allclose(L @ U, A) for lu_decompose, check np.allclose(P @ A, L @ U) for plu, and compare solve_lu against np.linalg.solve; cross-check your factors against scipy.linalg.lu. Mind the indexing: the math writes the multiplier "$\ell_{21}$ in row 2, column 1"; your code writes L[1][0]. This module builds directly on the elimination you wrote in Chapter 4 and feeds Chapter 11, where the determinant is read straight off the diagonal of $U$ (times the sign of the permutation).

When your plu(A) returns factors with np.allclose(P @ A, L @ U) flashing True across a hundred random matrices — including ones with zero first pivots that would have broken lu_decompose — you will have built the exact algorithm running inside np.linalg.solve and scipy.linalg.lu_factor. The libraries are faster (decades of cache-tuning and pivoting heuristics in LAPACK), but the idea is the one you wrote: eliminate once, remember the multipliers, reuse forever.

10.10 Putting it together: factorization as a reusable tool

Let's consolidate the whole chapter into the procedure you now command, and — just as importantly — into the framing the style bible insists on.

Given a square matrix $A$ and one or more right-hand sides:

Factor once. Compute $A=LU$ (if no pivoting is needed) or, in general and always in software, $PA=LU$ with partial pivoting. This is the expensive $O(n^3)$ step, and it depends only on $A$.
For each right-hand side $\mathbf{b}$, permute if needed ($P\mathbf{b}$), then forward-substitute $L\mathbf{y}=P\mathbf{b}$ and back-substitute $U\mathbf{x}=\mathbf{y}$. Each is a cheap $O(n^2)$ solve.
Reuse. Every additional $\mathbf{b}$ costs only step 2. The factorization is computed once and amortized across all the solves.

Three steps, and they are how essentially every dense linear system on Earth gets solved — inside circuit simulators, finite-element solvers, optimization routines, and the solve function of every numerical library. The factorization is not a curiosity; it is the production algorithm.

It is worth naming what changed in your thinking over this chapter, because it is more than a new computation. In Chapter 4 you saw $A\mathbf{x}=\mathbf{b}$ as a single event — pose the system, eliminate, read the answer, done. This chapter reframes it as a relationship: the matrix $A$ is a fixed object you will interrogate many times, and the smart move is to invest once in understanding it (the factorization) so that each interrogation (each $\mathbf{b}$) is cheap. That shift — from "solve this system" to "prepare this matrix for repeated solving" — is the mental model behind almost all of numerical linear algebra, and it is why libraries expose a factor step separately from a solve step. You will see the same separation when you precompute a QR factorization for repeated least-squares fits (Chapter 20) or an eigendecomposition for repeated matrix powers (Chapter 25). The matrix is the asset; the factorization is how you make the asset reusable.

But hold the framing the chapter opened with, because it is the deeper point. LU is a tool, not the destination. Its job is to make solving fast and to remember an elimination so you never repeat it — and in doing so it gives you your first taste of the book's recurring big idea, that factoring a matrix exposes structure you can reuse. $L$ and $U$ are not arbitrary; they are the elimination itself, frozen and reusable. That same instinct — factor the matrix into simpler, meaningful pieces — will return as the eigendecomposition (Chapter 25), the spectral theorem (Chapter 27), the QR factorization (Chapter 20), and the singular value decomposition (Chapter 30), each factoring a matrix to reveal something different: invariant directions, orthogonal structure, the deepest "rotate–stretch–rotate" geometry of any linear map. LU reveals the humblest but most immediately useful structure of all — how to solve, cheaply, again and again.

Historical Note — The systematic study of matrix factorization for solving linear systems is usually credited to Alan Turing, whose 1948 paper Rounding-off Errors in Matrix Processes analyzed the $LU$ factorization (he called the factors a "triangular resolution") and its error behavior, laying groundwork for modern numerical linear algebra [verify]. The partial-pivoting strategy and its stability were developed in the 1940s–60s, with James Wilkinson's error analysis becoming the definitive treatment [verify]. The underlying elimination, of course, is far older — Gauss, and the Chinese Nine Chapters before him (Chapter 4) — but the idea of keeping the multipliers as a reusable factorization, rather than discarding them, is a twentieth-century insight born from the needs of automatic computation. As so often, the algorithm is ancient; the realization that it should be remembered is what the computer age contributed.

In the next chapter we stay with elimination but ask a different question of it: not "what is the solution?" but "by how much did the transformation scale space?" The answer is the determinant, and — beautifully — it falls right out of the LU factorization you just built. The product of the pivots on the diagonal of $U$, times $(-1)$ for each row swap recorded in $P$, is $\det A$. The visualizer's det annotation, which has been quietly reporting area-scaling since Chapter 1, is about to get its full meaning, and the factorization of this chapter is the fastest way to compute it. The tool you built to solve systems efficiently turns out to also measure how a transformation stretches space.