Matrix Operations: Addition, Multiplication, and the Surprising Non-Commutativity

DataField.Dev

50 min read

> Learning paths. Math majors — read everything, especially the derivation of the product rule in §8.4 and the proof of associativity in §8.10; the transpose's role as the adjoint (§8.9) is teased here and made rigorous in Chapter 18. CS / Data...

Prerequisites

chapter-07-matrices-as-functions

Learning Objectives

Add matrices and multiply them by scalars, and read both operations geometrically as combining transformations entrywise.
Define matrix multiplication AS the composition of two linear transformations, and DERIVE the row-times-column product rule from 'apply B, then apply A' rather than memorizing it.
Explain, with a rotation-then-shear versus shear-then-rotation picture, why matrix multiplication is not commutative in general.
State and use the transpose, the identity matrix, and the associativity of matrix multiplication, including the order-reversal rule for the transpose of a product.
Use the recurring 2D visualizer to confirm that A∘B and B∘A produce different pictures from the same two matrices.
Implement matmul(A, B) from scratch in toolkit/matrices.py as composition, and verify it against numpy's @ operator.

In This Chapter

8.1 What does it mean to add two matrices?
8.2 What does scaling a matrix do to its transformation?
8.3 What is matrix multiplication, really? (Composition of transformations)
8.4 How do you derive the product rule from composition?
8.5 How should you read the row-times-column rule now?
8.6 What does a composition look like, step by step?
8.7 Why is matrix multiplication not commutative?
8.8 What happens when you multiply a matrix by itself? (Powers)
8.9 What is the transpose, and what does it mean?
8.10 What are the algebraic laws of matrix multiplication?
8.11 What is the identity matrix, and why is it the "1" of matrix multiplication?
8.12 How do all the operations fit together?

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Matrix Operations: Addition, Multiplication, and the Surprising Non-Commutativity

Learning paths. Math majors — read everything, especially the derivation of the product rule in §8.4 and the proof of associativity in §8.10; the transpose's role as the adjoint (§8.9) is teased here and made rigorous in Chapter 18. CS / Data Science — focus on the Geometric Intuition callouts, the composition picture, the numpy snippets, and the two applications; matrix multiplication is the single most-executed operation in machine learning, so the "weighted sum of columns, one column at a time" reading pays off daily. Physics / Engineering — focus on composition as "chain the motions," the non-commutativity picture, and the rotation/shear experiments in the visualizer; carry the image of two transformations applied in sequence. This is the chapter where the slogan from Chapter 7 — a matrix is a function that transforms space — gets an algebra: you learn how to combine transformations, and discover that the order matters.

A note on this chapter (§9). Chapter 7 established that a matrix is a linear transformation and that the matrix-vector product is the weighted sum of the columns. This chapter asks the next question that practically writes itself: a matrix is a verb, so what happens when you do two of them? We will define addition and scalar multiplication quickly (they are easy and entrywise), and then spend the heart of the chapter on multiplication — which we introduce as composition first, "apply $B$, then apply $A$," and from which we derive the famous row-times-column rule rather than handing it down. The payoff is twofold: the rule will finally make sense, and the most counterintuitive fact in Part II — that $AB \ne BA$ — will become not a surprise but a near-certainty.

8.1 What does it mean to add two matrices?

We have spent Chapter 7 learning to see a single matrix as a motion of space — a stretch, a turn, a shear. The natural next move is to ask how matrices combine, and there are two very different ways to combine two transformations. You can add them, or you can chain them (apply one after the other). These turn out to be wildly different operations, and the contrast between them is the spine of this chapter. Addition is the gentle one, so we start there.

Two matrices of the same shape are added the way you would guess: entry by entry. If $A$ and $B$ are both $m\times n$, then $A + B$ is the $m\times n$ matrix whose $(i,j)$ entry is $a_{ij} + b_{ij}$. For $2\times 2$ matrices, $$A + B = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} + \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}+b_{11} & a_{12}+b_{12} \\ a_{21}+b_{21} & a_{22}+b_{22} \end{bmatrix}.$$ That is the whole definition. You may only add matrices of matching dimensions; adding a $2\times 3$ to a $3\times 2$ is undefined, because there is no sensible pairing of entries. This is the algebraic echo of a geometric fact: addition combines two transformations that act on the same input space and land in the same output space, so their dimensions must match.

What does adding matrices do to the transformations they represent? Recall from Chapter 7 that $A\mathbf{v}$ is the image of $\mathbf{v}$ under $A$. The sum behaves exactly as you would hope under multiplication by a vector: $$(A + B)\mathbf{v} = A\mathbf{v} + B\mathbf{v}.$$ In words: transform $\mathbf{v}$ by $A$, transform it separately by $B$, then add the two output arrows tip-to-tail. The matrix $A + B$ is the single transformation that produces that combined output in one step. This is a parallel combination — both transformations act on the original input, and the results are summed — and it is exactly the structure behind a residual connection in a neural network (Chapter 33) or the superposition of two force fields in physics.

Geometric Intuition — Matrix addition is the parallel combination of two transformations. Send the same vector $\mathbf{v}$ through $A$ and through $B$, then add the resulting arrows. Picture it on the basis vectors: the first column of $A + B$ is where $A$ sends east plus where $B$ sends east, the two destination-arrows joined tip-to-tail. Addition never "chains" the transformations — both always act on the same untransformed input.

Let's add a concrete pair and confirm with numpy. We take $A = \begin{bmatrix} 1 & 2 \\ 3 & 4\end{bmatrix}$ and $B = \begin{bmatrix} 5 & 6 \\ 7 & 8\end{bmatrix}$.

# Matrix addition is entrywise; both matrices must share the same shape.
import numpy as np
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print((A + B).tolist())     # entry (i,j) is a_ij + b_ij
print((A - B).tolist())     # subtraction is also entrywise

[[6, 8], [10, 12]]
[[-4, -4], [-4, -4]]

Each output entry is the sum (or difference) of the corresponding inputs: $1+5=6$, $2+6=8$, and so on. Subtraction works the same way, entry by entry.

It is worth reading the geometric meaning off the columns, because it cements the "parallel combination" picture. The first column of $A$ is $(1,3)$ — where $A$ sends east — and the first column of $B$ is $(5,7)$ — where $B$ sends east. The first column of $A+B$ is their sum, $(6,10)$ — where the combined transformation sends east, namely the tip-to-tail sum of the two separate destinations. Addition simply adds the destination-arrows column by column. There is no chaining, no "first this then that": both $A$ and $B$ act on the original east vector, and we add the two results. This is precisely why addition is the gentle, commutative operation — $A + B = B + A$, because adding arrows does not care about order — in sharp contrast to the multiplication we are building toward, where order will be everything.

Matrix addition obeys all the familiar rules of ordinary addition: it is commutative ($A + B = B + A$), associative ($(A+B)+C = A+(B+C)$), it has an identity (the zero matrix $O$, all entries zero, with $A + O = A$), and every matrix has a negative ($A + (-A) = O$). These are inherited directly from the fact that the entries are ordinary numbers, added independently slot by slot. None of this will survive intact for multiplication — which is exactly what makes multiplication the interesting operation.

FAQ: Why can't I add matrices of different shapes?

Because addition pairs up entries by position, and there is no honest pairing when the grids do not line up. A $2\times 3$ matrix has an entry in row 2, column 3; a $3\times 2$ matrix has no such slot. Geometrically, an $m\times n$ matrix is a transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$ (Chapter 7), so two matrices can be added only when they take the same-dimensional input and produce the same-dimensional output — when they are the same kind of transformation. Adding mismatched matrices would be like trying to average a function of one variable with a function of two: the question is ill-posed. (numpy will sometimes "broadcast" mismatched arrays, stretching one to fit the other; that is a numpy convenience, not matrix addition, and §8.11's Computational Note warns you about it.)

8.2 What does scaling a matrix do to its transformation?

The second easy operation is scalar multiplication: multiply every entry by the same number $c$. If $A$ is $m\times n$ and $c$ is a scalar, $$cA = c\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} = \begin{bmatrix} ca_{11} & ca_{12} \\ ca_{21} & ca_{22} \end{bmatrix}.$$ This dilates the transformation. Since the columns of $A$ are the images of the basis vectors, multiplying every entry by $c$ multiplies every destination-arrow's length by $|c|$ (and flips it if $c<0$). The whole transformed picture grows or shrinks uniformly. Concretely, $(cA)\mathbf{v} = c(A\mathbf{v})$: applying $cA$ is the same as applying $A$ and then scaling the result by $c$. So $2A$ is "do $A$, then double everything," and $\tfrac{1}{2}A$ is "do $A$, then halve everything."

# Scalar multiplication scales every entry, hence every destination-arrow.
import numpy as np
A = np.array([[1, 2], [3, 4]])
print((3 * A).tolist())     # triples each entry: 3A

[[3, 6], [9, 12]]

Each entry tripled: the transformation $3A$ sends every vector three times as far out along the same direction $A$ would have sent it.

These two operations — addition and scalar multiplication — make the set of all $m\times n$ matrices into a vector space in the precise sense of Chapter 5. You can add matrices, scale them, and the eight vector-space axioms all hold (the zero matrix is the additive identity, $-A = (-1)A$ is the additive inverse, and so on). This is the first concrete example of a vector space whose "vectors" are not arrows or lists of coordinates but transformations themselves. We will lean on this when we treat the space of all linear maps in Chapter 35.

The Key Insight — Matrix addition and scalar multiplication are entrywise and easy, and they make the $m\times n$ matrices a vector space (Chapter 5). They are the "linear combination" operations on transformations. The interesting operation — the one that is neither entrywise nor easy, and the real subject of this chapter — is multiplication, which is not "multiply entry by entry" but something far richer: composition.

Common Pitfall — Multiplying matrices entrywise. The single most common beginner error is to assume matrix multiplication works like addition — entry by entry — so that the $(i,j)$ entry of $AB$ is $a_{ij}b_{ij}$. It is not. That entrywise product does exist and has a name (the Hadamard product, written $A \odot B$, used in some machine-learning contexts), but it is not matrix multiplication and it does not compose transformations. Matrix multiplication is the row-times-column operation we are about to derive, and it exists precisely so that $AB$ represents "do $B$, then do $A$." Keep the two ideas in separate drawers.

8.3 What is matrix multiplication, really? (Composition of transformations)

Now the heart of the chapter. We are going to define the product of two matrices, and we will do it the right way — geometrically, as composition — before any formula appears. If you have seen matrix multiplication before as "march along the row, down the column," set that aside for a few pages. We are going to earn that rule, and watch it fall out of something you already understand.

Here is the setup. A matrix is a transformation. So take two transformations, $B$ and $A$, and chain them: apply $B$ to a vector $\mathbf{x}$, and then apply $A$ to the result. The output is $$A(B\mathbf{x}).$$ First $B$ moves $\mathbf{x}$ somewhere; then $A$ moves that somewhere else. This two-step motion is itself a transformation of space — feed in $\mathbf{x}$, get out a final vector — and, crucially, it is a linear transformation, because the composition of two linear maps is linear (we check this in §8.10). By the master theorem of Chapter 7, every linear transformation is represented by a single matrix. So the combined "do $B$, then do $A$" motion has its own matrix. We name that matrix $AB$ and call it the product of $A$ and $B$.

$$\boxed{\,(AB)\mathbf{x} = A(B\mathbf{x}) \quad\text{for every vector } \mathbf{x}.\,}$$

That boxed equation is the definition of matrix multiplication. The product $AB$ is, by definition, the one matrix that does in a single step what "$B$ then $A$" does in two. Everything else in this chapter — the row-times-column rule, the shape requirement, the non-commutativity, the associativity — is a consequence of this one geometric idea.

The Key Insight — Matrix multiplication is composition of transformations. The product $AB$ is the single matrix representing "apply $B$ first, then apply $A$." Read $AB$ right-to-left, like nested function notation $f(g(x))$: the matrix on the right acts first. This is the whole secret of the subject. The row-times-column rule is not a definition — it is the formula you get when you compute the columns of this composite transformation.

Pause on the right-to-left reading, because it trips up everyone at first. We write $AB$, with $A$ on the left, yet $B$ is the one that acts first. This is not a quirk of matrices; it is inherited from function notation. When you write $f(g(x))$, you evaluate $g$ first (it is closest to $x$), then feed its output to $f$. Matrix-vector multiplication $A(B\mathbf{x})$ is exactly this: $B$ is closest to the vector, so $B$ goes first. The combined matrix $AB$ inherits the order. Chapter 7's §7.6 previewed this with "rotate-after-scale," and now it becomes the official definition.

Geometric Intuition — Composing transformations is composing motions of space. "$B$ then $A$" is a single combined journey: every point in the plane is first dragged to a new spot by $B$, then dragged again by $A$. The product matrix $AB$ records the net displacement of that two-step journey. As always, you only need to know where the basis vectors end up — and that observation, in the next section, hands us the product rule for free.

Common Pitfall — Reading $AB$ left-to-right as "$A$ then $B$." Because we read English left-to-right, it is tempting to read the product $AB$ as "do $A$, then $B$." It is the reverse: $AB$ means "do $B$, then $A$." If you want "do $A$ first, then $B$," you must write $BA$. A reliable mantra: the matrix nearest the vector acts first. When in doubt, append an explicit input vector $\mathbf{x}$ and parse $AB\mathbf{x}$ as $A(B\mathbf{x})$ — the parentheses never lie.

8.4 How do you derive the product rule from composition?

We have a definition of $AB$ — the matrix of "$B$ then $A$" — but not yet a recipe for its entries. Let us derive the recipe, using nothing but the columns-as-images principle from Chapter 7. This is the derivation the rest of your linear-algebra life rests on, so we go slowly.

Recall the central fact of Chapter 7: the columns of a matrix are the images of the basis vectors. To find the matrix of any transformation, we ask where it sends $\mathbf{e}_1$ and $\mathbf{e}_2$ and stack the answers as columns. The composite "$B$ then $A$" is a transformation, so we apply the same recipe. Where does the composite send $\mathbf{e}_1$? It sends it to $$A(B\mathbf{e}_1).$$ But $B\mathbf{e}_1$ is just the first column of $B$ (Chapter 7: multiplying by a basis vector extracts a column). Call the columns of $B$ by name: $B = \big[\,\mathbf{b}_1 \;\; \mathbf{b}_2\,\big]$, where $\mathbf{b}_1, \mathbf{b}_2$ are its columns. Then the first column of the composite is $A\mathbf{b}_1$, and the second column is $A\mathbf{b}_2$. Stacking: $$AB = \big[\, A\mathbf{b}_1 \;\;\; A\mathbf{b}_2 \,\big].$$

The Key Insight — The columns of $AB$ are $A$ applied to the columns of $B$. Column $j$ of the product is $A\mathbf{b}_j$, the image under $A$ of the $j$-th column of $B$. This is the cleanest possible statement of matrix multiplication, and it follows directly from "columns are images of basis vectors." You multiply $A$ against each column of $B$ in turn — and each of those products is a weighted sum of the columns of $A$, by Chapter 7. Composition all the way down.

This is already a complete, usable algorithm: to multiply $AB$, apply $A$ (the weighted-sum-of-columns way) to each column of $B$. But let us push one more step and recover the scalar formula everyone knows, so you can see exactly where it comes from. Write $A$ and $B$ out in full $2\times 2$ form: $$A = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}, \qquad B = \begin{bmatrix} b_{11} & b_{12} \\ b_{21} & b_{22} \end{bmatrix}.$$ The first column of $B$ is $\mathbf{b}_1 = \begin{bmatrix} b_{11} \\ b_{21}\end{bmatrix}$. Apply $A$ to it as a weighted sum of $A$'s columns (Chapter 7): $$A\mathbf{b}_1 = b_{11}\begin{bmatrix} a_{11} \\ a_{21}\end{bmatrix} + b_{21}\begin{bmatrix} a_{12} \\ a_{22}\end{bmatrix} = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} \\ a_{21}b_{11} + a_{22}b_{21} \end{bmatrix}.$$ That is the first column of $AB$. Doing the same with $\mathbf{b}_2 = \begin{bmatrix} b_{12} \\ b_{22}\end{bmatrix}$ gives the second column, and assembling both: $$AB = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \end{bmatrix}.$$

Now look at the top-left entry: $a_{11}b_{11} + a_{12}b_{21}$. That is exactly the first row of $A$ dotted with the first column of $B$. The top-right entry is the first row of $A$ dotted with the second column of $B$. In general:

$$\boxed{\,(AB)_{ij} = \sum_{k} a_{ik}\,b_{kj} = (\text{row } i \text{ of } A)\cdot(\text{column } j \text{ of } B).\,}$$

There is the row-times-column rule — but now you see it is not a rule at all. It is the arithmetic shadow of composition. The dot of row $i$ of $A$ with column $j$ of $B$ is computing the $i$-th coordinate of "$A$ applied to the $j$-th column of $B$," which is the $i$-th coordinate of where the composite sends $\mathbf{e}_j$. Every step traces back to "$B$ then $A$." You never have to memorize the rule again: derive it in three lines from columns-as-images whenever you need it.

Geometric Intuition — There are three equally valid ways to read the same product $AB$, and switching between them is a sign of fluency. (1) Composition: $AB$ is "do $B$, then $A$." (2) Columns: the $j$-th column of $AB$ is $A$ applied to the $j$-th column of $B$. (3) Entries: $(AB)_{ij}$ is row $i$ of $A$ dotted with column $j$ of $B$. The first is what it means; the second is the cleanest way to think; the third is how you compute by hand. They are one fact in three costumes.

Let's compute one entirely by hand and check it. Take $$A = \begin{bmatrix} 2 & 1 \\ 0 & 3 \end{bmatrix}, \qquad B = \begin{bmatrix} 1 & 4 \\ 2 & 5 \end{bmatrix}.$$ The columns way: the first column of $AB$ is $A$ applied to $\mathbf{b}_1 = (1,2)$, which is $1\cdot(2,0) + 2\cdot(1,3) = (4,6)$. The second column is $A$ applied to $\mathbf{b}_2 = (4,5)$, which is $4\cdot(2,0) + 5\cdot(1,3) = (13,15)$. So $$AB = \begin{bmatrix} 4 & 13 \\ 6 & 15 \end{bmatrix}.$$ Check the top-left by the entry rule: row 1 of $A$ is $(2,1)$, column 1 of $B$ is $(1,2)$, dot product $2\cdot 1 + 1\cdot 2 = 4$. ✓. Now numpy, where @ is matrix multiplication:

# Matrix product as composition: column j of A@B is A applied to column j of B.
import numpy as np
A = np.array([[2, 1], [0, 3]])
B = np.array([[1, 4], [2, 5]])
print((A @ B).tolist())               # the full product
print((A @ B[:, 0]).tolist(),         # A applied to column 0 of B  -> first column
      (A @ B[:, 1]).tolist())         # A applied to column 1 of B  -> second column

[[4, 13], [6, 15]]
[4, 6] [13, 15]

The full product is $\begin{bmatrix}4 & 13 \\ 6 & 15\end{bmatrix}$, exactly our hand result, and the second line confirms the columns-as-composition reading: $A$ applied to $B$'s first column gives $(4,6)$, the first column of the product; applied to $B$'s second column gives $(13,15)$, the second. (Note the index shift from Chapter 7: $B$'s first column is B[:, 0] in numpy, because numpy counts from 0 while our math counts from 1.)

Check Your Understanding — Without writing the full product, what is the second column of $AB$ for $A = \begin{bmatrix} 1 & 0 \\ 1 & 1\end{bmatrix}$ and $B = \begin{bmatrix} 3 & 2 \\ 0 & 4\end{bmatrix}$?

Answer

The second column of $AB$ is $A$ applied to the second column of $B$, which is $\mathbf{b}_2 = (2,4)$. As a weighted sum of $A$'s columns: $2\cdot(1,1) + 4\cdot(0,1) = (2, 6)$. So the second column of $AB$ is $\begin{bmatrix} 2 \\ 6\end{bmatrix}$ — and you found it without touching the first column. This is the columns-as-composition view doing its job: each column of the product is an independent little matrix-vector computation.

FAQ: Why must the inner dimensions match to multiply matrices?

Composition explains the shape rule with no memorization. To form $AB$, you apply $A$ to the output of $B$. So $B$'s outputs must be valid inputs to $A$. If $B$ is $n\times p$ (it eats $p$-vectors and produces $n$-vectors) and $A$ is $m\times n$ (it eats $n$-vectors and produces $m$-vectors), then $B$'s $n$-dimensional outputs feed perfectly into $A$, and the product $AB$ is $m\times p$ — it eats $p$-vectors (like $B$) and produces $m$-vectors (like $A$). The two inner dimensions (the $n$ that $B$ outputs and the $n$ that $A$ ingests) must agree; the outer dimensions ($m$ and $p$) become the shape of the product. Matrices whose inner dimensions match are called conformable. When they don't match, the composition is geometrically meaningless — $B$ would be handing $A$ vectors of the wrong size — and the product is undefined.

Let's see a non-square example, where the shape bookkeeping is the whole point. Let $A$ be $2\times 3$ and $B$ be $3\times 2$:

# Inner dimensions must match: (2x3)(3x2) is valid and yields a 2x2 product.
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6]])    # 2x3: maps R^3 -> R^2
B = np.array([[1, 0], [0, 1], [1, 1]])  # 3x2: maps R^2 -> R^3
print((A @ B).tolist())                 # B first (R^2->R^3), then A (R^3->R^2): R^2->R^2
print((A @ B).shape)                    # outer dimensions: (2, 2)

[[4, 5], [10, 11]]
(2, 2)

The composite eats a 2-vector, $B$ lifts it into $\mathbb{R}^3$, $A$ drops it back to $\mathbb{R}^2$, and the net transformation is the $2\times 2$ matrix $\begin{bmatrix}4 & 5 \\ 10 & 11\end{bmatrix}$. The product is $2\times 2$ — the outer dimensions — exactly as the inner-dimension rule predicts.

8.5 How should you read the row-times-column rule now?

In Chapter 7 we deliberately refused the "row dotted with vector" view of the matrix-vector product, because it hides the geometry; we promised it would return, properly motivated, when matrix-matrix multiplication arrived. That moment is now, so let us pay the debt and rehabilitate the row picture — not as a rule from the sky, but as one of several legitimate readings of a product, each useful for a different purpose.

We have already seen two readings: $AB$ is the composition "do $B$, then $A$" (what it means), and the columns of $AB$ are $A$ applied to the columns of $B$ (the cleanest way to think). The entry formula $(AB)_{ij} = (\text{row } i \text{ of } A)\cdot(\text{column } j \text{ of } B)$ gives a third. But there are two more, and seeing all of them turns matrix multiplication from a single trick into a flexible instrument.

The row reading. Just as the columns of $AB$ are $A$ acting on the columns of $B$, the rows of $AB$ are the rows of $A$ acting on $B$ from the left. The $i$-th row of the product is (row $i$ of $A$) combined with the rows of $B$ — a weighted sum of $B$'s rows, with the entries of $A$'s $i$-th row as the weights. This is the honest meaning of "row times column": it computes one row of the output at a time. It is the natural view when you care about the outputs coordinate by coordinate — for instance, when each row of $A$ represents one neuron's weights in a network layer and you want that neuron's response to every input.

The outer-product reading. There is a strikingly different way to assemble the same product: as a sum of outer products. Pair the $k$-th column of $A$ with the $k$-th row of $B$, multiply them (a column times a row gives a full matrix, called an outer product), and add up these matrices over all $k$: $$AB = \sum_k (\text{column } k \text{ of } A)(\text{row } k \text{ of } B).$$ Each term is a rank-one matrix, and the product is their sum. This decomposition looks exotic now, but it is the secret heart of the singular value decomposition (Chapter 30) and of low-rank approximation and image compression (Chapter 31): there, a matrix is written as a sum of a few outer products, and keeping only the largest terms compresses it. The outer-product view is matrix multiplication read as "build the product from rank-one pieces."

The Key Insight — A single product $AB$ admits five equivalent readings, and fluency is the ability to switch between them at will: (1) composition "do $B$ then $A$" — the meaning; (2) columns of $AB$ are $A$ times columns of $B$; (3) rows of $AB$ are rows of $A$ times $B$; (4) entries $(AB)_{ij}$ are row $i$ of $A$ dotted with column $j$ of $B$ — for hand computation; (5) outer products summed over the inner index — for decompositions. The row-times-column rule of Chapter 7's promise is reading (4); it is correct and useful, but it is the computation, never the meaning.

The lesson Chapter 7 insisted on still stands, just enriched. The row-dotted-with-column recipe is the right way to grind out the arithmetic by hand or in a tight inner loop, and you should be fluent with it. But when you want to understand what a product does, return to composition and columns; when you want to decompose a matrix, reach for outer products. The five readings are five views of one object, and the best linear algebraists hold all five at once.

8.6 What does a composition look like, step by step?

The product rule is now derived, but a formula on the page can still feel abstract. Let us slow all the way down and watch a single composition unfold geometrically, tracking the unit square through both stages, so that "the product is the combined motion" becomes something you can picture rather than just assert. We will compose two transformations you already know from Chapter 7: scaling and rotation.

Let $B$ be the non-uniform scaling that stretches east by $3$ and leaves north alone, and let $A$ be a counterclockwise quarter-turn: $$B = \begin{bmatrix} 3 & 0 \\ 0 & 1 \end{bmatrix} \quad(\text{stretch east by } 3), \qquad A = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \quad(\text{rotate } 90°).$$ We want the product $AB$, which by our definition means "do $B$, then do $A$" — first stretch, then rotate. Watch what happens to the basis vectors, one stage at a time.

Stage 1 — apply $B$ (the stretch). East $\mathbf{e}_1 = (1,0)$ stretches to $(3, 0)$, a long arrow pointing east. North $\mathbf{e}_2 = (0,1)$ is untouched, staying at $(0,1)$. The unit square has become a wide $3\times 1$ rectangle lying along the $x$-axis.

Stage 2 — apply $A$ (the rotation) to those results. Now we rotate the stretched picture a quarter-turn counterclockwise. The long east arrow $(3,0)$ swings up to $(0, 3)$. The unchanged north arrow $(0,1)$ swings to $(-1, 0)$. The wide rectangle, having been turned, now stands tall: a $1 \times 3$ rectangle reaching up the $y$-axis.

The final landing spots of the basis vectors are $(0,3)$ and $(-1,0)$ — and those, by columns-as-images, are exactly the columns of the product $AB$: $$AB = \begin{bmatrix} 0 & -1 \\ 3 & 0 \end{bmatrix}.$$ Let us confirm the whole story with numpy — both the product and the intermediate stage — and the determinant, which should be the product of the two determinants ($\det B = 3$ times $\det A = 1$, so $\det(AB) = 3$):

# A composition step by step: stretch east by 3 (B), then rotate 90 deg (A).
import numpy as np
B = np.array([[3, 0], [0, 1]])     # do this first: stretch east by 3
A = np.array([[0, -1], [1, 0]])    # then this: rotate 90 deg
print("after B, e1 ->", (B @ np.array([1, 0])).tolist(),    # stretched east arrow
      " e2 ->", (B @ np.array([0, 1])).tolist())            # unchanged north arrow
print("AB =", (A @ B).tolist())                             # the combined transformation
print("det(AB) =", int(round(np.linalg.det(A @ B))),        # equals det(A)*det(B)
      "= det(A)*det(B) =", int(round(np.linalg.det(A) * np.linalg.det(B))))

after B, e1 -> [3, 0]  e2 -> [0, 1]
AB = [[0, -1], [3, 0]]
det(AB) = 3 = det(A)*det(B) = 3

The output matches our hand-traced picture exactly: after $B$, east is at $(3,0)$ and north at $(0,1)$; after also applying $A$, the combined matrix is $\begin{bmatrix}0 & -1 \\ 3 & 0\end{bmatrix}$, whose columns are the final landing spots $(0,3)$ and $(-1,0)$ read down each column. And the determinant of the product is the product of the determinants — a preview of one of the cleanest theorems in the book, $\det(AB) = \det(A)\det(B)$, which we prove in Chapter 11 and which says "the area-scaling of a composite is the product of the area-scalings," exactly as common sense demands when you stretch by 3 and then turn without further stretching.

Geometric Intuition — A composition is a relay race: the unit square is handed from $B$ to $A$, and each runner deforms it in turn. To find the product matrix, you do not need to memorize anything — you watch where the two basis arrows finish after both legs of the relay, and write those finishing positions down as the columns. The product matrix $AB$ is just the net deformation, with the intermediate stage erased.

Check Your Understanding — Compose in the other order for the same $A$ and $B$ above: compute $BA$ ("rotate first, then stretch east by 3") by tracking the basis vectors, and confirm it differs from $AB$.

Answer

$BA$ means "do $A$ (rotate $90°$) first, then $B$ (stretch east by 3)." Track east: rotation sends $(1,0)\to(0,1)$, then the stretch leaves $(0,1)$ alone (it has no east component), so east finishes at $(0,1)$. Track north: rotation sends $(0,1)\to(-1,0)$, then the stretch triples the east component, sending $(-1,0)\to(-3,0)$, so north finishes at $(-3,0)$. Thus $BA = \begin{bmatrix} 0 & -3 \\ 1 & 0\end{bmatrix}$, which is not $AB = \begin{bmatrix} 0 & -1 \\ 3 & 0\end{bmatrix}$. Same two transformations, opposite order, different result — exactly the non-commutativity we study next. (numpy: B @ A returns [[0, -3], [1, 0]].)

8.7 Why is matrix multiplication not commutative?

Here is the most famous surprise in elementary linear algebra, and the question search engines see constantly: why is matrix multiplication not commutative? For ordinary numbers, $3 \times 5 = 5 \times 3$ — order never matters. For matrices, $AB$ and $BA$ are usually different matrices, and often one of them is not even defined. Once you hold the composition picture, this stops being a strange algebraic accident and becomes almost obvious: $AB$ means "do $B$, then $A$," and $BA$ means "do $A$, then $B$" — and the order in which you transform space matters.

Think of everyday composed actions. Put on your socks, then your shoes — fine. Reverse the order — shoes, then socks — and you get a different, sillier result. Open the door, then walk through; versus walk through, then open the door. Composition of actions is generally order-dependent, and linear transformations are no exception. The shock would be if matrix multiplication were commutative.

The Key Insight — $AB \ne BA$ in general because composition is order-dependent: doing $B$ then $A$ is a different motion of space from doing $A$ then $B$. Numbers commute because scaling-by-3-then-by-5 equals scaling-by-5-then-by-3 — but most transformations are not mere scalings, and rotating-then-shearing is genuinely different from shearing-then-rotating. Order matters because geometry has order.

Let us see it, using the recurring 2D visualizer from Chapter 1 — the anchor experiment of this chapter. We take two transformations whose order plainly matters: a $90°$ rotation $R$ and a horizontal shear $S$ with factor $1$. $$R = R(90°) = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}, \qquad S = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}.$$ We will compute both orders. Remember the right-to-left reading: $RS$ means "shear first, then rotate," and $SR$ means "rotate first, then shear." Here is the visualizer, reused verbatim exactly as frozen in the style bible:

# toolkit/visualizer.py — the recurring 2D transformation visualizer.
# Shows what a 2x2 matrix A does to the unit square and the basis vectors.
import numpy as np
import matplotlib.pyplot as plt

def visualize_2d(A, title="", ax=None):
    """Plot the action of 2x2 matrix A on the unit square and i-hat, j-hat."""
    A = np.asarray(A, dtype=float)
    square = np.array([[0, 1, 1, 0, 0],
                       [0, 0, 1, 1, 0]])          # unit-square corners (closed)
    out = A @ square                               # transformed square
    e1, e2 = A @ np.array([1, 0]), A @ np.array([0, 1])   # images of basis vectors
    if ax is None:
        _, ax = plt.subplots(figsize=(5, 5))
    ax.plot(square[0], square[1], "b--", lw=1, label="input (unit square)")
    ax.fill(out[0], out[1], alpha=0.25, color="C1")
    ax.plot(out[0], out[1], "C1-", lw=2, label="A · (unit square)")
    ax.arrow(0, 0, *e1, color="C3", width=0.02, length_includes_head=True)  # A e1
    ax.arrow(0, 0, *e2, color="C2", width=0.02, length_includes_head=True)  # A e2
    ax.axhline(0, color="gray", lw=0.5)
    ax.axvline(0, color="gray", lw=0.5)
    ax.set_aspect("equal")
    ax.grid(True, alpha=0.3)
    ax.set_title(title or f"det = {np.linalg.det(A):.2f}")
    ax.legend(loc="best", fontsize=8)
    return ax

Now we compute the two products and visualize each, side by side:

# A∘B vs B∘A: same two matrices, opposite order, different pictures.
import numpy as np, matplotlib.pyplot as plt
from toolkit.visualizer import visualize_2d
R = np.array([[0, -1], [1, 0]])     # rotate 90 deg counterclockwise
S = np.array([[1, 1], [0, 1]])      # horizontal shear, k = 1
print("RS (shear first, then rotate) =", (R @ S).tolist())
print("SR (rotate first, then shear) =", (S @ R).tolist())
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
visualize_2d(R @ S, title="RS: shear, then rotate", ax=ax1)
visualize_2d(S @ R, title="SR: rotate, then shear", ax=ax2)
plt.show()

RS (shear first, then rotate) = [[0, -1], [1, 1]]
SR (rotate first, then shear) = [[1, -1], [1, 0]]

The two product matrices are plainly different: $RS = \begin{bmatrix}0 & -1 \\ 1 & 1\end{bmatrix}$ but $SR = \begin{bmatrix}1 & -1 \\ 1 & 0\end{bmatrix}$. They are not equal, not even close — and the visualizer shows you why in pictures.

Figure 8.1. Order matters: $RS$ versus $SR$. Two panels, each transforming the same blue dashed unit square. Left ($RS$ = shear then rotate): the square is first slanted into a rightward-leaning parallelogram by the shear, and then that parallelogram is pivoted a quarter-turn counterclockwise; the red arrow ($A\mathbf{e}_1$) lands at $(0,1)$ and the green arrow ($A\mathbf{e}_2$) at $(-1,1)$. Right ($SR$ = rotate then shear): the square is first turned a quarter-turn (so east points up, north points left), and then the whole rotated figure is slanted by the shear; the red arrow lands at $(1,1)$ and the green at $(-1,0)$. The two orange parallelograms have the same area (both determinants are $1$) but different shapes and orientations — visible proof that $RS \ne SR$. Alt-text: Two side-by-side parallelograms produced from the same unit square by a shear and a rotation in opposite orders; the parallelograms are differently slanted, showing that swapping the order of two transformations changes the result.

To make the geometric story unmistakable, follow east ($\mathbf{e}_1$) through each pipeline by hand. In $RS$ (shear first), the shear leaves east at $(1,0)$ — points on the $x$-axis don't move under a horizontal shear — and then the rotation turns $(1,0)$ into $(0,1)$. So $RS$ sends east to $(0,1)$, matching the first column of $RS$. In $SR$ (rotate first), the rotation turns east $(1,0)$ into $(0,1)$, and then the shear sends $(0,1)$ to $(1,1)$ (north slides right by 1 under the shear). So $SR$ sends east to $(1,1)$, matching the first column of $SR$. East ends up in two genuinely different places — $(0,1)$ versus $(1,1)$ — because the operations were applied in different orders. That is non-commutativity, made of nothing but tracking a basis vector through two pipelines.

Geometric Intuition — Shearing then rotating tilts the square before turning it; rotating then shearing turns it before tilting. The tilt and the turn interfere with each other differently depending on which happens first, so the final parallelograms differ. This is the entire content of $AB \ne BA$: two motions of space, composed in opposite orders, generally land you in different places. The visualizer turns the abstract inequality into two pictures you can hold side by side.

For a second, even more dramatic illustration, compose a projection with a rotation — and watch the order decide whether information survives. Let $P = \begin{bmatrix}1 & 0 \\ 0 & 0\end{bmatrix}$ be the projection onto the $x$-axis (Chapter 7's flattening map, which crushes the vertical direction), and let $R = \begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}$ be the $90°$ rotation. Consider what each order does to the unit square. The product $RP$ means "project first, then rotate": the square is first flattened onto the $x$-axis (becoming a horizontal segment), and then that segment is turned a quarter-turn into a vertical segment on the $y$-axis. The product $PR$ means "rotate first, then project": the square is first turned a quarter-turn, and then flattened onto the $x$-axis (becoming a horizontal segment again). Both results are one-dimensional segments — projection destroyed a dimension either way — but they point in perpendicular directions, so the two products are different:

# Order changes which dimension survives: project-then-rotate vs rotate-then-project.
import numpy as np
P = np.array([[1, 0], [0, 0]])      # project onto the x-axis (flatten)
R = np.array([[0, -1], [1, 0]])     # rotate 90 deg
print("RP (project first, then rotate) =", (R @ P).tolist())
print("PR (rotate first, then project) =", (P @ R).tolist())

RP (project first, then rotate) = [[0, 0], [1, 0]]
PR (rotate first, then project) = [[0, -1], [0, 0]]

The two products $RP = \begin{bmatrix}0 & 0 \\ 1 & 0\end{bmatrix}$ and $PR = \begin{bmatrix}0 & -1 \\ 0 & 0\end{bmatrix}$ are different matrices — and reading their columns tells the whole story. Under $RP$, east lands at $(0,1)$ (it survives the projection, then rotates up to the $y$-axis) while north is annihilated; under $PR$, east is annihilated (it rotates to the $y$-axis, then the projection crushes it) while north lands at $(-1,0)$. Whether east or north is the survivor depends entirely on which transformation you do first. This is non-commutativity with teeth: in applications where one of your operations loses information — a projection, a quantization, a rounding step — doing it in the wrong order can throw away exactly the data you needed to keep. Both $RP$ and $PR$ are singular (determinant $0$), since a projection can never be undone, but they are singular in different directions.

When does the order not matter? Sometimes $AB = BA$ — we then say $A$ and $B$ commute — but it is the exception, not the rule. A matrix always commutes with itself, with the identity, and with its own powers and inverse; a uniform scaling ($cI$) commutes with everything (scaling by $c$ then doing anything equals doing anything then scaling by $c$); and two diagonal matrices commute with each other. Two rotations of the plane commute, because rotating by $\alpha$ then $\beta$ is rotating by $\alpha+\beta$ either way — but this is special to 2D; rotations in 3D famously do not commute, which is why the order you turn a Rubik's cube matters. Outside these structured cases, assume $AB \ne BA$ until proven otherwise.

There is a deeper reason commuting is rare, and it foreshadows the heart of the book. Two transformations commute essentially when they "respect each other's special directions" — more precisely, when they share a common set of eigenvectors (Chapter 23). A uniform scaling commutes with everything because every direction is special to it (it stretches all directions equally), so it cannot conflict with any other map's preferred directions. Two diagonal matrices commute because they share the same special directions, the coordinate axes. But a generic pair of transformations have different preferred directions that pull against each other, and applying them in different orders lands you in different places. We will not have the tools to make this precise until eigenvectors arrive, but you can already file away the intuition: commuting is a statement about shared structure, and most pairs of matrices share none. This is why $AB = BA$ is the surprising special case and $AB \ne BA$ is the default.

Common Pitfall — Canceling or reordering factors as if matrices were numbers. Because $AB \ne BA$, almost every algebraic reflex from high school needs care. You may not rewrite $AB$ as $BA$. The identity $(A+B)^2 = A^2 + 2AB + B^2$ is false for matrices; the honest expansion is $(A+B)^2 = A^2 + AB + BA + B^2$, and you cannot merge the middle terms because $AB$ and $BA$ differ. Likewise $AB = AC$ does not let you cancel $A$ to conclude $B = C$ unless $A$ is invertible (Chapter 9). Treat matrix algebra as a new game with its own rules, and double-check any step you imported from scalar arithmetic.

Real-World Application — Composing transformations in a game engine. Every frame your computer renders, it composes a chain of matrices to place an object in the world and then onto your screen: a model matrix (where the object sits and how it's oriented), then a view matrix (where the camera is), then a projection matrix. The renderer multiplies them once — $P V M$ — and applies the single product to every vertex, exactly the "compose first, then transform many points" efficiency we will exploit in Chapter 12. And the non-commutativity is not academic: scaling a model and then rotating it gives a tidy turned object, while rotating and then scaling along an axis produces a sheared, distorted mess. Animators who place objects in the wrong matrix order get famously bizarre bugs — limbs that stretch as they swing — precisely because $AB \ne BA$. See how this drives the rendering pipeline in transformations in games.

8.8 What happens when you multiply a matrix by itself? (Powers)

Composition with a single matrix repeated has a special name and a special importance: the powers of a matrix. Just as $A^2$ means $A\cdot A$ for a number, for a square matrix $A^2 = AA$ means "apply $A$ twice," $A^3 = AAA$ means "apply $A$ three times," and $A^n$ means "apply $A$ a total of $n$ times." (Powers only make sense for square matrices, because to compose $A$ with itself the output space must equal the input space — the inner dimensions must match, which forces $A$ to be $n\times n$.) Associativity (§8.10) is what makes $A^n$ unambiguous: it does not matter how you group the $n$ copies, the result is the same combined transformation.

Powers are where matrix multiplication earns its keep in applications, because iterating a transformation is everywhere — each step of a dynamical system, each round of a Markov chain, each layer of a repeated process is one more multiplication by the same matrix. Two clean examples show the pattern.

Powers of a shear add up. Let $S = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}$, the horizontal shear that slides north right by 1. Applying it twice should slide north right by 2, and indeed $S^2 = \begin{bmatrix}1 & 2 \\ 0 & 1\end{bmatrix}$; in general $S^n = \begin{bmatrix}1 & n \\ 0 & 1\end{bmatrix}$, a shear by $n$. The shears compose by adding their factors, which is the geometric reason a shear is so well-behaved.

Powers of a rotation add the angles. Let $R(\theta)$ be the rotation by angle $\theta$. Composing two rotations adds their angles, so $R(\theta)^n = R(n\theta)$ — applying a $45°$ rotation twice gives a $90°$ rotation, and applying it eight times gives a full $360°$ turn, which is the identity. This is the matrix expression of the angle-addition formulas from trigonometry, hidden inside a product.

# Powers iterate a transformation. Shear^n shears by n; Rotation^n adds angles.
import numpy as np
S = np.array([[1, 1], [0, 1]])                       # horizontal shear, k = 1
print("S^3 =", np.linalg.matrix_power(S, 3).tolist())  # shear by 3
theta = np.radians(45)
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])
print("R(45)^2 =", np.round(np.linalg.matrix_power(R, 2), 4).tolist())  # = R(90)
print("R(45)^8 =", np.round(np.linalg.matrix_power(R, 8), 4).tolist())  # = identity

S^3 = [[1, 3], [0, 1]]
R(45)^2 = [[0.0, -1.0], [1.0, -0.0]]
R(45)^8 = [[1.0, 0.0], [-0.0, 1.0]]

The shear cubed is a shear by 3, just as predicted; the $45°$ rotation squared is the $90°$ rotation $\begin{bmatrix}0 & -1 \\ 1 & 0\end{bmatrix}$ (the tiny $-0.0$ is floating-point rounding of an exact zero); and the $45°$ rotation to the eighth power is the identity, because $8 \times 45° = 360°$ brings space all the way around. (np.linalg.matrix_power(A, n) computes $A^n$ efficiently; for large $n$ it uses repeated squaring rather than $n$ separate multiplications.)

Here is the catch that makes powers genuinely interesting, and that motivates a huge swath of the second half of this book. Computing $A^n$ by brute force costs $n-1$ matrix multiplications, which is expensive when $n$ is large (PageRank may iterate thousands of times). Worse, for a generic matrix it is hard to see what $A^n$ does — the entries jumble together with each multiplication. The escape is diagonalization (Chapter 25): if you can write $A = PDP^{-1}$ with $D$ diagonal, then $A^n = PD^nP^{-1}$, and raising a diagonal matrix to the $n$-th power is trivial (just raise each diagonal entry). The entire payoff of eigenvalues and eigenvectors (Chapter 23) is, in large part, the ability to compute and understand high powers of a matrix — to predict the long-run behavior of an iterated transformation. Powers are the bridge from this chapter's algebra to the heart of the book.

Geometric Intuition — A power $A^n$ is the transformation you get by running the same motion $n$ times in a row. If $A$ spirals points outward a little, $A^n$ spirals them out a lot; if $A$ nudges a population toward equilibrium, $A^n$ for large $n$ shows the equilibrium itself. The long-run question "where does everything end up if I keep applying $A$ forever?" is the question of $A^n$ as $n\to\infty$, and its answer is governed by the eigenvalues of $A$ — the largest one wins, which is exactly the principle behind PageRank in Chapter 29.

There is one more reading of matrix powers that has nothing to do with transforming geometric space, and it is so useful in computer science that it deserves its own moment: powers of an adjacency matrix count paths in a graph. Represent a network — web pages, social-media follows, road connections — by its adjacency matrix $A$, where $a_{ij} = 1$ if there is a direct edge from node $j$ to node $i$ and $0$ otherwise. Then the matrix product reveals reachability: the entry $(A^2)_{ij}$ counts the number of two-step paths from $j$ to $i$, and in general $(A^n)_{ij}$ counts the $n$-step paths. The reason is exactly the entry rule from §8.4 read combinatorially: $(A^2)_{ij} = \sum_k a_{ik}a_{kj}$ adds up, over every possible intermediate node $k$, a $1$ whenever there is an edge $j\to k$ and an edge $k\to i$ — that is, it counts the ways to get from $j$ to $i$ by hopping through one middle node. Composition of the "one-step reachability" map with itself is "two-step reachability."

# Powers of an adjacency matrix count paths. Triangle graph on 3 nodes.
import numpy as np
A = np.array([[0, 1, 1],     # undirected triangle: every node joined to the other two
              [1, 0, 1],
              [1, 1, 0]])
print("A^2 =", (A @ A).tolist())   # (i,j) = number of 2-step walks from j to i

A^2 = [[2, 1, 1], [1, 2, 1], [1, 1, 2]]

Read the result: each diagonal entry is $2$ (from any node there are two 2-step walks back to itself — out to either neighbor and back), and each off-diagonal entry is $1$ (between any two nodes there is exactly one 2-step walk, through the third node). The same matrix multiplication that composes geometric transformations is here composing reachability relations, and it underlies graph algorithms, network analysis, and the link structure that PageRank (Chapter 29) turns into a ranking.

Real-World Application — Counting connections in a social network (computer science / network science). Recommendation systems that suggest "people you may know" often rest on exactly this idea. Store the friendship graph as an adjacency matrix $A$; then $A^2$ counts mutual friends — the entry $(A^2)_{ij}$ is the number of common connections between people $i$ and $j$ — and a large off-diagonal entry in $A^2$ where $A$ itself is zero flags two strangers with many friends in common, prime candidates to recommend to each other. Going further, $A^3$ counts friends-of-friends-of-friends, the basis of "degrees of separation" analyses. Each additional power composes one more hop through the network, which is matrix multiplication as the repeated composition of a single reachability step — the same operation a graphics engine uses to compose transforms, applied to a completely different kind of object.

8.9 What is the transpose, and what does it mean?

We pause from composition to meet a second important operation on a single matrix: the transpose. It is mechanically simple and geometrically deep, and it threads through orthogonality (Part IV), the spectral theorem (Chapter 27), and the SVD (Chapter 30), so it is worth meeting properly now.

To transpose a matrix, flip it across its main diagonal: rows become columns and columns become rows. The $(i,j)$ entry of $A^{\mathsf{T}}$ is the $(j,i)$ entry of $A$. An $m\times n$ matrix becomes $n\times m$. For example, $$A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \;\longrightarrow\; A^{\mathsf{T}} = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}.$$ The first row of $A$, $(1,2,3)$, became the first column of $A^{\mathsf{T}}$. The diagonal entries stay fixed; everything else reflects across it. (Throughout this book we write the transpose as $A^{\mathsf{T}}$, with a sans-serif T, never $A^T$.)

# Transpose flips rows and columns across the main diagonal.
import numpy as np
A = np.array([[1, 2, 3], [4, 5, 6]])
print(A.T.tolist())     # numpy's .T attribute returns the transpose

[[1, 4], [2, 5], [3, 6]]

The $2\times 3$ matrix became $3\times 2$, with rows and columns swapped.

The transpose is more than bookkeeping. Its deepest meaning is about the dot product, which we develop fully in Chapter 18, but the seed is this: for any matrix $A$ and vectors $\mathbf{x}, \mathbf{y}$ of the right sizes, $$(A\mathbf{x})\cdot\mathbf{y} = \mathbf{x}\cdot(A^{\mathsf{T}}\mathbf{y}).$$ The transpose is the transformation that lets you "move $A$ to the other side of a dot product." This makes $A^{\mathsf{T}}$ the adjoint of $A$, and it is the reason the transpose governs everything about angles, projections, and orthogonality. Let us check the identity once on numbers so it is not just a symbol. Take $A = \begin{bmatrix}2 & 1 \\ 0 & 3\end{bmatrix}$, $\mathbf{x} = (1, 1)$, $\mathbf{y} = (1, 2)$. Then $A\mathbf{x} = (3, 3)$ and $(A\mathbf{x})\cdot\mathbf{y} = 3\cdot 1 + 3\cdot 2 = 9$. On the other side, $A^{\mathsf{T}} = \begin{bmatrix}2 & 0 \\ 1 & 3\end{bmatrix}$, so $A^{\mathsf{T}}\mathbf{y} = (2, 7)$ and $\mathbf{x}\cdot(A^{\mathsf{T}}\mathbf{y}) = 1\cdot 2 + 1\cdot 7 = 9$. Both sides equal $9$: the transpose moved $A$ across the dot product without changing the value. This single identity is the engine of orthogonal projection (Chapter 19), least-squares regression (Chapter 17), and the entire theory of orthogonality in Part IV. A matrix that equals its own transpose, $A = A^{\mathsf{T}}$, is called symmetric; symmetric matrices are the best-behaved objects in all of linear algebra (the spectral theorem of Chapter 27 says they can always be diagonalized by a rotation), and they show up wherever there is a natural symmetry — covariance matrices in statistics, the Hessian of second derivatives in optimization, the adjacency matrix of an undirected graph.

Geometric Intuition — The transpose is not the inverse, and it is not "the transformation run backward" — that is the inverse $A^{-1}$ (Chapter 9). The transpose is subtler: it is the adjoint, the operator that mirrors $A$ across the dot product. For one special and important class — the orthogonal matrices (rotations and reflections, Chapter 21) — the transpose happens to equal the inverse, $A^{\mathsf{T}} = A^{-1}$, because reversing a rotation and reflecting it across the dot product turn out to be the same thing. But that is a special coincidence of distance-preserving maps, not the general meaning. Keep "transpose = adjoint" and "inverse = undo" in separate compartments.

The transpose interacts with multiplication in a way that catches everyone off guard — and the surprise is, once again, non-commutativity in disguise. Transposing a product reverses the order: $$\boxed{\,(AB)^{\mathsf{T}} = B^{\mathsf{T}} A^{\mathsf{T}}.\,}$$ Not $A^{\mathsf{T}} B^{\mathsf{T}}$ — the factors flip. There is a clean reason rooted in composition: $AB$ means "do $B$, then $A$"; undoing or mirroring that compound action naturally reverses the steps, so the transpose of "do $B$ then $A$" involves $A^{\mathsf{T}}$ first. (This is the same order-reversal you will see for inverses in Chapter 9: $(AB)^{-1} = B^{-1}A^{-1}$, like taking off shoes before socks.) Let's verify:

# Transposing a product reverses the order: (AB)^T = B^T A^T, NOT A^T B^T.
import numpy as np
A = np.array([[2, 1], [0, 3]])
B = np.array([[1, 4], [2, 5]])
print((A @ B).T.tolist())          # transpose of the product
print((B.T @ A.T).tolist())        # B^T A^T  -- should match
print((A.T @ B.T).tolist())        # A^T B^T  -- the WRONG order, for contrast

[[4, 6], [13, 15]]
[[4, 6], [13, 15]]
[[2, 4], [13, 17]]

The first two lines agree — $(AB)^{\mathsf{T}} = B^{\mathsf{T}}A^{\mathsf{T}} = \begin{bmatrix}4 & 6 \\ 13 & 15\end{bmatrix}$ — while the third line, $A^{\mathsf{T}}B^{\mathsf{T}}$, gives a different matrix entirely. The order reversal is not optional.

Common Pitfall — Writing $(AB)^{\mathsf{T}} = A^{\mathsf{T}}B^{\mathsf{T}}$. This is wrong; the transpose reverses the order: $(AB)^{\mathsf{T}} = B^{\mathsf{T}}A^{\mathsf{T}}$. The same reversal holds for inverses (Chapter 9). The easy memory hook is the dressing analogy: to reverse the process of putting on socks ($B$) then shoes ($A$), you take off shoes ($A^{\mathsf{T}}$ first) then socks ($B^{\mathsf{T}}$). Order-reversal is the signature of any operation that "undoes" or "mirrors" a composition.

FAQ: How is the transpose different from the inverse?

They answer different questions. The inverse $A^{-1}$ answers "what transformation undoes $A$?" — apply $A$ then $A^{-1}$ and you are back where you started ($A^{-1}A = I$). It exists only when $A$ loses no information (Chapter 9), and computing it is real work. The transpose $A^{\mathsf{T}}$ answers "how does $A$ behave with respect to the dot product?" — it is the adjoint, and it always exists for every matrix (just flip across the diagonal), costing nothing. For most matrices $A^{\mathsf{T}} \ne A^{-1}$; they coincide only for orthogonal matrices (Chapter 21), which is exactly what makes those matrices so special. A good slogan: the inverse reverses the motion; the transpose reflects the measurement.

8.10 What are the algebraic laws of matrix multiplication?

Matrix multiplication is missing commutativity, but it keeps the other structural laws you would want — and each one has a clean reason in terms of composition. Let us collect them, because the rest of the book uses them constantly. Throughout, assume the matrices have conformable shapes.

Associativity: $(AB)C = A(BC)$. You may regroup a product of three matrices any way you like (without reordering them). This is the most important law in the list, and composition makes it transparent: $(AB)C$ and $A(BC)$ both mean "do $C$, then $B$, then $A$" — the same three-step motion of space, just with the parentheses describing which two steps you bundle first. Since function composition is associative — $f\circ(g\circ h) = (f\circ g)\circ h$, both being "apply $h$, then $g$, then $f$" — and matrices represent functions, their multiplication inherits associativity for free.

Why we care. Associativity is what makes the product $ABC$ unambiguous (no parentheses needed) and is the silent workhorse behind every chained transformation and every efficient computation — choosing where to put the parentheses can change the speed of a calculation enormously without changing the answer (a fact called matrix-chain optimization, central to deep-learning compilers).

Key idea. Both groupings are the same composite function, so they send every vector to the same place; matrices equal on every vector are equal.

Proof. Let $A, B, C$ be conformable. For an arbitrary input vector $\mathbf{x}$, apply each side using the composition definition repeatedly: $$\big((AB)C\big)\mathbf{x} = (AB)(C\mathbf{x}) = A\big(B(C\mathbf{x})\big),$$ where the first equality is the definition of the product $(AB)$ acting on the vector $C\mathbf{x}$, and the second unfolds $AB$ acting on $B(C\mathbf{x})$'s predecessor. On the other side, $$\big(A(BC)\big)\mathbf{x} = A\big((BC)\mathbf{x}\big) = A\big(B(C\mathbf{x})\big).$$ Both equal $A(B(C\mathbf{x}))$. So $(AB)C$ and $A(BC)$ send every vector $\mathbf{x}$ to the identical output. Two matrices that agree on every vector are the same matrix (apply them to the basis vectors: they have the same columns). Hence $(AB)C = A(BC)$. $\blacksquare$

What this means. Associativity is "composition doesn't care how you bundle the steps." It is not commutativity — you still may not reorder $A, B, C$; you may only regroup them.

# Associativity: (AB)C == A(BC). Order is fixed; only the grouping changes.
import numpy as np
A = np.array([[2, 1], [0, 3]])
B = np.array([[1, 4], [2, 5]])
C = np.array([[0, -1], [1, 0]])
print(((A @ B) @ C).tolist())     # group the left pair first
print((A @ (B @ C)).tolist())     # group the right pair first

[[13, -4], [15, -6]]
[[13, -4], [15, -6]]

Both groupings give $\begin{bmatrix}13 & -4 \\ 15 & -6\end{bmatrix}$, confirming associativity.

Distributivity over addition: matrix multiplication distributes across sums, on both sides: $$A(B + C) = AB + AC, \qquad (A + B)C = AC + BC.$$ We need both left- and right-distributive laws separately, precisely because multiplication is not commutative — you cannot derive one from the other by swapping. Each holds because applying a linear map to a sum splits across the sum (additivity, Chapter 7).

The Key Insight — Matrix multiplication is associative and distributive, but not commutative. In the language of abstract algebra, square matrices form a ring (you can add, subtract, and multiply with these laws) — but a non-commutative ring, which is what makes linear algebra richer and stranger than ordinary arithmetic. Almost every identity you trust for numbers survives except the freedom to swap factors.

8.11 What is the identity matrix, and why is it the "1" of matrix multiplication?

Every number system has a multiplicative identity — the number $1$, which leaves everything unchanged: $1\cdot a = a$. Matrix multiplication has one too: the identity matrix $I$, the square matrix with $1$s on the main diagonal and $0$s everywhere else. In $2\times 2$ and general $n\times n$ form, $$I = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \qquad I_n = \begin{bmatrix} 1 & & \\ & \ddots & \\ & & 1\end{bmatrix}.$$ We met $I$ in Chapter 7 as the matrix of the transformation that does nothing — its columns are the basis vectors themselves, so it sends $\mathbf{e}_1$ to $\mathbf{e}_1$ and $\mathbf{e}_2$ to $\mathbf{e}_2$, leaving every vector fixed: $I\mathbf{v} = \mathbf{v}$. Composition makes its role as the multiplicative identity immediate: for any conformable $A$, $$AI = A \qquad\text{and}\qquad IA = A.$$ "Do nothing, then do $A$" is just "do $A$"; "do $A$, then do nothing" is just "do $A$." From either side, composing with the identity changes nothing. (When the shapes differ, the two $I$'s have different sizes: if $A$ is $m\times n$, then $I_m A = A = A I_n$.)

# The identity is the "1" of matrix multiplication: AI = IA = A.
import numpy as np
A = np.array([[2, 1], [0, 3]])
I = np.eye(2, dtype=int)          # 2x2 identity: ones on the diagonal
print((A @ I).tolist(), (I @ A).tolist())

[[2, 1], [0, 3]] [[2, 1], [0, 3]]

Multiplying by $I$ on either side returns $A$ unchanged. Notice that $I$ is one of the rare matrices that commutes with everything — $AI = IA$ for all $A$ — which makes sense, since "do nothing" never interferes with the order of anything else.

The identity is the anchor for the next chapter. A matrix $A$ has an inverse $A^{-1}$ when there is a transformation that undoes it, returning every vector to its start: $A^{-1}A = AA^{-1} = I$. Composition makes the goal vivid — "do $A$, then undo it" must equal "do nothing" — and the identity $I$ is precisely "do nothing." Chapter 9 takes up exactly when such an undoing transformation exists (when $A$ loses no information, i.e. flattens nothing) and how to compute it.

Computational Note — A few numpy pitfalls around the operations in this chapter. (1) The matrix product is @ (or np.matmul/np.dot), not *. The star A * B computes the entrywise Hadamard product, not matrix multiplication — a silent and common bug, since for same-shaped matrices it runs without error but returns the wrong thing. Always use @ for composition. (2) Build the identity with np.eye(n); np.ones((n, n)) is the all-ones matrix, which is not the identity. (3) numpy broadcasts mismatched shapes in + and *, so A + 1 adds 1 to every entry and a row vector may stretch to fit a matrix — convenient, but it means a shape bug in addition can pass silently rather than erroring. When you mean matrix addition, make sure both operands genuinely share a shape.

Build Your Toolkit — Add matmul(A, B) to toolkit/matrices.py, implemented from scratch as composition — no numpy in the implementation. Your Chapter 7 contribution already gave you apply(A, v) (matrix-vector product as the weighted sum of columns). Build matmul on top of it: the $j$-th column of $AB$ is apply(A, col_j_of_B), so loop over the columns of $B$, apply $A$ to each, and assemble the results as the columns of the product. (Equivalently, write the triple loop result[i][j] += A[i][k] * B[k][j] — but the column-by-column version makes the composition literal, and is the spirit of this chapter.) Validate that the inner dimensions match (len(B) == len(A[0])), raising a clear error otherwise. Then verify against numpy on several shapes: np.allclose(np.array(matmul(A, B)), np.array(A) @ np.array(B)) should be True for square and non-square conformable pairs. Confirm non-commutativity in your own code by checking that matmul(R, S) != matmul(S, R) for the rotation and shear of §8.7.

8.12 How do all the operations fit together?

Step back and see the shape of what we have built. We started with two easy ways to combine transformations — addition and scalar multiplication, both entrywise — that make matrices into a vector space, a structure you can take linear combinations in. Then we introduced the rich operation, multiplication, which is not entrywise at all but composition: $AB$ is the single matrix representing "do $B$, then do $A$." From that one geometric definition, three things followed without any memorization. The product rule (row times column) fell out as the arithmetic of computing the composite's columns. The shape requirement (inner dimensions match) fell out of "$B$'s outputs must be valid inputs to $A$." And the non-commutativity ($AB \ne BA$) fell out of the plain fact that the order of two motions matters — which the visualizer let us see as two different parallelograms.

We also met the transpose $A^{\mathsf{T}}$ (the adjoint, reflecting $A$ across the dot product, with the order-reversing rule $(AB)^{\mathsf{T}} = B^{\mathsf{T}}A^{\mathsf{T}}$), confirmed the structural laws (associativity and distributivity, but not commutativity), and crowned the system with the identity matrix $I$, the "do nothing" transformation that is the $1$ of matrix multiplication. Together these operations turn the set of matrices into a non-commutative ring — an algebra of transformations.

Geometric Intuition — Two ways to combine transformations, two flavors of meaning. Addition is parallel: send the same input through both maps and sum the outputs. Multiplication is serial: send the input through one map, then the other. Parallel is gentle and commutative ($A + B = B + A$); serial is rich and non-commutative ($AB \ne BA$). Almost everything interesting in linear algebra — and in the deep-learning systems built on it — lives in the serial composition of many transformations, one after another.

The thread now runs straight into Chapter 9. We have "do nothing" ($I$) and we have "do $A$"; the obvious missing piece is "undo $A$." When can a transformation be reversed, and how do we compute the matrix that reverses it? The answer ties back to the projection of Chapter 7 — the transformation that flattened the plane onto a line and lost information — because the matrices you can undo are exactly the ones that lose nothing. That is the inverse, and it is next.

Real-World Application — State transitions and the Markov step (economics / data science). In Chapter 7 we met the labor-market state vector $\mathbf{x} = (e, u)$ and a transition matrix $A$ that advances it one month: $\mathbf{x}_{\text{next}} = A\mathbf{x}$. Matrix multiplication is what lets us jump two months in one shot. Two steps is $A(A\mathbf{x}) = A^2\mathbf{x}$, so the matrix $A^2 = AA$ is the "two-month transition," its columns encoding where the employed and unemployed populations land after two months. By associativity, $n$ steps is the single matrix $A^n$, computed once and reused — the same trick that lets PageRank (Chapter 29) take thousands of web-surfing steps by raising one matrix to a power. The entry $(A^n)_{ij}$ is the fraction of group $j$ that ends in state $i$ after $n$ steps. Whole fields — Markov chains in data science, population models in ecology, inventory and pricing dynamics in economics — are powered by multiplying a transition matrix by itself, which is composition applied over and over. Case Study 2 builds this out into a full customer-retention model, and it is the linear algebra under the hood of neural network layers, where each layer composes a learned matrix with the data flowing through it.

Historical Note — Arthur Cayley introduced the rules of matrix multiplication in his 1858 Memoir on the Theory of Matrices, and he defined the product as the composition of two linear substitutions — exactly the transformation-first viewpoint of this chapter, not as an arithmetic rule on grids of numbers [verify]. The composition motivation came first historically; the row-times-column formula was the consequence, just as we derived it. Cayley also recorded that matrix multiplication is associative but, strikingly for the era, not commutative — among the first widely studied algebraic systems where $ab \ne ba$, a discovery that helped open the door to modern abstract (non-commutative) algebra [verify].