44 min read

> Learning paths. Math majors — read everything; the payoff here is seeing the affine group and projective lift (§12.2, §12.9) as honest mathematics, not a hack, and you should pin down why the perspective matrix is genuinely a linear map on a...

Prerequisites

  • chapter-08-matrix-operations
  • chapter-11-the-determinant

Learning Objectives

  • Explain why a pure linear map cannot translate (it fixes the origin) and how homogeneous coordinates fix this by lifting to one higher dimension.
  • Build 2D and 3D translation, rotation, and scaling matrices in homogeneous coordinates as a single 3x3 or 4x4 matrix multiply.
  • Compose rotation, scaling, and translation into one model matrix, and explain why the order of multiplication changes the result.
  • Project a 3D scene to a 2D screen with both orthographic projection and a gentle perspective projection, and explain the perspective divide.
  • Trace a point through the full rendering pipeline: model -> world -> camera -> projection -> screen.
  • Implement homogeneous transform builders (translation, rotation, scaling) returning 3x3 matrices, compose them, and verify against numpy (toolkit/capstone/render3d.py).

Application: Computer Graphics — Rotation, Scaling, Projection, and Rendering in 3D

Learning paths. Math majors — read everything; the payoff here is seeing the affine group and projective lift (§12.2, §12.9) as honest mathematics, not a hack, and you should pin down why the perspective matrix is genuinely a linear map on a larger space. CS / Data Science — this is your chapter: focus on the Geometric Intuition callouts, the numpy that builds real transformation matrices, and the rendering-pipeline walkthrough; the proofs are light by design. Physics / Engineering — focus on the geometry of composing transformations and on the camera/projection geometry, which is the same change-of-frame reasoning you use for reference frames. This chapter cashes in everything Part II built: a matrix is a function that transforms space, and a sequence of matrices turns a 3D world into the pixels in front of you.

A note on this chapter (§9). Chapters 7 through 11 built the machine; this chapter runs it on something you can watch move. We promised back in Chapter 1, and again in Chapter 3, that linear algebra is what puts a moving world on a screen, and that translation — the one motion a matrix seemingly can't do — has an elegant fix. Here we deliver both. This is an application chapter, so it leans on the geometric (G) and computational (C) tracks: we build transformation matrices, compose them in the right order (Chapter 8), read their determinants (Chapter 11), and render a wireframe with matplotlib. There is one genuinely new idea — homogeneous coordinates — and it is worth the whole chapter.

12.1 How does a computer turn a 3D world into a flat picture?

Pause on whatever screen you are reading this. Every frame a video game draws, every shot in an animated film, every rotating product on a shopping page, every furniture-in-your-living-room augmented-reality preview — all of it is the same computation, run millions of times a second: take a list of points in three-dimensional space, move them around, and flatten them onto a two-dimensional grid of pixels. That is the entire job of real-time graphics, and the language it is written in is linear algebra. This is the chapter where linear algebra in computer graphics stops being a promise and becomes a procedure you can carry out by hand.

Here is the shape of the problem, told geometrically before any symbol. You have a model — say a teapot, or for us a humble cube — described as a bag of corner points called vertices, with edges connecting them. You want to place that model somewhere in a larger scene (move it, turn it, resize it). You want to view the scene from a particular camera looking in a particular direction. And finally you want to squash the three-dimensional result down to the flat screen, with farther things drawn smaller, the way your eye sees them. Each of those steps is a transformation of points, and — this is the thesis of Part II — each transformation is a matrix.

The Key Insight — Rendering is a pipeline of matrix multiplications. A vertex starts in the model's own coordinates and is multiplied, in turn, by a model matrix, a view (camera) matrix, and a projection matrix, ending up as a 2D screen position. Change the matrices and the world moves; the points themselves never change, only the matrices that act on them. Everything in this chapter is built to make that one sentence true and computable.

So why does this chapter exist as a separate application, rather than as one more example back in Chapter 7? Because graphics forces a confrontation with a limitation we have flagged repeatedly and never resolved. A linear transformation — and therefore any matrix as we have used it so far — must fix the origin (we proved this in Chapter 7: set the scalar to zero in the homogeneity rule and $T(\mathbf 0) = \mathbf 0$). But the single most common thing you do in graphics is move an object somewhere else — slide it three units to the right. That motion, translation, moves the origin, so it is not linear, and no $2\times 2$ or $3\times 3$ matrix can perform it on its own. The Common Pitfall in Chapter 7 warned about exactly this; Case Study 1 of Chapter 7 ended by promising the fix. The fix is homogeneous coordinates, and it is so clean that it has become the universal convention of every graphics system on Earth.

We will take it in order. First we will see precisely why translation breaks the matrix framework, and then the trick that smuggles it back in (§12.2). Then we will rebuild our whole transformation zoo — rotation, scaling, translation — in the new homogeneous language, in 2D and then in 3D (§12.3–§12.5). We will compose them into a single model matrix and watch, painfully and instructively, how the order of multiplication changes everything (§12.6, paying off Chapter 8). Then we flatten 3D to 2D with orthographic and perspective projection (§12.7). Finally we assemble the full rendering pipeline and trace a single point all the way from a model's local space to a pixel (§12.8), and render a real wireframe cube. Let's start, as always, with the picture of the thing that breaks.

FAQ: Why is linear algebra used in computer graphics at all?

Because the operations graphics needs — moving, turning, resizing, and projecting points — are exactly the operations linear (and affine) algebra describes, and because matrices let you apply one operation to a million points with the same code, and apply a whole sequence of operations as a single combined matrix. A modern graphics processor is, at its heart, a machine built to multiply many small vectors by a few small matrices in parallel, thousands of times per frame. The geometry you learned in Chapter 7 — "a matrix sends the basis vectors to its columns" — is literally the arithmetic inside the render loop. Nothing about graphics transformations is conceptually new; it is Part II applied at scale, plus the one extra idea (homogeneous coordinates) that this chapter is built around.

12.2 Why can't a matrix translate, and what is the homogeneous-coordinates fix?

Picture the flat plane again, our infinite sheet of graph paper. Translation by a vector $\mathbf t = (t_x, t_y)$ slides the entire sheet bodily sideways: every point $(x,y)$ goes to $(x + t_x,\ y + t_y)$. Geometrically it is the gentlest motion imaginable — no stretching, no turning, just a rigid shove. And yet it is the one motion that a matrix, as we have defined matrices, cannot perform.

The reason is the origin. Watch what translation does to the point $(0,0)$: it sends it to $(t_x, t_y)$, which is not the origin (unless $\mathbf t = \mathbf 0$, i.e. no translation at all). But we proved in Chapter 7 that every linear transformation fixes the origin. So translation is not linear. It belongs to a slightly larger family called affine transformations — a linear map followed by a shift, $T(\mathbf x) = A\mathbf x + \mathbf b$ — and the trouble is that pesky $+\,\mathbf b$, which no amount of cleverness inside a $2\times 2$ matrix can produce, because matrix-times-vector is always a weighted sum of columns (Chapter 7) and a weighted sum of columns through the origin can never land at a fixed offset when the input is zero.

Common Pitfall"Just put the translation amounts in the matrix somewhere." You cannot make $\begin{bmatrix} a & b \\ c & d\end{bmatrix}\begin{bmatrix}x\\y\end{bmatrix}$ equal $\begin{bmatrix}x + t_x \\ y + t_y\end{bmatrix}$ for all $(x,y)$, no matter what $a,b,c,d$ are. Test it on $(0,0)$: the left side is always $(0,0)$, the right side is $(t_x, t_y)$. There is no room in a $2\times 2$ matrix for a constant term. This is not a gap in your algebra; it is a theorem (linear maps fix the origin), and the only way around a theorem is to change the setting. That is what homogeneous coordinates do.

So here is the trick, and it is genuinely beautiful. We lift the plane up into three dimensions by appending a third coordinate equal to $1$. The point $(x, y)$ becomes the triple $(x, y, 1)$. We are not adding real geometric height; we are parking our flat world on the slice $w = 1$ of a 3D space (call the third axis $w$). A 2D point is now a 3D point with its last coordinate pinned to $1$. These are called homogeneous coordinates.

Geometric Intuition — Imagine the ordinary 2D plane floating at height $w = 1$ inside a 3D room. Every 2D point $(x,y)$ is the 3D point $(x, y, 1)$ sitting on that elevated floor. Now here is the magic: a linear transformation of the big 3D room — an honest $3\times 3$ matrix, which of course fixes the room's origin $(0,0,0)$ — can nonetheless slide our elevated floor sideways, because the floor is not at the origin. A shear of the 3D room, applied to a point at height $1$, shifts its $x$ and $y$ in proportion to that height of $1$. Translation in 2D is just a shear in 3D. We did not break the "matrices fix the origin" rule; we moved up a dimension where the rule no longer pins the thing we care about.

Let's see the arithmetic. We want a $3\times 3$ matrix $T$ such that, when it hits $(x, y, 1)$, it returns $(x + t_x,\ y + t_y,\ 1)$. Reading off "where does each output coordinate come from," the matrix is

$$T(t_x, t_y) = \begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix}.$$

Check it by the weighted-sum-of-columns rule from Chapter 7, or just multiply it out:

$$\begin{bmatrix} 1 & 0 & t_x \\ 0 & 1 & t_y \\ 0 & 0 & 1 \end{bmatrix}\begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} 1\cdot x + 0\cdot y + t_x\cdot 1 \\ 0\cdot x + 1\cdot y + t_y \cdot 1 \\ 0\cdot x + 0\cdot y + 1\cdot 1\end{bmatrix} = \begin{bmatrix} x + t_x \\ y + t_y \\ 1\end{bmatrix}.$$

There it is. The translation amounts $t_x, t_y$ live in the third column, and they get multiplied by that pinned-to-$1$ last coordinate, which is why a constant shift appears in the output. The bottom row $\begin{bmatrix}0 & 0 & 1\end{bmatrix}$ keeps the last coordinate equal to $1$, so the result is again a valid homogeneous 2D point. Translation has become a matrix multiply — exactly the promise from Chapters 1 and 3, paid in full.

# Translation as a 3x3 homogeneous matrix multiply. numpy confirms the shift.
import numpy as np
def translation(tx, ty):
    return np.array([[1, 0, tx],
                     [0, 1, ty],
                     [0, 0, 1]], dtype=float)
p = np.array([1, 0, 1.0])             # the 2D point (1, 0) lifted to (1, 0, 1)
print(translation(3, 1) @ p)          # slide right 3, up 1
[4. 1. 1.]

The point $(1,0)$ slides to $(4,1)$ — the last coordinate stays $1$, confirming we are still on the $w = 1$ floor. We have just done with a matrix the one thing matrices supposedly cannot do.

The Key Insight — In homogeneous coordinates, you append a $1$ to every point, and translation becomes the third column of a $3\times 3$ matrix (or, in 3D, the fourth column of a $4\times 4$). This unifies rotation, scaling, and translation into one matrix type, so you can compose them all by ordinary matrix multiplication. Every graphics API on Earth represents transformations this way — in 3D, as $4\times 4$ matrices.

Why pin the last coordinate to $1$ specifically? Because the homogeneous representation is scale-invariant: the triples $(x, y, 1)$ and $(2x, 2y, 2)$ and in general $(cx, cy, c)$ for any nonzero $c$ all stand for the same 2D point $(x, y)$ — you recover the 2D point by dividing the first two coordinates by the last. Choosing the representative with last coordinate $1$ is just the canonical normalization. This freedom looks pedantic now, but it is exactly the lever that makes perspective work in §12.7, where the last coordinate comes out not equal to $1$ and the division does something wonderful. (And the points with last coordinate $0$, which you cannot divide back into the plane, turn out to be "points at infinity" — directions rather than locations. More on that in §12.9.)

Math-Major Sidebar (optional) — What we have built is the standard embedding of the affine group of the plane into a subgroup of the $3\times 3$ invertible matrices $GL_3(\mathbb R)$. The affine maps $\mathbf x \mapsto A\mathbf x + \mathbf b$ (with $A$ a $2\times 2$ matrix and $\mathbf b$ a shift) correspond exactly to the $3\times 3$ matrices $\begin{bmatrix} A & \mathbf b \\ \mathbf 0^{\mathsf T} & 1\end{bmatrix}$, and matrix multiplication of these block matrices reproduces affine composition: $\begin{bmatrix} A_2 & \mathbf b_2 \\ \mathbf 0^{\mathsf T} & 1\end{bmatrix}\begin{bmatrix} A_1 & \mathbf b_1 \\ \mathbf 0^{\mathsf T} & 1\end{bmatrix} = \begin{bmatrix} A_2 A_1 & A_2 \mathbf b_1 + \mathbf b_2 \\ \mathbf 0^{\mathsf T} & 1\end{bmatrix}$. The bottom-right $1$ and zero bottom row are what keep the family closed under multiplication. When we relax the bottom row to something other than $\begin{bmatrix}\mathbf 0^{\mathsf T} & 1\end{bmatrix}$, we leave the affine group and enter genuine projective transformations — which is precisely what a perspective camera is. So the homogeneous trick is not a hack; it is the entryway to projective geometry, the natural home of cameras and vanishing points.

12.3 How do you build rotation and scaling in homogeneous coordinates?

Now that translation lives happily as a $3\times 3$ matrix, we want rotation and scaling to join it in the same $3\times 3$ format, so that all three can be multiplied together. The recipe is the easiest thing in the chapter: take the ordinary $2\times 2$ transformation from Chapter 7, drop it into the top-left $2\times 2$ block, and pad with the identity's third row and column. Because these transformations do fix the origin, they need no third-column shift, so the third column is just $(0,0,1)$.

A scaling by $s_x$ horizontally and $s_y$ vertically was $\begin{bmatrix} s_x & 0 \\ 0 & s_y\end{bmatrix}$ in Chapter 7. Lifted to homogeneous coordinates:

$$\text{scaling}(s_x, s_y) = \begin{bmatrix} s_x & 0 & 0 \\ 0 & s_y & 0 \\ 0 & 0 & 1 \end{bmatrix}.$$

A rotation by angle $\theta$ counterclockwise was $\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta\end{bmatrix}$, which we derived in Chapter 7 by asking where the basis vectors go. Lifted:

$$\text{rotation}(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 \\ \sin\theta & \cos\theta & 0 \\ 0 & 0 & 1 \end{bmatrix}.$$

Geometric Intuition — The top-left $2\times 2$ block still does all the geometric work on the $x$ and $y$ coordinates exactly as in Chapter 7 — it rotates, scales, shears. The extra row and column are bookkeeping: the bottom row $\begin{bmatrix}0 & 0 & 1\end{bmatrix}$ preserves the homogeneous $1$, and the third column holds any translation (here, none). So every Chapter 7 transformation lives inside its homogeneous cousin unchanged, with translation now available as a slot that used to not exist.

Let's verify a rotation in this new format on a clean angle, $\theta = 90°$, and confirm it agrees with the bare $2\times 2$ version on the $x,y$ part.

# Rotation and scaling as 3x3 homogeneous matrices.
import numpy as np
def rotation(theta):
    c, s = np.cos(theta), np.sin(theta)
    return np.array([[c, -s, 0],
                     [s,  c, 0],
                     [0,  0, 1]], dtype=float)
def scaling(sx, sy):
    return np.array([[sx, 0, 0],
                     [0, sy, 0],
                     [0,  0, 1]], dtype=float)
print(np.round(rotation(np.radians(90)), 4))
print("(1,0) rotated 90 deg ->", np.round(rotation(np.radians(90)) @ np.array([1, 0, 1.0]), 4))
[[ 0. -1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]
[0. 1. 1.]

A quarter turn sends east $(1,0)$ to north $(0,1)$ — exactly as in Chapter 7 — and the homogeneous coordinate stays $1$. The geometry is identical; only the dimension of the bookkeeping changed.

Computational Note — The determinant of a homogeneous transform tells you a familiar story, with a caveat. For the $3\times 3$ rotation and scaling above, $\det = \det(\text{top-left } 2\times 2)\cdot 1$, because of the block-triangular structure (the determinant of a block-triangular matrix is the product of the diagonal blocks' determinants — Chapter 11). So a homogeneous rotation has $\det = 1$, a scaling has $\det = s_x s_y$, and — importantly — a translation has $\det = 1$ too, since its top-left block is the identity. The caveat: this $3\times 3$ determinant measures area scaling in the lifted 3D space, which equals the 2D area scaling because the $w = 1$ slice is carried rigidly. Do not over-read it; the geometric content is still the Chapter 11 story applied to the $2\times 2$ block.

FAQ: What is a 3D transformation matrix, and why is it 4x4?

A 3D transformation matrix is the homogeneous representation of a transformation of three-dimensional space: a $4\times 4$ matrix that acts on points written as $(x, y, z, 1)$. The pattern is identical to the 2D case, just one dimension up. The top-left $3\times 3$ block holds the linear part (rotation, scaling, shear of 3D space); the fourth column holds the translation $(t_x, t_y, t_z)$; the bottom row is $\begin{bmatrix}0 & 0 & 0 & 1\end{bmatrix}$ to preserve the homogeneous coordinate. It is $4\times 4$ rather than $3\times 3$ for the same reason the 2D version was $3\times 3$ rather than $2\times 2$: you need one extra dimension to give translation a column to live in. We build these next.

12.4 What do 3D rotation, scaling, and translation matrices look like?

Everything generalizes from 2D to 3D by adding one coordinate. A 3D point is $(x, y, z)$; in homogeneous coordinates it becomes $(x, y, z, 1)$, and transformations are $4\times 4$ matrices. Scaling and translation are immediate:

$$\text{scaling}_{3D}(s_x, s_y, s_z) = \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}, \qquad \text{translation}_{3D}(t_x, t_y, t_z) = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1\end{bmatrix}.$$

Rotation is the only place 3D is genuinely richer than 2D, and the reason is geometric: in the plane there is essentially one way to rotate (about the origin, by an angle), but in space you must say about which axis. There are three fundamental rotations, one about each coordinate axis, and each one fixes its axis while rotating the other two coordinates exactly as in the 2D case. Rotation about the $z$-axis, for instance, leaves $z$ alone and rotates $(x,y)$ — so it is the 2D rotation matrix, padded:

$$R_z(\theta) = \begin{bmatrix} \cos\theta & -\sin\theta & 0 & 0 \\ \sin\theta & \cos\theta & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}.$$

Rotation about the $x$-axis fixes $x$ and rotates $(y, z)$; rotation about the $y$-axis fixes $y$ and rotates $(z, x)$:

$$R_x(\theta) = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & \cos\theta & -\sin\theta & 0 \\ 0 & \sin\theta & \cos\theta & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}, \qquad R_y(\theta) = \begin{bmatrix} \cos\theta & 0 & \sin\theta & 0 \\ 0 & 1 & 0 & 0 \\ -\sin\theta & 0 & \cos\theta & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}.$$

Geometric Intuition — Each axis rotation is "the 2D rotation, applied to the other two coordinates, with the named axis held fixed." To see where the sign pattern in $R_y$ comes from, use the Chapter 7 method — ask where the basis vectors go — but note the cyclic order: $x \to y \to z \to x$. Rotating about $y$ rotates the $z$-$x$ plane in that cyclic order, which is why the lone plus-sign $\sin\theta$ sits in the top-right rather than the bottom-left. Many a graphics bug is a sign error in $R_y$ from forgetting the cycle; when unsure, re-derive by hand on a $90°$ rotation.

# The three fundamental 3D rotation matrices (4x4 homogeneous).
import numpy as np
def Rz(t):
    c, s = np.cos(t), np.sin(t)
    return np.array([[c,-s,0,0],[s,c,0,0],[0,0,1,0],[0,0,0,1]], float)
def Rx(t):
    c, s = np.cos(t), np.sin(t)
    return np.array([[1,0,0,0],[0,c,-s,0],[0,s,c,0],[0,0,0,1]], float)
print("Rz(30 deg) ="); print(np.round(Rz(np.radians(30)), 4))
print("Rx(30 deg) ="); print(np.round(Rx(np.radians(30)), 4))
Rz(30 deg) =
[[ 0.866 -0.5    0.     0.   ]
 [ 0.5    0.866  0.     0.   ]
 [ 0.     0.     1.     0.   ]
 [ 0.     0.     0.     1.   ]]
Rx(30 deg) =
[[ 1.     0.     0.     0.   ]
 [ 0.     0.866 -0.5    0.   ]
 [ 0.     0.5    0.866  0.   ]
 [ 0.     0.     0.     1.   ]]

Each matrix carries a recognizable $\begin{bmatrix}\cos & -\sin \\ \sin & \cos\end{bmatrix}$ block in the two coordinates it rotates, with $1$ on the diagonal of the axis it leaves fixed. The numbers $0.866 \approx \cos 30°$ and $0.5 = \sin 30°$ match Chapter 7 exactly.

Warning

— A general 3D rotation about an arbitrary axis (not a coordinate axis) is a more involved matrix, and composing the three coordinate rotations to reach a desired orientation runs into a notorious trap called gimbal lock, where two of the three rotation axes align and you lose a degree of freedom. This is why production engines often represent orientation with quaternions rather than chained axis rotations. We are not covering quaternions — they live just past this book — but you should know that "compose $R_x$, $R_y$, $R_z$ to get any orientation" is true in principle yet numerically and ergonomically fraught in practice. For this chapter's purposes, coordinate-axis rotations are exactly what we need, and they are honest matrices.

12.5 Composing transformations into one model matrix

Here is where the homogeneous payoff becomes undeniable. You rarely want just a rotation or just a translation; you want to take a model sitting at the origin in its own coordinates, scale it to the right size, rotate it to the right orientation, and translate it to the right place in the world. Because all three are now the same kind of matrix, you can multiply them into a single model matrix $M$ that does the whole job in one shot:

$$M = T \, R \, S.$$

Recall from Chapter 8 the rule for reading a product of transformations: the rightmost matrix acts first. So $M = TRS$ means "scale first, then rotate, then translate," because applying $M$ to a point $\mathbf p$ is $T(R(S\mathbf p))$ — $S$ touches $\mathbf p$ first. This is the standard order in graphics, and it is the sensible order: you size the object and orient it while it is still centered at the origin (where rotation and scaling behave simply), and only then shove it out to its world position.

The Key Insight — A single model matrix $M = TRS$ encodes an object's size, orientation, and position all at once. To place the object, you multiply each of its vertices by this one matrix. Compose once, apply to thousands of vertices — that economy is the entire reason graphics is built on matrices rather than on per-point formulas.

Let's compute a concrete model matrix and apply it to the unit square. Scale by $2$, rotate by $45°$, translate by $(3, 1)$, in homogeneous 2D.

# Build one model matrix M = T R S (scale, then rotate, then translate).
import numpy as np
def translation(tx, ty): return np.array([[1,0,tx],[0,1,ty],[0,0,1]], float)
def rotation(t): c,s = np.cos(t), np.sin(t); return np.array([[c,-s,0],[s,c,0],[0,0,1]], float)
def scaling(sx, sy): return np.array([[sx,0,0],[0,sy,0],[0,0,1]], float)

S = scaling(2, 2)
R = rotation(np.radians(45))
T = translation(3, 1)
M = T @ R @ S                      # rightmost (S) acts first; T acts last
print("model matrix M ="); print(np.round(M, 4))
square = np.array([[0, 1, 1, 0],   # x of the 4 corners
                   [0, 0, 1, 1],   # y
                   [1, 1, 1, 1]])  # homogeneous 1
print("transformed corners ="); print(np.round(M @ square, 4))
model matrix M =
[[ 1.4142 -1.4142  3.    ]
 [ 1.4142  1.4142  1.    ]
 [ 0.      0.      1.    ]]
transformed corners =
[[3.     4.4142 3.     1.5858]
 [1.     2.4142 3.8284 2.4142]
 [1.     1.     1.     1.    ]]

Read the model matrix: its top-left $2\times 2$ block, $\begin{bmatrix}1.4142 & -1.4142 \\ 1.4142 & 1.4142\end{bmatrix}$, is exactly $2\cos45°, 2\sin45°$ — the rotation-by-$45°$ scaled by $2$, since $2\cos 45° = 2\sin45° \approx 1.4142$. The third column $(3, 1)$ is the translation. The corner $(0,0)$ maps to $(3,1)$ — the model's origin landed at the world position we asked for — while the other corners are rotated and doubled around it. One matrix did scale, rotate, and translate at once.

Common PitfallTransform order, and the row-vs-column convention. Two traps trip nearly everyone, and they compound. (1) Order matters because matrix multiplication is not commutative (Chapter 8): $TRS \ne SRT \ne RST$. Translating then rotating swings the object around the world origin on a wide arc; rotating then translating spins it in place and then moves it. They are different motions and different matrices — track the basis vectors through each order and you will see why, exactly as in Chapter 7's §7.6 preview. (2) Row-vs-column vectors. This book, and the OpenGL tradition, treats vectors as columns and multiplies on the left: $\mathbf p' = M\mathbf p$, so the matrix that acts first is on the right of the product. Some systems (notably DirectX and much of the older graphics literature) treat vectors as rows and multiply on the right: $\mathbf p' = \mathbf p\, M$, which reverses the order, putting the first transform on the left and using the transposes of all our matrices. Both are correct; they are transposes of each other. Mixing the two conventions in one codebase is a classic, maddening source of objects that rotate the wrong way or fly off-screen. Pick one convention and never cross it. We use column vectors, left-multiplication, rightmost-acts-first, throughout.

Geometric IntuitionWhy order changes the picture. Think of "rotate $90°$ about the origin, then translate right by $5$" versus "translate right by $5$, then rotate $90°$ about the origin." In the first, a point at the origin spins to itself, then slides to $(5,0)$. In the second, the point slides to $(5,0)$ first, then the $90°$ rotation swings that whole position a quarter-turn about the origin, landing it at $(0,5)$. Same two operations, totally different destinations. The rotation always pivots about the origin, so what is sitting at the origin when the rotation happens decides the outcome.

12.6 Why does order matter, and how do you verify a composed transform?

Let's make the non-commutativity concrete and measurable, because "order matters" is the single most consequential fact a graphics programmer carries from Part II, and it is worth pinning to specific numbers. We will compute $TR$ and $RT$ for a quarter-turn $R$ and a translation $T$, and watch them disagree.

Take $R = R(90°) = \begin{bmatrix}0 & -1 & 0\\ 1 & 0 & 0 \\ 0 & 0 & 1\end{bmatrix}$ and $T = T(3, 1) = \begin{bmatrix}1 & 0 & 3 \\ 0 & 1 & 1 \\ 0 & 0 & 1\end{bmatrix}$. The product $TR$ ("rotate first, then translate") versus $RT$ ("translate first, then rotate") gives genuinely different matrices, which we confirm in numpy and check against hand reasoning.

# Order matters: T R is NOT R T. Apply each to the point (1, 0).
import numpy as np
R = np.array([[0,-1,0],[1,0,0],[0,0,1]], float)   # rotate 90 deg CCW
T = np.array([[1,0,3],[0,1,1],[0,0,1]], float)     # translate (3, 1)
p = np.array([1, 0, 1.0])
print("T@R =");  print(np.round(T @ R, 4))
print("R@T =");  print(np.round(R @ T, 4))
print("(T@R) on (1,0) ->", np.round((T @ R) @ p, 4))   # rotate, then shift
print("(R@T) on (1,0) ->", np.round((R @ T) @ p, 4))   # shift, then rotate
T@R =
[[ 0. -1.  3.]
 [ 1.  0.  1.]
 [ 0.  0.  1.]]
R@T =
[[ 0. -1. -1.]
 [ 1.  0.  3.]
 [ 0.  0.  1.]]
(T@R) on (1,0) -> [3. 2. 1.]
(R@T) on (1,0) -> [-1.  4.  1.]

The two matrices differ, and so do the destinations: rotating $(1,0)$ to $(0,1)$ and then translating gives $(3,2)$; translating $(1,0)$ to $(4,1)$ and then rotating gives $(-1, 4)$. The translation columns even differ — $(3,1)$ versus $(-1,3)$ — because in $RT$ the rotation also acts on the translation column. This is Chapter 8's non-commutativity, now with screen consequences: get the order wrong and your object lands in the wrong place, every frame.

Check Your Understanding — You want to spin a clock's hand about its own pivot at world position $(p_x, p_y)$, not about the world origin. In what order do you compose a translation $T_p = T(p_x, p_y)$ (pivot to its world spot), its inverse $T_{-p} = T(-p_x, -p_y)$, and a rotation $R(\theta)$?

Answer

$M = T_p\, R(\theta)\, T_{-p}$ (read right to left: first translate the pivot back to the origin, then rotate, then translate back out). This "translate to origin, rotate, translate back" sandwich is the universal recipe for rotating about an arbitrary point, and it is why homogeneous coordinates matter: you could not insert those translations into a bare $2\times 2$ rotation. Applying $M$ to the pivot point itself returns the pivot (it is the fixed point), and everything else swings around it. This same conjugation pattern $T_p R T_{-p}$ reappears as similarity / change of basis in Chapter 16.

Real-World ApplicationSkeletal animation and scene graphs (games and film). A character's arm is a chain: shoulder, elbow, wrist, each with its own local rotation, each attached to its parent. The world position of the hand is the product of all the matrices up the chain — shoulder transform, times elbow transform, times wrist transform — exactly a composition $M_{\text{hand}} = M_{\text{shoulder}} M_{\text{elbow}} M_{\text{wrist}}$. Because the matrices compose, animating the shoulder automatically drags the whole arm; the elbow's matrix is expressed relative to the shoulder, so it "inherits" the shoulder's motion through the product. This hierarchy of composed transforms is called a scene graph, and it is the backbone of every 3D engine and animation package. The order-matters discipline of this section is not academic there; it is the difference between an arm that bends naturally and one that detaches and floats away. The same matrix-composition idea drives 3D math for games, where every object's world matrix is the product of its local transform with its parent's world matrix.

12.7 How do you flatten 3D onto a 2D screen? (Projection)

We can now move objects anywhere in 3D space. But the screen is flat. The final geometric act of rendering is projection: collapsing three dimensions down to two, deciding where each 3D point lands on the 2D image plane. There are two flavors, and the difference between them is the difference between an architect's blueprint and a photograph.

12.7.1 Orthographic projection: parallel lines stay parallel

The simplest projection just throws away one coordinate. To project onto the screen along the $z$-axis (the viewing direction), keep $x$ and $y$ and drop $z$: the 3D point $(x, y, z)$ lands at the 2D point $(x, y)$, regardless of depth. This is orthographic projection — the projection from Chapter 7's §7.5.6, now in 3D — and it is exactly the singular, flattening kind of map we studied there, with the depth direction annihilated.

Geometric Intuition — Orthographic projection is the shadow cast by parallel rays of light coming straight along the viewing axis, like the sun's rays (effectively parallel because the sun is so far away). Two objects of the same size appear the same size on screen no matter how far apart in depth they are, because depth is simply discarded. Parallel lines in the world stay parallel on screen. This is wrong for mimicking human vision but exactly right for engineering and CAD drawings, where you want true, comparable measurements rather than a foreshortened photograph.

Notice that orthographic projection is precisely the singular kind of matrix we met in Chapter 11. Dropping the $z$-coordinate flattens three dimensions onto two, so the projection collapses an entire direction (depth) to nothing — its determinant, viewed as a map of 3D space, is zero, and it cannot be undone. You cannot recover an object's depth from its orthographic shadow, exactly as you could not recover a 2D point's height from its shadow on the $x$-axis back in Chapter 7. That irreversibility is not a defect; it is the point of projection. The screen genuinely has fewer dimensions than the world, and the renderer's job is to throw depth away in a controlled, meaningful way. (Depth is not entirely discarded in practice — it is stashed in a separate depth buffer so the renderer knows which surface is nearest and should be drawn in front — but the projected position on screen carries only two dimensions.)

In homogeneous coordinates, dropping $z$ while keeping the points well-formed is a $4\times 4$ matrix with a zeroed third row (so $z$ does not propagate) — though in practice one simply selects the $x$ and $y$ rows. Let's render a unit cube under orthographic projection after rotating it so we can see its three-dimensionality.

# Orthographic projection of a rotated unit cube: rotate, then drop z.
import numpy as np
def Ry(t): c,s=np.cos(t),np.sin(t); return np.array([[c,0,s,0],[0,1,0,0],[-s,0,c,0],[0,0,0,1]],float)
def Rx(t): c,s=np.cos(t),np.sin(t); return np.array([[1,0,0,0],[0,c,-s,0],[0,s,c,0],[0,0,0,1]],float)
cube = np.array([[0,1,1,0,0,1,1,0],   # x of 8 corners (bottom face then top)
                 [0,0,1,1,0,0,1,1],   # y
                 [0,0,0,0,1,1,1,1],   # z
                 [1,1,1,1,1,1,1,1]], float)  # homogeneous
M = Rx(np.radians(20)) @ Ry(np.radians(30))  # tilt to reveal 3D
rotated = M @ cube
screen_xy = rotated[:2, :]            # orthographic: keep x, y; drop z
print(np.round(screen_xy[:, :4], 4))  # first four projected corners
[[0.     0.866  0.866  0.    ]
 [0.     0.171  1.1107 0.9397]]

Each column is a cube corner flattened to the screen; the depth ($z$) has been discarded after the rotation tilted the cube so that opposite faces no longer overlap. We will connect the edges and draw this as Figure 12.2 in §12.8.

12.7.2 Perspective projection: farther means smaller

Orthographic projection looks flat and unnatural because real vision has perspective: things farther away look smaller, and parallel lines (think of railway tracks) appear to converge toward a vanishing point on the horizon. The geometry behind this is the pinhole camera, and it is gloriously simple: a point at depth $z$ in front of the eye, at height $y$, projects onto an image plane at distance $d$ from the eye to the screen height $d\,y/z$. The farther away (larger $z$), the smaller the image. Similar triangles give the whole story:

$$(x, y, z) \;\longmapsto\; \left(\frac{d\,x}{z},\ \frac{d\,y}{z}\right).$$

That division by $z$ is what shrinks distant things. And here is the moment the homogeneous machinery earns its keep spectacularly: a division is not a linear operation, so it seems we are stuck again — but homogeneous coordinates turn the division into a final normalization step we already have. We use a projection matrix whose bottom row copies $z$ into the homogeneous coordinate $w$, instead of the usual $\begin{bmatrix}0&0&0&1\end{bmatrix}$:

$$P = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0\end{bmatrix} \quad\text{(taking } d = 1\text{).}$$

Watch what it does to the homogeneous point $(x, y, z, 1)$:

$$P\begin{bmatrix} x \\ y \\ z \\ 1\end{bmatrix} = \begin{bmatrix} x \\ y \\ z \\ z\end{bmatrix}.$$

The output's last coordinate is now $w = z$, not $1$. To read off the actual 2D point we do what homogeneous coordinates always demand — divide the other coordinates by the last one (the perspective divide) — and out falls $(x/z,\ y/z)$, exactly the similar-triangles formula. The division we "could not do with a matrix" was hiding in the normalization step that homogeneous coordinates required all along.

The Key InsightPerspective projection is a linear map in homogeneous coordinates followed by the perspective divide. The matrix places $z$ into the $w$ slot; dividing through by $w = z$ at the end produces the $1/z$ shrinking that makes distant objects small. This is why graphics uses homogeneous coordinates and not just $3\times 3$ affine matrices: the same fourth coordinate that gave translation a home also delivers perspective for free, as a division you were going to perform anyway.

# Gentle perspective: a point at distance z projects with a 1/z shrink.
import numpy as np
P = np.array([[1,0,0,0],
              [0,1,0,0],
              [0,0,1,0],
              [0,0,1,0]], float)        # bottom row copies z into w; d = 1
for z in (2, 5):
    homog = P @ np.array([1, 1, z, 1.0])         # same (x,y)=(1,1), different depth
    screen = homog[:2] / homog[3]                # perspective divide by w = z
    print(f"z={z}: w={homog[3]:.0f}, screen point = {np.round(screen, 4)}")
z=2: w=2, screen point = [0.5 0.5]
z=5: w=5, screen point = [0.2 0.2]

The same world point $(1,1)$ appears at $(0.5, 0.5)$ when it is at depth $2$, but shrinks to $(0.2, 0.2)$ when pushed back to depth $5$ — closer to the screen's center, smaller, exactly as a receding object should. That is perspective, produced by one matrix and one division.

Geometric Intuition — Two parallel rails running away from you, at $(\pm 1, 0, z)$ for growing $z$, project to $(\pm 1/z, 0)$ — which both march toward $(0,0)$ as $z\to\infty$. The rails converge to a single vanishing point at the center of the image, precisely the railway-track illusion. Orthographic projection would keep them a fixed distance $2$ apart forever (no convergence), which is why orthographic drawings look flat and perspective drawings look real. The convergence is the $1/z$ at work.

Common PitfallForgetting the perspective divide, or dividing by zero. The perspective matrix alone does not give you screen coordinates; you must divide by the resulting $w$ afterward. Skipping the divide leaves objects un-shrunk and the scene looking orthographic-but-wrong. And if a point has $z = 0$ — it sits exactly at the eye/camera plane — then $w = 0$ and the division blows up: that point is "infinitely far" on the screen. Real renderers handle this with a near clipping plane that discards anything closer than some small $z > 0$ before the divide, which is one reason graphics has near and far clip planes at all. (The full projection matrices in OpenGL/Direct3D also remap $z$ into a $[-1,1]$ or $[0,1]$ depth range for the depth buffer; we are showing the gentle, conceptual core.)

12.8 What is the rendering pipeline, and how does one point travel through it?

We now have every piece. Assembling them in order gives the rendering pipeline — the fixed sequence of coordinate systems a vertex passes through, each transition a matrix multiply, on its way from a model's private coordinates to a pixel on your screen. This is the organizing skeleton of all 3D graphics, and it is nothing but a composition of the matrices we built.

The four spaces and the matrices between them:

  1. Model (local) space → World space, via the model matrix $M$ (§12.5). Each object is authored centered at its own origin; $M = TRS$ places it (size, orientation, position) into the shared world.
  2. World space → Camera (view) space, via the view matrix $V$. The camera sits somewhere in the world looking in some direction; the view matrix transforms the whole world so that the camera is at the origin looking down a fixed axis. Crucially, $V$ is the inverse of the camera's own placement (Chapter 9's inverse, used in earnest): if the camera is positioned in the world by a matrix $C$ — translate it to its spot, rotate it to face its direction — then the view matrix is $V = C^{-1}$, which undoes that placement and so drags the entire world into the camera's frame. To bring the world in front of a camera that sits at $(0,0,5)$, you translate the world by $(0,0,-5)$; to account for a camera that has turned, you also apply the inverse of its rotation. Moving the camera right is the same as moving the world left — there is no separate "camera object" in the math, only a matrix that re-coordinates everything relative to the viewer. This is one of the most conceptually slippery points in graphics, and it is pure Chapter 9: the view transform is the inverse of the camera transform, because seeing the world from the camera is undoing the act of placing the camera in the world.
  3. Camera space → Clip space, via the projection matrix $P$ (§12.7), orthographic or perspective. After the perspective divide this yields normalized device coordinates, a standard cube the hardware understands.
  4. Clip/NDC → Screen space, via the viewport transform, a final scale-and-translate that stretches the normalized coordinates to your window's pixel width and height (and flips $y$, since screens count rows downward — the very $y$-flip from Chapter 7's Case Study 1).

The Key Insight — The whole pipeline is one composition: $\mathbf p_{\text{screen}} \sim V_{\!port}\, P\, V\, M\, \mathbf p_{\text{model}}$, with a perspective divide after $P$. A vertex is multiplied by model, then view, then projection (then divided, then mapped to the viewport). Change $M$ and the object moves; change $V$ and the camera moves; change $P$ and the lens changes. Every frame of every 3D scene you have ever watched is this chain evaluated for every vertex.

Let's trace a single point all the way through, with numbers, to make the abstract chain concrete. Put a model's origin at world position $(2, 0, 0)$ (so $M$ is a translation), set the camera at $(0,0,5)$ looking toward $-z$ (so $V$ translates the world by $(0,0,-5)$), and watch the point's coordinates change at each stage.

# Trace one point: model -> world -> camera. (Projection follows as in 12.7.)
import numpy as np
def translation3(tx, ty, tz):
    M = np.eye(4); M[0,3], M[1,3], M[2,3] = tx, ty, tz; return M
model_pt = np.array([0, 0, 0, 1.0])        # the model's own origin
M = translation3(2, 0, 0)                  # model matrix: place model at world (2,0,0)
V = translation3(0, 0, -5)                 # view matrix: camera at (0,0,5) looking -z
world = M @ model_pt
camera = V @ world
print("model  space:", model_pt)
print("world  space:", world)             # (2, 0, 0): out in the world
print("camera space:", camera)            # (2, 0, -5): 5 units in front of the eye
model  space: [0. 0. 0. 1.]
world  space: [2. 0. 0. 1.]
camera space: [2. 0. -5. 1.]

The model's origin sits at its own $(0,0,0)$, is placed by $M$ at world $(2,0,0)$, and is then seen by the camera at $(2, 0, -5)$ — two units to the right and five units in front (negative $z$ is "into the screen" in our convention). Feed that camera-space point through the perspective matrix of §12.7 and divide, and you would get its final screen position. The point itself never moved; four matrices moved the coordinate systems around it.

Now let's render an actual wireframe — the complete §12.7.1 cube, with its edges drawn — so you can see the pipeline produce a picture rather than a list of numbers. We rotate the cube, project it orthographically, and connect the twelve edges.

# Render a wireframe cube: rotate in 3D, project orthographically, draw edges.
import numpy as np, matplotlib.pyplot as plt
def Ry(t): c,s=np.cos(t),np.sin(t); return np.array([[c,0,s,0],[0,1,0,0],[-s,0,c,0],[0,0,0,1]],float)
def Rx(t): c,s=np.cos(t),np.sin(t); return np.array([[1,0,0,0],[0,c,-s,0],[0,s,c,0],[0,0,0,1]],float)

# 8 cube corners (homogeneous): bottom face z=0, top face z=1
cube = np.array([[0,1,1,0,0,1,1,0],
                 [0,0,1,1,0,0,1,1],
                 [0,0,0,0,1,1,1,1],
                 [1,1,1,1,1,1,1,1]], float)
edges = [(0,1),(1,2),(2,3),(3,0),       # bottom square
         (4,5),(5,6),(6,7),(7,4),       # top square
         (0,4),(1,5),(2,6),(3,7)]       # vertical pillars
M = Rx(np.radians(20)) @ Ry(np.radians(30))
P = M @ cube                            # rotate all 8 corners at once
xy = P[:2, :]                           # orthographic projection: drop z
fig, ax = plt.subplots(figsize=(5, 5))
for a, b in edges:
    ax.plot([xy[0,a], xy[0,b]], [xy[1,a], xy[1,b]], "C0-", lw=2)
ax.set_aspect("equal"); ax.grid(True, alpha=0.3)
ax.set_title("Wireframe cube (rotated 30 deg about y, 20 deg about x)")
plt.show()

Figure 12.1. A wireframe unit cube, rotated and orthographically projected. The eight corners have been rotated $30°$ about the $y$-axis and $20°$ about the $x$-axis, then flattened by dropping the $z$-coordinate, and the twelve edges drawn as line segments. The result is the familiar "cube drawn on paper" — two offset squares joined by four pillars — with the front and back faces visibly separated by the tilt. Alt-text: A hexagonal silhouette of a cube drawn in line segments, showing a front square and a back square connected by four edges, the classic wireframe-cube appearance produced by rotating a cube and projecting it to 2D.

Geometric Intuition — Notice we drew a 3D solid using nothing but the operations of this chapter: build the cube's corners as vectors, multiply them by rotation matrices, drop a coordinate to project, and connect the dots. No special "3D graphics" magic — just matrices transforming points and a projection flattening them. The leap from this wireframe to a photorealistic game frame is more of the same (more vertices, perspective instead of orthographic, plus shading and texturing that are themselves more linear algebra), not something categorically different. You now understand the load-bearing core.

Real-World ApplicationAugmented reality placing a virtual object in your room (mobile AR / computer vision). When an AR app drops a virtual chair onto your real floor, it must compute, every frame, the matrix that aligns the virtual object with the camera's current view of the real world. The phone's tracking system estimates the camera's pose — its position and orientation in the room — and from that builds exactly a view matrix $V$; the chair's placement on the floor is a model matrix $M$; the phone's lens defines a projection matrix $P$. The virtual chair's vertices run through the identical $P V M$ pipeline as a game's, which is why the chair appears to sit convincingly on your floor and grows or shrinks correctly as you walk toward or away from it. AR is the rendering pipeline with a camera matrix estimated from the real world instead of chosen by a level designer — the same linear algebra, sourced differently. Visualizing such 3D scenes and point clouds is also where 3D plots in Python lives: the projection-to-screen math you just learned is what a 3D plotting library performs under the hood to draw axes and surfaces on a flat figure.

FAQ: What is the difference between world space, camera space, and screen space?

They are the same points described in three successively more convenient coordinate systems, with a matrix translating each description into the next. World space is the shared global stage where every object has an agreed-upon position — like map coordinates for a whole city. Camera (view) space re-expresses everything relative to the camera, with the camera at the origin looking down a fixed axis, so that "in front of me" and "to my right" become simple coordinate signs; you get there by applying the inverse of the camera's placement. Screen space is the final 2D pixel grid of your actual window, after projection has flattened the third dimension and the viewport transform has scaled to your resolution. The genius of the pipeline is that each transition is a matrix multiply, so the whole journey from a model's private coordinates to a pixel is one composed transformation — the central lesson of Part II, applied end to end.

12.9 What are homogeneous coordinates really? (Points and directions)

We close with a deeper look at the device that powered the whole chapter, because it rewards a second pass and it connects forward. We said a 2D point becomes $(x, y, 1)$, and that scaling all three by a nonzero constant gives the same point: $(x, y, 1) \sim (cx, cy, c)$. The general rule is that a homogeneous triple $(X, Y, W)$ with $W \ne 0$ represents the 2D point $(X/W,\ Y/W)$. The $w = 1$ slice we used is just the canonical choice; the perspective divide of §12.7 is exactly this $(X/W, Y/W)$ recovery, with $W$ no longer equal to $1$.

So what about the triples with $W = 0$? You cannot divide by zero, so $(X, Y, 0)$ does not correspond to any finite point. Geometrically, it represents a point at infinity — a pure direction rather than a location. As you slide a real point $(X/W, Y/W)$ off toward infinity along a direction by letting $W \to 0$, the homogeneous representative approaches $(X, Y, 0)$. This is why homogeneous coordinates so naturally describe perspective's vanishing points: the place where parallel lines "meet" is a point at infinity, $(X, Y, 0)$, that the perspective transform maps to an honest, finite location on your screen.

The Key Insight — Homogeneous coordinates separate points (last coordinate nonzero) from directions / points at infinity (last coordinate zero), and they make both first-class citizens of the same matrix algebra. This is the entry to projective geometry, where "parallel lines meet at infinity" stops being a figure of speech and becomes a coordinate you can compute with. The single extra coordinate that let translation be a matrix, and let perspective be a division, also unifies the finite and the infinite.

Historical Note — Homogeneous coordinates were introduced by August Ferdinand Möbius in his 1827 work Der barycentrische Calcul and developed within nineteenth-century projective geometry by mathematicians including Julius Plücker [verify]. They long predate computers; graphics simply rediscovered that this old projective device is the perfect engine for cameras and transformations. The marriage of homogeneous coordinates with the $4\times 4$ matrix pipeline became standard in computer graphics through the 1960s–70s, notably in the work at the University of Utah and in systems descending from Ivan Sutherland's Sketchpad [verify]. The mathematics was a century old before the application caught up.

Computational Note — In real-time graphics, all of this runs in $32$-bit floating point on the GPU, and the perspective divide plus the limited precision of the depth buffer cause a famous artifact called z-fighting, where two nearly-coplanar surfaces flicker because their projected depths round to the same value. The fixes (a carefully chosen near/far plane ratio, or a reversed/logarithmic depth buffer) are pure numerical-conditioning concerns — the same floating-point care we devote a whole chapter to in Chapter 38. Even here, in the most visual corner of the subject, the theme holds: theory tells you what to compute, and numerical awareness tells you how to compute it reliably.

12.10 Build Your Toolkit

This chapter's toolkit contribution seeds the 3D-render option for the Chapter 39 capstone. You will implement the homogeneous transform builders for 2D — the exact functions this chapter is built on — and compose them, verifying against numpy. They are small, but they are the literal nucleus of a renderer: with translation, rotation, and scaling returning $3\times 3$ homogeneous matrices, plus the matrix multiplication you already wrote in Chapter 8, you can place and move any 2D shape, and the 3D versions are the same idea one dimension up.

Build Your Toolkit — In toolkit/capstone/render3d.py, implement three homogeneous transform builders for 2D, returning $3\times 3$ matrices as lists of rows (pure Python; no numpy in the implementation): - translation(tx, ty) — returns $\begin{bmatrix}1 & 0 & t_x\\ 0 & 1 & t_y \\ 0 & 0 & 1\end{bmatrix}$. - rotation(theta) — returns the homogeneous rotation by theta radians (use Python's math.cos, math.sin); top-left block $\begin{bmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{bmatrix}$. - scaling(sx, sy) — returns the homogeneous diagonal scaling.

Then compose them with your Chapter 8 matmul to build a model matrix M = matmul(T, matmul(R, S)), and apply it to the corners of the unit square (each corner lifted to a homogeneous $(x, y, 1)$). Verify against numpy: confirm your M equals translation(...) @ rotation(...) @ scaling(...) built with np.array, and that a known transform behaves correctly — e.g. rotation(math.pi/2) applied to $(1, 0, 1)$ returns approximately $(0, 1, 1)$, and translation(3, 1) applied to $(1, 0, 1)$ returns $(4, 1, 1)$. A reference skeleton:

```python

toolkit/capstone/render3d.py — homogeneous 2D transform builders (from scratch).

import math def translation(tx, ty): return [[1.0, 0.0, tx], [0.0, 1.0, ty], [0.0, 0.0, 1.0]] def rotation(theta): c, s = math.cos(theta), math.sin(theta) return [[c, -s, 0.0], [s, c, 0.0], [0.0, 0.0, 1.0]] def scaling(sx, sy): return [[sx, 0.0, 0.0], [0.0, sy, 0.0], [0.0, 0.0, 1.0]]

compose with your Chapter 8 matmul: M = matmul(translation(...),

matmul(rotation(...), scaling(...)))

```

A quick numpy verification you can run against the from-scratch builders:

# Verify the from-scratch homogeneous builders against numpy.
import numpy as np
from toolkit.capstone.render3d import translation, rotation, scaling
import math
M_scratch = np.array(translation(3, 1)) @ np.array(rotation(math.radians(45))) @ np.array(scaling(2, 2))
print(np.round(M_scratch, 4))
print("rotate (1,0) by 90:", np.round(np.array(rotation(math.radians(90))) @ np.array([1, 0, 1.0]), 4))
[[ 1.4142 -1.4142  3.    ]
 [ 1.4142  1.4142  1.    ]
 [ 0.      0.      1.    ]]
rotate (1,0) by 90: [ 0.  1.  1.]

The composed model matrix matches the one we computed by hand in §12.5, and a $90°$ rotation sends $(1,0)$ to $(0,1)$ as it must. You now own the core of a transform engine — the same translation, rotation, scaling, and compose-into-a-model-matrix machinery that a real renderer runs millions of times per frame.

Computational Note — Real engines store these as $4\times 4$ matrices and let the GPU multiply thousands of vertices by them in parallel each frame; your pure-Python builders are for understanding the structure, not for speed. The capstone (Chapter 39) extends these 2D builders to the full 3D $4\times 4$ set — the $R_x, R_y, R_z$ rotations and a perspective matrix from this chapter — to render a rotating wireframe object end to end. The lesson is the recurring one: implement once from scratch to know what the operation is, then lean on optimized libraries to do it fast.

12.11 What did we just learn, and where does it go?

Step back and see the arc. Part II taught that a matrix is a function that transforms space; this chapter ran that idea on a moving 3D world and put pixels on a screen. The one genuinely new tool was homogeneous coordinates — append a $1$ to every point — and it solved the problem that had shadowed us since Chapter 7: translation, which fixes nothing and so cannot be linear, becomes a matrix multiply once you lift to one higher dimension. With that, rotation, scaling, and translation all live in one matrix type and compose by ordinary multiplication.

  • Translation is not linear (it moves the origin), so it cannot fit in an $n\times n$ matrix — but in homogeneous coordinates (§12.2) it becomes the last column of an $(n{+}1)\times(n{+}1)$ matrix, paying off the promise from Chapters 1 and 3.
  • We rebuilt rotation, scaling, and translation in $3\times 3$ (2D) and $4\times 4$ (3D) homogeneous form (§12.3–§12.4), with the three coordinate-axis rotations $R_x, R_y, R_z$ in 3D.
  • We composed them into one model matrix $M = TRS$ (§12.5) and saw, with numbers, why order matters ($TR \ne RT$) — Chapter 8's non-commutativity with screen consequences — and why row-vs-column conventions must never be mixed (§12.6).
  • We projected 3D to 2D two ways (§12.7): orthographic (drop a coordinate; parallel lines stay parallel) and perspective (the $1/z$ shrink, delivered by a projection matrix plus the perspective divide).
  • We assembled the full rendering pipeline — model → world → camera → projection → screen (§12.8) — traced one point through it, and rendered a wireframe cube with matplotlib.

The recurring themes are all present. Linear algebra is the study of transformations (theme 1): the entire pipeline is a composition of transformations, and the object never moves — the coordinate systems do. Geometry and algebra are one object (theme 2): "place the cube and tilt it toward the camera" and "multiply these vertices by $P V M$" are the same act. Computation validates theory (theme 3): every matrix we wrote produced numbers we confirmed in numpy, and your toolkit now seeds a real renderer. And this is the most applied possible face of the subject (theme 4): the same transformation matrices serve games, film, CAD, and AR.

Where does it go? Immediately, Part II is complete, and Part III turns from "what does a transformation do?" to "what can it reach, and what does it destroy?" — the four fundamental subspaces (Chapters 13–14), where the projection that flattened our cube's depth becomes the column space and null space, and the "translate to origin, rotate, translate back" sandwich of §12.6 grows into change of basis (Chapter 16). The rotations of this chapter return, generalized, as orthogonal matrices in Chapter 21. And the projection idea — collapsing onto a lower-dimensional shadow — becomes one of the most powerful tools in all of applied mathematics when we meet it again as least-squares projection (Chapters 17, 19) and principal component analysis (Chapter 32). You have been doing serious linear algebra by drawing a cube; the same machinery is about to reorganize data, fit models, and run search engines.

FAQ: Is the rendering pipeline really just matrix multiplication?

At its geometric heart, yes — and that is the remarkable thing this chapter exists to show. The journey from a 3D model to a 2D image is a fixed sequence of coordinate changes, each one a matrix multiply ($M$ to place the object, $V$ to view it, $P$ to project it, plus a viewport scale), with a single division (the perspective divide) thrown in. Real engines layer enormous additional machinery on top — shading, texturing, lighting, depth testing, clipping, anti-aliasing — and much of that is more linear algebra still (dot products for lighting, matrices for texture coordinates). But the spine, the part that decides where every point lands on your screen, is exactly the composition of transformations you built in Part II. A matrix is a function that transforms space; chain enough of them together, and you get the moving worlds on every screen you own.