Case Study 40.1 — Tensors in Deep Learning: Why the Libraries Are Named for Them

DataField.Dev

Case Study 40.1 — Tensors in Deep Learning: Why the Libraries Are Named for Them

Field: Machine learning / AI. Anchor tie-in: the forward look to where linear algebra goes next; the matrix-as-transformation idea (Chapter 7) and the SVD's rank-one decomposition (Chapter 30), lifted to higher order.

The puzzle: why "TensorFlow," and not "MatrixFlow"?

When the two most widely used deep-learning libraries were named — TensorFlow and PyTorch (whose central object is the tensor) — the word "tensor" was a deliberate choice, not marketing. A beginner reasonably asks: aren't neural networks just a lot of matrix multiplications? Why the fancier word? The answer is the cleanest possible illustration of §40.2's main point — that real data has more than two natural axes, and a matrix has room for only two. This case study traces a single batch of data through a tiny piece of a network and shows the tensor, the contraction, and the honest caveats in action.

Step 1 — Data has more than two axes

Start with the simplest non-trivial example: a batch of three short sentences, each four words long, where every word is represented by a learned vector of length 5 (an "embedding," Chapter 33). What shape is this data?

One word is a vector (length 5) — order 1.
One sentence is a stack of word-vectors, a matrix (4 words × 5 features) — order 2.
A batch of three sentences is a stack of matrices — a $3\times 4\times 5$ block, order 3. A tensor.

You cannot store this batch as a matrix without throwing away one of its three meanings (which sentence, which position, which feature). The tensor is not decoration; it is the smallest object that holds the data's actual structure. Add a fourth axis — say, multiple attention "heads," or the height/width/channel/batch of images — and you are at order 4 routinely. This is the everyday reason the libraries traffic in tensors.

Step 2 — The forward pass is a chain of contractions

Now watch the network compute. The fundamental operation, repeated billions of times, is the contraction of §40.2 — summing the product over a shared index — which is exactly the matrix multiplication of Chapter 8 applied across the extra axes at once. Here is the heart of a "self-attention" layer, the mechanism behind modern language models, on our tiny example:

# Mini self-attention: contractions all the way down. One sentence, 3 tokens, d=4.
import numpy as np
rng = np.random.default_rng(7)
Q = rng.standard_normal((3, 4))     # queries: one length-4 vector per token
K = rng.standard_normal((3, 4))     # keys
V = rng.standard_normal((3, 4))     # values
scores = np.einsum('id,jd->ij', Q, K)          # contract over d -> 3x3 token-vs-token
ex = np.exp(scores - scores.max(1, keepdims=True))
weights = ex / ex.sum(1, keepdims=True)        # softmax: each row sums to 1
out = np.einsum('ij,jd->id', weights, V)        # contract weights with values -> 3x4
print(scores.shape, out.shape)        # (3, 3) (3, 4)
print(np.round(weights.sum(1), 6))    # [1. 1. 1.] -- attention weights are a distribution

Read what happened. The first einsum contracts the query and key tensors over the feature index d, producing a $3\times 3$ matrix of "how much does token $i$ attend to token $j$." After a softmax (so each row is a probability distribution that sums to 1), the second einsum contracts those weights with the value vectors, producing a new $3\times 4$ representation where each token is a weighted blend of all tokens' values. Both steps are contractions — the matrix–vector product of Chapter 7, grown extra slots. Stack dozens of these, add the batch axis (a leading index every einsum carries along), and you have the computational core of a transformer. The headline-grabbing "attention" is, underneath, the inner product of Chapter 18 (the scores are dot products of query and key vectors) followed by a projection-like weighted sum.

The connection to the rest of the book is exact and worth pausing on. The score $\mathbf{q}_i\cdot\mathbf{k}_j$ is a dot product (Chapter 18) — it measures alignment between a query and a key, exactly the cosine-similarity intuition. The weighted sum of value vectors is a convex combination, a point in the span (Chapter 6) of the values. Nothing in the mechanism is new linear algebra; it is Parts I–IV, arranged on tensors and run at scale.

Step 3 — Where the matrix theory generalizes cleanly, and where it does not

The honest part. Some of what you learned transfers to tensors perfectly, and some does not — and a good engineer knows which is which.

Transfers cleanly: contraction (it is just summation over shared indices, in any order); the rank-one building block (the order-3 outer product $t_{ijk}=u_iv_jw_k$ of §40.2 is the natural analog of the rank-one matrix $\mathbf{u}\mathbf{v}^{\mathsf{T}}$ that the SVD sums in Chapter 30); the change-of-basis transformation law (Chapter 16, told for more slots); and the sheer usefulness — tensors compress, factor, and approximate data just as matrices do.

Does not transfer cleanly: there is no single notion of "tensor rank" with all the tidy properties of matrix column-space rank; finding the best low-rank tensor approximation is not simply "truncate the SVD" as it was for matrices (Chapter 31); and the SVD's unique, canonical factorization splits, for tensors, into several competing decompositions — CP (a sum of rank-one tensors), Tucker, and tensor-train — each with different strengths and none universally best. The clean theorem-per-question structure of matrix theory thins out in higher order. This is not a flaw in your education; it is the genuine mathematical state of affairs, and §40.2 named it deliberately so you would not be surprised.

There is also a vocabulary caveat worth carrying into any ML codebase. What deep-learning practitioners call a "tensor" is, almost always, a multidimensional array equipped with contraction — the engineer's definition of §40.2. The deeper structure the word carries in physics (the transformation law that makes a quantity coordinate-independent and physically meaningful) is usually not in play. A PyTorch tensor of activations is not the stress tensor of continuum mechanics; it is a labeled block of numbers. Both are legitimately tensors; they live at different points on the spectrum from "array with contraction" to "coordinate-free multilinear object."

What this case study shows

A neural network is not "just matrices," and it is not magic. It is tensors and contractions — the matrix-as-transformation idea of Chapter 7 and the dot product of Chapter 18, lifted to objects with the extra axes that real data demands, and run at enormous scale on hardware built for exactly this operation. The libraries are named for tensors because the tensor is the smallest object that holds the structure of batched, multi-axis data, and contraction is the engine that moves it. The forward look of Chapter 40 promised that the frontier of machine learning would turn out to be linear algebra with more slots; this is that promise, made concrete in fifteen lines of einsum. And the caveats — no clean tensor rank, several competing decompositions, the narrowed engineering meaning of the word — are the chapter's commitment to honesty about advanced fields, demonstrated rather than asserted.