Chapter 33 — Further Reading

DataField.Dev

Chapter 33 — Further Reading

This chapter is, by design, an overview — a tour of how three families of machine-learning systems rest on the linear algebra of this book, not a course in machine learning itself. Each of the three topics (neural networks, embeddings, recommenders) is a field with textbooks of its own, and the training side (gradients, backpropagation, optimization) is a calculus subject we deliberately deferred. The pointers below are where to go to turn this overview into real depth, organized from "linear-algebra companion" to "dedicated machine-learning text." As always, section numbers follow the most widely circulated editions and may shift between printings.

The three anchor textbooks (the linear-algebra side)

Gilbert Strang, Linear Algebra and Learning from Data (2019) — the single best companion to this chapter, by the author of our main Strang reference. This is Strang's own book on exactly this material: it develops the SVD, low-rank approximation, and PCA, then builds to the linear algebra of deep learning — the structure of a network layer, the role of $W\mathbf{x}+\mathbf{b}$, and the backpropagation chain as matrix multiplication. Part III on "Low Rank and Compressed Matrices" is the direct backdrop for the recommender's $R\approx UV^{\mathsf{T}}$; the final part treats neural networks from the matrix viewpoint this chapter takes. If you read one source after this chapter, read this.
Stephen Boyd & Lieven Vandenberghe, Introduction to Applied Linear Algebra (VMLS), Chapters 12–13 and the chapters on clustering and least squares. The applied, data-first treatment. Its development of the dot product as a similarity measure, of nearest-neighbor classification, and of least-squares model fitting is the natural foundation for the embeddings and recommender material here. Freely and legally downloadable as a PDF from the authors, with companion Python notebooks. Best matched to the CS / data-science learning path.
Sheldon Axler, Linear Algebra Done Right (4th ed.), Chapter 6 (Inner Product Spaces) and Chapter 7 (Operators on Inner Product Spaces). The rigorous complement. Axler's coordinate-free treatment of inner products and orthogonality is the abstract home of the cosine-similarity geometry an embedding lives in, and his spectral-theory chapter underlies the SVD that the recommender approximates. Math majors should pair Chapter 6 with this chapter's §33.5–33.6 and the Math-Major Sidebar tying the factorization to Eckart–Young.

Dedicated machine-learning texts (the systems side)

Ian Goodfellow, Yoshua Bengio & Aaron Courville, Deep Learning (2016) — the standard deep-learning reference, free online. Chapter 2 is a self-contained linear-algebra review in exactly this book's notation; Chapter 6 develops feedforward networks as compositions of affine transformations and nonlinearities — the formal version of our §33.2–33.4, including why the nonlinearity is essential and the universal approximation results we flagged. Chapters 6–8 cover backpropagation and gradient-based training in full, which is the matrix calculus this chapter deferred. The place to make §33.10 rigorous.
Christopher Bishop, Pattern Recognition and Machine Learning (2006), and the newer Deep Learning: Foundations and Concepts (Bishop & Bishop, 2024). Classic and modern treatments of the probabilistic and linear-algebraic foundations. Bishop's careful derivations of linear models, basis functions, and (in the new book) network architectures and embeddings complement the geometric emphasis here.
Aurélien Géron, Hands-On Machine Learning (3rd ed.). The practitioner's bridge: working code for neural networks, embeddings, and recommender systems, with the linear algebra made concrete in numpy/Keras. Best if you want to run scaled-up versions of this chapter's toy demos. Its chapters on representation learning and on building recommenders extend Case Studies 1 and 2 directly.

On the three applications in this chapter

Neural networks (§33.2–33.4, §33.10). Beyond Deep Learning above, the data-science framing of neural networks as predictive models, and the non-technical account of how models work, complement this chapter's strict linear-algebra view. For the visual intuition of layers warping space, 3Blue1Brown's Neural Networks video series is an excellent geometric companion to §33.2. The universal approximation theorems we marked [verify] trace to Cybenko (1989, sigmoid activations) and Hornik (1991, general activations); see any of the texts above for precise statements and their (important) caveats about width and trainability.
Embeddings (§33.5–33.6, Case Study 2). The foundational word-embedding papers are Mikolov et al. (2013) on word2vec and Pennington et al. (2014) on GloVe; both are readable and show the analogy geometry empirically. For the documented bias in embeddings — stereotypes appearing as directions in the space — see Bolukbasi et al. (2016), "Man is to Computer Programmer as Woman is to Homemaker?", and the broader fairness-in-ML literature. The modern contextual embeddings (where a word's vector depends on its sentence) come from the transformer line of work; Deep Learning and Géron cover the architecture.
Recommender systems (§33.7–33.9, Case Study 1). The canonical reference is Koren, Bell & Volinsky (2009), "Matrix Factorization Techniques for Recommender Systems," written by members of the team that won the Netflix Prize — it lays out the exact $R\approx UV^{\mathsf{T}}$ model with bias terms and regularization that this chapter sketches. Aggarwal, Recommender Systems: The Textbook (2016), is the comprehensive treatment, including the cold-start and implicit-feedback issues we flagged. The connection to the SVD and Eckart–Young is the Chapter 31 thread; the gradient-descent fitting is the optimization thread of the deep-learning texts.

Free online resources

MIT OpenCourseWare, 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning (Gilbert Strang). The video course built around Linear Algebra and Learning from Data — Strang lecturing on low-rank approximation, the SVD, and the linear algebra of deep learning, in the same geometric spirit as this book. The closest possible video companion to this chapter.
fast.ai, Practical Deep Learning for Coders. A free, code-first course that gets you building and training real networks and embeddings quickly; pairs well with Géron for turning this chapter's toys into working systems.
The surprise and implicit Python libraries (recommenders), and PyTorch/Keras embedding layers. For hands-on extension of Case Study 1, surprise implements SVD-style matrix factorization with a friendly API; PyTorch's nn.Embedding and Keras's embedding layers are how the §33.5 ideas appear in production code.

A note on where this is going

The matrix calculus this chapter deferred — gradients and backpropagation — is a calculus subject; see derivatives and gradients for the optimization foundation. The inner-product geometry that embeddings inhabit generalizes in Chapter 34 (abstract inner product spaces). And the recommender of this chapter is one of the application choices for the Chapter 39 capstone, where the from-scratch toolkit — vectors through SVD and PCA — is assembled on a real dataset. This chapter is the proof of the book's central promise: that the same linear algebra you learned for its own beauty is the working mathematics of the systems now reshaping the world. Learn it once, use it everywhere.