Chapter 31 — Optimization in Several Variables

DataField.Dev

34 min read

In Chapter 10 you learned to optimize a function of one variable, and the recipe was almost mechanical. Set $f'(x) = 0$ to find critical points, then check the second derivative or a sign change to decide whether each is a peak or a valley. The...

Prerequisites

Chapter 30: Multivariable Chain Rule and Gradient

Learning Objectives

Find critical points $\nabla f = \mathbf{0}$ of a multivariable function.
Apply the second-derivative test to classify critical points (max, min, saddle).
Use Lagrange multipliers to optimize subject to equality constraints.
Handle multiple constraints via systems of Lagrange equations.
Apply multivariable optimization to economics, engineering, and ML.

In This Chapter

31.1 The Setup: From One Variable to a Whole Landscape
31.2 Critical Points
31.3 The Hessian Matrix
31.4 The Second-Derivative Test
31.5 Absolute Extrema on Closed Regions
31.6 Higher Dimensions
31.7 Multivariable Newton's Method (Optional)
31.8 Lagrange Multipliers: Optimizing on a Constraint
31.9 Lagrange Multipliers with Two Constraints
31.10 Application: Utility Maximization in Economics
31.11 Application: Least Squares and Maximum Likelihood
31.12 Convexity: When Optimization Is Easy
Looking Ahead
Reflection

Case Study 01 Case Study 02 Exercises Quiz Key Takeaways Further Reading

Chapter 31 — Optimization in Several Variables

31.1 The Setup: From One Variable to a Whole Landscape

In Chapter 10 you learned to optimize a function of one variable, and the recipe was almost mechanical. Set $f'(x) = 0$ to find critical points, then check the second derivative or a sign change to decide whether each is a peak or a valley. The graph of $f(x)$ is a curve, and on a curve there are only two interesting things that can happen at a flat spot: it can turn over (a maximum) or turn up (a minimum).

Now let the function depend on two variables. The graph of $z = f(x, y)$ is no longer a curve but a surface — a landscape of hills and basins suspended over the $xy$-plane. The same instinct applies: at the top of a hill or the bottom of a basin, the surface is momentarily level, and "level" now means level in every direction at once. That is the condition $\nabla f = \mathbf{0}$, which we met in Chapter 30 when we discovered that the gradient points in the direction of steepest ascent. Where the gradient vanishes, there is no steepest ascent — no uphill at all.

But a surface can be flat at a point in a way a curve never can. Stand at a mountain pass: the trail dips down ahead of you and behind you toward the two valleys, yet rises to your left and right toward the two peaks. You are at a level point that is neither a summit nor a basin. This is a saddle point, and it has no single-variable analog whatsoever. Its existence is the single fact that makes multivariable optimization genuinely new.

This chapter develops the whole story in three movements. First, unconstrained optimization: find the critical points where $\nabla f = \mathbf{0}$, and classify them with the second-derivative test built from the Hessian matrix. Second, absolute extrema on closed regions, where the boundary matters as much as the interior. Third, constrained optimization via Lagrange multipliers — the elegant idea that lets you optimize while tethered to a curve or surface, which turns out to be the mathematical heart of economics, statistics, and engineering design.

The Key Insight. A function of several variables is level at a critical point in every direction, but level comes in three flavors, not two: the point can curve up in all directions (minimum), down in all directions (maximum), or up in some and down in others (saddle). The entire second-derivative test is a machine for deciding which flavor you have — and the saddle is the flavor single-variable calculus could never produce.

Throughout, watch three of the book's recurring themes braid together: geometry and algebra are inseparable (every test we derive has a picture of tangent level curves behind it), calculus appears in every quantitative field (the same Lagrange equation prices a portfolio and trains a classifier), and hand computation builds understanding while machine computation builds power (we verify every hand result with scipy or numpy).

31.2 Critical Points

A critical point of $f(x, y)$ is a point $(x_0, y_0)$ where the gradient vanishes:

$$\nabla f(x_0, y_0) = \langle f_x, f_y \rangle = \mathbf{0}, \qquad\text{i.e.}\qquad f_x(x_0,y_0) = 0 \ \text{and}\ f_y(x_0,y_0) = 0.$$

Both partial derivatives must be zero simultaneously. Geometrically, the tangent plane at a critical point is horizontal: it has the equation $z = f(x_0, y_0)$, with no tilt in the $x$- or $y$-direction (Chapter 30 built this tangent plane). Just as a smooth curve can have a horizontal tangent line only at a peak, valley, or inflection, a smooth surface can have a horizontal tangent plane only at a peak, basin, or saddle.

(We should also flag, as in one variable, that an extremum can occur where $\nabla f$ fails to exist — at a corner or cusp of the surface. The function $f(x,y) = \sqrt{x^2 + y^2}$ has its minimum at the origin, a sharp cone-point where neither partial exists. Throughout this chapter we assume $f$ is smooth enough that interior extrema are critical points; non-smooth cases are handled by examining those points directly, exactly as in Chapter 10.)

A smooth critical point is one of three types:

Local maximum: $f(x, y) \le f(x_0, y_0)$ for all $(x,y)$ near $(x_0, y_0)$ — the top of a hill.
Local minimum: $f(x, y) \ge f(x_0, y_0)$ nearby — the bottom of a basin.
Saddle point: neither — $f$ increases along some directions through the point and decreases along others.

Let's meet all three through the three simplest surfaces in calculus.

Example 1 — the bowl $f = x^2 + y^2$. Here $\nabla f = \langle 2x, 2y \rangle$, which is $\mathbf 0$ only at the origin. The surface is a paraboloid opening upward; $f(0,0) = 0$ and $f > 0$ everywhere else. The origin is a local (in fact global) minimum — the bottom of the bowl.

Example 2 — the dome $f = -x^2 - y^2$. Now $\nabla f = \langle -2x, -2y \rangle = \mathbf 0$ at the origin. This is the bowl flipped upside down; $f(0,0) = 0$ and $f < 0$ elsewhere. The origin is a local maximum — the crown of the dome.

Example 3 — the saddle $f = x^2 - y^2$. Here $\nabla f = \langle 2x, -2y \rangle = \mathbf 0$ at the origin, so the origin is again critical. But watch the two cross-sections. Along the $x$-axis ($y=0$), $f = x^2 > 0$ for $x \ne 0$: the surface curves up, like the bottom of a valley. Along the $y$-axis ($x=0$), $f = -y^2 < 0$ for $y \ne 0$: the surface curves down, like the top of a ridge. The origin is a minimum along one direction and a maximum along the perpendicular direction — a textbook saddle point. It is shaped exactly like a horse's saddle or a Pringles chip.

Geometric Intuition. Picture the level curves (contour lines) near each critical point, the same way a topographic map shows elevation. Near a minimum or maximum, the contours are closed loops nested around the point like the rings of a target — you are inside a basin or atop a hill. Near a saddle, the contours do something different: two of them cross at the point, forming an X, and the level sets open outward in four directions. That crossing X is the visual signature of a saddle, and once you can spot it on a contour plot you can classify critical points by eye.

Check Your Understanding. Find all critical points of $f(x,y) = x^2 + y^2 - 6x + 4y + 13$, and guess (from the algebra alone) whether the point is a max, min, or saddle.
Answer
$f_x = 2x - 6 = 0 \Rightarrow x = 3$; $f_y = 2y + 4 = 0 \Rightarrow y = -2$. The lone critical point is $(3, -2)$. Completing the square, $f = (x-3)^2 + (y+2)^2$, so $f \ge 0$ with equality only at $(3,-2)$ — a minimum. Like Example 1, it is an upward bowl, just shifted; its lowest value is $f(3,-2) = 0$.

31.3 The Hessian Matrix

Eyeballing cross-sections worked for the three model surfaces because they were simple. We need a test — something mechanical that classifies any smooth critical point. That test is built from the second partial derivatives, organized into a matrix.

The Hessian matrix of $f(x,y)$ is

$$H_f = \begin{pmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{pmatrix}.$$

It is the second-order analog of the gradient: where the gradient collects the first partials into a vector, the Hessian collects the second partials into a matrix. By Clairaut's theorem (Chapter 29), if the second partials are continuous then the mixed partials agree, $f_{xy} = f_{yx}$, so the Hessian is symmetric. That symmetry is not a curiosity — it is what guarantees the test below has real eigenvalues and a clean answer.

From the Hessian we extract a single number, the discriminant:

$$D = f_{xx}\, f_{yy} - f_{xy}^{\,2} = \det(H_f).$$

This determinant — together with the sign of $f_{xx}$ — is all you need to classify a critical point.

Computational Note. In sympy, the Hessian is one call: H = sp.hessian(f, (x, y)). You can then take H.det() for the discriminant or H.eigenvals() for the eigenvalue classification we discuss in §31.6. We will use this at the end of the section to check every hand computation.

31.4 The Second-Derivative Test

Here is the test in its usable form. Suppose $(x_0, y_0)$ is a critical point of $f$ (so $\nabla f = \mathbf 0$ there) and the second partials are continuous nearby. Compute $D = f_{xx}f_{yy} - f_{xy}^2$ at the point. Then:

Condition	Classification
$D > 0$ and $f_{xx} > 0$	local minimum
$D > 0$ and $f_{xx} < 0$	local maximum
$D < 0$	saddle point
$D = 0$	test is inconclusive

A useful way to remember it: $D > 0$ means "definitely a max or min" (the surface curves the same way in the worst direction as in the $x$-direction), and the sign of $f_{xx}$ then tells you which one. $D < 0$ means "saddle" — the surface disagrees with itself. And $D = 0$ means the second-order information ran out and you must look harder.

Why the test works

The test is not magic; it falls straight out of the second-order Taylor expansion of $f$ near the critical point. Write $h$ and $k$ for small displacements and expand:

$$f(x_0 + h, y_0 + k) \approx f(x_0, y_0) + \underbrace{f_x\,h + f_y\,k}_{=\,0} + \tfrac{1}{2}\big(f_{xx}\,h^2 + 2 f_{xy}\,hk + f_{yy}\,k^2\big).$$

The first-order terms vanish because we are at a critical point — that is the whole point of $\nabla f = \mathbf 0$. So the local behavior of $f$, relative to its value at the critical point, is governed entirely by the quadratic form

$$Q(h, k) = f_{xx}\,h^2 + 2 f_{xy}\,hk + f_{yy}\,k^2.$$

If $Q > 0$ for every nonzero $(h,k)$, then $f$ rises in every direction: a minimum. If $Q < 0$ for every direction, $f$ falls everywhere: a maximum. If $Q$ is positive in some directions and negative in others, $f$ rises some ways and falls others: a saddle.

Now complete the square on $Q$ (assume $f_{xx} \ne 0$):

$$Q(h,k) = f_{xx}\left(h + \frac{f_{xy}}{f_{xx}}k\right)^2 + \frac{f_{xx}f_{yy} - f_{xy}^2}{f_{xx}}\,k^2 = f_{xx}\left(h + \frac{f_{xy}}{f_{xx}}k\right)^2 + \frac{D}{f_{xx}}\,k^2.$$

Read off the two coefficients. The first is $f_{xx}$; the second is $D/f_{xx}$. For $Q$ to be positive in all directions, both coefficients must be positive: $f_{xx} > 0$ and $D/f_{xx} > 0$, which together force $f_{xx} > 0$ and $D > 0$. For $Q$ negative in all directions, both must be negative: $f_{xx} < 0$ and $D > 0$. And if $D < 0$, the two squared pieces carry opposite signs, so $Q$ takes both signs — a saddle, no matter what $f_{xx}$ is. The table is now proved, line by line. When $D = 0$, the second coefficient dies and the quadratic form is degenerate; the second-order terms cannot decide, and we fall through to higher order.

Math Major Sidebar — Eigenvalues and the signature. The cleaner formulation lives in linear algebra. A symmetric matrix like $H_f$ has real eigenvalues $\lambda_1, \lambda_2$, and the quadratic form $Q(\mathbf v) = \mathbf v^\top H_f \mathbf v$ is positive in the directions of positive eigenvalues and negative in the directions of negative ones. For the $2\times 2$ case, $\det H_f = \lambda_1 \lambda_2 = D$ and $\operatorname{tr} H_f = \lambda_1 + \lambda_2 = f_{xx} + f_{yy}$. So $D > 0$ means the eigenvalues share a sign (definite — max or min), and the trace tells you which; $D < 0$ means they have opposite signs (indefinite — saddle). This eigenvalue criterion generalizes verbatim to $n$ dimensions, where the determinant test does not. We return to it in §31.6.

Worked examples, graduated

Example 1 — a clean minimum. $f(x,y) = x^2 + 4y^2 - 2x - 8y + 6.$

Critical point: $f_x = 2x - 2 = 0 \Rightarrow x = 1$; $f_y = 8y - 8 = 0 \Rightarrow y = 1$. So $(1,1)$.

Second partials: $f_{xx} = 2,\ f_{yy} = 8,\ f_{xy} = 0$, hence $D = 2\cdot 8 - 0^2 = 16 > 0$ and $f_{xx} = 2 > 0$. By the test, local minimum at $(1,1)$, with value $f(1,1) = 1 + 4 - 2 - 8 + 6 = 1$.

Example 2 — a saddle and a minimum together. $f(x,y) = x^3 + y^3 - 3xy.$

Critical points: $f_x = 3x^2 - 3y = 0 \Rightarrow y = x^2$; $f_y = 3y^2 - 3x = 0 \Rightarrow x = y^2$. Substituting the first into the second, $x = (x^2)^2 = x^4$, so $x^4 - x = x(x^3 - 1) = 0$, giving $x = 0$ or $x = 1$ (the other roots are complex and we discard them). With $y = x^2$, the real critical points are $(0,0)$ and $(1,1)$.

Second partials: $f_{xx} = 6x,\ f_{yy} = 6y,\ f_{xy} = -3$.

At $(0,0)$: $D = (0)(0) - (-3)^2 = -9 < 0$. Saddle point.
At $(1,1)$: $D = (6)(6) - (-3)^2 = 36 - 9 = 27 > 0$ and $f_{xx} = 6 > 0$. Local minimum, value $f(1,1) = 1 + 1 - 3 = -1$.

One function, two qualitatively different critical points — exactly the richness that single-variable calculus lacks.

Common Pitfall. Many students solve $f_x = 0$ and $f_y = 0$ separately and then pair up the answers however they like, manufacturing critical points that do not exist. The two equations must hold at the same point: you are solving a system, not two independent equations. In Example 2, "$x = 0$ or $1$" came from substituting one equation into the other, not from reading them in isolation. Always back-substitute and confirm both partials vanish at each candidate.

Example 3 — when $D = 0$ and the test surrenders. $f(x,y) = x^4 - 2x^2 + y^4.$

Critical points: $f_x = 4x^3 - 4x = 4x(x^2 - 1) = 0 \Rightarrow x = 0, \pm 1$; $f_y = 4y^3 = 0 \Rightarrow y = 0$. Critical points $(0,0),\ (1,0),\ (-1,0)$.

Second partials: $f_{xx} = 12x^2 - 4,\ f_{yy} = 12y^2,\ f_{xy} = 0$. Because $f_{yy} = 12y^2 = 0$ whenever $y = 0$, the discriminant $D = f_{xx}f_{yy} - 0 = 0$ at every one of these critical points. The test is inconclusive across the board — so we examine the function directly.

At $(0,0)$: along the $x$-axis, $f(x,0) = x^4 - 2x^2 = x^2(x^2 - 2)$, which is negative for small $x \ne 0$ — a local max in the $x$-direction. But along the $y$-axis, $f(0,y) = y^4 \ge 0$ — a local min in the $y$-direction. Up one way, down another: $(0,0)$ is a saddle, which the determinant test completely missed.
At $(\pm 1, 0)$: along the $x$-axis, $f(x,0) = x^4 - 2x^2$ has its minima at $x = \pm 1$ (value $-1$). Along the $y$-direction, $f(\pm 1, y) = -1 + y^4 \ge -1$, also a minimum. Both directions agree: $(\pm 1, 0)$ are local minima, value $-1$.

Common Pitfall. Do not read "$D = 0$" as "no extremum." It means only that the second-order terms are silent; the point can still be a max, a min, or a saddle, decided by higher-order behavior. The cure is to look at $f$ along well-chosen paths — the axes, the lines $y = \pm x$, or a parametrized curve through the point — and watch whether $f$ rises, falls, or does both.

Verifying with Python

The three-tier pattern from our continuity standards — pose analytically, solve by hand, confirm by machine — closes the loop. Here we let sympy find and classify critical points symbolically, with the eigenvalue criterion (§31.6) doing the classification so the same code will scale to any dimension.

# Classify the critical points of f(x,y) = x^3 + y^3 - 3xy via the Hessian.
import sympy as sp

x, y = sp.symbols('x y', real=True)
f = x**3 + y**3 - 3*x*y

grad = [sp.diff(f, v) for v in (x, y)]
H = sp.hessian(f, (x, y))                    # 2x2 matrix of second partials

crit = sp.solve(grad, (x, y), dict=True)
for c in crit:
    if all(val.is_real for val in c.values()):
        eig = H.subs(c).eigenvals()          # {eigenvalue: multiplicity}
        signs = [sp.sign(e) for e in eig]
        if all(s ==  1 for s in signs):  kind = "local minimum"
        elif all(s == -1 for s in signs): kind = "local maximum"
        else:                             kind = "saddle point"
        print(c, "->", kind, "  eigenvalues:", list(eig.keys()))
# {x: 0, y: 0} -> saddle point   eigenvalues: [-3, 3]
# {x: 1, y: 1} -> local minimum  eigenvalues: [3, 9]

The machine agrees with the hand work: $(0,0)$ has eigenvalues of opposite sign (saddle), $(1,1)$ has two positive eigenvalues (minimum). Notice that the eigenvalue approach told us the answer without ever forming the discriminant — and, unlike $D$, it will keep working in §31.6 when we leave two dimensions.

31.5 Absolute Extrema on Closed Regions

The second-derivative test finds and classifies interior critical points, but it says nothing about the edges of a region. Many real problems are confined: a budget caps spending, a material constraint caps size, a domain has a boundary. For these we want the absolute (global) maximum and minimum over a closed, bounded region — and there is a guarantee that they exist.

Extreme Value Theorem (two-variable form). A continuous function on a closed and bounded region $R \subset \mathbb{R}^2$ attains an absolute maximum and an absolute minimum somewhere on $R$.

This is the multivariable cousin of the one-dimensional Extreme Value Theorem from Chapter 9. It promises the extrema exist; finding them is a three-step search, the exact analog of the closed-interval method from Chapter 10 ("check critical points and endpoints"), with "endpoints" upgraded to "boundary."

The strategy. 1. Find all critical points of $f$ in the interior of $R$, and list the values of $f$ there. 2. Find the extreme values of $f$ on the boundary of $R$. The boundary is one dimension lower, so this is a (constrained) sub-problem you solve by parametrizing the boundary and using single-variable calculus — or by Lagrange multipliers (§31.8). 3. Compare every value collected in steps 1 and 2. The largest is the absolute maximum; the smallest is the absolute minimum.

Example — $f(x,y) = xy$ on the closed disk $x^2 + y^2 \le 1$

Interior. $\nabla f = \langle y, x \rangle = \mathbf 0$ only at $(0,0)$, where $f(0,0) = 0$.

Boundary. On the circle $x^2 + y^2 = 1$, parametrize $x = \cos\theta,\ y = \sin\theta$. Then $$f = \cos\theta\sin\theta = \tfrac{1}{2}\sin(2\theta).$$ This single-variable function of $\theta$ has maximum $\tfrac12$ at $2\theta = \tfrac{\pi}{2}$, i.e. $\theta = \tfrac{\pi}{4}$, giving $(x,y) = (\tfrac{\sqrt2}{2}, \tfrac{\sqrt2}{2})$, and minimum $-\tfrac12$ at $\theta = \tfrac{3\pi}{4}$, giving $(-\tfrac{\sqrt2}{2}, \tfrac{\sqrt2}{2})$.

Compare. The candidate values are $0$ (interior), $\tfrac12$ and $-\tfrac12$ (boundary). Therefore the absolute maximum is $\tfrac12$, attained at $(\tfrac{\sqrt2}{2}, \pm\tfrac{\sqrt2}{2})$ where $x$ and $y$ share a sign, and the absolute minimum is $-\tfrac12$, attained where they have opposite signs. The interior critical point $(0,0)$ — a saddle of $f$ — is neither; it is a decoy, and the comparison step is what exposes it.

Common Pitfall. Forgetting the boundary is the classic error here. A surface restricted to a closed region very often attains its extremes on the edge, not at an interior critical point — and in this example, the interior critical point is a saddle, so the true extremes live entirely on the rim. Steps 1 and 2 are not optional alternatives; you must do both and then compare.

Check Your Understanding. Find the absolute maximum of $f(x,y) = x + y$ on the closed disk $x^2 + y^2 \le 2$.
Answer
Interior: $\nabla f = \langle 1,1\rangle$ never vanishes, so there are no interior critical points. Boundary: parametrize $x = \sqrt2\cos\theta,\ y = \sqrt2\sin\theta$, so $f = \sqrt2(\cos\theta + \sin\theta) = 2\sin(\theta + \tfrac{\pi}{4})$, with maximum $2$ at $\theta = \tfrac{\pi}{4}$, i.e. $(1,1)$. The absolute maximum is $\boxed{2}$ at $(1,1)$. (In §31.8 you will get the same answer from one Lagrange equation.)

31.6 Higher Dimensions

Everything above generalizes to a function $f(x_1, \dots, x_n)$ of $n$ variables, and the generalization is where the eigenvalue viewpoint earns its keep.

A critical point is still a point where the gradient vanishes: $\nabla f = \mathbf 0$, i.e. all $n$ first partials are zero. The Hessian is now the $n \times n$ symmetric matrix $H_{ij} = \partial^2 f / \partial x_i \partial x_j$. To classify a critical point, examine the eigenvalues of the Hessian at that point:

all eigenvalues positive ($H$ positive-definite) → local minimum;
all eigenvalues negative ($H$ negative-definite) → local maximum;
eigenvalues of mixed sign ($H$ indefinite) → saddle point;
some eigenvalue zero → degenerate, test inconclusive (the analog of $D = 0$).

The single number $D$ does not generalize, but the eigenvalue criterion does, which is why it is the "real" version of the test. In two dimensions, the determinant trick is just a shortcut: $\det H = \lambda_1\lambda_2$ has the sign of "do the eigenvalues agree?" and $f_{xx}$ breaks the tie.

Real-World Application — Saddle points and deep learning. Training a neural network means minimizing a loss function $f(\boldsymbol\theta)$ over a parameter vector $\boldsymbol\theta$ that can have $n \approx 10^{9}$ to $10^{12}$ components. A landmark observation (Dauphin et al., 2014) is that in such high-dimensional landscapes, the overwhelming majority of critical points are saddle points, not local minima: for a critical point to be a minimum, all $10^{11}$ eigenvalues of the Hessian must be positive simultaneously, which is astronomically unlikely by chance. This reframed the central difficulty of training — the obstacle is escaping the vast plains of saddle points, not getting trapped in bad local minima. The gradient-descent dynamics of Chapter 30 are the tool that does the escaping, and modern variants (momentum, Adam) are largely engineered to slip past saddles faster.

31.7 Multivariable Newton's Method (Optional)

Finding critical points means solving $\nabla f = \mathbf 0$ — a system of nonlinear equations. Chapter 11 introduced Newton's method for one equation in one unknown, $x_{k+1} = x_k - f'(x_k)^{-1}f(x_k)$. The same idea finds critical points in several variables, with the Hessian playing the role of the second derivative:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - H_f(\mathbf{x}_k)^{-1}\,\nabla f(\mathbf{x}_k).$$

The inverse Hessian $H_f^{-1}$ replaces the scalar $1/f''$, and $\nabla f$ replaces $f'$. Near a nondegenerate critical point this converges quadratically — the number of correct digits roughly doubles each step — which is far faster than the linear convergence of gradient descent (Chapter 30). The price is steep: you must compute and invert an $n \times n$ matrix at every step, costing on the order of $n^3$ operations, which is prohibitive when $n$ is in the millions.

This trade-off is exactly why machine learning rarely uses pure Newton's method. Instead it uses quasi-Newton methods such as BFGS and its limited-memory cousin L-BFGS, which build a cheap running approximation to $H_f^{-1}$ from gradient information alone — keeping much of Newton's speed while sidestepping the cost of the true Hessian. The scipy.optimize.minimize routine we use below defaults to exactly these methods.

31.8 Lagrange Multipliers: Optimizing on a Constraint

We now arrive at the chapter's most beautiful idea. Often we do not get to optimize freely; we must optimize subject to a constraint. Maximize utility given a fixed budget. Minimize surface area for a fixed volume. Maximize likelihood with the probabilities summing to one. The constraint confines us to a curve (or surface), and the question becomes: where on that curve is $f$ largest?

The problem. Optimize $f(x,y)$ subject to a constraint $g(x,y) = 0$.

The constraint $g(x,y) = 0$ is a curve in the plane. As you walk along it, $f$ rises and falls; we want the high and low points of that walk.

Deriving the method (not just stating it)

Here is the geometric argument, and it is worth savoring because it explains why the strange-looking Lagrange equation is forced on us.

Draw the level curves of $f$ — the curves $f = c$ for various constants $c$ — as a contour map. Now lay the constraint curve $g = 0$ across this map. As you travel along $g = 0$, you cross level curves of $f$, and each crossing means $f$ is changing: you are stepping from one contour value to another. As long as the constraint curve crosses a level curve transversally (at an angle), $f$ is strictly increasing in one direction along the constraint and decreasing in the other — so you are not at an extremum, because you could keep walking and push $f$ higher.

The only way $f$ can stop changing along the constraint — the only way to be at a constrained extremum — is for the constraint curve to be tangent to a level curve of $f$ at that point. At a tangency, the constraint momentarily runs parallel to a contour of $f$; locally you are neither climbing to a higher contour nor descending to a lower one, so $f$ has a critical value along the constraint.

Now translate "tangent" into vectors. The gradient $\nabla f$ is always perpendicular to the level curves of $f$ (Chapter 30 — the gradient points across contours, in the direction of steepest ascent). Likewise $\nabla g$ is perpendicular to the constraint curve $g = 0$ (itself a level curve, of $g$). If the two curves are tangent, they share the same tangent line, hence the same perpendicular direction, hence their gradients are parallel:

$$\boxed{\ \nabla f = \lambda\,\nabla g\ }$$

for some scalar $\lambda$, the Lagrange multiplier. Parallel vectors are scalar multiples of one another; $\lambda$ is that scalar (and the sign and size of $\lambda$ carry real meaning, as §31.11 shows). Pair this vector equation with the constraint itself, $g = 0$, and you have a complete system — in 2D, three scalar equations ($f_x = \lambda g_x$, $f_y = \lambda g_y$, $g = 0$) in three unknowns ($x, y, \lambda$).

Geometric Intuition. Imagine the level curves of $f$ as ripples spreading outward and the constraint $g = 0$ as a fixed wire bent through them. Slide along the wire and you pass through ripple after ripple — until you reach the spot where the wire just kisses a ripple without crossing it. That kiss is the constrained optimum: the wire and the ripple are tangent, their perpendiculars (the gradients) align, and $\nabla f = \lambda\nabla g$. Every Lagrange problem is the algebra of finding that kiss.

Warning. The Lagrange condition is necessary but not sufficient: it locates candidate points where the curves are tangent, but does not by itself say which are maxima, which are minima, and which are neither. As with the unconstrained case, you must evaluate $f$ at every candidate and compare, or invoke the Extreme Value Theorem (the constraint set, if closed and bounded, guarantees a max and a min exist among the candidates). Also: the derivation quietly assumed $\nabla g \ne \mathbf 0$ at the optimum. Where $\nabla g = \mathbf 0$ (a singular point of the constraint), the method can miss extrema and those points must be checked separately.

Example 1 — the simplest Lagrange problem

Maximize $f(x,y) = xy$ subject to $g(x,y) = x + y - 1 = 0$.

Gradients: $\nabla f = \langle y, x\rangle$, $\nabla g = \langle 1, 1\rangle$. The Lagrange equations $\nabla f = \lambda\nabla g$ read $y = \lambda$ and $x = \lambda$, so $x = y$. The constraint $x + y = 1$ with $x = y$ gives $2x = 1$, hence $x = y = \tfrac12$. The candidate is $(\tfrac12, \tfrac12)$ with $f = \tfrac14$.

Is it a maximum? Test another point on the line: $f(1, 0) = 0 < \tfrac14$, and $f(0.4, 0.6) = 0.24 < 0.25$. The value $\tfrac14$ is the largest, so $(\tfrac12,\tfrac12)$ is the constrained maximum. (Geometrically: among all rectangles with perimeter-half $x+y=1$, the square maximizes area — a fact you now have three ways to see.)

Example 2 — a sphere in three dimensions

Maximize $f = x + 2y + 3z$ subject to $g = x^2 + y^2 + z^2 - 14 = 0$.

Gradients: $\nabla f = \langle 1, 2, 3\rangle$, $\nabla g = \langle 2x, 2y, 2z\rangle$. The equations $\nabla f = \lambda\nabla g$ give $$1 = 2\lambda x,\quad 2 = 2\lambda y,\quad 3 = 2\lambda z \ \Longrightarrow\ x = \tfrac{1}{2\lambda},\ y = \tfrac{1}{\lambda},\ z = \tfrac{3}{2\lambda}.$$ Substitute into the constraint: $$\frac{1}{4\lambda^2} + \frac{1}{\lambda^2} + \frac{9}{4\lambda^2} = \frac{1 + 4 + 9}{4\lambda^2} = \frac{14}{4\lambda^2} = 14 \ \Longrightarrow\ \lambda^2 = \tfrac14 \ \Longrightarrow\ \lambda = \pm\tfrac12.$$ For $\lambda = \tfrac12$: $(x,y,z) = (1, 2, 3)$ and $f = 1 + 4 + 9 = 14$. For $\lambda = -\tfrac12$: $(x,y,z) = (-1,-2,-3)$ and $f = -14$. The constraint set (a sphere) is closed and bounded, so both extremes are attained: the maximum is $14$ at $(1,2,3)$ and the minimum is $-14$ at $(-1,-2,-3)$. (This is the Cauchy–Schwarz inequality in disguise: the linear function $\mathbf a\cdot\mathbf x$ is largest on a sphere when $\mathbf x$ points along $\mathbf a$.)

31.9 Lagrange Multipliers with Two Constraints

Sometimes you are tethered to two constraints at once, $g_1 = 0$ and $g_2 = 0$. In three dimensions, each constraint is a surface, and their intersection is a curve — you are confined to that curve and want $f$'s extremes along it.

The condition generalizes naturally. At a constrained extremum, $\nabla f$ must lie in the plane spanned by the two constraint gradients:

$$\nabla f = \lambda_1\,\nabla g_1 + \lambda_2\,\nabla g_2,$$

one multiplier per constraint. The reasoning mirrors the single-constraint case: to be free to increase $f$ along the intersection curve, $\nabla f$ would need a component along the curve's tangent direction. But the tangent to the intersection is perpendicular to both $\nabla g_1$ and $\nabla g_2$. So at an extremum $\nabla f$ must have no component along the tangent — meaning it lies entirely in the space spanned by $\nabla g_1$ and $\nabla g_2$, which is exactly what the boxed equation says. Together with the two constraints, this is a system of (in 3D) five equations $f_{x}=\lambda_1 g_{1x}+\lambda_2 g_{2x}$, etc., plus $g_1 = 0$ and $g_2 = 0$, in five unknowns $x, y, z, \lambda_1, \lambda_2$.

This is how you optimize over the intersection of two surfaces — say, the highest point on the curve where a plane slices a sphere.

31.10 Application: Utility Maximization in Economics

Constrained optimization is the mathematics of economics, because economics is the study of choice under scarcity — optimizing satisfaction subject to a budget. Let's derive the central result of consumer theory from scratch with Lagrange multipliers.

A consumer chooses quantities $x$ and $y$ of two goods to maximize a utility function $U(x,y)$, subject to a budget constraint $p_x\, x + p_y\, y = B$, where $p_x, p_y$ are prices and $B$ is income. So $g(x,y) = p_x x + p_y y - B = 0$.

The Lagrange condition $\nabla U = \lambda\nabla g$ reads $$U_x = \lambda\, p_x, \qquad U_y = \lambda\, p_y.$$ Here $U_x$ and $U_y$ are the marginal utilities — the extra satisfaction from one more unit of each good. Divide the two equations to eliminate $\lambda$:

$$\frac{U_x}{U_y} = \frac{p_x}{p_y}, \qquad\text{equivalently}\qquad \boxed{\ \frac{U_x}{p_x} = \frac{U_y}{p_y}.\ }$$

The boxed form is the celebrated equal marginal utility per dollar condition: at the optimum, the last dollar spent on each good yields the same additional utility. The intuition is airtight — if a dollar spent on $x$ bought more utility than a dollar spent on $y$, you would reallocate toward $x$ until the two evened out. Lagrange multipliers turn this verbal argument into an equation, and the multiplier $\lambda$ itself is the common value $U_x/p_x = U_y/p_y$: the marginal utility of income, the extra satisfaction one more dollar of budget would buy.

Cobb–Douglas in closed form

Make it concrete with the Cobb–Douglas production/utility function $Q(L,K) = A L^a K^b$, the workhorse of economics, where (for a firm) $L$ is labor, $K$ is capital, and the firm maximizes output $Q$ subject to the cost budget $w L + r K = B$ (wage $\times$ labor plus rental rate $\times$ capital).

The Lagrange equations are $$Q_L = a A L^{a-1} K^{b} = \lambda w, \qquad Q_K = b A L^{a} K^{b-1} = \lambda r.$$ Divide the first by the second; the $A$, and most of the powers, cancel: $$\frac{a A L^{a-1}K^{b}}{b A L^{a}K^{b-1}} = \frac{aK}{bL} = \frac{w}{r} \ \Longrightarrow\ K = \frac{b\,w}{a\,r}\,L.$$ Substitute into the budget $wL + rK = B$: $$wL + r\cdot\frac{bw}{ar}L = wL\Big(1 + \frac{b}{a}\Big) = wL\cdot\frac{a+b}{a} = B \ \Longrightarrow\ \boxed{\,L^* = \frac{aB}{w(a+b)},\quad K^* = \frac{bB}{r(a+b)}.\,}$$ (sympy confirms this exact solution — see the symbolic check we ran while preparing this section.) The result is strikingly clean: the firm splits its budget between the two inputs in proportion to the exponents $a$ and $b$, which economists call the output elasticities. Spend the fraction $a/(a+b)$ of your budget on labor and $b/(a+b)$ on capital, full stop. This single formula underlies an enormous amount of applied production and demand analysis.

Real-World Application — The shadow price. The Lagrange multiplier $\lambda$ is not just algebraic scaffolding; it has a precise economic meaning. It equals $\partial(\text{optimal } Q)/\partial B$ — the marginal value of the constraint, or the firm's shadow price of the budget. If the firm could secure one more dollar of budget, its optimal output would rise by approximately $\lambda$. This is why $\lambda$ is quoted in dollars-per-unit-of-constraint and why managers care about it directly: it tells you exactly how much an extra unit of a scarce resource is worth, which is precisely what you would pay for it. The same interpretation reappears across engineering and operations research as the sensitivity of the optimum to the constraint.

31.11 Application: Least Squares and Maximum Likelihood

Constrained and unconstrained multivariable optimization are the twin engines of statistics and machine learning. Two cornerstones illustrate it.

Least squares (unconstrained). Fit a line $y = mx + c$ to data points $(x_i, y_i)$ by choosing $m$ and $c$ to minimize the sum of squared residuals $$S(m, c) = \sum_{i=1}^{n}\big(y_i - mx_i - c\big)^2.$$ This is an unconstrained optimization in two variables $m$ and $c$, solved by the §31.2 method: set $\nabla S = \mathbf 0$. The two equations $\partial S/\partial m = 0$ and $\partial S/\partial c = 0$ are the famous normal equations, a linear system whose solution gives the regression coefficients. Because $S$ is a sum of squares, its Hessian is positive-definite (positive eigenvalues everywhere), so the critical point is the global minimum — no saddles, no local traps, a guaranteed best fit. Every linear regression you will ever run is this single critical-point computation.

Maximum likelihood (often constrained). Statistical estimation frequently maximizes a log-likelihood $\ell(\boldsymbol\theta)$ over parameters $\boldsymbol\theta$. When the parameters must satisfy a constraint — most commonly that a set of probabilities sums to one, $\sum_i p_i = 1$ — the natural tool is Lagrange multipliers. The classic result: maximizing the entropy $H = -\sum_i p_i \log p_i$ subject to $\sum_i p_i = 1$ (and possibly fixed mean and variance) yields, via the Lagrange equations, the uniform, exponential, or Gaussian distribution respectively. The Gaussian is literally the maximum-entropy distribution for a fixed mean and variance — a deep bridge between optimization and probability, built on a Lagrange multiplier.

Python: constrained optimization with scipy

For problems without a clean closed form, scipy.optimize handles both flavors. It solves the Lagrange conditions internally using the quasi-Newton ideas of §31.7.

# Unconstrained vs. constrained optimization, matching our hand results.
from scipy.optimize import minimize
import numpy as np

# (A) Unconstrained: minimize f(x,y) = x^2 + 4y^2 - 2x - 8y + 6  -> min at (1,1), value 1
f = lambda v: v[0]**2 + 4*v[1]**2 - 2*v[0] - 8*v[1] + 6
res_u = minimize(f, x0=np.array([0.0, 0.0]))
print(f"(A) min at {np.round(res_u.x, 3)}, value {res_u.fun:.3f}")
# (A) min at [1. 1.], value 1.000

# (B) Constrained: maximize xy on x^2 + y^2 = 1  (maximize = minimize the negative)
neg_xy   = lambda v: -v[0]*v[1]
on_circle = {'type': 'eq', 'fun': lambda v: v[0]**2 + v[1]**2 - 1}
res_c = minimize(neg_xy, x0=np.array([0.5, 0.5]), constraints=on_circle)
print(f"(B) max of xy = {-res_c.fun:.3f} at {np.round(res_c.x, 3)}")
# (B) max of xy = 0.500 at [0.707 0.707]

Result (A) reproduces Example 1 of §31.4 to the digit; result (B) reproduces the constrained optimum $xy = \tfrac12$ at $(\tfrac{\sqrt2}{2}, \tfrac{\sqrt2}{2})$ from the disk example, now read off the boundary circle directly. The lesson of our continuity theme stands: do the Lagrange algebra by hand to understand the answer, then let scipy scale it to the hundred-variable portfolios and million-parameter models where hand computation cannot follow.

Real-World Application — Markowitz portfolio theory. In finance, you choose portfolio weights $w_1, \dots, w_n$ across $n$ assets to minimize variance (risk) $\mathbf w^\top \Sigma\, \mathbf w$ subject to two constraints: the weights sum to one, $\sum_i w_i = 1$, and the expected return hits a target, $\sum_i w_i\mu_i = R$. This is a constrained quadratic program, solved exactly by Lagrange multipliers — two of them, one per constraint, in the spirit of §31.9. Sweeping the target return $R$ traces out the efficient frontier, the curve of best-possible risk-return trade-offs. Harry Markowitz won the 1990 Nobel Prize in Economics for this framework, and Lagrange multipliers are the machinery underneath it.

31.12 Convexity: When Optimization Is Easy

A final idea ties the chapter to gradient descent (Chapter 30) and explains why some problems are tractable and others are nightmares. A function $f$ is convex if its graph never bulges above any of its chords — formally, for all $\mathbf x, \mathbf y$ and $t \in [0,1]$, $$f\big(t\mathbf x + (1-t)\mathbf y\big) \le t\,f(\mathbf x) + (1-t)\,f(\mathbf y).$$ Equivalently for smooth $f$: the Hessian is positive-semidefinite everywhere (all eigenvalues $\ge 0$ at every point), so the surface curves upward like a bowl no matter where you stand.

The payoff is enormous: a convex function has no saddle points and no spurious local minima — every critical point is the global minimum. All the classification machinery of this chapter collapses into a single happy outcome. For convex problems, gradient descent reliably finds the one true optimum from any starting point; there is nowhere bad to get stuck.

Many of the most important problems in applied mathematics are convex: linear and ridge regression, logistic regression, support-vector machines, and a great many financial optimizations including the Markowitz problem above. This is precisely why they are solved problems with reliable algorithms. Neural-network training, by contrast, is famously non-convex — its loss landscape is riddled with the saddle points of §31.6 — which is what makes deep learning an art as much as a science. Yet, as Chapter 30 noted, gradient descent works astonishingly well in practice even there, slipping past saddles to find minima that generalize.

Math Major Sidebar — Convex optimization as a field. Convexity anchors an entire discipline at the intersection of analysis, geometry, and algorithms. The definitive reference, Boyd and Vandenberghe's Convex Optimization (2004), is freely available online and is foundational to operations research, control theory, machine learning, and quantitative finance. The deep theorems — duality, the KKT conditions (which generalize Lagrange multipliers to inequality constraints), and interior-point methods — all grow from the single geometric fact that a convex set has a well-defined notion of "below," so a local minimum cannot help but be global.

Add to Your Modeling Portfolio. Add a constrained optimization to your model, deriving the optimum with Lagrange multipliers and verifying it with scipy.optimize.minimize. Biology: maximize an organism's net energy intake (a foraging-utility function of two prey types) subject to a fixed time-or-energy budget — the optimal-foraging analog of consumer choice. Economics: maximize a Cobb–Douglas utility $U = x^a y^b$ subject to your budget $p_x x + p_y y = B$; derive the demand functions and interpret the multiplier $\lambda$ as the marginal utility of income. Physics: find a constrained equilibrium or a least-action configuration — e.g., the shape minimizing potential energy subject to fixed length or volume. Data Science: fit a model by minimizing squared error with a regularization constraint (e.g., $\|\mathbf w\|^2 \le c$), and connect the Lagrange multiplier to the ridge-regression penalty $\lambda$.

Looking Ahead

You can now find and classify the critical points of a function of several variables, locate absolute extrema on closed regions, and optimize under constraints by the elegant tangency argument of Lagrange. These are the tools that let calculus choose — the best design, the cheapest production plan, the most likely parameters.

The next move is from differentiating multivariable functions to integrating them. Chapter 32 introduces double and triple integrals — accumulation over two- and three-dimensional regions, the natural generalization of the definite integral of Chapter 13, with applications to volume, mass, center of mass, and probability over regions. Chapter 33 then develops the change of variables and the Jacobian, the multivariable version of $u$-substitution that makes those integrals computable. The optimization landscape you have just learned to read — its peaks, valleys, and saddles — is the terrain over which all of that integration will take place.

Reflection

Single-variable calculus gave you peaks and valleys. The leap to several variables gave you something genuinely new: the saddle point, a place that is level yet neither highest nor lowest, the geometric fingerprint of a world with more than one direction to move. The Hessian taught us to read that geometry off the second derivatives, and Lagrange multipliers taught us the unreasonably beautiful fact that constrained optima occur exactly where a level curve kisses a constraint — where two gradients fall into line. From the firm allocating its budget to the network learning its weights, the same equation $\nabla f = \lambda\nabla g$ is quietly doing the choosing. You have learned not just to find extremes, but to find them while bound — which is, after all, the only kind of optimization the real world ever asks for.