Case Study 39.2 — Ranking a Citation Network with PageRank (Track C)

DataField.Dev

Case Study 39.2 — Ranking a Citation Network with PageRank (Track C)

A fully worked capstone walkthrough in the PageRank track, on a real-shaped network rather than the 4-node teaching example. We follow the project end to end and arrive at a ranking that no link-count could have produced. Every number matches the accompanying numpy.

The brief

Devin, a CS student, takes the PageRank track but wants more than the toy 4-node graph of the chapter. His pitch: "Rank a small academic citation network by influence, and show that influence is not the same as being cited often." This is a genuine applied linear algebra project — the same algorithm Google built on, run on a directed graph where an edge means "cites." He will build the column-stochastic link matrix, run power_iteration from his toolkit/eigen.py, and cross-check against a direct eigensolver and the Perron–Frobenius theorem (Chapter 29).

Step 1 — Encode the problem as a matrix

Devin assembles a 7-paper network. An edge u → v means paper u cites paper v, so a citation sends authority to the cited paper. His nodes, with a plausible structure:

Node	Paper	Cites
0	Found-A	(nothing — it is the oldest)
1	Found-B	Found-A
2	Survey	Found-A, Found-B, MethodX, MethodY
3	MethodX	Found-A, Found-B
4	MethodY	Found-B, MethodX
5	AppX	Survey, MethodX
6	AppY	Survey, MethodY, MethodX

The modeling decision of §39.8 is the whole game here: rows and columns are papers, and an entry encodes "fraction of paper j's citations that point to paper i." Devin notes the in-degrees (how often each paper is cited) up front, because his thesis is that they will not match the final ranking: MethodX is cited 4 times (the most), the two foundations 3 times each, and the two applications 0 times.

Step 2 — Choose the operation: the dominant eigenvector

PageRank is the dominant eigenvector of the Google matrix

$$G = d\,M + \frac{1-d}{n}\,\mathbf{1}\mathbf{1}^{\mathsf{T}},$$

where M is the column-stochastic link matrix, d = 0.85 is the damping factor, and the teleport term makes the chain irreducible. By Perron–Frobenius (Chapter 29), a positive column-stochastic matrix has a unique dominant eigenvalue of 1 with a positive eigenvector; power iteration converges to it. Devin states this condition in his report — Perron–Frobenius needs the positivity that the damping term guarantees — which is the difference between running the algorithm and understanding it.

One real-world wrinkle the toy example dodged: Found-A cites nothing — it is a dangling node, an all-zero column that would not be stochastic. Devin uses the standard fix: a dangling column links uniformly to all pages (1/n in every row). This is exactly the dangling-node milestone (Exercise 39.18), and handling it correctly is a mark of a mature project.

Step 3 — Compute it with the toolkit

The build-and-iterate is a dozen lines, leaning on matrices.apply for each multiplication:

# Column-stochastic link matrix with the dangling-node fix, then power iteration.
M = np.zeros((n, n))
for src in range(n):
    dests = cites[src]
    if not dests:
        M[:, src] = 1.0 / n              # dangling node -> uniform teleport
    else:
        for d_ in dests:
            M[d_, src] = 1.0 / len(dests)
G = 0.85 * M + (0.15 / n) * np.ones((n, n))
v = np.ones(n) / n                       # start uniform
for _ in range(100):                     # power iteration (Chapter 29)
    v_new = G @ v; v_new /= v_new.sum()
    if np.abs(v_new - v).max() < 1e-10:  # converged
        break
    v = v_new

Step 4 — Results

Power iteration converges in about 31 iterations to:

PageRank (d = 0.85):
  Found-A   0.3178     <- ranked #1
  Found-B   0.1945
  MethodX   0.1663
  Survey    0.1025
  MethodY   0.0988
  AppX      0.0600
  AppY      0.0600

This is the result Devin came for. Found-A ranks first (0.3178) even though three papers cite it — the same in-degree as Found-B. MethodX, the most-cited paper (in-degree 4), ranks only third (0.1663). And the two application papers, cited by nobody, sit at the bottom with 0.0600 each (the teleport floor). His headline: "The most influential paper is not the most cited one — influence flows from being cited by influential papers."

Why does Found-A win? It is cited by Found-B, by the Survey, and by MethodX — and those citers are themselves well-cited, so they pass on a lot of authority. MethodX is cited more often, but several of its citers (the applications) are low-authority, so each citation carries less weight. This recursion — "important = cited by important" — is precisely why the ranking needs an eigenvector and not a tally. Found-A and Found-B differ in rank (0.3178 vs 0.1945) despite identical in-degree because Found-A is cited by the higher-authority set.

Step 5 — Validation

Devin validates at all three levels of §39.8.2.

Theoretical invariants. The PageRank vector must be a probability distribution: he confirms it is nonnegative and sums to 1.000000. And the dominant eigenvalue must be exactly 1 — the Perron–Frobenius guarantee for a column-stochastic G.

Agreement with an independent computation. He cross-checks power iteration against np.linalg.eig(G): the dominant eigenvalue comes back as 1.0, and the normalized dominant eigenvector matches his power-iteration result to a maximum absolute difference of essentially zero (< 1e-8). Two independent methods — iteration and a direct eigensolver — agreeing on the same vector is strong evidence the computation is right.

A known-structure sanity check. The dangling-node handling is testable: if he removes the fix and lets Found-A's column be all zeros, the columns no longer sum to 1, the matrix is no longer stochastic, and the iteration drifts — confirming the fix matters. With it restored, every column sums to 1 and the ranking is stable.

Step 6 — Presentation

Devin's figures make the graph and its ranking legible:

The citation graph drawn with arrows, each node sized or shaded by its PageRank — so the eye sees instantly that Found-A is the hub of authority.
A bar chart contrasting in-degree with PageRank for each paper. The mismatch is the visual punchline: MethodX has the tallest in-degree bar but only the third-tallest PageRank bar.
The convergence curve: the maximum change in v per iteration, dropping toward zero over ~31 steps, making power iteration's convergence visible.

He explains, as the chapter's FAQ does, why power iteration rather than a characteristic polynomial: it needs only cheap sparse multiplications and scales to networks far larger than any direct eigensolver could handle — the real web is billions of nodes.

Step 7 — Limitations and next steps

Honest edges (rubric item 5): the network is hand-built and tiny, so the ranking is illustrative, not authoritative; the choice d = 0.85 is conventional but affects both the ranking and the convergence speed. Devin's next step (the ⭐⭐⭐⭐ extension): scrape a real small network — a Wikipedia category's internal links, or a citation list from a paper's references — recompute PageRank, and plot how convergence speed changes as d sweeps from 0.5 to 0.95. (Larger d weights the link structure more but converges more slowly, because the gap between the dominant eigenvalue and the next shrinks.)

What this capstone demonstrates

Devin's project lands the book's themes from a different angle than the SVD case study. Eigenvalues reveal a matrix's true action — the dominant eigenvector of G is the long-run importance of each page, the sixth recurring theme made literal. Geometry and algebra are one object — the eigenvector is simultaneously an algebraic fixed point (Gv = v) and the geometric stationary distribution of a random walk. Computation validated theory — the dominant eigenvalue came out exactly 1 as Perron–Frobenius promises, and power iteration matched the direct eigensolver to machine precision. And the in-degree-versus-PageRank contrast is the whole reason the algorithm needed linear algebra in the first place: importance is recursive, and recursion of that kind is an eigenproblem. He has built, validated, and can defend a complete piece of applied linear algebra — and could explain to anyone, in one sentence, why the most-cited paper is not the most influential one.