Chapter 29 — Exercises

DataField.Dev

Chapter 29 — Exercises

Work these with pencil first, then confirm with numpy where a snippet is invited. Difficulty is marked ⭐ (conceptual) to ⭐⭐⭐⭐ (application/synthesis). Problems marked [proof] want a rigorous argument; [code] want a short numpy script. Notation follows the chapter: $M$ is the raw link matrix, $S$ the dangling-patched stochastic matrix, $G = dS + \tfrac{1-d}{n}\mathbf{1}\mathbf{1}^{\mathsf{T}}$ the Google matrix, $d = 0.85$ unless stated, and $\mathbf{r}$ the PageRank vector with $M\mathbf{r}=\mathbf{r}$ (or $G\mathbf{r}=\mathbf{r}$).

Tier 1 — Conceptual (⭐)

The one-sentence summary. In your own words, complete the sentence: "PageRank is the __ eigenvector of the _ matrix, belonging to eigenvalue , and it is computed by ___." Then state what each of those four blanks means geometrically.
Columns vs. rows. In the link matrix $M$, does column $j$ describe page $j$'s outgoing or incoming links? What does row $i$ collect? If page $7$ has five outgoing links, what is the value of each nonzero entry in column $7$?
Why eigenvalue 1? Explain why the stationary distribution must belong to eigenvalue $1$ specifically, and not to some eigenvalue greater than $1$ or less than $1$. Tie your answer to the conservation of total probability.
Stochastic by definition. State the two conditions for a matrix to be column-stochastic. Which condition fails for the column of a dangling node? Which condition would fail if some link were assigned a negative weight?
The random surfer in words. Describe the random-surfer model in two or three sentences, including what the surfer does at a normal page and what she does (in the damped model) when she gets bored. Which of these two behaviors does the damping factor $d$ govern?
Two failures, one fix. Name the two structural problems with the raw link matrix that the Google matrix repairs, and explain in one sentence each how the damping/teleport term repairs them.
Reading a rank vector. A four-page web has PageRank $\mathbf{r} = (0.10, 0.45, 0.30, 0.15)$. Which page is most important? If page $2$ is pointed to by pages $3$ and $4$ only, while page $1$ is pointed to by all three other pages, why might page $2$ still outrank page $1$? (Answer conceptually — what besides count matters?)
Eigenvalue or eigenvector? A classmate says "PageRank computes the largest eigenvalue of the web." Correct them: which part of the eigenpair is the actual ranking, and what is the value of the relevant eigenvalue?

Tier 2 — Computation by hand (⭐⭐)

Build a link matrix. A three-page web has links: page $1 \to 2$, page $1 \to 3$, page $2 \to 3$, page $3 \to 1$. Write the raw link matrix $M$ (order the pages $1,2,3$). Confirm every column sums to $1$ (no dangling nodes here).
Patch a dangling node. A four-page web has links: $1 \to 2$, $2 \to 3$, $3 \to 1$, and page $4$ has no outgoing links. Write the raw $M$ (note its zero column), then write the patched stochastic matrix $S$.
Form the Google matrix. For the matrix $S$ from Exercise 10 with $n = 4$ and $d = 0.85$, compute the teleport contribution $\tfrac{1-d}{n}$ added to every entry, and write out the full Google matrix $G$ (you may leave entries as exact fractions or round to four decimals).
One step of the surfer by hand. Take the three-page matrix $M$ from Exercise 9. Starting from the uniform vector $\mathbf{x}_0 = (\tfrac13,\tfrac13,\tfrac13)$, compute $\mathbf{x}_1 = M\mathbf{x}_0$ by hand. Does it still sum to $1$? Which page has the most rank after one step?
Solve a tiny stationary equation. For the two-state migration matrix $P = \begin{psmallmatrix} 0.7 & 0.4 \\ 0.3 & 0.6 \end{psmallmatrix}$ (a column-stochastic matrix), find the stationary distribution by solving $(P - I)\mathbf{x} = \mathbf{0}$ and normalizing so the entries sum to $1$. Confirm your answer is an eigenvector with eigenvalue $1$.
Trace and determinant check. For the matrix $P$ in Exercise 13, compute $\operatorname{tr}(P)$ and $\det(P)$, and use the §23.3.3 shortcut ($\lambda_1 + \lambda_2 = \operatorname{tr}$, $\lambda_1\lambda_2 = \det$) to find both eigenvalues. Verify one of them is exactly $1$, as every stochastic matrix requires.
Damping changes the spectrum. The patched matrix $S$ of some web has eigenvalues $\{1, 0.6, -0.4\}$. Using the relationship from §29.5.2, what are the eigenvalues of the Google matrix $G$ with $d = 0.85$? What is the spectral gap (the distance from $|\lambda_2|$ to $1$), and what does it tell you about how fast power iteration converges?
Dangling drains probability. Take the four-page raw matrix $M$ from Exercise 10 (with dangling page $4$ left unpatched). Starting from $\mathbf{x}_0 = (\tfrac14,\tfrac14,\tfrac14,\tfrac14)$, compute $\mathbf{x}_1 = M\mathbf{x}_0$ and show its entries sum to less than $1$. Explain where the missing probability went.

Tier 3 — Proof (⭐⭐⭐, [proof]) and code (⭐⭐⭐, [code])

[proof] Stochastic preserves probability. Prove that if $M$ is column-stochastic and $\mathbf{x}$ has nonnegative entries summing to $1$, then $M\mathbf{x}$ also has nonnegative entries summing to $1$. (Reproduce the §29.3 argument in your own words, making the column-sum step explicit.)
[proof] Eigenvalue 1 always exists. Prove that every column-stochastic matrix $M$ has $1$ as an eigenvalue. (Hint: show $\mathbf{1}^{\mathsf{T}} M = \mathbf{1}^{\mathsf{T}}$, conclude $1$ is an eigenvalue of $M^{\mathsf{T}}$, then use the fact that $M$ and $M^{\mathsf{T}}$ share a characteristic polynomial.)
[proof] The Google matrix is stochastic and positive. Let $S$ be column-stochastic and $G = dS + \tfrac{1-d}{n}\mathbf{1}\mathbf{1}^{\mathsf{T}}$ with $0 < d < 1$. Prove (a) every column of $G$ sums to $1$, and (b) every entry of $G$ is strictly positive. Explain why (b) is the hypothesis the Perron-Frobenius theorem needs.
[proof] The damping eigenvalue relationship. Suppose $S\mathbf{r} = \mathbf{r}$ (the stationary eigenvector, with $\mathbf{1}^{\mathsf{T}}\mathbf{r} = 1$) and let $\mathbf{v}$ be any other eigenvector of $S$, $S\mathbf{v} = \mu\mathbf{v}$ with $\mu \ne 1$. Show that $\mathbf{1}^{\mathsf{T}}\mathbf{v} = 0$ (use that $S$ is stochastic so $\mathbf{1}^{\mathsf{T}}S = \mathbf{1}^{\mathsf{T}}$), then prove $G\mathbf{v} = d\mu\,\mathbf{v}$ — i.e. every non-Perron eigenvalue of $S$ is scaled by $d$ in $G$. (This is the §29.5.2 fact.)
[code] PageRank from scratch. Implement pagerank(link_matrix, damping=0.85, tol=1e-10) matching the §29.9 specification: patch dangling columns, form $G$, run power iteration from the uniform start with $\ell_1$ renormalization. Run it on the chapter's five-page web and print the result; confirm it equals $(0.4066, 0.2198, 0.1542, 0.1202, 0.0991)$ to four decimals.
[code] Verify it is the eigenvalue-1 eigenvector. Extend Exercise 21: for the five-page web, build $G$, compute its full spectrum with np.linalg.eig, extract the eigenvector for the eigenvalue closest to $1$, normalize it to sum $1$, and confirm it matches your power-iteration result. Also print $\lVert G\mathbf{r} - \mathbf{r}\rVert$ and confirm it is near machine zero.
[code] Watch convergence track the spectral gap. For the five-page Google matrix, run power iteration and record the $\ell_1$ error $\lVert \mathbf{x}_t - \mathbf{r}\rVert_1$ at each step. Compute the ratio of successive errors and confirm it approaches $|\lambda_2|$, the second-largest eigenvalue magnitude. Then repeat with $d = 0.5$ and $d = 0.95$ and report how the number of iterations to reach $\text{tol}=10^{-8}$ changes. Explain the trend using §29.5.2.
[code] Break it on purpose. Construct a two-page web that is a periodic trap: page $1 \to 2$ and page $2 \to 1$, nothing else. Iterate the raw matrix $M$ from $\mathbf{x}_0 = (1, 0)$ and show the vector oscillates forever without converging. Then build the damped $G$ and show power iteration now converges to $(\tfrac12, \tfrac12)$. Identify the eigenvalue of $M$ responsible for the oscillation (hint: it has magnitude $1$ but is not $+1$).

Tier 4 — Application & synthesis (⭐⭐⭐⭐)

[code] Rank a citation network. Five papers cite as follows (citer $\to$ cited): $P_2 \to P_1$; $P_3 \to P_1, P_2$; $P_4 \to P_1, P_2, P_3$; $P_5 \to P_1, P_2, P_3, P_4$; and $P_1$ cites nothing (dangling). Build the link matrix (treat a citation as a "vote" so importance flows to the cited paper), compute PageRank with $d=0.85$, and compare the ranking to the raw citation counts (in-degrees) $4, 3, 2, 1, 0$. Do they agree on the order here? Construct a different small example where PageRank and raw citation count disagree on which paper is most important, and explain what feature of the link structure causes the disagreement. (This is the heart of citation-influence metrics; see Case Study 2.)
[code] Rank sports teams. Four teams play a season; a loss is modeled as a link from the loser to the winner (the loser "endorses" the team that beat it). Results — beaten-by lists: team $A$ lost to $\{C, D\}$; team $B$ lost to $\{A, D\}$; team $C$ lost to $\{B, D\}$; team $D$ lost to $\{B\}$. Build the link matrix, compute the PageRank-style ranking with $d=0.85$, and report the order. Team $D$ beat the most opponents — does it rank first? Explain, using the recursive-importance idea, why beating a strong team can matter more than beating many weak ones. (See Case Study 1.)
Effect of the damping factor. Discuss qualitatively what happens to PageRank as $d \to 1$ (trust links completely) and as $d \to 0$ (ignore links, pure teleport). What rank does every page get when $d = 0$? Why is the limit $d \to 1$ mathematically dangerous (think about what Perron-Frobenius needs)? Why is $d = 0.85$ a compromise rather than a theorem?
[proof] Why teleport guarantees uniqueness. Explain, referencing the Perron-Frobenius conditions in §29.5.1, why a strictly positive Google matrix cannot have two independent stationary distributions, whereas a web split into two disconnected islands can. Connect "two islands" to the notion of a reducible matrix and "one positive $G$" to irreducible and aperiodic (primitive).