Case Study 2: Sam's Cross-Position Performance Comparison
The Setup
Sam Okafor has a new assignment from the Riverside Raptors' front office: help evaluate potential trades by comparing player performance across positions.
The challenge is subtle. Player performance data for the Raptors comes primarily from scout evaluations — experienced basketball observers who watch games and rate players on multiple dimensions using a 1-10 scale. These ratings are ordinal: a player rated 8 is considered better than one rated 6, but the difference between 6 and 8 isn't necessarily "twice" or even the same as the difference between 4 and 6. Different scouts might calibrate differently, and the psychological distance between rating levels isn't uniform.
The general manager wants to know: Do performance ratings vary systematically across positions? If guards consistently receive higher ratings than centers, that might indicate scout bias (guards are flashier and easier to evaluate) rather than genuine performance differences. Or it might mean the Raptors need to invest more in their frontcourt.
Sam has collected ratings from three scouts across four positions over the past month. Each rating represents one game evaluation.
The Data
import numpy as np
from scipy import stats
from itertools import combinations
# ============================================================
# SAM'S CROSS-POSITION PERFORMANCE ANALYSIS
# ============================================================
np.random.seed(2026)
# Scout performance ratings (1-10 ordinal scale)
# Each value = one game evaluation by one scout
point_guards = [8, 7, 9, 8, 7, 8, 9, 7, 8, 8]
shooting_guards = [7, 6, 8, 7, 7, 6, 7, 8, 7, 6]
small_forwards = [6, 7, 5, 6, 7, 6, 5, 6, 7, 6]
power_forwards = [5, 6, 7, 5, 4, 6, 5, 6, 5, 7]
centers = [5, 4, 6, 5, 5, 4, 6, 5, 4, 5]
positions = {
'PG': point_guards,
'SG': shooting_guards,
'SF': small_forwards,
'PF': power_forwards,
'C': centers
}
# ---- Part 1: Descriptive Statistics ----
print("=" * 65)
print("SAM'S ANALYSIS: Performance Ratings by Position")
print("=" * 65)
print(f"\n{'Pos':<5} {'n':>4} {'Median':>8} {'Mean':>8} "
f"{'SD':>8} {'Min':>6} {'Max':>6} {'IQR':>6}")
print("-" * 55)
for name, data in positions.items():
d = np.array(data)
q1, q3 = np.percentile(d, [25, 75])
print(f"{name:<5} {len(d):>4} {np.median(d):>8.1f} "
f"{d.mean():>8.2f} {d.std(ddof=1):>8.2f} "
f"{d.min():>6} {d.max():>6} {q3-q1:>6.1f}")
# Rating frequency table
print(f"\nRating Distribution:")
print(f"{'Pos':<5}", end="")
for v in range(1, 11):
print(f" {v:>4}", end="")
print()
print("-" * 48)
for name, data in positions.items():
d = np.array(data)
print(f"{name:<5}", end="")
for v in range(1, 11):
count = np.sum(d == v)
print(f" {count:>4}" if count > 0 else " -", end="")
print()
The Analysis
# ---- Part 2: Why Nonparametric? ----
print("\n" + "=" * 65)
print("DIAGNOSTIC CHECKS")
print("=" * 65)
# Normality check per group
print("\nShapiro-Wilk Normality Tests:")
for name, data in positions.items():
stat, p = stats.shapiro(data)
result = "Non-normal" if p < 0.05 else "Plausibly normal"
print(f" {name}: W = {stat:.4f}, p = {p:.4f} ({result})")
print(f"""
Assessment:
- Ordinal scale (1-10): intervals not guaranteed equal
- Small samples (n = 10 per group)
- Some distributions show non-normality (discrete data)
- Scout ratings are subjective ordinal assessments
→ Kruskal-Wallis test is more appropriate than ANOVA
""")
# ---- Part 3: Kruskal-Wallis Test ----
print("=" * 65)
print("KRUSKAL-WALLIS TEST")
print("=" * 65)
H_stat, p_value = stats.kruskal(*positions.values())
N = sum(len(d) for d in positions.values())
k = len(positions)
print(f"\nH = {H_stat:.3f}")
print(f"df = {k - 1}")
print(f"N = {N}")
print(f"p-value = {p_value:.6f}")
if p_value < 0.05:
print(f"\n*** Significant at α = 0.05 ***")
print("Performance ratings differ across positions.")
# ---- Part 4: Post-Hoc Comparisons ----
if p_value < 0.05:
print("\n" + "=" * 65)
print("POST-HOC: PAIRWISE MANN-WHITNEY U TESTS")
print("=" * 65)
pairs = list(combinations(positions.keys(), 2))
n_comp = len(pairs)
alpha_bonf = 0.05 / n_comp
print(f"\nNumber of pairwise comparisons: {n_comp}")
print(f"Bonferroni-corrected α: {alpha_bonf:.4f}")
print(f"\n{'Comparison':<12} {'Med₁':>6} {'Med₂':>6} "
f"{'U':>8} {'p (adj)':>10} {'Sig?':>6}")
print("-" * 55)
significant_pairs = []
for g1, g2 in pairs:
d1 = np.array(positions[g1])
d2 = np.array(positions[g2])
stat, p = stats.mannwhitneyu(d1, d2,
alternative='two-sided')
adj_p = min(p * n_comp, 1.0)
sig = "*" if adj_p < 0.05 else ""
if adj_p < 0.05:
significant_pairs.append((g1, g2, adj_p))
print(f"{g1:>3} vs {g2:<4} "
f"{np.median(d1):>6.1f} {np.median(d2):>6.1f} "
f"{stat:>8.1f} {adj_p:>10.4f} {sig:>6}")
print(f"\nSignificant pairs (after Bonferroni correction):")
if significant_pairs:
for g1, g2, p in significant_pairs:
print(f" {g1} vs {g2}: adjusted p = {p:.4f}")
else:
print(" None (all adjusted p-values > 0.05)")
# ---- Part 5: ANOVA Comparison ----
print("\n" + "=" * 65)
print("COMPARISON: ONE-WAY ANOVA (for reference)")
print("=" * 65)
F_stat, p_anova = stats.f_oneway(*positions.values())
print(f"\nF({k-1}, {N-k}) = {F_stat:.3f}")
print(f"p-value = {p_anova:.6f}")
print(f"\nAgreement: {'Both significant' if (p_value < 0.05 and p_anova < 0.05) else 'Disagree!'}")
The Story in the Ranks
Sam creates a visualization to help the coaching staff understand what the Kruskal-Wallis test found:
import matplotlib.pyplot as plt
# ---- Part 6: Visualization ----
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plots
data_for_box = [positions[p] for p in positions]
bp = axes[0].boxplot(data_for_box,
labels=positions.keys(),
patch_artist=True)
colors = ['#2196F3', '#4CAF50', '#FF9800', '#9C27B0', '#F44336']
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
patch.set_alpha(0.6)
axes[0].set_ylabel('Performance Rating (1-10)')
axes[0].set_title('Performance Ratings by Position')
axes[0].set_ylim(1, 10)
axes[0].axhline(y=np.median(np.concatenate(list(positions.values()))),
color='gray', linestyle='--', alpha=0.5,
label='Overall median')
axes[0].legend()
# Mean rank comparison
all_data = np.concatenate(list(positions.values()))
all_labels = []
for name, data in positions.items():
all_labels.extend([name] * len(data))
# Compute ranks
from scipy.stats import rankdata
ranks = rankdata(all_data)
mean_ranks = {}
for name in positions:
mask = [l == name for l in all_labels]
mean_ranks[name] = np.mean(ranks[mask])
pos_names = list(mean_ranks.keys())
rank_values = list(mean_ranks.values())
bars = axes[1].barh(pos_names, rank_values, color=colors, alpha=0.7)
axes[1].set_xlabel('Mean Rank')
axes[1].set_title('Mean Rank by Position (Kruskal-Wallis)')
axes[1].axvline(x=(N+1)/2, color='gray', linestyle='--',
alpha=0.5, label='Expected under H₀')
axes[1].legend()
# Add mean rank values on bars
for bar, val in zip(bars, rank_values):
axes[1].text(val + 0.5, bar.get_y() + bar.get_height()/2,
f'{val:.1f}', va='center')
plt.tight_layout()
plt.savefig('position_ratings_analysis.png', dpi=150,
bbox_inches='tight')
plt.show()
print(f"\nMean ranks (expected under H₀ = {(N+1)/2:.1f}):")
for name, rank in mean_ranks.items():
deviation = rank - (N+1)/2
direction = "above" if deviation > 0 else "below"
print(f" {name}: {rank:.1f} ({abs(deviation):.1f} {direction} expected)")
Sam's Scouting Report
Sam drafts his report for the front office:
=================================================================
RIVERSIDE RAPTORS — PERFORMANCE RATING ANALYSIS
Prepared by: Sam Okafor, Analytics Intern
=================================================================
QUESTION: Do scout performance ratings differ across positions?
METHOD: Kruskal-Wallis test (nonparametric alternative to
ANOVA). Chosen because scout ratings are ordinal — a rating
of 8 indicates "better" than 6, but the magnitude of the
difference is subjective and not guaranteed to be uniform
across the scale.
RESULT: Performance ratings differ significantly across
positions (H = 28.45, df = 4, p < 0.001).
KEY FINDINGS:
- Point Guards receive the highest ratings
(median = 8.0, mean rank = 41.2)
- Centers receive the lowest ratings
(median = 5.0, mean rank = 13.4)
- The pattern is monotonically decreasing:
PG > SG > SF > PF > C
POST-HOC (Bonferroni-corrected):
- PG significantly higher than SF, PF, and C
- SG significantly higher than PF and C
- SF vs. PF, SF vs. C: not significant after correction
- PG vs. SG: borderline
INTERPRETATION:
There are two possible explanations for this pattern:
1. GENUINE PERFORMANCE GRADIENT: Guards on the Raptors are
genuinely performing at a higher level than frontcourt
players. This would suggest the front office should
prioritize acquiring or developing frontcourt talent.
2. SCOUT RATING BIAS: Guards produce more visible, highlight-
friendly plays (three-pointers, assists, steals). Centers
contribute in ways that are harder to rate numerically
(screens, defensive positioning, rebounding position).
The rating system may systematically undervalue
frontcourt contributions.
RECOMMENDATION:
Before making personnel decisions based on these ratings:
a) Review scout rating criteria for position-specific
adjustments
b) Compare ratings to objective metrics (PER, Win Shares)
to check for systematic bias
c) Consider position-specific rating scales
d) Survey scouts about their rating processes
STATISTICAL NOTE:
The Kruskal-Wallis test was used because scout ratings are
ordinal (1-10 scale). An ANOVA on the same data gives
a concordant result (p < 0.001), but the Kruskal-Wallis
test is methodologically more appropriate for ordinal data.
=================================================================
The Deeper Question
Sam brings the analysis to the coaching staff meeting. Head Coach Williams looks at the results and asks the question Sam was hoping for:
"So are our guards actually better, or are our scouts just rating them higher because they're more fun to watch?"
Sam replies: "That's exactly the right question, Coach. The statistics can tell us that the ratings differ across positions. But whether that difference reflects genuine performance or rating bias — that's something the statistics alone can't answer. We need to dig deeper."
He continues: "Think of it this way. The Kruskal-Wallis test tells us the pattern in the ratings is real — it's not random noise. But the cause of that pattern could be performance differences, rating bias, or both. To tease those apart, we'd need to compare the ordinal ratings against objective metrics that don't depend on scout subjectivity."
Theme 1 — Statistics as a Superpower: The nonparametric test didn't just answer Sam's question — it reframed it. By confirming that the rating differences are real (not random), the Kruskal-Wallis test elevated the conversation from "do the numbers look different?" to "why do the numbers look different?" That's the upgrade from data to insight. And it's the kind of question that leads to better decisions — whether those decisions are about player acquisition, scout training, or rating system design.
Lessons for Your Own Work
-
Ordinal data demands nonparametric methods. Scout ratings, survey responses, pain scales — any time the numbers represent ordered categories rather than precise measurements, consider nonparametric tests. The methods in this chapter are designed for exactly this situation.
-
Statistical significance doesn't explain causation. Sam found that ratings differ across positions. The test doesn't tell him why. Is it real performance differences? Scout bias? Both? Answering that requires additional data and domain expertise — a theme that has run through this textbook since Chapter 4.
-
Small samples aren't a death sentence. Sam had only 10 ratings per position — too few for reliable normality assessment. Nonparametric methods gave him trustworthy results with modest data.
-
Present results in context. Sam's report doesn't just say "H = 28.45, p < 0.001." It explains what that means for the organization, offers competing interpretations, and recommends next steps. That's what separates statistical analysis from statistical reporting.
-
The best analysis raises better questions. The Kruskal-Wallis test didn't end the investigation — it focused it. Now the team knows where to look: not "are there differences?" but "what's driving the differences we know are real?"