Exercises: Communicating with Data: Telling Stories with Numbers

Contributors

Exercises: Communicating with Data: Telling Stories with Numbers

These exercises progress from conceptual understanding of visualization principles through writing practice, chart revision, audience adaptation, and a full report-drafting exercise. Estimated completion time: 3 hours.

Difficulty Guide: - ⭐ Foundational (5-10 min each) - ⭐⭐ Intermediate (10-20 min each) - ⭐⭐⭐ Challenging (20-40 min each) - ⭐⭐⭐⭐ Advanced/Research (40+ min each)

Part A: Conceptual Understanding ⭐

A.1. In your own words, explain what Tufte means by "data-ink ratio." Why is maximizing this ratio generally a good idea?

A.2. True or false (explain each):

(a) A bar chart's y-axis should always start at zero.

(b) A line chart's y-axis should always start at zero.

(c) 3D effects on charts help the viewer understand the data better.

(d) Pie charts are the best way to show parts of a whole.

(e) Adding decorative icons to a bar chart increases the data-ink ratio.

(f) A graph can be technically accurate and still be misleading.

A.3. List three examples of "chartjunk" and explain why each one reduces the effectiveness of a visualization.

A.4. Explain the difference between writing statistical results for a technical audience versus a non-technical audience. Give one example of a sentence written for each audience that describes the same finding.

A.5. What is the purpose of an executive summary? Why is it placed at the beginning of a report rather than the end?

A.6. Why is reproducibility important in data analysis? List three specific practices that improve the reproducibility of an analysis.

Part B: Identifying Misleading Techniques ⭐

B.1. A news article shows a bar chart of crime rates in five cities. The y-axis starts at 850 incidents per 100,000 people. The city with the lowest rate (870) has a bar that's barely visible, while the city with the highest rate (920) has a tall bar. Identify the misleading technique and explain how you would fix it.

B.2. A company's annual report includes the following graph: a line chart showing revenue from 2020 to 2024. Revenue was $50M in 2020, dipped to $35M in 2021 (during the pandemic), then recovered to $48M in 2022, $52M in 2023, and $55M in 2024. The graph starts at 2021.

(a) What misleading technique is being used?

(b) What impression does the cherry-picked time window create?

(c) How would you present this data honestly?

B.3. A presentation slide shows a dual-axis chart. The left axis shows temperature (in Fahrenheit) and the right axis shows ice cream sales (in dollars). Both lines appear to track each other perfectly. The presenter says, "As you can see, temperature directly drives ice cream sales."

Identify two problems with this presentation.

B.4. A political infographic shows two circles representing military spending: Country A ($500 billion) is shown as a circle with radius 2 inches, and Country B ($250 billion) is shown as a circle with radius 1 inch.

(a) What's the visual ratio of the two circles' areas?

(b) What's the actual spending ratio?

(c) Why is this misleading? How would you fix it?

B.5. For each of the following, identify which misleading technique(s) is being used:

(a) A pie chart with 15 slices, three of which are labeled "Other."

(b) A bar chart comparing unemployment rates: 4.2% for Party A's term and 4.5% for Party B's term. The y-axis runs from 4.0% to 4.6%.

(c) An area chart showing company revenue where each year is represented by a progressively larger image of a dollar sign.

(d) A line chart titled "Crime at an All-Time Low" showing data from January to March of a single year.

Part C: Chart Revision ⭐⭐

C.1. You receive a chart with the following problems: 3D bar chart, rainbow colors for 8 categories, heavy gridlines, a large legend box covering part of the data, a dark blue background, no axis labels, and the title "Data." Rewrite the design specification for this chart, applying Tufte's principles. List each change you would make and why.

C.2. A colleague creates a scatterplot of test scores (y-axis) vs. study hours (x-axis). The points are gray, the regression line is thin and gray, there are no labels on either axis, no title, and no indication of sample size or $R^2$. The chart is technically correct but unhelpful. List at least five improvements you would make, explaining how each serves the viewer.

C.3. You have data on customer satisfaction (1-5 scale) for four products. Sketch or describe two different visualizations:

(a) One designed to make the differences between products look as large as possible (use misleading techniques on purpose).

(b) One designed to show the data honestly and clearly.

Explain what you changed between the two versions.

C.4. Redesign the following chart description to follow the principles from this chapter:

Original: A 3D pie chart showing market share for five competitors. The largest slice (32%) is "exploded" out from the rest. Colors are red, green, blue, yellow, and purple. No data labels — percentages are shown only in the legend.

Write a complete redesign specification, including: chart type, color palette, labeling strategy, title, and any annotations.

Part D: Writing Statistical Results ⭐⭐

D.1. Translate the following technical result into plain language for a general audience:

"A two-sample Welch's t-test comparing mean systolic blood pressure between the intervention group (n = 127, M = 128.3, SD = 14.2) and control group (n = 131, M = 135.7, SD = 15.8) yielded t(253.4) = -3.99, p < .001, d = 0.49, 95% CI for the difference: (-11.0, -3.7) mmHg."

D.2. Write both a technical and a plain-language version of the following finding:

A chi-square test of independence was performed on a 3×2 contingency table examining the relationship between education level (high school, bachelor's, graduate) and voting behavior (voted, did not vote) in a sample of 1,200 adults. The test yielded $\chi^2(2) = 18.7$, $p < .001$, Cramér's V = 0.12.

D.3. A researcher has these regression results: $\hat{y} = 12.4 + 3.2x$, where $y$ is annual income in thousands of dollars and $x$ is years of education beyond high school. $R^2 = 0.34$, $p < .001$, $n = 500$. The 95% CI for the slope is (2.7, 3.7).

Write three versions of this finding: (a) For a peer-reviewed journal (b) For a newspaper article (c) For a policy brief to legislators

D.4. The following plain-language description contains errors. Identify each error and write a corrected version:

"Our study proved that the new drug cures high blood pressure. The p-value was 0.03, meaning there is only a 3% chance the drug doesn't work. The average blood pressure dropped from 142 to 138, which is clearly important."

D.5. Write a one-paragraph executive summary for the following analysis:

A consulting firm analyzed employee satisfaction survey data for a retail company. They compared satisfaction scores across four store regions (Northeast, Southeast, Midwest, West) using ANOVA. The overall test was significant: $F(3, 496) = 4.82$, $p = .003$, $\eta^2 = .028$. Post-hoc Tukey tests showed that the Southeast region ($M = 3.2$ out of 5) had significantly lower satisfaction than the Northeast ($M = 3.8$, $p = .001$) and Midwest ($M = 3.7$, $p = .008$), but not the West ($M = 3.5$, $p = .21$).

Part E: Audience Adaptation ⭐⭐

E.1. You've found that a machine learning model used for hiring has a false positive rate of 15% for male applicants and 28% for female applicants. Write a 3-sentence summary of this finding for:

(a) A data science team meeting (b) The company's VP of Human Resources (c) A journalist writing about AI bias

E.2. You've conducted a regression analysis showing that for every 1°F increase in average summer temperature, emergency room visits for heat-related illness increase by 12 per 100,000 population ($p < .001$, $R^2 = 0.68$). Write a 2-sentence summary for:

(a) A public health journal (b) A city council considering heat mitigation funding

E.3. Sam has finished his analysis of whether a new basketball training routine improves free throw percentage. The results: paired t-test, $t(19) = 2.34$, $p = .030$, $d = 0.52$. Average improvement: 4.7 percentage points (from 71.2% to 75.9%). 95% CI for the improvement: (0.5, 8.9) percentage points.

Write two versions: (a) A paragraph for an academic sports science paper (b) A slide talking point for the coaching staff

Part F: Visualization Design ⭐⭐

F.1. For each of the following scenarios, recommend (1) a chart type, (2) a color strategy, and (3) one specific annotation to include:

(a) Showing how customer satisfaction has changed monthly over three years.

(b) Comparing average test scores across 10 schools.

(c) Showing the relationship between hours of sleep and reaction time for 200 participants.

(d) Showing the proportion of a city budget allocated to six departments.

(e) Comparing income distributions for three education levels.

F.2. You need to show the same dataset — monthly temperature and rainfall for a city over one year — to two different audiences. Describe how you would design the visualization differently for:

(a) A climate scientist at a research conference (b) A tourist planning a vacation

F.3. Create the following chart in Python, applying all the design principles from this chapter:

A horizontal bar chart showing the percentage of adults who exercise regularly in five countries: Japan (35%), USA (28%), Germany (33%), Brazil (22%), and Australia (40%). Include: clean design, colorblind-friendly colors, data labels, a descriptive title that states the finding, and a source note.

# Write your solution here
# Apply: Tufte's principles, accessibility, annotation

Part G: Report Structure ⭐⭐

G.1. A student writes the following Introduction for their data analysis report:

"I used the Gapminder dataset. I loaded it into Python and cleaned it. Then I ran a regression. The results were interesting."

Identify at least four problems with this Introduction and rewrite it properly.

G.2. A Methods section includes the following sentence: "I deleted some rows that didn't look right and then ran the analysis." Explain why this is inadequate and rewrite it to meet reproducibility standards.

G.3. Write a Limitations section (4-5 sentences) for the following study: A researcher used U.S. Census data from 2020 to examine whether median household income predicts high school graduation rates across counties. They found $R^2 = 0.45$, $p < .001$.

G.4. Evaluate the following Results paragraph. Does it meet the standards from Section 25.7? What's missing?

"The regression was significant (p < .05). Income predicted graduation rates. The graph showed a positive trend."

Part H: Ethical Scenarios ⭐⭐⭐

H.1. A marketing manager asks you to create a chart showing that their product's customer satisfaction rating (4.2 out of 5) is "dramatically higher" than the competitor's (4.0 out of 5). They want the y-axis to start at 3.8. What do you do? Write a response to the manager explaining your decision.

H.2. You're preparing a report on racial disparities in mortgage lending. Your analysis shows that Black applicants are denied at a rate of 27% compared to 14% for white applicants, even after controlling for income, credit score, and debt-to-income ratio. The bank's legal team asks you to present the data "in the most favorable light possible."

(a) What ethical principles are in tension here?

(b) How would you present the data honestly while still acknowledging the complexity of the issue?

(c) What would you include in the Limitations section?

H.3. A news organization publishes an interactive map showing COVID-19 case rates by zip code. The map uses a color scale from green (low) to red (high). The areas with the highest case rates are predominantly low-income neighborhoods with larger populations of color.

(a) Is this map inherently problematic? Why or why not?

(b) What additional context or design choices could make the map more responsible?

(c) What would Tufte say about the design? What would an ethics review board say?

Part I: Reproducibility ⭐⭐

I.1. You receive a colleague's analysis file — a single Python script with no comments, variable names like x1, df2, and temp, and no documentation of the data source or cleaning steps. The script produces a chart with the title "Results." List at least five specific improvements that would make this analysis reproducible.

I.2. Write a "Reproducibility Header" for a Jupyter notebook that includes: (a) the analysis title, (b) analyst name and date, (c) data source with URL, (d) Python and library versions, and (e) random seed. Provide the actual Python code.

I.3. Explain why the following practice undermines reproducibility: "I opened the CSV in Excel, manually deleted the outliers, saved the file, and then loaded it into Python."

Part J: Comprehensive Application ⭐⭐⭐

J.1. You work for a nonprofit that runs an after-school tutoring program. You've collected data on 150 students showing that tutored students scored an average of 76 on a standardized test compared to 71 for non-tutored students. The two-sample t-test yielded $t(148) = 2.87$, $p = .005$, $d = 0.47$, 95% CI for the difference: (1.6, 8.4) points.

Write a complete one-page report including: (a) An executive summary (3-4 sentences) (b) A description of the key finding with appropriate hedging language (c) A recommendation (d) A Limitations paragraph (at least 3 limitations) (e) A description of one visualization you would include

J.2. Maya's complete ER visit analysis includes these findings: - Poverty rate is strongly correlated with ER visit rate ($r = 0.96$, $R^2 = 0.92$) - Uninsured rate is also strongly correlated with ER visit rate ($r = 0.94$) - Primary care physician access is negatively correlated ($r = -0.91$) - In a multiple regression controlling for all three variables, only uninsured rate and PCP access are significant

Write two versions of this result: one for a technical public health audience, and one for the city council. Each should be 4-6 sentences.

J.3. Alex needs to present a dashboard to StreamVibe executives summarizing the A/B test. Design the dashboard layout by describing:

(a) The executive summary section (what goes in the first box?) (b) Three key metrics to display prominently (c) One visualization to include and its design specifications (d) How you would communicate uncertainty (e) The action recommendation

Part K: Critical Evaluation ⭐⭐⭐

K.1. Find a data visualization in a recent news article (any source). Apply the checklist from this chapter:

(a) Does it start the y-axis at an appropriate value? (b) Is the time window appropriate? (c) Is chartjunk present? (d) Are colors accessible? (e) Is the title informative or misleading? (f) Is uncertainty shown?

Write a one-paragraph evaluation.

K.2. Evaluate the following executive summary. Identify at least three problems and rewrite it:

"Our analysis looked at the data and found that it was significant. The regression model had an R-squared of 0.15, proving that advertising causes sales to increase. We recommend doubling the advertising budget immediately. The p-value was really small."

K.3. A colleague shows you a presentation slide with the title "Customer Churn Drops 50%!" The slide shows a bar chart with churn rates of 4% (before) and 2% (after). The y-axis starts at zero.

(a) Is the title misleading? Why or why not?

(b) What additional information would you need to evaluate this claim?

(c) Rewrite the title to be more informative.

Part L: Python Implementation ⭐⭐⭐

L.1. Create a "before and after" visualization in Python. Take any dataset you've worked with in a previous chapter and create two versions of the same chart:

(a) A "before" version with at least three chartjunk elements (b) An "after" version applying all the principles from this chapter

Include comments explaining each design decision.

L.2. Create a small multiples visualization showing a distribution or trend across four or more groups. Use shared axes, consistent colors, and clean design. Apply Tufte's principles throughout.

L.3. Write a Python function called polished_bar_chart() that takes a list of categories, a list of values, a title, and a y-axis label, and produces a publication-ready bar chart. The function should automatically:

Start the y-axis at zero
Remove top and right spines
Add data labels on bars
Use a colorblind-friendly color
Add light horizontal gridlines
Use a clean, descriptive title style

def polished_bar_chart(categories, values, title, ylabel,
                       color='steelblue', figsize=(8, 5)):
    """
    Create a publication-ready bar chart following Tufte's principles.

    Parameters:
    -----------
    categories : list of str
        Category labels for the x-axis
    values : list of float
        Values for each category
    title : str
        Descriptive title (state the finding, not just variables)
    ylabel : str
        Y-axis label with units
    color : str, optional
        Bar color (default: steelblue)
    figsize : tuple, optional
        Figure size (default: (8, 5))

    Returns:
    --------
    fig, ax : matplotlib figure and axes objects
    """
    # Write your implementation here
    pass

Part M: Synthesis and Reflection ⭐⭐⭐⭐

M.1. "Data visualization is rhetoric." Agree or disagree? Write a 300-word argument drawing on examples from this chapter.

M.2. You're hiring a data analyst. Based on what you learned in this chapter, write three interview questions you would ask to evaluate their communication skills. For each question, explain what a good answer would include.

M.3. Reflect on a time in this course (or in another context) when you were misled by a visualization or a statistical claim. What technique was used? Now that you know the principles from this chapter, how would you have evaluated the claim differently?

M.4. The American Statistical Association's Ethical Guidelines state that statisticians should "present their findings and interpretations honestly and objectively." Tufte's principles focus on maximizing data and minimizing clutter. A marketing department wants to "tell a compelling story with the data." Are these three goals compatible? When might they conflict? Write a 200-word reflection.