Key Takeaways: Exploring Data — Graphs and Descriptive Statistics

Contributors

Key Takeaways: Exploring Data — Graphs and Descriptive Statistics

One-Sentence Summary

Data visualization transforms raw numbers into pictures that reveal patterns, shapes, and surprises — and the first skill of statistical thinking is learning to see data as a distribution, not as individual values.

Core Concepts at a Glance

Concept	Definition	Why It Matters
Histogram	Divides numerical data into equal-width bins with touching bars	Reveals the shape of a distribution — the single most important graph in introductory statistics
Bar chart	Displays categorical frequencies with separate (non-touching) bars	Shows how observations are spread across categories
Distribution shape	The overall pattern: symmetric, skewed, unimodal, bimodal	The shape tells the story — two datasets with the same mean can have completely different shapes
Outlier	An observation far from the rest of the data	May be an error, an anomaly, a genuine extreme, or the most important data point — investigate before acting
Distribution thinking	Seeing data as a whole distribution rather than individual numbers	The threshold concept that separates looking at data from truly understanding it

Graph Selection Guide

What type of variable(s) are you graphing?
│
├── ONE CATEGORICAL variable
│   ├── Bar chart ← DEFAULT (always works)
│   └── Pie chart (≤ 5 categories, parts of a whole only)
│
├── ONE NUMERICAL variable
│   ├── Histogram ← DEFAULT (always works)
│   └── Stem-and-leaf plot (≤ 50 observations, want exact values)
│
├── TWO CATEGORICAL variables
│   ├── Grouped bar chart (side-by-side)
│   └── Stacked bar chart
│
├── ONE CATEGORICAL + ONE NUMERICAL
│   ├── Side-by-side histograms
│   └── Side-by-side box plots (Chapter 6)
│
└── TWO NUMERICAL variables
    └── Scatterplot (Chapter 22)

Quick Reference: Bar Chart vs. Histogram

Feature	Bar Chart	Histogram
Variable type	Categorical	Numerical
Bars touch?	No (gaps between bars)	Yes (bars are adjacent)
X-axis shows	Category names	Numerical scale (bins)
Bar order	Can rearrange	Must follow number line
Bar width	Cosmetic only	Defines bin width

The simplest test: If the x-axis has words, it's a bar chart. If the x-axis has numbers, it's probably a histogram.

Distribution Shape Vocabulary

Term	What It Looks Like	Real-World Example
Symmetric	Left and right sides are mirror images	Human body temperatures
Skewed right	Long tail stretches to the right (higher values)	Household income
Skewed left	Long tail stretches to the left (lower values)	Easy exam scores
Unimodal	One peak	Heights of adult women
Bimodal	Two peaks	Flu cases by age (children + elderly)
Uniform	All bars roughly equal height	Rolling a fair die

Memory trick: The skew is named for the direction of the tail, not the hump. Right-skewed = tail points right, hump on the left.

The Four-Part Description (Use Every Time)

When you look at any histogram, describe:

Shape — Symmetric or skewed? Unimodal, bimodal, or uniform?
Center — Where is the approximate middle?
Spread — How wide is the distribution? (Range from min to max)
Unusual features — Any outliers? Gaps? Clusters?

Example: "The distribution of watch times is skewed right and unimodal, centered around 25-30 minutes, with a spread from 5 to 180 minutes. A few outliers beyond 120 minutes represent binge-watching sessions."

Common Graphing Mistakes

Mistake	The Problem	The Fix
Truncated axis	Bar heights don't reflect true ratios	Start bar chart axes at zero
3D effects	Perspective distortion skews comparisons	Always use 2D charts
Unequal bin widths	Wider bins appear more prominent	Use equal-width bins
Wrong graph type	E.g., histogram for categorical data	Match graph to variable type
Too many pie slices	Impossible to compare similar-sized slices	Limit to 5 categories or use a bar chart
Missing labels	Reader can't interpret the graph	Always include title, axis labels, and units

Python Quick Reference

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Bar chart (categorical variable)
sns.countplot(data=df, x='category_column', color='steelblue')
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Count')
plt.show()

# Histogram (numerical variable)
sns.histplot(data=df, x='number_column', bins=15, edgecolor='white')
plt.title('Title')
plt.xlabel('X Label (units)')
plt.ylabel('Frequency')
plt.show()

# Overlaid histograms (comparing groups)
sns.histplot(data=df, x='number_column', hue='group_column',
             bins=10, alpha=0.5, edgecolor='white')
plt.show()

Key Terms

Term	Definition
Histogram	Numerical data divided into equal-width bins, displayed as touching bars
Bar chart	Categorical data displayed as separate bars with gaps
Pie chart	Proportions shown as slices of a circle
Stem-and-leaf plot	Data split into stems and leaves, preserving exact values
Frequency distribution	Table organizing data into classes with counts
Relative frequency	Proportion of observations in a class (count / total)
Distribution shape	Overall pattern of a histogram (symmetric, skewed, etc.)
Symmetric	Left and right sides are approximately mirror images
Skewed right	Longer tail extends toward larger values
Skewed left	Longer tail extends toward smaller values
Unimodal	Distribution with one peak
Bimodal	Distribution with two peaks
Outlier	Observation far from the rest of the data

The One Thing to Remember

If you forget everything else from this chapter, remember this:

A single number can never fully describe a dataset. Two distributions can have the same mean but completely different shapes — and completely different stories. Always look at the shape. The shape is the story.