Key Takeaways: Exploring Data — Graphs and Descriptive Statistics

One-Sentence Summary

Data visualization transforms raw numbers into pictures that reveal patterns, shapes, and surprises — and the first skill of statistical thinking is learning to see data as a distribution, not as individual values.

Core Concepts at a Glance

Concept Definition Why It Matters
Histogram Divides numerical data into equal-width bins with touching bars Reveals the shape of a distribution — the single most important graph in introductory statistics
Bar chart Displays categorical frequencies with separate (non-touching) bars Shows how observations are spread across categories
Distribution shape The overall pattern: symmetric, skewed, unimodal, bimodal The shape tells the story — two datasets with the same mean can have completely different shapes
Outlier An observation far from the rest of the data May be an error, an anomaly, a genuine extreme, or the most important data point — investigate before acting
Distribution thinking Seeing data as a whole distribution rather than individual numbers The threshold concept that separates looking at data from truly understanding it

Graph Selection Guide

What type of variable(s) are you graphing?
│
├── ONE CATEGORICAL variable
│   ├── Bar chart ← DEFAULT (always works)
│   └── Pie chart (≤ 5 categories, parts of a whole only)
│
├── ONE NUMERICAL variable
│   ├── Histogram ← DEFAULT (always works)
│   └── Stem-and-leaf plot (≤ 50 observations, want exact values)
│
├── TWO CATEGORICAL variables
│   ├── Grouped bar chart (side-by-side)
│   └── Stacked bar chart
│
├── ONE CATEGORICAL + ONE NUMERICAL
│   ├── Side-by-side histograms
│   └── Side-by-side box plots (Chapter 6)
│
└── TWO NUMERICAL variables
    └── Scatterplot (Chapter 22)

Quick Reference: Bar Chart vs. Histogram

Feature Bar Chart Histogram
Variable type Categorical Numerical
Bars touch? No (gaps between bars) Yes (bars are adjacent)
X-axis shows Category names Numerical scale (bins)
Bar order Can rearrange Must follow number line
Bar width Cosmetic only Defines bin width

The simplest test: If the x-axis has words, it's a bar chart. If the x-axis has numbers, it's probably a histogram.

Distribution Shape Vocabulary

Term What It Looks Like Real-World Example
Symmetric Left and right sides are mirror images Human body temperatures
Skewed right Long tail stretches to the right (higher values) Household income
Skewed left Long tail stretches to the left (lower values) Easy exam scores
Unimodal One peak Heights of adult women
Bimodal Two peaks Flu cases by age (children + elderly)
Uniform All bars roughly equal height Rolling a fair die

Memory trick: The skew is named for the direction of the tail, not the hump. Right-skewed = tail points right, hump on the left.

The Four-Part Description (Use Every Time)

When you look at any histogram, describe:

  1. Shape — Symmetric or skewed? Unimodal, bimodal, or uniform?
  2. Center — Where is the approximate middle?
  3. Spread — How wide is the distribution? (Range from min to max)
  4. Unusual features — Any outliers? Gaps? Clusters?

Example: "The distribution of watch times is skewed right and unimodal, centered around 25-30 minutes, with a spread from 5 to 180 minutes. A few outliers beyond 120 minutes represent binge-watching sessions."

Common Graphing Mistakes

Mistake The Problem The Fix
Truncated axis Bar heights don't reflect true ratios Start bar chart axes at zero
3D effects Perspective distortion skews comparisons Always use 2D charts
Unequal bin widths Wider bins appear more prominent Use equal-width bins
Wrong graph type E.g., histogram for categorical data Match graph to variable type
Too many pie slices Impossible to compare similar-sized slices Limit to 5 categories or use a bar chart
Missing labels Reader can't interpret the graph Always include title, axis labels, and units

Python Quick Reference

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Bar chart (categorical variable)
sns.countplot(data=df, x='category_column', color='steelblue')
plt.title('Title')
plt.xlabel('X Label')
plt.ylabel('Count')
plt.show()

# Histogram (numerical variable)
sns.histplot(data=df, x='number_column', bins=15, edgecolor='white')
plt.title('Title')
plt.xlabel('X Label (units)')
plt.ylabel('Frequency')
plt.show()

# Overlaid histograms (comparing groups)
sns.histplot(data=df, x='number_column', hue='group_column',
             bins=10, alpha=0.5, edgecolor='white')
plt.show()

Key Terms

Term Definition
Histogram Numerical data divided into equal-width bins, displayed as touching bars
Bar chart Categorical data displayed as separate bars with gaps
Pie chart Proportions shown as slices of a circle
Stem-and-leaf plot Data split into stems and leaves, preserving exact values
Frequency distribution Table organizing data into classes with counts
Relative frequency Proportion of observations in a class (count / total)
Distribution shape Overall pattern of a histogram (symmetric, skewed, etc.)
Symmetric Left and right sides are approximately mirror images
Skewed right Longer tail extends toward larger values
Skewed left Longer tail extends toward smaller values
Unimodal Distribution with one peak
Bimodal Distribution with two peaks
Outlier Observation far from the rest of the data

The One Thing to Remember

If you forget everything else from this chapter, remember this:

A single number can never fully describe a dataset. Two distributions can have the same mean but completely different shapes — and completely different stories. Always look at the shape. The shape is the story.