Python Pandas Cheat Sheet: The Complete Quick Reference

Pandas is the backbone of data analysis in Python. Whether you are cleaning messy spreadsheets, merging datasets, or building summary reports, you will reach for pandas more than any other library. This cheat sheet covers every common operation with copy-paste code examples you can use immediately.

Import pandas at the top of every script:

import pandas as pd
import numpy as np

Creating DataFrames

Build a DataFrame from a dictionary of lists:

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [30, 25, 35],
    'salary': [70000, 60000, 90000]
})

Create from a list of dictionaries (useful when each row comes from an API response):

records = [
    {'name': 'Alice', 'age': 30},
    {'name': 'Bob', 'age': 25},
    {'name': 'Charlie', 'age': 35}
]
df = pd.DataFrame(records)

Create from a NumPy array with custom column names:

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'])

Reading and Writing Data

Format Read Write
CSV pd.read_csv('file.csv') df.to_csv('file.csv', index=False)
Excel pd.read_excel('file.xlsx') df.to_excel('file.xlsx', index=False)
JSON pd.read_json('file.json') df.to_json('file.json', orient='records')
SQL pd.read_sql(query, conn) df.to_sql('table', conn, if_exists='replace')
Parquet pd.read_parquet('file.parquet') df.to_parquet('file.parquet')

Common read_csv parameters you will use constantly:

df = pd.read_csv(
    'data.csv',
    sep=',',                    # delimiter
    header=0,                   # row number for column names
    names=['a', 'b', 'c'],     # custom column names
    usecols=['a', 'b'],        # only load specific columns
    dtype={'a': str, 'b': int},# force column types
    parse_dates=['date_col'],  # parse as datetime
    na_values=['N/A', 'null'], # treat these as NaN
    nrows=1000,                # only read first 1000 rows
    encoding='utf-8'
)

Read a specific Excel sheet:

df = pd.read_excel('report.xlsx', sheet_name='Q1 Sales')

Read from a SQL database:

import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers WHERE state = "CA"', conn)

Viewing and Inspecting Data

Command What It Returns
df.head(10) First 10 rows
df.tail(5) Last 5 rows
df.shape Tuple of (rows, columns)
df.info() Column names, types, non-null counts
df.describe() Summary statistics for numeric columns
df.dtypes Data type of each column
df.columns List of column names
df.index The row index
df.nunique() Count of unique values per column
df.value_counts('col') Frequency counts for a column
df.sample(5) 5 random rows
df.memory_usage(deep=True) Memory consumption per column
# Quick data audit in three lines
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())

Selecting Data

Column access — three ways to grab a column:

df['name']           # returns a Series
df.name              # dot notation (only works for simple names without spaces)
df[['name', 'age']]  # multiple columns — returns a DataFrame

loc — select by label (row/column names):

df.loc[0, 'name']                    # single value
df.loc[0:4, 'name':'salary']         # slice of rows and columns (inclusive)
df.loc[df['age'] > 30, ['name', 'salary']]  # filter rows, select columns

iloc — select by integer position:

df.iloc[0, 1]           # first row, second column
df.iloc[0:5, 0:3]       # first 5 rows, first 3 columns
df.iloc[-1]              # last row
df.iloc[:, [0, 2]]       # all rows, first and third columns

Filtering Data

Boolean indexing — the most common pattern:

df[df['age'] > 30]
df[df['name'] == 'Alice']
df[(df['age'] > 25) & (df['salary'] > 65000)]  # AND — use &
df[(df['age'] < 25) | (df['age'] > 35)]        # OR — use |
df[~(df['name'] == 'Bob')]                       # NOT — use ~

isin for matching against a list of values:

df[df['name'].isin(['Alice', 'Charlie'])]
df[~df['state'].isin(['CA', 'NY'])]  # exclude specific values

query for readable string-based filtering:

df.query('age > 30 and salary > 65000')
df.query('name in ["Alice", "Charlie"]')

min_age = 25
df.query('age > @min_age')  # reference Python variables with @

between for range filtering:

df[df['age'].between(25, 35)]  # inclusive on both ends

Sorting

df.sort_values('salary')                          # ascending
df.sort_values('salary', ascending=False)          # descending
df.sort_values(['department', 'salary'], ascending=[True, False])  # multi-column
df.sort_index()                                    # sort by row index
df.nlargest(10, 'salary')                          # top 10 by salary
df.nsmallest(5, 'age')                             # bottom 5 by age

Adding and Removing Columns

# Add a new column
df['bonus'] = df['salary'] * 0.10
df['full_name'] = df['first'] + ' ' + df['last']
df['category'] = np.where(df['salary'] > 80000, 'senior', 'junior')

# Add with multiple conditions
df['tier'] = np.select(
    [df['salary'] > 100000, df['salary'] > 70000],
    ['gold', 'silver'],
    default='bronze'
)

# Rename columns
df.rename(columns={'old_name': 'new_name', 'age': 'years'}, inplace=True)

# Drop columns
df.drop(columns=['bonus', 'tier'], inplace=True)

# Reorder columns
df = df[['name', 'age', 'salary']]

Groupby and Aggregation

Basic groupby — split, apply, combine:

df.groupby('department')['salary'].mean()
df.groupby('department')['salary'].agg(['mean', 'median', 'count'])
df.groupby(['department', 'title'])['salary'].sum()

Multiple aggregations on different columns:

df.groupby('department').agg(
    avg_salary=('salary', 'mean'),
    total_sales=('sales', 'sum'),
    headcount=('employee_id', 'count'),
    max_tenure=('years', 'max')
).reset_index()

Transform — apply a function and return a same-sized result:

# Add a column with the department average salary
df['dept_avg'] = df.groupby('department')['salary'].transform('mean')

# Percent of department total
df['pct_of_dept'] = df['salary'] / df.groupby('department')['salary'].transform('sum')

Merging and Joining

Merge Type What It Keeps
inner Only rows with keys in both DataFrames
left All rows from left, matching rows from right
right All rows from right, matching rows from left
outer All rows from both DataFrames
# Merge on a common column
merged = pd.merge(orders, customers, on='customer_id', how='left')

# Merge on differently named columns
merged = pd.merge(orders, products, left_on='prod_id', right_on='product_id')

# Merge on multiple columns
merged = pd.merge(df1, df2, on=['year', 'month'], how='inner')

# Concatenate DataFrames vertically (stacking rows)
combined = pd.concat([df1, df2, df3], ignore_index=True)

# Concatenate horizontally (adding columns side by side)
combined = pd.concat([df1, df2], axis=1)

Pivoting and Reshaping

Pivot table — like Excel pivot tables:

pivot = df.pivot_table(
    values='revenue',
    index='region',
    columns='quarter',
    aggfunc='sum',
    fill_value=0,
    margins=True   # add row/column totals
)

Melt — unpivot from wide to long format:

# Wide format: columns are Q1, Q2, Q3, Q4
# Long format: one column for quarter, one for value
long_df = df.melt(
    id_vars=['product'],
    value_vars=['Q1', 'Q2', 'Q3', 'Q4'],
    var_name='quarter',
    value_name='revenue'
)

Crosstab for frequency tables:

pd.crosstab(df['department'], df['gender'], margins=True)

Handling Missing Data

df.isnull().sum()            # count NaN per column
df.isnull().sum().sum()      # total NaN in entire DataFrame
df.dropna()                  # drop rows with any NaN
df.dropna(subset=['email'])  # drop rows where email is NaN
df.dropna(thresh=3)          # keep rows with at least 3 non-NaN values

df.fillna(0)                           # replace NaN with 0
df['salary'].fillna(df['salary'].median(), inplace=True)  # fill with median
df.fillna(method='ffill')              # forward fill (carry previous value)
df.fillna(method='bfill')              # backward fill
df.interpolate()                       # linear interpolation for numeric data

Replace specific values:

df.replace({'N/A': np.nan, 'missing': np.nan})
df['status'].replace({'active': 1, 'inactive': 0}, inplace=True)

String Operations

All string methods are available via the .str accessor:

df['name'].str.lower()
df['name'].str.upper()
df['name'].str.title()
df['name'].str.strip()                  # remove leading/trailing whitespace
df['name'].str.replace('old', 'new')
df['name'].str.contains('alice', case=False)  # boolean mask
df['name'].str.startswith('A')
df['name'].str.len()
df['email'].str.split('@').str[1]       # extract domain from email

# Extract with regex
df['year'] = df['date_str'].str.extract(r'(\d{4})')

# Multiple string conditions
mask = df['description'].str.contains('urgent|critical', case=False, na=False)

Date and Time Operations

# Convert string to datetime
df['date'] = pd.to_datetime(df['date_str'])
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d')

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['date'].dt.dayofweek >= 5

# Date arithmetic
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days
df['next_week'] = df['date'] + pd.Timedelta(days=7)

# Resample time series (requires datetime index)
df.set_index('date').resample('M')['revenue'].sum()   # monthly totals
df.set_index('date').resample('Q')['revenue'].mean()   # quarterly averages

# Filter by date range
df[df['date'].between('2025-01-01', '2025-12-31')]

Apply, Map, and Lambda

apply — run any function on rows or columns:

# Apply to a column (Series)
df['salary_k'] = df['salary'].apply(lambda x: round(x / 1000, 1))

# Apply to each row (axis=1)
df['summary'] = df.apply(lambda row: f"{row['name']} earns {row['salary']}", axis=1)

# Apply a named function
def categorize(salary):
    if salary > 100000:
        return 'high'
    elif salary > 60000:
        return 'medium'
    return 'low'

df['band'] = df['salary'].apply(categorize)

map — apply a mapping dictionary or function to a Series:

df['department_code'] = df['department'].map({
    'Engineering': 'ENG',
    'Marketing': 'MKT',
    'Sales': 'SLS'
})

applymap (renamed to map on DataFrame in pandas 2.1+) — element-wise on entire DataFrame:

# Format all numeric values to 2 decimal places
df[['col1', 'col2']].map(lambda x: f'{x:.2f}')

Useful One-Liners

# Remove duplicate rows
df.drop_duplicates()
df.drop_duplicates(subset=['email'], keep='last')

# Change column types
df['price'] = df['price'].astype(float)
df['id'] = df['id'].astype(str)

# Clip outliers
df['score'] = df['score'].clip(lower=0, upper=100)

# Rank values
df['salary_rank'] = df['salary'].rank(ascending=False)

# Cumulative sum
df['running_total'] = df['revenue'].cumsum()

# Percentage change
df['pct_change'] = df['revenue'].pct_change()

# Bin continuous data into categories
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100],
                         labels=['young', 'mid', 'senior', 'veteran'])

Performance Tips

For large datasets, these habits make a measurable difference:

# Read only the columns you need
df = pd.read_csv('huge_file.csv', usecols=['id', 'name', 'amount'])

# Use category dtype for low-cardinality string columns
df['state'] = df['state'].astype('category')

# Use read_csv with chunksize for files that do not fit in memory
for chunk in pd.read_csv('huge.csv', chunksize=100000):
    process(chunk)

# Avoid iterrows — use vectorized operations instead
# Slow:
for idx, row in df.iterrows():
    df.loc[idx, 'new'] = row['a'] + row['b']
# Fast:
df['new'] = df['a'] + df['b']

This cheat sheet covers the operations you will use in the vast majority of pandas workflows. Bookmark it, keep it open while you work, and refer back to it until the syntax becomes second nature.

Learn Python from scratch in our free Python for Business Beginners textbook.