Quiz: Your Data Toolkit — Python, Excel, and Jupyter Notebooks

Q: If `df` has 1,200 rows and 15 columns, what does `df.shape` return? `1200` `15` `(1200, 15)` `(15, 1200)`

C) `(1200, 15)` Why C: `.shape` returns a tuple of (rows, columns). Rows always come first. Why not A: That's only the row count — `.shape` gives both dimensions. Why not B: That's only the column count. Why not D: The order is (rows, columns), not (columns, rows). Reference: Section 3.5

Q: True or False: Google Colab requires you to install Python on your computer before you can use it.

False. Google Colab runs entirely in your web browser and uses Google's servers to execute Python code. You only need a Google account and an internet connection — no local installation required. Reference: Section 3.2

Q: True or False: After restarting the kernel in a Jupyter notebook, all your variables are preserved and you can continue working without re-running any cells.

False. Restarting the kernel clears all variables from memory. You need to re-run your code cells from the top to recreate them. This is similar to turning off a calculator — it forgets everything. Reference: Section 3.2

Q: True or False: When a 0/1 variable (like `smoker`) is stored as an integer in pandas, its mean equals the proportion of 1s in the column.

True. If a column contains only 0s and 1s, the mean is the sum of all values divided by the count — which equals the number of 1s divided by the total. For example, if 89 out of 500 respondents are smokers (coded as 1), the mean is 89/500 = 0.178, meaning 17.8% are smokers. Reference: Section 3.5

Q: Write the complete code to: (a) import pandas, (b) load a CSV file from a URL, (c) display the number of rows and columns, and (d) show summary statistics. Assume the URL is stored in a variable called `data_url`.

```python import pandas as pd df = pd.read_csv(data_url) print(df.shape) df.describe() ``` This four-line workflow is the standard opening sequence for exploring any new dataset: import the library, load the data, check the dimensions, and compute summary statistics. Reference: Sections 3.4-3.5

Contributors

Quiz: Your Data Toolkit — Python, Excel, and Jupyter Notebooks

Test your understanding before moving on. Target: 70% or higher to proceed confidently.

Section 1: Multiple Choice (1 point each)

1. What is a Jupyter notebook?

A) A word processor for writing lab reports
B) An interactive document that combines code, text, and output
C) A spreadsheet application similar to Excel
D) A programming language used for statistics

Answer

**B)** An interactive document that combines code, text, and output. *Why B:* Jupyter notebooks let you mix executable code cells with formatted text cells, and display results inline. This makes them ideal for data analysis because you can write code, see results, and add explanations in one document. *Why not A:* While you can write text, it's fundamentally a coding tool, not a word processor. *Why not C:* Jupyter is code-based, not grid-based like a spreadsheet. *Why not D:* Jupyter is an environment, not a language. It runs Python (or other languages) inside it. *Reference:* Section 3.2

2. What does the kernel do in a Jupyter notebook?

A) Formats text cells using Markdown
B) Saves your notebook to disk
C) Executes your Python code and remembers variables
D) Connects your notebook to the internet

Answer

**C)** Executes your Python code and remembers variables. *Why C:* The kernel is the running Python process that takes your code, runs it, and returns results. It maintains state — variables defined in one cell remain available in later cells. *Why not A:* Markdown rendering is handled by the notebook interface, not the kernel. *Why not B:* Saving is a notebook function, not a kernel function. *Why not D:* Internet connectivity is handled by the browser and operating system. *Reference:* Section 3.2

3. Which line of code correctly imports the pandas library?

A) install pandas
B) import pandas as pd
C) from pandas import *
D) pandas.load()

Answer

**B)** `import pandas as pd` *Why B:* This is the standard convention used by virtually all data scientists. It imports the library and gives it the shorthand alias `pd`. *Why not A:* `install` is not a Python keyword. You'd use `pip install pandas` in the terminal to install it, but `import` is how you load it in code. *Why not C:* While this technically works, it imports everything into the global namespace and is considered bad practice because it can cause naming conflicts. *Why not D:* This is not valid Python syntax for importing a library. *Reference:* Section 3.4

4. What is a DataFrame in pandas?

A) A single number calculated from data
B) A type of graph used in statistics
C) A two-dimensional data structure with rows and columns, like a spreadsheet
D) A file format for storing data

Answer

**C)** A two-dimensional data structure with rows and columns, like a spreadsheet. *Why C:* A DataFrame is pandas's core data structure. It stores tabular data — rows are observations, columns are variables — similar to a spreadsheet but manipulated through code. *Why not A:* That would be a scalar or a single statistic. *Why not B:* DataFrames hold data; they are not visualizations. *Why not D:* CSV is a file format. A DataFrame is what you get after loading a CSV into pandas. *Reference:* Section 3.4

5. What command loads a CSV file named "survey.csv" into a pandas DataFrame?

A) pd.open("survey.csv")
B) pd.load_csv("survey.csv")
C) pd.read_csv("survey.csv")
D) pd.import("survey.csv")

Answer

**C)** `pd.read_csv("survey.csv")` *Why C:* `read_csv()` is the pandas function for loading comma-separated value files. *Why not A:* There is no `pd.open()` function in pandas. *Why not B:* The function is `read_csv`, not `load_csv`. *Why not D:* `import` is a Python keyword for loading libraries, not a pandas function for loading data. *Reference:* Section 3.5

6. If df has 1,200 rows and 15 columns, what does df.shape return?

A) 1200
B) 15
C) (1200, 15)
D) (15, 1200)

Answer

**C)** `(1200, 15)` *Why C:* `.shape` returns a tuple of (rows, columns). Rows always come first. *Why not A:* That's only the row count — `.shape` gives both dimensions. *Why not B:* That's only the column count. *Why not D:* The order is (rows, columns), not (columns, rows). *Reference:* Section 3.5

7. What does the .describe() method show by default?

A) The first 5 rows of the DataFrame
B) The data type of each column
C) Summary statistics (count, mean, std, min, quartiles, max) for numerical columns
D) The number of missing values in each column

Answer

**C)** Summary statistics (count, mean, std, min, quartiles, max) for numerical columns. *Why C:* `.describe()` computes eight summary statistics for each numerical column: count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. *Why not A:* That's `.head()`. *Why not B:* That's `.dtypes`. *Why not D:* `.info()` shows non-null counts (from which you can infer missing values). `.describe()` shows statistical summaries. *Reference:* Section 3.5

8. Which code correctly filters a DataFrame to show only rows where the age column is greater than 30?

A) df['age' > 30]
B) df[df['age'] > 30]
C) df.filter('age' > 30)
D) df.where(age > 30)

Answer

**B)** `df[df['age'] > 30]` *Why B:* This is the correct pandas filtering pattern. `df['age'] > 30` creates a True/False series for each row, and `df[...]` keeps only the True rows. *Why not A:* This places the condition inside the column name brackets, which is a syntax error. *Why not C:* `.filter()` exists in pandas but works differently — it filters columns by name, not rows by condition. *Why not D:* `df.where()` exists but behaves differently (it replaces False values with NaN instead of removing rows). Also, `age` without quotes would cause a NameError. *Reference:* Section 3.6

9. What does .value_counts() do?

A) Counts the total number of values in a DataFrame
B) Counts how many times each unique value appears in a column
C) Returns the number of non-null values in each column
D) Calculates the sum of all values in a column

Answer

**B)** Counts how many times each unique value appears in a column. *Why B:* `.value_counts()` is applied to a single column and returns a frequency table — each unique value paired with its count, sorted from most to least common. *Why not A:* That would be `len(df)` or `df.shape[0]`. *Why not C:* That's shown in `.info()`. *Why not D:* That would be `.sum()`. *Reference:* Section 3.6

10. Which of the following is NOT an advantage of Python over spreadsheets for data analysis?

A) Better reproducibility
B) Faster for small, one-off calculations
C) Handles large datasets more efficiently
D) More powerful statistical testing capabilities

Answer

**B)** Faster for small, one-off calculations. *Why B:* For quick calculations on small amounts of data, a spreadsheet is often faster — you just type numbers and formulas into cells without needing to write code or import libraries. *Why not A:* Reproducibility is a major Python advantage — code documents every step. *Why not C:* Python handles millions of rows easily; spreadsheets struggle above ~100,000. *Why not D:* Python (via scipy and statsmodels) offers a far wider range of statistical tests than Excel or Google Sheets. *Reference:* Section 3.9

Section 2: Short Answer (2 points each)

11. You type health['State'] and get a KeyError. The column is actually named state (lowercase). Explain why this error occurs and how to avoid it in the future.

Answer

Python is **case-sensitive**, so `'State'` and `'state'` are treated as completely different column names. The DataFrame contains `state` (lowercase), so `'State'` doesn't match anything, causing a KeyError. To avoid this, use `df.columns` to see the exact column names, and always match capitalization exactly when referencing them. *Reference:* Section 3.8

12. What's the difference between df.head() and df.info()? When would you use each?

Answer

`df.head()` shows the **first 5 rows** of actual data — you see the values in each column for the first few observations. Use it to get a quick visual sense of what your data looks like. `df.info()` shows **metadata** about the DataFrame — column names, data types, non-null counts, and memory usage. It doesn't show actual data values. Use it to understand the structure of your dataset, especially to check for missing values and verify data types. Both are essential first steps when exploring a new dataset. *Reference:* Section 3.5

13. Explain what this code does, line by line:

ca_smokers = health[(health['state'] == 'CA') & (health['smoker'] == 1)]
print(len(ca_smokers))

Answer

Line 1 creates a new DataFrame called `ca_smokers` by filtering the `health` DataFrame. It applies **two conditions** simultaneously: - `health['state'] == 'CA'` — the respondent is from California - `health['smoker'] == 1` — the respondent is a smoker - `&` combines the conditions with logical AND (both must be true) - Each condition is wrapped in parentheses (required when combining conditions in pandas) Line 2 prints the **number of rows** in the filtered result — i.e., how many California smokers are in the dataset. *Reference:* Section 3.6

14. A dataset has a column called satisfaction with values 1 through 5. pandas stores it as int64. A student runs df.describe() and reports: "The average satisfaction is 3.2." Explain why this might be misleading, using concepts from Chapter 2.

Answer

`satisfaction` on a 1-5 scale is an **ordinal** variable, not a truly numerical one. The numbers represent ranked categories (e.g., Very Dissatisfied to Very Satisfied), but the distances between categories may not be equal. The "distance" between 1 and 2 might not be the same as between 4 and 5. Computing a mean treats these distances as equal, which may not reflect reality. A **median** or **mode** would be more appropriate summaries for ordinal data, along with a frequency table (`.value_counts()`). pandas can't detect this distinction — it sees integers and treats them as numbers. The analyst must apply their knowledge of variable types. *Reference:* Section 3.5, [Chapter 2](../chapter-02-types-of-data/index.md) review

Section 3: True or False (1 point each)

15. True or False: Google Colab requires you to install Python on your computer before you can use it.

Answer

**False.** Google Colab runs entirely in your web browser and uses Google's servers to execute Python code. You only need a Google account and an internet connection — no local installation required. *Reference:* Section 3.2

16. True or False: After restarting the kernel in a Jupyter notebook, all your variables are preserved and you can continue working without re-running any cells.

Answer

**False.** Restarting the kernel clears all variables from memory. You need to re-run your code cells from the top to recreate them. This is similar to turning off a calculator — it forgets everything. *Reference:* Section 3.2

17. True or False: The command df.sort_values('age') permanently changes the order of rows in the DataFrame df.

Answer

**False.** By default, `.sort_values()` returns a **new** sorted DataFrame without modifying the original. The original `df` remains in its original order. To permanently sort, you would need to either reassign: `df = df.sort_values('age')` or use the `inplace=True` parameter: `df.sort_values('age', inplace=True)`. *Reference:* Section 3.6

18. True or False: When a 0/1 variable (like smoker) is stored as an integer in pandas, its mean equals the proportion of 1s in the column.

Answer

**True.** If a column contains only 0s and 1s, the mean is the sum of all values divided by the count — which equals the number of 1s divided by the total. For example, if 89 out of 500 respondents are smokers (coded as 1), the mean is 89/500 = 0.178, meaning 17.8% are smokers. *Reference:* Section 3.5

Section 4: Applied Coding (3 points each)

19. Write the complete code to: (a) import pandas, (b) load a CSV file from a URL, (c) display the number of rows and columns, and (d) show summary statistics. Assume the URL is stored in a variable called data_url.

Answer

import pandas as pd

df = pd.read_csv(data_url)
print(df.shape)
df.describe()

This four-line workflow is the standard opening sequence for exploring any new dataset: import the library, load the data, check the dimensions, and compute summary statistics. *Reference:* Sections 3.4-3.5

20. Given a DataFrame called students with columns name, major, gpa, and year, write code to: a) Find all students with a GPA above 3.5 b) Count how many students are in each major c) Sort students by GPA from highest to lowest and show the top 10

Answer

# a) Students with GPA above 3.5
high_gpa = students[students['gpa'] > 3.5]

# b) Count students in each major
students['major'].value_counts()

# c) Top 10 by GPA (highest first)
students.sort_values('gpa', ascending=False).head(10)

Part (a) uses the filtering pattern `df[df['col'] > value]`. Part (b) uses `.value_counts()` for categorical data. Part (c) chains `.sort_values()` with `.head()` to get the top entries. *Reference:* Section 3.6