Single-Variable Statistics - Ninth Grade Math

Introduction to Statistics

Statistics: The science of collecting, organizing, analyzing, and interpreting data
Single-Variable Data: Data involving one characteristic or measurement
Population: The entire group being studied
Sample: A subset of the population used to make inferences
Parameter: A numerical description of a population
Statistic: A numerical description of a sample

1. Identify Biased Samples

Biased Sample: A sample that does not fairly represent the population
Random Sample: Each member of population has equal chance of being selected
Representative Sample: Reflects the characteristics of the population
Sampling Bias: Systematic error in how a sample is collected

Types of Biased Samples

Common Types of Sampling Bias:

1. Convenience Sampling:
• Choosing samples that are easy to reach
• Example: Surveying only your friends about school lunch
• Problem: Not representative of all students

2. Voluntary Response Bias:
• Only people who choose to respond are included
• Example: Online polls where people opt in
• Problem: People with strong opinions more likely to respond

3. Undercoverage:
• Some groups in population are excluded
• Example: Phone survey excludes people without phones
• Problem: Missing perspectives from excluded groups

4. Nonresponse Bias:
• Selected individuals don't participate
• Example: Mail survey with low response rate
• Problem: Responders may differ from non-responders

5. Question Wording Bias:
• Questions are leading or confusing
• Example: "Don't you agree that...?"
• Problem: Influences responses

Example 1: Identify if sample is biased

Scenario: A principal wants to know students' opinions on school uniforms. She surveys students in the chess club.

Analysis:
• This is convenience sampling
• Chess club members may not represent all students
• Different clubs/groups may have different opinions

Conclusion: This is a BIASED sample
Better method: Randomly select students from all grades and activities

Example 2: Identify bias

Scenario: A store wants to know customer satisfaction. They ask every 10th customer who makes a purchase.

Analysis: Systematic sampling of actual customers
Potential bias: Only includes people who made purchases (satisfied customers)
Missing: People who left without buying (possibly dissatisfied)

Conclusion: BIASED - excludes non-purchasers

Example 3: Unbiased sample

Scenario: A researcher assigns a number to each student in the school and uses a random number generator to select 50 students for a survey.

Analysis:
• Each student has equal chance of selection
• Random selection process
• No systematic exclusion

Conclusion: This is an UNBIASED sample

2. Mean, Median, Mode, and Range

Measures of Center: Values that describe the center or typical value of data
Measures of Spread: Values that describe how data is distributed
Central Tendency: The tendency of data to cluster around a central value

Mean (Average)

Mean Formula:

$$\text{Mean} = \bar{x} = \frac{\sum x}{n}$$

where:
• $\sum x$ = sum of all data values
• $n$ = number of data values
• $\bar{x}$ (x-bar) = mean

In words: Add all values, divide by how many values

Example 1: Find the mean of 5, 8, 12, 15, 20

$$\text{Mean} = \frac{5 + 8 + 12 + 15 + 20}{5} = \frac{60}{5} = 12$$

Answer: Mean = 12

Median (Middle Value)

Median Steps:

Step 1: Order data from least to greatest
Step 2: Find middle value

If odd number of values:
Median is the middle value

If even number of values:
$$\text{Median} = \frac{\text{Two middle values}}{2}$$

Example 2: Find median of 3, 7, 9, 15, 20

Already ordered: 3, 7, 9, 15, 20
Middle value: 9

Answer: Median = 9

Example 3: Find median of 4, 8, 10, 12, 16, 20

Two middle values: 4, 8, 10, 12, 16, 20
$$\text{Median} = \frac{10 + 12}{2} = \frac{22}{2} = 11$$

Answer: Median = 11

Mode (Most Frequent)

Mode Definition:

The value that appears most frequently in the dataset

Special Cases:
• No mode: All values appear once
• Bimodal: Two values appear most frequently
• Multimodal: More than two values tied for most frequent

Example 4: Find mode of 2, 3, 3, 5, 7, 7, 7, 9

Frequency:
2: once, 3: twice, 5: once, 7: three times, 9: once

Answer: Mode = 7

Example 5: Find mode of 1, 2, 3, 4, 5

All values appear once

Answer: No mode

Range (Spread)

Range Formula:

$$\text{Range} = \text{Maximum} - \text{Minimum}$$

Interpretation: Shows how spread out the data is

Example 6: Find range of 12, 18, 25, 30, 42

$$\text{Range} = 42 - 12 = 30$$

Answer: Range = 30

3. Calculate Quartiles and Interquartile Range

Quartiles: Values that divide ordered data into four equal parts
Q1 (First Quartile): 25th percentile - median of lower half
Q2 (Second Quartile): 50th percentile - median of entire dataset
Q3 (Third Quartile): 75th percentile - median of upper half
IQR: Interquartile Range - range of middle 50% of data

Quartile Formulas:

Step 1: Order data from least to greatest
Step 2: Find median (Q2)
Step 3: Find median of lower half (Q1)
Step 4: Find median of upper half (Q3)

Interquartile Range:
$$\text{IQR} = Q3 - Q1$$

Five-Number Summary:
Minimum, Q1, Median (Q2), Q3, Maximum

Example 1: Find quartiles for: 2, 5, 7, 9, 11, 13, 15, 18, 20

Step 1: Already ordered
n = 9 values

Step 2: Find Q2 (median)
2, 5, 7, 9, 11, 13, 15, 18, 20
Q2 = 11

Step 3: Find Q1 (median of lower half)
Lower half: 2, 5, 7, 9
Q1 = 7

Step 4: Find Q3 (median of upper half)
Upper half: 13, 15, 18, 20
Q3 = 15

Step 5: Calculate IQR
$$\text{IQR} = 15 - 7 = 8$$

Answer: Q1 = 7, Q2 = 11, Q3 = 15, IQR = 8

Example 2: Find five-number summary for: 3, 6, 8, 10, 12, 15, 18, 22

Minimum: 3
Q1: Median of (3, 6, 8, 10) = $\frac{6+8}{2} = 7$
Q2 (Median): $\frac{10+12}{2} = 11$
Q3: Median of (12, 15, 18, 22) = $\frac{15+18}{2} = 16.5$
Maximum: 22

IQR: $16.5 - 7 = 9.5$

Answer: Min = 3, Q1 = 7, Med = 11, Q3 = 16.5, Max = 22, IQR = 9.5

4-5. Identify Outliers and Their Effects

Outlier: A data value significantly different from other values
Effect: Can greatly affect mean, but not median
Why identify: May indicate errors, special cases, or important information

Method 1: Using IQR (Most Common)

IQR Method for Outliers:

Step 1: Calculate Q1, Q3, and IQR

Step 2: Calculate boundaries
$$\text{Lower Boundary} = Q1 - 1.5 \times \text{IQR}$$
$$\text{Upper Boundary} = Q3 + 1.5 \times \text{IQR}$$

Step 3: Any value outside boundaries is an outlier
• Value < Lower Boundary → Low outlier
• Value > Upper Boundary → High outlier

Example 1: Identify outliers in: 5, 8, 10, 12, 15, 18, 20, 45

Find Q1 and Q3:
Q1 = 9 (median of 5, 8, 10, 12)
Q3 = 19 (median of 15, 18, 20, 45)

Calculate IQR:
$\text{IQR} = 19 - 9 = 10$

Calculate boundaries:
Lower: $9 - 1.5(10) = 9 - 15 = -6$
Upper: $19 + 1.5(10) = 19 + 15 = 34$

Check data:
All values except 45 are between -6 and 34
45 > 34

Answer: 45 is an outlier

Effects of Removing Outliers

How Outliers Affect Statistics:

Mean: GREATLY affected
• High outlier increases mean
• Low outlier decreases mean

Median: SLIGHTLY or NOT affected
• Position of middle value usually stays similar

Mode: Usually NOT affected
• Outliers typically appear only once

Range: GREATLY affected
• Outliers are often min or max values

Standard Deviation: GREATLY affected
• Measures spread from mean

Example 2: Describe effect of removing outlier

Original data: 10, 12, 13, 14, 15, 15, 16, 50

With outlier (50):
Mean: $\frac{10+12+13+14+15+15+16+50}{8} = \frac{145}{8} = 18.125$
Median: $\frac{14+15}{2} = 14.5$
Range: $50 - 10 = 40$

Without outlier:
Mean: $\frac{10+12+13+14+15+15+16}{7} = \frac{95}{7} \approx 13.57$
Median: $14$ (middle value)
Range: $16 - 10 = 6$

Analysis:
• Mean decreased from 18.125 to 13.57 (significant change)
• Median changed slightly from 14.5 to 14
• Range decreased dramatically from 40 to 6

Conclusion: Removing outlier made data more representative

6. Variance and Standard Deviation

Variance: Average of squared deviations from the mean
Standard Deviation: Square root of variance - measures typical distance from mean
Symbol for variance: $\sigma^2$ (population) or $s^2$ (sample)
Symbol for standard deviation: $\sigma$ (population) or $s$ (sample)

Population vs Sample

Key Difference:

Population: Entire group
• Divide by $n$
• Use $\sigma$ (sigma)

Sample: Part of group
• Divide by $n - 1$ (Bessel's correction)
• Use $s$

In this course, we typically use population formulas

Variance

Population Variance Formula:

$$\sigma^2 = \frac{\sum (x - \bar{x})^2}{n}$$

where:
• $x$ = each data value
• $\bar{x}$ = mean
• $n$ = number of values
• $(x - \bar{x})$ = deviation from mean

Steps:
1. Find the mean
2. Find each deviation: $(x - \bar{x})$
3. Square each deviation: $(x - \bar{x})^2$
4. Find average of squared deviations

Example 1: Find variance of 2, 4, 6, 8, 10

Step 1: Find mean
$\bar{x} = \frac{2+4+6+8+10}{5} = \frac{30}{5} = 6$

Step 2-3: Find deviations and square them

x	$(x - \bar{x})$	$(x - \bar{x})^2$
2	2 - 6 = -4	16
4	4 - 6 = -2	4
6	6 - 6 = 0	0
8	8 - 6 = 2	4
10	10 - 6 = 4	16

Step 4: Calculate variance
$$\sigma^2 = \frac{16+4+0+4+16}{5} = \frac{40}{5} = 8$$

Answer: Variance = 8

Standard Deviation

Standard Deviation Formula:

$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum (x - \bar{x})^2}{n}}$$

In words: Square root of variance

Interpretation:
• Small standard deviation: data clustered near mean
• Large standard deviation: data spread out from mean
• Units are same as original data (unlike variance)

Example 2: Find standard deviation using variance from Example 1

Variance: $\sigma^2 = 8$

Standard deviation:
$$\sigma = \sqrt{8} = 2\sqrt{2} \approx 2.83$$

Answer: Standard deviation ≈ 2.83

Interpretation: Values typically vary about 2.83 units from mean of 6

Using Standard Deviation to Find Outliers

Standard Deviation Method:

An outlier is any value more than 3 standard deviations from the mean

$$\text{Lower Boundary} = \bar{x} - 3\sigma$$
$$\text{Upper Boundary} = \bar{x} + 3\sigma$$

Values outside this range are outliers

Example 3: A dataset has mean = 50 and standard deviation = 5. Is 72 an outlier?

Calculate boundaries:
Lower: $50 - 3(5) = 50 - 15 = 35$
Upper: $50 + 3(5) = 50 + 15 = 65$

Check 72:
72 > 65 (upper boundary)

Answer: Yes, 72 is an outlier

7. Choose Appropriate Measures of Center and Variation

Choosing Wisely: Different situations call for different measures
Key Question: Are there outliers or is data skewed?

Decision Guide:

Use MEAN and STANDARD DEVIATION when:
• Data is symmetric (no outliers)
• Normal distribution (bell-shaped)
• Want to use all data values
• Doing further calculations

Use MEDIAN and IQR when:
• Data has outliers
• Data is skewed (not symmetric)
• Want measure resistant to extreme values
• Dealing with ordinal data (rankings)

Use MODE when:
• Data is categorical
• Want most common value
• Multiple values tied for highest frequency

Example 1: Choose appropriate measures

Scenario: Home prices in a neighborhood: $150K, $160K, $170K, $180K, $190K, $2M

Analysis:
• $2M is an outlier (much higher than others)
• Mean would be heavily influenced by $2M
• Median better represents typical home

Mean: $\frac{2,850,000}{6} = \$475,000$ (misleading!)
Median: $\frac{170,000 + 180,000}{2} = \$175,000$ (more typical)

Best choice: Median and IQR
Reason: Outlier present, better represents typical home

Example 2: Choose measures

Scenario: Test scores: 72, 75, 78, 80, 82, 85, 88, 90

Analysis:
• No outliers
• Fairly symmetric distribution
• All values close together

Best choice: Mean and Standard Deviation
Reason: Symmetric data, no outliers, uses all information

Example 3: Favorite colors survey

Data: Red (5), Blue (12), Green (3), Yellow (2)

Analysis:
• Categorical data (not numerical)
• Can't calculate mean or median

Best choice: Mode
Answer: Blue is most popular (mode)

Measures of Center Comparison

Measure	Formula/Method	Best Used When	Affected by Outliers?
Mean	$\bar{x} = \frac{\sum x}{n}$	Symmetric data, no outliers	YES - heavily affected
Median	Middle value when ordered	Skewed data, outliers present	NO - resistant to outliers
Mode	Most frequent value	Categorical data	NO - not affected

Measures of Spread Comparison

Measure	Formula	What It Shows	Affected by Outliers?
Range	Max - Min	Total spread	YES - very sensitive
IQR	Q3 - Q1	Spread of middle 50%	NO - resistant
Variance	$\sigma^2 = \frac{\sum (x-\bar{x})^2}{n}$	Average squared deviation	YES - very sensitive
Standard Deviation	$\sigma = \sqrt{\sigma^2}$	Typical distance from mean	YES - very sensitive

Outlier Detection Methods

Method	Formula	When to Use
IQR Method (Most Common)	Lower: $Q1 - 1.5 \times IQR$ Upper: $Q3 + 1.5 \times IQR$	General purpose, box plots
Standard Deviation Method	Lower: $\bar{x} - 3\sigma$ Upper: $\bar{x} + 3\sigma$	Normal distributions

Types of Sampling Bias

Type	Description	Example	Problem
Convenience	Easy to reach samples	Survey friends only	Not representative
Voluntary Response	Self-selected participants	Online poll	Strong opinions overrepresented
Undercoverage	Excludes part of population	Phone survey only	Missing perspectives
Nonresponse	Selected don't respond	Low response rate	Responders may differ

Success Tips for Single-Variable Statistics:
✓ Mean uses all values; median uses position
✓ Always order data before finding median or quartiles
✓ IQR measures spread of middle 50% - resistant to outliers
✓ Use IQR method (1.5 × IQR) to identify outliers
✓ Outliers greatly affect mean, range, and standard deviation
✓ Outliers barely affect median and IQR
✓ Variance is in squared units; standard deviation is in original units
✓ Choose median & IQR when outliers present
✓ Choose mean & standard deviation for symmetric data
✓ Random sampling eliminates bias - every member has equal chance!