Single-Variable Statistics
Complete Notes & Formulae for Eleventh Grade (Algebra 2)
1. Identify Biased Samples
What is Sampling Bias?
Sampling bias occurs when a sample does not accurately represent the population, causing some members to be overrepresented or underrepresented
Result:
• Skewed or invalid results
• Cannot generalize findings to the population
Types of Sampling Bias:
1. Self-Selection Bias (Voluntary Response Bias)
People with strong opinions volunteer to participate
Example: Online polls where only motivated people respond
2. Nonresponse Bias
People who refuse to participate differ systematically from those who do
Example: Busy people less likely to complete long surveys
3. Undercoverage Bias
Some groups in the population are inadequately represented
Example: Online surveys miss people without internet access
4. Convenience Sampling Bias
Sampling only easily accessible individuals
Example: Surveying only students in one classroom
5. Survivorship Bias
Only studying "survivors" or successful cases
Example: Only interviewing successful entrepreneurs
2. Variance and Standard Deviation
Variance (σ² or s²):
Measures the average squared deviation from the mean (spread of data)
Population Variance:
\[ \sigma^2 = \frac{\sum (x_i - \mu)^2}{N} \]
Sample Variance:
\[ s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} \]
Note: Divide by \( n-1 \) for sample (Bessel's correction)
Standard Deviation (σ or s):
Square root of variance; measures typical distance from the mean
\[ \sigma = \sqrt{\sigma^2} \quad \text{or} \quad s = \sqrt{s^2} \]
Key Points:
• Larger SD = more spread out data
• Smaller SD = data clustered near mean
• SD has same units as original data
3. Identify an Outlier
What is an Outlier?
A data value that is significantly different (much larger or smaller) from other values in the dataset
Method 1: Using Standard Deviation (3-Sigma Rule)
Rule:
Any data value more than 3 standard deviations from the mean is an outlier
\[ \text{Outlier if: } x < \mu - 3\sigma \text{ or } x > \mu + 3\sigma \]
Method 2: Using IQR (Interquartile Range)
Steps:
1. Find Q1 (25th percentile) and Q3 (75th percentile)
2. Calculate IQR = Q3 - Q1
3. Check for outliers:
\[ \text{Lower outliers: } x < Q1 - 1.5 \times IQR \]
\[ \text{Upper outliers: } x > Q3 + 1.5 \times IQR \]
Example:
Data: 10, 12, 14, 15, 16, 18, 50. Mean = 19.3, SD = 14.4
Upper bound: 19.3 + 3(14.4) = 62.5
Lower bound: 19.3 - 3(14.4) = -23.9
All values fall within bounds
No outliers using 3-sigma rule (though 50 appears unusual)
4. Effect of Removing Outliers
Impact on Statistics:
Removing outliers typically affects:
Mean:
• Most affected by outliers
• Will move toward center of remaining data
Median:
• Resistant to outliers (less affected)
• May change slightly or not at all
Standard Deviation:
• Usually decreases (data less spread out)
• Indicates more consistent data
Range:
• Always decreases
Example:
Data: 10, 12, 14, 15, 16, 18, 100
| With Outlier | Without Outlier | |
|---|---|---|
| Mean | 26.4 | 14.2 |
| Median | 15 | 14.5 |
| SD | 31.8 | 2.9 |
Effect: Mean decreased significantly, SD decreased dramatically, median barely changed
5. Find Confidence Intervals for Population Means
Confidence Interval:
A range of values that likely contains the true population parameter with a specified level of confidence
\[ \text{CI} = \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}} \]
where:
• \( \bar{x} \) = sample mean
• \( z^* \) = critical value (z-score for confidence level)
• \( \sigma \) = population standard deviation
• \( n \) = sample size
• \( \frac{\sigma}{\sqrt{n}} \) = standard error
Common Critical Values:
| Confidence Level | z* |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
Example:
Sample: n = 100, \( \bar{x} = 75 \), σ = 10. Find 95% CI.
Standard error: \( \frac{10}{\sqrt{100}} = 1 \)
Margin of error: \( 1.96 \times 1 = 1.96 \)
CI: \( 75 \pm 1.96 = (73.04, 76.96) \)
95% CI: (73.04, 76.96)
6. Find Confidence Intervals for Population Proportions
Formula:
\[ \text{CI} = \hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]
where:
• \( \hat{p} = \frac{x}{n} \) = sample proportion
• \( x \) = number of successes
• \( n \) = sample size
• \( z^* \) = critical value
Conditions:
• Random sample
• \( n\hat{p} \geq 10 \) and \( n(1-\hat{p}) \geq 10 \)
• Sample size < 10% of population
Example:
Survey: 200 people, 120 support a policy. Find 95% CI for proportion.
\( \hat{p} = \frac{120}{200} = 0.6 \)
Standard error: \( \sqrt{\frac{0.6(0.4)}{200}} = \sqrt{\frac{0.24}{200}} = 0.0346 \)
Margin of error: \( 1.96 \times 0.0346 = 0.0678 \)
CI: \( 0.6 \pm 0.0678 = (0.532, 0.668) \)
95% CI: (53.2%, 66.8%)
7. Interpret Confidence Intervals for Population Means
Correct Interpretation:
For a 95% confidence interval (a, b):
✓ CORRECT:
"We are 95% confident that the true population mean lies between a and b"
"If we repeated this process many times, about 95% of intervals would contain the true mean"
✗ INCORRECT:
"There is a 95% probability the true mean is between a and b" (probability is wrong - the interval either contains μ or it doesn't)
"95% of the data falls in this interval" (CI is about parameter, not data)
Confidence Level Meaning:
• Higher confidence level → Wider interval (more certainty, less precision)
• Lower confidence level → Narrower interval (less certainty, more precision)
• Larger sample size → Narrower interval (more precision)
8. Experiment Design
Key Components:
1. Control Group
Receives no treatment or standard treatment (baseline for comparison)
2. Treatment Group
Receives the experimental treatment
3. Random Assignment
Randomly assign subjects to groups to eliminate bias
4. Replication
Use enough subjects to detect effects (larger sample = more reliable)
5. Blinding
• Single-blind: Subjects don't know which group they're in
• Double-blind: Neither subjects nor researchers know (reduces bias)
Types of Studies:
Observational Study:
Observe subjects without intervention; can show association but NOT causation
Experiment:
Researcher imposes treatment; CAN establish causation with proper design
9. Analyze Results Using Simulations
Purpose of Simulation:
Simulations help determine if observed results could have occurred by chance
Steps:
1. State null hypothesis (no effect/difference)
2. Run many simulations assuming null hypothesis is true
3. Compare observed result to simulation distribution
4. Calculate p-value (proportion of simulations as extreme as observed)
5. Make conclusion
Interpretation:
P-value:
Probability of getting results as extreme as observed, assuming no real effect
• Small p-value (< 0.05): Results unlikely due to chance → Statistically significant
• Large p-value (> 0.05): Results could easily occur by chance → Not significant
Example:
A coin is flipped 100 times, getting 60 heads. Is the coin fair?
Simulate 1000 trials of flipping a fair coin 100 times
Count how many simulations give ≥60 heads
If only 30 out of 1000 simulations have ≥60 heads:
P-value = 30/1000 = 0.03
Conclusion: Result is statistically significant (p < 0.05); coin may be biased
10. Quick Reference Summary
Key Formulas:
Sample Variance: \( s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} \)
Standard Deviation: \( s = \sqrt{s^2} \)
Outlier (3-sigma): \( x < \mu - 3\sigma \) or \( x > \mu + 3\sigma \)
CI for Mean: \( \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}} \)
CI for Proportion: \( \hat{p} \pm z^* \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \)
📚 Study Tips
✓ Always check for biased sampling methods in survey design
✓ Outliers significantly affect mean and standard deviation but not median
✓ Higher confidence level = wider confidence interval
✓ Random assignment in experiments helps establish causation
✓ P-value < 0.05 typically indicates statistical significance
