Guides

Statistics Fundamentals

This comprehensive guide covers essential topics in probability, statistics, and their applications.

Complete Guide to Statistics Fundamentals

Master the Fundamentals of Mathematical Statistics! This comprehensive guide covers essential topics in probability, statistics, and their applications. Perfect for students in behavioral sciences, data science, mathematics, and anyone studying statistics across various curricula including AP Statistics, IB Mathematics, GCSE, IGCSE, and college-level courses.

What is Statistics?

Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It provides methods and tools for making sense of numerical information and drawing meaningful conclusions from data in the presence of variability and uncertainty.

The Two Main Branches of Statistics

Descriptive Statistics

Methods for organizing, summarizing, and presenting data in a meaningful way

Purpose: Describe what the data shows

Examples: Mean, median, standard deviation, graphs, tables

Inferential Statistics

Methods for making inferences and predictions about populations based on sample data

Purpose: Draw conclusions beyond the immediate data

Examples: Hypothesis testing, confidence intervals, regression analysis

Key Distinction:

  • Population: The entire group of individuals or items of interest
  • Sample: A subset of the population selected for study
  • Parameter: A numerical characteristic of a population (usually unknown)
  • Statistic: A numerical characteristic of a sample (calculated from data)

Fundamentals of Probability and Statistics

Probability is the foundation of statistical inference. It quantifies uncertainty and provides the mathematical framework for making predictions about random events.

Basic Probability Concepts

Probability Definition:

The probability of an event A, denoted \( P(A) \), is a number between 0 and 1 that represents the likelihood of event A occurring.

\[ 0 \leq P(A) \leq 1 \]

Key Terms:

  • Random Experiment: A process with uncertain outcomes
  • Sample Space (S): The set of all possible outcomes
  • Event: A subset of the sample space
  • Probability: The likelihood of an event occurring

Classical Probability:

\[ P(A) = \frac{\text{Number of favorable outcomes}}{\text{Total number of possible outcomes}} \]

Probability Rules and Laws

1. Complement Rule:

\[ P(A^c) = 1 - P(A) \]

Where \( A^c \) is the complement of event A (A does not occur)

2. Addition Rule (Mutually Exclusive Events):

For events that cannot occur simultaneously:

\[ P(A \cup B) = P(A) + P(B) \]

3. General Addition Rule:

\[ P(A \cup B) = P(A) + P(B) - P(A \cap B) \]

4. Multiplication Rule (Independent Events):

For events where one doesn't affect the other:

\[ P(A \cap B) = P(A) \times P(B) \]

5. Conditional Probability:

The probability of A given that B has occurred:

\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]

Bayes' Theorem

Bayes' Theorem: A fundamental formula for updating probabilities based on new information

\[ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} \]

Extended Form:

\[ P(A|B) = \frac{P(B|A) \times P(A)}{P(B|A) \times P(A) + P(B|A^c) \times P(A^c)} \]

Applications: Medical diagnosis, spam filtering, machine learning, decision analysis

Random Variables and Probability Distributions

A random variable is a numerical outcome of a random phenomenon. It assigns a numerical value to each outcome in a sample space.

Types of Random Variables

Discrete Random Variables

Take on countable values (integers)

Examples:

  • Number of heads in 10 coin flips
  • Number of students in a class
  • Number of defective items in a batch

Described by: Probability Mass Function (PMF)

Continuous Random Variables

Take on any value in an interval (uncountable)

Examples:

  • Height of individuals
  • Time until an event occurs
  • Temperature measurements

Described by: Probability Density Function (PDF)

Expected Value and Variance

Expected Value (Mean) - Discrete:

\[ E(X) = \mu = \sum_{i} x_i P(X = x_i) \]

Expected Value - Continuous:

\[ E(X) = \mu = \int_{-\infty}^{\infty} x f(x) \, dx \]

Variance - Discrete:

\[ \text{Var}(X) = \sigma^2 = \sum_{i} (x_i - \mu)^2 P(X = x_i) \]

Alternative Formula:

\[ \text{Var}(X) = E(X^2) - [E(X)]^2 \]

Standard Deviation:

\[ \sigma = \sqrt{\text{Var}(X)} \]

Common Probability Distributions

Discrete Distributions

1. Bernoulli Distribution

Models a single trial with two outcomes: success (1) or failure (0)

Parameters: \( p \) = probability of success

PMF:

\[ P(X = x) = p^x(1-p)^{1-x}, \quad x \in \{0, 1\} \]

Mean: \( E(X) = p \)

Variance: \( \text{Var}(X) = p(1-p) \)

2. Binomial Distribution

Models the number of successes in \( n \) independent Bernoulli trials

Notation: \( X \sim \text{Binomial}(n, p) \)

PMF:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

Where \( \binom{n}{k} = \frac{n!}{k!(n-k)!} \)

Mean: \( E(X) = np \)

Variance: \( \text{Var}(X) = np(1-p) \)

3. Poisson Distribution

Models the number of events occurring in a fixed interval of time or space

Notation: \( X \sim \text{Poisson}(\lambda) \)

PMF:

\[ P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!} \]

Mean: \( E(X) = \lambda \)

Variance: \( \text{Var}(X) = \lambda \)

Applications: Modeling rare events, call centers, radioactive decay

Continuous Distributions

1. Uniform Distribution

All values in an interval are equally likely

Notation: \( X \sim \text{Uniform}(a, b) \)

PDF:

\[ f(x) = \frac{1}{b-a}, \quad a \leq x \leq b \]

Mean: \( E(X) = \frac{a+b}{2} \)

Variance: \( \text{Var}(X) = \frac{(b-a)^2}{12} \)

2. Normal (Gaussian) Distribution

The most important continuous distribution, characterized by its bell-shaped curve

Notation: \( X \sim N(\mu, \sigma^2) \)

PDF:

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]

Mean: \( E(X) = \mu \)

Variance: \( \text{Var}(X) = \sigma^2 \)

Standard Normal Distribution: \( Z \sim N(0, 1) \)

Standardization:

\[ Z = \frac{X - \mu}{\sigma} \]

Empirical Rule (68-95-99.7 Rule):

  • 68% of data falls within 1 standard deviation of the mean
  • 95% of data falls within 2 standard deviations of the mean
  • 99.7% of data falls within 3 standard deviations of the mean

3. Exponential Distribution

Models the time between events in a Poisson process

Notation: \( X \sim \text{Exp}(\lambda) \)

PDF:

\[ f(x) = \lambda e^{-\lambda x}, \quad x \geq 0 \]

Mean: \( E(X) = \frac{1}{\lambda} \)

Variance: \( \text{Var}(X) = \frac{1}{\lambda^2} \)

Descriptive Statistics

Descriptive statistics summarize and describe the main features of a dataset.

Measures of Central Tendency

MeasureFormulaDescription
Mean (Average)\( \bar{x} = \frac{\sum_{i=1}^n x_i}{n} \)Sum of all values divided by the number of values
MedianMiddle value when ordered50th percentile; resistant to outliers
ModeMost frequent valueCan have multiple modes or none

Measures of Variability (Dispersion)

1. Range:

\[ \text{Range} = \text{Maximum} - \text{Minimum} \]

2. Variance (Sample):

\[ s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} \]

3. Standard Deviation (Sample):

\[ s = \sqrt{s^2} = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}} \]

4. Interquartile Range (IQR):

\[ \text{IQR} = Q_3 - Q_1 \]

Where \( Q_1 \) is the 25th percentile and \( Q_3 \) is the 75th percentile

Measures of Position

Percentiles: Values that divide the data into 100 equal parts

Quartiles: Divide the data into four equal parts

  • \( Q_1 \): 25th percentile (first quartile)
  • \( Q_2 \): 50th percentile (median, second quartile)
  • \( Q_3 \): 75th percentile (third quartile)

z-score (Standard Score):

\[ z = \frac{x - \mu}{\sigma} \]

Measures how many standard deviations a value is from the mean

Inferential Statistics

Inferential statistics use sample data to make inferences about populations.

Sampling Distributions

Central Limit Theorem (CLT):

For a large sample size \( n \), the sampling distribution of the sample mean \( \bar{X} \) is approximately normal, regardless of the population's distribution:

\[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]

Standard Error of the Mean:

\[ SE = \frac{\sigma}{\sqrt{n}} \]

Key Points:

  • Generally, \( n \geq 30 \) is considered large enough
  • The CLT is fundamental to statistical inference
  • Allows us to use normal distribution methods even when the population is not normal

Confidence Intervals

A confidence interval provides a range of plausible values for a population parameter

Confidence Interval for Population Mean (\( \sigma \) known):

\[ \bar{x} \pm z^* \frac{\sigma}{\sqrt{n}} \]

Confidence Interval for Population Mean (\( \sigma \) unknown):

\[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \]

Where \( t^* \) is from the t-distribution with \( n-1 \) degrees of freedom

Common Confidence Levels:

  • 90% confidence: \( z^* = 1.645 \)
  • 95% confidence: \( z^* = 1.96 \)
  • 99% confidence: \( z^* = 2.576 \)

Interpretation: We are 95% confident that the true population parameter lies within the calculated interval.

Hypothesis Testing

Hypothesis testing is a statistical method for making decisions about population parameters based on sample data.

Steps in Hypothesis Testing

  1. State the Hypotheses
    • Null Hypothesis (\( H_0 \)): The claim to be tested (usually "no effect" or "no difference")
    • Alternative Hypothesis (\( H_a \) or \( H_1 \)): What we hope to support
  2. Choose Significance Level (\( \alpha \))
    • Commonly \( \alpha = 0.05 \) (5%)
    • Represents the probability of Type I error
  3. Calculate Test Statistic
    • z-test, t-test, chi-square test, etc.
  4. Determine p-value
    • Probability of obtaining results at least as extreme as observed, assuming \( H_0 \) is true
  5. Make Decision
    • If p-value ≤ \( \alpha \): Reject \( H_0 \)
    • If p-value > \( \alpha \): Fail to reject \( H_0 \)

Common Hypothesis Tests

1. One-Sample z-Test for Mean

When to use: Testing a claim about a population mean when \( \sigma \) is known and \( n \) is large

Test Statistic:

\[ z = \frac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} \]

Hypotheses:

  • Two-tailed: \( H_0: \mu = \mu_0 \) vs \( H_a: \mu \neq \mu_0 \)
  • Right-tailed: \( H_0: \mu = \mu_0 \) vs \( H_a: \mu > \mu_0 \)
  • Left-tailed: \( H_0: \mu = \mu_0 \) vs \( H_a: \mu < \mu_0 \)

2. One-Sample t-Test for Mean

When to use: Testing a claim about a population mean when \( \sigma \) is unknown

Test Statistic:

\[ t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \]

Follows a t-distribution with \( n-1 \) degrees of freedom

3. Two-Sample t-Test

When to use: Comparing means of two independent groups

Test Statistic:

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]

4. Paired t-Test

When to use: Comparing two related samples (before/after, matched pairs)

Test Statistic:

\[ t = \frac{\bar{d} - 0}{s_d/\sqrt{n}} \]

Where \( \bar{d} \) is the mean of differences and \( s_d \) is the standard deviation of differences

Type I and Type II Errors

RealityDecision: Reject \( H_0 \)Decision: Fail to Reject \( H_0 \)
\( H_0 \) is TrueType I Error (False Positive)
Probability = \( \alpha \)
Correct Decision
\( H_0 \) is FalseCorrect Decision
Power = \( 1 - \beta \)
Type II Error (False Negative)
Probability = \( \beta \)

Statistics for the Behavioral Sciences

Statistics plays a crucial role in behavioral sciences, including psychology, sociology, education, and related fields.

Key Applications

Experimental Design

  • Randomized controlled trials
  • Between-subjects designs
  • Within-subjects designs
  • Factorial designs

Correlation and Regression

  • Pearson correlation coefficient
  • Simple linear regression
  • Multiple regression
  • Prediction models

Analysis of Variance (ANOVA)

  • One-way ANOVA
  • Two-way ANOVA
  • Repeated measures ANOVA
  • Post-hoc tests

Non-Parametric Tests

  • Chi-square test
  • Mann-Whitney U test
  • Wilcoxon signed-rank test
  • Kruskal-Wallis test

Correlation

Pearson Correlation Coefficient (r):

Measures the strength and direction of linear relationship between two variables

\[ r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \]

Properties:

  • \( -1 \leq r \leq 1 \)
  • \( r = 1 \): Perfect positive linear relationship
  • \( r = -1 \): Perfect negative linear relationship
  • \( r = 0 \): No linear relationship

Coefficient of Determination:

\[ r^2 = \text{proportion of variance explained} \]

Simple Linear Regression

Regression Equation:

\[ \hat{y} = a + bx \]

Slope (b):

\[ b = r \frac{s_y}{s_x} = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} \]

Intercept (a):

\[ a = \bar{y} - b\bar{x} \]

Interpretation:

  • \( b \): Change in y for a one-unit change in x
  • \( a \): Predicted value of y when x = 0

Worked Examples

Example 1: Probability Calculation

Problem: A bag contains 5 red balls and 3 blue balls. Two balls are drawn without replacement. What is the probability both are red?

Solution:

\[ P(\text{both red}) = P(\text{1st red}) \times P(\text{2nd red | 1st red}) \]

\[ = \frac{5}{8} \times \frac{4}{7} = \frac{20}{56} = \frac{5}{14} \approx 0.357 \]

Example 2: Binomial Probability

Problem: A fair coin is flipped 10 times. What is the probability of getting exactly 6 heads?

Solution:

\( n = 10, k = 6, p = 0.5 \)

\[ P(X = 6) = \binom{10}{6} (0.5)^6(0.5)^4 = \frac{10!}{6!4!} (0.5)^{10} \]

\[ = 210 \times 0.0009766 \approx 0.205 \]

Example 3: Confidence Interval

Problem: A sample of 36 students has a mean test score of 75 with a standard deviation of 12. Calculate a 95% confidence interval for the population mean.

Solution:

\( \bar{x} = 75, s = 12, n = 36, z^* = 1.96 \)

\[ CI = 75 \pm 1.96 \times \frac{12}{\sqrt{36}} = 75 \pm 1.96 \times 2 \]

\[ = 75 \pm 3.92 = (71.08, 78.92) \]

Example 4: Hypothesis Test

Problem: A manufacturer claims the mean lifetime of batteries is 500 hours. A sample of 25 batteries has a mean of 485 hours with a standard deviation of 40 hours. Test at \( \alpha = 0.05 \).

Solution:

\( H_0: \mu = 500 \) vs \( H_a: \mu \neq 500 \)

\[ t = \frac{485 - 500}{40/\sqrt{25}} = \frac{-15}{8} = -1.875 \]

Critical value at \( \alpha = 0.05 \), df = 24: \( \pm 2.064 \)

Since \( |-1.875| < 2.064 \), we fail to reject \( H_0 \). Insufficient evidence to reject the manufacturer's claim.

Summary of Key Statistical Concepts

ConceptKey Formula/IdeaApplication
Mean\( \bar{x} = \frac{\sum x_i}{n} \)Measure of central tendency
Standard Deviation\( s = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}} \)Measure of variability
Z-score\( z = \frac{x - \mu}{\sigma} \)Standardization
Probability\( 0 \leq P(A) \leq 1 \)Quantify uncertainty
Normal Distribution\( X \sim N(\mu, \sigma^2) \)Most common distribution
Confidence Interval\( \bar{x} \pm z^* \frac{\sigma}{\sqrt{n}} \)Estimate population parameter
t-test\( t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}} \)Test hypotheses about means
Correlation\( -1 \leq r \leq 1 \)Measure relationship strength

Study Tips for Mastering Statistics

  • Understand the Concepts: Focus on understanding what formulas mean, not just memorizing them
  • Practice Regularly: Work through many problems to develop intuition
  • Visualize the Data: Use graphs and plots to understand distributions and relationships
  • Know When to Use Tests: Understand the conditions and assumptions for each statistical test
  • Check Assumptions: Always verify assumptions (normality, independence, etc.) before applying tests
  • Interpret Results: Focus on what statistical results mean in practical contexts
  • Use Software: Familiarize yourself with statistical software (R, SPSS, Python, Excel)
  • Connect to Real Life: Relate statistical concepts to real-world applications

About the Author

Adam

LinkedIn Profile

Co-Founder @RevisionTown

info@revisiontown.com

Math Expert in various curriculums including IB, AP, GCSE, IGCSE, and more

Shares: