Comprehensive Statistics Guide
Master statistics with examples, methods, and practice problems
Table of Contents
1. Introduction to Statistics
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.
Types of Statistics
Statistics is generally divided into two main categories:
- Descriptive Statistics: Summarizes data using indexes such as mean, median, mode, variance, etc.
- Inferential Statistics: Uses data from samples to make inferences about larger populations.
Important Terminology
- Population: The entire group that is the subject of study.
- Sample: A subset of the population.
- Variable: A characteristic or attribute that can be measured.
- Data: The actual values of variables.
2. Descriptive Statistics
Descriptive statistics involves methods of organizing, summarizing, and presenting data in a convenient and informative way using tables, graphs, and summary measures.
Measures of Central Tendency
Mean
The arithmetic average of a data set.
Example: Find the mean of 4, 7, 2, 9, 3
Solution: Mean = (4 + 7 + 2 + 9 + 3) / 5 = 25 / 5 = 5
Median
The middle value of a data set when arranged in order.
Example: Find the median of 4, 7, 2, 9, 3
Solution:
- Arrange in ascending order: 2, 3, 4, 7, 9
- Find the middle value: Median = 4
Example 2: Find the median of 4, 7, 2, 9, 3, 6
Solution:
- Arrange in ascending order: 2, 3, 4, 6, 7, 9
- For even number of values, average the two middle values: Median = (4 + 6) / 2 = 5
Mode
The most frequently occurring value in a data set.
Example: Find the mode of 2, 3, 4, 4, 5, 5, 5, 6, 7
Solution: The value 5 appears three times, which is more than any other value, so mode = 5.
Measures of Dispersion
Range
The difference between the maximum and minimum values in a data set.
Example: Find the range of 4, 7, 2, 9, 3
Solution: Range = 9 - 2 = 7
Variance
The average of the squared differences from the mean.
Example: Find the variance of 4, 7, 2, 9, 3
Solution:
- Find the mean: (4 + 7 + 2 + 9 + 3) / 5 = 5
- Find the squared differences from the mean:
- (4 - 5)² = (-1)² = 1
- (7 - 5)² = (2)² = 4
- (2 - 5)² = (-3)² = 9
- (9 - 5)² = (4)² = 16
- (3 - 5)² = (-2)² = 4
- Sum the squared differences: 1 + 4 + 9 + 16 + 4 = 34
- Divide by (n-1) for sample variance: 34 / 4 = 8.5
Standard Deviation
The square root of the variance, representing the average distance from the mean.
Example: Find the standard deviation of 4, 7, 2, 9, 3
Solution:
- From our previous calculation, variance = 8.5
- Standard deviation = √8.5 ≈ 2.92
3. Probability Distributions
A probability distribution is a function that describes the likelihood of obtaining the possible values of a random variable.
Discrete Probability Distributions
Binomial Distribution
Describes the number of successes in a fixed number of independent Bernoulli trials with the same probability of success.
Where:
- n = number of trials
- k = number of successes
- p = probability of success in a single trial
Example: A fair coin is tossed 5 times. What is the probability of getting exactly 3 heads?
Solution:
- n = 5, k = 3, p = 0.5
- P(X = 3) = (5 choose 3) × (0.5)³ × (0.5)² = 10 × 0.125 × 0.25 = 0.3125
Poisson Distribution
Describes the number of events occurring in a fixed interval of time or space, assuming events occur independently.
Where:
- λ (lambda) = average number of events in the interval
- k = number of events
- e = Euler's number (≈ 2.71828)
Example: The average number of calls received by a call center in a 10-minute period is 5. What is the probability of receiving exactly 3 calls in a 10-minute period?
Solution:
- λ = 5, k = 3
- P(X = 3) = (e⁻⁵ × 5³) / 3! = (0.00674 × 125) / 6 = 0.14
Continuous Probability Distributions
Normal Distribution
A continuous probability distribution that is symmetric about the mean, showing that data near the mean are more frequent than data far from the mean.
Where:
- μ = mean
- σ = standard deviation
- π = pi (≈ 3.14159)
- e = Euler's number (≈ 2.71828)
Example: IQ scores are normally distributed with a mean of 100 and a standard deviation of 15. What is the probability that a randomly selected person has an IQ between 85 and 115?
Solution:
- We need to find P(85 ≤ X ≤ 115)
- Standardize: z₁ = (85 - 100) / 15 = -1, z₂ = (115 - 100) / 15 = 1
- From the standard normal table, P(-1 ≤ Z ≤ 1) = 0.6827 or about 68.27%
Exponential Distribution
Describes the time between events in a Poisson process.
Where:
- λ = rate parameter
- x = random variable (often time)
Example: The average lifetime of a certain electronic component is 1000 hours. Assuming an exponential distribution, what is the probability that a component will last more than 1500 hours?
Solution:
- λ = 1/1000 = 0.001
- P(X > 1500) = e^(-λx) = e^(-0.001 × 1500) = e^(-1.5) ≈ 0.223 or about 22.3%
4. Hypothesis Testing
Hypothesis testing is a method of statistical inference used to decide whether experimental data support a particular claim about a population parameter.
Steps in Hypothesis Testing
- State the hypotheses: Null hypothesis (H₀) and Alternative hypothesis (H₁ or Ha)
- Choose the significance level (α): Common values are 0.01, 0.05, and 0.10
- Calculate the test statistic
- Find the p-value or critical value
- Make a decision: Reject H₀ if p-value < α
- State the conclusion in context
Z-Test (For Known Population Standard Deviation)
Where:
- x̄ = sample mean
- μ = population mean (hypothesized value)
- σ = population standard deviation
- n = sample size
Example: A manufacturer claims that the mean lifetime of its lightbulbs is at least 1000 hours. A random sample of 36 bulbs has a mean lifetime of 950 hours. The population standard deviation is known to be 120 hours. Test the claim at α = 0.05.
Solution:
- H₀: μ ≥ 1000 hours
- H₁: μ < 1000 hours (This is a left-tailed test)
- α = 0.05
- Z = (950 - 1000) / (120 / √36) = -50 / 20 = -2.5
- For a left-tailed test with α = 0.05, the critical value is -1.645
- Since -2.5 < -1.645, we reject H₀
- Conclusion: There is sufficient evidence to conclude that the mean lifetime of the lightbulbs is less than 1000 hours.
T-Test (For Unknown Population Standard Deviation)
Where:
- x̄ = sample mean
- μ = population mean (hypothesized value)
- s = sample standard deviation
- n = sample size
Example: A researcher claims that college students sleep an average of 7 hours per night. A random sample of 25 students has a mean of 6.5 hours with a standard deviation of 1.2 hours. Test the claim at α = 0.05.
Solution:
- H₀: μ = 7 hours
- H₁: μ ≠ 7 hours (This is a two-tailed test)
- α = 0.05
- t = (6.5 - 7) / (1.2 / √25) = -0.5 / 0.24 = -2.08
- For a two-tailed test with α = 0.05 and 24 degrees of freedom, the critical values are ±2.064
- Since -2.08 < -2.064, we reject H₀
- Conclusion: There is sufficient evidence to conclude that college students do not sleep an average of 7 hours per night.
Chi-Square Test (For Association)
Used to determine if there is a significant association between two categorical variables.
Where:
- O = observed frequency
- E = expected frequency
Example: A survey asks males and females whether they prefer coffee, tea, or neither. The results are shown in the table:
Coffee | Tea | Neither | Total | |
---|---|---|---|---|
Male | 40 | 30 | 30 | 100 |
Female | 35 | 50 | 15 | 100 |
Total | 75 | 80 | 45 | 200 |
Is there a significant association between gender and beverage preference at α = 0.05?
Solution: (simplified)
- H₀: There is no association between gender and beverage preference
- H₁: There is an association between gender and beverage preference
- Calculate expected frequencies and chi-square statistic (calculation steps omitted)
- χ² = 9.72, degrees of freedom = (2-1)(3-1) = 2
- For α = 0.05 and df = 2, the critical value is 5.991
- Since 9.72 > 5.991, we reject H₀
- Conclusion: There is sufficient evidence to conclude that there is an association between gender and beverage preference.
5. Regression Analysis
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables.
Simple Linear Regression
Models the relationship between two variables by fitting a linear equation to the observed data.
Where:
- ŷ = predicted value of the dependent variable
- x = independent variable
- b₀ = y-intercept (constant term)
- b₁ = slope (regression coefficient)
The coefficients are calculated as:
Example: The following data represents the hours studied (x) and the exam scores (y) for 5 students:
Hours Studied (x) | Exam Score (y) |
---|---|
1 | 65 |
2 | 70 |
3 | 80 |
4 | 85 |
5 | 90 |
Find the regression equation and predict the exam score for a student who studies 3.5 hours.
Solution:
- Calculate means: x̄ = 3, ȳ = 78
-
Calculate the numerator and denominator for b₁:
x y (x - x̄) (y - ȳ) (x - x̄)(y - ȳ) (x - x̄)² 1 65 -2 -13 26 4 2 70 -1 -8 8 1 3 80 0 2 0 0 4 85 1 7 7 1 5 90 2 12 24 4 ∑ = 65 ∑ = 10 - Calculate b₁ = 65 / 10 = 6.5
- Calculate b₀ = 78 - (6.5 × 3) = 78 - 19.5 = 58.5
- Regression equation: ŷ = 58.5 + 6.5x
- For x = 3.5: ŷ = 58.5 + 6.5(3.5) = 58.5 + 22.75 = 81.25
Prediction: A student who studies 3.5 hours is predicted to score 81.25 on the exam.
Coefficient of Determination (R²)
Measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
Where:
- SSE = Sum of Squared Errors = ∑(y - ŷ)²
- SST = Total Sum of Squares = ∑(y - ȳ)²
R² ranges from 0 to 1, with 1 indicating perfect prediction and 0 indicating that the model doesn't explain any of the variation.
Multiple Linear Regression
An extension of simple linear regression that uses multiple independent variables to predict the dependent variable.
Where:
- ŷ = predicted value of the dependent variable
- x₁, x₂, ..., xₙ = independent variables
- b₀ = y-intercept
- b₁, b₂, ..., bₙ = regression coefficients
6. Statistics Quiz
Test your understanding of statistics with this interactive quiz.