Bivariate Statistics
Complete Notes & Formulae for Twelfth Grade (Precalculus)
1. Scatter Plots
Definition:
A scatter plot is a graph that shows the relationship between two quantitative variables
• Independent variable (x): Plotted on horizontal axis
• Dependent variable (y): Plotted on vertical axis
• Each point represents one observation (x, y)
Outliers in Scatter Plots:
An outlier is a point that lies far away from the general pattern of the data
• Outliers can significantly affect correlation and regression
• Always check for outliers before analysis
• Outliers may indicate measurement errors or special cases
2. Correlation Coefficient (r)
Definition:
The correlation coefficient (r) measures the strength and direction of a linear relationship between two variables
\[ r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2 \sum(y - \bar{y})^2}} \]
\[ \text{or} \quad r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]
Properties of r:
• Range: \( -1 \leq r \leq 1 \)
• \( r = 1 \): Perfect positive linear correlation
• \( r = -1 \): Perfect negative linear correlation
• \( r = 0 \): No linear correlation
• \( 0 < r < 1 \): Positive correlation (both increase together)
• \( -1 < r < 0 \): Negative correlation (one increases, other decreases)
Strength Interpretation:
| |r| Value | Strength |
|---|---|
| 0.0 - 0.3 | Weak |
| 0.3 - 0.7 | Moderate |
| 0.7 - 1.0 | Strong |
3. Linear Regression Line
Equation:
The regression line (line of best fit) has the equation:
\[ \hat{y} = a + bx \]
where:
• \( \hat{y} \) = predicted value of y
• \( a \) = y-intercept (value of y when x = 0)
• \( b \) = slope (change in y per unit change in x)
• \( x \) = independent variable
Finding Slope and Intercept:
Slope:
\[ b = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2} \]
Or using correlation:
\[ b = r \frac{s_y}{s_x} \]
Y-Intercept:
\[ a = \bar{y} - b\bar{x} \]
where:
• \( s_x, s_y \) = standard deviations of x and y
• \( \bar{x}, \bar{y} \) = means of x and y
4. Interpret Regression Lines
Slope Interpretation:
The slope tells us how much y changes for each 1-unit increase in x
Example:
If \( \hat{y} = 50 + 3x \) where y = test score, x = study hours
Interpretation: For each additional hour of study, test score increases by 3 points
Y-intercept: Without studying (x=0), predicted score is 50
Making Predictions:
Substitute the x-value into the regression equation to predict y
⚠️ Caution: Only predict within the range of x-values in your data (interpolation)
Predicting outside the data range (extrapolation) can be unreliable
5. Coefficient of Determination (r²)
Definition:
r² measures the proportion of variance in y that is explained by x
\[ r^2 = (\text{correlation coefficient})^2 \]
• Range: \( 0 \leq r^2 \leq 1 \)
• \( r^2 = 0.75 \) means 75% of variation in y is explained by x
• Higher r² indicates better fit of regression line to data
6. Residuals
Definition:
A residual is the difference between observed and predicted values
\[ \text{Residual} = y - \hat{y} \]
• Positive residual: Actual value is above the regression line
• Negative residual: Actual value is below the regression line
• Sum of residuals always equals zero for least-squares line
Residual Plots:
Plot residuals vs. x-values to check if linear model is appropriate
• Random scatter: Linear model is appropriate
• Pattern present: Linear model may not be appropriate
7. Exponential Regression
Model:
When data shows exponential growth or decay, use exponential regression
\[ y = ab^x \]
where:
• \( a \) = initial value (when x = 0)
• \( b \) = growth/decay factor
• If \( b > 1 \): exponential growth
• If \( 0 < b < 1 \): exponential decay
Finding the Model:
Method: Transform data using logarithms, then perform linear regression
Step 1: Take ln of both sides
\[ \ln(y) = \ln(a) + x \cdot \ln(b) \]
Step 2: This is linear in form: Y = C + mX
where Y = ln(y), C = ln(a), m = ln(b)
Step 3: Find a and b
\[ a = e^C, \quad b = e^m \]
8. Choosing the Right Model
Model Selection:
| Pattern in Scatter Plot | Model to Use |
|---|---|
| Straight line pattern | Linear: \( y = a + bx \) |
| J-shaped curve (rapid growth) | Exponential: \( y = ab^x \) |
| U-shaped curve | Quadratic: \( y = ax^2 + bx + c \) |
| No clear pattern | No relationship |
9. Quick Reference Summary
Key Formulas:
Correlation: \( r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \)
Linear Regression: \( \hat{y} = a + bx \)
Slope: \( b = r\frac{s_y}{s_x} \)
Intercept: \( a = \bar{y} - b\bar{x} \)
Residual: \( y - \hat{y} \)
Exponential: \( y = ab^x \)
📚 Study Tips
✓ Correlation measures strength and direction of linear relationship
✓ Correlation does NOT imply causation
✓ Regression line always passes through (\(\bar{x}\), \(\bar{y}\))
✓ Check residual plots to verify linear model is appropriate
✓ Use exponential regression when data shows rapid growth or decay
