Bivariate Statistics
Complete Notes & Formulae for Eleventh Grade (Algebra 2)
1. Outliers in Scatter Plots
What is a Scatter Plot?
A graph showing the relationship between two quantitative variables, with each point representing one observation
Variables:
• Explanatory variable (x): Independent variable (horizontal axis)
• Response variable (y): Dependent variable (vertical axis)
Identifying Outliers:
An outlier in a scatter plot is a point that doesn't follow the general pattern of the data
Characteristics:
• Lies far from the overall trend or pattern
• Has an unusually large or small x or y value compared to other points
• Can significantly affect the regression line and correlation coefficient
⚠️ Effect of Outliers:
• Can pull the regression line toward it
• Can strengthen or weaken correlation coefficient
• Should be investigated (measurement error? special case?)
2. Match Correlation Coefficients to Scatter Plots
Correlation Coefficient (r):
A measure of the strength and direction of the linear relationship between two variables
\[ -1 \leq r \leq 1 \]
Interpreting Correlation Values:
| Value of r | Direction | Strength | Pattern |
|---|---|---|---|
| r = 1 | Positive | Perfect | All points on line, rising |
| 0.7 < r < 1 | Positive | Strong | Points close to line, rising |
| 0.3 < r < 0.7 | Positive | Moderate | Moderate scatter, rising trend |
| 0 < r < 0.3 | Positive | Weak | Very scattered, slight rise |
| r = 0 | None | None | No linear pattern |
| -0.3 < r < 0 | Negative | Weak | Very scattered, slight decline |
| -0.7 < r < -0.3 | Negative | Moderate | Moderate scatter, declining |
| -1 < r < -0.7 | Negative | Strong | Points close to line, declining |
| r = -1 | Negative | Perfect | All points on line, declining |
3. Calculate Correlation Coefficients
Pearson Correlation Coefficient Formula:
\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \cdot \sqrt{\sum(y_i - \bar{y})^2}} \]
where:
• \( x_i, y_i \) = individual data points
• \( \bar{x}, \bar{y} \) = means of x and y
• \( n \) = number of data points
Alternative Formula (Computational):
\[ r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]
Steps to Calculate:
1. Calculate the means \( \bar{x} \) and \( \bar{y} \)
2. Find each deviation from the mean: \( (x_i - \bar{x}) \) and \( (y_i - \bar{y}) \)
3. Calculate the numerator: sum of products of deviations
4. Calculate the denominator: product of square roots of sum of squared deviations
5. Divide numerator by denominator
4. Find the Equation of a Regression Line
Least Squares Regression Line (LSRL):
The line of best fit that minimizes the sum of squared residuals
\[ \hat{y} = a + bx \]
where:
• \( \hat{y} \) = predicted value of y
• \( a \) = y-intercept
• \( b \) = slope
• \( x \) = explanatory variable
Formulas for Slope and Intercept:
Slope (b):
\[ b = r \cdot \frac{s_y}{s_x} \]
or
\[ b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} \]
Y-intercept (a):
\[ a = \bar{y} - b\bar{x} \]
where:
• \( r \) = correlation coefficient
• \( s_x \) = standard deviation of x
• \( s_y \) = standard deviation of y
Key Point:
The regression line ALWAYS passes through the point \( (\bar{x}, \bar{y}) \)
5. Interpret Regression Lines
Interpreting the Slope (b):
Template for interpretation:
"For each one-unit increase in [x-variable], the predicted [y-variable] increases/decreases by [slope] units."
Example:
If \( \hat{y} = 20 + 3x \) where x = study hours, y = test score:
"For each additional hour of study, the predicted test score increases by 3 points."
Interpreting the Y-intercept (a):
Template for interpretation:
"When [x-variable] is zero, the predicted [y-variable] is [intercept] units."
⚠️ Warning:
Only interpret if x = 0 makes sense in context! (Don't extrapolate beyond data range)
Predictions vs. Causation:
✓ Regression CAN:
• Show association between variables
• Make predictions within data range
✗ Regression CANNOT:
• Prove causation (correlation ≠ causation)
• Make accurate predictions far outside data range (extrapolation is risky)
6. Analyze a Regression Line of a Data Set
Residuals:
The difference between observed and predicted values
\[ \text{Residual} = y - \hat{y} = \text{Observed} - \text{Predicted} \]
Interpretation:
• Positive residual: Actual value is above the line (underpredicted)
• Negative residual: Actual value is below the line (overpredicted)
• Zero residual: Point lies exactly on the line
Coefficient of Determination (R²):
\[ R^2 = r^2 \]
Measures the proportion of variation in y explained by the regression line
Interpretation:
• \( R^2 = 0.85 \) means 85% of variation in y is explained by x
• Higher \( R^2 \) = better fit (closer to 1)
• Always between 0 and 1
Residual Plots:
A plot of residuals vs. x-values used to check model appropriateness
Good residual plot (linear model appropriate):
Random scatter around horizontal line at y = 0
Bad residual plot (linear model NOT appropriate):
• Clear pattern or curve
• Fan shape (increasing/decreasing spread)
7. Exponential Regression
Exponential Model:
Used when data shows exponential growth or decay (multiplicative change)
\[ y = ab^x \]
where:
• \( a \) = initial value (when x = 0)
• \( b \) = growth/decay factor (base)
• If \( b > 1 \): exponential growth
• If \( 0 < b < 1 \): exponential decay
Finding Exponential Model:
Method: Linearization with Logarithms
1. Take natural log of both sides: \( \ln(y) = \ln(a) + x \cdot \ln(b) \)
2. This becomes linear: \( \ln(y) = c + mx \) where \( c = \ln(a), m = \ln(b) \)
3. Find linear regression on \( (x, \ln(y)) \) data
4. Convert back: \( a = e^c \) and \( b = e^m \)
When to Use Exponential vs. Linear:
Use LINEAR when:
• Scatter plot shows straight-line pattern
• Equal changes in x produce equal changes in y
Use EXPONENTIAL when:
• Scatter plot shows curved (exponential) pattern
• Y-values multiply by constant factor for equal x-intervals
• Common in population growth, compound interest, radioactive decay
8. Quick Reference Summary
Key Formulas:
Correlation Coefficient:
\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \cdot \sqrt{\sum(y_i - \bar{y})^2}} \]
Regression Line: \( \hat{y} = a + bx \)
Slope: \( b = r \cdot \frac{s_y}{s_x} \)
Intercept: \( a = \bar{y} - b\bar{x} \)
Residual: \( y - \hat{y} \)
Coefficient of Determination: \( R^2 = r^2 \)
Exponential Model: \( y = ab^x \)
📚 Study Tips
✓ Strong correlation (|r| > 0.7) indicates points cluster tightly around line
✓ Correlation does NOT imply causation
✓ Always check residual plots to verify linear model appropriateness
✓ Outliers can dramatically affect regression line and correlation
✓ Use exponential regression when data shows multiplicative growth/decay
