Basic Math

Bivariate statistics | Eleventh Grade

Bivariate Statistics

Complete Notes & Formulae for Eleventh Grade (Algebra 2)

1. Outliers in Scatter Plots

What is a Scatter Plot?

A graph showing the relationship between two quantitative variables, with each point representing one observation

Variables:

Explanatory variable (x): Independent variable (horizontal axis)

Response variable (y): Dependent variable (vertical axis)

Identifying Outliers:

An outlier in a scatter plot is a point that doesn't follow the general pattern of the data

Characteristics:

• Lies far from the overall trend or pattern

• Has an unusually large or small x or y value compared to other points

• Can significantly affect the regression line and correlation coefficient

⚠️ Effect of Outliers:

• Can pull the regression line toward it

• Can strengthen or weaken correlation coefficient

• Should be investigated (measurement error? special case?)

2. Match Correlation Coefficients to Scatter Plots

Correlation Coefficient (r):

A measure of the strength and direction of the linear relationship between two variables

\[ -1 \leq r \leq 1 \]

Interpreting Correlation Values:

Value of rDirectionStrengthPattern
r = 1PositivePerfectAll points on line, rising
0.7 < r < 1PositiveStrongPoints close to line, rising
0.3 < r < 0.7PositiveModerateModerate scatter, rising trend
0 < r < 0.3PositiveWeakVery scattered, slight rise
r = 0NoneNoneNo linear pattern
-0.3 < r < 0NegativeWeakVery scattered, slight decline
-0.7 < r < -0.3NegativeModerateModerate scatter, declining
-1 < r < -0.7NegativeStrongPoints close to line, declining
r = -1NegativePerfectAll points on line, declining

3. Calculate Correlation Coefficients

Pearson Correlation Coefficient Formula:

\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \cdot \sqrt{\sum(y_i - \bar{y})^2}} \]

where:

• \( x_i, y_i \) = individual data points

• \( \bar{x}, \bar{y} \) = means of x and y

• \( n \) = number of data points

Alternative Formula (Computational):

\[ r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]

Steps to Calculate:

1. Calculate the means \( \bar{x} \) and \( \bar{y} \)

2. Find each deviation from the mean: \( (x_i - \bar{x}) \) and \( (y_i - \bar{y}) \)

3. Calculate the numerator: sum of products of deviations

4. Calculate the denominator: product of square roots of sum of squared deviations

5. Divide numerator by denominator

4. Find the Equation of a Regression Line

Least Squares Regression Line (LSRL):

The line of best fit that minimizes the sum of squared residuals

\[ \hat{y} = a + bx \]

where:

• \( \hat{y} \) = predicted value of y

• \( a \) = y-intercept

• \( b \) = slope

• \( x \) = explanatory variable

Formulas for Slope and Intercept:

Slope (b):

\[ b = r \cdot \frac{s_y}{s_x} \]

or

\[ b = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} \]

Y-intercept (a):

\[ a = \bar{y} - b\bar{x} \]

where:

• \( r \) = correlation coefficient

• \( s_x \) = standard deviation of x

• \( s_y \) = standard deviation of y

Key Point:

The regression line ALWAYS passes through the point \( (\bar{x}, \bar{y}) \)

5. Interpret Regression Lines

Interpreting the Slope (b):

Template for interpretation:

"For each one-unit increase in [x-variable], the predicted [y-variable] increases/decreases by [slope] units."

Example:

If \( \hat{y} = 20 + 3x \) where x = study hours, y = test score:

"For each additional hour of study, the predicted test score increases by 3 points."

Interpreting the Y-intercept (a):

Template for interpretation:

"When [x-variable] is zero, the predicted [y-variable] is [intercept] units."

⚠️ Warning:

Only interpret if x = 0 makes sense in context! (Don't extrapolate beyond data range)

Predictions vs. Causation:

✓ Regression CAN:

• Show association between variables

• Make predictions within data range

✗ Regression CANNOT:

• Prove causation (correlation ≠ causation)

• Make accurate predictions far outside data range (extrapolation is risky)

6. Analyze a Regression Line of a Data Set

Residuals:

The difference between observed and predicted values

\[ \text{Residual} = y - \hat{y} = \text{Observed} - \text{Predicted} \]

Interpretation:

• Positive residual: Actual value is above the line (underpredicted)

• Negative residual: Actual value is below the line (overpredicted)

• Zero residual: Point lies exactly on the line

Coefficient of Determination (R²):

\[ R^2 = r^2 \]

Measures the proportion of variation in y explained by the regression line

Interpretation:

• \( R^2 = 0.85 \) means 85% of variation in y is explained by x

• Higher \( R^2 \) = better fit (closer to 1)

• Always between 0 and 1

Residual Plots:

A plot of residuals vs. x-values used to check model appropriateness

Good residual plot (linear model appropriate):

Random scatter around horizontal line at y = 0

Bad residual plot (linear model NOT appropriate):

• Clear pattern or curve

• Fan shape (increasing/decreasing spread)

7. Exponential Regression

Exponential Model:

Used when data shows exponential growth or decay (multiplicative change)

\[ y = ab^x \]

where:

• \( a \) = initial value (when x = 0)

• \( b \) = growth/decay factor (base)

• If \( b > 1 \): exponential growth

• If \( 0 < b < 1 \): exponential decay

Finding Exponential Model:

Method: Linearization with Logarithms

1. Take natural log of both sides: \( \ln(y) = \ln(a) + x \cdot \ln(b) \)

2. This becomes linear: \( \ln(y) = c + mx \) where \( c = \ln(a), m = \ln(b) \)

3. Find linear regression on \( (x, \ln(y)) \) data

4. Convert back: \( a = e^c \) and \( b = e^m \)

When to Use Exponential vs. Linear:

Use LINEAR when:

• Scatter plot shows straight-line pattern

• Equal changes in x produce equal changes in y

Use EXPONENTIAL when:

• Scatter plot shows curved (exponential) pattern

• Y-values multiply by constant factor for equal x-intervals

• Common in population growth, compound interest, radioactive decay

8. Quick Reference Summary

Key Formulas:

Correlation Coefficient:

\[ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2} \cdot \sqrt{\sum(y_i - \bar{y})^2}} \]

Regression Line: \( \hat{y} = a + bx \)

Slope: \( b = r \cdot \frac{s_y}{s_x} \)

Intercept: \( a = \bar{y} - b\bar{x} \)

Residual: \( y - \hat{y} \)

Coefficient of Determination: \( R^2 = r^2 \)

Exponential Model: \( y = ab^x \)

📚 Study Tips

✓ Strong correlation (|r| > 0.7) indicates points cluster tightly around line

✓ Correlation does NOT imply causation

✓ Always check residual plots to verify linear model appropriateness

✓ Outliers can dramatically affect regression line and correlation

✓ Use exponential regression when data shows multiplicative growth/decay

Shares: