Basic Math

Bivariate statistics | Ninth Grade

Bivariate Statistics - Ninth Grade Math

Introduction to Bivariate Data

Bivariate Data: Data involving two variables
Purpose: To examine relationships between two variables
Key Questions:
• Is there a relationship between the variables?
• How strong is the relationship?
• Can we predict one variable from the other?
Variables:
Independent Variable (x): The input or predictor variable
Dependent Variable (y): The output or response variable

1. Interpret a Scatter Plot

Scatter Plot: A graph showing relationship between two quantitative variables
Each Point: Represents one data pair (x, y)
x-axis: Independent variable
y-axis: Dependent variable

Types of Associations

Direction of Association:

1. Positive Association (Positive Correlation):
• As x increases, y increases
• Points trend upward from left to right
• Example: Hours studied vs. test score

2. Negative Association (Negative Correlation):
• As x increases, y decreases
• Points trend downward from left to right
• Example: Hours watching TV vs. test score

3. No Association (No Correlation):
• No clear pattern
• Points scattered randomly
• Example: Shoe size vs. test score

Strength of Association

How Close to a Line:

Strong Association:
• Points cluster tightly around a line
• Clear pattern visible

Moderate Association:
• Points generally follow a pattern but with scatter
• Trend visible but not tight

Weak Association:
• Points loosely follow a pattern
• Much scatter, unclear trend

Form of Association

Shape of Pattern:

Linear: Points follow a straight line pattern
Nonlinear: Points follow a curved pattern (quadratic, exponential, etc.)
No Form: No discernible pattern
Example 1: Interpret scatter plot

Data: Hours studying (x) vs. Test score (y)
Pattern: Points trend upward from left to right, fairly tight to a line

Interpretation:
Direction: Positive association
Strength: Strong
Form: Linear

Conclusion: There is a strong, positive, linear association between hours studying and test scores. As study time increases, test scores tend to increase.

2. Outliers in Scatter Plots

Outlier in Scatter Plot: A point that doesn't fit the general pattern
Characteristics:
• Far from other points
• Doesn't follow the trend
• May indicate error or special case
Identifying Outliers:

Visual Method:
• Look for points far from the main cluster
• Points that don't fit the linear pattern

Types of Outliers:
Vertical outlier: Unusual y-value for its x-value
Horizontal outlier: Unusual x-value
Influential outlier: Point that significantly affects correlation/regression line
Example 1: Identify outlier

Data points: (1, 3), (2, 5), (3, 7), (4, 9), (5, 11), (6, 2)

Analysis:
Most points follow pattern: y ≈ 2x + 1
Point (6, 2) doesn't fit: should be around (6, 13)

Conclusion: (6, 2) is an outlier
Effect: Would weaken the correlation and pull regression line down
Effect of Outliers:
• Can weaken correlation
• Affects slope of regression line
• May dramatically change predictions
• Should investigate: error, unusual case, or valid extreme?

3. Match Correlation Coefficients to Scatter Plots

Correlation Coefficient (r): A number measuring strength and direction of linear relationship
Symbol: $r$
Range: $-1 \leq r \leq 1$
Also called: Pearson correlation coefficient
Correlation Coefficient Values:

Perfect Positive: $r = +1$
• All points on line with positive slope
• Perfect positive linear relationship

Strong Positive: $0.7 < r < 1$
• Points cluster tightly around upward line
• Strong positive association

Moderate Positive: $0.3 < r < 0.7$
• Points loosely follow upward trend
• Moderate positive association

Weak Positive: $0 < r < 0.3$
• Slight upward trend, much scatter
• Weak positive association

No Correlation: $r = 0$
• No pattern
• No linear relationship

Weak Negative: $-0.3 < r < 0$
• Slight downward trend

Moderate Negative: $-0.7 < r < -0.3$
• Clear downward trend with scatter

Strong Negative: $-1 < r < -0.7$
• Points cluster tightly around downward line

Perfect Negative: $r = -1$
• All points on line with negative slope
Example 1: Match r-values to scatter plots

Given r-values: -0.95, -0.4, 0.1, 0.85

Plot A: Points tightly clustered downward
→ $r = -0.95$ (strong negative)

Plot B: Points loosely trending downward
→ $r = -0.4$ (moderate negative)

Plot C: Random scatter, slight upward
→ $r = 0.1$ (very weak positive)

Plot D: Points tightly clustered upward
→ $r = 0.85$ (strong positive)

4. Calculate Correlation Coefficients

Correlation Coefficient Formula:

$$r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$$

where:
• $n$ = number of data pairs
• $x$ = values of independent variable
• $y$ = values of dependent variable
• $\sum xy$ = sum of products of paired values
• $\sum x$ = sum of x-values
• $\sum y$ = sum of y-values
• $\sum x^2$ = sum of squared x-values
• $\sum y^2$ = sum of squared y-values
Steps to Calculate r:
Step 1: Create table with columns: x, y, xy, x², y²
Step 2: Calculate each column
Step 3: Find sum of each column
Step 4: Substitute into formula
Step 5: Simplify to find r
Step 6: Interpret the value
Example 1: Calculate r for data: (1, 2), (2, 3), (3, 5), (4, 6)

Create table:
xyxy
12214
23649
3515925
46241636
Σ = 10Σ = 16Σ = 47Σ = 30Σ = 74

Apply formula: $n = 4$
$$r = \frac{4(47) - (10)(16)}{\sqrt{[4(30) - 10^2][4(74) - 16^2]}}$$

$$r = \frac{188 - 160}{\sqrt{[120 - 100][296 - 256]}}$$

$$r = \frac{28}{\sqrt{20 \times 40}} = \frac{28}{\sqrt{800}} = \frac{28}{28.28} \approx 0.99$$

Answer: $r \approx 0.99$ (very strong positive correlation)

5-6. Write and Interpret Lines of Best Fit

Line of Best Fit: A line that best represents the data in a scatter plot
Also called: Trend line or regression line
Purpose: To model relationship and make predictions
Form: $y = mx + b$ (slope-intercept form)
Line of Best Fit Equation:

$$y = mx + b$$

where:
• $m$ = slope (rate of change)
• $b$ = y-intercept (value when x = 0)

Slope Formula (using two points on line):
$$m = \frac{y_2 - y_1}{x_2 - x_1}$$

Finding b: Use a point on the line
$$b = y - mx$$
Example 1: Write equation of line of best fit

Given: Line passes through (1, 3) and (5, 11)

Find slope:
$$m = \frac{11 - 3}{5 - 1} = \frac{8}{4} = 2$$

Find y-intercept using (1, 3):
$3 = 2(1) + b$
$3 = 2 + b$
$b = 1$

Equation: $y = 2x + 1$

Interpreting Lines of Best Fit

Interpretation Guide:

Slope (m):
• Represents rate of change
• "For every 1 unit increase in x, y changes by m units"
• Positive m: y increases as x increases
• Negative m: y decreases as x increases

Y-intercept (b):
• Value of y when x = 0
• Starting value or initial amount
• May or may not be meaningful in context

Making Predictions:
• Substitute x-value into equation
• Solve for y
Interpolation: Predicting within data range (reliable)
Extrapolation: Predicting outside data range (less reliable)
Example 2: Interpret line of best fit

Equation: $y = 5x + 60$
Context: x = hours studied, y = test score

Slope interpretation:
For every additional hour studied, test score increases by 5 points on average.

Y-intercept interpretation:
A student who studies 0 hours would be expected to score 60 points.

Prediction: If a student studies 8 hours:
$y = 5(8) + 60 = 40 + 60 = 100$ points

7-9. Find, Interpret, and Analyze Regression Lines

Regression Line: The line of best fit calculated using least squares method
Least Squares: Minimizes sum of squared vertical distances from points to line
Also called: Linear regression, least squares regression line (LSRL)
Linear Regression Formulas:

Slope:
$$m = \frac{n\sum xy - (\sum x)(\sum y)}{n\sum x^2 - (\sum x)^2}$$

Or using means:
$$m = r \cdot \frac{s_y}{s_x}$$

Y-intercept:
$$b = \bar{y} - m\bar{x}$$

where:
• $\bar{x}$ = mean of x-values
• $\bar{y}$ = mean of y-values
• $s_x$ = standard deviation of x
• $s_y$ = standard deviation of y
• $r$ = correlation coefficient

Regression Equation:
$$\hat{y} = mx + b$$
(hat symbol indicates predicted value)
Example 1: Find regression equation

Given data: (2, 3), (4, 5), (6, 7), (8, 10)

Calculate means:
$\bar{x} = \frac{2+4+6+8}{4} = 5$
$\bar{y} = \frac{3+5+7+10}{4} = 6.25$

Use table from earlier:
$\sum xy = 146$, $\sum x^2 = 120$, $n = 4$

Calculate slope:
$$m = \frac{4(146) - (20)(25)}{4(120) - 400} = \frac{584 - 500}{480 - 400} = \frac{84}{80} = 1.05$$

Calculate y-intercept:
$$b = 6.25 - 1.05(5) = 6.25 - 5.25 = 1$$

Regression equation: $\hat{y} = 1.05x + 1$

Analyzing Regression Lines

Key Concepts:

Residual: Difference between actual and predicted value
$$\text{Residual} = y - \hat{y}$$

Positive residual: Actual value above prediction
Negative residual: Actual value below prediction

Coefficient of Determination ($r^2$):
• Square of correlation coefficient
• Represents proportion of variation in y explained by x
• Range: 0 to 1
• Example: $r^2 = 0.81$ means 81% of variation in y is explained by x
Example 2: Calculate and interpret residual

Regression equation: $\hat{y} = 2x + 3$
Actual data point: (5, 15)

Predicted value:
$\hat{y} = 2(5) + 3 = 13$

Residual:
$15 - 13 = 2$

Interpretation: The actual y-value is 2 units higher than predicted by the regression line.

10. Exponential Regression

Exponential Regression: Finding best-fit exponential curve for data
Used when: Data shows exponential growth or decay pattern
Form: $y = ab^x$ or $y = ae^{kx}$
Exponential Model:

$$y = ab^x$$

where:
• $a$ = initial value (y-intercept, when x = 0)
• $b$ = growth/decay factor
• If $b > 1$: exponential growth
• If $0 < b < 1$: exponential decay

Alternative form:
$$y = ae^{kx}$$

where:
• $k > 0$: growth
• $k < 0$: decay
When to Use Exponential vs Linear:

Use Linear when:
• Constant rate of change
• Points follow straight line
• Add/subtract same amount each time

Use Exponential when:
• Rate of change increases/decreases
• Points follow curved pattern
• Multiply by same factor each time
• Data doubles, triples, or halves at regular intervals
Example 1: Identify exponential pattern

Data: (0, 5), (1, 10), (2, 20), (3, 40)

Check for pattern:
$\frac{10}{5} = 2$, $\frac{20}{10} = 2$, $\frac{40}{20} = 2$

Each y-value is double the previous → exponential!

Model: $y = 5(2)^x$
• Initial value: $a = 5$
• Growth factor: $b = 2$ (doubles each time)

11. Correlation and Causation

Correlation: A statistical relationship between two variables
Causation: One variable directly causes changes in another
Key Principle: Correlation does NOT imply causation!
Important Distinctions:

Correlation means:
• Two variables are associated
• They change together
• You can predict one from the other
• Does NOT mean one causes the other

Causation means:
• One variable directly influences another
• Change in one causes change in the other
• There is a cause-and-effect relationship
• Much harder to prove than correlation

Why Correlation ≠ Causation

Three Main Reasons:

1. Third Variable (Confounding Variable):
• A hidden variable affects both
• Example: Ice cream sales and drowning deaths
  → Both caused by hot weather (third variable)

2. Reverse Causation (Directionality Problem):
• Don't know which variable causes which
• Example: Depression and low vitamin D
  → Does depression cause low vitamin D, or vice versa?

3. Coincidence:
• Pure chance
• Example: Number of Nicolas Cage movies and swimming pool drownings
  → No real connection, just coincidence
Example 1: Correlation without causation

Observation: There is a strong positive correlation between shoe size and reading ability in children.

Does large feet cause better reading? NO!

Explanation:
• Third variable: AGE
• Older children have bigger feet
• Older children read better
• Age causes both variables to increase

Conclusion: Correlation exists, but no causal relationship between shoe size and reading ability.
Example 2: Identify causation

Scenario A: Hours of exercise and calories burned
Analysis: Exercise directly causes calorie burning
Conclusion: Causation ✓

Scenario B: Coffee consumption and heart disease
Analysis: Many confounding variables (stress, sleep, diet)
Conclusion: Correlation, but causation unclear

Scenario C: Hours studied and test scores
Analysis: Studying directly improves knowledge
Conclusion: Strong evidence for causation ✓

Establishing Causation

Requirements for Causation:

1. Correlation exists: Variables must be related
2. Temporal precedence: Cause must come before effect
3. No alternative explanation: Rule out confounding variables

Gold Standard: Controlled experiment
• Random assignment
• Control group vs. experimental group
• Manipulate one variable, measure effect on other
• Control for confounding variables

Correlation Coefficient Guide

r ValueStrengthDirectionDescription
$r = 1$PerfectPositiveAll points on line, upward slope
$0.7 < r < 1$StrongPositivePoints tightly clustered, upward trend
$0.3 < r < 0.7$ModeratePositiveClear upward trend with scatter
$0 < r < 0.3$WeakPositiveSlight upward trend, much scatter
$r = 0$NoneNoneNo linear relationship
$-0.3 < r < 0$WeakNegativeSlight downward trend
$-0.7 < r < -0.3$ModerateNegativeClear downward trend with scatter
$-1 < r < -0.7$StrongNegativePoints tightly clustered, downward trend
$r = -1$PerfectNegativeAll points on line, downward slope

Linear vs Exponential Models

FeatureLinear ModelExponential Model
Equation$y = mx + b$$y = ab^x$
ShapeStraight lineCurved (J-shape or decay)
Rate of ChangeConstant (same each time)Increasing or decreasing
PatternAdd/subtract same amountMultiply/divide by same factor
Example2, 5, 8, 11, 14 (+3 each)2, 6, 18, 54 (×3 each)
Real-worldConstant speed, hourly wagePopulation growth, compound interest

Correlation vs Causation

AspectCorrelationCausation
DefinitionVariables are relatedOne variable causes another
How to FindCalculate r, observe patternControlled experiment required
ImplicationCan predict, not explainChange one affects the other
ExampleShoe size and reading abilityExercise and calories burned
CautionMay have third variableHard to prove definitively
Success Tips for Bivariate Statistics:
✓ Scatter plots show relationships between two quantitative variables
✓ Correlation coefficient r measures strength: closer to ±1 = stronger
✓ Positive r: both variables increase together; Negative r: one decreases as other increases
✓ Line of best fit equation: y = mx + b (slope + y-intercept)
✓ Slope tells rate of change; y-intercept is starting value
✓ Outliers don't fit the pattern and can affect correlation/regression
✓ r² shows percentage of variation explained (coefficient of determination)
✓ Residual = actual - predicted value
✓ Use exponential model when data multiplies by constant factor
✓ CORRELATION ≠ CAUSATION! Always consider third variables!
Shares: