Bivariate Statistics

Bivariate Statistics - Formulas & Concepts

IB Mathematics Analysis & Approaches (SL & HL)

📊 Bivariate Data & Scatter Diagrams

Definition:

Bivariate data consists of paired values from two variables, typically denoted as \((x, y)\). The data examines the relationship between the two variables.

Variables:

Independent variable (x): The explanatory variable (plotted on x-axis)
Dependent variable (y): The response variable (plotted on y-axis)

Mean Point:

\[(\bar{x}, \bar{y})\]

The line of best fit always passes through the mean point

🔗 Correlation

Types of Correlation:

Positive correlation: As x increases, y increases
Negative correlation: As x increases, y decreases
No correlation: No clear linear relationship
Strong correlation: Points lie close to a straight line
Weak correlation: Points are scattered

Important Note:

Correlation does NOT imply causation!

📈 Pearson's Correlation Coefficient (r)

Formula:

\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]

Given in formula booklet

Range:

\[-1 \leq r \leq 1\]

Interpretation:

\(r = +1\): Perfect positive linear correlation
\(r = -1\): Perfect negative linear correlation
\(r = 0\): No linear correlation
\(|r|\) close to 1: Strong linear relationship
\(|r|\) close to 0: Weak or no linear relationship

💯 Coefficient of Determination

Notation:

\[r^2\]

Interpretation:

• \(r^2\) represents the proportion (or percentage) of variance in y that can be explained by x
• Also called "explained variation"
• Always between 0 and 1 (or 0% to 100%)
• Higher values indicate a better fit

Example:

If \(r = 0.8\), then \(r^2 = 0.64\), meaning 64% of the variance in y is explained by x

📉 Regression Line of y on x

Equation:

\[y = ax + b\]

Also written as \(y = mx + c\) in some contexts

Gradient (Slope):

\[a = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\]

Alternative form: \(a = r \cdot \frac{\sigma_y}{\sigma_x}\)

Y-intercept:

\[b = \bar{y} - a\bar{x}\]

Key Properties:

• The line always passes through \((\bar{x}, \bar{y})\)
• Minimizes the sum of squared vertical distances (residuals)
• Used to predict y from x
• Called "least squares regression line"

📊 Regression Line of x on y (HL)

Equation:

\[x = cy + d\]

Gradient (Slope):

\[c = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(y_i - \bar{y})^2}\]

Alternative form: \(c = r \cdot \frac{\sigma_x}{\sigma_y}\)

X-intercept:

\[d = \bar{x} - c\bar{y}\]

Usage:

Used to predict x from y (minimizes horizontal distances)

🔍 Interpreting Parameters

For \(y = ax + b\):

Gradient (a): The rate of change of y with respect to x. For each one-unit increase in x, y changes by \(a\) units
Y-intercept (b): The predicted value of y when \(x = 0\)

Example Interpretation:

If \(y = 3x + 2\):
• For each additional unit of x, y increases by 3 units
• When x = 0, the predicted y-value is 2

🎯 Making Predictions

Interpolation:

• Prediction made within the range of observed data
• Generally reliable
• Substitute x-value into regression equation to find predicted y-value

Extrapolation:

• Prediction made outside the range of observed data
• May be unreliable - the linear trend may not continue
• Use with caution

📍 Residuals

Definition:

\[\text{Residual} = y_{\text{observed}} - y_{\text{predicted}}\]

Also written as: \(e_i = y_i - \hat{y}_i\)

Properties:

• Residuals represent vertical distances from points to the regression line
• Sum of all residuals = 0
• Positive residual: actual y-value is above the line
• Negative residual: actual y-value is below the line
• The regression line minimizes \(\sum e_i^2\) (sum of squared residuals)

🔢 Using Technology (GDC)

GDC Can Calculate:

• Correlation coefficient \(r\)
• Coefficient of determination \(r^2\)
• Regression line equation \(y = ax + b\)
• Values of \(a\) and \(b\)
• Predicted values
• Residuals

Typical Steps:

1. Enter x-values in List 1 (L1)
2. Enter y-values in List 2 (L2)
3. Access Statistics menu
4. Choose Linear Regression (LinReg)
5. Read values of \(a\), \(b\), and \(r\)

💡 Exam Tip: Most bivariate formulas (correlation coefficient, regression line parameters) are given in the IB formula booklet. Always use your GDC to calculate regression lines and correlation coefficients - it's faster and more accurate! Remember: the line always passes through \((\bar{x}, \bar{y})\), correlation does NOT imply causation, and be careful about extrapolation. For y on x, minimize vertical distances; for x on y, minimize horizontal distances.