Bivariate Statistics - Formulas & Concepts
IB Mathematics Analysis & Approaches (SL & HL)
📊 Bivariate Data & Scatter Diagrams
Definition:
Bivariate data consists of paired values from two variables, typically denoted as \((x, y)\). The data examines the relationship between the two variables.
Variables:
• Independent variable (x): The explanatory variable (plotted on x-axis)
• Dependent variable (y): The response variable (plotted on y-axis)
Mean Point:
\[(\bar{x}, \bar{y})\]
The line of best fit always passes through the mean point
🔗 Correlation
Types of Correlation:
• Positive correlation: As x increases, y increases
• Negative correlation: As x increases, y decreases
• No correlation: No clear linear relationship
• Strong correlation: Points lie close to a straight line
• Weak correlation: Points are scattered
Important Note:
Correlation does NOT imply causation!
📈 Pearson's Correlation Coefficient (r)
Formula:
\[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}}\]
Given in formula booklet
Range:
\[-1 \leq r \leq 1\]
Interpretation:
• \(r = +1\): Perfect positive linear correlation
• \(r = -1\): Perfect negative linear correlation
• \(r = 0\): No linear correlation
• \(|r|\) close to 1: Strong linear relationship
• \(|r|\) close to 0: Weak or no linear relationship
💯 Coefficient of Determination
Notation:
\[r^2\]
Interpretation:
• \(r^2\) represents the proportion (or percentage) of variance in y that can be explained by x
• Also called "explained variation"
• Always between 0 and 1 (or 0% to 100%)
• Higher values indicate a better fit
Example:
If \(r = 0.8\), then \(r^2 = 0.64\), meaning 64% of the variance in y is explained by x
📉 Regression Line of y on x
Equation:
\[y = ax + b\]
Also written as \(y = mx + c\) in some contexts
Gradient (Slope):
\[a = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}\]
Alternative form: \(a = r \cdot \frac{\sigma_y}{\sigma_x}\)
Y-intercept:
\[b = \bar{y} - a\bar{x}\]
Key Properties:
• The line always passes through \((\bar{x}, \bar{y})\)
• Minimizes the sum of squared vertical distances (residuals)
• Used to predict y from x
• Called "least squares regression line"
📊 Regression Line of x on y (HL)
Equation:
\[x = cy + d\]
Gradient (Slope):
\[c = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(y_i - \bar{y})^2}\]
Alternative form: \(c = r \cdot \frac{\sigma_x}{\sigma_y}\)
X-intercept:
\[d = \bar{x} - c\bar{y}\]
Usage:
Used to predict x from y (minimizes horizontal distances)
🔍 Interpreting Parameters
For \(y = ax + b\):
• Gradient (a): The rate of change of y with respect to x. For each one-unit increase in x, y changes by \(a\) units
• Y-intercept (b): The predicted value of y when \(x = 0\)
Example Interpretation:
If \(y = 3x + 2\):
• For each additional unit of x, y increases by 3 units
• When x = 0, the predicted y-value is 2
🎯 Making Predictions
Interpolation:
• Prediction made within the range of observed data
• Generally reliable
• Substitute x-value into regression equation to find predicted y-value
Extrapolation:
• Prediction made outside the range of observed data
• May be unreliable - the linear trend may not continue
• Use with caution
📍 Residuals
Definition:
\[\text{Residual} = y_{\text{observed}} - y_{\text{predicted}}\]
Also written as: \(e_i = y_i - \hat{y}_i\)
Properties:
• Residuals represent vertical distances from points to the regression line
• Sum of all residuals = 0
• Positive residual: actual y-value is above the line
• Negative residual: actual y-value is below the line
• The regression line minimizes \(\sum e_i^2\) (sum of squared residuals)
🔢 Using Technology (GDC)
GDC Can Calculate:
• Correlation coefficient \(r\)
• Coefficient of determination \(r^2\)
• Regression line equation \(y = ax + b\)
• Values of \(a\) and \(b\)
• Predicted values
• Residuals
Typical Steps:
1. Enter x-values in List 1 (L1)
2. Enter y-values in List 2 (L2)
3. Access Statistics menu
4. Choose Linear Regression (LinReg)
5. Read values of \(a\), \(b\), and \(r\)
💡 Exam Tip: Most bivariate formulas (correlation coefficient, regression line parameters) are given in the IB formula booklet. Always use your GDC to calculate regression lines and correlation coefficients - it's faster and more accurate! Remember: the line always passes through \((\bar{x}, \bar{y})\), correlation does NOT imply causation, and be careful about extrapolation. For y on x, minimize vertical distances; for x on y, minimize horizontal distances.
