Simple Linear Regression

Dr. Mine Dogucu

Data babies in openintro package

Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, …

Baby Weights

ggplot(babies, 
       aes(x = gestation, y = bwt)) +
  geom_point()

A scatter plot displays birth weight (bwt) on the y-axis, ranging from approximately 50 to 175, against gestational age (gestation) on the x-axis, ranging from approximately 150 to 350. There is a moderate positive relationship, indicating that birth weight generally increases with gestational age. The majority of data points form a dense cluster between 250-300 for gestation and 100-150 for bwt. Scattered points exist across the range, including some with lower birth weights for both very short and very long gestations.

Baby Weights

ggplot(babies,
       aes(x = gestation, y = bwt)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) 

A prominent blue line is drawn through the data. This line slopes upwards, from approximately (150, 60) to (350, 155), visually summarizing the positive relationship. While the line indicates an average trend, the data points show considerable scatter around it, implying variability in birth weight for a given gestational age.

lm stands for linear model
se stands for standard error

Variables

y Response Birth weight Numeric
x Explanatory Gestation Numeric

Linear Equations Review

Recall from your previous math classes \(y = mx + b\)

where \(m\) is the slope and \(b\) is the y-intercept

e.g. \(y = 2x -1\)

A line plot displays a perfectly linear relationship between x and y. The x-axis ranges from 0 to 5, and the y-axis from -1 to 9. Black data points are plotted at integer x-values: (0, -1), (1, 1), (2, 3), (3, 5), (4, 7), and (5, 9). A straight blue line connects these points, indicating a consistent positive slope where y increases by 2 for every unit increase in x.

Notice anything different between baby weights plot and this one?

Linear Equation

Math class

\(y = b + mx\)

\(b\) is y-intercept
\(m\) is slope

Stats class

\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)

\(\beta_0\) is y-intercept
\(\beta_1\) is slope
\(\epsilon_i\) is error/residual
\(i = 1, 2, ...n\) identifier for each point

Linear Model in R

model_g <- lm(bwt ~ gestation, data = babies)

lm stands for linear model. We are fitting a linear regression model.
Note that the variables are entered in y ~ x order. This is also the same order of variables in the linear equation \(y_i = \beta_0 +\beta_1x_i + \epsilon_i\).

Model Results

broom::tidy(model_g)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  -10.1      8.32       -1.21 2.27e- 1
2 gestation      0.464    0.0297     15.6  3.22e-50

\(\hat {y}_i = b_0 + b_1 x_i\)

\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ gestation}_i\)

\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)

Expected bwt for a baby with 300 days of gestation

\(\hat {\text{bwt}_i} = -10.1 + 0.464\text{ gestation}_i\)

\(\hat {\text{bwt}} = -10.1 + 0.464 \times 300\)

\(\hat {\text{bwt}} =\) 129.1

For a baby with 300 days of gestation the expected birth weight is 129.1 ounces.

Interpretation of estimates

The scatter plot with the blue line that was aforementioned is shown on the left column.

\(b_1 = 0.464\) which means for one unit(day) increase in gestation period the expected increase in birth weight is 0.464 ounces.

The left column's scatterplot is only extended to about 150 days of gestation but this scatterplot shows 0 days of gestation and the line is also extended to show the y intercept on the right column.

\(b_0 = -10.1\) which means for gestation period of 0 days the expected birth weight is -10.1 ounces!!!!!!!! (does NOT make sense)

Extrapolation

  • There is no such thing as 0 days of gestation.
  • Birth weight cannot possibly be -10.1 ounces.
  • Extrapolation happens when we use a model outside the range of the x-values that are observed. After all, we cannot really know how the model behaves (e.g. may be non-linear) outside of the scope of what we have observed.

Baby number 148

babies |> 
  filter(case == 148) |> 
  select(bwt, gestation)
# A tibble: 1 × 2
    bwt gestation
  <int>     <int>
1   160       300

The regular scatter plot of the birth weight data with the line. A specific point with coordinates 160 and 300 are highlighted in red. This point is above the blue line.

Baby #148

Expected

\(\hat y_{148} = b_0 +b_1x_{148}\)

\(\hat y_{148} = -10.1 + 0.464\times300\)

\(\hat y_{148}\) = 129.1

Observed

\(y_{148} =\) 160

Residual for i = 148

There is a red vertical line with end points of the aforementioned red point and blue line below it.

\(y_{148} = 160\)

\(\hat y_{148}\) = 129.1

\(e_{148} = y_{148} - \hat y_{148}\)

\(e_{148} =\) 30.9

Least Squares Regression

The goal is to minimize

\[e_1^2 + e_2^2 + ... + e_n^2\]

which can be rewritten as

\[\sum_{i = 1}^n e_i^2\]

Conditions for Least Squares Regression

  • Linearity

  • Independence

  • Normality of Residuals

  • Equality of Variance (Constant Variance)

Look at the first letters of each point: LINE

Linearity

Linear

A scatter plot displays a strong positive linear relationship between x and y. The x-axis ranges from -2 to 2, and the y-axis from -1 to 15. Numerous black data points generally increase in y-value as x increases. A prominent blue regression line passes through the center of the data, sloping upwards from approximately (-2, -2) to (2, 13), indicating a clear linear trend despite some moderate scatter of points around the line.

Non-linear

A scatter plot displays data points with x-values from -2 to 2 and y-values from 0 to 20. Numerous black data points show a clear quadratic, U-shaped pattern, with y-values high at x=-2, decreasing to a minimum around x=0  (y-value approximately 5), and then increasing again to high y-values at x=2. A blue linear regression line, sloping slightly upwards, is shown, but it does not fit the U-shaped data well. A black curved line, representing a quadratic fit, accurately follows the U-shaped distribution of the data points, indicating a much better fit than the linear line.

Independence

Harder to check because we need to know how the data were collected.

In the description of the dataset it says [a study]considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area.

It is possible that babies born in the same hospital may have similar birth weight.

Correlated data (not independent)

Correlated data examples: patients within hospitals, students within schools, people within neighborhoods, time-series data.

Normality

Nearly normal

A density plot shows the distribution of "resid" (residuals) on the x-axis, ranging from approximately -3 to 4. The y-axis represents density, from 0 to 0.4. A black curve illustrates the distribution, which is unimodal with a sharp peak slightly above 0, reaching a density of over 0.4. The distribution is right-skewed, showing a longer tail of positive residuals, and also has a small secondary hump or flattening around "resid" values of 3 to 4.

Not normal

A density plot shows the distribution of "resid" (residuals) on the x-axis, ranging from approximately -7 to 10, with density on the y-axis, ranging from 0 to 0.08. The black curve indicates a bimodal distribution with two main peaks  of similar height, both reaching a density of approximately 0.08. The distribution appears right-skewed, with a longer tail extending towards positive "resid" values.

Equal Variance

Equal Variance

A scatter plot displays residuals ("resid") on the y-axis, ranging from -2 to 4, against "x" on the x-axis, ranging from -2 to 2. A horizontal black line is present at "resid" equals 0. Numerous black data points are scattered seemingly randomly above and below this zero line, without any discernible pattern, trend, or systematic curvature. The distribution of residuals appears fairly even across the range of x-values.

Not-equal variance

A scatter plot displays residuals ("resid") on the y-axis, ranging from  approximately -12 to 10, against "x" on the x-axis, ranging from approximately 0.5 to 4. A horizontal black line is present at "resid" equals 0. The black data points show a clear pattern of increasing spread (heteroscedasticity) as x increases. For smaller x-values (e.g., below 2), residuals are tightly clustered around 0. As x increases towards 4, the residuals fan out, showing  greater variability with both larger positive and larger negative values, moving further away from the zero line.

Inference: Confidence Interval (theoretical)

confint(model_g)
                  2.5 %    97.5 %
(Intercept) -26.3915884 6.2632199
gestation     0.4059083 0.5226169

Note that the 95% confidence interval for the slope does not contain zero and all the values in the interval are positive indicating a significant positive relationship between gestation and birth weight.

Variables

y Response Birth weight Numeric
x Explanatory Smoke Categorical

Notation

\(y_i = \beta_0 +\beta_1x_i + \epsilon_i\)

\(\beta_0\) is y-intercept
\(\beta_1\) is slope
\(\epsilon_i\) is error/residual
\(i = 1, 2, ...n\) identifier for each point

Linear Model in R

model_s <- lm(bwt ~ smoke, data = babies)
tidy(model_s)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   123.       0.649    190.   0       
2 smoke          -8.94     1.03      -8.65 1.55e-17

\(\hat {y}_i = b_0 + b_1 x_i\)

\(\hat {\text{bwt}_i} = b_0 + b_1 \text{ smoke}_i\)

\(\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\)

Understanding Slope

\[\hat {\text{bwt}_i} = 123 + (-8.94\text{ smoke}_i)\]


Expected bwt for a baby with a non-smoker mother

\(\hat {\text{bwt}_i} = 123 + (-8.94\times 0)\)

\(\hat {\text{bwt}_i} = 123\)

\(E[bwt_i | smoke_i = 0] = b_0\)


Expected bwt for a baby with a smoker mother

\(\hat {\text{bwt}_i} = 123 + (-8.94\times 1)\)

\(\hat {\text{bwt}_i} = 114.06\)

\(E[bwt_i | smoke_i = 1] = b_0 + b_1\)

Confidence Interval

confint(model_s)
                2.5 %     97.5 %
(Intercept) 121.77391 124.320430
smoke       -10.96413  -6.911199

Note that the confidence interval for the “slope” does not contain 0 and all the values in the interval are negative.

We are confident that the parameter \(\beta_1\) is negative. In other words, for the smoke group the expected baby weight is lower than the nonsmoke group.

Understanding Relationships

  • Just because we observe a significant relationship between \(x\) and \(y\), it does not mean that \(x\) causes \(y\). We need rigorously designed randomized experiments to establish causality. spurious correlations

  • Just because we observe a significant relationship in a sample that does not mean the findings will generalize to the population. We need samples that are representative of the population to generalize the findings.