Linear Regression Review

(AST405) Lifetime data analysis

Authors
Affiliation

0 Linear Regression Models

0.1 Multiple Linear Regression Models

Multiple Linear Regression Models

  • Let {(yi,xi1,,xip),i=1,,n} be the data obtained from the ith subject

    • yi response

    • xij jth independent variable


  • A multiple linear regression model yi=β0+β1xi1++βpxip+ϵij

  • In matrix notation y=xβ+ϵ

    • y=(y1,,yn)

    • x n×(p+1) matrix with first column is a vector of one’s

    • β=(β0,β1,,βp)

    • ϵij error term


  • General assumptions

    • ϵ’s are independent

    • E(ϵij)=0

    • V(ϵij)=σ2

    • ϵijN(0,σ2)


  • The fitted model y^i=β^0+β^1xi1++β^pxip

    • Ordinary least squares or maximum likelihood estimators β^=(xx)1xy

    • Asymptotically β^ follows a normal distribution with mean vector β and variance-covariance matrix (xx)1σ2


  • Residuals ϵ^=yy^

    • Estimate of error variance σ^2=ϵϵnp1

    • Residuals are used for model diagnostics


  • Statistical inference regarding multiple linear regression models are based on t-test, F-test, and chi-square test

H01:βj=0(j=1,,p)H02:β1==βp=0H03:β1==βq=0(q<p)H04:β1==βq(q<p)

0.2 An example with data on inheritance of height

Inheritance of height

  • During 1893–1898 in the UK, K. Pearson (a famous statistician) organized the collection of heights of 1375 mothers aged 65 or less and one of their adult daughters aged 18 or more ()

    • Mother height (x) predictor

    • Daughter height (y) response

Does taller mother tend to have taller daughter?

  • Assumed model “Daughter height on mother height” yi=β0+β1xi+ϵi

Inheritance of height

  • The data heights
library(alr4)
data("Heights")
heights <- as_tibble(Heights)

Scatterplot of mother height and daughter height

ggplot(heights) + 
  geom_point(aes(mheight, dheight), size = .75) +
  theme_bw()

0.3 Model fit

lm()

  • lm() is the most popular R function to fit linear model (with continuous response)

    • A typical syntax of lm() function lm(formula, data), which returns a list
  • For example, codes for fitting the model “Daughter height on mother height”

mod_h <- lm(formula = dheight ~ mheight,
            data = heights)

  • Elements of lm() output object contain useful objects related to the corresponding fit of linear model
names(mod_h)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"        
mod_h$coefficients
(Intercept)     mheight 
  29.917437    0.541747 

  • Elements of lm() output object contain useful objects related to the corresponding fit of linear model
names(summary(mod_h))
 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled" 
summary(mod_h)$coefficients
             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 29.917437 1.62246940 18.43945 5.211879e-68
mheight      0.541747 0.02596069 20.86797 3.216915e-84

  • Output of lm() object can also be used as an argument of some useful functions, such as

    • coefficients() returns the estimates of regression parameters

    • residuals() returns associated residuals (there are different types of residuals, use type argument to specify this)

    • fitted() returns fitted values corresponding to the predictor values of the data

    • anova() returns ANOVA table

    • summary() returns objects related to linear model fits, some of them are not included in the lm() object

    • confint() returns confidence intervals of the regression parameters

0.4 broom package

broom

  • Most of the built-in R objects related to model fits (e.g. lm(), t.test(), etc.) require tidy data as input, but its outputs are messy (not tidy), which cannot be used as input for the methods of tidyverse

    • e.g. lm() returns a list, not a data frame
  • broom package has functions that transform messy data into tidy data, which are used as inputs of different tidyverse functions, such as ggplot(), kable(), etc.


  • broom has three functions that takes model fit object as an argument and returns a tibble (tidy data)

    • glance() returns a signle row summary of the model fit, which contains estimates of coefficient of determination, error variance, etc.

    • tidy() returns different values corresponding to each parameter, such as estimates, t-stat, p-value, etc.

    • augment() returns fitted information corresponding to each observations, e.g. residuals, fitted values, SE of fitted values, etc.

glance()

Single row summary of the model fit

glance(mod_h)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.241         0.240  2.27      435. 3.22e-84     1 -3075. 6156. 6172.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

tidy()

Summary of the parameter estimates

tidy(mod_h)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   29.9      1.62        18.4 5.21e-68
2 mheight        0.542    0.0260      20.9 3.22e-84

augment()

Observation-wise values of model fit

augment(mod_h) %>%
  slice(1:6)
# A tibble: 6 × 8
  dheight mheight .fitted .resid     .hat .sigma .cooksd .std.resid
    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>   <dbl>      <dbl>
1    55.1    59.7    62.3  -7.16 0.00172    2.26 0.00862     -3.16 
2    56.5    58.2    61.4  -4.95 0.00310    2.26 0.00743     -2.19 
3    56      60.6    62.7  -6.75 0.00118    2.26 0.00523     -2.98 
4    56.8    60.7    62.8  -6.00 0.00113    2.26 0.00397     -2.65 
5    56      61.8    63.4  -7.40 0.000783   2.26 0.00418     -3.27 
6    57.9    55.5    60.0  -2.08 0.00707    2.27 0.00303     -0.923

0.5 Model diagnostics

Model diagnostics

  • Residual vs fitted (Independence of errors, Constant variance)

  • Residual vs predictor (Linearity, Zero mean)

  • Q-Q normal plot of residuals (Normality of error)

Residual plots

Scatterplot of fitted values and residuals

ggplot(augment(mod_h)) +
  geom_point(aes(.fitted, .resid), size = .75) +
  labs(x = "fitted", y = "residuals") + 
  theme_minimal()


Scatterplot of predictor and residuals

ggplot(augment(mod_h)) +
  geom_point(aes(mheight, .resid), size = .75) +
  labs(x = "Mother Ht", y = "residuals") + 
  theme_minimal()

Q-Q normal plot

Using base R functions



  • For Q-Q plot, ggplot2 functions stat_qq() and stat_qq_line() can be used
ggplot(augment(mod_h), aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  theme_minimal()

Summary

  • lm() function is used to fit the models

  • Functions of broom package can be used to obtain tidy data from the output of lm() function, which is usually messy

  • Output of functions of broom package can be used as arguments of ggplot and other tidyverse functions

Exercise

  • Using the FEV data (Download the FEV data), fit the following models and perform the model diagnostics

    1. fev on Age

    2. fev on Age and Hgt

0.6 Regression models with categorical predictors

Regression models with categorical predictors

  • In general, for a predictor with two levels, define a dummy or binary variable that takes the values 1 and 0 corresponding to the two levels of the original variable

    • For example, if the variable X has the levels “male” and “female”, we can define a dummy variable D such that D={1if X=male0if X=female

    • The level “female” is considered as the “reference” category in this case

    • Dummy variable can also be defined with “male” as the reference category


  • The model “Y on X” can be expressed in terms of “D” as E(Y|D)=β0+β1D

    • β0=E(Y|D=0)

    • β1=E(Y|D=1)E(Y|D=0)

  • Interpretations of regression parameters depend on the reference category considered


  • For a categorical variable with more that two categories, more than one dummy variables needed to be defined

  • Let X be a categorical variable with three categories, say “poor”, “middle”, and “rich”

    • To consider X as a predictor, two dummy variables need to be defined D1={1if X=poor0otherwiseD2={1if X=middle0otherwise

    • In this case, the category “rich” is considered as the reference category and dummy variables can be defined with other category as the reference


  • The regression model “Y on X”, where X has three categories can be defined as E(Y|D1,D2)=β0+β1D1+β2D2

    • β0=E(Y|D1=0,D2=0)

    • β1=E(Y|D1=1,D2=0)E(Y|D1=0,D2=0)

    • β2=E(Y|D1=0,D2=1)E(Y|D1=0,D2=0)

0.7 Interaction

Regression models with one categorical and one continuous predictors

  • Consider a regression model “Y on X1 and X2”, where X1 has two levels (“male” and “female”) and X2 is continuous (say age in years)

  • Define a dummay variable for X1

    • D1M=I(X1=Male)

Consider the models (1)E(Y|D1M,X2)=β0+β1D1M+β2X2(2)E(Y|D1M,X2)=β0+β1D1M+β2X2+θD1MX2

  • How would you interpret the parameters in model (1)?
  • How would you interpret θ in model (2)?

(2)E(Y|D1M,X2)=β0+β1D1M+β2X2+θD1MX2={β0+β2X2for female(β0+β1+(β2+θ)X2for male

  • β0 mean response of female of age 0

  • β1 difference of mean response between male and female when both of them at age of 0

  • β2 change of mean response for 1 year change in age for female

  • θ difference in the change of the mean response between males and females when their age changes by 1 year

    • it represents how the effect of age differs between males and females

Regression model with two categorical variables

  • Consider a regression model “Y on X1 and X2”, where X1 has two levels (“male” and “female”) and X2 has three levels (“poor”, “middle”, and “rich”)

  • Define the dummy variables for the categorical variables X1 and X2

    • For X1, D1M=I(X1=male)

    • For X2, D21=I(X2=poor) and D22=I(X2=middle)


Model 1

E(Y|D1M,D21,D22)=β0+β1D1M+β21D21+β22D22

  • β0 mean response of rich female individuals

  • β1 mean difference between male and female when X2 is fixed

  • β21 mean difference between poor and rich when X1 is fixed

  • β22 mean difference between middle and rich when X1 is fixed


Model 2

  • The “Model 2” contains both main effects and interactions E(Y|D1M,D21,D22)=β0+β1D1M+β21D21+β22D22+θ1D1MD21+θ2D1MD22

    • β0 mean response of rich female individuals

    • Interpretations of other parameters are complicated!!


  • The following table of expected response would help us to define parameters of “Model 2”

 gender  Poor  Middle  Rich  Male β0+β1+β21+θ1β0+β1+β22+θ2β0+β1 Female β0+β21β0+β22β0

  • Interpret θ’s

  • Difference of mean response between male and female among the “rich” E(Y|Male,Rich)E(Y|Female,Rich)=β1

  • Difference of mean response between male and female among the “middle” E(Y|Male,Middle)E(Y|Female,Middle)=β1+θ1


  • Difference in differences (DID) {E(Y|Male,Middlw)E(Y|Female,Middle)}{E(Y|Male,Rich)E(Y|Female,Rich)}=θ1

  • Interaction term θ1 measures whether the effect of “gender” is the same at the levels “Rich” and “Middle”

  • Similarly, the interaction term θ2 measures whether the effect of “gender” is the same at the levels “Rich” and “Poor”


  • In the presence of significant interactions, the main effects have no interesting interpretations

  • Interaction terms should not be in the model if both the corresponding main effects are not significant

  • Insignificant interaction terms should not be in the model

Acknowledgements

This lecture is adapted from materials created by Mahbub Latif

References

Pearson, Karl, and Alice Lee. 1903. “On the Laws of Inheritance in Man: I. Inheritance of Physical Characters.” Biometrika 2 (4): 357–462.