(Pre-requisite)

(AST405) Lifetime data analysis

Author

Md Rasel Biswas

0 Linear Regression Models

0.1 Multiple Linear Regression Models

Multiple Linear Regression Models

Let ${(y_{i}, x_{i 1}, \dots, x_{i p}), i = 1, \dots, n}$ be the data obtained from the $i^{t h}$ subject
- $y_{i} \to$ response
- $x_{i j} \to$ $j t h$ independent variable

A multiple linear regression model $y_{i} = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i j}$
In matrix notation $y = x^{'} β + ϵ$
- $y = (y_{1}, \dots, y_{n})^{'}$
- $x \to$ $n \times (p + 1)$ matrix with first column is a vector of one’s
- $β = (β_{0}, β_{1}, \dots, β_{p})^{'}$
- $ϵ_{i j} \to$ error term

General assumptions
- $ϵ$ ’s are independent
- $E (ϵ_{i j}) = 0$
- $V (ϵ_{i j}) = σ^{2}$
- $ϵ_{i j} \sim N (0, σ^{2})$

The fitted model ${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i 1} + \dots + {\hat{β}}_{p} x_{i p}$
- Ordinary least squares or maximum likelihood estimators $\hat{β} = (x^{'} x)^{- 1} x^{'} y$
- Asymptotically $\hat{β}$ follows a normal distribution with mean vector $β$ and variance-covariance matrix $(x^{'} x)^{- 1} σ^{2}$

Residuals $\hat{ϵ} = y - \hat{y}$
- Estimate of error variance ${\hat{σ}}^{2} = \frac{ϵ^{'} ϵ}{n - p - 1}$
- Residuals are used for model diagnostics

Statistical inference regarding multiple linear regression models are based on t-test, F-test, and chi-square test

$\begin{aligned} H_{01} & : β_{j} = 0 (j = 1, \dots, p) \\ H_{02} & : β_{1} = \dots = β_{p} = 0 \\ H_{03} & : β_{1} = \dots = β_{q} = 0 (q < p) \\ H_{04} & : β_{1} = \dots = β_{q} (q < p) \end{aligned}$

0.2 An example with data on inheritance of height

Inheritance of height

During 1893–1898 in the UK, K. Pearson (a famous statistician) organized the collection of heights of 1375 mothers aged 65 or less and one of their adult daughters aged 18 or more (Pearson and Lee 1903)
- Mother height $(x)$ $\to$ predictor
- Daughter height $(y)$ $\to$ response

Does taller mother tend to have taller daughter?

Assumed model “Daughter height on mother height” $y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}$

Inheritance of height

The data heights

library(alr4)
data("Heights")

Transform Heights from data.frame() to tibble()

heights <- as_tibble(Heights)

Scatterplot of mother height and daughter height

ggplot(heights) + 
  geom_point(aes(mheight, dheight), size = .75) +
  theme_bw()

0.3 Model fit

`lm()`

lm() is the most popular R function to fit linear model (with continuous response)
- A typical syntax of lm() function lm(formula, data), which returns a list
For example, codes for fitting the model “Daughter height on mother height”

mod_h <- lm(formula = dheight ~ mheight,
            data = heights)

Elements of lm() output object contain useful objects related to the corresponding fit of linear model

names(mod_h)

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

mod_h$coefficients

(Intercept)     mheight 
  29.917437    0.541747

Elements of lm() output object contain useful objects related to the corresponding fit of linear model

names(summary(mod_h))

 [1] "call"          "terms"         "residuals"     "coefficients" 
 [5] "aliased"       "sigma"         "df"            "r.squared"    
 [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

summary(mod_h)$coefficients

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 29.917437 1.62246940 18.43945 5.211879e-68
mheight      0.541747 0.02596069 20.86797 3.216915e-84

Output of lm() object can also be used as an argument of some useful functions, such as
- coefficients() returns the estimates of regression parameters
- residuals() returns associated residuals (there are different types of residuals, use type argument to specify this)
- fitted() returns fitted values corresponding to the predictor values of the data
- anova() returns ANOVA table
- summary() returns objects related to linear model fits, some of them are not included in the lm() object
- confint() returns confidence intervals of the regression parameters

0.4 `broom` package

`broom`

Most of the built-in R objects related to model fits (e.g. lm(), t.test(), etc.) require tidy data as input, but its outputs are messy (not tidy), which cannot be used as input for the methods of tidyverse
- e.g. lm() returns a list, not a data frame
broom package has functions that transform messy data into tidy data, which are used as inputs of different tidyverse functions, such as ggplot(), kable(), etc.

broom has three functions that takes model fit object as an argument and returns a tibble (tidy data)
- glance() returns a signle row summary of the model fit, which contains estimates of coefficient of determination, error variance, etc.
- tidy() returns different values corresponding to each parameter, such as estimates, t-stat, p-value, etc.
- augment() returns fitted information corresponding to each observations, e.g. residuals, fitted values, SE of fitted values, etc.

`glance()`

Single row summary of the model fit

glance(mod_h)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.241         0.240  2.27      435. 3.22e-84     1 -3075. 6156. 6172.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

`tidy()`

Summary of the parameter estimates

tidy(mod_h)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   29.9      1.62        18.4 5.21e-68
2 mheight        0.542    0.0260      20.9 3.22e-84

`augment()`

Observation-wise values of model fit

augment(mod_h) %>%
  slice(1:6)

# A tibble: 6 × 8
  dheight mheight .fitted .resid     .hat .sigma .cooksd .std.resid
    <dbl>   <dbl>   <dbl>  <dbl>    <dbl>  <dbl>   <dbl>      <dbl>
1    55.1    59.7    62.3  -7.16 0.00172    2.26 0.00862     -3.16 
2    56.5    58.2    61.4  -4.95 0.00310    2.26 0.00743     -2.19 
3    56      60.6    62.7  -6.75 0.00118    2.26 0.00523     -2.98 
4    56.8    60.7    62.8  -6.00 0.00113    2.26 0.00397     -2.65 
5    56      61.8    63.4  -7.40 0.000783   2.26 0.00418     -3.27 
6    57.9    55.5    60.0  -2.08 0.00707    2.27 0.00303     -0.923

0.5 Model diagnostics

Model diagnostics

Residual vs fitted (Independence of errors, Constant variance)
Residual vs predictor (Linearity, Zero mean)
Q-Q normal plot of residuals (Normality of error)

Residual plots

Scatterplot of fitted values and residuals

ggplot(augment(mod_h)) +
  geom_point(aes(.fitted, .resid), size = .75) +
  labs(x = "fitted", y = "residuals") + 
  theme_minimal()

Scatterplot of predictor and residuals

ggplot(augment(mod_h)) +
  geom_point(aes(mheight, .resid), size = .75) +
  labs(x = "Mother Ht", y = "residuals") + 
  theme_minimal()

Q-Q normal plot

Using base R functions

qqnorm(residuals(mod_h))
qqline(residuals(mod_h))

For Q-Q plot, ggplot2 functions stat_qq() and stat_qq_line() can be used

ggplot(augment(mod_h), aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  theme_minimal()

Summary

lm() function is used to fit the models
Functions of broom package can be used to obtain tidy data from the output of lm() function, which is usually messy
Output of functions of broom package can be used as arguments of ggplot and other tidyverse functions

Exercise

Using the FEV data (Download the FEV data), fit the following models and perform the model diagnostics
1. fev on Age
2. fev on Age and Hgt

0.6 Regression models with categorical predictors

Regression models with categorical predictors

In general, for a predictor with two levels, define a dummy or binary variable that takes the values 1 and 0 corresponding to the two levels of the original variable
- For example, if the variable $X$ has the levels “male” and “female”, we can define a dummy variable $D$ such that $D = {\begin{cases} 1 & if X = male \\ 0 & if X = female \end{cases}$
- The level “female” is considered as the “reference” category in this case
- Dummy variable can also be defined with “male” as the reference category

The model “ $Y$ on $X$ ” can be expressed in terms of “ $D$ ” as $E (Y | D) = β_{0} + β_{1} D$
- $β_{0} = E (Y | D = 0)$
- $β_{1} = E (Y | D = 1) - E (Y | D = 0)$
Interpretations of regression parameters depend on the reference category considered

For a categorical variable with more that two categories, more than one dummy variables needed to be defined
Let $X$ be a categorical variable with three categories, say “poor”, “middle”, and “rich”
- To consider $X$ as a predictor, two dummy variables need to be defined $\begin{aligned} D_{1} = {\begin{cases} 1 & if X = poor \\ 0 & otherwise \end{cases} & D_{2} = {\begin{cases} 1 & if X = middle \\ 0 & otherwise \end{cases} \end{aligned}$
- In this case, the category “rich” is considered as the reference category and dummy variables can be defined with other category as the reference

The regression model “ $Y$ on $X$ ”, where $X$ has three categories can be defined as $E (Y | D_{1}, D_{2}) = β_{0} + β_{1} D_{1} + β_{2} D_{2}$
- $β_{0} = E (Y | D_{1} = 0, D_{2} = 0)$
- $β_{1} = E (Y | D_{1} = 1, D_{2} = 0) - E (Y | D_{1} = 0, D_{2} = 0)$
- $β_{2} = E (Y | D_{1} = 0, D_{2} = 1) - E (Y | D_{1} = 0, D_{2} = 0)$

0.7 Interaction

Regression models with one categorical and one continuous predictors

Consider a regression model “ $Y$ on $X_{1}$ and $X_{2}$ ”, where $X_{1}$ has two levels (“male” and “female”) and $X_{2}$ is continuous (say age in years)
Define a dummay variable for $X_{1}$
- $D_{1 M} = I (X_{1} = Male)$

Consider the models $\begin{aligned} (1) E (Y | D_{1 M}, X_{2}) & = β_{0} + β_{1} D_{1 M} + β_{2} X_{2} \\ (2) E (Y | D_{1 M}, X_{2}) & = β_{0} + β_{1} D_{1 M} + β_{2} X_{2} + θ D_{1 M} X_{2} \end{aligned}$

How would you interpret the parameters in model (1)?
How would you interpret $θ$ in model (2)?

$\begin{aligned} (2) E (Y | D_{1 M}, X_{2}) & = β_{0} + β_{1} D_{1 M} + β_{2} X_{2} + θ D_{1 M} X_{2} \\ = {\begin{cases} β_{0} + β_{2} X_{2} & for female \\ (β_{0} + β_{1} + (β_{2} + θ) X_{2} & for male \end{cases} \end{aligned}$

$β_{0} \to$ mean response of female of age 0
$β_{1} \to$ difference of mean response between male and female when both of them at age of 0
$β_{2} \to$ change of mean response for 1 year change in age for female
$θ \to$ difference in the change of the mean response between males and females when their age changes by 1 year
- it represents how the effect of age differs between males and females

Regression model with two categorical variables

Consider a regression model “ $Y$ on $X_{1}$ and $X_{2}$ ”, where $X_{1}$ has two levels (“male” and “female”) and $X_{2}$ has three levels (“poor”, “middle”, and “rich”)
Define the dummy variables for the categorical variables $X_{1}$ and $X_{2}$
- For $X_{1}$ , $D_{1 M} = I (X_{1} = male)$
- For $X_{2}$ , $D_{21} = I (X_{2} = poor)$ and $D_{22} = I (X_{2} = middle)$

Model 1

$E (Y | D_{1 M}, D_{21}, D_{22}) = β_{0} + β_{1} D_{1 M} + β_{21} D_{21} + β_{22} D_{22}$

$β_{0} \to$ mean response of rich female individuals
$β_{1} \to$ mean difference between male and female when $X_{2}$ is fixed
$β_{21} \to$ mean difference between poor and rich when $X_{1}$ is fixed
$β_{22} \to$ mean difference between middle and rich when $X_{1}$ is fixed

Model 2

The “Model 2” contains both main effects and interactions $\begin{aligned} E (Y | D_{1 M}, D_{21}, D_{22}) & = β_{0} + β_{1} D_{1 M} + β_{21} D_{21} + β_{22} D_{22} \\ + θ_{1} D_{1 M} D_{21} + θ_{2} D_{1 M} D_{22} \end{aligned}$
- $β_{0} \to$ mean response of rich female individuals
- Interpretations of other parameters are complicated!!

The following table of expected response would help us to define parameters of “Model 2”

$\begin{array}{llll} gender & Poor & Middle & Rich \\ Male & β_{0} + β_{1} + β_{21} + θ_{1} & β_{0} + β_{1} + β_{22} + θ_{2} & β_{0} + β_{1} \\ Female & β_{0} + β_{21} & β_{0} + β_{22} & β_{0} \end{array}$

Interpret $θ$ ’s

Difference of mean response between male and female among the “rich” $E (Y | Male, Rich) - E (Y | Female, Rich) = β_{1}$
Difference of mean response between male and female among the “middle” $E (Y | Male, Middle) - E (Y | Female, Middle) = β_{1} + θ_{1}$

Difference in differences (DID) $\begin{aligned} {E (Y | Male, Middlw) - E (Y | Female, Middle)} \\ - {E (Y | Male, Rich) - E (Y | Female, Rich)} & = θ_{1} \end{aligned}$
Interaction term $θ_{1}$ measures whether the effect of “gender” is the same at the levels “Rich” and “Middle”
Similarly, the interaction term $θ_{2}$ measures whether the effect of “gender” is the same at the levels “Rich” and “Poor”

In the presence of significant interactions, the main effects have no interesting interpretations
Interaction terms should not be in the model if both the corresponding main effects are not significant
Insignificant interaction terms should not be in the model

Acknowledgements

This lecture is adapted from materials created by Mahbub Latif

References

Pearson, Karl, and Alice Lee. 1903. “On the Laws of Inheritance in Man: I. Inheritance of Physical Characters.” Biometrika 2 (4): 357–462.

0 Linear Regression Models

0.1 Multiple Linear Regression Models

Multiple Linear Regression Models

0.2 An example with data on inheritance of height

Inheritance of height

Does taller mother tend to have taller daughter?

Inheritance of height

Scatterplot of mother height and daughter height

0.3 Model fit

lm()

0.4 broom package

broom

glance()

tidy()

augment()

0.5 Model diagnostics

Model diagnostics

Residual plots

Scatterplot of fitted values and residuals

Scatterplot of predictor and residuals

Q-Q normal plot

Using base R functions

Summary

Exercise

0.6 Regression models with categorical predictors

Regression models with categorical predictors

0.7 Interaction

Regression models with one categorical and one continuous predictors

Regression model with two categorical variables

Model 1

Model 2

Acknowledgements

References

`lm()`

0.4 `broom` package

`broom`

`glance()`

`tidy()`

`augment()`