Recap: Basic Statistics
mean (expected value) : the
mean of the possible values a random variable can take
notation : E(X) = \mu
E
(
X
)
=
μ
E(X) = \mu
E ( X ) = μ
finitely many outcomes : E(X) = x_1p_1 + ...
+ x_np_n
E
(
X
)
=
x
1
p
1
+
.
.
.
+
x
n
p
n
E(X) = x_1p_1 + ... +
x_np_n
E ( X ) = x 1 p 1 + ... + x n p n
countably infinitely many outcomes : E(X) =
\sum_{i=1}^\infty x_ip_i
E
(
X
)
=
∑
i
=
1
∞
x
i
p
i
E(X) =
\sum_{i=1}^\infty x_ip_i
E ( X ) = ∑ i = 1 ∞ x i p i
random variables with density : E(X) =
\int_{-\infty}^\infty xf(x)dx
E
(
X
)
=
∫
−
∞
∞
x
f
(
x
)
d
x
E(X) =
\int_{-\infty}^\infty xf(x)dx
E ( X ) = ∫ − ∞ ∞ x f ( x ) d x
properties :
E(c) = c
E
(
c
)
=
c
E(c) = c
E ( c ) = c
E(aX + b) = aE(x) + b
E
(
a
X
+
b
)
=
a
E
(
x
)
+
b
E(aX + b) = aE(x) + b
E ( a X + b ) = a E ( x ) + b
E(a_1X_1 + ... + a_nX_n) = a_1E(X_1) + ... +
a_nE(X_n)
E
(
a
1
X
1
+
.
.
.
+
a
n
X
n
)
=
a
1
E
(
X
1
)
+
.
.
.
+
a
n
E
(
X
n
)
E(a_1X_1 + ... +
a_nX_n) = a_1E(X_1) + ... + a_nE(X_n)
E ( a 1 X 1 + ... + a n X n ) = a 1 E ( X 1 ) + ... + a n E ( X n )
E(X_1 \cdot ... \cdot X_n) = E(X_1) \cdot ...
\cdot E(X_n)
E
(
X
1
⋅
.
.
.
⋅
X
n
)
=
E
(
X
1
)
⋅
.
.
.
⋅
E
(
X
n
)
E(X_1 \cdot ... \cdot
X_n) = E(X_1) \cdot ... \cdot E(X_n)
E ( X 1 ⋅ ... ⋅ X n ) = E ( X 1 ) ⋅ ... ⋅ E ( X n ) for
independent (uncorrelated) X_i
X
i
X_i
X i
E(g(X)) = \int_\R g(x)f(x)dx
E
(
g
(
X
)
)
=
∫
R
g
(
x
)
f
(
x
)
d
x
E(g(X)) = \int_\R
g(x)f(x)dx
E ( g ( X )) = ∫ R g ( x ) f ( x ) d x
standard deviation : a
measure of the amount of variation of the values of a variable about its mean
definition : \sigma = \sqrt{Var(X)}
σ
=
V
a
r
(
X
)
\sigma = \sqrt{Var(X)}
σ = Va r ( X )
standard error : the
standard deviation of its sampling distribution or an estimate of that standard deviation
definition : se = \frac{\sigma}{\sqrt{n}}
s
e
=
σ
n
se = \frac{\sigma}{\sqrt{n}}
se = n
σ
for n
n
n
n
observations
variance : a measure of how far a
set of numbers is spread out from their average value
notation : Var(X) = \sigma^2
V
a
r
(
X
)
=
σ
2
Var(X) = \sigma^2
Va r ( X ) = σ 2
definition : Var(X) = E(X^2) - E(X)^2
V
a
r
(
X
)
=
E
(
X
2
)
−
E
(
X
)
2
Var(X) = E(X^2) - E(X)^2
Va r ( X ) = E ( X 2 ) − E ( X ) 2
properties :
Var(c) = 0
V
a
r
(
c
)
=
0
Var(c) = 0
Va r ( c ) = 0
Var(X + a) = Var(X)
V
a
r
(
X
+
a
)
=
V
a
r
(
X
)
Var(X + a) = Var(X)
Va r ( X + a ) = Va r ( X )
Var(aX) = a^2Var(X)
V
a
r
(
a
X
)
=
a
2
V
a
r
(
X
)
Var(aX) = a^2Var(X)
Va r ( a X ) = a 2 Va r ( X )
Var(aX \pm bY) = a^2Var(X) + b^2Var(Y) \pm
2ab\;Cov(X,Y)
V
a
r
(
a
X
±
b
Y
)
=
a
2
V
a
r
(
X
)
+
b
2
V
a
r
(
Y
)
±
2
a
b
C
o
v
(
X
,
Y
)
Var(aX \pm bY) =
a^2Var(X) + b^2Var(Y) \pm 2ab\;Cov(X,Y)
Va r ( a X ± bY ) = a 2 Va r ( X ) + b 2 Va r ( Y ) ± 2 ab C o v ( X , Y )
Var(X_1 + ... + X_n) = Var(X_1) + ... +
Var(X_n)
V
a
r
(
X
1
+
.
.
.
+
X
n
)
=
V
a
r
(
X
1
)
+
.
.
.
+
V
a
r
(
X
n
)
Var(X_1 + ... + X_n) =
Var(X_1) + ... + Var(X_n)
Va r ( X 1 + ... + X n ) = Va r ( X 1 ) + ... + Va r ( X n ) for
independent (uncorrelated) X_i
X
i
X_i
X i
covariance : a measure of the
joint variability of two random variables
definition : Cov(X,Y) = E(XY) - E(X)E(Y)
C
o
v
(
X
,
Y
)
=
E
(
X
Y
)
−
E
(
X
)
E
(
Y
)
Cov(X,Y) = E(XY) - E(X)E(Y)
C o v ( X , Y ) = E ( X Y ) − E ( X ) E ( Y )
positively correlated variables : Cov(X,Y) > 0
C
o
v
(
X
,
Y
)
>
0
Cov(X,Y) > 0
C o v ( X , Y ) > 0
negatively correlated variables : Cov(X,Y) < 0
C
o
v
(
X
,
Y
)
<
0
Cov(X,Y) < 0
C o v ( X , Y ) < 0
uncorrelated variables : Cov(X,Y) = 0
C
o
v
(
X
,
Y
)
=
0
Cov(X,Y) = 0
C o v ( X , Y ) = 0 (but not the other way
around!)
independent variables : Cov(X,Y) =
0
C
o
v
(
X
,
Y
)
=
0
Cov(X,Y) = 0
C o v ( X , Y ) = 0
properties :
Cov(X,X) = Var(X)
C
o
v
(
X
,
X
)
=
V
a
r
(
X
)
Cov(X,X) = Var(X)
C o v ( X , X ) = Va r ( X )
Cov(X,Y) = Cov(Y,X)
C
o
v
(
X
,
Y
)
=
C
o
v
(
Y
,
X
)
Cov(X,Y) = Cov(Y,X)
C o v ( X , Y ) = C o v ( Y , X )
Cov(X,c) = 0
C
o
v
(
X
,
c
)
=
0
Cov(X,c) = 0
C o v ( X , c ) = 0
Cov(aX, bY) = ab\;Cov(X,Y)
C
o
v
(
a
X
,
b
Y
)
=
a
b
C
o
v
(
X
,
Y
)
Cov(aX, bY) =
ab\;Cov(X,Y)
C o v ( a X , bY ) = ab C o v ( X , Y )
Cov(X+a, Y+b) = Cov(X,Y)
C
o
v
(
X
+
a
,
Y
+
b
)
=
C
o
v
(
X
,
Y
)
Cov(X+a, Y+b) =
Cov(X,Y)
C o v ( X + a , Y + b ) = C o v ( X , Y )
Cov(aX + bY, cW + dV) =
acCov(X,W)+adCov(X,V)+bcCov(Y,W)+bdCov(Y,V)
C
o
v
(
a
X
+
b
Y
,
c
W
+
d
V
)
=
a
c
C
o
v
(
X
,
W
)
+
a
d
C
o
v
(
X
,
V
)
+
b
c
C
o
v
(
Y
,
W
)
+
b
d
C
o
v
(
Y
,
V
)
Cov(aX + bY, cW + dV) =
acCov(X,W)+adCov(X,V)+bcCov(Y,W)+bdCov(Y,V)
C o v ( a X + bY , c W + d V ) = a c C o v ( X , W ) + a d C o v ( X , V ) + b c C o v ( Y , W ) + b d C o v ( Y , V )
correlation : any statistical
relationship, whether causal or not, between two random variables
definition : Corr(X,Y) =
\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}
C
o
r
r
(
X
,
Y
)
=
C
o
v
(
X
,
Y
)
V
a
r
(
X
)
V
a
r
(
Y
)
Corr(X,Y) =
\frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}
C orr ( X , Y ) = Va r ( X ) Va r ( Y )
C o v ( X , Y )
Introduction
regression : creates a functional relationship between a response
(dependent) variable and a set of explanatory (predictor) variables
(covariates)
regression model : which explanatory variables have an effect on the
response?
deterministic relationship : a certain input will always lead to the same
result
parameter : an unknown constant, most likely to be estimated by collecting and using
data
empirical model : any kind of model based on empirical observations
rather than on mathematically describable (theory-based) relationships of the
system modelled
controlled experiment : one where the experimenter can set the values of the
explanatory variable(s)
line definition (linear model) : y = \beta_0 + \beta_1 x
y
=
β
0
+
β
1
x
y = \beta_0 + \beta_1 x
y = β 0 + β 1 x ( + \epsilon )
(
+
ϵ
)
( + \epsilon )
( + ϵ )
\beta_0,
\beta_1
β
0
,
β
1
\beta_0, \beta_1
β 0 , β 1 :
constants (parameters)
intercept \beta_0
β
0
\beta_0
β 0 :
y
y
y
y
when x = 0
x
=
0
x = 0
x = 0
slope \beta_1
β
1
\beta_1
β 1 :
change in y
y
y
y
if x
x
x
x is
increased by 1 unit
\epsilon
ϵ
\epsilon
ϵ : random
disturbance (error)
\beta_0 + \beta_1 x
β
0
+
β
1
x
\beta_0 + \beta_1 x
β 0 + β 1 x :
deterministic
\epsilon
ϵ
\epsilon
ϵ :
random , models variability in measurements around the
regression line
linear in \beta_0
β
0
\beta_0
β 0
and \beta_1
β
1
\beta_1
β 1
for each experiment : y_i = \beta_0 + \beta_1 x_i + \epsilon_i
y
i
=
β
0
+
β
1
x
i
+
ϵ
i
y_i = \beta_0 + \beta_1 x_i +
\epsilon_i
y i = β 0 + β 1 x i + ϵ i
input and result : (x_i,
y_i)
(
x
i
,
y
i
)
(x_i, y_i)
( x i , y i )
\beta_0, \beta_1
β
0
,
β
1
\beta_0, \beta_1
β 0 , β 1
remain constant
x_i, \epsilon_i
x
i
,
ϵ
i
x_i, \epsilon_i
x i , ϵ i
vary per experiment i = 1,2,...,n
i
=
1
,
2
,
.
.
.
,
n
i = 1,2,...,n
i = 1 , 2 , ... , n
mean E(\epsilon_i) = 0
E
(
ϵ
i
)
=
0
E(\epsilon_i) =
0
E ( ϵ i ) = 0
variance Var(\epsilon_i) = \sigma^2
V
a
r
(
ϵ
i
)
=
σ
2
Var(\epsilon_i)
= \sigma^2
Va r ( ϵ i ) = σ 2
\epsilon_i, \epsilon_j
ϵ
i
,
ϵ
j
\epsilon_i,
\epsilon_j
ϵ i , ϵ j
independent random variables for i \neq j
i
≠
j
i \neq j
i = j
x_i
x
i
x_i
x i
deterministic (i.e. the input data is clearly and certainly
defined; it can also be noisy, in which case x_i
x
i
x_i
x i
is not deterministic)
\implies y_i
⟹
y
i
\implies y_i
⟹ y i
random variable; y_i, y_j
y
i
,
y
j
y_i, y_j
y i , y j
independent for i \neq j
i
≠
j
i \neq j
i = j
mean E(y_i) = E(\beta_0 + \beta_1 x_i +
\epsilon_i) = \beta_0 + \beta_1 x_i + \underbrace{E(\epsilon_i)}_0 =
\beta_0 + \beta_1 x_i
E
(
y
i
)
=
E
(
β
0
+
β
1
x
i
+
ϵ
i
)
=
β
0
+
β
1
x
i
+
E
(
ϵ
i
)
⏟
0
=
β
0
+
β
1
x
i
E(y_i) =
E(\beta_0 + \beta_1 x_i + \epsilon_i) = \beta_0 +
\beta_1 x_i + \underbrace{E(\epsilon_i)}_0 = \beta_0
+ \beta_1 x_i
E ( y i ) = E ( β 0 + β 1 x i + ϵ i ) = β 0 + β 1 x i + 0
E ( ϵ i ) = β 0 + β 1 x i
variance Var(y_i) = \sigma^2
V
a
r
(
y
i
)
=
σ
2
Var(y_i) =
\sigma^2
Va r ( y i ) = σ 2
unexplained variability \sigma
σ
\sigma
σ
general : y = \mu + \epsilon
y
=
μ
+
ϵ
y = \mu + \epsilon
y = μ + ϵ
deterministic component \mu = \beta_0 +
\beta_1 x_1 + ... + \beta_p x_p
μ
=
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
\mu = \beta_0 + \beta_1
x_1 + ... + \beta_p x_p
μ = β 0 + β 1 x 1 + ... + β p x p
explanatory variables x_1, ... ,
x_p
x
1
,
.
.
.
,
x
p
x_1, ... , x_p
x 1 , ... , x p
(assume fixed, measured without error)
\beta_i, i = 1,2,...,p
β
i
,
i
=
1
,
2
,
.
.
.
,
p
\beta_i, i =
1,2,...,p
β i , i = 1 , 2 , ... , p :
change in \mu
μ
\mu
μ
when changing x_i
x
i
x_i
x i
by one unit while keeping all other explanatory variables the same
E(y) = \mu
E
(
y
)
=
μ
E(y) = \mu
E ( y ) = μ ,
Var(y) = \sigma^2
V
a
r
(
y
)
=
σ
2
Var(y) =
\sigma^2
Va r ( y ) = σ 2
linearity : the derivatives of \mu
μ
\mu
μ with
respect to the parameters \beta_i
β
i
\beta_i
β i
do not depend on the variables
notation : x_{ij}
x
i
j
x_{ij}
x ij
for the i
i
i
i -th unit (i.e. row
in a table) and the j
j
j
j -th
explanatory variable (i.e. column in a table) (R: table[i,j]
)
dependent variable : depends on an independent variable
Simple Linear Regression
simple linear regression
model : y =
\mu + \epsilon
y
=
μ
+
ϵ
y = \mu + \epsilon
y = μ + ϵ
mean E(y) = \mu = \beta_0 + \beta_1 x
E
(
y
)
=
μ
=
β
0
+
β
1
x
E(y) = \mu = \beta_0 + \beta_1
x
E ( y ) = μ = β 0 + β 1 x
one predictor (regressor) variable x
x
x
x
one response variable y
y
y
y
random error \epsilon
ϵ
\epsilon
ϵ
for n
n
n
n pairs of
observations (x_i, y_i)
(
x
i
,
y
i
)
(x_i, y_i)
( x i , y i ) : y_i = \beta_0 + \beta_1 x_i
+ \epsilon_i,\;\; i = 1, ..., n
y
i
=
β
0
+
β
1
x
i
+
ϵ
i
,
i
=
1
,
.
.
.
,
n
y_i = \beta_0 + \beta_1 x_i +
\epsilon_i,\;\; i = 1, ..., n
y i = β 0 + β 1 x i + ϵ i , i = 1 , ... , n
x_i
x
i
x_i
x i
not random (can be selected by experimenter)
\epsilon_i \sim N(0, \sigma^2)
ϵ
i
∼
N
(
0
,
σ
2
)
\epsilon_i \sim N(0,
\sigma^2)
ϵ i ∼ N ( 0 , σ 2 )
y_i \sim N(\mu_i, \sigma^2)
y
i
∼
N
(
μ
i
,
σ
2
)
y_i \sim
N(\mu_i, \sigma^2)
y i ∼ N ( μ i , σ 2 ) ,
where \mu_i = \beta_0 + \beta_1
x_i
μ
i
=
β
0
+
β
1
x
i
\mu_i = \beta_0
+ \beta_1 x_i
μ i = β 0 + β 1 x i
E(\epsilon_i) = 0
E
(
ϵ
i
)
=
0
E(\epsilon_i) = 0
E ( ϵ i ) = 0
E(y_i) = \mu_i = \beta_0 + \beta_1
x_i
E
(
y
i
)
=
μ
i
=
β
0
+
β
1
x
i
E(y_i) = \mu_i
= \beta_0 + \beta_1 x_i
E ( y i ) = μ i = β 0 + β 1 x i
Var(\epsilon_i) = \sigma^2
V
a
r
(
ϵ
i
)
=
σ
2
Var(\epsilon_i) =
\sigma^2
Va r ( ϵ i ) = σ 2
Var(y_i) = \sigma^2
V
a
r
(
y
i
)
=
σ
2
Var(y_i) =
\sigma^2
Va r ( y i ) = σ 2
Cov(\epsilon_i, \epsilon_j) = 0
C
o
v
(
ϵ
i
,
ϵ
j
)
=
0
Cov(\epsilon_i,
\epsilon_j) = 0
C o v ( ϵ i , ϵ j ) = 0 for i \neq
j
i
≠
j
i \neq j
i = j
any two observations y_i,
y_j
y
i
,
y
j
y_i, y_j
y i , y j
are independent for i \neq
j
i
≠
j
i \neq j
i = j
goal : estimate \beta_0, \beta_1, \sigma^2
β
0
,
β
1
,
σ
2
\beta_0, \beta_1, \sigma^2
β 0 , β 1 , σ 2
from available data (x_i, y_i)
(
x
i
,
y
i
)
(x_i, y_i)
( x i , y i )
zero slope \implies
⟹
\implies
⟹
absence of linear association
unbiased parameter estimate : E(\hat\theta) = \theta
E
(
θ
^
)
=
θ
E(\hat\theta) = \theta
E ( θ ^ ) = θ
biased parameter estimate : E(\hat\theta) \neq \theta
E
(
θ
^
)
≠
θ
E(\hat\theta) \neq \theta
E ( θ ^ ) = θ
least squares estimation
(LSE) : a mathematical procedure for finding the best-fitting curve to a
given set of points by minimizing the sum of the squares of the offsets ("the
residuals") of the points from the curve
goal : minimize \sum_{i=1}^n (y_i - \hat y_i)^2
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
\sum_{i=1}^n (y_i - \hat y_i)^2
∑ i = 1 n ( y i − y ^ i ) 2
where \hat y_i = \hat \beta_0 +
\hat \beta_1 x_i
y
^
i
=
β
^
0
+
β
^
1
x
i
\hat y_i = \hat \beta_0 + \hat
\beta_1 x_i
y ^ i = β ^ 0 + β ^ 1 x i
(fitted value)
LSE \hat \beta_1 = \frac{\sum_{i=1}^n (x_i -
\overline x)(y_i - \overline y)}{\sum_{i=1}^n (x_i - \overline x)^2} =
\frac{s_{xy}}{s_{xx}}
β
^
1
=
∑
i
=
1
n
(
x
i
−
x
‾
)
(
y
i
−
y
‾
)
∑
i
=
1
n
(
x
i
−
x
‾
)
2
=
s
x
y
s
x
x
\hat \beta_1 =
\frac{\sum_{i=1}^n (x_i - \overline x)(y_i - \overline
y)}{\sum_{i=1}^n (x_i - \overline x)^2} =
\frac{s_{xy}}{s_{xx}}
β ^ 1 = ∑ i = 1 n ( x i − x ) 2 ∑ i = 1 n ( x i − x ) ( y i − y ) = s xx s x y
s_{xy} = \sum_{i=1}^n (x_i - \overline
x)(y_i - \overline y)
s
x
y
=
∑
i
=
1
n
(
x
i
−
x
‾
)
(
y
i
−
y
‾
)
s_{xy} =
\sum_{i=1}^n (x_i - \overline x)(y_i - \overline y)
s x y = ∑ i = 1 n ( x i − x ) ( y i − y )
s_{xx} = \sum_{i=1}^n (x_i - \overline
x)(x_i - \overline x)
s
x
x
=
∑
i
=
1
n
(
x
i
−
x
‾
)
(
x
i
−
x
‾
)
s_{xx} =
\sum_{i=1}^n (x_i - \overline x)(x_i - \overline x)
s xx = ∑ i = 1 n ( x i − x ) ( x i − x )
(!) reordering using \sum_{i=1}^n (x_i - \overline x) =
0
∑
i
=
1
n
(
x
i
−
x
‾
)
=
0
\sum_{i=1}^n
(x_i - \overline x) = 0
∑ i = 1 n ( x i − x ) = 0 :
\hat \beta_1 =
\frac{\sum_{i=1}^n (x_i - \overline x) y_i}{\sum_{i=1}^n
(x_i - \overline x)^2}
β
^
1
=
∑
i
=
1
n
(
x
i
−
x
‾
)
y
i
∑
i
=
1
n
(
x
i
−
x
‾
)
2
\hat
\beta_1 = \frac{\sum_{i=1}^n (x_i -
\overline x) y_i}{\sum_{i=1}^n (x_i -
\overline x)^2}
β ^ 1 = ∑ i = 1 n ( x i − x ) 2 ∑ i = 1 n ( x i − x ) y i
s_{xx} = \sum_{i=1}^n x_i(x_i -
\overline x)
s
x
x
=
∑
i
=
1
n
x
i
(
x
i
−
x
‾
)
s_{xx}
= \sum_{i=1}^n x_i(x_i - \overline x)
s xx = ∑ i = 1 n x i ( x i − x )
LSE \hat \beta_0 = \overline y - \hat \beta_1
\overline x
β
^
0
=
y
‾
−
β
^
1
x
‾
\hat \beta_0 =
\overline y - \hat \beta_1 \overline x
β ^ 0 = y − β ^ 1 x
LSE s^2 = \frac1{n-2}\sum_{i=1}^n (y_i - \hat
y_i)^2
s
2
=
1
n
−
2
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
s^2 =
\frac1{n-2}\sum_{i=1}^n (y_i - \hat y_i)^2
s 2 = n − 2 1 ∑ i = 1 n ( y i − y ^ i ) 2
short: s^2 = \frac{\sum_{i=1}^n
e_i^2}{n-2}
s
2
=
∑
i
=
1
n
e
i
2
n
−
2
s^2 =
\frac{\sum_{i=1}^n e_i^2}{n-2}
s 2 = n − 2 ∑ i = 1 n e i 2
residual e_i = y_i -
\hat y_i
e
i
=
y
i
−
y
^
i
e_i = y_i -
\hat y_i
e i = y i − y ^ i
degree of freedom : number of independent observations
(n
n
n
n )
minus the number of estimated parameters (here 2, \beta_0
β
0
\beta_0
β 0
and \beta_1
β
1
\beta_1
β 1 )
sample mean \overline x =
\frac1n\sum_{i=1}^n x_i
x
‾
=
1
n
∑
i
=
1
n
x
i
\overline x =
\frac1n\sum_{i=1}^n x_i
x = n 1 ∑ i = 1 n x i
result mean \overline y =
\frac1n\sum_{i=1}^n y_i
y
‾
=
1
n
∑
i
=
1
n
y
i
\overline y =
\frac1n\sum_{i=1}^n y_i
y = n 1 ∑ i = 1 n y i
\overline y = \hat\beta_0 + \hat\beta_1
\overline x
y
‾
=
β
^
0
+
β
^
1
x
‾
\overline y =
\hat\beta_0 + \hat\beta_1 \overline x
y = β ^ 0 + β ^ 1 x
E(\hat \beta_1) = \beta_1
E
(
β
^
1
)
=
β
1
E(\hat \beta_1) =
\beta_1
E ( β ^ 1 ) = β 1
E(\hat \beta_0) = \beta_0
E
(
β
^
0
)
=
β
0
E(\hat \beta_0) =
\beta_0
E ( β ^ 0 ) = β 0
E(\overline y) = \beta_0 + \beta_1 \overline
x
E
(
y
‾
)
=
β
0
+
β
1
x
‾
E(\overline y) =
\beta_0 + \beta_1 \overline x
E ( y ) = β 0 + β 1 x
E(s^2) = \sigma^2
E
(
s
2
)
=
σ
2
E(s^2) = \sigma^2
E ( s 2 ) = σ 2
Var(\hat \beta_1) =
\frac{\sigma^2}{s_{xx}}
V
a
r
(
β
^
1
)
=
σ
2
s
x
x
Var(\hat \beta_1) =
\frac{\sigma^2}{s_{xx}}
Va r ( β ^ 1 ) = s xx σ 2
Var(\hat \beta_0) = \sigma^2\left(\frac1n +
\frac{\overline x^2}{s_{xx}}\right)
V
a
r
(
β
^
0
)
=
σ
2
(
1
n
+
x
‾
2
s
x
x
)
Var(\hat \beta_0) =
\sigma^2\left(\frac1n + \frac{\overline x^2}{s_{xx}}\right)
Va r ( β ^ 0 ) = σ 2 ( n 1 + s xx x 2 )
se(\hat \beta_1) =
\frac{s}{\sqrt{s_{xx}}}
s
e
(
β
^
1
)
=
s
s
x
x
se(\hat \beta_1) =
\frac{s}{\sqrt{s_{xx}}}
se ( β ^ 1 ) = s xx
s
maximum likelihood
estimation (MLE) : a method of estimating the parameters of an assumed
probability distribution by maximizing a likelihood function
\hat \sigma^2 =
\frac1n\sum_{i=1}^n(y_i - \hat y_i)^2
σ
^
2
=
1
n
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
\hat \sigma^2 =
\frac1n\sum_{i=1}^n(y_i - \hat y_i)^2
σ ^ 2 = n 1 ∑ i = 1 n ( y i − y ^ i ) 2
(biased !)
null hypothesis
testing : a method of statistical inference used to decide whether the data
sufficiently supports a particular hypothesis
t
t
t
t -test :
a statistical test used to test whether the difference between the response of two groups is
statistically significant or not (here: two-sided)
H_0: \beta_1 = 0
H
0
:
β
1
=
0
H_0: \beta_1 = 0
H 0 : β 1 = 0 vs. H_A: \beta_1 \neq
0
H
A
:
β
1
≠
0
H_A: \beta_1 \neq 0
H A : β 1 = 0 (\leq
≤
\leq
≤ or >
>
>
> )
T = \frac{\hat \beta_1}{se(\hat \beta_1)} \sim
t_{n-2}
T
=
β
^
1
s
e
(
β
^
1
)
∼
t
n
−
2
T = \frac{\hat
\beta_1}{se(\hat \beta_1)} \sim t_{n-2}
T = se ( β ^ 1 ) β ^ 1 ∼ t n − 2
\alpha
α
\alpha
α
usually 0.05
0.05
0.05
0.05
quantile approach : reject H_0
H
0
H_0
H 0
if |T| > t_{n-2, 1-\alpha
/2}
∣
T
∣
>
t
n
−
2
,
1
−
α
/
2
|T| >
t_{n-2, 1-\alpha /2}
∣ T ∣ > t n − 2 , 1 − α /2
probability approach : reject H_0
H
0
H_0
H 0
if p
p
p
p -value
is less than \alpha
α
\alpha
α
p
p
p
p -value :
the probability of obtaining test results at least as extreme as the
result actually observed, under the assumption that the null
hypothesis is correct
R : 2 * pt(abs(tval), df, lower.tail = FALSE)
the lower the p
p
p
p -value,
the more far-fetched the null hypothesis is
confidence interval :
an interval which is expected to typically contain the parameter being estimated
100(1-\alpha)\%
100
(
1
−
α
)
%
100(1-\alpha)\%
100 ( 1 − α ) % confidence
interval for \beta_1
β
1
\beta_1
β 1 :
\hat \beta_1 \pm t_{n-2,
1-\alpha/2} \cdot se(\hat\beta_1)
β
^
1
±
t
n
−
2
,
1
−
α
/
2
⋅
s
e
(
β
^
1
)
\hat \beta_1 \pm t_{n-2,
1-\alpha/2} \cdot se(\hat\beta_1)
β ^ 1 ± t n − 2 , 1 − α /2 ⋅ se ( β ^ 1 )
general : \text{Estimate ± (t value)(standard error of
estimate)}
Estimate ± (t value)(standard error of estimate)
\text{Estimate ± (t
value)(standard error of estimate)}
Estimate ± (t value)(standard error of estimate)
prediction of a new point : y_p = \hat \beta_0 + \hat \beta_1 x + \epsilon
y
p
=
β
^
0
+
β
^
1
x
+
ϵ
y_p = \hat \beta_0 + \hat \beta_1 x +
\epsilon
y p = β ^ 0 + β ^ 1 x + ϵ , where \epsilon \sim N(0,
\sigma^2)
ϵ
∼
N
(
0
,
σ
2
)
\epsilon \sim N(0, \sigma^2)
ϵ ∼ N ( 0 , σ 2 )
analysis of variance (ANOVA) : a collection of statistical models and their
associated estimation procedures used to analyze the differences between groups
SST = SSR + SSE
S
S
T
=
S
S
R
+
S
S
E
SST = SSR + SSE
SST = SSR + SSE
Source
d.f.
SS (Sum of Squares)
MS (Mean Square)
F
Regression
1
1
1
1
SSR=\sum_{i=1}^n(\hat
y_i - \overline y)^2
S
S
R
=
∑
i
=
1
n
(
y
^
i
−
y
‾
)
2
SSR=\sum_{i=1}^n(\hat y_i -
\overline y)^2
SSR = ∑ i = 1 n ( y ^ i − y ) 2
MSR = SSR
M
S
R
=
S
S
R
MSR = SSR
MSR = SSR
\frac{MSR}{MSE}
M
S
R
M
S
E
\frac{MSR}{MSE}
MSE MSR
Residual (Error)
n-2
n
−
2
n-2
n − 2
SSE=\sum_{i=1}^n(y_i -
\hat y_i)^2
S
S
E
=
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
SSE=\sum_{i=1}^n(y_i - \hat
y_i)^2
SSE = ∑ i = 1 n ( y i − y ^ i ) 2
MSE = s^2
M
S
E
=
s
2
MSE = s^2
MSE = s 2
Total
n-1
n
−
1
n-1
n − 1
SST=\sum_{i=1}^n(y_i -
\overline y)^2
S
S
T
=
∑
i
=
1
n
(
y
i
−
y
‾
)
2
SST=\sum_{i=1}^n(y_i -
\overline y)^2
SST = ∑ i = 1 n ( y i − y ) 2
F-test : any statistical test used
to compare the variances of two samples or the ratio of variances between multiple samples
H_0: \beta_1 =
0
H
0
:
β
1
=
0
H_0: \beta_1 = 0
H 0 : β 1 = 0 vs. H_A: \beta_1 \neq
0
H
A
:
β
1
≠
0
H_A: \beta_1 \neq 0
H A : β 1 = 0
T \sim t_v \implies T^2
\sim F_{1,v}
T
∼
t
v
⟹
T
2
∼
F
1
,
v
T \sim t_v \implies T^2 \sim
F_{1,v}
T ∼ t v ⟹ T 2 ∼ F 1 , v
quantile approach : reject H_0
H
0
H_0
H 0
if F >
F_{1,n-2,1-\alpha}
F
>
F
1
,
n
−
2
,
1
−
α
F > F_{1,n-2,1-\alpha}
F > F 1 , n − 2 , 1 − α
(qf(1 - alpha, 1, n - 2)
)
probability approach : reject H_0
H
0
H_0
H 0
if P(f > F) <
\alpha
P
(
f
>
F
)
<
α
P(f > F) < \alpha
P ( f > F ) < α
where f \sim
F_{1,n-2}
f
∼
F
1
,
n
−
2
f \sim F_{1,n-2}
f ∼ F 1 , n − 2
coefficient of
determination : the proportion of the variation in the dependent variable that
is predictable from the independent variable(s)
R^2 = \frac{SSR}{SST} = 1 -
\frac{SSE}{SST}
R
2
=
S
S
R
S
S
T
=
1
−
S
S
E
S
S
T
R^2 = \frac{SSR}{SST} = 1 -
\frac{SSE}{SST}
R 2 = SST SSR = 1 − SST SSE
interpretation : 100 R^2
\%
100
R
2
%
100 R^2 \%
100 R 2 % of the
variation in y
y
y
y
can be explained by x
x
x
x
0 \leq R^2 \leq
1
0
≤
R
2
≤
1
0 \leq R^2 \leq 1
0 ≤ R 2 ≤ 1
the better the linear regression (right) fits the data in comparison to the simple
average (left), the closer the value of R^2
R
2
R^2
R 2
is to 1
Pearson
correlation : a correlation coefficient that measures linear
(!) correlation between two variables x,y
x
,
y
x,y
x , y
r = \text{sign}(\hat
\beta_1) \sqrt{R^2}
r
=
sign
(
β
^
1
)
R
2
r = \text{sign}(\hat \beta_1)
\sqrt{R^2}
r = sign ( β ^ 1 ) R 2
in R ...
cov(x,y)
returns \frac1{n-1}\sum_{i=1}^n(x_i-\overline x)(y_i -
\overline y)
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
‾
)
(
y
i
−
y
‾
)
\frac1{n-1}\sum_{i=1}^n(x_i-\overline x)(y_i - \overline y)
n − 1 1 ∑ i = 1 n ( x i − x ) ( y i − y )
var(x)
returns \frac1{n-1}\sum_{i=1}^n(x_i-\overline
x)^2
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
‾
)
2
\frac1{n-1}\sum_{i=1}^n(x_i-\overline x)^2
n − 1 1 ∑ i = 1 n ( x i − x ) 2
cor(x,y)
returns \frac{Cov(x,y)}{sd(x)sd(y)}
C
o
v
(
x
,
y
)
s
d
(
x
)
s
d
(
y
)
\frac{Cov(x,y)}{sd(x)sd(y)}
s d ( x ) s d ( y ) C o v ( x , y )
sd(x)
returns sqrt(var(x))
diagnostics : y \sim
N(\beta_0 + \beta_1 x, \sigma^2)
y
∼
N
(
β
0
+
β
1
x
,
σ
2
)
y \sim N(\beta_0 + \beta_1 x, \sigma^2)
y ∼ N ( β 0 + β 1 x , σ 2 )
independence : told by investigator
linearity : plot y
y
y
y
against x
x
x
x
constant variance : plot y-\hat y
y
−
y
^
y-\hat y
y − y ^
against \hat y
y
^
\hat y
y ^
normal distribution : plot y - \hat y
y
−
y
^
y - \hat y
y − y ^
against normal quantiles
Recap: Matrix Algebra
matrix : a rectangular
array of numbers
formally : A \in p \times q
A
∈
p
×
q
A \in p \times q
A ∈ p × q
(matrix A
A
A
A with p
p
p
p rows and
q
q
q
q
columns)
A = (a_{ij})
A
=
(
a
i
j
)
A = (a_{ij})
A = ( a ij ) , where {a_{ij}}
a
i
j
{a_{ij}}
a ij
is the entry in row i
i
i
i and
column j
j
j
j
square matrix : same number of rows and columns (p=q
p
=
q
p=q
p = q )
det(AB) = det(A)det(B)
d
e
t
(
A
B
)
=
d
e
t
(
A
)
d
e
t
(
B
)
det(AB) = det(A)det(B)
d e t ( A B ) = d e t ( A ) d e t ( B ) (also
written as |AB| = |A||B|
∣
A
B
∣
=
∣
A
∣
∣
B
∣
|AB| = |A||B|
∣ A B ∣ = ∣ A ∣∣ B ∣ )
identity matrix I
I
I
I :
square matrix with ones in the diagonal and zeros everywhere else
zero matrix O
O
O
O :
matrix of all zeros
diagonal (square) matrix : all entries outside the diagonal are zero
(column) vector : a
matrix consisting of a single column
formally : \boldsymbol{x} \in p \times 1
x
∈
p
×
1
\boldsymbol{x} \in p \times 1
x ∈ p × 1
\boldsymbol{x} = (x_i)
x
=
(
x
i
)
\boldsymbol{x} = (x_i)
x = ( x i )
elements : x_1, ... , x_p
x
1
,
.
.
.
,
x
p
x_1, ... , x_p
x 1 , ... , x p
unit vector \boldsymbol{1}
1
\boldsymbol{1}
1 :
vector with all elements equal to one
zero vector \boldsymbol{0}
0
\boldsymbol{0}
0 :
vector with all elements equal to zero
Operations and Special Types
matrix / vector addition : element-wise, same dimensions
matrix / vector multiplication : for A \in p \times q,\;\;B \in q \times t
A
∈
p
×
q
,
B
∈
q
×
t
A \in p \times q,\;\;B \in q \times t
A ∈ p × q , B ∈ q × t , go through each
row in the first matrix and multiply and add the elements with the elements of each column in the
second matrix; that's one complete row in the result matrix
formally : C = AB = (c_{ij}) \in p \times t
C
=
A
B
=
(
c
i
j
)
∈
p
×
t
C = AB = (c_{ij}) \in p \times
t
C = A B = ( c ij ) ∈ p × t with c_{ij} = \sum_{r=1}^q
a_{ir}b_{rj}
c
i
j
=
∑
r
=
1
q
a
i
r
b
r
j
c_{ij} = \sum_{r=1}^q
a_{ir}b_{rj}
c ij = ∑ r = 1 q a i r b r j
(AB)C = A(BC)
(
A
B
)
C
=
A
(
B
C
)
(AB)C = A(BC)
( A B ) C = A ( BC )
(A+B)C = AC +
BC
(
A
+
B
)
C
=
A
C
+
B
C
(A+B)C = AC + BC
( A + B ) C = A C + BC
A(B+C) = AB+AC
A
(
B
+
C
)
=
A
B
+
A
C
A(B+C) = AB+AC
A ( B + C ) = A B + A C
matrix transposition : interchange rows and columns
formally : A' = (a_{ji})
A
′
=
(
a
j
i
)
A' = (a_{ji})
A ′ = ( a ji ) with A \in q \times
p
A
∈
q
×
p
A \in q \times p
A ∈ q × p
symmetric matrix : A = A'
A
=
A
′
A = A'
A = A ′
(A+B)' = A' +
B'
(
A
+
B
)
′
=
A
′
+
B
′
(A+B)' = A' + B'
( A + B ) ′ = A ′ + B ′
(A')' = A
(
A
′
)
′
=
A
(A')' = A
( A ′ ) ′ = A
(cA)' = cA'
(
c
A
)
′
=
c
A
′
(cA)' = cA'
( c A ) ′ = c A ′
(AB)' = B'A'
(
A
B
)
′
=
B
′
A
′
(AB)' = B'A'
( A B ) ′ = B ′ A ′
for A \in m \times n, B \in n
\times p
A
∈
m
×
n
,
B
∈
n
×
p
A \in m \times n, B \in n
\times p
A ∈ m × n , B ∈ n × p
(inner) vector product : multiply element-wise, then add all together \to
→
\to
→ scalar
formally : \boldsymbol{x'y} = \sum_{i = 1}^p x_iy_i
x
′
y
=
∑
i
=
1
p
x
i
y
i
\boldsymbol{x'y} = \sum_{i
= 1}^p x_iy_i
x ′ y = ∑ i = 1 p x i y i
orthogonal vectors : inner product 0
euclidian norm (length) : || \boldsymbol{x}|| =
\sqrt{\boldsymbol{x'x}}
∣
∣
x
∣
∣
=
x
′
x
|| \boldsymbol{x}|| =
\sqrt{\boldsymbol{x'x}}
∣∣ x ∣∣ = x ′ x
set of linearly dependent vectors : there exist scalars c_i
c
i
c_i
c i ,
not all simultaneously zero, such that c_1\boldsymbol{x_1} + ... + c_k\boldsymbol{x_k} = 0
c
1
x
1
+
.
.
.
+
c
k
x
k
=
0
c_1\boldsymbol{x_1} + ... +
c_k\boldsymbol{x_k} = 0
c 1 x 1 + ... + c k x k = 0
(!) at least one vector can be written as a linear combination of the
remaining ones (for example, a column in a matrix is the summation of two other columns)
linearly independent : otherwise
matrix rank : largest number of linearly independent columns (or rows)
nonsingular matrix : square matrix with rank equal to row / column number
formally : A \in m\times m, \; \; {rank}(A) =
m
A
∈
m
×
m
,
r
a
n
k
(
A
)
=
m
A \in m\times m, \; \;
{rank}(A) = m
A ∈ m × m , r ank ( A ) = m
matrix inverse : AA^{-1} = A^{-1}A = I
A
A
−
1
=
A
−
1
A
=
I
AA^{-1} = A^{-1}A = I
A A − 1 = A − 1 A = I
ABB^{-1}A^{-1} =
I
A
B
B
−
1
A
−
1
=
I
ABB^{-1}A^{-1} = I
A B B − 1 A − 1 = I
(A^{-1})' =
(A')^{-1}
(
A
−
1
)
′
=
(
A
′
)
−
1
(A^{-1})' = (A')^{-1}
( A − 1 ) ′ = ( A ′ ) − 1
(\lambda A)^{-1} =
\frac1\lambda A^{-1}
(
λ
A
)
−
1
=
1
λ
A
−
1
(\lambda A)^{-1} =
\frac1\lambda A^{-1}
( λ A ) − 1 = λ 1 A − 1
for nonsingular matrices : (AB)^{-1} = B^{-1}A^{-1}
(
A
B
)
−
1
=
B
−
1
A
−
1
(AB)^{-1} = B^{-1}A^{-1}
( A B ) − 1 = B − 1 A − 1
orthogonal (square) matrix : AA' = A'A = I
A
A
′
=
A
′
A
=
I
AA' = A'A = I
A A ′ = A ′ A = I
A' = A^{-1}
A
′
=
A
−
1
A' = A^{-1}
A ′ = A − 1
the rows (columns) are mutually orthogonal
the length of the rows (columns) is one
det(A) = \pm 1
d
e
t
(
A
)
=
±
1
det(A) = \pm 1
d e t ( A ) = ± 1
trace of a (square) matrix : the sum of its diagonal elements
formally : tr(A) = \sum_{i=1}^m a_{ii}
t
r
(
A
)
=
∑
i
=
1
m
a
i
i
tr(A) = \sum_{i=1}^m a_{ii}
t r ( A ) = ∑ i = 1 m a ii
tr(A) = tr(A')
t
r
(
A
)
=
t
r
(
A
′
)
tr(A) = tr(A')
t r ( A ) = t r ( A ′ )
tr(A+B) = tr(A) +
tr(B)
t
r
(
A
+
B
)
=
t
r
(
A
)
+
t
r
(
B
)
tr(A+B) = tr(A) + tr(B)
t r ( A + B ) = t r ( A ) + t r ( B )
tr(CDE) = tr(ECD) =
tr(DEC)
t
r
(
C
D
E
)
=
t
r
(
E
C
D
)
=
t
r
(
D
E
C
)
tr(CDE) = tr(ECD) = tr(DEC)
t r ( C D E ) = t r ( EC D ) = t r ( D EC ) for conformable
matrices C,D,E
C
,
D
,
E
C,D,E
C , D , E
(matrices s.t. products are defined)
tr(c) = c
t
r
(
c
)
=
c
tr(c) = c
t r ( c ) = c
bonus : E(tr(\cdot)) = tr(E(\cdot))
E
(
t
r
(
⋅
)
)
=
t
r
(
E
(
⋅
)
)
E(tr(\cdot)) = tr(E(\cdot))
E ( t r ( ⋅ )) = t r ( E ( ⋅ ))
idempotent (square) matrix : AA = A
A
A
=
A
AA = A
AA = A
det(A) = 0
d
e
t
(
A
)
=
0
det(A) = 0
d e t ( A ) = 0 or 1
1
1
1
rank(A) = tr(A)
r
a
n
k
(
A
)
=
t
r
(
A
)
rank(A) = tr(A)
r ank ( A ) = t r ( A )
Simple Regression (Matrix)
simple regression (matrix approach) : \boldsymbol{y = X\beta + \epsilon}
y
=
X
β
+
ϵ
\boldsymbol{y = X\beta + \epsilon}
y = Xβ + ϵ
\boldsymbol{y,
\epsilon}
y
,
ϵ
\boldsymbol{y, \epsilon}
y , ϵ
are (n \times 1)
(
n
×
1
)
(n \times 1)
( n × 1 )
random vectors
\boldsymbol X
X
\boldsymbol X
X
is a (n \times 2)
(
n
×
2
)
(n \times 2)
( n × 2 ) matrix (first col.
ones, second column x_i
x
i
x_i
x i
LSE \boldsymbol{\hat \beta = (X'X)^{-1}X'y}
β
^
=
(
X
′
X
)
−
1
X
′
y
\boldsymbol{\hat \beta =
(X'X)^{-1}X'y}
β ^ = ( X ′ X ) − 1 X ′ y
fitted value vector \boldsymbol{\hat y = X \hat \beta}
y
^
=
X
β
^
\boldsymbol{\hat y = X \hat
\beta}
y ^ = X β ^
residual vector \boldsymbol{e = y - \hat y = y - X \hat
\beta}
e
=
y
−
y
^
=
y
−
X
β
^
\boldsymbol{e = y - \hat y = y
- X \hat \beta}
e = y − y ^ = y − X β ^
LSE s^2 = \frac1{n-2}\boldsymbol{e'e}
s
2
=
1
n
−
2
e
′
e
s^2 =
\frac1{n-2}\boldsymbol{e'e}
s 2 = n − 2 1 e ′ e
random vector : vector \boldsymbol y
y
\boldsymbol y
y
of random variables
mean (expected value) E(\boldsymbol y) = (E(y_1), ..., E(y_n))' = \boldsymbol
\mu
E
(
y
)
=
(
E
(
y
1
)
,
.
.
.
,
E
(
y
n
)
)
′
=
μ
E(\boldsymbol y) = (E(y_1),
..., E(y_n))' = \boldsymbol \mu
E ( y ) = ( E ( y 1 ) , ... , E ( y n ) ) ′ = μ
(non-random vector)
E(y_i) = \mu_i
E
(
y
i
)
=
μ
i
E(y_i) = \mu_i
E ( y i ) = μ i
for a random matrix : E(Y) =
\begin{pmatrix} E(y_{11}) & \dots & E(y_{1n})\\\vdots & \ddots
& \vdots \\ E(y_{n1}) & \dots & E(y_{nn})
\end{pmatrix}
E
(
Y
)
=
(
E
(
y
11
)
…
E
(
y
1
n
)
⋮
⋱
⋮
E
(
y
n
1
)
…
E
(
y
n
n
)
)
E(Y) = \begin{pmatrix}
E(y_{11}) & \dots & E(y_{1n})\\\vdots & \ddots
& \vdots \\ E(y_{n1}) & \dots & E(y_{nn})
\end{pmatrix}
E ( Y ) =
E ( y 11 ) ⋮ E ( y n 1 ) … ⋱ … E ( y 1 n ) ⋮ E ( y nn )
(non-random matrix)
properties : a
a
a
a scalar
constant, \boldsymbol b
b
\boldsymbol b
b
vector of constants, \boldsymbol y
y
\boldsymbol y
y
random vector, A
A
A
A matrix of
constants...
E(a\boldsymbol{y + b}) = aE(\boldsymbol y) +
\boldsymbol b
E
(
a
y
+
b
)
=
a
E
(
y
)
+
b
E(a\boldsymbol{y + b})
= aE(\boldsymbol y) + \boldsymbol b
E ( a y + b ) = a E ( y ) + b
E(A \boldsymbol y) = A \; E(\boldsymbol
y)
E
(
A
y
)
=
A
E
(
y
)
E(A \boldsymbol y) = A
\; E(\boldsymbol y)
E ( A y ) = A E ( y )
E(\boldsymbol y' A) = E(\boldsymbol y)'
A
E
(
y
′
A
)
=
E
(
y
)
′
A
E(\boldsymbol y'
A) = E(\boldsymbol y)' A
E ( y ′ A ) = E ( y ) ′ A
Var(A \boldsymbol y) = A \:Var(\boldsymbol
y)A'
V
a
r
(
A
y
)
=
A
V
a
r
(
y
)
A
′
Var(A \boldsymbol y) =
A \:Var(\boldsymbol y)A'
Va r ( A y ) = A Va r ( y ) A ′
if \boldsymbol y
y
\boldsymbol y
y
is normal distributed, so is A \boldsymbol
y
A
y
A \boldsymbol y
A y
covariance matrix \Sigma
Σ
\Sigma
Σ : diagonal
elements Var(y_i)
V
a
r
(
y
i
)
Var(y_i)
Va r ( y i ) , off-diagonal elements Cov(y_i, y_j)
C
o
v
(
y
i
,
y
j
)
Cov(y_i, y_j)
C o v ( y i , y j )
formally : Var(Y) = \Sigma = \begin{pmatrix}E((y_1 - \mu_1)(y_1 -
\mu_1)) & \dots & E((y_1 - \mu_1)(y_n - \mu_n)) \\ \vdots & \ddots &
\vdots \\ E((y_n - \mu_n)(y_1 - \mu_1)) & \dots & E((y_n - \mu_n)(y_n -
\mu_n))\end{pmatrix} = E((\boldsymbol{y-\mu})(\boldsymbol{y-\mu})')
V
a
r
(
Y
)
=
Σ
=
(
E
(
(
y
1
−
μ
1
)
(
y
1
−
μ
1
)
)
…
E
(
(
y
1
−
μ
1
)
(
y
n
−
μ
n
)
)
⋮
⋱
⋮
E
(
(
y
n
−
μ
n
)
(
y
1
−
μ
1
)
)
…
E
(
(
y
n
−
μ
n
)
(
y
n
−
μ
n
)
)
)
=
E
(
(
y
−
μ
)
(
y
−
μ
)
′
)
Var(Y) = \Sigma =
\begin{pmatrix}E((y_1 - \mu_1)(y_1 - \mu_1)) & \dots &
E((y_1 - \mu_1)(y_n - \mu_n)) \\ \vdots & \ddots & \vdots \\
E((y_n - \mu_n)(y_1 - \mu_1)) & \dots & E((y_n - \mu_n)(y_n
- \mu_n))\end{pmatrix} =
E((\boldsymbol{y-\mu})(\boldsymbol{y-\mu})')
Va r ( Y ) = Σ =
E (( y 1 − μ 1 ) ( y 1 − μ 1 )) ⋮ E (( y n − μ n ) ( y 1 − μ 1 )) … ⋱ … E (( y 1 − μ 1 ) ( y n − μ n )) ⋮ E (( y n − μ n ) ( y n − μ n ))
= E (( y − μ ) ( y − μ ) ′ )
\Sigma
Σ
\Sigma
Σ symmetric, because
Cov(y_i, y_j) = Cov(y_j,
y_i)
C
o
v
(
y
i
,
y
j
)
=
C
o
v
(
y
j
,
y
i
)
Cov(y_i, y_j) = Cov(y_j, y_i)
C o v ( y i , y j ) = C o v ( y j , y i )
\Sigma
Σ
\Sigma
Σ diagonal
if observations (y_i
y
i
y_i
y i )
are independent, because Cov(y_i, y_j) = 0, i \neq j
C
o
v
(
y
i
,
y
j
)
=
0
,
i
≠
j
Cov(y_i, y_j) = 0, i \neq j
C o v ( y i , y j ) = 0 , i = j
Multiple Regression
general linear model : y =
\beta_0 + \beta_1 x_1 + ... + \beta_p x_p + \epsilon
y
=
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
+
ϵ
y = \beta_0 + \beta_1 x_1 + ... +
\beta_p x_p + \epsilon
y = β 0 + β 1 x 1 + ... + β p x p + ϵ
response variable y
y
y
y
several independent (predictor, explanatory) variables x_i
x
i
x_i
x i
n
n
n
n cases,
p
p
p
p
predictor values : y_i = \beta_0 + \beta_1x_{i1} + ... + \beta_p x_{ip} +
\epsilon_i = \mu_i + \epsilon_i
y
i
=
β
0
+
β
1
x
i
1
+
.
.
.
+
β
p
x
i
p
+
ϵ
i
=
μ
i
+
ϵ
i
y_i = \beta_0 + \beta_1x_{i1} +
... + \beta_p x_{ip} + \epsilon_i = \mu_i + \epsilon_i
y i = β 0 + β 1 x i 1 + ... + β p x i p + ϵ i = μ i + ϵ i
x_{ij}
x
i
j
x_{ij}
x ij :
value of the j
j
j
j -th
predictor variable of the i
i
i
i -th
case
y_1, ..., y_n
y
1
,
.
.
.
,
y
n
y_1, ..., y_n
y 1 , ... , y n
iid., normal distributed, y_i \sim N(\mu_i, \sigma^2)
y
i
∼
N
(
μ
i
,
σ
2
)
y_i \sim N(\mu_i,
\sigma^2)
y i ∼ N ( μ i , σ 2 )
\mu_i
μ
i
\mu_i
μ i
non-random (deterministic)
E(\epsilon_i) = 0
E
(
ϵ
i
)
=
0
E(\epsilon_i) = 0
E ( ϵ i ) = 0
E(y_i) = \mu_i = \beta_0 + \beta_1
x_{i1} + ... + \beta_p x_{ip}
E
(
y
i
)
=
μ
i
=
β
0
+
β
1
x
i
1
+
.
.
.
+
β
p
x
i
p
E(y_i) = \mu_i
= \beta_0 + \beta_1 x_{i1} + ... + \beta_p x_{ip}
E ( y i ) = μ i = β 0 + β 1 x i 1 + ... + β p x i p
Var(\epsilon_i) = \sigma^2
V
a
r
(
ϵ
i
)
=
σ
2
Var(\epsilon_i) =
\sigma^2
Va r ( ϵ i ) = σ 2
Var(y_i) = \sigma^2
V
a
r
(
y
i
)
=
σ
2
Var(y_i) =
\sigma^2
Va r ( y i ) = σ 2
vector form : \boldsymbol y = X \boldsymbol \beta + \boldsymbol
\epsilon
y
=
X
β
+
ϵ
\boldsymbol y = X \boldsymbol \beta +
\boldsymbol \epsilon
y = X β + ϵ
\boldsymbol y =
\begin{pmatrix}y_1 \\ \vdots \\ y_n\end{pmatrix}
y
=
(
y
1
⋮
y
n
)
\boldsymbol y =
\begin{pmatrix}y_1 \\ \vdots \\ y_n\end{pmatrix}
y =
y 1 ⋮ y n
,
\boldsymbol y \sim N(X
\boldsymbol \beta, \sigma^2 I)
y
∼
N
(
X
β
,
σ
2
I
)
\boldsymbol y \sim N(X
\boldsymbol \beta, \sigma^2 I)
y ∼ N ( X β , σ 2 I ) , E(\boldsymbol y) = X
\boldsymbol \beta
E
(
y
)
=
X
β
E(\boldsymbol y) = X
\boldsymbol \beta
E ( y ) = X β ,
Var(\boldsymbol y) =
\sigma^2I
V
a
r
(
y
)
=
σ
2
I
Var(\boldsymbol y) = \sigma^2I
Va r ( y ) = σ 2 I
X = \begin{pmatrix} 1 &
x_{11} & \dots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\
1 & x_{n1} & \dots& x_{np} \end{pmatrix}
X
=
(
1
x
11
…
x
1
p
⋮
⋮
⋱
⋮
1
x
n
1
…
x
n
p
)
X = \begin{pmatrix} 1 &
x_{11} & \dots & x_{1p} \\ \vdots & \vdots & \ddots
& \vdots \\ 1 & x_{n1} & \dots& x_{np} \end{pmatrix}
X =
1 ⋮ 1 x 11 ⋮ x n 1 … ⋱ … x 1 p ⋮ x n p
fixed, non-random, full rank
\boldsymbol \beta =
\begin{pmatrix}\beta_0 \\ \vdots \\ \beta_p\end{pmatrix}
β
=
(
β
0
⋮
β
p
)
\boldsymbol \beta =
\begin{pmatrix}\beta_0 \\ \vdots \\ \beta_p\end{pmatrix}
β =
β 0 ⋮ β p
\boldsymbol \epsilon =
\begin{pmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{pmatrix}
ϵ
=
(
ϵ
1
⋮
ϵ
n
)
\boldsymbol \epsilon =
\begin{pmatrix}\epsilon_1 \\ \vdots \\ \epsilon_n\end{pmatrix}
ϵ =
ϵ 1 ⋮ ϵ n
,
\boldsymbol \epsilon \sim
N(\boldsymbol 0, \sigma^2 I)
ϵ
∼
N
(
0
,
σ
2
I
)
\boldsymbol \epsilon \sim
N(\boldsymbol 0, \sigma^2 I)
ϵ ∼ N ( 0 , σ 2 I ) , E(\boldsymbol \epsilon) =
\boldsymbol 0
E
(
ϵ
)
=
0
E(\boldsymbol \epsilon) =
\boldsymbol 0
E ( ϵ ) = 0 ,
Var(\boldsymbol \epsilon) =
\sigma^2I
V
a
r
(
ϵ
)
=
σ
2
I
Var(\boldsymbol \epsilon) =
\sigma^2I
Va r ( ϵ ) = σ 2 I
LSE : \boldsymbol{\hat\beta} = (X'X)^{-1}X'\boldsymbol y
β
^
=
(
X
′
X
)
−
1
X
′
y
\boldsymbol{\hat\beta} =
(X'X)^{-1}X'\boldsymbol y
β ^ = ( X ′ X ) − 1 X ′ y
fitted values \boldsymbol{\hat y} = X\boldsymbol{\hat \beta} =
H\boldsymbol y
y
^
=
X
β
^
=
H
y
\boldsymbol{\hat y} =
X\boldsymbol{\hat \beta} = H\boldsymbol y
y ^ = X β ^ = H y
H = X(X'X)^{-1}X' \in n \times n
H
=
X
(
X
′
X
)
−
1
X
′
∈
n
×
n
H =
X(X'X)^{-1}X' \in n \times n
H = X ( X ′ X ) − 1 X ′ ∈ n × n
H
H
H
H
is the orthogonal projection of \boldsymbol
y
y
\boldsymbol y
y
onto the linear space spanned by column vectors of X
X
X
X
H
H
H
H
symmetric (H' = H
H
′
=
H
H' = H
H ′ = H )
H
H
H
H
idempotent (HH = H
H
H
=
H
HH = H
HH = H )
E(\boldsymbol{\hat y}) = X \boldsymbol
\beta
E
(
y
^
)
=
X
β
E(\boldsymbol{\hat y})
= X \boldsymbol \beta
E ( y ^ ) = X β
Var(\boldsymbol{\hat y}) = \sigma^2
H
V
a
r
(
y
^
)
=
σ
2
H
Var(\boldsymbol{\hat
y}) = \sigma^2 H
Va r ( y ^ ) = σ 2 H
residuals \boldsymbol e = \boldsymbol y - \boldsymbol{\hat y} =
(I-H)\boldsymbol y
e
=
y
−
y
^
=
(
I
−
H
)
y
\boldsymbol e = \boldsymbol y -
\boldsymbol{\hat y} = (I-H)\boldsymbol y
e = y − y ^ = ( I − H ) y
(I-H)
(
I
−
H
)
(I-H)
( I − H ) projects
\boldsymbol y
y
\boldsymbol y
y
onto the perpendicular space to the linear space spanned by the column
vectors of X
X
X
X
(I-H)
(
I
−
H
)
(I-H)
( I − H ) symmetric
((I-H)' = (I-H)
(
I
−
H
)
′
=
(
I
−
H
)
(I-H)' = (I-H)
( I − H ) ′ = ( I − H ) )
(I-H)
(
I
−
H
)
(I-H)
( I − H ) idempotent
((I-H)(I-H) = (I-H)
(
I
−
H
)
(
I
−
H
)
=
(
I
−
H
)
(I-H)(I-H) = (I-H)
( I − H ) ( I − H ) = ( I − H ) )
rearranged : \boldsymbol y =
\boldsymbol{\hat y} + \boldsymbol e = H\boldsymbol y + (I-H) \boldsymbol
y
y
=
y
^
+
e
=
H
y
+
(
I
−
H
)
y
\boldsymbol y =
\boldsymbol{\hat y} + \boldsymbol e = H\boldsymbol y + (I-H)
\boldsymbol y
y = y ^ + e = H y + ( I − H ) y
E(\boldsymbol e) = \boldsymbol 0
E
(
e
)
=
0
E(\boldsymbol e) =
\boldsymbol 0
E ( e ) = 0
Var(\boldsymbol e) = \sigma^2(I-H)
V
a
r
(
e
)
=
σ
2
(
I
−
H
)
Var(\boldsymbol e) =
\sigma^2(I-H)
Va r ( e ) = σ 2 ( I − H )
E(\boldsymbol{\hat \beta})
= \boldsymbol \beta
E
(
β
^
)
=
β
E(\boldsymbol{\hat \beta}) =
\boldsymbol \beta
E ( β ^ ) = β
E(\hat \beta_i) = \beta_i
E
(
β
^
i
)
=
β
i
E(\hat \beta_i) =
\beta_i
E ( β ^ i ) = β i
Var(\boldsymbol{\hat
\beta}) = \sigma^2(X'X)^{-1}
V
a
r
(
β
^
)
=
σ
2
(
X
′
X
)
−
1
Var(\boldsymbol{\hat \beta}) =
\sigma^2(X'X)^{-1}
Va r ( β ^ ) = σ 2 ( X ′ X ) − 1
Var(\hat \beta_i) = \sigma^2 v_{ii}
V
a
r
(
β
^
i
)
=
σ
2
v
i
i
Var(\hat \beta_i) =
\sigma^2 v_{ii}
Va r ( β ^ i ) = σ 2 v ii ,
where v_{ii}
v
i
i
v_{ii}
v ii
is the corresponding diag. el. in (X'X)^{-1}
(
X
′
X
)
−
1
(X'X)^{-1}
( X ′ X ) − 1
MLE : s^2 =
\frac{SSE}{n-k-1} = \frac1{n-k-1}\sum_{i=1}^n(y_i - \hat y_i)^2
s
2
=
S
S
E
n
−
k
−
1
=
1
n
−
k
−
1
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
s^2 = \frac{SSE}{n-k-1} =
\frac1{n-k-1}\sum_{i=1}^n(y_i - \hat y_i)^2
s 2 = n − k − 1 SSE = n − k − 1 1 ∑ i = 1 n ( y i − y ^ i ) 2
(for k
k
k
k predictors
not including the intercept!)
Source
d.f.
SS (Sum of Squares)
MS (Mean Square)
F
Regression
k
k
k
k
SSR=\sum_{i=1}^n(\hat
y_i - \overline y)^2
S
S
R
=
∑
i
=
1
n
(
y
^
i
−
y
‾
)
2
SSR=\sum_{i=1}^n(\hat y_i -
\overline y)^2
SSR = ∑ i = 1 n ( y ^ i − y ) 2
MSR =
\frac{SSR}k
M
S
R
=
S
S
R
k
MSR = \frac{SSR}k
MSR = k SSR
\frac{MSR}{MSE}
M
S
R
M
S
E
\frac{MSR}{MSE}
MSE MSR
Residual (Error)
n-k-1
n
−
k
−
1
n-k-1
n − k − 1
SSE=\sum_{i=1}^n(y_i -
\hat y_i)^2
S
S
E
=
∑
i
=
1
n
(
y
i
−
y
^
i
)
2
SSE=\sum_{i=1}^n(y_i - \hat
y_i)^2
SSE = ∑ i = 1 n ( y i − y ^ i ) 2
MSE =
\frac{SSE}{n-k-1}=s^2
M
S
E
=
S
S
E
n
−
k
−
1
=
s
2
MSE = \frac{SSE}{n-k-1}=s^2
MSE = n − k − 1 SSE = s 2
Total
n-1
n
−
1
n-1
n − 1
SST=\sum_{i=1}^n(y_i -
\overline y)^2
S
S
T
=
∑
i
=
1
n
(
y
i
−
y
‾
)
2
SST=\sum_{i=1}^n(y_i -
\overline y)^2
SST = ∑ i = 1 n ( y i − y ) 2
alternative ANOVA calculations :
SST = \boldsymbol{y'y} -
n\boldsymbol{\overline y}^2
S
S
T
=
y
′
y
−
n
y
‾
2
SST = \boldsymbol{y'y} -
n\boldsymbol{\overline y}^2
SST = y ′ y − n y 2
SSE = \boldsymbol{y'y} -
\boldsymbol{\hat \beta'}X'X\boldsymbol{\hat\beta}
S
S
E
=
y
′
y
−
β
^
′
X
′
X
β
^
SSE = \boldsymbol{y'y} -
\boldsymbol{\hat \beta'}X'X\boldsymbol{\hat\beta}
SSE = y ′ y − β ^ ′ X ′ X β ^
SSR = SST - SSE =
\boldsymbol{\hat \beta'}X'X\boldsymbol{\hat\beta} - n\boldsymbol{\overline
y}^2
S
S
R
=
S
S
T
−
S
S
E
=
β
^
′
X
′
X
β
^
−
n
y
‾
2
SSR = SST - SSE =
\boldsymbol{\hat \beta'}X'X\boldsymbol{\hat\beta} -
n\boldsymbol{\overline y}^2
SSR = SST − SSE = β ^ ′ X ′ X β ^ − n y 2
multiple R^2
R
2
R^2
R 2 :
"usefulness" of regression...
R^2 = \frac{SSR}{SST} = 1 -
\frac{SSE}{SST}
R
2
=
S
S
R
S
S
T
=
1
−
S
S
E
S
S
T
R^2 = \frac{SSR}{SST} = 1 -
\frac{SSE}{SST}
R 2 = SST SSR = 1 − SST SSE
(variation due to regression over total variation)
adding a variable to a model increases the regression sum of squares, and hence R^2
R
2
R^2
R 2
if adding a variable only marginally increases R^2
R
2
R^2
R 2 ,
it might cast doubt on its inclusion in the model
F-test : H_0:
\beta_1 = ... = \beta_k = 0
H
0
:
β
1
=
.
.
.
=
β
k
=
0
H_0: \beta_1 = ... = \beta_k = 0
H 0 : β 1 = ... = β k = 0 vs. H_A:
H
A
:
H_A:
H A : at least one \beta_j \neq 0
β
j
≠
0
\beta_j \neq 0
β j = 0
alternative : H_{restrict}: E(y) = \beta_0
H
r
e
s
t
r
i
c
t
:
E
(
y
)
=
β
0
H_{restrict}: E(y) = \beta_0
H res t r i c t : E ( y ) = β 0
vs. H_{full}: E(y) = \beta_0 +
\beta_1 x_1 + ... + \beta_k x_k
H
f
u
l
l
:
E
(
y
)
=
β
0
+
β
1
x
1
+
.
.
.
+
β
k
x
k
H_{full}: E(y) = \beta_0 +
\beta_1 x_1 + ... + \beta_k x_k
H f u ll : E ( y ) = β 0 + β 1 x 1 + ... + β k x k
F = \frac{MSR}{MSE} \sim
F_{k, n-k-1}
F
=
M
S
R
M
S
E
∼
F
k
,
n
−
k
−
1
F = \frac{MSR}{MSE} \sim F_{k,
n-k-1}
F = MSE MSR ∼ F k , n − k − 1
(bottom of R output)
quantile approach : reject H_0
H
0
H_0
H 0
if F > F_{k,n-k-1, 1-\alpha}
F
>
F
k
,
n
−
k
−
1
,
1
−
α
F > F_{k,n-k-1,
1-\alpha}
F > F k , n − k − 1 , 1 − α
probability approach : reject H_0
H
0
H_0
H 0
if P(F_{random} > F) < \alpha
P
(
F
r
a
n
d
o
m
>
F
)
<
α
P(F_{random} > F)
< \alpha
P ( F r an d o m > F ) < α
where F_{random} \sim F_{k,n-k-1}
F
r
a
n
d
o
m
∼
F
k
,
n
−
k
−
1
F_{random} \sim
F_{k,n-k-1}
F r an d o m ∼ F k , n − k − 1
t-test : H_0:
\beta_j = 0
H
0
:
β
j
=
0
H_0: \beta_j = 0
H 0 : β j = 0 vs. H_A: \beta_j \neq 0
H
A
:
β
j
≠
0
H_A: \beta_j \neq 0
H A : β j = 0
t = \frac{\hat
\beta_j}{se(\hat\beta_j)} \sim t_{n-k-1}
t
=
β
^
j
s
e
(
β
^
j
)
∼
t
n
−
k
−
1
t = \frac{\hat
\beta_j}{se(\hat\beta_j)} \sim t_{n-k-1}
t = se ( β ^ j ) β ^ j ∼ t n − k − 1
reject H_0
H
0
H_0
H 0
if 2\cdot P(T>|t|) < \alpha
2
⋅
P
(
T
>
∣
t
∣
)
<
α
2\cdot P(T>|t|) <
\alpha
2 ⋅ P ( T > ∣ t ∣ ) < α
where T \sim t_{n-k-1}
T
∼
t
n
−
k
−
1
T \sim t_{n-k-1}
T ∼ t n − k − 1
100(1-\alpha)\%
100
(
1
−
α
)
%
100(1-\alpha)\%
100 ( 1 − α ) %
confidence interval for \beta_j
β
j
\beta_j
β j :
\hat \beta_j \pm t_{n-k-1, 1-\alpha/2}\cdot
se(\hat\beta_j)
β
^
j
±
t
n
−
k
−
1
,
1
−
α
/
2
⋅
s
e
(
β
^
j
)
\hat \beta_j \pm
t_{n-k-1, 1-\alpha/2}\cdot se(\hat\beta_j)
β ^ j ± t n − k − 1 , 1 − α /2 ⋅ se ( β ^ j )
linear combination of coefficients : for when we want to estimate a result
with given predictors
example (book) : estimating avg. formaldehyde concentration in homes with
UFFI (x_1 = 1
x
1
=
1
x_1 = 1
x 1 = 1 ) and airtightness 5
(x_2 = 2
x
2
=
2
x_2 = 2
x 2 = 2 )
\theta = \beta_0 + \beta_1 + 5 \beta_2 =
\boldsymbol{a'\beta}
θ
=
β
0
+
β
1
+
5
β
2
=
a
′
β
\theta = \beta_0 +
\beta_1 + 5 \beta_2 = \boldsymbol{a'\beta}
θ = β 0 + β 1 + 5 β 2 = a ′ β
with \boldsymbol{a'} = (1,1,5)
a
′
=
(
1
,
1
,
5
)
\boldsymbol{a'} =
(1,1,5)
a ′ = ( 1 , 1 , 5 )
estimate : \hat \theta = \boldsymbol{a'\hat\beta} =
(1,1,5)\begin{pmatrix}31.37\\9.31\\2.85\end{pmatrix} = 54.96
θ
^
=
a
′
β
^
=
(
1
,
1
,
5
)
(
31.37
9.31
2.85
)
=
54.96
\hat \theta =
\boldsymbol{a'\hat\beta} =
(1,1,5)\begin{pmatrix}31.37\\9.31\\2.85\end{pmatrix} = 54.96
θ ^ = a ′ β ^ = ( 1 , 1 , 5 )
31.37 9.31 2.85
= 54.96
additional sum of squares principle (linear hypthoseses) : testing
simultaneous statements about several parameters
example :
full model : y = \beta_0 +
\beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \epsilon
y
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
ϵ
y = \beta_0 +
\beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \epsilon
y = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ϵ
restrictions : each restriction is one equation equaling 0
\beta_1 = 2\beta_2
β
1
=
2
β
2
\beta_1 =
2\beta_2
β 1 = 2 β 2
(or \beta_1 - 2\beta_2 = 0
β
1
−
2
β
2
=
0
\beta_1 -
2\beta_2 = 0
β 1 − 2 β 2 = 0 )
\beta_3 = 0
β
3
=
0
\beta_3 = 0
β 3 = 0
matrix form : matrix A \in a \times
(k+1)
A
∈
a
×
(
k
+
1
)
A \in a \times (k+1)
A ∈ a × ( k + 1 ) has one row
for each restriction and one column per parameter (+ full rank)
\begin{pmatrix}0 & 1 & -2 &
0 \\ 0 & 0 & 0 & 1\end{pmatrix} \begin{pmatrix}\beta_0
\\ \beta_1 \\ \beta_2 \\ \beta_3\end{pmatrix} =
\begin{pmatrix}0\\0\end{pmatrix}
(
0
1
−
2
0
0
0
0
1
)
(
β
0
β
1
β
2
β
3
)
=
(
0
0
)
\begin{pmatrix}0 & 1 & -2 & 0 \\ 0 &
0 & 0 & 1\end{pmatrix}
\begin{pmatrix}\beta_0 \\ \beta_1 \\ \beta_2 \\
\beta_3\end{pmatrix} =
\begin{pmatrix}0\\0\end{pmatrix}
( 0 0 1 0 − 2 0 0 1 )
β 0 β 1 β 2 β 3
= ( 0 0 )
hypothesis : H_0:
A\boldsymbol\beta = \boldsymbol 0
H
0
:
A
β
=
0
H_0:
A\boldsymbol\beta = \boldsymbol 0
H 0 : A β = 0
vs. H_A:
H
A
:
H_A:
H A : at
least one of these \beta_j \neq 0
β
j
≠
0
\beta_j \neq 0
β j = 0
alternative : H_{restrict}: \mu = \beta_0 +
\beta_1x_1 + \beta_2x_2 + \beta_3x_3
H
r
e
s
t
r
i
c
t
:
μ
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
H_{restrict}: \mu = \beta_0 + \beta_1x_1 +
\beta_2x_2 + \beta_3x_3
H res t r i c t : μ = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3
vs. H_{full}: \mu = \beta_0 +
\beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 +
\beta_5x_5 + \beta_6x_6
H
f
u
l
l
:
μ
=
β
0
+
β
1
x
1
+
β
2
x
2
+
β
3
x
3
+
β
4
x
4
+
β
5
x
5
+
β
6
x
6
H_{full}: \mu = \beta_0 + \beta_1x_1 +
\beta_2x_2 + \beta_3x_3 + \beta_4x_4 +
\beta_5x_5 + \beta_6x_6
H f u ll : μ = β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + β 4 x 4 + β 5 x 5 + β 6 x 6
restricted model : y = \beta_0 +
\beta_2(2x_1 + x_2) + \epsilon
y
=
β
0
+
β
2
(
2
x
1
+
x
2
)
+
ϵ
y = \beta_0 +
\beta_2(2x_1 + x_2) + \epsilon
y = β 0 + β 2 ( 2 x 1 + x 2 ) + ϵ
additional sum of squares : SSE_{restrict} - SSE_{full}
S
S
E
r
e
s
t
r
i
c
t
−
S
S
E
f
u
l
l
SSE_{restrict} - SSE_{full}
SS E res t r i c t − SS E f u ll
(!) for \mu = \beta_0
μ
=
β
0
\mu = \beta_0
μ = β 0 :
SSE_{restrict} = SST
S
S
E
r
e
s
t
r
i
c
t
=
S
S
T
SSE_{restrict} = SST
SS E res t r i c t = SST
test statistic : F = \frac{(SSE_{restrict}-SSE_{full})/a}{SSE_{full} /
(n-k-1)} \sim F_{a,n-k-1}
F
=
(
S
S
E
r
e
s
t
r
i
c
t
−
S
S
E
f
u
l
l
)
/
a
S
S
E
f
u
l
l
/
(
n
−
k
−
1
)
∼
F
a
,
n
−
k
−
1
F =
\frac{(SSE_{restrict}-SSE_{full})/a}{SSE_{full} / (n-k-1)} \sim
F_{a,n-k-1}
F = SS E f u ll / ( n − k − 1 ) ( SS E res t r i c t − SS E f u ll ) / a ∼ F a , n − k − 1
for a
a
a
a rows in
A
A
A
A , k
k
k
k
parameters, n
n
n
n
observations
reject H_0
H
0
H_0
H 0
if p-value < \alpha
<
α
< \alpha
< α
Specification
one-sample problem : y_i =
\beta_0 + \epsilon_i
y
i
=
β
0
+
ϵ
i
y_i = \beta_0 + \epsilon_i
y i = β 0 + ϵ i
y_1, ... , y_n
y
1
,
.
.
.
,
y
n
y_1, ... , y_n
y 1 , ... , y n
observations taken under uniform conditions from a stable model with mean level \beta_0
β
0
\beta_0
β 0
E(y_i) =
\beta_0
E
(
y
i
)
=
β
0
E(y_i) = \beta_0
E ( y i ) = β 0
E(\boldsymbol y) = X
\boldsymbol\beta
E
(
y
)
=
X
β
E(\boldsymbol y) = X
\boldsymbol\beta
E ( y ) = X β
\boldsymbol y = (y_1, ... y_n)'
y
=
(
y
1
,
.
.
.
y
n
)
′
\boldsymbol y = (y_1,
... y_n)'
y = ( y 1 , ... y n ) ′
X = (1,...,1)'
X
=
(
1
,
.
.
.
,
1
)
′
X = (1,...,1)'
X = ( 1 , ... , 1 ) ′
\boldsymbol \beta = \beta_0
β
=
β
0
\boldsymbol \beta =
\beta_0
β = β 0
\hat\beta_0 = \overline
y
β
^
0
=
y
‾
\hat\beta_0 = \overline y
β ^ 0 = y
\hat\sigma^2 = s^2 =
\frac{s_{yy}}{n-1} = \frac{\sum_{i=1}^n(y_i-\overline y)^2}{n-1}
σ
^
2
=
s
2
=
s
y
y
n
−
1
=
∑
i
=
1
n
(
y
i
−
y
‾
)
2
n
−
1
\hat\sigma^2 = s^2 =
\frac{s_{yy}}{n-1} = \frac{\sum_{i=1}^n(y_i-\overline y)^2}{n-1}
σ ^ 2 = s 2 = n − 1 s yy = n − 1 ∑ i = 1 n ( y i − y ) 2
SSE = SST
S
S
E
=
S
S
T
SSE = SST
SSE = SST
in R : lm(y~1)
two-sample problem : y_i =
\begin{cases}\beta_1 + \epsilon_i & i = 1,2,..,m \\ \beta_2+ \epsilon_1 & i = m + 1,
... , n\end{cases}
y
i
=
{
β
1
+
ϵ
i
i
=
1
,
2
,
.
.
,
m
β
2
+
ϵ
1
i
=
m
+
1
,
.
.
.
,
n
y_i = \begin{cases}\beta_1 + \epsilon_i
& i = 1,2,..,m \\ \beta_2+ \epsilon_1 & i = m + 1, ... ,
n\end{cases}
y i = { β 1 + ϵ i β 2 + ϵ 1 i = 1 , 2 , .. , m i = m + 1 , ... , n
y_1,...y_m
y
1
,
.
.
.
y
m
y_1,...y_m
y 1 , ... y m
taken under one set of conditions (standard process), mean \beta_1
β
1
\beta_1
β 1
y_{m+1},...,y_n
y
m
+
1
,
.
.
.
,
y
n
y_{m+1},...,y_n
y m + 1 , ... , y n
taken under another set of conditions (new process), mean \beta_2
β
2
\beta_2
β 2
alternative : y_i = \beta_1x_{i1} + \beta_2x_{i2} +
\epsilon_i
y
i
=
β
1
x
i
1
+
β
2
x
i
2
+
ϵ
i
y_i = \beta_1x_{i1} +
\beta_2x_{i2} + \epsilon_i
y i = β 1 x i 1 + β 2 x i 2 + ϵ i
E(y_i) = \beta_1x_{i1} +
\beta_2x_{i2}
E
(
y
i
)
=
β
1
x
i
1
+
β
2
x
i
2
E(y_i) = \beta_1x_{i1}
+ \beta_2x_{i2}
E ( y i ) = β 1 x i 1 + β 2 x i 2
x_{i1}, x_{i2}
x
i
1
,
x
i
2
x_{i1}, x_{i2}
x i 1 , x i 2
indicator variables
x_{i1} = \begin{cases}1 & i =
1,2,...,m \\ 0 & i = m+1,...,n\end{cases}
x
i
1
=
{
1
i
=
1
,
2
,
.
.
.
,
m
0
i
=
m
+
1
,
.
.
.
,
n
x_{i1} =
\begin{cases}1 & i = 1,2,...,m \\ 0 & i =
m+1,...,n\end{cases}
x i 1 = { 1 0 i = 1 , 2 , ... , m i = m + 1 , ... , n
x_{i2} = \begin{cases}0 & i =
1,2,...,m \\ 1 & i = m+1,...,n\end{cases}
x
i
2
=
{
0
i
=
1
,
2
,
.
.
.
,
m
1
i
=
m
+
1
,
.
.
.
,
n
x_{i2} =
\begin{cases}0 & i = 1,2,...,m \\ 1 & i =
m+1,...,n\end{cases}
x i 2 = { 0 1 i = 1 , 2 , ... , m i = m + 1 , ... , n
E\begin{pmatrix}y_1 \\ \vdots \\ y_m \\ y_{m+1}
\\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix}1 \\ \vdots \\ 1 \\ 0 \\
\vdots \\ 0 \end{pmatrix} \beta_1 + \begin{pmatrix}0 \\ \vdots \\ 0 \\ 1 \\
\vdots \\ 1 \end{pmatrix} \beta_2
E
(
y
1
⋮
y
m
y
m
+
1
⋮
y
n
)
=
(
1
⋮
1
0
⋮
0
)
β
1
+
(
0
⋮
0
1
⋮
1
)
β
2
E\begin{pmatrix}y_1 \\
\vdots \\ y_m \\ y_{m+1} \\ \vdots \\ y_n \end{pmatrix} =
\begin{pmatrix}1 \\ \vdots \\ 1 \\ 0 \\ \vdots \\ 0
\end{pmatrix} \beta_1 + \begin{pmatrix}0 \\ \vdots \\ 0 \\ 1
\\ \vdots \\ 1 \end{pmatrix} \beta_2
E
y 1 ⋮ y m y m + 1 ⋮ y n
=
1 ⋮ 1 0 ⋮ 0
β 1 +
0 ⋮ 0 1 ⋮ 1
β 2
matrix form E(\boldsymbol y) = X \boldsymbol
\beta
E
(
y
)
=
X
β
E(\boldsymbol
y) = X \boldsymbol \beta
E ( y ) = X β
X = \begin{pmatrix}1 & 0\\
\vdots & \vdots \\ 1 & 0\\ 0 & 1 \\ \vdots &
\vdots \\ 0 & 1 \end{pmatrix}
X
=
(
1
0
⋮
⋮
1
0
0
1
⋮
⋮
0
1
)
X =
\begin{pmatrix}1 & 0\\ \vdots &
\vdots \\ 1 & 0\\ 0 & 1 \\ \vdots
& \vdots \\ 0 & 1 \end{pmatrix}
X =
1 ⋮ 1 0 ⋮ 0 0 ⋮ 0 1 ⋮ 1
\boldsymbol \beta =
\begin{pmatrix}\beta_1 \\ \beta_2 \end{pmatrix}
β
=
(
β
1
β
2
)
\boldsymbol \beta = \begin{pmatrix}\beta_1
\\ \beta_2 \end{pmatrix}
β = ( β 1 β 2 )
in R : lm(y~x1+x2-1)
hypothesis : \beta_1 = \beta_2
β
1
=
β
2
\beta_1 = \beta_2
β 1 = β 2
polynomial models :
linear : y_i = \beta_0 + \beta_1x_i + \epsilon_i
y
i
=
β
0
+
β
1
x
i
+
ϵ
i
y_i = \beta_0 + \beta_1x_i +
\epsilon_i
y i = β 0 + β 1 x i + ϵ i
X=\begin{pmatrix}1 & x_1 \\ \vdots &
\vdots \\ 1 & x_n \end{pmatrix}
X
=
(
1
x
1
⋮
⋮
1
x
n
)
X=\begin{pmatrix}1
& x_1 \\ \vdots & \vdots \\ 1 & x_n
\end{pmatrix}
X =
1 ⋮ 1 x 1 ⋮ x n
lm(y~x)
quadratic : y_i = \beta_0 + \beta_1x_i + \beta_2x_i^2 +
\epsilon_i
y
i
=
β
0
+
β
1
x
i
+
β
2
x
i
2
+
ϵ
i
y_i = \beta_0 + \beta_1x_i +
\beta_2x_i^2 + \epsilon_i
y i = β 0 + β 1 x i + β 2 x i 2 + ϵ i
X=\begin{pmatrix}1 & x_1 & x_1^2 \\
\vdots & \vdots& \vdots \\ 1 & x_n &
x_n^2\end{pmatrix}
X
=
(
1
x
1
x
1
2
⋮
⋮
⋮
1
x
n
x
n
2
)
X=\begin{pmatrix}1
& x_1 & x_1^2 \\ \vdots & \vdots& \vdots \\
1 & x_n & x_n^2\end{pmatrix}
X =
1 ⋮ 1 x 1 ⋮ x n x 1 2 ⋮ x n 2
lm(y~x+I(x^2))
k
k
k
k -th
degree : y_i = \beta_0 + \beta_1 x_i + ... + \beta_kx_i^k +
\epsilon_i
y
i
=
β
0
+
β
1
x
i
+
.
.
.
+
β
k
x
i
k
+
ϵ
i
y_i = \beta_0 + \beta_1 x_i +
... + \beta_kx_i^k + \epsilon_i
y i = β 0 + β 1 x i + ... + β k x i k + ϵ i
X=\begin{pmatrix}1 & ... & x_1^k \\
\vdots & \ddots & \vdots \\ 1 & ... & x_n^k
\end{pmatrix}
X
=
(
1
.
.
.
x
1
k
⋮
⋱
⋮
1
.
.
.
x
n
k
)
X=\begin{pmatrix}1
& ... & x_1^k \\ \vdots & \ddots & \vdots \\
1 & ... & x_n^k \end{pmatrix}
X =
1 ⋮ 1 ... ⋱ ... x 1 k ⋮ x n k
lm(y~poly(x, degree=k, raw=T))
systems of straight lines : yields of a chemical process which changes linearly with
temperature...
y_1, ..., y_m
y
1
,
.
.
.
,
y
m
y_1, ..., y_m
y 1 , ... , y m :
yields of a chemical process at temperatures t_1,...,t_m
t
1
,
.
.
.
,
t
m
t_1,...,t_m
t 1 , ... , t m
in the absence of a catalyst (x_i = 0
x
i
=
0
x_i = 0
x i = 0 )
y_{m+1},...,y_{2m}
y
m
+
1
,
.
.
.
,
y
2
m
y_{m+1},...,y_{2m}
y m + 1 , ... , y 2 m :
yields of a chemical process at the same temperatures t_1,...,t_m
t
1
,
.
.
.
,
t
m
t_1,...,t_m
t 1 , ... , t m
in the presence of a catalyst (x_i = 1
x
i
=
1
x_i = 1
x i = 1 )
case a (main effects) : the catalyst has an effect; the effect is the
same at all temperatures
\mu_i = \begin{cases}\beta_0 + \beta_1 t_i
& i = 1,2,...,m \\
\beta_0+\beta_1t_{i-m}+\beta_2&i=m+1,...,2m\end{cases}
μ
i
=
{
β
0
+
β
1
t
i
i
=
1
,
2
,
.
.
.
,
m
β
0
+
β
1
t
i
−
m
+
β
2
i
=
m
+
1
,
.
.
.
,
2
m
\mu_i =
\begin{cases}\beta_0 + \beta_1 t_i & i = 1,2,...,m \\
\beta_0+\beta_1t_{i-m}+\beta_2&i=m+1,...,2m\end{cases}
μ i = { β 0 + β 1 t i β 0 + β 1 t i − m + β 2 i = 1 , 2 , ... , m i = m + 1 , ... , 2 m
alternative (indicator variable) : E(y_i) = \beta_0 +
\beta_1t_i + \beta_2x_i
E
(
y
i
)
=
β
0
+
β
1
t
i
+
β
2
x
i
E(y_i) = \beta_0 +
\beta_1t_i + \beta_2x_i
E ( y i ) = β 0 + β 1 t i + β 2 x i
x_i = \begin{cases}0 & i =
1,2,...,m \\ 1 & i = m+1,...,2m\end{cases}
x
i
=
{
0
i
=
1
,
2
,
.
.
.
,
m
1
i
=
m
+
1
,
.
.
.
,
2
m
x_i =
\begin{cases}0 & i = 1,2,...,m \\ 1 & i =
m+1,...,2m\end{cases}
x i = { 0 1 i = 1 , 2 , ... , m i = m + 1 , ... , 2 m
t_{i+m} = t_i, i =
1,2,...,m
t
i
+
m
=
t
i
,
i
=
1
,
2
,
.
.
.
,
m
t_{i+m} = t_i,
i = 1,2,...,m
t i + m = t i , i = 1 , 2 , ... , m
matrix form : E(\boldsymbol y) =
X \boldsymbol \beta
E
(
y
)
=
X
β
E(\boldsymbol y) = X
\boldsymbol \beta
E ( y ) = X β
\boldsymbol y =
\begin{pmatrix}y_1\\\vdots\\ y_m \\ y_{m+1} \\ \vdots \\
y_{2m}\end{pmatrix}
y
=
(
y
1
⋮
y
m
y
m
+
1
⋮
y
2
m
)
\boldsymbol y =
\begin{pmatrix}y_1\\\vdots\\ y_m \\ y_{m+1} \\
\vdots \\ y_{2m}\end{pmatrix}
y =
y 1 ⋮ y m y m + 1 ⋮ y 2 m
X=\begin{pmatrix}1 & t_1 & 0 \\
\vdots & \vdots & \vdots \\ 1 & t_m & 0 \\ 1 &
t_1 & 1 \\ \vdots & \vdots & \vdots \\ 1 & t_m &
1 \end{pmatrix}
X
=
(
1
t
1
0
⋮
⋮
⋮
1
t
m
0
1
t
1
1
⋮
⋮
⋮
1
t
m
1
)
X=\begin{pmatrix}1 & t_1 & 0 \\ \vdots &
\vdots & \vdots \\ 1 & t_m & 0 \\ 1
& t_1 & 1 \\ \vdots & \vdots &
\vdots \\ 1 & t_m & 1 \end{pmatrix}
X =
1 ⋮ 1 1 ⋮ 1 t 1 ⋮ t m t 1 ⋮ t m 0 ⋮ 0 1 ⋮ 1
\boldsymbol \beta =
\begin{pmatrix}\beta_0 \\ \beta_1 \\\beta_2\end{pmatrix}
β
=
(
β
0
β
1
β
2
)
\boldsymbol
\beta = \begin{pmatrix}\beta_0 \\ \beta_1
\\\beta_2\end{pmatrix}
β =
β 0 β 1 β 2
hypothesis : \beta_2 =
0
β
2
=
0
\beta_2 = 0
β 2 = 0
case b (interaction) : the catalyst has an effect; the effect
changes with temperature
\mu_i = \beta_0 + \beta_1t_i + \beta_2x_i +
\beta_3t_ix_i\;\;\;\;i = 1,2,...,2m
μ
i
=
β
0
+
β
1
t
i
+
β
2
x
i
+
β
3
t
i
x
i
i
=
1
,
2
,
.
.
.
,
2
m
\mu_i = \beta_0 +
\beta_1t_i + \beta_2x_i + \beta_3t_ix_i\;\;\;\;i =
1,2,...,2m
μ i = β 0 + β 1 t i + β 2 x i + β 3 t i x i i = 1 , 2 , ... , 2 m
catalyst absent (x_i = 0
x
i
=
0
x_i = 0
x i = 0 ) :
\mu_i = \beta_0 +
\beta_1t_i\;\;\;\;i=1,2,...,m
μ
i
=
β
0
+
β
1
t
i
i
=
1
,
2
,
.
.
.
,
m
\mu_i = \beta_0 +
\beta_1t_i\;\;\;\;i=1,2,...,m
μ i = β 0 + β 1 t i i = 1 , 2 , ... , m
catalyst present (x_i = 1
x
i
=
1
x_i = 1
x i = 1 ) :
\mu_i = \beta_0 +
\beta_1t_{i-m}+\beta_2+\beta_3t_{i-m}\;\;\;\;i = m+1,...,2m
μ
i
=
β
0
+
β
1
t
i
−
m
+
β
2
+
β
3
t
i
−
m
i
=
m
+
1
,
.
.
.
,
2
m
\mu_i = \beta_0 +
\beta_1t_{i-m}+\beta_2+\beta_3t_{i-m}\;\;\;\;i = m+1,...,2m
μ i = β 0 + β 1 t i − m + β 2 + β 3 t i − m i = m + 1 , ... , 2 m
\mu_i = \beta_0 + \beta_2 + (\beta_1 +
\beta_3)t_{i-m}
μ
i
=
β
0
+
β
2
+
(
β
1
+
β
3
)
t
i
−
m
\mu_i = \beta_0
+ \beta_2 + (\beta_1 + \beta_3)t_{i-m}
μ i = β 0 + β 2 + ( β 1 + β 3 ) t i − m
matrix form : E(\boldsymbol y) =
X \boldsymbol \beta
E
(
y
)
=
X
β
E(\boldsymbol y) = X
\boldsymbol \beta
E ( y ) = X β
\boldsymbol y =
\begin{pmatrix}y_1\\\vdots\\ y_m \\ y_{m+1} \\ \vdots \\
y_{2m}\end{pmatrix}
y
=
(
y
1
⋮
y
m
y
m
+
1
⋮
y
2
m
)
\boldsymbol y =
\begin{pmatrix}y_1\\\vdots\\ y_m \\ y_{m+1} \\
\vdots \\ y_{2m}\end{pmatrix}
y =
y 1 ⋮ y m y m + 1 ⋮ y 2 m
X=\begin{pmatrix}1 & t_1 & 0
& 0 \\ \vdots & \vdots & \vdots& \vdots \\ 1 &
t_m & 0 & 0 \\ 1 & t_1 & 1 & t_1 \\ \vdots &
\vdots & \vdots& \vdots \\ 1 & t_m & 1 & t_m
\end{pmatrix}
X
=
(
1
t
1
0
0
⋮
⋮
⋮
⋮
1
t
m
0
0
1
t
1
1
t
1
⋮
⋮
⋮
⋮
1
t
m
1
t
m
)
X=\begin{pmatrix}1 & t_1 & 0 & 0 \\
\vdots & \vdots & \vdots& \vdots \\ 1
& t_m & 0 & 0 \\ 1 & t_1 & 1
& t_1 \\ \vdots & \vdots & \vdots&
\vdots \\ 1 & t_m & 1 & t_m
\end{pmatrix}
X =
1 ⋮ 1 1 ⋮ 1 t 1 ⋮ t m t 1 ⋮ t m 0 ⋮ 0 1 ⋮ 1 0 ⋮ 0 t 1 ⋮ t m
\boldsymbol \beta =
\begin{pmatrix}\beta_0 \\ \beta_1
\\\beta_2\\\beta_3\end{pmatrix}
β
=
(
β
0
β
1
β
2
β
3
)
\boldsymbol
\beta = \begin{pmatrix}\beta_0 \\ \beta_1
\\\beta_2\\\beta_3\end{pmatrix}
β =
β 0 β 1 β 2 β 3
hypothesis : \beta_2 = \beta_3 =
0
β
2
=
β
3
=
0
\beta_2 = \beta_3 = 0
β 2 = β 3 = 0 (no rejection
\to
→
\to
→ catalyst has
no effect)
catalyst depends on temperature? \beta_3 = 0
β
3
=
0
\beta_3 = 0
β 3 = 0
one-way classification (k-sample problem) : comparison of several
"treatments"; generalization of the two-sample problem
k
k
k
k
catalysts, n_i
n
i
n_i
n i
observations with the i
i
i
i -th catalyst
(i = 1,...,k
i
=
1
,
.
.
.
,
k
i = 1,...,k
i = 1 , ... , k )
n = n_1 + ... + n_k
n
=
n
1
+
.
.
.
+
n
k
n = n_1 + ... + n_k
n = n 1 + ... + n k
total observations
y_{ij}
y
i
j
y_{ij}
y ij :
j
j
j
j -th
observation from the i
i
i
i -th catalyst
group (i=1,...,k;\;\;j=1,...,n_i
i
=
1
,
.
.
.
,
k
;
j
=
1
,
.
.
.
,
n
i
i=1,...,k;\;\;j=1,...,n_i
i = 1 , ... , k ; j = 1 , ... , n i )
E(y_{ij}) = \beta_i
E
(
y
i
j
)
=
β
i
E(y_{ij}) = \beta_i
E ( y ij ) = β i
matrix form : E(\boldsymbol y) = X \boldsymbol \beta = \beta_1
\boldsymbol x_1 + ... + \beta_k \boldsymbol x_k
E
(
y
)
=
X
β
=
β
1
x
1
+
.
.
.
+
β
k
x
k
E(\boldsymbol y) = X
\boldsymbol \beta = \beta_1 \boldsymbol x_1 + ... + \beta_k
\boldsymbol x_k
E ( y ) = X β = β 1 x 1 + ... + β k x k
\boldsymbol x_i
x
i
\boldsymbol x_i
x i :
regressor vectors indicating the group membership of the observations
x_{ji} = \begin{cases}1&y_{ij}
\text{ from group } i\\0&\text{otherwise}\end{cases}
x
j
i
=
{
1
y
i
j
from group
i
0
otherwise
x_{ji} =
\begin{cases}1&y_{ij} \text{ from group }
i\\0&\text{otherwise}\end{cases}
x ji = { 1 0 y ij from group i otherwise
example (3 groups) :
\boldsymbol y = \begin{pmatrix} y_{11}
\\ \vdots \\ y_{1n_1} \\ y_{21} \\\vdots \\ y_{2n_2} \\
y_{31}\\\vdots \\\ y_{3n_3} \end{pmatrix}
y
=
(
y
11
⋮
y
1
n
1
y
21
⋮
y
2
n
2
y
31
⋮
y
3
n
3
)
\boldsymbol y =
\begin{pmatrix} y_{11} \\ \vdots \\ y_{1n_1} \\
y_{21} \\\vdots \\ y_{2n_2} \\ y_{31}\\\vdots \\\
y_{3n_3} \end{pmatrix}
y =
y 11 ⋮ y 1 n 1 y 21 ⋮ y 2 n 2 y 31 ⋮ y 3 n 3
X = (\boldsymbol x_1, \boldsymbol x_2,
\boldsymbol x_3) = \begin{pmatrix}1 & 0 & 0 \\ \vdots &
\vdots & \vdots \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\
\vdots & \vdots & \vdots \\ 0 & 1 & 0 \\ 0 & 0
& 1 \\ \vdots & \vdots & \vdots \\ 0 & 0 &
1\end{pmatrix}
X
=
(
x
1
,
x
2
,
x
3
)
=
(
1
0
0
⋮
⋮
⋮
1
0
0
0
1
0
⋮
⋮
⋮
0
1
0
0
0
1
⋮
⋮
⋮
0
0
1
)
X =
(\boldsymbol x_1, \boldsymbol x_2, \boldsymbol x_3)
= \begin{pmatrix}1 & 0 & 0 \\ \vdots &
\vdots & \vdots \\ 1 & 0 & 0 \\ 0 &
1 & 0 \\ \vdots & \vdots & \vdots \\ 0
& 1 & 0 \\ 0 & 0 & 1 \\ \vdots &
\vdots & \vdots \\ 0 & 0 &
1\end{pmatrix}
X = ( x 1 , x 2 , x 3 ) =
1 ⋮ 1 0 ⋮ 0 0 ⋮ 0 0 ⋮ 0 1 ⋮ 1 0 ⋮ 0 0 ⋮ 0 0 ⋮ 0 1 ⋮ 1
\boldsymbol \beta =
\begin{pmatrix}\beta_1 \\ \beta_2 \\ \beta_3
\end{pmatrix}
β
=
(
β
1
β
2
β
3
)
\boldsymbol
\beta = \begin{pmatrix}\beta_1 \\ \beta_2 \\ \beta_3
\end{pmatrix}
β =
β 1 β 2 β 3
LSE : \hat \beta_i = \overline y_i
β
^
i
=
y
‾
i
\hat \beta_i = \overline y_i
β ^ i = y i
hypothesis : \beta_1 = \beta_2 = ... = \beta_k
β
1
=
β
2
=
.
.
.
=
β
k
\beta_1 = \beta_2 = ... =
\beta_k
β 1 = β 2 = ... = β k
alternative (reference group) : relate group means to the mean of a
reference group (here, the first group)
\beta_i = \beta_1 + \delta_i,\;\;\; i =
2,3,...k
β
i
=
β
1
+
δ
i
,
i
=
2
,
3
,
.
.
.
k
\beta_i = \beta_1 +
\delta_i,\;\;\; i = 2,3,...k
β i = β 1 + δ i , i = 2 , 3 , ... k
E(y_{ij})=\begin{cases}\beta_1&i=1\\\beta_1+\delta_i&i=2,...,k\end{cases}
E
(
y
i
j
)
=
{
β
1
i
=
1
β
1
+
δ
i
i
=
2
,
.
.
.
,
k
E(y_{ij})=\begin{cases}\beta_1&i=1\\\beta_1+\delta_i&i=2,...,k\end{cases}
E ( y ij ) = { β 1 β 1 + δ i i = 1 i = 2 , ... , k
matrix form : E(\boldsymbol y) =
X\boldsymbol \beta
E
(
y
)
=
X
β
E(\boldsymbol y) =
X\boldsymbol \beta
E ( y ) = X β
where X = (\boldsymbol 1,\boldsymbol x_2, ...,
\boldsymbol x_k)
X
=
(
1
,
x
2
,
.
.
.
,
x
k
)
X = (\boldsymbol
1,\boldsymbol x_2, ..., \boldsymbol x_k)
X = ( 1 , x 2 , ... , x k ) and \boldsymbol \beta =
(\beta_1, \delta_2, ..., \delta_k)'
β
=
(
β
1
,
δ
2
,
.
.
.
,
δ
k
)
′
\boldsymbol \beta =
(\beta_1, \delta_2, ..., \delta_k)'
β = ( β 1 , δ 2 , ... , δ k ) ′
example (3 groups) :
X = (\boldsymbol 1,\boldsymbol x_2,
\boldsymbol x_3) = \begin{pmatrix}1 & 0 & 0 \\ \vdots &
\vdots & \vdots \\ 1 & 0 & 0 \\ 1 & 1 & 0 \\
\vdots & \vdots & \vdots \\ 1 & 1 & 0 \\ 1 & 0
& 1 \\ \vdots & \vdots & \vdots \\ 1 & 0 &
1\end{pmatrix}
X
=
(
1
,
x
2
,
x
3
)
=
(
1
0
0
⋮
⋮
⋮
1
0
0
1
1
0
⋮
⋮
⋮
1
1
0
1
0
1
⋮
⋮
⋮
1
0
1
)
X =
(\boldsymbol 1,\boldsymbol x_2, \boldsymbol x_3) =
\begin{pmatrix}1 & 0 & 0 \\ \vdots &
\vdots & \vdots \\ 1 & 0 & 0 \\ 1 &
1 & 0 \\ \vdots & \vdots & \vdots \\ 1
& 1 & 0 \\ 1 & 0 & 1 \\ \vdots &
\vdots & \vdots \\ 1 & 0 &
1\end{pmatrix}
X = ( 1 , x 2 , x 3 ) =
1 ⋮ 1 1 ⋮ 1 1 ⋮ 1 0 ⋮ 0 1 ⋮ 1 0 ⋮ 0 0 ⋮ 0 0 ⋮ 0 1 ⋮ 1
\boldsymbol \beta =
\begin{pmatrix}\beta_1 \\ \delta_2 \\ \delta_2
\end{pmatrix}
β
=
(
β
1
δ
2
δ
2
)
\boldsymbol
\beta = \begin{pmatrix}\beta_1 \\ \delta_2 \\
\delta_2 \end{pmatrix}
β =
β 1 δ 2 δ 2
LSE : \hat{\boldsymbol\beta} = (\overline y_1,\;\;
\overline y_2 - \overline y_1,\;\; ...,\;\; \overline y_k - \overline
y_1)'
β
^
=
(
y
‾
1
,
y
‾
2
−
y
‾
1
,
.
.
.
,
y
‾
k
−
y
‾
1
)
′
\hat{\boldsymbol\beta}
= (\overline y_1,\;\; \overline y_2 - \overline y_1,\;\;
...,\;\; \overline y_k - \overline y_1)'
β ^ = ( y 1 , y 2 − y 1 , ... , y k − y 1 ) ′
multicollinearity : in the presence of one variable, the other is not important
enough to have it included; the two variables express the same information, so there is no point to
include both (p. 157 / 171)
typically shown by the fact that, in a model which includes both covariates,
neither is significant on its own (t-test)
orthogonality : special properties for X
X
X
X matrices
with orthogonal columns (dot product of any two columns 0)...
non-changing estimates : \beta_i
β
i
\beta_i
β i
remains the same, regardless of how many variables there are in the model
additivity of SSRs : SSR(x_1, ... , x_k) = SSR(x_1) + ... +
SSR(x_k)
S
S
R
(
x
1
,
.
.
.
,
x
k
)
=
S
S
R
(
x
1
)
+
.
.
.
+
S
S
R
(
x
k
)
SSR(x_1, ... , x_k) = SSR(x_1)
+ ... + SSR(x_k)
SSR ( x 1 , ... , x k ) = SSR ( x 1 ) + ... + SSR ( x k ) , for a differing
number of variables in a model
orthogonal \implies
⟹
\implies
⟹
independence : the components of \boldsymbol{\hat\beta}
β
^
\boldsymbol{\hat\beta}
β ^
are independent (covariances between \beta_i
β
i
\beta_i
β i
zero)
Model Diagnostics
possible reasons for a model being
inadequate :
inadequate functional form : missing needed variables and nonlinear
components
incorrect error specification : non-constant Var(\epsilon_i)
V
a
r
(
ϵ
i
)
Var(\epsilon_i)
Va r ( ϵ i ) , non-normal
distribution, non-independent errors
unusual observations : outliers playing a big part
residual
analysis : using the residual to assess the adequacy of a model
residual :
\boldsymbol e = \boldsymbol
y - \boldsymbol{\hat y}
e
=
y
−
y
^
\boldsymbol e = \boldsymbol y -
\boldsymbol{\hat y}
e = y − y ^
\boldsymbol{\hat y}=H\boldsymbol y
y
^
=
H
y
\boldsymbol{\hat
y}=H\boldsymbol y
y ^ = H y
i
i
i
i -th
case in dataset : e_i = y_i - \hat
y_i
e
i
=
y
i
−
y
^
i
e_i = y_i - \hat y_i
e i = y i − y ^ i
estimates the random component \boldsymbol
\epsilon
ϵ
\boldsymbol \epsilon
ϵ
E(\boldsymbol e) = (I-H)E(\boldsymbol
y)
E
(
e
)
=
(
I
−
H
)
E
(
y
)
E(\boldsymbol e) =
(I-H)E(\boldsymbol y)
E ( e ) = ( I − H ) E ( y )
correctly specified model : E(\boldsymbol e) = \boldsymbol
0
E
(
e
)
=
0
E(\boldsymbol
e) = \boldsymbol 0
E ( e ) = 0
E(\boldsymbol e) =
(I-H)E(\boldsymbol y) = (I-H)X\boldsymbol \beta = ... =
X\boldsymbol\beta - X\boldsymbol\beta = \boldsymbol
0
E
(
e
)
=
(
I
−
H
)
E
(
y
)
=
(
I
−
H
)
X
β
=
.
.
.
=
X
β
−
X
β
=
0
E(\boldsymbol e) = (I-H)E(\boldsymbol y) =
(I-H)X\boldsymbol \beta = ... =
X\boldsymbol\beta - X\boldsymbol\beta =
\boldsymbol 0
E ( e ) = ( I − H ) E ( y ) = ( I − H ) X β = ... = X β − X β = 0
incorrectly specified model : E(\boldsymbol e) \neq \boldsymbol
0
E
(
e
)
≠
0
E(\boldsymbol
e) \neq \boldsymbol 0
E ( e ) = 0
"true" model : E(\boldsymbol y) = X
\boldsymbol \beta + \boldsymbol u \gamma
E
(
y
)
=
X
β
+
u
γ
E(\boldsymbol y) = X \boldsymbol \beta +
\boldsymbol u \gamma
E ( y ) = X β + u γ
\boldsymbol
u
u
\boldsymbol u
u :
regressor vector not in L(X)
L
(
X
)
L(X)
L ( X )
\gamma
γ
\gamma
γ :
a parameter
E(\boldsymbol e) =
(I-H)E(\boldsymbol y) = (I-H)(X\boldsymbol\beta +
\boldsymbol u \gamma) = \gamma(I-H)\boldsymbol u \neq
0
E
(
e
)
=
(
I
−
H
)
E
(
y
)
=
(
I
−
H
)
(
X
β
+
u
γ
)
=
γ
(
I
−
H
)
u
≠
0
E(\boldsymbol e) = (I-H)E(\boldsymbol y) =
(I-H)(X\boldsymbol\beta + \boldsymbol u
\gamma) = \gamma(I-H)\boldsymbol u \neq 0
E ( e ) = ( I − H ) E ( y ) = ( I − H ) ( X β + u γ ) = γ ( I − H ) u = 0
\boldsymbol e
e
\boldsymbol e
e
and \boldsymbol{\hat
y}
y
^
\boldsymbol{\hat y}
y ^
should be uncorrelated
fitted values should not carry any information on the residuals
in other words : a graph of the residuals against the fitted values
should show no patterns
properties : for \boldsymbol y = X \boldsymbol \beta + \boldsymbol
\epsilon
y
=
X
β
+
ϵ
\boldsymbol y = X \boldsymbol
\beta + \boldsymbol \epsilon
y = X β + ϵ ,
where h_{ij}
h
i
j
h_{ij}
h ij
are elements of H
H
H
H ...
Var(\epsilon_i) = \sigma^2
V
a
r
(
ϵ
i
)
=
σ
2
Var(\epsilon_i) =
\sigma^2
Va r ( ϵ i ) = σ 2
constant
Var(e_i) =
\sigma^2(1-h_{ii})
V
a
r
(
e
i
)
=
σ
2
(
1
−
h
i
i
)
Var(e_i) =
\sigma^2(1-h_{ii})
Va r ( e i ) = σ 2 ( 1 − h ii )
not constant
Cov(\epsilon_i, \epsilon_j) = 0,\;\;\; i \neq
j
C
o
v
(
ϵ
i
,
ϵ
j
)
=
0
,
i
≠
j
Cov(\epsilon_i,
\epsilon_j) = 0,\;\;\; i \neq j
C o v ( ϵ i , ϵ j ) = 0 , i = j
uncorrelated
Cov(e_i, e_j) = -\sigma^2h_{ij},\;\;\;
i \neq j
C
o
v
(
e
i
,
e
j
)
=
−
σ
2
h
i
j
,
i
≠
j
Cov(e_i, e_j) =
-\sigma^2h_{ij},\;\;\; i \neq j
C o v ( e i , e j ) = − σ 2 h ij , i = j
not uncorrelated
standardized residuals : residuals standardized to have approx. mean zero
and variance one
definition : e_i^s=\frac{e_i}s
e
i
s
=
e
i
s
e_i^s=\frac{e_i}s
e i s = s e i
recall : \hat\sigma^2 = s^2 =
\frac{\boldsymbol{e'e}}{n-k-1}
σ
^
2
=
s
2
=
e
′
e
n
−
k
−
1
\hat\sigma^2 =
s^2 = \frac{\boldsymbol{e'e}}{n-k-1}
σ ^ 2 = s 2 = n − k − 1 e ′ e
studentized
residuals : the dimensionless ratio resulting from the division of a
residual by an estimate of its standard deviation
|d_i| > 2
∣
d
i
∣
>
2
|d_i| > 2
∣ d i ∣ > 2 or 3
3
3
3 would make us
question whether the model is adequate for that case i
i
i
i
a histogram or a dot plot of the studentized residuals helps us assess whether one
or more of the residuals are unusually large
serial correlation
(autocorrelation) : if a regression model is fit to time series data
(e.g. monthly, yearly...), it is likely that errors are serially correlated (as opposed to the
errors \epsilon_t
ϵ
t
\epsilon_t
ϵ t
being independent for time indices t
t
t
t )
positively autocorrelated : a positive error last time unit implies a
similar positive error this time unit
detection : calculate lag k
k
k
k
sample autocorrelation r_k
r
k
r_k
r k
of the residuals (r_0 = 1
r
0
=
1
r_0 = 1
r 0 = 1 )
measures the association within the same series (residuals) k
k
k
k
steps apart
sample correlation between e_t
e
t
e_t
e t
and its k
k
k
k -th
lag, e_{t-k}
e
t
−
k
e_{t-k}
e t − k
lag k
k
k
k
autocorrelation always between -1
−
1
-1
− 1 and +1
+
1
+1
+ 1
graphically : plot e_t
e
t
e_t
e t
against e_{t-k}
e
t
−
k
e_{t-k}
e t − k
and look for associations (positive: upwards, negative: downwards)
in R : acf(fit$residuals,las=1)
autocorrelation function (of the residuals) : graph of autocorrelations
r_k
r
k
r_k
r k
as a function of the lag k
k
k
k
two horizontal bands at \pm\frac2{\sqrt n}
±
2
n
\pm\frac2{\sqrt n}
± n
2
are added to the graph
sample autocorrelations that are outside these limits are
indications of autocorrelation
if (almost) all autocorrelations are within these limits, one can make the
assumption of independent errors
Durbin-Watson test : examines lag 1 autocorrelation r_1
r
1
r_1
r 1
in more detail; complicated to compute
DW \approx 2
D
W
≈
2
DW \approx 2
D W ≈ 2 :
independent errors
DW > 2
D
W
>
2
DW > 2
D W > 2 or DW <
2
D
W
<
2
DW < 2
D W < 2 :
correlated errors
outlier : an observation that differs from the majority of the cases in the data set
one must distinguish among outliers in the y
y
y
y
(response) dimension (a) vs. outliers in the x
x
x
x
(covariate) dimension (b) vs. outliers in both dimensions (c)
x
x
x
x
dimension : outliers that have unusual values on one or more of the
covariates
y
y
y
y
dimension : outliers are linked to the regression model
random component too large?
response or covariates recorded incorrectly?
missing covariate?
detection : graphically, studentized residual, leverage
influence : an individual case has a major influence on a
statistical procedure if the effects of the analysis are significantly
altered when the case is omitted
leverage : a measure
of how far away the independent variable values of an observation are from those of the other
observations
definition : h_{ii}
h
i
i
h_{ii}
h ii
for i
i
i
i -th
independent observation, i=1,..,n
i
=
1
,
.
.
,
n
i=1,..,n
i = 1 , .. , n (entry in
hat matrix H
H
H
H )
properties :
h_{ii}
h
i
i
h_{ii}
h ii
is a function of the covariates (x
x
x
x )
but not the response
h_{ii}
h
i
i
h_{ii}
h ii
is higher for x
x
x
x
farther away from the centroid \overline
x
x
‾
\overline x
x
\sum_{i=1}^n h_{ii} = tr(H)= k+1
∑
i
=
1
n
h
i
i
=
t
r
(
H
)
=
k
+
1
\sum_{i=1}^n h_{ii} =
tr(H)= k+1
∑ i = 1 n h ii = t r ( H ) = k + 1
\overline h =\frac{k+1}n
h
‾
=
k
+
1
n
\overline h
=\frac{k+1}n
h = n k + 1
rule of thumb : a case for which the leverage exceeds twice
the average is considered a high-leverage case
formally : h_{ii} >
2\overline h = \frac{2(k+1)}n
h
i
i
>
2
h
‾
=
2
(
k
+
1
)
n
h_{ii} >
2\overline h = \frac{2(k+1)}n
h ii > 2 h = n 2 ( k + 1 )
influence : study how the deletion of a case affects the parameter estimates
after deleting the i
i
i
i -th
case : \boldsymbol y = X \boldsymbol \beta + \boldsymbol
\epsilon
y
=
X
β
+
ϵ
\boldsymbol y = X \boldsymbol
\beta + \boldsymbol \epsilon
y = X β + ϵ
for the remaining n-1
n
−
1
n-1
n − 1 cases
\boldsymbol{\hat\beta}_{(i)}
β
^
(
i
)
\boldsymbol{\hat\beta}_{(i)}
β ^ ( i ) :
the estimate of \boldsymbol\beta
β
\boldsymbol\beta
β
without the i
i
i
i -th case
\boldsymbol{\hat\beta}
β
^
\boldsymbol{\hat\beta}
β ^ :
the estimate of \boldsymbol\beta
β
\boldsymbol\beta
β
for all cases
influence of the i
i
i
i -th
case : \boldsymbol{\hat\beta} -
\boldsymbol{\hat\beta}_{(i)}
β
^
−
β
^
(
i
)
\boldsymbol{\hat\beta} -
\boldsymbol{\hat\beta}_{(i)}
β ^ − β ^ ( i )
Cook's D
statistic : estimate of the influence of a data point when performing a
least-squares regression analysis
D_i > 0.5
D
i
>
0.5
D_i > 0.5
D i > 0.5 should be
examined
D_i > 1
D
i
>
1
D_i > 1
D i > 1 great concern
Lack of Fit
lack of fit test : can be performed if there are repeated observations at some of
the constellations of the explanatory variables
formally : for n
n
n
n
observations, only k
k
k
k
different values of x
x
x
x were
observed
there were n_i
n
i
n_i
n i
values of y
y
y
y
measured at covariate value x_i,\;\; i =
1,...,k
x
i
,
i
=
1
,
.
.
.
,
k
x_i,\;\; i = 1,...,k
x i , i = 1 , ... , k
x_1: y_{11},...,y_{1n_1}\\\vdots\\x_k:
y_{k1},...,y_{kn_k}
x
1
:
y
11
,
.
.
.
,
y
1
n
1
⋮
x
k
:
y
k
1
,
.
.
.
,
y
k
n
k
x_1:
y_{11},...,y_{1n_1}\\\vdots\\x_k: y_{k1},...,y_{kn_k}
x 1 : y 11 , ... , y 1 n 1 ⋮ x k : y k 1 , ... , y k n k
one-way classification model : y_{ij} = \beta_1I(x_i = x_1) + ... + \beta_kI(x_i =
x_k) + \epsilon_{ij}
y
i
j
=
β
1
I
(
x
i
=
x
1
)
+
.
.
.
+
β
k
I
(
x
i
=
x
k
)
+
ϵ
i
j
y_{ij} = \beta_1I(x_i = x_1) +
... + \beta_kI(x_i = x_k) + \epsilon_{ij}
y ij = β 1 I ( x i = x 1 ) + ... + β k I ( x i = x k ) + ϵ ij
so, in essence, each individual result for a certain constellation x_i
x
i
x_i
x i
is just \beta_i + \epsilon_{ij}
β
i
+
ϵ
i
j
\beta_i + \epsilon_{ij}
β i + ϵ ij
E(\epsilon_{ij})=0
E
(
ϵ
i
j
)
=
0
E(\epsilon_{ij})=0
E ( ϵ ij ) = 0
Var(\epsilon_{ij}) = \sigma^2
V
a
r
(
ϵ
i
j
)
=
σ
2
Var(\epsilon_{ij}) =
\sigma^2
Va r ( ϵ ij ) = σ 2
matrix form : \boldsymbol y = X \boldsymbol{\beta +
\epsilon}
y
=
X
β
+
ϵ
\boldsymbol y = X
\boldsymbol{\beta + \epsilon}
y = X β + ϵ
\boldsymbol y \in n \times 1
y
∈
n
×
1
\boldsymbol y \in n
\times 1
y ∈ n × 1 : vector of
responses, n = \sum_{i=1}^kn_i
n
=
∑
i
=
1
k
n
i
n = \sum_{i=1}^kn_i
n = ∑ i = 1 k n i
X \in n \times k
X
∈
n
×
k
X \in n \times k
X ∈ n × k :
design matrix with ones and zeros representing the k
k
k
k
groups
\boldsymbol \beta \in k \times 1
β
∈
k
×
1
\boldsymbol \beta \in k
\times 1
β ∈ k × 1 : vector of
unknown means \mu_i
μ
i
\mu_i
μ i
(\hat\beta_1, ...,
\hat\beta_k) = (\overline y_1, ..., \overline y_k)
(
β
^
1
,
.
.
.
,
β
^
k
)
=
(
y
‾
1
,
.
.
.
,
y
‾
k
)
(\hat\beta_1, ..., \hat\beta_k)
= (\overline y_1, ..., \overline y_k)
( β ^ 1 , ... , β ^ k ) = ( y 1 , ... , y k )
\overline y_i =
\frac1{n_i}\sum_{j=1}^{n_i}y_{ij}
y
‾
i
=
1
n
i
∑
j
=
1
n
i
y
i
j
\overline y_i =
\frac1{n_i}\sum_{j=1}^{n_i}y_{ij}
y i = n i 1 ∑ j = 1 n i y ij
(avg. of group i
i
i
i )
restricted (parametric) model : y_{ij} = \beta_0 + \beta_1x_i +
\epsilon_{ij}
y
i
j
=
β
0
+
β
1
x
i
+
ϵ
i
j
y_{ij} = \beta_0 + \beta_1x_i +
\epsilon_{ij}
y ij = β 0 + β 1 x i + ϵ ij
estimate via least squares
PESS : PESS= SSE_{full} =
\sum_{i=1}^k\sum_{j=1}^{n_i}(y_{ij}-\overline y_i)^2
P
E
S
S
=
S
S
E
f
u
l
l
=
∑
i
=
1
k
∑
j
=
1
n
i
(
y
i
j
−
y
‾
i
)
2
PESS= SSE_{full} =
\sum_{i=1}^k\sum_{j=1}^{n_i}(y_{ij}-\overline y_i)^2
PESS = SS E f u ll = ∑ i = 1 k ∑ j = 1 n i ( y ij − y i ) 2
d.f. number of observations minus number of groups (n-k
n
−
k
n-k
n − k )
LFSS : LFSS = \sum_{i=1}^k n_i(\overline y_i - \hat \beta_0 -
\hat \beta_1 x_i)^2 \geq 0
L
F
S
S
=
∑
i
=
1
k
n
i
(
y
‾
i
−
β
^
0
−
β
^
1
x
i
)
2
≥
0
LFSS = \sum_{i=1}^k
n_i(\overline y_i - \hat \beta_0 - \hat \beta_1 x_i)^2 \geq 0
L FSS = ∑ i = 1 k n i ( y i − β ^ 0 − β ^ 1 x i ) 2 ≥ 0
d.f. number of groups k
k
k
k
- number of parameters (lin: 2 params.)
SSE_{restrict} = PESS + LFSS
S
S
E
r
e
s
t
r
i
c
t
=
P
E
S
S
+
L
F
S
S
SSE_{restrict} = PESS +
LFSS
SS E res t r i c t = PESS + L FSS
SSE_{restrict} \geq SSE_{full}
S
S
E
r
e
s
t
r
i
c
t
≥
S
S
E
f
u
l
l
SSE_{restrict} \geq
SSE_{full}
SS E res t r i c t ≥ SS E f u ll
test (linear) : H_{restrict}: \mu_{i} = \beta_0 +
\beta_1x_i
H
r
e
s
t
r
i
c
t
:
μ
i
=
β
0
+
β
1
x
i
H_{restrict}: \mu_{i} = \beta_0
+ \beta_1x_i
H res t r i c t : μ i = β 0 + β 1 x i
vs. H_{full} = \mu_i =
\beta_1I(x_i=x_1) + ... + \beta_kI(x_i=x_k)
H
f
u
l
l
=
μ
i
=
β
1
I
(
x
i
=
x
1
)
+
.
.
.
+
β
k
I
(
x
i
=
x
k
)
H_{full} = \mu_i =
\beta_1I(x_i=x_1) + ... + \beta_kI(x_i=x_k)
H f u ll = μ i = β 1 I ( x i = x 1 ) + ... + β k I ( x i = x k )
F=\frac{LFSS/(k-2)}{PESS/(n-k)} \sim
F_{k-2,n-k}
F
=
L
F
S
S
/
(
k
−
2
)
P
E
S
S
/
(
n
−
k
)
∼
F
k
−
2
,
n
−
k
F=\frac{LFSS/(k-2)}{PESS/(n-k)} \sim F_{k-2,n-k}
F = PESS / ( n − k ) L FSS / ( k − 2 ) ∼ F k − 2 , n − k
reject H_0 \implies
H
0
⟹
H_0 \implies
H 0 ⟹
lack of fit, reject restricted model
test (general) : H_{restrict}: \mu_i = \beta_0 + \beta_1x_i +
\beta_2x_i^2 + ...
H
r
e
s
t
r
i
c
t
:
μ
i
=
β
0
+
β
1
x
i
+
β
2
x
i
2
+
.
.
.
H_{restrict}: \mu_i = \beta_0 +
\beta_1x_i + \beta_2x_i^2 + ...
H res t r i c t : μ i = β 0 + β 1 x i + β 2 x i 2 + ... vs. H_{full} = \mu_i =
\beta_1I(x_i=x_1) + ... + \beta_kI(x_i=x_k)
H
f
u
l
l
=
μ
i
=
β
1
I
(
x
i
=
x
1
)
+
.
.
.
+
β
k
I
(
x
i
=
x
k
)
H_{full} = \mu_i =
\beta_1I(x_i=x_1) + ... + \beta_kI(x_i=x_k)
H f u ll = μ i = β 1 I ( x i = x 1 ) + ... + β k I ( x i = x k )
F=\frac{LFSS/(k-\dim(\beta))}{PESS/(n-k)} \sim
F_{k-\dim(\beta),n-k}
F
=
L
F
S
S
/
(
k
−
dim
(
β
)
)
P
E
S
S
/
(
n
−
k
)
∼
F
k
−
dim
(
β
)
,
n
−
k
F=\frac{LFSS/(k-\dim(\beta))}{PESS/(n-k)} \sim
F_{k-\dim(\beta),n-k}
F = PESS / ( n − k ) L FSS / ( k − d i m ( β )) ∼ F k − d i m ( β ) , n − k
variance-stabilizing
transformations : find a simple function g
g
g
g to apply to
values x
x
x
x in a data set to
create new values y =
g(x)
y
=
g
(
x
)
y = g(x)
y = g ( x ) such that the variability of
the values y
y
y
y is not
related to their mean value
assume y = \mu + \epsilon
y
=
μ
+
ϵ
y = \mu + \epsilon
y = μ + ϵ where \mu
μ
\mu
μ is a
fixed mean
Var(y) = (h(\mu))^2\sigma^2
V
a
r
(
y
)
=
(
h
(
μ
)
)
2
σ
2
Var(y) =
(h(\mu))^2\sigma^2
Va r ( y ) = ( h ( μ ) ) 2 σ 2
has a non-constant variance that depends on the mean
h
h
h
h
known
goal : find g(x)
g
(
x
)
g(x)
g ( x ) such that Var(g(x))
V
a
r
(
g
(
x
)
)
Var(g(x))
Va r ( g ( x )) is
constant and does not depend on \mu
μ
\mu
μ
Var(g(y)) =
(g'(\mu))^2(h(\mu))^2\sigma^2
V
a
r
(
g
(
y
)
)
=
(
g
′
(
μ
)
)
2
(
h
(
μ
)
)
2
σ
2
Var(g(y)) =
(g'(\mu))^2(h(\mu))^2\sigma^2
Va r ( g ( y )) = ( g ′ ( μ ) ) 2 ( h ( μ ) ) 2 σ 2
goal : find g
g
g
g
such that g'(\mu) = \frac1{h(\mu)}
g
′
(
μ
)
=
1
h
(
μ
)
g'(\mu) =
\frac1{h(\mu)}
g ′ ( μ ) = h ( μ ) 1 ,
such that finally Var(g(y)) \approx \sigma^2
V
a
r
(
g
(
y
)
)
≈
σ
2
Var(g(y)) \approx
\sigma^2
Va r ( g ( y )) ≈ σ 2
example : h(\mu) = \mu \implies g'(\mu) = \frac1\mu
\implies g(\mu) = \ln(\mu)
h
(
μ
)
=
μ
⟹
g
′
(
μ
)
=
1
μ
⟹
g
(
μ
)
=
ln
(
μ
)
h(\mu) = \mu \implies
g'(\mu) = \frac1\mu \implies g(\mu) = \ln(\mu)
h ( μ ) = μ ⟹ g ′ ( μ ) = μ 1 ⟹ g ( μ ) = ln ( μ )
Box-Cox transformations : find \lambda
λ
\lambda
λ s.t. the
transformed response y_i^{(\lambda)}
y
i
(
λ
)
y_i^{(\lambda)}
y i ( λ )
minimizes SSE(\lambda)
S
S
E
(
λ
)
SSE(\lambda)
SSE ( λ ) (done in a table,
compare various \lambda
λ
\lambda
λ s to their
SSE(\lambda)
S
S
E
(
λ
)
SSE(\lambda)
SSE ( λ ) s)
\lambda = 0
λ
=
0
\lambda = 0
λ = 0 : log transform
using limit lm(log(...)~., data=...)
\lambda = \frac12
λ
=
1
2
\lambda = \frac12
λ = 2 1 :
square root transform lm(sqrt(...)~., data=...)
\lambda = 1
λ
=
1
\lambda = 1
λ = 1 : no transform
lm(...~., data=...)
\lambda = 2
λ
=
2
\lambda = 2
λ = 2 : square
transformlm((...)*(...)~., data=...)
in R : library(MASS); boxcox(fit)
Model Selection
goal : given observational data , find the best model which
incorporates the concepts of model fit and model simplicity
increasing the number of predictors increases variability
in the predictions
Var(\hat{\boldsymbol y}) = \sigma^2H
\implies
V
a
r
(
y
^
)
=
σ
2
H
⟹
Var(\hat{\boldsymbol
y}) = \sigma^2H \implies
Va r ( y ^ ) = σ 2 H ⟹
average variance \frac1n\sum_{i=1}^nVar(\hat{\boldsymbol
y})=\frac{\sigma^2(k+1)}n
1
n
∑
i
=
1
n
V
a
r
(
y
^
)
=
σ
2
(
k
+
1
)
n
\frac1n\sum_{i=1}^nVar(\hat{\boldsymbol
y})=\frac{\sigma^2(k+1)}n
n 1 ∑ i = 1 n Va r ( y ^ ) = n σ 2 ( k + 1 )
for k
k
k
k
covariates and sample size n
n
n
n
multicollinearity : different methods of analysis may end up with final
models that look very different, but describe the data equally well
model selection : given
observations on a repsonse y
y
y
y and q
q
q
q potential
explanatory variables v_1,...,v_q
v
1
,
.
.
.
,
v
q
v_1,...,v_q
v 1 , ... , v q ,
select a model y =
\beta_0 + \beta_1x_1 + ... + \beta_px_p + \epsilon
y
=
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
+
ϵ
y = \beta_0 + \beta_1x_1 + ... +
\beta_px_p + \epsilon
y = β 0 + β 1 x 1 + ... + β p x p + ϵ where...
x_1,...,x_p
x
1
,
.
.
.
,
x
p
x_1,...,x_p
x 1 , ... , x p
is a subset of the original regressors v_1,...,v_q
v
1
,
.
.
.
,
v
q
v_1,...,v_q
v 1 , ... , v q
no important variable is left out of the model
no unimportant variable is included in the model
all possible regressions : fit 2^q
2
q
2^q
2 q
models if q
q
q
q variables
are involved
R_k^2 = 1 -
\frac{SSE_k}{SST}
R
k
2
=
1
−
S
S
E
k
S
S
T
R_k^2 = 1 - \frac{SSE_k}{SST}
R k 2 = 1 − SST SS E k
for k
k
k
k
variables and k+1
k
+
1
k+1
k + 1 regression
coefficients
increase in k
k
k
k
means decrease in SSE_k
S
S
E
k
SSE_k
SS E k ,
approaching 0
0
0
0 when k=n-1
k
=
n
−
1
k=n-1
k = n − 1
increase in R^2
R
2
R^2
R 2
means decrease in s^2
s
2
s^2
s 2
SST
S
S
T
SST
SST
does not depend on the covariates; just on y
y
y
y
(see definition)
therefore, R^2
R
2
R^2
R 2
approaches 1
1
1
1 as k
k
k
k
increases, so we don't use R^2
R
2
R^2
R 2 ,
since we'd just choose the model with the most variables
R_{adj}^2
R
a
d
j
2
R_{adj}^2
R a d j 2 :
adjusted R^2
R
2
R^2
R 2
remedies the problem of R^2
R
2
R^2
R 2
continually increasing by dividing the degrees of freedom
ideal model and choice of k
k
k
k :
highest R_{adj}^2
R
a
d
j
2
R_{adj}^2
R a d j 2
equivalent : smallest s^2
s
2
s^2
s 2
AIC
A
I
C
AIC
A I C : Akaike's Information
Criterion
prefer models with smaller AIC
A
I
C
AIC
A I C
BIC
B
I
C
BIC
B I C : Bayesian Information
Criterion
larger penalty for more variables
automatic model selection methods : forward selection, backward elimination,
stepwise regression (needed in R: library(MASS)
)
forward selection : start with the smallest model, build up to the optimal
model
in R : stepAIC(lm(y~1, data=dataset), direction = "forward", scope = list(upper = lm(y~., data=dataset))[, k = log(nrow(dataset))])
(for BIC: [...]
)
backward elimination : start with the largest model and build down to the
optimal model
in R : stepAIC(lm(y~., data=dataset), direction = "backward"[, k = log(nrow(dataset))])
stepwise regression : oscillate between forward selection and backward
elimination
in R : use either previous function, but with direction = "both"
forward selection
INIT
M = intercept-only model
P = all covariates
REPEAT
IF P empty STOP
ELSE
calculate AIC for sizeof(P) models, each model containing one covariate in P is added to M
IF all AICs > AIC(M) STOP
ELSE
update M with covariate whose addition had minimum AIC
remove covariate from P
forward selection
INIT
M = intercept-only model
P = all covariates
REPEAT
IF P empty STOP
ELSE
calculate AIC for sizeof(P) models, each model containing one covariate in P is added to M
IF all AICs > AIC(M) STOP
ELSE
update M with covariate whose addition had minimum AIC
remove covariate from P
backward elimination
INIT
M = model with all covariates
P = all covariates
REPEAT
IF P empty STOP
ELSE
calculate AIC for sizeof(P) models, each model without each of the covariates in P
IF all AICs > AIC(M) STOP
ELSE
update M by deleting covariate that led to minimum AIC
remove covariate from P
backward elimination
INIT
M = model with all covariates
P = all covariates
REPEAT
IF P empty STOP
ELSE
calculate AIC for sizeof(P) models, each model without each of the covariates in P
IF all AICs > AIC(M) STOP
ELSE
update M by deleting covariate that led to minimum AIC
remove covariate from P
stepwise regression
INIT
M = intercept-only model OR full model
e = small threshold
REPEAT UNTIL STOP
do a forward step on M
do a backward step on M
# For both steps, the differences in AIC need to be
# greater than e for the selection to go forward, otherwise
# the changes can keep undoing each other.
stepwise regression
INIT
M = intercept-only model OR full model
e = small threshold
REPEAT UNTIL STOP
do a forward step on M
do a backward step on M
# For both steps, the differences in AIC need to be
# greater than e for the selection to go forward, otherwise
# the changes can keep undoing each other.
Nonlinear Regression
linear model (recap) : y=\mu+\epsilon
y
=
μ
+
ϵ
y=\mu+\epsilon
y = μ + ϵ where \mu = \beta_0 + \beta_1x_1 + ... +
\beta_kx_k
μ
=
β
0
+
β
1
x
1
+
.
.
.
+
β
k
x
k
\mu = \beta_0 + \beta_1x_1 + ... +
\beta_kx_k
μ = β 0 + β 1 x 1 + ... + β k x k
key : linearity of the parameters \beta_i
β
i
\beta_i
β i
the regressor variables x_i
x
i
x_i
x i
can be any known nonlinear function of the regressors...
intrinsically nonlinear model : a nonlinear model that cannot be transformed into a
linear model
counterexample : y = \alpha x_1^\beta x_2^\gamma\epsilon
y
=
α
x
1
β
x
2
γ
ϵ
y = \alpha x_1^\beta
x_2^\gamma\epsilon
y = α x 1 β x 2 γ ϵ can be
transformed into \ln(y)=\ln(\alpha)+\beta\ln(x_1)+\gamma\ln(x_2)+\ln(\epsilon)
ln
(
y
)
=
ln
(
α
)
+
β
ln
(
x
1
)
+
γ
ln
(
x
2
)
+
ln
(
ϵ
)
\ln(y)=\ln(\alpha)+\beta\ln(x_1)+\gamma\ln(x_2)+\ln(\epsilon)
ln ( y ) = ln ( α ) + β ln ( x 1 ) + γ ln ( x 2 ) + ln ( ϵ )
require iterative algorithms and convergence (vs. linear models, which are
analytic )
linear trend model : \mu_t
= \alpha + \gamma t
μ
t
=
α
+
γ
t
\mu_t = \alpha + \gamma t
μ t = α + γ t
\gamma
γ
\gamma
γ :
growth rate, unbounded
\alpha
α
\alpha
α :
starting value at t=0
t
=
0
t=0
t = 0
nonlinear regression model : y_i = \mu_i + \epsilon_i = \mu(\boldsymbol x_i,
\boldsymbol\beta) + \epsilon_i
y
i
=
μ
i
+
ϵ
i
=
μ
(
x
i
,
β
)
+
ϵ
i
y_i = \mu_i + \epsilon_i =
\mu(\boldsymbol x_i, \boldsymbol\beta) + \epsilon_i
y i = μ i + ϵ i = μ ( x i , β ) + ϵ i
\epsilon_i \sim
N(0,\sigma^2)
ϵ
i
∼
N
(
0
,
σ
2
)
\epsilon_i \sim N(0,\sigma^2)
ϵ i ∼ N ( 0 , σ 2 ) iid. for i=1,...,n
i
=
1
,
.
.
.
,
n
i=1,...,n
i = 1 , ... , n
\boldsymbol x_i =
(x_{i1},...,x_{im})'
x
i
=
(
x
i
1
,
.
.
.
,
x
i
m
)
′
\boldsymbol x_i =
(x_{i1},...,x_{im})'
x i = ( x i 1 , ... , x im ) ′ :
vector of m
m
m
m covariates
for the i
i
i
i -th case
(typically m=1
m
=
1
m=1
m = 1 , where the covariate
is time)
\boldsymbol
\beta
β
\boldsymbol \beta
β :
vector of p
p
p
p parameters
to be estimated along with \sigma^2
σ
2
\sigma^2
σ 2
(usually a different num. of parameters than covariates)
\mu(\boldsymbol x_i,
\boldsymbol\beta)
μ
(
x
i
,
β
)
\mu(\boldsymbol x_i,
\boldsymbol\beta)
μ ( x i , β ) : nonlinear model
component
S(\boldsymbol{\hat\beta}) =
\sum_{i=1}^n(y_i-\mu(\boldsymbol x_i, \boldsymbol \beta))^2
S
(
β
^
)
=
∑
i
=
1
n
(
y
i
−
μ
(
x
i
,
β
)
)
2
S(\boldsymbol{\hat\beta}) =
\sum_{i=1}^n(y_i-\mu(\boldsymbol x_i, \boldsymbol \beta))^2
S ( β ^ ) = ∑ i = 1 n ( y i − μ ( x i , β ) ) 2
estimates :
\boldsymbol{\hat\beta}
β
^
\boldsymbol{\hat\beta}
β ^ :
no closed form!
use iterative function to minimize S(\boldsymbol{\hat\beta})
S
(
β
^
)
S(\boldsymbol{\hat\beta})
S ( β ^ )
\hat\sigma^2 = s^2 =
\frac{S(\boldsymbol{\hat\beta})}{n-p} =
\frac{\sum_{i=1}^n(y_i-\mu(\boldsymbol x_i, \boldsymbol
\beta))^2}{n-p}
σ
^
2
=
s
2
=
S
(
β
^
)
n
−
p
=
∑
i
=
1
n
(
y
i
−
μ
(
x
i
,
β
)
)
2
n
−
p
\hat\sigma^2 = s^2 =
\frac{S(\boldsymbol{\hat\beta})}{n-p} =
\frac{\sum_{i=1}^n(y_i-\mu(\boldsymbol x_i, \boldsymbol
\beta))^2}{n-p}
σ ^ 2 = s 2 = n − p S ( β ^ ) = n − p ∑ i = 1 n ( y i − μ ( x i , β ) ) 2
Var(\boldsymbol{\hat\beta})
\approx s^2(X'X)^{-1}
V
a
r
(
β
^
)
≈
s
2
(
X
′
X
)
−
1
Var(\boldsymbol{\hat\beta})
\approx s^2(X'X)^{-1}
Va r ( β ^ ) ≈ s 2 ( X ′ X ) − 1
s.e.(\hat\beta_i) = \sqrt{v_{ii}}
s
.
e
.
(
β
^
i
)
=
v
i
i
s.e.(\hat\beta_i) =
\sqrt{v_{ii}}
s . e . ( β ^ i ) = v ii
(the square roots of the diagonal elements in the covariance matrix provide
estimates of the standard errors)
off-diagonal elements provide estimates of the covariances among the estimates
100(1-\alpha)\%
100
(
1
−
α
)
%
100(1-\alpha)\%
100 ( 1 − α ) % C.I.
for \beta_j
β
j
\beta_j
β j :
\hat\beta_j \pm
t_{n-p,1-\alpha/2}s.e.(\hat\beta_j)
β
^
j
±
t
n
−
p
,
1
−
α
/
2
s
.
e
.
(
β
^
j
)
\hat\beta_j \pm
t_{n-p,1-\alpha/2}s.e.(\hat\beta_j)
β ^ j ± t n − p , 1 − α /2 s . e . ( β ^ j )
H_0: \beta_j =
0
H
0
:
β
j
=
0
H_0: \beta_j = 0
H 0 : β j = 0 vs. H_a: \beta_j \neq
0
H
a
:
β
j
≠
0
H_a: \beta_j \neq 0
H a : β j = 0 : t =
\frac{\hat\beta_j}{s.e.(\hat\beta_j)} \sim t_{n-p}
t
=
β
^
j
s
.
e
.
(
β
^
j
)
∼
t
n
−
p
t =
\frac{\hat\beta_j}{s.e.(\hat\beta_j)} \sim t_{n-p}
t = s . e . ( β ^ j ) β ^ j ∼ t n − p
for constants : H_0: \beta_j =
c
H
0
:
β
j
=
c
H_0: \beta_j = c
H 0 : β j = c vs.
H_a: \beta_j \neq c
H
a
:
β
j
≠
c
H_a: \beta_j \neq c
H a : β j = c :
t = \frac{\hat\beta_j - c}{s.e.(\hat\beta_j)}
\sim t_{n-p}
t
=
β
^
j
−
c
s
.
e
.
(
β
^
j
)
∼
t
n
−
p
t = \frac{\hat\beta_j -
c}{s.e.(\hat\beta_j)} \sim t_{n-p}
t = s . e . ( β ^ j ) β ^ j − c ∼ t n − p
restricted models or goodness-of-fit tests can be performed similarly as for linear models
in R : fitnls = nls(y~(formula using a, b),start=list(a=...,b=...))
Newton-Raphson method : a
root-finding algorithm which produces successively better approximations to the zeroes of a
real-valued function
goal : find \boldsymbol\beta
β
\boldsymbol\beta
β
that minimizes f(\boldsymbol \beta)
f
(
β
)
f(\boldsymbol \beta)
f ( β ) , here S(\boldsymbol
\beta)
S
(
β
)
S(\boldsymbol \beta)
S ( β ) or -\log
L(\boldsymbol\beta)
−
log
L
(
β
)
-\log L(\boldsymbol\beta)
− log L ( β )
Df(\boldsymbol \beta) = (\frac{\partial
f}{\partial \beta_1}, ... , \frac{\partial f}{\partial
\beta_p})'
D
f
(
β
)
=
(
∂
f
∂
β
1
,
.
.
.
,
∂
f
∂
β
p
)
′
Df(\boldsymbol \beta) =
(\frac{\partial f}{\partial \beta_1}, ... , \frac{\partial
f}{\partial \beta_p})'
D f ( β ) = ( ∂ β 1 ∂ f , ... , ∂ β p ∂ f ) ′ :
the p
p
p
p -vector
containing the first derivatives of f
f
f
f
w.r.t. \beta_i
β
i
\beta_i
β i
D^2f(\boldsymbol \beta)
D
2
f
(
β
)
D^2f(\boldsymbol \beta)
D 2 f ( β ) : the p \times
p
p
×
p
p \times p
p × p
matrix of second derivatives with the ij
i
j
ij
ij -th
element \frac{\partial^2 f}{\partial \beta_i \partial
\beta_f}
∂
2
f
∂
β
i
∂
β
f
\frac{\partial^2
f}{\partial \beta_i \partial \beta_f}
∂ β i ∂ β f ∂ 2 f
(Hessian matrix)
general : initialize \boldsymbol\beta_{old} =
β
o
l
d
=
\boldsymbol\beta_{old} =
β o l d = starting value, then
repeat until convergence:
\boldsymbol \beta_{new} \approx \boldsymbol
\beta_{old} -
(D^2f(\boldsymbol\beta_{old}))^{-1}Df(\boldsymbol\beta_{old})
β
n
e
w
≈
β
o
l
d
−
(
D
2
f
(
β
o
l
d
)
)
−
1
D
f
(
β
o
l
d
)
\boldsymbol \beta_{new}
\approx \boldsymbol \beta_{old} -
(D^2f(\boldsymbol\beta_{old}))^{-1}Df(\boldsymbol\beta_{old})
β n e w ≈ β o l d − ( D 2 f ( β o l d ) ) − 1 D f ( β o l d )
\boldsymbol \beta_{old} = \boldsymbol
\beta_{new}
β
o
l
d
=
β
n
e
w
\boldsymbol \beta_{old}
= \boldsymbol \beta_{new}
β o l d = β n e w
problem : unstable due to inversion
scoring : initialize \boldsymbol\beta_{old} =
β
o
l
d
=
\boldsymbol\beta_{old} =
β o l d = starting value, then
repeat until convergence:
\boldsymbol \beta_{new} \approx \boldsymbol
\beta_{old} + (I(\boldsymbol\beta_{old}))^{-1}D\log
L(\boldsymbol\beta_{old})
β
n
e
w
≈
β
o
l
d
+
(
I
(
β
o
l
d
)
)
−
1
D
log
L
(
β
o
l
d
)
\boldsymbol \beta_{new}
\approx \boldsymbol \beta_{old} +
(I(\boldsymbol\beta_{old}))^{-1}D\log
L(\boldsymbol\beta_{old})
β n e w ≈ β o l d + ( I ( β o l d ) ) − 1 D log L ( β o l d )
\boldsymbol \beta_{old} = \boldsymbol
\beta_{new}
β
o
l
d
=
β
n
e
w
\boldsymbol \beta_{old}
= \boldsymbol \beta_{new}
β o l d = β n e w
information matrix : I(\boldsymbol\beta)
= E(-D^2\log L(\boldsymbol \beta))
I
(
β
)
=
E
(
−
D
2
log
L
(
β
)
)
I(\boldsymbol\beta) =
E(-D^2\log L(\boldsymbol \beta))
I ( β ) = E ( − D 2 log L ( β ))
problematic : local minima that "trap" iterative algorithms,
parameters of highly varying magnitudes (e.g. one parameter in range 0-1, another in the
thousands), badly specified models with non-identifiable parameters (similar to
multicollinearity)
Time Series Models
first-order autoregressive model (AR1) : y_t = \mu(X_t, \beta) + \epsilon_t
y
t
=
μ
(
X
t
,
β
)
+
ϵ
t
y_t = \mu(X_t, \beta) + \epsilon_t
y t = μ ( X t , β ) + ϵ t
autocorrelations of observations 1 step apart : \phi
ϕ
\phi
ϕ
all correlations among observations one step apart \phi
ϕ
\phi
ϕ are
the same
\phi = Corr(\epsilon_1, \epsilon_2) = ... =
Corr(\epsilon_{n-1}, \epsilon_n)
ϕ
=
C
o
r
r
(
ϵ
1
,
ϵ
2
)
=
.
.
.
=
C
o
r
r
(
ϵ
n
−
1
,
ϵ
n
)
\phi = Corr(\epsilon_1,
\epsilon_2) = ... = Corr(\epsilon_{n-1}, \epsilon_n)
ϕ = C orr ( ϵ 1 , ϵ 2 ) = ... = C orr ( ϵ n − 1 , ϵ n )
|\phi| < 1
∣
ϕ
∣
<
1
|\phi| < 1
∣ ϕ ∣ < 1
(correlations between -1
−
1
-1
− 1 and
+1
+
1
+1
+ 1 )
autocorrelations of observations k
k
k
k
steps apart : \phi^k
ϕ
k
\phi^k
ϕ k
\phi^k = Corr(\epsilon_1, \epsilon_{k+1}) = ...
= Corr(\epsilon_{n-k}, \epsilon_n)
ϕ
k
=
C
o
r
r
(
ϵ
1
,
ϵ
k
+
1
)
=
.
.
.
=
C
o
r
r
(
ϵ
n
−
k
,
ϵ
n
)
\phi^k =
Corr(\epsilon_1, \epsilon_{k+1}) = ... =
Corr(\epsilon_{n-k}, \epsilon_n)
ϕ k = C orr ( ϵ 1 , ϵ k + 1 ) = ... = C orr ( ϵ n − k , ϵ n )
properties of autocorrelations :
they depend only on the time lag between the observations (so the time indices don't
matter; just the time distance)
they decrease exponentially with the time lag (because -1 < \phi <
1
−
1
<
ϕ
<
1
-1 < \phi < 1
− 1 < ϕ < 1 )
the farther apart the observations, the weaker the autocorrelation
if \phi
ϕ
\phi
ϕ
is close to 1, the decay is slow
autocorrelation functions (lag k
k
k
k
correlation) : \rho_k = \phi^k = Corr(\epsilon_{t-k},
\epsilon_t)
ρ
k
=
ϕ
k
=
C
o
r
r
(
ϵ
t
−
k
,
ϵ
t
)
\rho_k = \phi^k =
Corr(\epsilon_{t-k}, \epsilon_t)
ρ k = ϕ k = C orr ( ϵ t − k , ϵ t )
\rho_0 = 1
ρ
0
=
1
\rho_0 = 1
ρ 0 = 1
\rho_k = \rho_{-k}
ρ
k
=
ρ
−
k
\rho_k = \rho_{-k}
ρ k = ρ − k
correlated error at time t
t
t
t : \epsilon_t =
\phi\epsilon_{t-1} + a_t
ϵ
t
=
ϕ
ϵ
t
−
1
+
a
t
\epsilon_t = \phi\epsilon_{t-1}
+ a_t
ϵ t = ϕ ϵ t − 1 + a t
where a_t \sim N(0,
\sigma_a^2)
a
t
∼
N
(
0
,
σ
a
2
)
a_t \sim N(0, \sigma_a^2)
a t ∼ N ( 0 , σ a 2 )
white noise (random shocks) : a_t
a
t
a_t
a t
a_t
a
t
a_t
a t
is the "usual" regression model error; mean 0, all uncorrelated
Corr(a_{t-k},a_t) =
0
C
o
r
r
(
a
t
−
k
,
a
t
)
=
0
Corr(a_{t-k},a_t) = 0
C orr ( a t − k , a t ) = 0
for all k \neq 0
k
≠
0
k \neq
0
k = 0
expanded : \epsilon_t = a_t + \phi a_{t-1} + \phi^2a_{t-2}
+ ...
ϵ
t
=
a
t
+
ϕ
a
t
−
1
+
ϕ
2
a
t
−
2
+
.
.
.
\epsilon_t = a_t + \phi
a_{t-1} + \phi^2a_{t-2} + ...
ϵ t = a t + ϕ a t − 1 + ϕ 2 a t − 2 + ...
E(\epsilon_t) = 0
E
(
ϵ
t
)
=
0
E(\epsilon_t) = 0
E ( ϵ t ) = 0
Var(\epsilon_t) \to
\frac{\sigma_a^2}{1-\phi^2}
V
a
r
(
ϵ
t
)
→
σ
a
2
1
−
ϕ
2
Var(\epsilon_t) \to
\frac{\sigma_a^2}{1-\phi^2}
Va r ( ϵ t ) → 1 − ϕ 2 σ a 2
stationary model : fixed level 0; realizations scatter around the fixed
level and sample paths don't leave this level for long periods
in R : library(nlme); fitgls=gls(y~x,correlation=corARMA(p=1,q=0))
(change p=2
for AR2)
random walk model : y_t =
\mu(X_t, \beta) + \epsilon_t
y
t
=
μ
(
X
t
,
β
)
+
ϵ
t
y_t = \mu(X_t, \beta) + \epsilon_t
y t = μ ( X t , β ) + ϵ t
\phi = 1
ϕ
=
1
\phi = 1
ϕ = 1
\epsilon_t = \epsilon_{t-1}
+ a_t = a_t + a_{t-1} + a_{t-2} + ...
ϵ
t
=
ϵ
t
−
1
+
a
t
=
a
t
+
a
t
−
1
+
a
t
−
2
+
.
.
.
\epsilon_t = \epsilon_{t-1} +
a_t = a_t + a_{t-1} + a_{t-2} + ...
ϵ t = ϵ t − 1 + a t = a t + a t − 1 + a t − 2 + ...
cumulative sum of all random shocks up to time t
t
t
t
nonstationary model : no fixed level; paths can deviate for long periods
from the starting point
first-order difference : w_t = \epsilon_t - \epsilon_{t-1} = a_t
w
t
=
ϵ
t
−
ϵ
t
−
1
=
a
t
w_t = \epsilon_t -
\epsilon_{t-1} = a_t
w t = ϵ t − ϵ t − 1 = a t
well-behaved, stationary, uncorrelated
effects of ignoring autocorrelation : what happens when we fit a standard linear
model even if the errors are correlated?
stationary errors : variance of \hat\beta
β
^
\hat\beta
β ^
will be overestimated compared to the true variance
(inefficiency )
t
t
t
t
ratios too small \to
→
\to
→ null
hypothesis less likely to be rejected when it
should be
non-stationary errors : variance of \hat\beta
β
^
\hat\beta
β ^
will be underestimated compared to the true variance
t
t
t
t
ratios too large \to
→
\to
→ null
hypothesis likely to be rejected when it shouldn't
be
forecasting (prediciton) : given data up to time period n
n
n
n , predict response
at time period n +
r
n
+
r
n + r
n + r (r
r
r
r step-ahead
forecast)
r
r
r
r
step-ahead forecast : y_n(r) = \hat y_{n+r}
y
n
(
r
)
=
y
^
n
+
r
y_n(r) = \hat y_{n+r}
y n ( r ) = y ^ n + r
n
n
n
n :
forecast origin
r
r
r
r :
forecast horizon
assumption : future values of the covariate x_t
x
t
x_t
x t
are known (e.g. own future investments)
1 step forecast (AR1, one covariate) : assume x_{n+1}
x
n
+
1
x_{n+1}
x n + 1
known...
observation : y_{n+1} = \phi y_n
+ (1-\phi)\beta_0 + (x_{n+1}-\phi x_n)\beta_1 + a_{n+1}
y
n
+
1
=
ϕ
y
n
+
(
1
−
ϕ
)
β
0
+
(
x
n
+
1
−
ϕ
x
n
)
β
1
+
a
n
+
1
y_{n+1} = \phi y_n +
(1-\phi)\beta_0 + (x_{n+1}-\phi x_n)\beta_1 + a_{n+1}
y n + 1 = ϕ y n + ( 1 − ϕ ) β 0 + ( x n + 1 − ϕ x n ) β 1 + a n + 1
prediction : \hat y_{n+1} = \hat
\phi y_n + (1-\hat\phi)\hat\beta_0 + (x_{n+1}-\hat\phi
x_n)\hat\beta_1
y
^
n
+
1
=
ϕ
^
y
n
+
(
1
−
ϕ
^
)
β
^
0
+
(
x
n
+
1
−
ϕ
^
x
n
)
β
^
1
\hat y_{n+1} = \hat
\phi y_n + (1-\hat\phi)\hat\beta_0 + (x_{n+1}-\hat\phi
x_n)\hat\beta_1
y ^ n + 1 = ϕ ^ y n + ( 1 − ϕ ^ ) β ^ 0 + ( x n + 1 − ϕ ^ x n ) β ^ 1
95% CI : \hat
y_{n+1} \pm 1.96se(\hat y_{n+1})
y
^
n
+
1
±
1.96
s
e
(
y
^
n
+
1
)
\hat y_{n+1}
\pm 1.96se(\hat y_{n+1})
y ^ n + 1 ± 1.96 se ( y ^ n + 1 )
general step forecast (AR1, one covariate) : \hat y_{n+r} = \hat \phi
\hat y_{n+r-1} + (1-\hat\phi)\hat\beta_0 + (x_{n+r}-\hat\phi
x_{n+r-1})\hat\beta_1
y
^
n
+
r
=
ϕ
^
y
^
n
+
r
−
1
+
(
1
−
ϕ
^
)
β
^
0
+
(
x
n
+
r
−
ϕ
^
x
n
+
r
−
1
)
β
^
1
\hat y_{n+r} = \hat \phi \hat
y_{n+r-1} + (1-\hat\phi)\hat\beta_0 + (x_{n+r}-\hat\phi
x_{n+r-1})\hat\beta_1
y ^ n + r = ϕ ^ y ^ n + r − 1 + ( 1 − ϕ ^ ) β ^ 0 + ( x n + r − ϕ ^ x n + r − 1 ) β ^ 1
for r \geq 2
r
≥
2
r \geq 2
r ≥ 2
95% CI : \hat y_{n+r} \pm 1.96se(\hat
y_{n+r})
y
^
n
+
r
±
1.96
s
e
(
y
^
n
+
r
)
\hat y_{n+r} \pm
1.96se(\hat y_{n+r})
y ^ n + r ± 1.96 se ( y ^ n + r )
Logistic Regression
logistic regression :
regression where the response variable is binary (general:
categorical )
y_i
y
i
y_i
y i :
outcome of case i, \;\;\; i = 1,2,...,n
i
,
i
=
1
,
2
,
.
.
.
,
n
i, \;\;\; i = 1,2,...,n
i , i = 1 , 2 , ... , n
y_i\sim Ber(\pi)
y
i
∼
B
e
r
(
π
)
y_i\sim Ber(\pi)
y i ∼ B er ( π ) , independent
P(y_i=1)=\pi
P
(
y
i
=
1
)
=
π
P(y_i=1)=\pi
P ( y i = 1 ) = π
(success)
\ln\left(\frac{\pi(x_i)}{1-\pi(x_i)}\right)=x'_i\beta
= \beta_0 + \beta_1x_{i1}+...+\beta_px_{ip}
ln
(
π
(
x
i
)
1
−
π
(
x
i
)
)
=
x
i
′
β
=
β
0
+
β
1
x
i
1
+
.
.
.
+
β
p
x
i
p
\ln\left(\frac{\pi(x_i)}{1-\pi(x_i)}\right)=x'_i\beta
= \beta_0 + \beta_1x_{i1}+...+\beta_px_{ip}
ln ( 1 − π ( x i ) π ( x i ) ) = x i ′ β = β 0 + β 1 x i 1 + ... + β p x i p
\pi(x_i)=\frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}
π
(
x
i
)
=
e
x
i
′
β
1
+
e
x
i
′
β
\pi(x_i)=\frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}
π ( x i ) = 1 + e x i ′ β e x i ′ β
1-\pi(x_i)=\frac{1}{1+e^{x_i'\beta}}
1
−
π
(
x
i
)
=
1
1
+
e
x
i
′
β
1-\pi(x_i)=\frac{1}{1+e^{x_i'\beta}}
1 − π ( x i ) = 1 + e x i ′ β 1
\beta_0
β
0
\beta_0
β 0 :
inflection point
\beta_1
β
1
\beta_1
β 1 :
steepness of sigmoid-like function
risk of y
y
y
y
for factor x
x
x
x :
\pi(x) =
P(y=1|x)=\frac{e^{x'\beta}}{1+e^{x'\beta}}
π
(
x
)
=
P
(
y
=
1
∣
x
)
=
e
x
′
β
1
+
e
x
′
β
\pi(x)
=
P(y=1|x)=\frac{e^{x'\beta}}{1+e^{x'\beta}}
π ( x ) = P ( y = 1∣ x ) = 1 + e x ′ β e x ′ β
odds of y
y
y
y
for a fixed x
x
x
x :
Odds(x) =
\frac{\pi(x)}{1-\pi(x)}=\frac{P(y=1|x)}{1-P(y=1|x)}=\exp(x'\beta)
O
d
d
s
(
x
)
=
π
(
x
)
1
−
π
(
x
)
=
P
(
y
=
1
∣
x
)
1
−
P
(
y
=
1
∣
x
)
=
exp
(
x
′
β
)
Odds(x)
=
\frac{\pi(x)}{1-\pi(x)}=\frac{P(y=1|x)}{1-P(y=1|x)}=\exp(x'\beta)
O dd s ( x ) = 1 − π ( x ) π ( x ) = 1 − P ( y = 1∣ x ) P ( y = 1∣ x ) = exp ( x ′ β )
how much higher is the probability of the occurrence y
y
y
y
compared to the nonoccurrence of y
y
y
y ?
odds of n:1
\implies
n
:
1
⟹
n:1
\implies
n : 1 ⟹
occurence is n
n
n
n
times more likely than nonoccurence
odds ratio : OR=\frac{Odds(x=1)}{Odds(x=0)}=\frac{\frac{P(y=1|x=1)}{P(y=0|x=1)}}{\frac{P(y=1|x=0)}{P(y=0|x=0)}}
O
R
=
O
d
d
s
(
x
=
1
)
O
d
d
s
(
x
=
0
)
=
P
(
y
=
1
∣
x
=
1
)
P
(
y
=
0
∣
x
=
1
)
P
(
y
=
1
∣
x
=
0
)
P
(
y
=
0
∣
x
=
0
)
OR=\frac{Odds(x=1)}{Odds(x=0)}=\frac{\frac{P(y=1|x=1)}{P(y=0|x=1)}}{\frac{P(y=1|x=0)}{P(y=0|x=0)}}
OR = O dd s ( x = 0 ) O dd s ( x = 1 ) = P ( y = 0∣ x = 0 ) P ( y = 1∣ x = 0 ) P ( y = 0∣ x = 1 ) P ( y = 1∣ x = 1 )
\beta=\ln\left(\frac{\pi(x+1)}{1-\pi(x+1)}\right)-\ln\left(\frac{\pi(x)}{1-\pi(x)}\right)=\ln(OR)
β
=
ln
(
π
(
x
+
1
)
1
−
π
(
x
+
1
)
)
−
ln
(
π
(
x
)
1
−
π
(
x
)
)
=
ln
(
O
R
)
\beta=\ln\left(\frac{\pi(x+1)}{1-\pi(x+1)}\right)-\ln\left(\frac{\pi(x)}{1-\pi(x)}\right)=\ln(OR)
β = ln ( 1 − π ( x + 1 ) π ( x + 1 ) ) − ln ( 1 − π ( x ) π ( x ) ) = ln ( OR )
vector of log odds ratios
\exp(\beta)=OR
exp
(
β
)
=
O
R
\exp(\beta)=OR
exp ( β ) = OR
what is the multiplicative factor by which the odds
of occurrence increase / decrease for a change from
x
x
x
x
to x+1
x
+
1
x+1
x + 1 ?
e.g. \beta =
-0.2 \to \exp(\beta) = 0.82
\implies
β
=
−
0.2
→
exp
(
β
)
=
0.82
⟹
\beta = -0.2 \to
\exp(\beta) = 0.82
\implies
β = − 0.2 → exp ( β ) = 0.82 ⟹
a change from x
x
x
x
to x+1
x
+
1
x+1
x + 1
decreases the odds of occurence by 18\%
18
%
18\%
18%
for k
k
k
k
units : \beta
k
β
k
\beta k
β k
with ratio measured as \exp(\beta
k)
exp
(
β
k
)
\exp(\beta k)
exp ( β k )
P(y_i=0)=1-\pi
P
(
y
i
=
0
)
=
1
−
π
P(y_i=0)=1-\pi
P ( y i = 0 ) = 1 − π
(failure)
E(y_i)=\pi
E
(
y
i
)
=
π
E(y_i)=\pi
E ( y i ) = π
one covariate model : \exp(x'\beta) = \beta_0 + \beta_1 x
exp
(
x
′
β
)
=
β
0
+
β
1
x
\exp(x'\beta) = \beta_0 +
\beta_1 x
exp ( x ′ β ) = β 0 + β 1 x
\ln\left(\frac{P(y=1|x)}{1-P(y=1|x)}\right)=\beta_0+\beta_1x
ln
(
P
(
y
=
1
∣
x
)
1
−
P
(
y
=
1
∣
x
)
)
=
β
0
+
β
1
x
\ln\left(\frac{P(y=1|x)}{1-P(y=1|x)}\right)=\beta_0+\beta_1x
ln ( 1 − P ( y = 1∣ x ) P ( y = 1∣ x ) ) = β 0 + β 1 x
Odds(x)=\frac{P(y=1|x)}{1-P(y=1|x)}=\exp(\beta_0+\beta_1x)
O
d
d
s
(
x
)
=
P
(
y
=
1
∣
x
)
1
−
P
(
y
=
1
∣
x
)
=
exp
(
β
0
+
β
1
x
)
Odds(x)=\frac{P(y=1|x)}{1-P(y=1|x)}=\exp(\beta_0+\beta_1x)
O dd s ( x ) = 1 − P ( y = 1∣ x ) P ( y = 1∣ x ) = exp ( β 0 + β 1 x )
Odds(x=0)=\exp(\beta_0+\beta_1 \cdot 0)
= \exp(\beta_0)
O
d
d
s
(
x
=
0
)
=
exp
(
β
0
+
β
1
⋅
0
)
=
exp
(
β
0
)
Odds(x=0)=\exp(\beta_0+\beta_1 \cdot 0) =
\exp(\beta_0)
O dd s ( x = 0 ) = exp ( β 0 + β 1 ⋅ 0 ) = exp ( β 0 )
Odds(x=1)=\exp(\beta_0+\beta_1 \cdot 1)
= \exp(\beta_0 + \beta_1)
O
d
d
s
(
x
=
1
)
=
exp
(
β
0
+
β
1
⋅
1
)
=
exp
(
β
0
+
β
1
)
Odds(x=1)=\exp(\beta_0+\beta_1 \cdot 1) =
\exp(\beta_0 + \beta_1)
O dd s ( x = 1 ) = exp ( β 0 + β 1 ⋅ 1 ) = exp ( β 0 + β 1 )
OR =
\frac{Odds(x=1)}{Odds(x=0)}=\exp(\beta_1)
O
R
=
O
d
d
s
(
x
=
1
)
O
d
d
s
(
x
=
0
)
=
exp
(
β
1
)
OR =
\frac{Odds(x=1)}{Odds(x=0)}=\exp(\beta_1)
OR = O dd s ( x = 0 ) O dd s ( x = 1 ) = exp ( β 1 )
\beta_1 = \ln(OR)
β
1
=
ln
(
O
R
)
\beta_1 = \ln(OR)
β 1 = ln ( OR )
H_0: \beta_1 = 0
H
0
:
β
1
=
0
H_0: \beta_1 =
0
H 0 : β 1 = 0 (no
assoc. between x
x
x
x
and y
y
y
y )
\beta_0 = \ln(Odds(x = 0))
β
0
=
ln
(
O
d
d
s
(
x
=
0
)
)
\beta_0 = \ln(Odds(x =
0))
β 0 = ln ( O dd s ( x = 0 ))
MLE \hat \beta
β
^
\hat \beta
β ^ :
Newton-Raphson...
CIs and tests :
100(1-\alpha)\%
100
(
1
−
α
)
%
100(1-\alpha)\%
100 ( 1 − α ) % CI
for \ln(OR)
ln
(
O
R
)
\ln(OR)
ln ( OR ) : \hat\beta_j \pm
z_{1-\alpha/2}se(\hat\beta_j)
β
^
j
±
z
1
−
α
/
2
s
e
(
β
^
j
)
\hat\beta_j \pm
z_{1-\alpha/2}se(\hat\beta_j)
β ^ j ± z 1 − α /2 se ( β ^ j )
100(1-\alpha)\%
100
(
1
−
α
)
%
100(1-\alpha)\%
100 ( 1 − α ) % CI
for OR
O
R
OR
OR :
\exp(\hat\beta_j \pm
z_{1-\alpha/2}se(\hat\beta_j))
exp
(
β
^
j
±
z
1
−
α
/
2
s
e
(
β
^
j
)
)
\exp(\hat\beta_j \pm
z_{1-\alpha/2}se(\hat\beta_j))
exp ( β ^ j ± z 1 − α /2 se ( β ^ j ))
Wald test : H_0: \beta_j =
0
H
0
:
β
j
=
0
H_0: \beta_j = 0
H 0 : β j = 0 vs. H_A: \beta_j \neq
0: \frac{\hat\beta_j}{se(\hat\beta_j)}\sim N(0,1)
H
A
:
β
j
≠
0
:
β
^
j
s
e
(
β
^
j
)
∼
N
(
0
,
1
)
H_A: \beta_j \neq 0:
\frac{\hat\beta_j}{se(\hat\beta_j)}\sim N(0,1)
H A : β j = 0 : se ( β ^ j ) β ^ j ∼ N ( 0 , 1 )
case : an individual observation
constellation : grouped information at distinct levels of the explanatory
variables
n_k
n
k
n_k
n k ;
number of cases at the k
k
k
k -th
constellation
y_k
y
k
y_k
y k :
number of successes at the k
k
k
k -th
constellation
prob. of success for k
k
k
k -th
constellation : \pi(x_k, \beta) =
\frac{\exp(x_k\beta)}{1+\exp(x_k\beta)}
π
(
x
k
,
β
)
=
exp
(
x
k
β
)
1
+
exp
(
x
k
β
)
\pi(x_k, \beta) =
\frac{\exp(x_k\beta)}{1+\exp(x_k\beta)}
π ( x k , β ) = 1 + e x p ( x k β ) e x p ( x k β )
likelihood ratio tests (LRT) : used to compare the maximum likelihood under
the current model (the “full” model), with the maximum likelihood obtained under alternative
competing models ("restricted" models)
H_{restrict}:
H
r
e
s
t
r
i
c
t
:
H_{restrict}:
H res t r i c t : linear
predictor x'_{res}\beta_{res}
x
r
e
s
′
β
r
e
s
x'_{res}\beta_{res}
x res ′ β res
vs. H_{full}:
H
f
u
l
l
:
H_{full}:
H f u ll : linear
predictor x'\beta
x
′
β
x'\beta
x ′ β
x_{res}
x
r
e
s
x_{res}
x res
subset of x
x
x
x
LRT statistic : 2 \cdot
\ln\left(\frac{L(full)}{L(restrict)}\right) = 2 \cdot
\ln\left(\frac{L(\hat\beta)}{L(\hat\beta_{res})}\right) \sim
\chi_a^2
2
⋅
ln
(
L
(
f
u
l
l
)
L
(
r
e
s
t
r
i
c
t
)
)
=
2
⋅
ln
(
L
(
β
^
)
L
(
β
^
r
e
s
)
)
∼
χ
a
2
2 \cdot
\ln\left(\frac{L(full)}{L(restrict)}\right) = 2 \cdot
\ln\left(\frac{L(\hat\beta)}{L(\hat\beta_{res})}\right) \sim
\chi_a^2
2 ⋅ ln ( L ( res t r i c t ) L ( f u ll ) ) = 2 ⋅ ln ( L ( β ^ res ) L ( β ^ ) ) ∼ χ a 2
equiv. : 2 \cdot
(\ln(L(full))-\ln(L(restrict)))
2
⋅
(
ln
(
L
(
f
u
l
l
)
)
−
ln
(
L
(
r
e
s
t
r
i
c
t
)
)
)
2 \cdot
(\ln(L(full))-\ln(L(restrict)))
2 ⋅ ( ln ( L ( f u ll )) − ln ( L ( res t r i c t )))
a = \dim(\beta) -
\dim(\beta_{res})
a
=
dim
(
β
)
−
dim
(
β
r
e
s
)
a = \dim(\beta)
- \dim(\beta_{res})
a = dim ( β ) − dim ( β res )
reject H_{restrict}
H
r
e
s
t
r
i
c
t
H_{restrict}
H res t r i c t
if statistic greater than corresponding chi-square value
large value \implies
⟹
\implies
⟹
the success probability depends on one or more of the regressors (i.e. full
model better)
small value \implies
⟹
\implies
⟹
none of the regressors in the model influence the success probability
deviance : twice the log-likelihood ratio between the saturated model and
the parameterized (full) model; m
m
m
m
constellations
saturated model : each constellation of the explanatory variables is
allowed its own distinct success probability
\hat\pi_k = \frac{y_k}{n_k}
π
^
k
=
y
k
n
k
\hat\pi_k =
\frac{y_k}{n_k}
π ^ k = n k y k
D = 2\frac{\ln(L(saturated))}{\ln(L(full))} = 2
\cdot \ln \left(\frac{L(\hat\pi_1,...,\hat\pi_m)}{L(\hat\beta)}\right) \sim
\chi_a^2
D
=
2
ln
(
L
(
s
a
t
u
r
a
t
e
d
)
)
ln
(
L
(
f
u
l
l
)
)
=
2
⋅
ln
(
L
(
π
^
1
,
.
.
.
,
π
^
m
)
L
(
β
^
)
)
∼
χ
a
2
D =
2\frac{\ln(L(saturated))}{\ln(L(full))} = 2 \cdot \ln
\left(\frac{L(\hat\pi_1,...,\hat\pi_m)}{L(\hat\beta)}\right)
\sim \chi_a^2
D = 2 l n ( L ( f u ll )) l n ( L ( s a t u r a t e d )) = 2 ⋅ ln ( L ( β ^ ) L ( π ^ 1 , ... , π ^ m ) ) ∼ χ a 2
a = m - \dim(\beta)
a
=
m
−
dim
(
β
)
a = m -
\dim(\beta)
a = m − dim ( β )
LRT = D(restricted) -
D(full)
L
R
T
=
D
(
r
e
s
t
r
i
c
t
e
d
)
−
D
(
f
u
l
l
)
LRT =
D(restricted) - D(full)
L RT = D ( res t r i c t e d ) − D ( f u ll )
p-value : 1 - pchisq(D, 𝑎)
in R : freqs <- cbind(yes, no); fit <- glm(freqs~x[+...], family="binomial")
Poisson Regression
generalized linear model
(GLM) : generalizes linear regression by allowing the linear model to be related
to the response variable via a link function
response variables y_1,...y,_n
y
1
,
.
.
.
y
,
n
y_1,...y,_n
y 1 , ... y , n :
share the same distribution from the exponential family (Normal, Poisson,
Binomial...)
parameters \beta
β
\beta
β and
explanatory variables x_1,...x_p
x
1
,
.
.
.
x
p
x_1,...x_p
x 1 , ... x p
monotone link function g
g
g
g :
relates a transform of the mean \mu_i
μ
i
\mu_i
μ i
linearly to the explanatory variables
g(\mu_i) = \beta_0 + \beta_1x_{i1} + ... +
\beta_px_{ip}
g
(
μ
i
)
=
β
0
+
β
1
x
i
1
+
.
.
.
+
β
p
x
i
p
g(\mu_i) = \beta_0 +
\beta_1x_{i1} + ... + \beta_px_{ip}
g ( μ i ) = β 0 + β 1 x i 1 + ... + β p x i p
standard linear regression : g(\mu) =
\mu
g
(
μ
)
=
μ
g(\mu) = \mu
g ( μ ) = μ
(identity function)
logistic regression : g(\mu) =
\ln\left(\frac\mu{1-\mu}\right)
g
(
μ
)
=
ln
(
μ
1
−
μ
)
g(\mu) =
\ln\left(\frac\mu{1-\mu}\right)
g ( μ ) = ln ( 1 − μ μ )
(logit)
Poisson regression : g(\mu) =
\ln(\mu)
g
(
μ
)
=
ln
(
μ
)
g(\mu) = \ln(\mu)
g ( μ ) = ln ( μ )
Poisson regression model : response represents count data (e.g.
number of daily equipment failures, weekly traffic fatalities...)
P(Y = y) =
\frac{\mu^y}{y!}e^{-\mu},\;\;\;y=0,1,2,...
P
(
Y
=
y
)
=
μ
y
y
!
e
−
μ
,
y
=
0
,
1
,
2
,
.
.
.
P(Y = y) =
\frac{\mu^y}{y!}e^{-\mu},\;\;\;y=0,1,2,...
P ( Y = y ) = y ! μ y e − μ , y = 0 , 1 , 2 , ...
E(y) = Var(y) = \mu >
0
E
(
y
)
=
V
a
r
(
y
)
=
μ
>
0
E(y) = Var(y) = \mu > 0
E ( y ) = Va r ( y ) = μ > 0
g(\mu)=\ln(\mu)=
\beta_0+\beta_1x_1+...+\beta_px_p
g
(
μ
)
=
ln
(
μ
)
=
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
g(\mu)=\ln(\mu)=
\beta_0+\beta_1x_1+...+\beta_px_p
g ( μ ) = ln ( μ ) = β 0 + β 1 x 1 + ... + β p x p
\mu=\exp(\beta_0+\beta_1x_1+...+\beta_px_p)
μ
=
exp
(
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
)
\mu=\exp(\beta_0+\beta_1x_1+...+\beta_px_p)
μ = exp ( β 0 + β 1 x 1 + ... + β p x p )
interpretation of coefficients : changing x_i
x
i
x_i
x i
by one unit to x_i+1
x
i
+
1
x_i+1
x i + 1 while keeping all
other regressors fixed affects the mean of the response by 100(\exp(\beta_i)-1)\%
100
(
exp
(
β
i
)
−
1
)
%
100(\exp(\beta_i)-1)\%
100 ( exp ( β i ) − 1 ) %
example : \frac{\exp(\beta_0+\beta_1(x_1+1)+...+\beta_px_p)}{\exp(\beta_0+\beta_1x_1+...+\beta_px_p)}=\exp(\beta_1)
exp
(
β
0
+
β
1
(
x
1
+
1
)
+
.
.
.
+
β
p
x
p
)
exp
(
β
0
+
β
1
x
1
+
.
.
.
+
β
p
x
p
)
=
exp
(
β
1
)
\frac{\exp(\beta_0+\beta_1(x_1+1)+...+\beta_px_p)}{\exp(\beta_0+\beta_1x_1+...+\beta_px_p)}=\exp(\beta_1)
e x p ( β 0 + β 1 x 1 + ... + β p x p ) e x p ( β 0 + β 1 ( x 1 + 1 ) + ... + β p x p ) = exp ( β 1 )
95% CI : \hat\beta \pm 1.96se(\hat\beta)
β
^
±
1.96
s
e
(
β
^
)
\hat\beta \pm 1.96se(\hat\beta)
β ^ ± 1.96 se ( β ^ )
for the mean ratio \exp(\beta)
exp
(
β
)
\exp(\beta)
exp ( β ) :
\exp(\hat\beta \pm
1.96se(\hat\beta))
exp
(
β
^
±
1.96
s
e
(
β
^
)
)
\exp(\hat\beta \pm
1.96se(\hat\beta))
exp ( β ^ ± 1.96 se ( β ^ ))
everything else identical to logistic regression
Linear Mixed Effects Models
linear mixed effects models : some subset of regression parameters vary randomly
from one individual to another
individuals are assumed to have their own subject-specific mean response trajectories over
time
simple mixed effects model : Y_{ij} = \beta + b_i + e_{ij}
Y
i
j
=
β
+
b
i
+
e
i
j
Y_{ij} = \beta + b_i + e_{ij}
Y ij = β + b i + e ij
(observation = population mean + individual deviation + measurement error)
\beta
β
\beta
β :
population mean (fixed effects , constant)
b_i
b
i
b_i
b i :
individual deviation from the population mean (random
effects ) (i
i
i
i -th
individual)
b_i \sim N(0,d)
b
i
∼
N
(
0
,
d
)
b_i \sim N(0,d)
b i ∼ N ( 0 , d )
positive : individual responds higher than
population average (higher on y
y
y
y -axis)
negative : individual responds lower than
population average (lower on y
y
y
y -axis)
e_{ij}
e
i
j
e_{ij}
e ij :
within-individual deviations (measurement error) (i
i
i
i -th
individual, j
j
j
j -th
observation)
e_{ij} \sim N(0,\sigma^2)
e
i
j
∼
N
(
0
,
σ
2
)
e_{ij} \sim
N(0,\sigma^2)
e ij ∼ N ( 0 , σ 2 )
E(Y_{ij}) = \beta
E
(
Y
i
j
)
=
β
E(Y_{ij}) = \beta
E ( Y ij ) = β
Var(Y_{ij}) = d + \sigma^2
V
a
r
(
Y
i
j
)
=
d
+
σ
2
Var(Y_{ij}) = d +
\sigma^2
Va r ( Y ij ) = d + σ 2
Cov(Y_{ij},Y_{km}) = 0
C
o
v
(
Y
i
j
,
Y
k
m
)
=
0
Cov(Y_{ij},Y_{km}) = 0
C o v ( Y ij , Y km ) = 0 for i \neq
k
i
≠
k
i \neq k
i = k
Cov(Y_{ij},Y_{ij}) = Var(Y_{ij}) = d +
\sigma^2
C
o
v
(
Y
i
j
,
Y
i
j
)
=
V
a
r
(
Y
i
j
)
=
d
+
σ
2
Cov(Y_{ij},Y_{ij}) =
Var(Y_{ij}) = d + \sigma^2
C o v ( Y ij , Y ij ) = Va r ( Y ij ) = d + σ 2
Cov(Y_{ij},Y_{ik}) = d
C
o
v
(
Y
i
j
,
Y
i
k
)
=
d
Cov(Y_{ij},Y_{ik}) = d
C o v ( Y ij , Y ik ) = d
Cor(Y_{ij},Y_{km}) = 0
C
o
r
(
Y
i
j
,
Y
k
m
)
=
0
Cor(Y_{ij},Y_{km}) = 0
C or ( Y ij , Y km ) = 0 for i \neq
k
i
≠
k
i \neq k
i = k
Cor(Y_{ij},Y_{ij}) = 1
C
o
r
(
Y
i
j
,
Y
i
j
)
=
1
Cor(Y_{ij},Y_{ij}) = 1
C or ( Y ij , Y ij ) = 1
Cor(Y_{ij},Y_{ik}) =
\frac{d}{d+\sigma^2}
C
o
r
(
Y
i
j
,
Y
i
k
)
=
d
d
+
σ
2
Cor(Y_{ij},Y_{ik}) =
\frac{d}{d+\sigma^2}
C or ( Y ij , Y ik ) = d + σ 2 d
in R : matrix format, n
n
n
n
rows for n
n
n
n
individuals and m
m
m
m
columns for m
m
m
m
time points per individual
has to be transformed into longitudinal format for use with lme
general linear mixed effects model : Y_i = X_i\beta + Z_ib_i + e_i,\;\;b_i\sim
N_q(0,D),\;\;e_i\sim N_{n_i}(0,R_i)
Y
i
=
X
i
β
+
Z
i
b
i
+
e
i
,
b
i
∼
N
q
(
0
,
D
)
,
e
i
∼
N
n
i
(
0
,
R
i
)
Y_i = X_i\beta + Z_ib_i +
e_i,\;\;b_i\sim N_q(0,D),\;\;e_i\sim N_{n_i}(0,R_i)
Y i = X i β + Z i b i + e i , b i ∼ N q ( 0 , D ) , e i ∼ N n i ( 0 , R i ) , where R_i =
\sigma^2I_{n_i}
R
i
=
σ
2
I
n
i
R_i = \sigma^2I_{n_i}
R i = σ 2 I n i
Y_i \in \R^{n_i}
Y
i
∈
R
n
i
Y_i \in \R^{n_i}
Y i ∈ R n i :
outcomes (for i
i
i
i -th
individual, as for nearly everything here...)
X_i \in \R^{n_i \times p}
X
i
∈
R
n
i
×
p
X_i \in \R^{n_i \times
p}
X i ∈ R n i × p :
design matrix for fixed effects
Z_i \in \R^{n_i \times q}
Z
i
∈
R
n
i
×
q
Z_i \in \R^{n_i \times
q}
Z i ∈ R n i × q :
design matrix for random effects (columns are usually subset of columns of
X_i
X
i
X_i
X i )
\beta \in \R^p
β
∈
R
p
\beta \in \R^p
β ∈ R p :
fixed effects
any component of \beta
β
\beta
β
can be allowed to vary randomly by including the corresponding column of
X_i
X
i
X_i
X i
in Z_i
Z
i
Z_i
Z i
b_i \in \R^q
b
i
∈
R
q
b_i \in \R^q
b i ∈ R q :
random effects
independent of covariates X_i
X
i
X_i
X i
e_i \in \R^{n_i}
e
i
∈
R
n
i
e_i \in \R^{n_i}
e i ∈ R n i :
within-individual errors
conditional mean : E(Y_i \;|\; b_i) =
X_i\beta + Z_ib_i
E
(
Y
i
∣
b
i
)
=
X
i
β
+
Z
i
b
i
E(Y_i \;|\; b_i) =
X_i\beta + Z_ib_i
E ( Y i ∣ b i ) = X i β + Z i b i
marginal mean : E(Y_i) =
X_i\beta
E
(
Y
i
)
=
X
i
β
E(Y_i) = X_i\beta
E ( Y i ) = X i β
conditional variance : Var(Y_i\;|\;b_i) =
R_i
V
a
r
(
Y
i
∣
b
i
)
=
R
i
Var(Y_i\;|\;b_i) = R_i
Va r ( Y i ∣ b i ) = R i
marginal variance : Var(Y_i) =
Z_iDZ_i'+R_i
V
a
r
(
Y
i
)
=
Z
i
D
Z
i
′
+
R
i
Var(Y_i) =
Z_iDZ_i'+R_i
Va r ( Y i ) = Z i D Z i ′ + R i
\sigma^2_{REML}=\frac1{n-1}\sum_{i=1}^n(x_i-\overline
x)^2
σ
R
E
M
L
2
=
1
n
−
1
∑
i
=
1
n
(
x
i
−
x
‾
)
2
\sigma^2_{REML}=\frac1{n-1}\sum_{i=1}^n(x_i-\overline x)^2
σ REM L 2 = n − 1 1 ∑ i = 1 n ( x i − x ) 2
(unbiased, restricted maximum likelihood)
\sigma^2_{ML}=\frac1{n}\sum_{i=1}^n(x_i-\overline
x)^2
σ
M
L
2
=
1
n
∑
i
=
1
n
(
x
i
−
x
‾
)
2
\sigma^2_{ML}=\frac1{n}\sum_{i=1}^n(x_i-\overline x)^2
σ M L 2 = n 1 ∑ i = 1 n ( x i − x ) 2
(biased)
\hat b_i
b
^
i
\hat b_i
b ^ i :
best linear unbiased predictor (BLUP)
"shrinks" the i
i
i
i -th
individual's predicted response profile towards the population-averaged mean
response profile
large R_i
R
i
R_i
R i
compared to D \implies
D
⟹
D \implies
D ⟹
more shrinkage to mean
small R_i
R
i
R_i
R i
compared to D \implies
D
⟹
D \implies
D ⟹
closer to observed value
large n_i \implies
n
i
⟹
n_i \implies
n i ⟹
less shrinkage
Statistical Learning (Machine Learning)
supervised learning : an outcome (which guides the learning process) predicted based
on a set of features
unsupervised learning : no outcome; only features observed (not relevant
here)
outcome (outputs, responses, dependent variables) : the thing to predict;
can be quantitative (ordered, e.g. stock price) or qualitative (unordered
here, categorical, factors, e.g species of Iris)
predicting quantitative outcomes \to
→
\to
→
regression
predicting qualitative outcomes \to
→
\to
→
classification
features (inputs, predictors, independent variables) : the data to make
predcitions for the outcome
training set : data set containing both features and outcomes to
build the model
Prediction Methods
least squares model (linear model) : high stability but low
accuracy (high bias, low variance )
goal : predict Y
Y
Y
Y by
f(X) = X'\beta
f
(
X
)
=
X
′
β
f(X) = X'\beta
f ( X ) = X ′ β
\beta_0
β
0
\beta_0
β 0 :
intercept (bias)
X \in \R^p
X
∈
R
p
X \in \R^p
X ∈ R p :
random input vector (first element 1
1
1
1 for intercept)
Y \in \R
Y
∈
R
Y \in \R
Y ∈ R : random
outcome (to predict)
p(X,Y)
p
(
X
,
Y
)
p(X,Y)
p ( X , Y ) : joint
distribution
f(X)
f
(
X
)
f(X)
f ( X ) : function
for predicting Y
Y
Y
Y
based on X
X
X
X
f'(X)=\beta \in \R^p
f
′
(
X
)
=
β
∈
R
p
f'(X)=\beta \in \R^p
f ′ ( X ) = β ∈ R p :
vector that points in the steepest uphill direction
(x_1,y_1),...,(x_n,y_n)
(
x
1
,
y
1
)
,
.
.
.
,
(
x
n
,
y
n
)
(x_1,y_1),...,(x_n,y_n)
( x 1 , y 1 ) , ... , ( x n , y n ) : training
data
method : pick \beta
β
\beta
β to
minimize residual sum of squares RSS(\beta) = \sum_{i=1}^n(y_i -
x_i'\beta)^2
R
S
S
(
β
)
=
∑
i
=
1
n
(
y
i
−
x
i
′
β
)
2
RSS(\beta) = \sum_{i=1}^n(y_i -
x_i'\beta)^2
RSS ( β ) = ∑ i = 1 n ( y i − x i ′ β ) 2
matrix notation : RSS(\beta) =
(y-X\beta)'(y-X\beta)
R
S
S
(
β
)
=
(
y
−
X
β
)
′
(
y
−
X
β
)
RSS(\beta) =
(y-X\beta)'(y-X\beta)
RSS ( β ) = ( y − Xβ ) ′ ( y − Xβ )
LSE : \hat\beta = (X'X)^{-1}X'y
β
^
=
(
X
′
X
)
−
1
X
′
y
\hat\beta =
(X'X)^{-1}X'y
β ^ = ( X ′ X ) − 1 X ′ y
fitted value : \hat y_i =
x_i'\hat\beta
y
^
i
=
x
i
′
β
^
\hat y_i =
x_i'\hat\beta
y ^ i = x i ′ β ^
theoretical : \beta = (E(X'X))^{-1}E(X'Y)
β
=
(
E
(
X
′
X
)
)
−
1
E
(
X
′
Y
)
\beta =
(E(X'X))^{-1}E(X'Y)
β = ( E ( X ′ X ) ) − 1 E ( X ′ Y )
should only be used when Y
Y
Y
Y is
continuous and Normal distributed
k
k
k
k -nearest
neighbor model : low stability but high accuracy (low bias,
high variance )
use observations in training set closest in input space to x
x
x
x to form
\hat Y
Y
^
\hat Y
Y ^
formally : f(x)=\frac1k\sum_{x_i \in
N_k(x)}y_i
f
(
x
)
=
1
k
∑
x
i
∈
N
k
(
x
)
y
i
f(x)=\frac1k\sum_{x_i
\in N_k(x)}y_i
f ( x ) = k 1 ∑ x i ∈ N k ( x ) y i
N_k(x)
N
k
(
x
)
N_k(x)
N k ( x ) :
neighborhood of x
x
x
x
defined by the k
k
k
k
closest points x_i
x
i
x_i
x i
in the training sample
find k
k
k
k
observations with x_i
x
i
x_i
x i
closest to new x
x
x
x
in input space, and average their responses
Statistical Decision Theory
loss function L(Y,f(X))
L
(
Y
,
f
(
X
)
)
L(Y,f(X))
L ( Y , f ( X )) : penalizes errors in
prediction
expected prediction error : EPE(f) = E_{x,y}(L(Y,f(X)))
E
P
E
(
f
)
=
E
x
,
y
(
L
(
Y
,
f
(
X
)
)
)
EPE(f) = E_{x,y}(L(Y,f(X)))
EPE ( f ) = E x , y ( L ( Y , f ( X )))
expected (squared) prediction error : criterion for choosing f
f
f
f based on
the squared error loss function (L2)
L2 Loss : L_2 = L(Y,f(X)) = (Y-f(X))^2
L
2
=
L
(
Y
,
f
(
X
)
)
=
(
Y
−
f
(
X
)
)
2
L_2 = L(Y,f(X)) = (Y-f(X))^2
L 2 = L ( Y , f ( X )) = ( Y − f ( X ) ) 2
(most popular)
EPE(f) = E_{x,y}(Y-f(X))^2
= \int\int(y-f(x))^2p(x,y)dxdy = \int(\int(Y-f(x))^2p(y|x)dy)p(x)dx =
E_x(E_{y|x}((Y-f(X))^2|X))
E
P
E
(
f
)
=
E
x
,
y
(
Y
−
f
(
X
)
)
2
=
∫
∫
(
y
−
f
(
x
)
)
2
p
(
x
,
y
)
d
x
d
y
=
∫
(
∫
(
Y
−
f
(
x
)
)
2
p
(
y
∣
x
)
d
y
)
p
(
x
)
d
x
=
E
x
(
E
y
∣
x
(
(
Y
−
f
(
X
)
)
2
∣
X
)
)
EPE(f) = E_{x,y}(Y-f(X))^2 =
\int\int(y-f(x))^2p(x,y)dxdy = \int(\int(Y-f(x))^2p(y|x)dy)p(x)dx =
E_x(E_{y|x}((Y-f(X))^2|X))
EPE ( f ) = E x , y ( Y − f ( X ) ) 2 = ∫∫ ( y − f ( x ) ) 2 p ( x , y ) d x d y = ∫ ( ∫ ( Y − f ( x ) ) 2 p ( y ∣ x ) d y ) p ( x ) d x = E x ( E y ∣ x (( Y − f ( X ) ) 2 ∣ X ))
optimal Bayes classifier : minimize E_{y|x}((Y-f(X))^2\;|\;X)
E
y
∣
x
(
(
Y
−
f
(
X
)
)
2
∣
X
)
E_{y|x}((Y-f(X))^2\;|\;X)
E y ∣ x (( Y − f ( X ) ) 2 ∣ X ) for all X
X
X
X
f_{bayes}(X)=E_{y|x}(Y|X)
f
b
a
y
e
s
(
X
)
=
E
y
∣
x
(
Y
∣
X
)
f_{bayes}(X)=E_{y|x}(Y|X)
f ba yes ( X ) = E y ∣ x ( Y ∣ X )
nearest neighbor : f(x) =
Ave(y_i \; | \; x_i \in N_k(x))
f
(
x
)
=
A
v
e
(
y
i
∣
x
i
∈
N
k
(
x
)
)
f(x) = Ave(y_i \; | \; x_i \in N_k(x))
f ( x ) = A v e ( y i ∣ x i ∈ N k ( x ))
as n,k \to \infty, \frac kn\to
0
n
,
k
→
∞
,
k
n
→
0
n,k \to \infty, \frac kn\to 0
n , k → ∞ , n k → 0 : f(x) \to
E(Y\;|\;X=x)
f
(
x
)
→
E
(
Y
∣
X
=
x
)
f(x) \to E(Y\;|\;X=x)
f ( x ) → E ( Y ∣ X = x )
L1 Loss : E|Y-f(X)|
E
∣
Y
−
f
(
X
)
∣
E|Y-f(X)|
E ∣ Y − f ( X ) ∣
(abs. value)
f(x) =
\text{median}(Y\;|\;X=x)
f
(
x
)
=
median
(
Y
∣
X
=
x
)
f(x) = \text{median}(Y\;|\;X=x)
f ( x ) = median ( Y ∣ X = x )
Categorical Data
estimate \hat
G
G
^
\hat G
G ^ :
contains values in the set of possible classes \mathcal G
G
\mathcal G
G where |\mathcal G| = K
∣
G
∣
=
K
|\mathcal G| = K
∣ G ∣ = K
loss function : L \in
\R^{K \times K}
L
∈
R
K
×
K
L \in \R^{K \times K}
L ∈ R K × K
zero on he diagonal
nonnegative elsewhere
L(k,l)
L
(
k
,
l
)
L(k,l)
L ( k , l ) : price paid
for classifying an observation belonging to class \mathcal
G_k
G
k
\mathcal G_k
G k
as \mathcal G_l
G
l
\mathcal G_l
G l
zero-one loss function : all misclassifications are charged one unit
L(k,l)=\begin{cases}0&k=l\\1&k \neq
l\end{cases}
L
(
k
,
l
)
=
{
0
k
=
l
1
k
≠
l
L(k,l)=\begin{cases}0&k=l\\1&k \neq l\end{cases}
L ( k , l ) = { 0 1 k = l k = l
EPE = E(L(G,\hat
G(X)))=E_x(E_{g|x}(L(G,\hat G(X))\;|\;X))
E
P
E
=
E
(
L
(
G
,
G
^
(
X
)
)
)
=
E
x
(
E
g
∣
x
(
L
(
G
,
G
^
(
X
)
)
∣
X
)
)
EPE = E(L(G,\hat
G(X)))=E_x(E_{g|x}(L(G,\hat G(X))\;|\;X))
EPE = E ( L ( G , G ^ ( X ))) = E x ( E g ∣ x ( L ( G , G ^ ( X )) ∣ X ))
E_{g|x}(L(G,\hat G(X))\;|\;X) =
\sum_{k=1}^KL(\mathcal G_k, f(X))p(\mathcal G_k \;|\; X)
E
g
∣
x
(
L
(
G
,
G
^
(
X
)
)
∣
X
)
=
∑
k
=
1
K
L
(
G
k
,
f
(
X
)
)
p
(
G
k
∣
X
)
E_{g|x}(L(G,\hat
G(X))\;|\;X) = \sum_{k=1}^KL(\mathcal G_k, f(X))p(\mathcal
G_k \;|\; X)
E g ∣ x ( L ( G , G ^ ( X )) ∣ X ) = ∑ k = 1 K L ( G k , f ( X )) p ( G k ∣ X )
\sum_{k=1}^K p(\mathcal G_k \;|\; X) =
1
∑
k
=
1
K
p
(
G
k
∣
X
)
=
1
\sum_{k=1}^K
p(\mathcal G_k \;|\; X) = 1
∑ k = 1 K p ( G k ∣ X ) = 1
f(X) = \argmin_{g\in G} \sum_{k=1}^KL(\mathcal
G_k, g)p(\mathcal G_k \;|\; X)
f
(
X
)
=
arg min
g
∈
G
∑
k
=
1
K
L
(
G
k
,
g
)
p
(
G
k
∣
X
)
f(X) = \argmin_{g\in G}
\sum_{k=1}^KL(\mathcal G_k, g)p(\mathcal G_k \;|\; X)
f ( X ) = arg min g ∈ G ∑ k = 1 K L ( G k , g ) p ( G k ∣ X )
zero-one loss : f(X) =
\argmax_{g\in G}p(g\;|\;X)
f
(
X
)
=
arg max
g
∈
G
p
(
g
∣
X
)
f(X) = \argmax_{g\in
G}p(g\;|\;X)
f ( X ) = arg max g ∈ G p ( g ∣ X )
f(X) = f_{bayes}(X)
f
(
X
)
=
f
b
a
y
e
s
(
X
)
f(X) = f_{bayes}(X)
f ( X ) = f ba yes ( X )
EPE(f_{bayes}) =
E_{x,y}(L(Y,f_{bayes}(X)))
E
P
E
(
f
b
a
y
e
s
)
=
E
x
,
y
(
L
(
Y
,
f
b
a
y
e
s
(
X
)
)
)
EPE(f_{bayes})
= E_{x,y}(L(Y,f_{bayes}(X)))
EPE ( f ba yes ) = E x , y ( L ( Y , f ba yes ( X )))
Summary by Flavius Schmidt, ge83pux, 2025.
https://home.cit.tum.de/~scfl/
Images from Wikimedia .