The Simple Linear Regression Model - Kinh tế chính trị Mác - Lênin | Đại học Tôn Đức Thắng

Let’s recall the simple linear regression model from last time. This is a statisticalmodel with two variables Xand Y, where we try to predict Yfrom X. Theassumptions of the model are as follow. Tài liệu được sưu tầm và soạn thảo dưới dạng file PDF để gửi tới các bạn sinh viên cùng tham khảo, ôn tập đầy đủ kiến thức, chuẩn bị cho các buổi học thật tốt. Mời bạn đọc đón xem!

Lecture 4: Simple Linear Regression Models, with
Hints at Their Estimation
36-401, Fall 2017, Section B
1 The Simple Linear Regression Model
Let’s recall the simple linear regression model from last time. This is a statistical
model with two variables X and Y , where we try to predict Y from X. The
assumptions of the model are as follows:
1. The distribution of is arbitrary (and perhaps is even non-random).X X
2. If X = x, then Y = β
0
+ +β
1
x , for some constants (“coefficients”,
“parameters”) β
0
and β
1
, and some random noise variable .
3.
E [|X = x] = 0 (no matter what x is), Var [|X = x] = σ
2
(no matter
what is).x
4. is uncorrelated across observations.
To elaborate, with multiple data points, (X
1
, Y
1
) ), (X
2
, Y
2
, . . . (X
n
, Y
n
), then
the model says that, for each i 1 : ,n
Y X
i
= β β
0
+
1 i
+
i
(1)
where the noise variables all have the same expectation (0) and the same
i
variance (
σ
2
), and Cov [
i
,
j
] = 0 (unless i = j, of course).
1.1 “Plug-In” Estimates
In lecture 1, we saw that the optimal linear predictor of has slopeY from X
β
1
= Cov [X, Y ] /Var [X], and intercept ]. A common tacticβ
0
= E [ [Y ] β
1
E X
in devising estimators is to use what’s sometimes called the “plug-in principle”,
where we find equations for the parameters which would hold if we knew the
full distribution, and “plug in” the sample versions of the population quantities.
We saw this in the last lecture, where we estimated by the ratio of the sampleβ
1
covariance to the sample variance:
c
β
1
=
c
XY
s
2
X
(2)
1
We also saw, in the notes to the last lecture, that so long as the law of large
numbers holds,
c
β β
1
1
(3)
as n . It follows easily that
c
β
0
= Y
c
β
1
X (4)
will also converge on .β
0
1.2 Least Squares Estimates
An alternative way of estimating the simple linear regression model starts from
the objective we are trying to reach, rather than from the formula for the slope.
Recall, from lecture 1, that the true optimal slope and intercept are the ones
which minimize the mean squared error:
(β
0
, β
1
) = argmin
(b ,b
0 1
)
E
( )) (5)Y (b
0
+ b
1
X
2
This is a function of the complete distribution, so we can’t get it from data,
but we can approximate it with data. The in-sample, empirical or training
MSE is
\
MSE b
(
0
, b
1
)
1
n
n
X
i=1
( ( )) (6)
y
i
b
0
+ b
1
x
i
2
Notice that this is a function of ; it is also, of course, a function ofb
0
and b
1
the data, (x
1
, y
1
) (, (x
2
, y
2
), . . . x
n
, y
n
), but we will generally suppress that in our
notation.
If our samples are all independent, for any fixed ( ), the law of largeb
0
, b
1
numbers tells us that
\
MSE MSE(b
0
, b
1
) (b
0
, b
1
) as n . So it doesn’t
seem unreasonable to try minimizing the in-sample error, which we can compute,
as a proxy for minimizing the true MSE, which we can’t. Where does it lead
us?
Start by taking the derivatives with respect to the slope and the intercept:
\
MSE
b
0
=
1
n
n
X
i=1
(y
i
( ))(b
0
+ b
1
x
i
2) (7)
\
MSE
b
0
=
1
n
n
X
i=1
(y
i
( ))( 2b
0
+ b
1
x
i
x
i
) (8)
Set these to zero at the optimum (
ˆ
β
0
,
ˆ
β
1
):
1
n
n
X
i=1
(
y
i
(
ˆ
β
0
+
ˆ
β
1
x
i
)) = 0 (9)
1
n
n
X
i=1
(
y
i
(
ˆ
β
0
+
ˆ
β
1
x x
i
))(
i
) = 0
2
These are often called the for least-squares estimation, ornormal equations
the estimating equations: a system of two equations in two unknowns, whose
solution gives the estimate. Many people would, at this point, remove the factor
of 1 , but I think it makes it easier to understand the next steps:/n
y
ˆ
β
0
ˆ
β
1
x = 0 (10)
xy
ˆ
β
0
x
ˆ
β
1
x
2
= 0 (11)
The first equation, re-written, gives
ˆ
β
0
= y
ˆ
β
1
x (12)
Substituting this into the remaining equation,
0 =
xy ¯y¯x +
ˆ
β
1
¯x¯x
ˆ
β
1
x
2
(13)
0 =
c
XY
ˆ
β
1
s
2
X
(14)
ˆ
β
1
=
c
XY
s
2
X
(15)
That is, the least-squares estimate of the slope is our old friend the plug-in
estimate of the slope, and thus the least-squares intercept is also the plug-in
intercept.
Going forward The equivalence between the plug-in estimator and the least-
squares estimator is a bit of a special case for linear models. In some non-linear
models, least squares is quite feasible (though the optimum can only be found
numerically, not in closed form); in others, plug-in estimates are more useful
than optimization.
1.3 Bias, Variance and Standard Error of Parameter Es-
timates
Whether we think of it as deriving from pluging-in or from least squares, we
work out some of the properties of this estimator of the coefficients, using the
model assumptions. We’ll start with the slope,
ˆ
β
1
.
3
ˆ
β
1
=
c
XY
s
2
X
(16)
=
1
n
P
n
i
=1
x
i
y
i
¯x¯y
s
2
X
(17)
=
1
n
P
n
i
=1
x
i
(β
0
+ β
1
x
i
+
1
) ¯x(β β
0
+
1
¯x + ¯)
s
2
X
(18)
=
β
0
¯x + β
1
x
2
+
1
n
P
n
i
=1
x
i
i
¯
0
β
1
¯x
2
¯x¯
s
2
x
(19)
=
β
1
s
2
X
+
1
n
P
n
i
=1
x
i
i
¯x¯
s
2
X
(20)
= β
1
+
1
n
P
n
i
=1
x
i
i
¯x¯
s
2
X
(21)
Since ¯
x¯ = n
1
P
i
¯x
i
,
ˆ
β β
1
=
1
+
1
n
P
n
i
=1
(x
i
¯x)
i
s
2
X
(22)
This representation of the slope estimate shows that it is equal to the true
slope ( ) plus something which depends on the noise terms (the , and theirβ
1
i
sample average ¯). We’ll use this to find the expected value and the variance of
the estimator
ˆ
β
1
.
In the next couple of paragraphs, I am going to treat the x
i
as non-random
variables. This is appropriate in “designed” or “controlled” experiments, where
we get to chose their value. In randomized experiments or in observational stud-
ies, obviously the x
i
aren’t necessarily fixed; however, these expressions will be
correct for the conditional expectation
E
h
ˆ
β
1
|x
1
, . . . x
n
i
and conditional vari-
ance Var
h
ˆ
β
1
|x
1
, . . . x
n
i
, and I will come back to how we get the unconditional
expectation and variance.
Expected value and bias Recall that E [
i
|X
i
] = 0, so
1
n
n
X
i=1
( ¯x
i
x)E [
i
] = 0 (23)
Thus,
E
h
ˆ
β β
1
i
=
1
(24)
Since the bias of an estimator is the difference between its expected value
and the truth,
ˆ
β
1
is an unbiased estimator of the optimal slope.
4
(To repeat what I’m sure you remember from mathematical statistics: “bias”
here is a technical term, meaning no more and no less than
E
h
ˆ
β
1
i
β
1
. An unbi-
ased estimator could still make systematic mistakes for instance, it could un-
derestimate 99% of the time, provided that the 1% of the time it over-estimates,
it does so by much more than it under-estimates. Moreover, unbiased estimators
are not necessarily superior to biased ones: the total error depends on both the
bias of the estimator and its variance, and there are many situations where you
can remove lots of bias at the cost of adding a little variance. Least squares
for simple linear regression happens not to be one of them, but you shouldn’t
expect that as a general rule.)
Turning to the intercept,
E
h
ˆ
β
0
i
= E
h
Y
ˆ
β
1
X
i
(25)
=
β
0
+ β
1
X E
h
ˆ
β
1
i
X (26)
= β
0
+ β β
1
X
1
X (27)
= β
0
(28)
so it, too, is unbiased.
Variance and Standard Error Using the formula for the variance of a sum
from lecture 1, and the model assumption that all the are uncorrelated with
i
each other,
Var
h
ˆ
β
1
i
= Var +
β
1
1
n
P
n
i
=1
(x
i
¯x)
i
s
2
X
(29)
= Var
1
n
P
n
i
=1
(x
i
¯x)
i
s
2
X
(30)
=
1
n
2
P
n
i
=1
(x
i
¯x)
2
Var [ ]
i
(
s
2
X
)
2
(31)
=
σ
2
n
s
2
X
(
s
2
X
)
2
(32)
=
σ
2
ns
2
X
(33)
In words, this says that the variance of the slope estimate goes up as the
noise around the regression line ( ) gets bigger, and goes down as we have
σ
2
more observations ( ), which are further spread out along the horizontal axisn
(
s
2
X
); it should not be surprising that it’s easier to work out the slope of a line
from many, well-separated points on the line than from a few points smushed
together.
The standard error of an estimator is just its standard deviation, or the
square root of its variance:
se(
ˆ
β
1
) =
σ
p
ns
2
X
(34)
5
I will leave working out the variance of
ˆ
β
0
as an exercise.
Unconditional-on-X Properties The last few paragraphs, as I said, have
looked at the expectation and variance of
ˆ
β
1
conditional on x
1
, . . . x
n
, either
because the x’s really are non-random (e.g., controlled by us), or because we’re
just interested in conditional inference. If we do care about unconditional prop-
erties, then we still need to find
E
h
ˆ
β
1
i
and Var
h
ˆ
β
1
i
, not just E
h
ˆ
β
1
|x
1
, . . . x
n
i
and Var
h
ˆ
β
1
|x
1
, . . . x
n
i
. Fortunately, this is easy, so long as the simple linear
regression model holds.
To get the unconditional expectation, we use the “law of total expectation”:
E
h
ˆ
β
1
i
= E
h
E
h
ˆ
β
1
|X
1
, . . . X
n
ii
(35)
= E [β
1
] = (36)β
1
That is, the estimator is unconditionally unbiased.
To get the unconditional variance, we use the “law of total variance”:
Var
h
ˆ
β
1
i
= E
h
Var
h
ˆ
β
1
|X
1
, . . . X
n
ii
+ Var
h
E
h
ˆ
β
1
|X
1
, . . . X
n
ii
(37)
=
E
σ
2
ns
2
X
+ Var [β
1
] (38)
=
σ
2
n
E
1
s
2
X
(39)
1.4 Parameter Interpretation; Causality
Two of the parameters are easy to interpret.
σ
2
is the variance of the noise around the regression line; σ is a typical distance
of a point from the line. (“Typical” here in a special sense, it’s the root-
mean-squared distance, rather than, say, the average absolute distance.)
β
0
is the simply the expected value of is 0,Y when X E [Y |X = 0]. The point
X = 0 usually has no special significance, but this setting does ensure
that the line goes through the point (E [X] , E [ ]).Y
The interpretation of the slope is both very straightforward and very tricky.
Mathematically, it’s easy to convince yourself that, for any x
β
1
= E [Y |X X= x] E [Y | = x 1] (40)
or, for any ,x
1
, x
2
β
1
=
E [Y Y|X = x
2
] E [ |X = x
1
]
x x
2
1
(41)
This is just saying that the slope of a line is “rise run”./
6
The tricky part is that we have a strong, natural tendency to interpretvery
this as telling us something about causation “If we change by 1, thenX
on average Y will change by ”. This interpretation is usually completelyβ
1
unsupported by the analysis. If I use an old-fashioned mercury thermometer,
the height of mercury in the tube usually has a nice linear relationship with the
temperature of the room the thermometer is in. This linear relationship goes
both ways, so we could regress temperature ( ) on mercury height ( ). But ifY X
I manipulate the height of the mercury (say, by changing the ambient pressure,
or shining a laser into the tube, etc.), changing the height will not, in fact,X
change the temperature outside.
The right way to interpret is not as the result of a , but as anβ
1
change
expected difference. The correct catch-phrase would be something like “If we
select two sets of cases from the un-manipulated distribution where differsX
by 1, we expect Y to differ by .” This covers the thermometer example, andβ
1
every other I can think of. It is, I admit, much more inelegant than “If X
changes by 1, Y changes by on average”, but it has the advantage of beingβ
1
true, which the other does not.
There are circumstances where regression can be a useful part of causal
inference, but we will need a lot more tools to grasp them; that will come
towards the end of 402.
2 The Gaussian-Noise Simple Linear Regression
Model
We have, so far, assumed comparatively little about the noise term . The
advantage of this is that our conclusions apply to lots of different situations;
the drawback is that there’s really not all that much more to say about our
estimator
b
β or our predictions than we’ve already gone over. If we made more
detailed assumptions about , we could make more precise inferences.
There are lots of forms of distributions for which we might contemplate,
and which are compatible with the assumptions of the simple linear regression
model (Figure 1). The one which has become the most common over the last
two centuries is to assume follows a Gaussian distribution.
The result is the Gaussian-noise simple linear regression model :
1
1. The distribution of is arbitrary (and perhaps is even non-random).X X
2. If X = x, then Y = β
0
+ β
1
x + , for some constants (“coefficients”,
“parameters”) β
0
and β
1
, and some random noise variable .
3.
N (0, σ
2
), independent of .X
1
Our textbook, rather old-fashionedly, calls this the “normal error” model rather than
“Gaussian noise”. I dislike this: “normal” is an over-loaded word in math, while “Gaussian”
is (comparatively) specific; “error” made sense in Gauss’s original context of modeling, specif-
ically, errors of observation, but is misleading generally; and calling Gaussian distributions
“normal” suggests they are much more common than they really are.
7
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
−3 −2 −1 0 1 2 3
0.0 0.2 0.4 0.6 0.8 1.0
ε
par c( =mfrow ( , ))2 3
curve dnorm expression c( (x), = = =from -3, to 3, xlab (epsilon), = =ylab "", ylim ( , ))0 1
curve exp abs expression( (- (x)) = = =/2, from -3, to 3, xlab (epsilon), = ,ylab ""
ylim= ( , ))c 0 1
curve sqrt pmax expression( ( ( , x )) (pi0 1- ^2 / /2), = = =from -3, to 3, xlab (epsilon),
ylab ylim="", = ( , ))c 0 1
curve dt expression c( (x,3), = = =from -3, to 3, xlab (epsilon), = =ylab "", ylim ( , ))0 1
curve dgamma( (x = =+1.5, shape 1.5, scale 1), = = ,from -3, to 3
xlab ylab ylim=expression(epsilon), ="", = ( , ))c 0 1
curve dgamma(0.5* (x = =+1.5, shape 1.5, scale 1) +
0.5 0.5 0.5 1 3*dgamma( -x, =shape , scale= ), =from - ,
to xlab ylab ylim=3, =expression(epsilon), ="", = ( , ))c 0 1
par c( =mfrow ( , ))1 1
Figure 1: Some possible noise distributions for the simple linear regression model,
since all have E [] = 0, and could get any variance by scaling. (The model is even
compatible with each observation taking from a different distribution.) From top left
to bottom right: Gaussian; double-exponential (“Laplacian”); “circular” distribution;
t with 3 degrees of freedom; a gamma distribution (shape 1.5, scale 1) shifted to have
mean 0; mixture of two gammas with shape 1.5 and shape 0.5, each off-set to have
expectation 0. The first three were all used as error models in the 18th and 19th
centuries. (See the source file for how to get the code below the figure.)
8
4. is independent across observations.
You will notice that these assumptions are strictly stronger than those of the
simple linear regression model. More exactly, the first two assumptions are the
same, while the third and fourth assumptions of the Gaussian-noise model imply
the corresponding assumptions of the other model. This means that everything
we have done so far directly applies to the Gaussian-noise model. On the other
hand, the stronger assumptions let us say more. They tell us, exactly, the
probability distribution for , and so will let us get exact distributionsY given X
for predictions and for other inferential statistics.
Why the Gaussian noise model? Why should we think that the noise
around the regression line would follow a Gaussian distribution, independent of
X? There are two big reasons.
1. Central limit theorem The noise might be due to adding up the effects
of lots of little random causes, all nearly independent of each other and
of X, where each of the effects are of roughly similar magnitude. Then
the central limit theorem will take over, and the distribution of the sum
of effects will indeed be pretty Gaussian. For Gauss’s original context,
X was (simplifying) “Where is such-and-such-a-planet in space?”, Y was
“Where does an astronomer record the planet as appearing in the sky?”,
and noise came from defects in the telescope, eye-twitches, atmospheric
distortions, etc., etc., so this was pretty reasonable. It is clearly anot
universal truth of nature, however, or even something we should expect
to hold true as a general rule, as the name “normal” suggests.
2. Mathematical convenience Assuming Gaussian noise lets us work out a
very complete theory of inference and prediction for the model, with lots
of closed-form answers to questions like “What is the optimal estimate of
the variance?” or “What is the probability that we’d see a fit this good
from a line with a non-zero intercept if the true line goes through the
origin?”, etc., etc. Answering such questions without the Gaussian-noise
assumption needs somewhat more advanced techniques, and much more
advanced computing; we’ll get to it towards the end of the class.
2.1 Visualizing the Gaussian Noise Model
The Gaussian noise model gives us not just an expected value for at eachY
x x, but a whole conditional distribution for Y at each . To visualize it, then,
it’s not enough to just sketch a curve; we need a three-dimensional surface,
showing, for each combination of x and y, the probability density of aroundY
that y given that x. Figure 2 illustrates.
2.2 Maximum Likelihood vs. Least Squares
As you remember from your mathematical statistics class, the oflikelihood
a parameter value on a data set is the probability density at the data under
9
x
−1 0 1 2 3
y
−4
−2
0
2
4
p(y | x)
0.0
0.2
0.4
0.6
x.seq ( = = = )<- seq from -1, to 3, length.out 150
y.seq ( = = = )<- seq from -5, to 5, length.out 150
cond.pdf ( , (y, = x, =<- function x y) { dnorm mean 10 5- * sd 0.5) }
z <- outer(x.seq, y.seq, cond.pdf)
persp(x.seq,y.seq,z, = = = ,ticktype "detailed", phi 75, xlab "x"
ylab zlab cex.axis="y", = ( (yexpression p |x)), = )0.8
Figure 2: Illustrating how the conditional pdf of varies as a function ofY X, for
a hypothetical Gaussian noise simple linear regression where β
0
= 10, β
1
= −5, and
σ
2
= (0.5)
2
. The perspective is adjusted so that we are looking nearly straight down
from above on the surface. (Can you find a better viewing angle?) See help(persp)
for the 3D plotting (especially the examples), and help(outer) for the function,outer
which takes all combinations of elements from two vectors and pushes them through
a function. How would you modify this so that the regression line went through the
origin with a slope of and a standard deviation of 5?4/3
10
those parameters. We could not work with the likelihood with the simple linear
regression model, because it didn’t specify enough about the distribution to let
us calculate a density. With the Gaussian-noise model, however, we can write
down a likelihood
2
By the model’s assumptions, if think the parameters are the
parameters are (reserving the Greek letters for their true values), then
b
0
, b , s
1
2
Y
|X = x N (b b
0
+
1
x, s
2
), and Y Y
i
and
j
are independent given X
i
and ,X
j
so the over-all likelihood is
n
Y
i=1
1
2πs
2
e
( ( ))
y
i
b
0
+b
1
x
i
2
2
s
2
(42)
As usual, we work with the log-likelihood, which gives us the same information
3
but replaces products with sums:
L b
(
0
, b , s
1
2
) =
n
2
log 2π
n
log
s
1
2
s
2
n
X
i=1
( ( )) (43)
y
i
b
0
+ b
1
x
i
2
We recall from mathematical statistics that when we’ve got a likelihood func-
tion, we generally want to maximize it. That is, we want to find the parameter
values which make the data we observed as likely, as probable, as the model
will allow. (This may not be very likely; that’s another issue.) We recall from
calculus that one way to maximize is to take derivatives and set them to zero.
L
b
0
=
1
2
s
2
n
X
i=1
2( ( ))(y
i
b
0
+ b
1
x
i
1) (44)
L
b
1
=
1
2
s
2
n
X
i=1
2( ( ))(y
i
b
0
+ b
1
x
i
x
i
) (45)
Notice that when we set these derivatives to zero, all the multiplicative
constants in particular, the prefactor of
1
2
s
2
go away. We are left with
n
X
i=1
y
i
(
c
β
0
+
c
β
1
x
i
) = 0 (46)
n
X
i=1
(
y
i
(
c
β
0
+
c
β
1
x x
i
))
i
= 0 (47)
These are, up to a factor of 1 the equations we got from the method/n, exactly
of least squares (Eq. 9). That means that the least squares solution theis
maximum likelihood estimate under the Gaussian noise model; this is no coin-
cidence .
4
2
Strictly speaking, this is a “conditional” (on ) likelihood; but only pedants use theX
adjective in this context.
3
Why is this?
4
It’s no coincidence because, to put it somewhat anachronistically, what Gauss did was
ask himself “for what distribution of the noise would least squares maximize the likelihood?”.
11
Now let’s take the derivative with respect to :s
L
s
=
n
s
+ 2
1
2
s
3
n
X
i=1
( (
y
i
b
0
+ b
1
x
i
)) (48)
2
Setting this to 0 at the optimum, including multiplying through by
bσ
3
, we get
c
σ
2
=
1
n
n
X
i=1
(
y
i
(
c
β
0
+
c
β
1
x
i
)) (49)
2
Notice that the right-hand side is just the in-sample mean squared error.
Other models Maximum likelihood estimates of the regression curve coin-
cide with least-squares estimates when the noise around the curve is additive,
Gaussian, of constant variance, and both independent of and of other noiseX
terms. These were all assumptions we used in setting up a log-likelihood which
was, up to constants, proportional to the (negative) mean-squared error. If
any of those assumptions fail, maximum likelihood and least squares estimates
can diverge, though sometimes the MLE solves a “generalized” least squares
problem (as we’ll see later in this course).
Exercises
To think through, not to hand it.
1. Show that if E [|X = x] = 0 for all ] = 0. Would thisx, then Cov [X,
still be true if for some other constant ?E [|X = x] = a a
2. Find the variance of
ˆ
β
0
. Hint: Do you need to worry about covariance
between
Y and
ˆ
β
1
?
12
| 1/12

Preview text:

Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation 36-401, Fall 2017, Section B 1
The Simple Linear Regression Model
Let’s recall the simple linear regression model from last time. This is a statistical
model with two variables X and Y , where we try to predict Y from X. The
assumptions of the model are as follows:
1. The distribution of X is arbitrary (and perhaps X is even non-random).
2. If X = x, then Y = β0 + β1x + , for some constants (“coefficients”,
“parameters”) β0 and β1, and some random noise variable .
3. E [|X = x] = 0 (no matter what x is), Var [|X = x] = σ2 (no matter what x is).
4.  is uncorrelated across observations.
To elaborate, with multiple data points, (X1, Y1), (X2, Y2), . . . (Xn, Yn), then
the model says that, for each i ∈ 1 : n, Yi = β0 + β1Xi + i (1)
where the noise variables i all have the same expectation (0) and the same
variance (σ2), and Cov [i, j] = 0 (unless i = j, of course). 1.1 “Plug-In” Estimates
In lecture 1, we saw that the optimal linear predictor of Y from X has slope
β1 = Cov [X, Y ] /Var [X], and intercept β0 = E [Y ] − β1E [X]. A common tactic
in devising estimators is to use what’s sometimes called the “plug-in principle”,
where we find equations for the parameters which would hold if we knew the
full distribution, and “plug in” the sample versions of the population quantities.
We saw this in the last lecture, where we estimated β1 by the ratio of the sample
covariance to the sample variance: c c β XY 1 = (2) s2X 1
We also saw, in the notes to the last lecture, that so long as the law of large numbers holds, c β1 → β1 (3)
as n → ∞. It follows easily that c β0 = Y − c β1X (4) will also converge on β0. 1.2 Least Squares Estimates
An alternative way of estimating the simple linear regression model starts from
the objective we are trying to reach, rather than from the formula for the slope.
Recall, from lecture 1, that the true optimal slope and intercept are the ones
which minimize the mean squared error:   (β 2
0, β1) = argmin E (Y − (b0 + b1X )) (5) (b0,b1)
This is a function of the complete distribution, so we can’t get it from data,
but we can approximate it with data. The in-sample, empirical or training MSE is n X \ 1 M SE(b 2 0, b1) ≡ (yi − (b0 + b1xi)) (6) n i=1
Notice that this is a function of b0 and b1; it is also, of course, a function of
the data, (x1, y1), (x2, y2), . . . (xn, yn), but we will generally suppress that in our notation.
If our samples are all independent, for any fixed (b0, b1), the law of large numbers tells us that \
M SE(b0, b1) → MSE(b0, b1) as n → ∞. So it doesn’t
seem unreasonable to try minimizing the in-sample error, which we can compute,
as a proxy for minimizing the true MSE, which we can’t. Where does it lead us?
Start by taking the derivatives with respect to the slope and the intercept: n ∂ \ M SE 1 X = (y b −2) (7) ∂b i − ( 0 + b1xi))( 0 n i=1 ∂ \ M SE 1 n X = (y b − x ∂b i − ( 0 + b1xi))( 2 i) (8) 0 n i=1
Set these to zero at the optimum ( ˆ β0, ˆ β1): 1 n X (yi − (ˆβ0 + ˆβ n 1xi)) = 0 (9) i=1 1 n X (y n i − ( ˆ β0 + ˆ β1xi))(xi) = 0 i=1 2
These are often called the normal equations for least-squares estimation, or
the estimating equations: a system of two equations in two unknowns, whose
solution gives the estimate. Many people would, at this point, remove the factor
of 1/n, but I think it makes it easier to understand the next steps: y − ˆ β0 − ˆβ1x = 0 (10) xy − ˆβ0x − ˆ β1x2 = 0 (11)
The first equation, re-written, gives ˆ β0 = y − ˆ β1x (12)
Substituting this into the remaining equation, 0 = xy − ¯y¯x + ˆ β1 ¯ x¯ x − ˆ β1x2 (13) 0 = cXY − ˆ β1s2 (14) X ˆ c β XY 1 = (15) s2X
That is, the least-squares estimate of the slope is our old friend the plug-in
estimate of the slope, and thus the least-squares intercept is also the plug-in intercept. Going forward
The equivalence between the plug-in estimator and the least-
squares estimator is a bit of a special case for linear models. In some non-linear
models, least squares is quite feasible (though the optimum can only be found
numerically, not in closed form); in others, plug-in estimates are more useful than optimization. 1.3
Bias, Variance and Standard Error of Parameter Es- timates
Whether we think of it as deriving from pluging-in or from least squares, we
work out some of the properties of this estimator of the coefficients, using the
model assumptions. We’ll start with the slope, ˆ β1. 3 ˆ c β XY 1 = (16) s2XP 1 n x = n i=1 iyi − ¯ x¯ y (17) s2X P 1 n x = n i=1 i(β0 + β1xi + 1) − ¯ x(β0 + β1 ¯ x + ¯) (18) s2X P β n 0 ¯ x + β1x2 + 1
xii − ¯xβ0 − β1¯x2 − ¯x¯ = n i=1 (19) s2 x P β n 1s2 + 1 x = X n i=1 ii − ¯ x¯ (20) s2X P 1 n xii − ¯x¯ = β n i=1 1 + (21) s2X P Since ¯ x¯ = n−1 ¯ x i i, P 1 n ˆ (x β n i=1 i − ¯ x)i 1 = β1 + (22) s2X
This representation of the slope estimate shows that it is equal to the true
slope (β1) plus something which depends on the noise terms (the i, and their
sample average ¯). We’ll use this to find the expected value and the variance of the estimator ˆ β1.
In the next couple of paragraphs, I am going to treat the xi as non-random
variables. This is appropriate in “designed” or “controlled” experiments, where
we get to chose their value. In randomized experiments or in observational stud-
ies, obviously the xi aren’t necessarily fixed; however, these expressions will be h i
correct for the conditional expectation E ˆ
β1|x1, . . . xn and conditional vari- h i ance Var ˆ
β1|x1, . . . xn , and I will come back to how we get the unconditional expectation and variance. Expected value and bias
Recall that E [i|Xi] = 0, so 1 n X(xi − ¯x)E[ n i] = 0 (23) i=1 Thus, h i E ˆ β1 = β1 (24)
Since the bias of an estimator is the difference between its expected value and the truth, ˆ
β1 is an unbiased estimator of the optimal slope. 4
(To repeat what I’m sure you remember from mathematical statistics: “bias” h i
here is a technical term, meaning no more and no less than E ˆ β1 −β1. An unbi-
ased estimator could still make systematic mistakes — for instance, it could un-
derestimate 99% of the time, provided that the 1% of the time it over-estimates,
it does so by much more than it under-estimates. Moreover, unbiased estimators
are not necessarily superior to biased ones: the total error depends on both the
bias of the estimator and its variance, and there are many situations where you
can remove lots of bias at the cost of adding a little variance. Least squares
for simple linear regression happens not to be one of them, but you shouldn’t
expect that as a general rule.) Turning to the intercept, h i h i E ˆ β0 = E Y − ˆβ1X (25) h i = β ˆ 0 + β1X − E β1 X (26) = β0 + β1X − β1X (27) = β0 (28) so it, too, is unbiased. Variance and Standard Error
Using the formula for the variance of a sum
from lecture 1, and the model assumption that all the i are uncorrelated with each other, h i  P 1 n  (x Var ˆ β n i=1 i − ¯ x)i 1 = Var β1 + (29) s2 X  1 Pn  (x = Var n i=1 i − ¯ x)i (30) s2X P 1 n (xi − ¯x)2Var [i] = n2 i=1 (31) (s2 )2 X σ2 s2 = n X (32) (s2 )2 X σ2 = (33) ns2X
In words, this says that the variance of the slope estimate goes up as the
noise around the regression line (σ2) gets bigger, and goes down as we have
more observations (n), which are further spread out along the horizontal axis
(s2 ); it should not be surprising that it’s easier to work out the slope of a line X
from many, well-separated points on the line than from a few points smushed together.
The standard error of an estimator is just its standard deviation, or the square root of its variance: σ se( ˆ β1) = p (34) ns2X 5
I will leave working out the variance of ˆ β0 as an exercise. Unconditional-on-X Properties
The last few paragraphs, as I said, have
looked at the expectation and variance of ˆ
β1 conditional on x1, . . . xn, either
because the x’s really are non-random (e.g., controlled by us), or because we’re
just interested in conditional inference. If we do care about unconditional prop- h i h i h i
erties, then we still need to find E ˆ β ˆ ˆ 1
and Var β1 , not just E β1|x1, . . . xn h i and Var ˆ
β1|x1, . . . xn . Fortunately, this is easy, so long as the simple linear regression model holds.
To get the unconditional expectation, we use the “law of total expectation”: h i h h ii E ˆ β ˆ 1 = E E β1|X1, . . . Xn (35) = E [β1] = β1 (36)
That is, the estimator is unconditionally unbiased.
To get the unconditional variance, we use the “law of total variance”: h i h h ii h h ii Var ˆ β ˆ ˆ 1 =
E Var β1|X1, . . . Xn + Var E β1|X1, . . . Xn (37)   σ2 = E + Var [β ns2 1] (38) X   σ2 1 = E (39) n s2X 1.4
Parameter Interpretation; Causality
Two of the parameters are easy to interpret.
σ2 is the variance of the noise around the regression line; σ is a typical distance
of a point from the line. (“Typical” here in a special sense, it’s the root-
mean-squared distance, rather than, say, the average absolute distance.)
β0 is the simply the expected value of Y when X is 0, E [Y |X = 0]. The point
X = 0 usually has no special significance, but this setting does ensure
that the line goes through the point (E [X] , E [Y ]).
The interpretation of the slope is both very straightforward and very tricky.
Mathematically, it’s easy to convince yourself that, for any x
β1 = E [Y |X = x] − E [Y |X = x − 1] (40) or, for any x1, x2, E [Y |X = x2] − E [Y |X = x β 1] 1 = (41) x2 − x1
This is just saying that the slope of a line is “rise/run”. 6
The tricky part is that we have a very strong, natural tendency to interpret
this as telling us something about causation — “If we change X by 1, then
on average Y will change by β1”. This interpretation is usually completely
unsupported by the analysis. If I use an old-fashioned mercury thermometer,
the height of mercury in the tube usually has a nice linear relationship with the
temperature of the room the thermometer is in. This linear relationship goes
both ways, so we could regress temperature (Y ) on mercury height (X). But if
I manipulate the height of the mercury (say, by changing the ambient pressure,
or shining a laser into the tube, etc.), changing the height X will not, in fact,
change the temperature outside.
The right way to interpret β1 is not as the result of a change, but as an
expected difference. The correct catch-phrase would be something like “If we
select two sets of cases from the un-manipulated distribution where X differs
by 1, we expect Y to differ by β1.” This covers the thermometer example, and
every other I can think of. It is, I admit, much more inelegant than “If X
changes by 1, Y changes by β1 on average”, but it has the advantage of being
true, which the other does not.
There are circumstances where regression can be a useful part of causal
inference, but we will need a lot more tools to grasp them; that will come towards the end of 402. 2
The Gaussian-Noise Simple Linear Regression Model
We have, so far, assumed comparatively little about the noise term . The
advantage of this is that our conclusions apply to lots of different situations;
the drawback is that there’s really not all that much more to say about our estimator b
β or our predictions than we’ve already gone over. If we made more
detailed assumptions about , we could make more precise inferences.
There are lots of forms of distributions for  which we might contemplate,
and which are compatible with the assumptions of the simple linear regression
model (Figure 1). The one which has become the most common over the last
two centuries is to assume  follows a Gaussian distribution.
The result is the Gaussian-noise simple linear regression model1:
1. The distribution of X is arbitrary (and perhaps X is even non-random).
2. If X = x, then Y = β0 + β1x + , for some constants (“coefficients”,
“parameters”) β0 and β1, and some random noise variable .
3.  ∼ N(0, σ2), independent of X.
1Our textbook, rather old-fashionedly, calls this the “normal error” model rather than
“Gaussian noise”. I dislike this: “normal” is an over-loaded word in math, while “Gaussian”
is (comparatively) specific; “error” made sense in Gauss’s original context of modeling, specif-
ically, errors of observation, but is misleading generally; and calling Gaussian distributions
“normal” suggests they are much more common than they really are. 7 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 ε ε ε 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 ε ε ε par(mfrow=c(2,3))
curve(dnorm(x), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(exp(-abs(x))/2, from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(sqrt(pmax(0,1-x^2))/(pi/2), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(dt(x,3), from=-3, to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(dgamma(x+1.5, shape=1.5, scale=1), from=-3, to=3,
xlab=expression(epsilon), ylab="", ylim=c(0,1))
curve(0.5*dgamma(x+1.5, shape=1.5, scale=1) +
0.5*dgamma(0.5-x, shape=0.5, scale=1), from=-3,
to=3, xlab=expression(epsilon), ylab="", ylim=c(0,1)) par(mfrow=c(1,1))
Figure 1: Some possible noise distributions for the simple linear regression model,
since all have E [ ] = 0, and could get any variance by scaling. (The model is even
compatible with each observation taking
from a different distribution.) From top left
to bottom right: Gaussian; double-exponential (“Laplacian”); “circular” distribution;
t with 3 degrees of freedom; a gamma distribution (shape 1.5, scale 1) shifted to have 8
mean 0; mixture of two gammas with shape 1.5 and shape 0.5, each off-set to have
expectation 0. The first three were all used as error models in the 18th and 19th
centuries. (See the source file for how to get the code below the figure.)
4.  is independent across observations.
You will notice that these assumptions are strictly stronger than those of the
simple linear regression model. More exactly, the first two assumptions are the
same, while the third and fourth assumptions of the Gaussian-noise model imply
the corresponding assumptions of the other model. This means that everything
we have done so far directly applies to the Gaussian-noise model. On the other
hand, the stronger assumptions let us say more. They tell us, exactly, the
probability distribution for Y given X, and so will let us get exact distributions
for predictions and for other inferential statistics. Why the Gaussian noise model?
Why should we think that the noise
around the regression line would follow a Gaussian distribution, independent of X? There are two big reasons.
1. Central limit theorem The noise might be due to adding up the effects
of lots of little random causes, all nearly independent of each other and
of X, where each of the effects are of roughly similar magnitude. Then
the central limit theorem will take over, and the distribution of the sum
of effects will indeed be pretty Gaussian. For Gauss’s original context,
X was (simplifying) “Where is such-and-such-a-planet in space?”, Y was
“Where does an astronomer record the planet as appearing in the sky?”,
and noise came from defects in the telescope, eye-twitches, atmospheric
distortions, etc., etc., so this was pretty reasonable. It is clearly not a
universal truth of nature, however, or even something we should expect
to hold true as a general rule, as the name “normal” suggests.
2. Mathematical convenience Assuming Gaussian noise lets us work out a
very complete theory of inference and prediction for the model, with lots
of closed-form answers to questions like “What is the optimal estimate of
the variance?” or “What is the probability that we’d see a fit this good
from a line with a non-zero intercept if the true line goes through the
origin?”, etc., etc. Answering such questions without the Gaussian-noise
assumption needs somewhat more advanced techniques, and much more
advanced computing; we’ll get to it towards the end of the class. 2.1
Visualizing the Gaussian Noise Model
The Gaussian noise model gives us not just an expected value for Y at each
x, but a whole conditional distribution for Y at each x. To visualize it, then,
it’s not enough to just sketch a curve; we need a three-dimensional surface,
showing, for each combination of x and y, the probability density of Y around
that y given that x. Figure 2 illustrates. 2.2
Maximum Likelihood vs. Least Squares
As you remember from your mathematical statistics class, the likelihood of
a parameter value on a data set is the probability density at the data under 9 4 2 y0 −2 −4 0.20.0 −1 0 1 2 3 0.4 0.6 x p(y | x)
x.seq <- seq(from=-1, to=3, length.out=150)
y.seq <- seq(from=-5, to=5, length.out=150)
cond.pdf <- function(x,y) { dnorm(y, mean=10-5*x, sd=0.5) }
z <- outer(x.seq, y.seq, cond.pdf)
persp(x.seq,y.seq,z, ticktype="detailed", phi=75, xlab="x",
ylab="y", zlab=expression(p(y|x)), cex.axis=0.8)
Figure 2: Illustrating how the conditional pdf of Y varies as a function of X, for
a hypothetical Gaussian noise simple linear regression where β0 = 10, β1 = −5, and
σ2 = (0.5)2. The perspective is adjusted so that we are looking nearly straight down
from above on the surface. (Can you find a better viewing angle?) See help(persp)
for the 3D plotting (especially the examples), and help(outer) for the outer function,
which takes all combinations of elements from two vectors and pushes them through
a function. How would you modify this so that the regression line went through the
origin with a slope of 4/3 and a standard deviation of 5? 10
those parameters. We could not work with the likelihood with the simple linear
regression model, because it didn’t specify enough about the distribution to let
us calculate a density. With the Gaussian-noise model, however, we can write
down a likelihood2 By the model’s assumptions, if think the parameters are the parameters are b 2 0, b1, s
(reserving the Greek letters for their true values), then
Y |X = x ∼ N(b0 + b1x, s2), and Yi and Yj are independent given Xi and Xj, so the over-all likelihood is n Y 1 y 2 i− b0+b1xi √ e− ( ( )) 2s2 (42) 2πs2 i=1
As usual, we work with the log-likelihood, which gives us the same information3
but replaces products with sums: n n 1 n X L(b 2 2 0, b1, s ) = − log 2π − s − (yi − (b0 + b1xi)) (43) 2 log 2s2 i=1
We recall from mathematical statistics that when we’ve got a likelihood func-
tion, we generally want to maximize it. That is, we want to find the parameter
values which make the data we observed as likely, as probable, as the model
will allow. (This may not be very likely; that’s another issue.) We recall from
calculus that one way to maximize is to take derivatives and set them to zero. ∂L 1 n X = − 2(yi − (b0 + b1xi))(−1) (44) ∂b0 2s2 i=1 ∂L 1 n X = − 2(yi − (b0 + b1xi))(−x ∂b i) (45) 1 2s2 i=1
Notice that when we set these derivatives to zero, all the multiplicative
constants — in particular, the prefactor of 1 — go away. We are left with 2s2 n X yi − (cβ0+ cβ1xi) = 0 (46) i=1 n
X(yi − (cβ0 + cβ1xi))xi = 0 (47) i=1
These are, up to a factor of 1/n, exactly the equations we got from the method
of least squares (Eq. 9). That means that the least squares solution is the
maximum likelihood estimate under the Gaussian noise model; this is no coin- cidence4.
2Strictly speaking, this is a “conditional” (on X) likelihood; but only pedants use the adjective in this context. 3Why is this?
4It’s no coincidence because, to put it somewhat anachronistically, what Gauss did was
ask himself “for what distribution of the noise would least squares maximize the likelihood?”. 11
Now let’s take the derivative with respect to s: n ∂L n 1 X = − + 2 (y 2 i − (b ∂s s 2s3 0 + b1xi)) (48) i=1
Setting this to 0 at the optimum, including multiplying through by bσ3, we get n X c 1 σ2 = (y 2 n i − (c β0 + c β1xi)) (49) i=1
Notice that the right-hand side is just the in-sample mean squared error. Other models
Maximum likelihood estimates of the regression curve coin-
cide with least-squares estimates when the noise around the curve is additive,
Gaussian, of constant variance, and both independent of X and of other noise
terms. These were all assumptions we used in setting up a log-likelihood which
was, up to constants, proportional to the (negative) mean-squared error. If
any of those assumptions fail, maximum likelihood and least squares estimates
can diverge, though sometimes the MLE solves a “generalized” least squares
problem (as we’ll see later in this course). Exercises
To think through, not to hand it.
1. Show that if E [|X = x] = 0 for all x, then Cov [X, ] = 0. Would this
still be true if E [|X = x] = a for some other constant a? 2. Find the variance of ˆ
β0. Hint: Do you need to worry about covariance between Y and ˆ β1? 12