1
00:00:08,019 --> 00:00:13,759
Let's discuss the method Ashenfelter used
to build his model, linear regression.

2
00:00:13,759 --> 00:00:17,520
We'll start with one-variable linear regression,
which

3
00:00:17,520 --> 00:00:23,630
just uses one independent variable to predict
the dependent variable.

4
00:00:23,630 --> 00:00:27,200
This figure shows a plot of one of the independent
variables,

5
00:00:27,200 --> 00:00:32,800
average growing season temperature,
and the dependent variable, wine price.

6
00:00:32,800 --> 00:00:36,140
The goal of linear regression is to create
a predictive line

7
00:00:36,140 --> 00:00:37,890
through the data.

8
00:00:37,890 --> 00:00:41,940
There are many different lines that
could be drawn to predict wine price using

9
00:00:41,940 --> 00:00:44,560
average
growing season temperature.

10
00:00:44,560 --> 00:00:49,260
A simple option would be a flat line at the
average price,

11
00:00:49,260 --> 00:00:52,460
in this case 7.07.

12
00:00:52,460 --> 00:00:58,740
The equation for this line is y equals 7.07.

13
00:00:58,740 --> 00:01:03,719
This linear regression model would predict
7.07 regardless

14
00:01:03,719 --> 00:01:05,969
of the temperature.

15
00:01:05,969 --> 00:01:11,280
But it looks like a better line would
have a positive slope, such as this line in

16
00:01:11,280 --> 00:01:13,530
blue.

17
00:01:13,530 --> 00:01:25,539
The equation for this line is y equals 0.5*(AGST)
-1.25.

18
00:01:25,539 --> 00:01:28,729
This linear regression model would predict
a higher price

19
00:01:28,729 --> 00:01:31,340
when the temperature is higher.

20
00:01:31,340 --> 00:01:34,329
Let's make this idea a little more formal.

21
00:01:34,329 --> 00:01:38,319
In general form a one-variable linear regression
model

22
00:01:38,319 --> 00:01:42,810
is a linear equation to predict the dependent
variable, y,

23
00:01:42,810 --> 00:01:45,619
using the independent variable, x.

24
00:01:45,619 --> 00:01:50,439
Beta 0 is the intercept term or intercept
coefficient,

25
00:01:50,439 --> 00:01:56,759
and Beta 1 is the slope of the line or coefficient
for the independent variable, x.

26
00:01:56,759 --> 00:02:03,060
For each observation, i, we have data
for the dependent variable Yi and data

27
00:02:03,060 --> 00:02:06,849
for the independent variable, Xi.

28
00:02:06,849 --> 00:02:11,569
Using this equation we make a prediction beta
0 plus Beta

29
00:02:11,569 --> 00:02:15,730
1 times Xi for each data point, i.

30
00:02:15,730 --> 00:02:20,820
This prediction is hopefully close to the
true outcome, Yi.

31
00:02:20,820 --> 00:02:24,670
But since the coefficients have to be the
same for all data

32
00:02:24,670 --> 00:02:30,810
points, i, we often make a small error,
which we'll call epsilon i.

33
00:02:30,810 --> 00:02:35,480
This error term is also often called a residual.

34
00:02:35,480 --> 00:02:39,860
Our errors will only all be 0 if all our points
lie perfectly

35
00:02:39,860 --> 00:02:41,720
on the same line.

36
00:02:41,720 --> 00:02:44,990
This rarely happens, so we know that our model
will probably

37
00:02:44,990 --> 00:02:47,040
make some errors.

38
00:02:47,040 --> 00:02:52,240
The best model or best choice of coefficients
Beta 0 and Beta 1

39
00:02:52,240 --> 00:03:04,220
has the smallest error terms or smallest residuals.

40
00:03:04,220 --> 00:03:08,130
This figure shows the blue line that we drew
in the beginning.

41
00:03:08,130 --> 00:03:13,610
We can compute the residuals or errors
of this line for each data point.

42
00:03:13,610 --> 00:03:19,760
For example, for this point the actual value
is about 6.2.

43
00:03:19,760 --> 00:03:24,420
Using our regression model we predict about
6.5.

44
00:03:24,420 --> 00:03:28,640
So the error for this data point is negative
0.3,

45
00:03:28,640 --> 00:03:32,770
which is the actual value minus our prediction.

46
00:03:32,770 --> 00:03:39,180
As another example for this point,
the actual value is about 8.

47
00:03:39,180 --> 00:03:44,200
Using our regression model we predict about
7.5.

48
00:03:44,200 --> 00:03:48,560
So the error for this data point is about
0.5.

49
00:03:48,560 --> 00:03:53,820
Again the actual value minus our prediction.

50
00:03:53,820 --> 00:03:56,790
One measure of the quality of a regression
line

51
00:03:56,790 --> 00:04:01,110
is the sum of squared errors, or SSE.

52
00:04:01,110 --> 00:04:05,850
This is the sum of the squared residuals or
error terms.

53
00:04:05,850 --> 00:04:09,740
Let n equal the number of data points that
we have in our data

54
00:04:09,740 --> 00:04:12,400
set.

55
00:04:12,400 --> 00:04:17,339
Then the sum of squared errors is
equal to the error we make on the first data

56
00:04:17,339 --> 00:04:21,370
point squared
plus the error we make on the second data

57
00:04:21,370 --> 00:04:24,800
point squared
plus the errors that you make on all data

58
00:04:24,800 --> 00:04:32,979
points
up to the n-th data point squared.

59
00:04:32,979 --> 00:04:38,210
We can compute the sum of squared errors
for both the red line and the blue line.

60
00:04:38,210 --> 00:04:42,139
As expected the blue line is a better fit
than the red line

61
00:04:42,139 --> 00:04:45,930
since it has a smaller sum of squared errors.

62
00:04:45,930 --> 00:04:48,900
The line that gives the minimum sum of squared
errors

63
00:04:48,900 --> 00:04:50,680
is shown in green.

64
00:04:50,680 --> 00:04:54,830
This is the line that our regression model
will find.

65
00:04:54,830 --> 00:04:58,210
Although sum of squared errors allows us to
compare lines

66
00:04:58,210 --> 00:05:03,289
on the same data set, it's hard to interpret
for two reasons.

67
00:05:03,289 --> 00:05:07,849
The first is that it scales with n, the number
of data points.

68
00:05:07,849 --> 00:05:11,449
If we built the same model with twice as much
data,

69
00:05:11,449 --> 00:05:14,180
the sum of squared errors might be twice as
big.

70
00:05:14,180 --> 00:05:17,419
But this doesn't mean it's a worse model.

71
00:05:17,419 --> 00:05:20,270
The second is that the units are hard to understand.

72
00:05:20,270 --> 00:05:26,270
Some of squared errors is in squared units
of the dependent variable.

73
00:05:26,270 --> 00:05:31,039
Because of these problems, Root Means Squared
Error, or RMSE,

74
00:05:31,039 --> 00:05:32,819
is often used.

75
00:05:32,819 --> 00:05:37,699
This divides sum of squared errors by n
and then takes a square root.

76
00:05:37,699 --> 00:05:41,180
So it's normalized by n and is in the same
units

77
00:05:41,180 --> 00:05:44,129
as the dependent variable.

78
00:05:44,129 --> 00:05:48,759
Another common error measure for linear regression
is R squared.

79
00:05:48,759 --> 00:05:51,699
This error measure is nice because it compares
the best

80
00:05:51,699 --> 00:05:55,308
model to a baseline model, the model that
does not

81
00:05:55,308 --> 00:05:59,479
use any variables, or the red line from before.

82
00:05:59,479 --> 00:06:05,370
The baseline model predicts the average value
of the dependent variable regardless

83
00:06:05,370 --> 00:06:08,979
of the value of the independent variable.

84
00:06:08,979 --> 00:06:12,279
We can compute that the sum of squared errors
for the best fit

85
00:06:12,279 --> 00:06:16,949
line or the green line is 5.73.

86
00:06:16,949 --> 00:06:23,080
And the sum of squared errors for the baseline
or the red line is 10.15.

87
00:06:23,080 --> 00:06:25,819
The sum of squared errors for the baseline
model

88
00:06:25,819 --> 00:06:29,880
is also known as the total sum of squares,
commonly referred

89
00:06:29,880 --> 00:06:32,590
to as SST.

90
00:06:32,590 --> 00:06:40,860
Then the formula for R squared is
R squared equals 1 minus sum of squared errors

91
00:06:40,860 --> 00:06:44,599
divided
by total sum of squares.

92
00:06:44,599 --> 00:06:56,400
In this case it equals 1 minus 5.73
divided by 10.15 which equals 0.44.

93
00:06:56,400 --> 00:07:01,719
R squared is nice because it captures
the value added from using a linear regression

94
00:07:01,719 --> 00:07:06,129
model over just predicting the average outcome
for every data

95
00:07:06,129 --> 00:07:07,319
point.

96
00:07:07,319 --> 00:07:10,610
So what values do we expect to see for R squared?

97
00:07:10,610 --> 00:07:16,520
Well both the sum of squared errors
and the total sum of squares have

98
00:07:16,520 --> 00:07:19,430
to be greater than or equal to zero because
they're

99
00:07:19,430 --> 00:07:23,680
the sum of squared terms so they can't be
negative.

100
00:07:23,680 --> 00:07:27,809
Additionally the sum of squared errors has
to be less than

101
00:07:27,809 --> 00:07:31,270
or equal to the total sum of squares.

102
00:07:31,270 --> 00:07:34,460
This is because our linear regression model
could just

103
00:07:34,460 --> 00:07:38,749
set the coefficient for the independent variable
to 0

104
00:07:38,749 --> 00:07:41,720
and then we would have the baseline model.

105
00:07:41,720 --> 00:07:46,110
So our linear regression model will never
be worse than the baseline model.

106
00:07:46,110 --> 00:07:52,960
So in the worst case the sum of squares errors
equals the total sum of squares, and our R

107
00:07:52,960 --> 00:07:55,449
squared is equal to 0.

108
00:07:55,449 --> 00:07:59,529
So this means no improvement over the baseline.

109
00:07:59,529 --> 00:08:05,699
In the best case our linear regression model
makes no errors, and the sum of squared errors

110
00:08:05,699 --> 00:08:07,809
is equal to 0.

111
00:08:07,809 --> 00:08:11,029
And then our R squared is equal to 1.

112
00:08:11,029 --> 00:08:17,689
So an R squared equal to 1 or close to 1
means a perfect or almost perfect predictive

113
00:08:17,689 --> 00:08:20,089
model.

114
00:08:20,089 --> 00:08:23,069
R squared is nice because it's unitless and
therefore

115
00:08:23,069 --> 00:08:26,469
universally interpretable between problems.

116
00:08:26,469 --> 00:08:31,060
However, it can still be hard to compare between
problems.

117
00:08:31,060 --> 00:08:35,549
Good models for easy problems will
have an R squared close to 1.

118
00:08:35,549 --> 00:08:41,570
But good models for hard problems
can still have an R squared close to zero.

119
00:08:41,570 --> 00:08:45,660
Throughout this course we will see
examples of both types of problems.