1
00:00:04,500 --> 00:00:07,620
In the previous video, we
created linear regression

2
00:00:07,620 --> 00:00:11,510
models in R. Using
the summary function,

3
00:00:11,510 --> 00:00:14,700
we were able to see the
coefficients, as well

4
00:00:14,700 --> 00:00:17,010
as some other information.

5
00:00:17,010 --> 00:00:20,690
The output of the coefficient
section of the summary function

6
00:00:20,690 --> 00:00:22,540
is shown here.

7
00:00:22,540 --> 00:00:26,650
The independent variables
are listed on the left.

8
00:00:26,650 --> 00:00:31,690
The estimate column gives the
coefficients for the intercept

9
00:00:31,690 --> 00:00:36,110
and for each of the independent
variables in our model.

10
00:00:36,110 --> 00:00:38,580
The remaining columns
help us determine

11
00:00:38,580 --> 00:00:41,340
if a variable should be
included in the model,

12
00:00:41,340 --> 00:00:46,190
or if its coefficient is
significantly different from 0.

13
00:00:46,190 --> 00:00:49,020
A coefficient of 0
means that the value

14
00:00:49,020 --> 00:00:51,360
of the independent
variable does not

15
00:00:51,360 --> 00:00:55,060
change our prediction for
the dependent variable.

16
00:00:55,060 --> 00:00:59,300
If a coefficient is not
significantly different from 0,

17
00:00:59,300 --> 00:01:02,760
then we should probably remove
the variable from our model

18
00:01:02,760 --> 00:01:07,330
since it's not helping to
predict the dependent variable.

19
00:01:07,330 --> 00:01:10,270
The standard error
gives a measure

20
00:01:10,270 --> 00:01:13,270
of how much the coefficient
is likely to vary

21
00:01:13,270 --> 00:01:16,220
from the estimate value.

22
00:01:16,220 --> 00:01:23,990
The t value is the estimate
divided by the standard error.

23
00:01:28,190 --> 00:01:31,170
It will be negative if
the estimate is negative,

24
00:01:31,170 --> 00:01:34,560
and positive if the
estimate is positive.

25
00:01:34,560 --> 00:01:37,800
The larger the absolute
value of the t statistic,

26
00:01:37,800 --> 00:01:41,550
the more likely the coefficient
is to be significant,

27
00:01:41,550 --> 00:01:44,970
so we want variables with
a large absolute value

28
00:01:44,970 --> 00:01:47,039
in this column.

29
00:01:47,039 --> 00:01:49,940
The last column
gives the probability

30
00:01:49,940 --> 00:01:53,020
that a coefficient
is actually 0.

31
00:01:53,020 --> 00:01:56,690
It will be large if the absolute
value of the t statistic

32
00:01:56,690 --> 00:02:00,790
is small, and it will be small
if the absolute value of the t

33
00:02:00,790 --> 00:02:03,130
statistic is large.

34
00:02:03,130 --> 00:02:07,520
We want variables with
small values in this column.

35
00:02:07,520 --> 00:02:10,190
This is a lot of information,
but the easiest way

36
00:02:10,190 --> 00:02:13,350
in R to determine if a
variable is significant

37
00:02:13,350 --> 00:02:17,000
is to look at the stars
at the end of each row.

38
00:02:17,000 --> 00:02:20,000
Three stars is the highest
level of significance,

39
00:02:20,000 --> 00:02:25,440
and corresponds to a
probability less than 0.001.

40
00:02:25,440 --> 00:02:27,590
The star coding
scheme is explained

41
00:02:27,590 --> 00:02:30,890
at the bottom of the
coefficient output.

42
00:02:30,890 --> 00:02:35,270
Three stars corresponds to
probability values between 0

43
00:02:35,270 --> 00:02:41,280
and 0.001, or the smallest
possible probabilities.

44
00:02:41,280 --> 00:02:44,160
Two stars is also
very significant,

45
00:02:44,160 --> 00:02:50,500
and corresponds to a probability
between 0.001 and 0.01.

46
00:02:50,500 --> 00:02:54,300
One star is also
significant, and corresponds

47
00:02:54,300 --> 00:02:59,600
to a probability
between 0.01 and 0.05.

48
00:02:59,600 --> 00:03:04,560
A period, or dot, means that the
variable is almost significant,

49
00:03:04,560 --> 00:03:09,080
and corresponds to a probability
between 0.05 and 0.1.

50
00:03:09,080 --> 00:03:11,690
When we ask you to list
the significant variables

51
00:03:11,690 --> 00:03:15,700
in the model, we will
usually not include these.

52
00:03:15,700 --> 00:03:19,800
Nothing at the end a row
means that the variable is not

53
00:03:19,800 --> 00:03:22,050
significant in the model.

54
00:03:22,050 --> 00:03:24,540
Age and FrancePopulation
are both

55
00:03:24,540 --> 00:03:27,079
insignificant in our model.

56
00:03:27,079 --> 00:03:32,280
Let's switch to R, and see
if we can improve our model.

57
00:03:32,280 --> 00:03:36,030
In the previous video, we
built a linear regression model

58
00:03:36,030 --> 00:03:40,190
called model3 that used all
of our independent variables

59
00:03:40,190 --> 00:03:44,300
to predict the dependent
variable, Price.

60
00:03:44,300 --> 00:03:49,100
In R console, we can see the
summary output for this model.

61
00:03:49,100 --> 00:03:51,430
By looking at the
coefficient section,

62
00:03:51,430 --> 00:03:54,680
we can see that both
Age and FrancePopulation

63
00:03:54,680 --> 00:03:57,490
are insignificant in our model.

64
00:03:57,490 --> 00:03:59,800
Because of this,
we should consider

65
00:03:59,800 --> 00:04:03,320
removing these variables
from our model.

66
00:04:03,320 --> 00:04:06,790
Let's start by just
removing FrancePopulation,

67
00:04:06,790 --> 00:04:09,130
which we intuitively
don't expect

68
00:04:09,130 --> 00:04:12,580
to be predictive of
wine price anyway.

69
00:04:12,580 --> 00:04:16,490
Let's create a new
model called model4,

70
00:04:16,490 --> 00:04:21,110
which again uses the lm
function to predict price using

71
00:04:21,110 --> 00:04:28,180
the independent variables,
AGST, HarvestRain,

72
00:04:28,180 --> 00:04:31,710
WinterRain, and Age.

73
00:04:31,710 --> 00:04:33,350
Here, we're not using
FrancePopulation.

74
00:04:35,890 --> 00:04:38,220
Our data set, again,
is wine, which

75
00:04:38,220 --> 00:04:42,050
will be the data used
to create our model.

76
00:04:42,050 --> 00:04:43,680
Let's look at the
summary of model4.

77
00:04:48,100 --> 00:04:53,610
We can see that our R-squared
in this model is 0.8286,

78
00:04:53,610 --> 00:04:57,650
and our adjusted
R-squared is 0.79.

79
00:04:57,650 --> 00:05:01,310
If we scroll back up
to our previous model,

80
00:05:01,310 --> 00:05:06,110
our R-squared was 0.8294,
and our adjusted R-squared

81
00:05:06,110 --> 00:05:08,750
was 0.784.

82
00:05:08,750 --> 00:05:13,960
So this model is just as strong,
if not stronger, than before,

83
00:05:13,960 --> 00:05:16,240
because our adjusted
R-squared actually

84
00:05:16,240 --> 00:05:18,180
increased by removing
FrancePopulation.

85
00:05:20,800 --> 00:05:24,520
If we look at each of our
variables and the stars,

86
00:05:24,520 --> 00:05:28,000
we now see that something
a little strange happened.

87
00:05:28,000 --> 00:05:31,770
Before, age was not significant
at all in our model.

88
00:05:31,770 --> 00:05:35,490
But now, the variable
Age has two stars,

89
00:05:35,490 --> 00:05:39,010
meaning that it's very
significant in this new model.

90
00:05:39,010 --> 00:05:41,170
Why did this happen?

91
00:05:41,170 --> 00:05:45,000
This is due to something
called multi-colinearity.

92
00:05:45,000 --> 00:05:50,530
Age and FrancePopulation are
what we call highly correlated.

93
00:05:50,530 --> 00:05:53,550
Let's learn a bit more
about correlation.

94
00:05:53,550 --> 00:05:56,990
Correlation measures the
linear relationship between two

95
00:05:56,990 --> 00:06:02,250
variables, and is a
number between +1 and -1.

96
00:06:02,250 --> 00:06:05,420
The highest a correlation
can be is positivev 1,

97
00:06:05,420 --> 00:06:09,260
which corresponds to a perfect
positive linear relationship

98
00:06:09,260 --> 00:06:11,450
between the two variables.

99
00:06:11,450 --> 00:06:14,510
The smallest a correlation
can be is negative 1,

100
00:06:14,510 --> 00:06:18,140
which corresponds to a perfect
negative linear relationship

101
00:06:18,140 --> 00:06:20,140
between the two variables.

102
00:06:20,140 --> 00:06:22,040
In the middle of
these two extremes

103
00:06:22,040 --> 00:06:25,000
is a correlation of
0, which corresponds

104
00:06:25,000 --> 00:06:28,600
to no linear relationship
between the two variables.

105
00:06:28,600 --> 00:06:31,800
Let's look at some examples.

106
00:06:31,800 --> 00:06:35,130
This plot graphs
WinterRain on the x-axis,

107
00:06:35,130 --> 00:06:38,110
and wine price on the y-axis.

108
00:06:38,110 --> 00:06:40,180
By visually
inspecting this plot,

109
00:06:40,180 --> 00:06:42,100
we can see that it
looks like there's

110
00:06:42,100 --> 00:06:45,450
a slight positive linear
relationship between these two

111
00:06:45,450 --> 00:06:47,040
variables.

112
00:06:47,040 --> 00:06:50,290
It turns out that the
correlation between WinterRain

113
00:06:50,290 --> 00:06:54,950
and wine price is
0.14, which corresponds

114
00:06:54,950 --> 00:06:57,470
to a slightly positive
linear relationship,

115
00:06:57,470 --> 00:07:00,290
as we saw visually.

116
00:07:00,290 --> 00:07:04,170
This plot graphs
HarvestRain on the x-axis,

117
00:07:04,170 --> 00:07:08,110
and AGST on the y-axis.

118
00:07:08,110 --> 00:07:11,600
It's hard to visually see
a positive or negative

119
00:07:11,600 --> 00:07:14,140
linear trend in this data.

120
00:07:14,140 --> 00:07:16,020
It turns out that
the correlation

121
00:07:16,020 --> 00:07:21,560
is equal to negative 0.06,
which is very close to 0,

122
00:07:21,560 --> 00:07:24,310
and corresponds to very
little linear relationship.

123
00:07:27,060 --> 00:07:30,080
This plot shows the
age of wine compared

124
00:07:30,080 --> 00:07:32,200
to the population of France.

125
00:07:32,200 --> 00:07:34,420
It looks like
there's a very strong

126
00:07:34,420 --> 00:07:38,520
negative linear relationship
between these two variables.

127
00:07:38,520 --> 00:07:40,390
It turns out that
the correlation

128
00:07:40,390 --> 00:07:46,420
is equal to -0.99, which
is very close to -1,

129
00:07:46,420 --> 00:07:49,460
the smallest a
correlation can be.

130
00:07:49,460 --> 00:07:52,790
Let's compute some
correlations in R.

131
00:07:52,790 --> 00:07:56,100
We can compute the correlation
between a pair of variables

132
00:07:56,100 --> 00:07:59,260
by using the cor function.

133
00:07:59,260 --> 00:08:03,650
Let's compute the correlation
between WinterRain and price.

134
00:08:03,650 --> 00:08:08,080
We start by typing the
name of the function cor,

135
00:08:08,080 --> 00:08:13,570
then the name of the first
variable-- wine$WinterRain,

136
00:08:13,570 --> 00:08:17,020
followed by a comma, and
then the name of the second

137
00:08:17,020 --> 00:08:18,800
variable.

138
00:08:18,800 --> 00:08:22,210
If we hit Enter, we see
that the correlation here

139
00:08:22,210 --> 00:08:29,140
between WinterRain and
price is 0.1366505.

140
00:08:29,140 --> 00:08:31,410
We can also compute the
correlation between two

141
00:08:31,410 --> 00:08:35,950
different variables-- let's
say Age and FrancePopulation.

142
00:08:35,950 --> 00:08:39,840
Again, we use the function
cor, and then type

143
00:08:39,840 --> 00:08:41,500
the names of the two variables.

144
00:08:45,020 --> 00:08:48,010
If we hit Enter, we see that
the correlation between Age

145
00:08:48,010 --> 00:08:52,820
and FrancePopulation
is about -0.99.

146
00:08:52,820 --> 00:08:54,880
We can also compute
the correlation

147
00:08:54,880 --> 00:08:56,590
between all
variables in our data

148
00:08:56,590 --> 00:09:00,130
set by using the
same function, cor,

149
00:09:00,130 --> 00:09:04,690
and typing the name
of the data set, wine.

150
00:09:04,690 --> 00:09:08,110
Here, our output shows
us a lot of numbers,

151
00:09:08,110 --> 00:09:11,100
and the rows are labeled by
the variable names, as well

152
00:09:11,100 --> 00:09:12,810
as the columns.

153
00:09:12,810 --> 00:09:16,210
To find the correlation
between two variables,

154
00:09:16,210 --> 00:09:18,850
we find one variable
name on the row,

155
00:09:18,850 --> 00:09:20,650
and then go to
the column labeled

156
00:09:20,650 --> 00:09:22,600
by the other variable name.

157
00:09:22,600 --> 00:09:24,750
So for example,
if we want to find

158
00:09:24,750 --> 00:09:29,150
the correlation
between AGST and price,

159
00:09:29,150 --> 00:09:32,660
we can go down to the
row labeled price,

160
00:09:32,660 --> 00:09:36,170
and then through that row
across to the column, labeled

161
00:09:36,170 --> 00:09:46,440
AGST Here, we find that the
correlation is about 0.6595.

162
00:09:46,440 --> 00:09:50,030
So we have confirmed that
the correlation between Age

163
00:09:50,030 --> 00:09:52,920
and FrancePopulation
is very high,

164
00:09:52,920 --> 00:09:56,770
so we do have
multi-colinearity in our model.

165
00:09:56,770 --> 00:10:00,110
Note that multi-colinearity
refers to the situation

166
00:10:00,110 --> 00:10:04,060
where two independent variables
are highly correlated.

167
00:10:04,060 --> 00:10:06,910
A high correlation between
an independent variable

168
00:10:06,910 --> 00:10:09,600
and the dependent variable,
like the correlation

169
00:10:09,600 --> 00:10:13,730
between AGST and
price, is a good thing,

170
00:10:13,730 --> 00:10:16,740
since we're trying to predict
the dependent variable using

171
00:10:16,740 --> 00:10:18,570
the independent variable.

172
00:10:18,570 --> 00:10:21,270
Multi-colinearity only
applies to the case

173
00:10:21,270 --> 00:10:23,820
where two independent
variables are

174
00:10:23,820 --> 00:10:28,410
highly positive or
negatively correlated.

175
00:10:28,410 --> 00:10:31,140
Because of
multi-colinearity, you always

176
00:10:31,140 --> 00:10:35,420
want to remove the insignificant
variables one at a time.

177
00:10:35,420 --> 00:10:38,790
Let's see what would've happened
if we had removed both Age

178
00:10:38,790 --> 00:10:42,000
and FrancePopulation
at the same time.

179
00:10:42,000 --> 00:10:46,750
We'll call this model, model5,
and use the lm function

180
00:10:46,750 --> 00:10:54,430
to predict price using
only AGST, HarvestRain,

181
00:10:54,430 --> 00:10:56,500
and WinterRain.

182
00:10:56,500 --> 00:11:00,200
Again, we'll use
the data set, wine.

183
00:11:00,200 --> 00:11:04,200
If we look at the summary
of our model, model5,

184
00:11:04,200 --> 00:11:09,690
we can see that AGST,
HarvestRain, and WinterRain are

185
00:11:09,690 --> 00:11:12,900
all fairly significant,
but our multiple R-squared

186
00:11:12,900 --> 00:11:15,410
dropped to 0.75.

187
00:11:15,410 --> 00:11:21,180
In our previous model-- model4--
our R-squared was about 0.83.

188
00:11:21,180 --> 00:11:24,590
So if we had removed both
Age and FrancePopulation

189
00:11:24,590 --> 00:11:28,460
at the same time, we would have
lost a significant variable,

190
00:11:28,460 --> 00:11:32,730
Age, and the R-squared of our
model would have been lower.

191
00:11:32,730 --> 00:11:35,320
Why did we keep
FrancePopulation,

192
00:11:35,320 --> 00:11:37,520
and remove Age instead?

193
00:11:37,520 --> 00:11:40,690
Well, we expect Age
to be significant.

194
00:11:40,690 --> 00:11:43,610
Older wines are
typically more expensive.

195
00:11:43,610 --> 00:11:45,770
Since the population
of France steadily

196
00:11:45,770 --> 00:11:49,260
increases with the year,
this captures the same effect

197
00:11:49,260 --> 00:11:53,330
as Age, but is less
interpretable in our model.

198
00:11:53,330 --> 00:11:56,300
Multi-colinearity reminds
us that coefficients

199
00:11:56,300 --> 00:11:58,540
are only interpretable
in the presence

200
00:11:58,540 --> 00:12:00,910
of other variables being used.

201
00:12:00,910 --> 00:12:03,750
High correlations can
even cause coefficients

202
00:12:03,750 --> 00:12:05,530
to have the wrong sign.

203
00:12:05,530 --> 00:12:09,230
We'll see this in
the next lecture.

204
00:12:09,230 --> 00:12:11,720
So we fixed the
multi-colinearity problem

205
00:12:11,720 --> 00:12:14,410
between Age and
FrancePopulation.

206
00:12:14,410 --> 00:12:16,490
Do we have any other
highly correlated

207
00:12:16,490 --> 00:12:18,370
independent variables?

208
00:12:18,370 --> 00:12:20,650
If you look back at
our correlation matrix,

209
00:12:20,650 --> 00:12:22,660
you can see that we don't.

210
00:12:22,660 --> 00:12:26,590
Since all of our other remaining
variables are also significant,

211
00:12:26,590 --> 00:12:29,180
we'll stick with
model4 as our model

212
00:12:29,180 --> 00:12:31,570
for the rest of this lecture.