1
00:00:04,600 --> 00:00:08,210
Now, as we start to think about
building regression models

2
00:00:08,210 --> 00:00:11,080
with this data set, we need
to consider the possibility

3
00:00:11,080 --> 00:00:13,880
that there is
multicollinearity within

4
00:00:13,880 --> 00:00:15,080
the independent variables.

5
00:00:15,080 --> 00:00:16,580
And there's a good
reason to suspect

6
00:00:16,580 --> 00:00:19,060
that there would be
multicollinearity amongst

7
00:00:19,060 --> 00:00:20,790
the variables,
because in some sense,

8
00:00:20,790 --> 00:00:23,030
they're all measuring
the same thing, which

9
00:00:23,030 --> 00:00:26,620
is how strong the Republican
candidate is performing

10
00:00:26,620 --> 00:00:28,620
in the particular state.

11
00:00:28,620 --> 00:00:32,009
So while normally, we would
run the correlation function

12
00:00:32,009 --> 00:00:35,380
on the training set, in
this case, it doesn't work.

13
00:00:35,380 --> 00:00:38,060
It says, x must be numeric.

14
00:00:38,060 --> 00:00:41,750
And if we go back and look at
the structure of the training

15
00:00:41,750 --> 00:00:44,850
set, it jumps out why
we're getting this issue.

16
00:00:44,850 --> 00:00:46,230
It's because we're
trying to take

17
00:00:46,230 --> 00:00:48,150
the correlations of
the names of states,

18
00:00:48,150 --> 00:00:50,040
which doesn't make any sense.

19
00:00:50,040 --> 00:00:51,950
So to compute the
correlation, we're

20
00:00:51,950 --> 00:00:54,580
going to want to take the
correlation amongst just

21
00:00:54,580 --> 00:00:56,080
the independent
variables that we're

22
00:00:56,080 --> 00:00:58,820
going to be using to
predict, and we can also

23
00:00:58,820 --> 00:01:02,400
add in the dependent variable
to this correlation matrix.

24
00:01:02,400 --> 00:01:05,550
So I'll take cor
of the training set

25
00:01:05,550 --> 00:01:09,830
but just limit it to the
independent variables--

26
00:01:09,830 --> 00:01:17,640
Rasmussen, SurveyUSA,
PropR, and DiffCount.

27
00:01:17,640 --> 00:01:20,940
And then also, we'll add in the
dependent variable, Republican.

28
00:01:26,450 --> 00:01:27,320
So there we go.

29
00:01:27,320 --> 00:01:30,260
We're seeing a lot
of big values here.

30
00:01:30,260 --> 00:01:33,680
For instance,
SurveyUSA and Rasmussen

31
00:01:33,680 --> 00:01:37,460
are independent variables that
have a correlation of 0.94,

32
00:01:37,460 --> 00:01:39,580
which is very, very
large and something

33
00:01:39,580 --> 00:01:40,870
that would be concerning.

34
00:01:40,870 --> 00:01:42,440
It means that probably
combining them

35
00:01:42,440 --> 00:01:45,800
together isn't going to do much
to produce a working regression

36
00:01:45,800 --> 00:01:47,670
model.

37
00:01:47,670 --> 00:01:50,400
So let's first
consider the case where

38
00:01:50,400 --> 00:01:54,170
we want to build a logistic
regression model with just one

39
00:01:54,170 --> 00:01:55,250
variable.

40
00:01:55,250 --> 00:01:57,580
So in this case,
it stands to reason

41
00:01:57,580 --> 00:01:59,330
that the variable
we'd want to add

42
00:01:59,330 --> 00:02:00,830
would be the one
that is most highly

43
00:02:00,830 --> 00:02:04,410
correlated with the
outcome, Republican.

44
00:02:04,410 --> 00:02:06,270
So if we read the
bottom row, which

45
00:02:06,270 --> 00:02:08,830
is the correlation of each
variable to Republican,

46
00:02:08,830 --> 00:02:12,220
we see that PropR is
probably the best candidate

47
00:02:12,220 --> 00:02:14,620
to include in our
single-variable model,

48
00:02:14,620 --> 00:02:16,490
because it's so
highly correlated,

49
00:02:16,490 --> 00:02:18,840
meaning it's going to do
a good job of predicting

50
00:02:18,840 --> 00:02:21,500
the Republican status.

51
00:02:21,500 --> 00:02:23,680
So let's build a model.

52
00:02:23,680 --> 00:02:26,290
We can call it mod1.

53
00:02:26,290 --> 00:02:31,190
So we'll call the glm function,
predicting Republican,

54
00:02:31,190 --> 00:02:34,880
using PropR alone.

55
00:02:34,880 --> 00:02:36,940
As always, we'll
pass along the data

56
00:02:36,940 --> 00:02:39,300
to train with as
our training set.

57
00:02:39,300 --> 00:02:41,720
And because we have
logistic regression,

58
00:02:41,720 --> 00:02:43,180
we need family = "binomial".

59
00:02:46,670 --> 00:02:51,170
And we can take a look at
this model using the summary

60
00:02:51,170 --> 00:02:52,490
function.

61
00:02:52,490 --> 00:02:54,820
And we can see that
it looks pretty

62
00:02:54,820 --> 00:02:56,910
nice in terms of
its significance

63
00:02:56,910 --> 00:02:59,920
and the sign of
the coefficients.

64
00:02:59,920 --> 00:03:02,500
We have a lot of
stars over here.

65
00:03:02,500 --> 00:03:05,380
PropR is the
proportion of the polls

66
00:03:05,380 --> 00:03:06,700
that said the Republican won.

67
00:03:06,700 --> 00:03:10,370
We see that that has a very
high coefficient in terms

68
00:03:10,370 --> 00:03:12,120
of predicting that the
Republican will win

69
00:03:12,120 --> 00:03:14,850
in the state, which
makes a lot of sense.

70
00:03:14,850 --> 00:03:16,930
And we'll note down
that the AIC measuring

71
00:03:16,930 --> 00:03:20,230
the strength of
the model is 19.8.

72
00:03:20,230 --> 00:03:22,160
So this seems like a
very reasonable model.

73
00:03:22,160 --> 00:03:25,030
Let's see how it does in
terms of actually predicting

74
00:03:25,030 --> 00:03:28,440
the Republican outcome
on the training set.

75
00:03:28,440 --> 00:03:30,890
So first, we want to
compute the predictions,

76
00:03:30,890 --> 00:03:33,130
the predicted probabilities
that the Republican

77
00:03:33,130 --> 00:03:35,380
is going to win on
the training set.

78
00:03:35,380 --> 00:03:41,210
So we'll create a vector
called pred1, prediction one,

79
00:03:41,210 --> 00:03:43,630
then we'll call the
predict function.

80
00:03:43,630 --> 00:03:46,300
We'll pass it our model one.

81
00:03:46,300 --> 00:03:48,860
And we're not going
to pass it newdata,

82
00:03:48,860 --> 00:03:50,410
because we're just
making predictions

83
00:03:50,410 --> 00:03:51,760
on the training set right now.

84
00:03:51,760 --> 00:03:53,810
We're not looking at
test set predictions.

85
00:03:53,810 --> 00:03:59,300
But we do need to pass it type =
"response" to get probabilities

86
00:03:59,300 --> 00:04:01,470
out as the predictions.

87
00:04:01,470 --> 00:04:03,310
And now, we want to see
how well it's doing.

88
00:04:03,310 --> 00:04:05,650
So if we used a
threshold of 0.5,

89
00:04:05,650 --> 00:04:07,980
where we said if the
probability is at least 1/2,

90
00:04:07,980 --> 00:04:09,790
we're going to
predict Republican,

91
00:04:09,790 --> 00:04:11,860
otherwise, we'll
predict Democrat.

92
00:04:11,860 --> 00:04:14,600
Let's see how that would
do on the training set.

93
00:04:14,600 --> 00:04:17,269
So we'll want to use
the table function

94
00:04:17,269 --> 00:04:21,010
and look at the training
set Republican value

95
00:04:21,010 --> 00:04:25,120
against the logical
of whether pred1

96
00:04:25,120 --> 00:04:29,050
is greater than or equal to 0.5.

97
00:04:29,050 --> 00:04:33,260
So here, the rows, as usual, are
the outcome -- 1 is Republican,

98
00:04:33,260 --> 00:04:34,950
0 is Democrat.

99
00:04:34,950 --> 00:04:37,730
And the columns-- TRUE
means that we predicted

100
00:04:37,730 --> 00:04:40,580
Republican, FALSE means
we predicted Democrat.

101
00:04:40,580 --> 00:04:42,870
So we see that on
the training set,

102
00:04:42,870 --> 00:04:45,320
this model with one
variable as a prediction

103
00:04:45,320 --> 00:04:48,550
makes four mistakes,
which is just

104
00:04:48,550 --> 00:04:52,280
about the same as our
smart baseline model.

105
00:04:52,280 --> 00:04:55,440
So now, let's see if we can
improve on this performance

106
00:04:55,440 --> 00:04:57,760
by adding in another variable.

107
00:04:57,760 --> 00:05:01,890
So if we go back up to
our correlations here,

108
00:05:01,890 --> 00:05:03,640
we're going to be
searching, since there's

109
00:05:03,640 --> 00:05:06,250
so much multicollinearity,
we might be searching

110
00:05:06,250 --> 00:05:10,020
for a pair of variables that has
a relatively lower correlation

111
00:05:10,020 --> 00:05:13,900
with each other, because they
might kind of work together

112
00:05:13,900 --> 00:05:15,970
to improve the
prediction overall

113
00:05:15,970 --> 00:05:17,250
of the Republican outcome.

114
00:05:17,250 --> 00:05:20,020
If two variables are
highly, highly correlated,

115
00:05:20,020 --> 00:05:23,720
they're less likely to
improve predictions together,

116
00:05:23,720 --> 00:05:28,240
since they're so similar in
their correlation structure.

117
00:05:28,240 --> 00:05:31,760
So it looks like, just looking
at this top left four by four

118
00:05:31,760 --> 00:05:34,920
matrix, which is the
correlations between all

119
00:05:34,920 --> 00:05:38,260
the independent variables,
basically the least correlated

120
00:05:38,260 --> 00:05:42,530
pairs of variables are either
Rasmussen and DiffCount,

121
00:05:42,530 --> 00:05:45,480
or SurveyUSA and DiffCount.

122
00:05:45,480 --> 00:05:47,800
So the idea would
be to try out one

123
00:05:47,800 --> 00:05:50,420
of these pairs in our
two-variable model.

124
00:05:50,420 --> 00:05:54,690
So we'll go ahead and try
out SurveyUSA and DiffCount

125
00:05:54,690 --> 00:05:57,520
together in our second model.

126
00:05:57,520 --> 00:06:00,670
So to save ourselves
some typing,

127
00:06:00,670 --> 00:06:02,830
we can hit up a
few times until we

128
00:06:02,830 --> 00:06:05,740
get to the model
definition for model one.

129
00:06:05,740 --> 00:06:08,950
And then we can just
change the variables.

130
00:06:08,950 --> 00:06:15,420
In this case, we're now using
SurveyUSA plus DiffCount.

131
00:06:15,420 --> 00:06:18,210
We'll also need to remember to
change the name of our model

132
00:06:18,210 --> 00:06:19,560
from mod1 to mod2.

133
00:06:22,430 --> 00:06:24,160
And now, just like
before, we're going

134
00:06:24,160 --> 00:06:27,190
to want to compute
out our predictions.

135
00:06:27,190 --> 00:06:33,020
So we'll say pred2 is equal
to the predict of our model 2,

136
00:06:33,020 --> 00:06:34,920
again, with type =
"response", because we

137
00:06:34,920 --> 00:06:36,260
need to get those probabilities.

138
00:06:36,260 --> 00:06:38,230
Again, we're not
passing in newdata.

139
00:06:38,230 --> 00:06:39,650
This is a training
set prediction.

140
00:06:42,180 --> 00:06:46,570
And finally, we can
use the up arrows

141
00:06:46,570 --> 00:06:49,570
to see how our second
model's predictions are doing

142
00:06:49,570 --> 00:06:53,890
at predicting the Republican
outcome in the training set.

143
00:06:53,890 --> 00:06:56,920
And we can see that we
made one less mistake.

144
00:06:56,920 --> 00:06:59,840
We made three mistakes instead
of four on the training

145
00:06:59,840 --> 00:07:02,990
set-- so a little better
than the smart baseline

146
00:07:02,990 --> 00:07:04,470
but nothing too impressive.

147
00:07:04,470 --> 00:07:06,310
And the last thing we're
going to want to do

148
00:07:06,310 --> 00:07:09,380
is to actually look at the
model and see if it makes sense.

149
00:07:09,380 --> 00:07:14,250
So we can run summary
of our model two.

150
00:07:14,250 --> 00:07:17,160
And we can see that there are
some things that are pluses.

151
00:07:17,160 --> 00:07:19,760
For instance, the AIC
has a smaller value,

152
00:07:19,760 --> 00:07:22,460
which suggests a stronger model.

153
00:07:22,460 --> 00:07:26,160
And the estimates have, again,
the sign we would expect.

154
00:07:26,160 --> 00:07:29,880
So SurveyUSA and DiffCount
both have positive coefficients

155
00:07:29,880 --> 00:07:31,780
in predicting if
the Republican wins

156
00:07:31,780 --> 00:07:33,770
the state, which makes sense.

157
00:07:33,770 --> 00:07:38,080
But a weakness of this model is
that neither of these variables

158
00:07:38,080 --> 00:07:41,790
has a significance
of a star or better,

159
00:07:41,790 --> 00:07:46,400
which means that they are less
significant statistically.

160
00:07:46,400 --> 00:07:48,800
So there are definitely some
strengths and weaknesses

161
00:07:48,800 --> 00:07:51,850
between the two-variable
and the one-variable model.

162
00:07:51,850 --> 00:07:54,610
We'll go ahead and use
the two-variable model

163
00:07:54,610 --> 00:07:57,890
when we make our predictions
on the testing set.