1 00:00:04,600 --> 00:00:08,210 Now, as we start to think about building regression models 2 00:00:08,210 --> 00:00:11,080 with this data set, we need to consider the possibility 3 00:00:11,080 --> 00:00:13,880 that there is multicollinearity within 4 00:00:13,880 --> 00:00:15,080 the independent variables. 5 00:00:15,080 --> 00:00:16,580 And there's a good reason to suspect 6 00:00:16,580 --> 00:00:19,060 that there would be multicollinearity amongst 7 00:00:19,060 --> 00:00:20,790 the variables, because in some sense, 8 00:00:20,790 --> 00:00:23,030 they're all measuring the same thing, which 9 00:00:23,030 --> 00:00:26,620 is how strong the Republican candidate is performing 10 00:00:26,620 --> 00:00:28,620 in the particular state. 11 00:00:28,620 --> 00:00:32,009 So while normally, we would run the correlation function 12 00:00:32,009 --> 00:00:35,380 on the training set, in this case, it doesn't work. 13 00:00:35,380 --> 00:00:38,060 It says, x must be numeric. 14 00:00:38,060 --> 00:00:41,750 And if we go back and look at the structure of the training 15 00:00:41,750 --> 00:00:44,850 set, it jumps out why we're getting this issue. 16 00:00:44,850 --> 00:00:46,230 It's because we're trying to take 17 00:00:46,230 --> 00:00:48,150 the correlations of the names of states, 18 00:00:48,150 --> 00:00:50,040 which doesn't make any sense. 19 00:00:50,040 --> 00:00:51,950 So to compute the correlation, we're 20 00:00:51,950 --> 00:00:54,580 going to want to take the correlation amongst just 21 00:00:54,580 --> 00:00:56,080 the independent variables that we're 22 00:00:56,080 --> 00:00:58,820 going to be using to predict, and we can also 23 00:00:58,820 --> 00:01:02,400 add in the dependent variable to this correlation matrix. 24 00:01:02,400 --> 00:01:05,550 So I'll take cor of the training set 25 00:01:05,550 --> 00:01:09,830 but just limit it to the independent variables-- 26 00:01:09,830 --> 00:01:17,640 Rasmussen, SurveyUSA, PropR, and DiffCount. 27 00:01:17,640 --> 00:01:20,940 And then also, we'll add in the dependent variable, Republican. 28 00:01:26,450 --> 00:01:27,320 So there we go. 29 00:01:27,320 --> 00:01:30,260 We're seeing a lot of big values here. 30 00:01:30,260 --> 00:01:33,680 For instance, SurveyUSA and Rasmussen 31 00:01:33,680 --> 00:01:37,460 are independent variables that have a correlation of 0.94, 32 00:01:37,460 --> 00:01:39,580 which is very, very large and something 33 00:01:39,580 --> 00:01:40,870 that would be concerning. 34 00:01:40,870 --> 00:01:42,440 It means that probably combining them 35 00:01:42,440 --> 00:01:45,800 together isn't going to do much to produce a working regression 36 00:01:45,800 --> 00:01:47,670 model. 37 00:01:47,670 --> 00:01:50,400 So let's first consider the case where 38 00:01:50,400 --> 00:01:54,170 we want to build a logistic regression model with just one 39 00:01:54,170 --> 00:01:55,250 variable. 40 00:01:55,250 --> 00:01:57,580 So in this case, it stands to reason 41 00:01:57,580 --> 00:01:59,330 that the variable we'd want to add 42 00:01:59,330 --> 00:02:00,830 would be the one that is most highly 43 00:02:00,830 --> 00:02:04,410 correlated with the outcome, Republican. 44 00:02:04,410 --> 00:02:06,270 So if we read the bottom row, which 45 00:02:06,270 --> 00:02:08,830 is the correlation of each variable to Republican, 46 00:02:08,830 --> 00:02:12,220 we see that PropR is probably the best candidate 47 00:02:12,220 --> 00:02:14,620 to include in our single-variable model, 48 00:02:14,620 --> 00:02:16,490 because it's so highly correlated, 49 00:02:16,490 --> 00:02:18,840 meaning it's going to do a good job of predicting 50 00:02:18,840 --> 00:02:21,500 the Republican status. 51 00:02:21,500 --> 00:02:23,680 So let's build a model. 52 00:02:23,680 --> 00:02:26,290 We can call it mod1. 53 00:02:26,290 --> 00:02:31,190 So we'll call the glm function, predicting Republican, 54 00:02:31,190 --> 00:02:34,880 using PropR alone. 55 00:02:34,880 --> 00:02:36,940 As always, we'll pass along the data 56 00:02:36,940 --> 00:02:39,300 to train with as our training set. 57 00:02:39,300 --> 00:02:41,720 And because we have logistic regression, 58 00:02:41,720 --> 00:02:43,180 we need family = "binomial". 59 00:02:46,670 --> 00:02:51,170 And we can take a look at this model using the summary 60 00:02:51,170 --> 00:02:52,490 function. 61 00:02:52,490 --> 00:02:54,820 And we can see that it looks pretty 62 00:02:54,820 --> 00:02:56,910 nice in terms of its significance 63 00:02:56,910 --> 00:02:59,920 and the sign of the coefficients. 64 00:02:59,920 --> 00:03:02,500 We have a lot of stars over here. 65 00:03:02,500 --> 00:03:05,380 PropR is the proportion of the polls 66 00:03:05,380 --> 00:03:06,700 that said the Republican won. 67 00:03:06,700 --> 00:03:10,370 We see that that has a very high coefficient in terms 68 00:03:10,370 --> 00:03:12,120 of predicting that the Republican will win 69 00:03:12,120 --> 00:03:14,850 in the state, which makes a lot of sense. 70 00:03:14,850 --> 00:03:16,930 And we'll note down that the AIC measuring 71 00:03:16,930 --> 00:03:20,230 the strength of the model is 19.8. 72 00:03:20,230 --> 00:03:22,160 So this seems like a very reasonable model. 73 00:03:22,160 --> 00:03:25,030 Let's see how it does in terms of actually predicting 74 00:03:25,030 --> 00:03:28,440 the Republican outcome on the training set. 75 00:03:28,440 --> 00:03:30,890 So first, we want to compute the predictions, 76 00:03:30,890 --> 00:03:33,130 the predicted probabilities that the Republican 77 00:03:33,130 --> 00:03:35,380 is going to win on the training set. 78 00:03:35,380 --> 00:03:41,210 So we'll create a vector called pred1, prediction one, 79 00:03:41,210 --> 00:03:43,630 then we'll call the predict function. 80 00:03:43,630 --> 00:03:46,300 We'll pass it our model one. 81 00:03:46,300 --> 00:03:48,860 And we're not going to pass it newdata, 82 00:03:48,860 --> 00:03:50,410 because we're just making predictions 83 00:03:50,410 --> 00:03:51,760 on the training set right now. 84 00:03:51,760 --> 00:03:53,810 We're not looking at test set predictions. 85 00:03:53,810 --> 00:03:59,300 But we do need to pass it type = "response" to get probabilities 86 00:03:59,300 --> 00:04:01,470 out as the predictions. 87 00:04:01,470 --> 00:04:03,310 And now, we want to see how well it's doing. 88 00:04:03,310 --> 00:04:05,650 So if we used a threshold of 0.5, 89 00:04:05,650 --> 00:04:07,980 where we said if the probability is at least 1/2, 90 00:04:07,980 --> 00:04:09,790 we're going to predict Republican, 91 00:04:09,790 --> 00:04:11,860 otherwise, we'll predict Democrat. 92 00:04:11,860 --> 00:04:14,600 Let's see how that would do on the training set. 93 00:04:14,600 --> 00:04:17,269 So we'll want to use the table function 94 00:04:17,269 --> 00:04:21,010 and look at the training set Republican value 95 00:04:21,010 --> 00:04:25,120 against the logical of whether pred1 96 00:04:25,120 --> 00:04:29,050 is greater than or equal to 0.5. 97 00:04:29,050 --> 00:04:33,260 So here, the rows, as usual, are the outcome -- 1 is Republican, 98 00:04:33,260 --> 00:04:34,950 0 is Democrat. 99 00:04:34,950 --> 00:04:37,730 And the columns-- TRUE means that we predicted 100 00:04:37,730 --> 00:04:40,580 Republican, FALSE means we predicted Democrat. 101 00:04:40,580 --> 00:04:42,870 So we see that on the training set, 102 00:04:42,870 --> 00:04:45,320 this model with one variable as a prediction 103 00:04:45,320 --> 00:04:48,550 makes four mistakes, which is just 104 00:04:48,550 --> 00:04:52,280 about the same as our smart baseline model. 105 00:04:52,280 --> 00:04:55,440 So now, let's see if we can improve on this performance 106 00:04:55,440 --> 00:04:57,760 by adding in another variable. 107 00:04:57,760 --> 00:05:01,890 So if we go back up to our correlations here, 108 00:05:01,890 --> 00:05:03,640 we're going to be searching, since there's 109 00:05:03,640 --> 00:05:06,250 so much multicollinearity, we might be searching 110 00:05:06,250 --> 00:05:10,020 for a pair of variables that has a relatively lower correlation 111 00:05:10,020 --> 00:05:13,900 with each other, because they might kind of work together 112 00:05:13,900 --> 00:05:15,970 to improve the prediction overall 113 00:05:15,970 --> 00:05:17,250 of the Republican outcome. 114 00:05:17,250 --> 00:05:20,020 If two variables are highly, highly correlated, 115 00:05:20,020 --> 00:05:23,720 they're less likely to improve predictions together, 116 00:05:23,720 --> 00:05:28,240 since they're so similar in their correlation structure. 117 00:05:28,240 --> 00:05:31,760 So it looks like, just looking at this top left four by four 118 00:05:31,760 --> 00:05:34,920 matrix, which is the correlations between all 119 00:05:34,920 --> 00:05:38,260 the independent variables, basically the least correlated 120 00:05:38,260 --> 00:05:42,530 pairs of variables are either Rasmussen and DiffCount, 121 00:05:42,530 --> 00:05:45,480 or SurveyUSA and DiffCount. 122 00:05:45,480 --> 00:05:47,800 So the idea would be to try out one 123 00:05:47,800 --> 00:05:50,420 of these pairs in our two-variable model. 124 00:05:50,420 --> 00:05:54,690 So we'll go ahead and try out SurveyUSA and DiffCount 125 00:05:54,690 --> 00:05:57,520 together in our second model. 126 00:05:57,520 --> 00:06:00,670 So to save ourselves some typing, 127 00:06:00,670 --> 00:06:02,830 we can hit up a few times until we 128 00:06:02,830 --> 00:06:05,740 get to the model definition for model one. 129 00:06:05,740 --> 00:06:08,950 And then we can just change the variables. 130 00:06:08,950 --> 00:06:15,420 In this case, we're now using SurveyUSA plus DiffCount. 131 00:06:15,420 --> 00:06:18,210 We'll also need to remember to change the name of our model 132 00:06:18,210 --> 00:06:19,560 from mod1 to mod2. 133 00:06:22,430 --> 00:06:24,160 And now, just like before, we're going 134 00:06:24,160 --> 00:06:27,190 to want to compute out our predictions. 135 00:06:27,190 --> 00:06:33,020 So we'll say pred2 is equal to the predict of our model 2, 136 00:06:33,020 --> 00:06:34,920 again, with type = "response", because we 137 00:06:34,920 --> 00:06:36,260 need to get those probabilities. 138 00:06:36,260 --> 00:06:38,230 Again, we're not passing in newdata. 139 00:06:38,230 --> 00:06:39,650 This is a training set prediction. 140 00:06:42,180 --> 00:06:46,570 And finally, we can use the up arrows 141 00:06:46,570 --> 00:06:49,570 to see how our second model's predictions are doing 142 00:06:49,570 --> 00:06:53,890 at predicting the Republican outcome in the training set. 143 00:06:53,890 --> 00:06:56,920 And we can see that we made one less mistake. 144 00:06:56,920 --> 00:06:59,840 We made three mistakes instead of four on the training 145 00:06:59,840 --> 00:07:02,990 set-- so a little better than the smart baseline 146 00:07:02,990 --> 00:07:04,470 but nothing too impressive. 147 00:07:04,470 --> 00:07:06,310 And the last thing we're going to want to do 148 00:07:06,310 --> 00:07:09,380 is to actually look at the model and see if it makes sense. 149 00:07:09,380 --> 00:07:14,250 So we can run summary of our model two. 150 00:07:14,250 --> 00:07:17,160 And we can see that there are some things that are pluses. 151 00:07:17,160 --> 00:07:19,760 For instance, the AIC has a smaller value, 152 00:07:19,760 --> 00:07:22,460 which suggests a stronger model. 153 00:07:22,460 --> 00:07:26,160 And the estimates have, again, the sign we would expect. 154 00:07:26,160 --> 00:07:29,880 So SurveyUSA and DiffCount both have positive coefficients 155 00:07:29,880 --> 00:07:31,780 in predicting if the Republican wins 156 00:07:31,780 --> 00:07:33,770 the state, which makes sense. 157 00:07:33,770 --> 00:07:38,080 But a weakness of this model is that neither of these variables 158 00:07:38,080 --> 00:07:41,790 has a significance of a star or better, 159 00:07:41,790 --> 00:07:46,400 which means that they are less significant statistically. 160 00:07:46,400 --> 00:07:48,800 So there are definitely some strengths and weaknesses 161 00:07:48,800 --> 00:07:51,850 between the two-variable and the one-variable model. 162 00:07:51,850 --> 00:07:54,610 We'll go ahead and use the two-variable model 163 00:07:54,610 --> 00:07:57,890 when we make our predictions on the testing set.