1 00:00:04,500 --> 00:00:07,630 Now, we're ready to actually start building models. 2 00:00:07,630 --> 00:00:10,130 So as usual, the first thing we're going to do 3 00:00:10,130 --> 00:00:13,120 is split our data into a training and a testing set. 4 00:00:13,120 --> 00:00:14,620 And for this problem, we're actually 5 00:00:14,620 --> 00:00:18,290 going to train on data from the 2004 and 2008 elections, 6 00:00:18,290 --> 00:00:21,680 and we're going to test on data from the 2012 7 00:00:21,680 --> 00:00:23,290 presidential election. 8 00:00:23,290 --> 00:00:25,640 So to do that, we'll create a data frame 9 00:00:25,640 --> 00:00:29,380 called Train, using the subset function that breaks down 10 00:00:29,380 --> 00:00:32,259 the original polling data frame and only 11 00:00:32,259 --> 00:00:37,080 stores the observations when either the Year was 2004 12 00:00:37,080 --> 00:00:42,240 or when the Year was 2008. 13 00:00:42,240 --> 00:00:43,960 And to obtain the testing set, we're 14 00:00:43,960 --> 00:00:49,070 going to use subset to create a data frame called Test that 15 00:00:49,070 --> 00:00:51,900 saves the observations in polling where 16 00:00:51,900 --> 00:00:55,330 the year was 2012. 17 00:00:55,330 --> 00:00:58,740 So now that we've broken it down into a training and a testing 18 00:00:58,740 --> 00:01:02,290 set, we want to understand the prediction of our baseline 19 00:01:02,290 --> 00:01:04,660 model against which we want to compare 20 00:01:04,660 --> 00:01:06,870 a later logistic regression model. 21 00:01:06,870 --> 00:01:09,039 So to do that, we'll look at the breakdown 22 00:01:09,039 --> 00:01:10,990 of the dependent variable in the training 23 00:01:10,990 --> 00:01:13,100 set using the table function. 24 00:01:17,380 --> 00:01:21,820 What we can see here is that in 47 of the 100 training 25 00:01:21,820 --> 00:01:24,320 observations, the Democrat won the state, 26 00:01:24,320 --> 00:01:27,830 and in 53 of the observations, the Republican won the state. 27 00:01:27,830 --> 00:01:30,570 So our simple baseline model is always 28 00:01:30,570 --> 00:01:33,080 going to predict the more common outcome, which 29 00:01:33,080 --> 00:01:35,690 is that the Republican is going to win the state. 30 00:01:35,690 --> 00:01:37,490 And we see that the simple baseline model 31 00:01:37,490 --> 00:01:42,410 will have accuracy of 53% on the training set. 32 00:01:42,410 --> 00:01:46,070 Now, unfortunately, this is a pretty weak model. 33 00:01:46,070 --> 00:01:49,450 It always predicts Republican, even for a very landslide 34 00:01:49,450 --> 00:01:52,300 Democratic state, where the Democrat was polling 35 00:01:52,300 --> 00:01:55,610 by 15% or 20% ahead of the Republican. 36 00:01:55,610 --> 00:01:59,240 So nobody would really consider this to be a credible model. 37 00:01:59,240 --> 00:02:03,490 So we need to think of a smarter baseline model against which 38 00:02:03,490 --> 00:02:06,580 we can compare our logistic regression models that we're 39 00:02:06,580 --> 00:02:08,070 going to develop later. 40 00:02:08,070 --> 00:02:10,860 So a reasonable smart baseline would 41 00:02:10,860 --> 00:02:13,660 be to just take one of the polls-- in our case, 42 00:02:13,660 --> 00:02:16,100 we'll take Rasmussen-- and make a prediction 43 00:02:16,100 --> 00:02:18,920 based on who poll said was winning in the state. 44 00:02:18,920 --> 00:02:22,090 So for instance, if the Republican is polling ahead, 45 00:02:22,090 --> 00:02:25,010 the Rasmussen smart baseline would just 46 00:02:25,010 --> 00:02:27,040 pick the Republican to be the winner. 47 00:02:27,040 --> 00:02:29,880 If the Democrat was ahead, it would pick the Democrat. 48 00:02:29,880 --> 00:02:31,970 And if they were tied, the model would not 49 00:02:31,970 --> 00:02:34,490 know which one to select. 50 00:02:34,490 --> 00:02:36,860 So to compute this smart baseline, 51 00:02:36,860 --> 00:02:39,260 we're going to use a new function called the sign 52 00:02:39,260 --> 00:02:40,500 function. 53 00:02:40,500 --> 00:02:42,650 And what this function does is, if it's 54 00:02:42,650 --> 00:02:45,650 passed a positive number, it returns the value 1. 55 00:02:45,650 --> 00:02:49,270 If it's passed a negative number, it returns negative 1. 56 00:02:49,270 --> 00:02:52,079 And if it's passed 0, it returns 0. 57 00:02:52,079 --> 00:02:56,810 So if we passed the Rasmussen variable into sign, 58 00:02:56,810 --> 00:02:59,600 whenever the Republican was winning the state, meaning 59 00:02:59,600 --> 00:03:02,840 Rasmussen is positive, it's going to return a 1. 60 00:03:02,840 --> 00:03:05,380 So for instance, if the value 20 is 61 00:03:05,380 --> 00:03:08,440 passed, meaning the Republican is polling 20 ahead, 62 00:03:08,440 --> 00:03:09,470 it returns 1. 63 00:03:09,470 --> 00:03:13,510 So 1 signifies that the Republican is predicted to win. 64 00:03:13,510 --> 00:03:15,950 If the Democrat is leading in the Rasmussen poll, 65 00:03:15,950 --> 00:03:18,170 it'll take on a negative value. 66 00:03:18,170 --> 00:03:22,150 So if we took for instance the sign of -10, we get -1. 67 00:03:22,150 --> 00:03:25,260 So -1 means this smart baseline is 68 00:03:25,260 --> 00:03:28,220 predicting that the Democrat won the state. 69 00:03:28,220 --> 00:03:30,490 And finally, if we took the sign of 0, 70 00:03:30,490 --> 00:03:33,270 meaning that the Rasmussen poll had a tie, 71 00:03:33,270 --> 00:03:35,320 it returns 0, saying that the model is 72 00:03:35,320 --> 00:03:39,140 inconclusive about who's going to win the state. 73 00:03:39,140 --> 00:03:41,930 So now, we're ready to actually compute 74 00:03:41,930 --> 00:03:45,520 this prediction for all of our training set. 75 00:03:45,520 --> 00:03:47,280 And we can take a look at the breakdown 76 00:03:47,280 --> 00:03:50,190 of that using the table function applied 77 00:03:50,190 --> 00:03:56,829 to the sign of the training set's Rasmussen variable. 78 00:03:56,829 --> 00:04:00,340 And what we can see is that in 56 of the 100 training set 79 00:04:00,340 --> 00:04:03,500 observations, the smart baseline predicted 80 00:04:03,500 --> 00:04:05,740 that the Republican was going to win. 81 00:04:05,740 --> 00:04:08,750 In 42 instances, it predicted the Democrat. 82 00:04:08,750 --> 00:04:11,640 And in two instances, it was inconclusive. 83 00:04:11,640 --> 00:04:15,100 So what we really want to do is to see the breakdown of how 84 00:04:15,100 --> 00:04:19,290 the smart baseline model does, compared to the actual result 85 00:04:19,290 --> 00:04:21,390 -- who actually won the state. 86 00:04:21,390 --> 00:04:23,760 So we want to again use the table function, 87 00:04:23,760 --> 00:04:27,240 but this time, we want to compare the training set's 88 00:04:27,240 --> 00:04:32,650 outcome against the sign of the polling data. 89 00:04:36,180 --> 00:04:39,590 So in this table, the rows are the true outcome -- 90 00:04:39,590 --> 00:04:42,320 1 is for Republican, 0 is for Democrat -- 91 00:04:42,320 --> 00:04:46,190 and the columns are the smart baseline predictions, -1, 0, 92 00:04:46,190 --> 00:04:47,280 or 1. 93 00:04:47,280 --> 00:04:51,130 What we can see is in the top left corner over here, 94 00:04:51,130 --> 00:04:55,990 we have 42 observations where the Rasmussen smart baseline 95 00:04:55,990 --> 00:04:57,580 predicted the Democrat would win, 96 00:04:57,580 --> 00:04:59,840 and the Democrat actually did win. 97 00:04:59,840 --> 00:05:03,380 There were 52 observations where the smart baseline predicted 98 00:05:03,380 --> 00:05:05,940 the Republican would win, and the Republican actually 99 00:05:05,940 --> 00:05:07,070 did win. 100 00:05:07,070 --> 00:05:10,460 Again, there were those two inconclusive observations. 101 00:05:10,460 --> 00:05:12,200 And finally, there were four mistakes. 102 00:05:12,200 --> 00:05:15,740 There were four times where the smart baseline model predicted 103 00:05:15,740 --> 00:05:18,480 that the Republican would win, but actually the Democrat 104 00:05:18,480 --> 00:05:19,760 won the state. 105 00:05:19,760 --> 00:05:21,770 So as we can see, this model, with four mistakes 106 00:05:21,770 --> 00:05:25,000 and two inconclusive results out of the 100 training 107 00:05:25,000 --> 00:05:29,060 set observations is doing much, much better than the naive 108 00:05:29,060 --> 00:05:31,560 baseline, which simply was always predicting 109 00:05:31,560 --> 00:05:33,040 the Republican would win and made 110 00:05:33,040 --> 00:05:35,780 47 mistakes on the same data. 111 00:05:35,780 --> 00:05:39,300 So we see that this is a much more reasonable baseline model 112 00:05:39,300 --> 00:05:42,320 to carry forward, against which we can compare 113 00:05:42,320 --> 00:05:45,150 our logistic regression-based approach.