1 00:00:04,500 --> 00:00:06,890 As usual, we will start by reading in our data 2 00:00:06,890 --> 00:00:09,350 and looking at it in the R console. 3 00:00:09,350 --> 00:00:12,250 So we can create a data frame called polling 4 00:00:12,250 --> 00:00:18,600 using the read.csv function for our PollingData.csv file. 5 00:00:18,600 --> 00:00:23,610 And we can take a look at its structure with the str command. 6 00:00:23,610 --> 00:00:26,300 And what we can see is that as expected, 7 00:00:26,300 --> 00:00:29,460 we have a state and a year variable for each observation, 8 00:00:29,460 --> 00:00:31,760 as well as some polling data and the outcome 9 00:00:31,760 --> 00:00:33,620 variable, Republican. 10 00:00:33,620 --> 00:00:35,480 So something we notice right off the bat 11 00:00:35,480 --> 00:00:38,260 is that even though there are 50 states and three election 12 00:00:38,260 --> 00:00:41,810 years, so we would expect 150 observations, 13 00:00:41,810 --> 00:00:46,440 we actually only have 145 observations in the data frame. 14 00:00:46,440 --> 00:00:48,710 So using the table function, we can 15 00:00:48,710 --> 00:00:52,650 look at the breakdown of the polling data frame's Year 16 00:00:52,650 --> 00:00:54,340 variable. 17 00:00:54,340 --> 00:00:58,470 And what we see is that while in the 2004 and 2008 elections, 18 00:00:58,470 --> 00:01:02,310 all 50 states have data reported, in 2012, only 45 19 00:01:02,310 --> 00:01:04,430 of the 50 states have data. 20 00:01:04,430 --> 00:01:05,890 And actually, what happened here is 21 00:01:05,890 --> 00:01:09,050 that pollsters were so sure about the five missing states 22 00:01:09,050 --> 00:01:11,750 that they didn't perform any polls in the months leading up 23 00:01:11,750 --> 00:01:13,740 to the 2012 election. 24 00:01:13,740 --> 00:01:16,580 So since these states are particularly easy to predict, 25 00:01:16,580 --> 00:01:19,140 we feel pretty comfortable moving forward, making 26 00:01:19,140 --> 00:01:23,390 predictions just for the 45 remaining states. 27 00:01:23,390 --> 00:01:24,810 So the second thing that we notice 28 00:01:24,810 --> 00:01:26,760 is that there are these NA values, which 29 00:01:26,760 --> 00:01:28,800 signify missing data. 30 00:01:28,800 --> 00:01:31,740 So to get a handle on just how many values are missing, 31 00:01:31,740 --> 00:01:36,370 we can use our summary function on the polling data frame. 32 00:01:36,370 --> 00:01:38,320 And what we see is that while for the majority 33 00:01:38,320 --> 00:01:41,090 of our variables, there's actually no missing data, 34 00:01:41,090 --> 00:01:43,410 we see that for the Rasmussen polling data 35 00:01:43,410 --> 00:01:46,710 and also for the SurveyUSA polling data, 36 00:01:46,710 --> 00:01:49,470 there are a decent number of missing values. 37 00:01:49,470 --> 00:01:50,930 So let's take a look at just how we 38 00:01:50,930 --> 00:01:53,759 can handle this missing data. 39 00:01:53,759 --> 00:01:55,660 There are a number of simple approaches 40 00:01:55,660 --> 00:01:58,140 to dealing with missing data. 41 00:01:58,140 --> 00:02:00,340 One would be to delete observations 42 00:02:00,340 --> 00:02:03,430 that are missing at least one variable value. 43 00:02:03,430 --> 00:02:04,960 Unfortunately, in this case, that 44 00:02:04,960 --> 00:02:07,090 would result in throwing away more than 50% 45 00:02:07,090 --> 00:02:08,690 of the observations. 46 00:02:08,690 --> 00:02:10,820 And further, we want to be able to make predictions 47 00:02:10,820 --> 00:02:13,380 for all states, not just for the ones that 48 00:02:13,380 --> 00:02:16,329 report all of their variable values. 49 00:02:16,329 --> 00:02:19,320 Another observation would be to remove the variables that 50 00:02:19,320 --> 00:02:22,290 have missing values, in this case, the Rasmussen 51 00:02:22,290 --> 00:02:25,010 and SurveyUSA variables. 52 00:02:25,010 --> 00:02:28,070 However, we expect Rasmussen and SurveyUSA 53 00:02:28,070 --> 00:02:31,430 to be qualitatively different from aggregate variables, 54 00:02:31,430 --> 00:02:34,210 such as DiffCount and PropR, so we 55 00:02:34,210 --> 00:02:36,640 want to retain them in our data set. 56 00:02:36,640 --> 00:02:39,040 A third approach would be to fill the missing data 57 00:02:39,040 --> 00:02:41,130 points with average values. 58 00:02:41,130 --> 00:02:44,490 So for Rasmussen and SurveyUSA, the average value for a poll 59 00:02:44,490 --> 00:02:47,320 would be very close to zero across all the times 60 00:02:47,320 --> 00:02:49,380 with it reported, which is roughly a tie 61 00:02:49,380 --> 00:02:52,240 between the Democrat and Republican candidate. 62 00:02:52,240 --> 00:02:56,530 However, if PropR is very close to one or zero, 63 00:02:56,530 --> 00:02:59,440 we would expect the Rasmussen or SurveyUSA 64 00:02:59,440 --> 00:03:01,250 values that are currently missing 65 00:03:01,250 --> 00:03:05,260 to be positive or negative, respectively. 66 00:03:05,260 --> 00:03:07,240 This leads to a more complicated approach 67 00:03:07,240 --> 00:03:10,120 called multiple imputation in which we fill in the missing 68 00:03:10,120 --> 00:03:13,500 values based on the non-missing values for an observation. 69 00:03:13,500 --> 00:03:16,640 So for instance, if the Rasmussen variable is reported 70 00:03:16,640 --> 00:03:20,090 and is very negative, then the missing SurveyUSA value 71 00:03:20,090 --> 00:03:23,820 would likely be filled in as a negative value as well. 72 00:03:23,820 --> 00:03:26,100 Just like in the sample.split function, 73 00:03:26,100 --> 00:03:28,630 multiple runs of multiple imputation 74 00:03:28,630 --> 00:03:32,240 will in general result in different missing values being 75 00:03:32,240 --> 00:03:37,430 filled in based on the random seed that is set. 76 00:03:37,430 --> 00:03:39,550 Although multiple imputation is in general 77 00:03:39,550 --> 00:03:41,930 a mathematically sophisticated approach, 78 00:03:41,930 --> 00:03:44,640 we can use it rather easily through pre-existing R 79 00:03:44,640 --> 00:03:45,840 libraries. 80 00:03:45,840 --> 00:03:47,420 We will use the Multiple Imputation 81 00:03:47,420 --> 00:03:50,980 by Chained Equations, or mice package. 82 00:03:50,980 --> 00:03:54,329 So just like we did in lecture with the ROCR package, 83 00:03:54,329 --> 00:03:56,150 we're going to install and then load 84 00:03:56,150 --> 00:03:58,910 a new package, the mice package. 85 00:03:58,910 --> 00:04:02,440 So we run install.packages, and we 86 00:04:02,440 --> 00:04:04,630 pass it mice, which is the name of the package we 87 00:04:04,630 --> 00:04:06,320 want to install. 88 00:04:06,320 --> 00:04:11,420 So you have to select a mirror near you for the installation, 89 00:04:11,420 --> 00:04:15,330 and hopefully everything will go smoothly 90 00:04:15,330 --> 00:04:17,450 and you'll get the package mice installed. 91 00:04:17,450 --> 00:04:18,850 So after it's installed, we still 92 00:04:18,850 --> 00:04:20,990 need to load it so that we can actually use it, 93 00:04:20,990 --> 00:04:23,830 so we do that with the library command. 94 00:04:23,830 --> 00:04:26,850 If you have to use it in the future, all you'll have to do 95 00:04:26,850 --> 00:04:30,050 is run library instead of installing and then running 96 00:04:30,050 --> 00:04:31,600 library. 97 00:04:31,600 --> 00:04:35,409 So for our multiple imputation to be useful, 98 00:04:35,409 --> 00:04:37,790 we have to be able to find out the values of our missing 99 00:04:37,790 --> 00:04:42,200 variables without using the outcome of Republican. 100 00:04:42,200 --> 00:04:44,510 So, what we're going to do here is 101 00:04:44,510 --> 00:04:46,420 we're going to limit our data frame to just 102 00:04:46,420 --> 00:04:48,710 the four polling related variables 103 00:04:48,710 --> 00:04:51,970 before we actually perform multiple imputation. 104 00:04:51,970 --> 00:04:55,350 So we're going to create a new data frame called simple, 105 00:04:55,350 --> 00:04:57,840 and that's just going to be our original polling data 106 00:04:57,840 --> 00:05:08,570 frame limited to Rasmussen, SurveyUSA, PropR, 107 00:05:08,570 --> 00:05:09,250 and DiffCount. 108 00:05:14,390 --> 00:05:17,360 We can take a look at the simple data 109 00:05:17,360 --> 00:05:20,670 frame using the summary command. 110 00:05:20,670 --> 00:05:22,380 What we can see is that we haven't 111 00:05:22,380 --> 00:05:23,420 done anything fancy yet. 112 00:05:23,420 --> 00:05:25,550 We still have our missing values. 113 00:05:25,550 --> 00:05:27,170 All that's changed is now we have 114 00:05:27,170 --> 00:05:30,790 a smaller number of variables in total. 115 00:05:30,790 --> 00:05:34,950 So again, multiple imputation, if you ran it twice, 116 00:05:34,950 --> 00:05:37,150 you would get different values that were filled in. 117 00:05:37,150 --> 00:05:41,220 So, to make sure that everybody following along 118 00:05:41,220 --> 00:05:43,370 gets the same results from imputation, 119 00:05:43,370 --> 00:05:46,090 we're going to set the random seed to a value. 120 00:05:46,090 --> 00:05:47,550 It doesn't really matter what value 121 00:05:47,550 --> 00:05:52,310 we pick, so we'll just pick my favorite number, 144. 122 00:05:52,310 --> 00:05:54,110 And now we're ready to do imputation, 123 00:05:54,110 --> 00:05:55,580 which is just one line. 124 00:05:55,580 --> 00:05:59,930 So we're going to create a new data frame called imputed, 125 00:05:59,930 --> 00:06:02,820 and we're going to use the function complete, 126 00:06:02,820 --> 00:06:05,770 called on the function mice, called on simple. 127 00:06:08,400 --> 00:06:12,460 So the output here shows us that five rounds of imputation 128 00:06:12,460 --> 00:06:16,020 have been run, and now all of the variables 129 00:06:16,020 --> 00:06:17,070 have been filled in. 130 00:06:17,070 --> 00:06:18,580 So there's no more missing values, 131 00:06:18,580 --> 00:06:23,850 and we can see that using the summary function on imputed. 132 00:06:23,850 --> 00:06:26,840 So Rasmussen and SurveyUSA both have no more 133 00:06:26,840 --> 00:06:29,290 of those NA or missing values. 134 00:06:29,290 --> 00:06:32,800 So the last step in this imputation process 135 00:06:32,800 --> 00:06:36,060 is to actually copy the Rasmussen and SurveyUSA 136 00:06:36,060 --> 00:06:39,659 variables back into our original polling data frame, which 137 00:06:39,659 --> 00:06:41,830 has all the variables for the problem. 138 00:06:41,830 --> 00:06:45,050 And we can do that with two simple assignments. 139 00:06:45,050 --> 00:06:49,690 So we'll just copy over to polling Rasmussen, the value 140 00:06:49,690 --> 00:06:53,940 from the imputed data frame, and then we'll 141 00:06:53,940 --> 00:06:58,480 do the same for the SurveyUSA variable. 142 00:07:01,170 --> 00:07:04,900 And we'll use one final check using summary 143 00:07:04,900 --> 00:07:07,800 on the final polling data frame. 144 00:07:07,800 --> 00:07:10,860 And as we can see, Rasmussen and SurveyUSA 145 00:07:10,860 --> 00:07:13,160 are no longer missing values.