1
00:00:04,500 --> 00:00:06,890
As usual, we will start
by reading in our data

2
00:00:06,890 --> 00:00:09,350
and looking at it
in the R console.

3
00:00:09,350 --> 00:00:12,250
So we can create a data
frame called polling

4
00:00:12,250 --> 00:00:18,600
using the read.csv function
for our PollingData.csv file.

5
00:00:18,600 --> 00:00:23,610
And we can take a look at its
structure with the str command.

6
00:00:23,610 --> 00:00:26,300
And what we can see
is that as expected,

7
00:00:26,300 --> 00:00:29,460
we have a state and a year
variable for each observation,

8
00:00:29,460 --> 00:00:31,760
as well as some polling
data and the outcome

9
00:00:31,760 --> 00:00:33,620
variable, Republican.

10
00:00:33,620 --> 00:00:35,480
So something we notice
right off the bat

11
00:00:35,480 --> 00:00:38,260
is that even though there are
50 states and three election

12
00:00:38,260 --> 00:00:41,810
years, so we would
expect 150 observations,

13
00:00:41,810 --> 00:00:46,440
we actually only have 145
observations in the data frame.

14
00:00:46,440 --> 00:00:48,710
So using the table
function, we can

15
00:00:48,710 --> 00:00:52,650
look at the breakdown of the
polling data frame's Year

16
00:00:52,650 --> 00:00:54,340
variable.

17
00:00:54,340 --> 00:00:58,470
And what we see is that while
in the 2004 and 2008 elections,

18
00:00:58,470 --> 00:01:02,310
all 50 states have data
reported, in 2012, only 45

19
00:01:02,310 --> 00:01:04,430
of the 50 states have data.

20
00:01:04,430 --> 00:01:05,890
And actually, what
happened here is

21
00:01:05,890 --> 00:01:09,050
that pollsters were so sure
about the five missing states

22
00:01:09,050 --> 00:01:11,750
that they didn't perform any
polls in the months leading up

23
00:01:11,750 --> 00:01:13,740
to the 2012 election.

24
00:01:13,740 --> 00:01:16,580
So since these states are
particularly easy to predict,

25
00:01:16,580 --> 00:01:19,140
we feel pretty comfortable
moving forward, making

26
00:01:19,140 --> 00:01:23,390
predictions just for
the 45 remaining states.

27
00:01:23,390 --> 00:01:24,810
So the second thing
that we notice

28
00:01:24,810 --> 00:01:26,760
is that there are
these NA values, which

29
00:01:26,760 --> 00:01:28,800
signify missing data.

30
00:01:28,800 --> 00:01:31,740
So to get a handle on just
how many values are missing,

31
00:01:31,740 --> 00:01:36,370
we can use our summary function
on the polling data frame.

32
00:01:36,370 --> 00:01:38,320
And what we see is that
while for the majority

33
00:01:38,320 --> 00:01:41,090
of our variables, there's
actually no missing data,

34
00:01:41,090 --> 00:01:43,410
we see that for the
Rasmussen polling data

35
00:01:43,410 --> 00:01:46,710
and also for the
SurveyUSA polling data,

36
00:01:46,710 --> 00:01:49,470
there are a decent
number of missing values.

37
00:01:49,470 --> 00:01:50,930
So let's take a
look at just how we

38
00:01:50,930 --> 00:01:53,759
can handle this missing data.

39
00:01:53,759 --> 00:01:55,660
There are a number
of simple approaches

40
00:01:55,660 --> 00:01:58,140
to dealing with missing data.

41
00:01:58,140 --> 00:02:00,340
One would be to
delete observations

42
00:02:00,340 --> 00:02:03,430
that are missing at
least one variable value.

43
00:02:03,430 --> 00:02:04,960
Unfortunately, in
this case, that

44
00:02:04,960 --> 00:02:07,090
would result in throwing
away more than 50%

45
00:02:07,090 --> 00:02:08,690
of the observations.

46
00:02:08,690 --> 00:02:10,820
And further, we want to be
able to make predictions

47
00:02:10,820 --> 00:02:13,380
for all states, not
just for the ones that

48
00:02:13,380 --> 00:02:16,329
report all of their
variable values.

49
00:02:16,329 --> 00:02:19,320
Another observation would be
to remove the variables that

50
00:02:19,320 --> 00:02:22,290
have missing values, in
this case, the Rasmussen

51
00:02:22,290 --> 00:02:25,010
and SurveyUSA variables.

52
00:02:25,010 --> 00:02:28,070
However, we expect
Rasmussen and SurveyUSA

53
00:02:28,070 --> 00:02:31,430
to be qualitatively different
from aggregate variables,

54
00:02:31,430 --> 00:02:34,210
such as DiffCount
and PropR, so we

55
00:02:34,210 --> 00:02:36,640
want to retain them
in our data set.

56
00:02:36,640 --> 00:02:39,040
A third approach would be
to fill the missing data

57
00:02:39,040 --> 00:02:41,130
points with average values.

58
00:02:41,130 --> 00:02:44,490
So for Rasmussen and SurveyUSA,
the average value for a poll

59
00:02:44,490 --> 00:02:47,320
would be very close to
zero across all the times

60
00:02:47,320 --> 00:02:49,380
with it reported,
which is roughly a tie

61
00:02:49,380 --> 00:02:52,240
between the Democrat and
Republican candidate.

62
00:02:52,240 --> 00:02:56,530
However, if PropR is very
close to one or zero,

63
00:02:56,530 --> 00:02:59,440
we would expect the
Rasmussen or SurveyUSA

64
00:02:59,440 --> 00:03:01,250
values that are
currently missing

65
00:03:01,250 --> 00:03:05,260
to be positive or
negative, respectively.

66
00:03:05,260 --> 00:03:07,240
This leads to a more
complicated approach

67
00:03:07,240 --> 00:03:10,120
called multiple imputation in
which we fill in the missing

68
00:03:10,120 --> 00:03:13,500
values based on the non-missing
values for an observation.

69
00:03:13,500 --> 00:03:16,640
So for instance, if the
Rasmussen variable is reported

70
00:03:16,640 --> 00:03:20,090
and is very negative, then
the missing SurveyUSA value

71
00:03:20,090 --> 00:03:23,820
would likely be filled in
as a negative value as well.

72
00:03:23,820 --> 00:03:26,100
Just like in the
sample.split function,

73
00:03:26,100 --> 00:03:28,630
multiple runs of
multiple imputation

74
00:03:28,630 --> 00:03:32,240
will in general result in
different missing values being

75
00:03:32,240 --> 00:03:37,430
filled in based on the
random seed that is set.

76
00:03:37,430 --> 00:03:39,550
Although multiple
imputation is in general

77
00:03:39,550 --> 00:03:41,930
a mathematically
sophisticated approach,

78
00:03:41,930 --> 00:03:44,640
we can use it rather easily
through pre-existing R

79
00:03:44,640 --> 00:03:45,840
libraries.

80
00:03:45,840 --> 00:03:47,420
We will use the
Multiple Imputation

81
00:03:47,420 --> 00:03:50,980
by Chained Equations,
or mice package.

82
00:03:50,980 --> 00:03:54,329
So just like we did in
lecture with the ROCR package,

83
00:03:54,329 --> 00:03:56,150
we're going to
install and then load

84
00:03:56,150 --> 00:03:58,910
a new package, the mice package.

85
00:03:58,910 --> 00:04:02,440
So we run
install.packages, and we

86
00:04:02,440 --> 00:04:04,630
pass it mice, which is
the name of the package we

87
00:04:04,630 --> 00:04:06,320
want to install.

88
00:04:06,320 --> 00:04:11,420
So you have to select a mirror
near you for the installation,

89
00:04:11,420 --> 00:04:15,330
and hopefully everything
will go smoothly

90
00:04:15,330 --> 00:04:17,450
and you'll get the
package mice installed.

91
00:04:17,450 --> 00:04:18,850
So after it's
installed, we still

92
00:04:18,850 --> 00:04:20,990
need to load it so that
we can actually use it,

93
00:04:20,990 --> 00:04:23,830
so we do that with
the library command.

94
00:04:23,830 --> 00:04:26,850
If you have to use it in the
future, all you'll have to do

95
00:04:26,850 --> 00:04:30,050
is run library instead of
installing and then running

96
00:04:30,050 --> 00:04:31,600
library.

97
00:04:31,600 --> 00:04:35,409
So for our multiple
imputation to be useful,

98
00:04:35,409 --> 00:04:37,790
we have to be able to find
out the values of our missing

99
00:04:37,790 --> 00:04:42,200
variables without using
the outcome of Republican.

100
00:04:42,200 --> 00:04:44,510
So, what we're
going to do here is

101
00:04:44,510 --> 00:04:46,420
we're going to limit
our data frame to just

102
00:04:46,420 --> 00:04:48,710
the four polling
related variables

103
00:04:48,710 --> 00:04:51,970
before we actually perform
multiple imputation.

104
00:04:51,970 --> 00:04:55,350
So we're going to create a
new data frame called simple,

105
00:04:55,350 --> 00:04:57,840
and that's just going to be
our original polling data

106
00:04:57,840 --> 00:05:08,570
frame limited to Rasmussen,
SurveyUSA, PropR,

107
00:05:08,570 --> 00:05:09,250
and DiffCount.

108
00:05:14,390 --> 00:05:17,360
We can take a look
at the simple data

109
00:05:17,360 --> 00:05:20,670
frame using the summary command.

110
00:05:20,670 --> 00:05:22,380
What we can see
is that we haven't

111
00:05:22,380 --> 00:05:23,420
done anything fancy yet.

112
00:05:23,420 --> 00:05:25,550
We still have our
missing values.

113
00:05:25,550 --> 00:05:27,170
All that's changed
is now we have

114
00:05:27,170 --> 00:05:30,790
a smaller number of
variables in total.

115
00:05:30,790 --> 00:05:34,950
So again, multiple imputation,
if you ran it twice,

116
00:05:34,950 --> 00:05:37,150
you would get different
values that were filled in.

117
00:05:37,150 --> 00:05:41,220
So, to make sure that
everybody following along

118
00:05:41,220 --> 00:05:43,370
gets the same results
from imputation,

119
00:05:43,370 --> 00:05:46,090
we're going to set the
random seed to a value.

120
00:05:46,090 --> 00:05:47,550
It doesn't really
matter what value

121
00:05:47,550 --> 00:05:52,310
we pick, so we'll just pick
my favorite number, 144.

122
00:05:52,310 --> 00:05:54,110
And now we're ready
to do imputation,

123
00:05:54,110 --> 00:05:55,580
which is just one line.

124
00:05:55,580 --> 00:05:59,930
So we're going to create a
new data frame called imputed,

125
00:05:59,930 --> 00:06:02,820
and we're going to use
the function complete,

126
00:06:02,820 --> 00:06:05,770
called on the function
mice, called on simple.

127
00:06:08,400 --> 00:06:12,460
So the output here shows us
that five rounds of imputation

128
00:06:12,460 --> 00:06:16,020
have been run, and now
all of the variables

129
00:06:16,020 --> 00:06:17,070
have been filled in.

130
00:06:17,070 --> 00:06:18,580
So there's no more
missing values,

131
00:06:18,580 --> 00:06:23,850
and we can see that using the
summary function on imputed.

132
00:06:23,850 --> 00:06:26,840
So Rasmussen and SurveyUSA
both have no more

133
00:06:26,840 --> 00:06:29,290
of those NA or missing values.

134
00:06:29,290 --> 00:06:32,800
So the last step in
this imputation process

135
00:06:32,800 --> 00:06:36,060
is to actually copy the
Rasmussen and SurveyUSA

136
00:06:36,060 --> 00:06:39,659
variables back into our original
polling data frame, which

137
00:06:39,659 --> 00:06:41,830
has all the variables
for the problem.

138
00:06:41,830 --> 00:06:45,050
And we can do that with
two simple assignments.

139
00:06:45,050 --> 00:06:49,690
So we'll just copy over to
polling Rasmussen, the value

140
00:06:49,690 --> 00:06:53,940
from the imputed data
frame, and then we'll

141
00:06:53,940 --> 00:06:58,480
do the same for the
SurveyUSA variable.

142
00:07:01,170 --> 00:07:04,900
And we'll use one final
check using summary

143
00:07:04,900 --> 00:07:07,800
on the final polling data frame.

144
00:07:07,800 --> 00:07:10,860
And as we can see,
Rasmussen and SurveyUSA

145
00:07:10,860 --> 00:07:13,160
are no longer missing values.