1
00:00:04,500 --> 00:00:07,630
Now, we're ready to actually
start building models.

2
00:00:07,630 --> 00:00:10,130
So as usual, the first
thing we're going to do

3
00:00:10,130 --> 00:00:13,120
is split our data into a
training and a testing set.

4
00:00:13,120 --> 00:00:14,620
And for this problem,
we're actually

5
00:00:14,620 --> 00:00:18,290
going to train on data from
the 2004 and 2008 elections,

6
00:00:18,290 --> 00:00:21,680
and we're going to test
on data from the 2012

7
00:00:21,680 --> 00:00:23,290
presidential election.

8
00:00:23,290 --> 00:00:25,640
So to do that, we'll
create a data frame

9
00:00:25,640 --> 00:00:29,380
called Train, using the subset
function that breaks down

10
00:00:29,380 --> 00:00:32,259
the original polling
data frame and only

11
00:00:32,259 --> 00:00:37,080
stores the observations when
either the Year was 2004

12
00:00:37,080 --> 00:00:42,240
or when the Year was 2008.

13
00:00:42,240 --> 00:00:43,960
And to obtain the
testing set, we're

14
00:00:43,960 --> 00:00:49,070
going to use subset to create
a data frame called Test that

15
00:00:49,070 --> 00:00:51,900
saves the observations
in polling where

16
00:00:51,900 --> 00:00:55,330
the year was 2012.

17
00:00:55,330 --> 00:00:58,740
So now that we've broken it down
into a training and a testing

18
00:00:58,740 --> 00:01:02,290
set, we want to understand
the prediction of our baseline

19
00:01:02,290 --> 00:01:04,660
model against which
we want to compare

20
00:01:04,660 --> 00:01:06,870
a later logistic
regression model.

21
00:01:06,870 --> 00:01:09,039
So to do that, we'll
look at the breakdown

22
00:01:09,039 --> 00:01:10,990
of the dependent
variable in the training

23
00:01:10,990 --> 00:01:13,100
set using the table function.

24
00:01:17,380 --> 00:01:21,820
What we can see here is that
in 47 of the 100 training

25
00:01:21,820 --> 00:01:24,320
observations, the
Democrat won the state,

26
00:01:24,320 --> 00:01:27,830
and in 53 of the observations,
the Republican won the state.

27
00:01:27,830 --> 00:01:30,570
So our simple baseline
model is always

28
00:01:30,570 --> 00:01:33,080
going to predict the more
common outcome, which

29
00:01:33,080 --> 00:01:35,690
is that the Republican is
going to win the state.

30
00:01:35,690 --> 00:01:37,490
And we see that the
simple baseline model

31
00:01:37,490 --> 00:01:42,410
will have accuracy of
53% on the training set.

32
00:01:42,410 --> 00:01:46,070
Now, unfortunately, this
is a pretty weak model.

33
00:01:46,070 --> 00:01:49,450
It always predicts Republican,
even for a very landslide

34
00:01:49,450 --> 00:01:52,300
Democratic state, where
the Democrat was polling

35
00:01:52,300 --> 00:01:55,610
by 15% or 20% ahead
of the Republican.

36
00:01:55,610 --> 00:01:59,240
So nobody would really consider
this to be a credible model.

37
00:01:59,240 --> 00:02:03,490
So we need to think of a smarter
baseline model against which

38
00:02:03,490 --> 00:02:06,580
we can compare our logistic
regression models that we're

39
00:02:06,580 --> 00:02:08,070
going to develop later.

40
00:02:08,070 --> 00:02:10,860
So a reasonable
smart baseline would

41
00:02:10,860 --> 00:02:13,660
be to just take one of
the polls-- in our case,

42
00:02:13,660 --> 00:02:16,100
we'll take Rasmussen--
and make a prediction

43
00:02:16,100 --> 00:02:18,920
based on who poll said
was winning in the state.

44
00:02:18,920 --> 00:02:22,090
So for instance, if the
Republican is polling ahead,

45
00:02:22,090 --> 00:02:25,010
the Rasmussen smart
baseline would just

46
00:02:25,010 --> 00:02:27,040
pick the Republican
to be the winner.

47
00:02:27,040 --> 00:02:29,880
If the Democrat was ahead,
it would pick the Democrat.

48
00:02:29,880 --> 00:02:31,970
And if they were tied,
the model would not

49
00:02:31,970 --> 00:02:34,490
know which one to select.

50
00:02:34,490 --> 00:02:36,860
So to compute this
smart baseline,

51
00:02:36,860 --> 00:02:39,260
we're going to use a new
function called the sign

52
00:02:39,260 --> 00:02:40,500
function.

53
00:02:40,500 --> 00:02:42,650
And what this function
does is, if it's

54
00:02:42,650 --> 00:02:45,650
passed a positive number,
it returns the value 1.

55
00:02:45,650 --> 00:02:49,270
If it's passed a negative
number, it returns negative 1.

56
00:02:49,270 --> 00:02:52,079
And if it's passed
0, it returns 0.

57
00:02:52,079 --> 00:02:56,810
So if we passed the
Rasmussen variable into sign,

58
00:02:56,810 --> 00:02:59,600
whenever the Republican was
winning the state, meaning

59
00:02:59,600 --> 00:03:02,840
Rasmussen is positive,
it's going to return a 1.

60
00:03:02,840 --> 00:03:05,380
So for instance,
if the value 20 is

61
00:03:05,380 --> 00:03:08,440
passed, meaning the Republican
is polling 20 ahead,

62
00:03:08,440 --> 00:03:09,470
it returns 1.

63
00:03:09,470 --> 00:03:13,510
So 1 signifies that the
Republican is predicted to win.

64
00:03:13,510 --> 00:03:15,950
If the Democrat is leading
in the Rasmussen poll,

65
00:03:15,950 --> 00:03:18,170
it'll take on a negative value.

66
00:03:18,170 --> 00:03:22,150
So if we took for instance
the sign of -10, we get -1.

67
00:03:22,150 --> 00:03:25,260
So -1 means this
smart baseline is

68
00:03:25,260 --> 00:03:28,220
predicting that the
Democrat won the state.

69
00:03:28,220 --> 00:03:30,490
And finally, if we
took the sign of 0,

70
00:03:30,490 --> 00:03:33,270
meaning that the
Rasmussen poll had a tie,

71
00:03:33,270 --> 00:03:35,320
it returns 0, saying
that the model is

72
00:03:35,320 --> 00:03:39,140
inconclusive about who's
going to win the state.

73
00:03:39,140 --> 00:03:41,930
So now, we're ready
to actually compute

74
00:03:41,930 --> 00:03:45,520
this prediction for all
of our training set.

75
00:03:45,520 --> 00:03:47,280
And we can take a
look at the breakdown

76
00:03:47,280 --> 00:03:50,190
of that using the
table function applied

77
00:03:50,190 --> 00:03:56,829
to the sign of the training
set's Rasmussen variable.

78
00:03:56,829 --> 00:04:00,340
And what we can see is that
in 56 of the 100 training set

79
00:04:00,340 --> 00:04:03,500
observations, the smart
baseline predicted

80
00:04:03,500 --> 00:04:05,740
that the Republican
was going to win.

81
00:04:05,740 --> 00:04:08,750
In 42 instances, it
predicted the Democrat.

82
00:04:08,750 --> 00:04:11,640
And in two instances,
it was inconclusive.

83
00:04:11,640 --> 00:04:15,100
So what we really want to do
is to see the breakdown of how

84
00:04:15,100 --> 00:04:19,290
the smart baseline model does,
compared to the actual result

85
00:04:19,290 --> 00:04:21,390
-- who actually won the state.

86
00:04:21,390 --> 00:04:23,760
So we want to again
use the table function,

87
00:04:23,760 --> 00:04:27,240
but this time, we want to
compare the training set's

88
00:04:27,240 --> 00:04:32,650
outcome against the sign
of the polling data.

89
00:04:36,180 --> 00:04:39,590
So in this table, the rows
are the true outcome --

90
00:04:39,590 --> 00:04:42,320
1 is for Republican,
0 is for Democrat --

91
00:04:42,320 --> 00:04:46,190
and the columns are the smart
baseline predictions, -1, 0,

92
00:04:46,190 --> 00:04:47,280
or 1.

93
00:04:47,280 --> 00:04:51,130
What we can see is in the
top left corner over here,

94
00:04:51,130 --> 00:04:55,990
we have 42 observations where
the Rasmussen smart baseline

95
00:04:55,990 --> 00:04:57,580
predicted the
Democrat would win,

96
00:04:57,580 --> 00:04:59,840
and the Democrat
actually did win.

97
00:04:59,840 --> 00:05:03,380
There were 52 observations where
the smart baseline predicted

98
00:05:03,380 --> 00:05:05,940
the Republican would win,
and the Republican actually

99
00:05:05,940 --> 00:05:07,070
did win.

100
00:05:07,070 --> 00:05:10,460
Again, there were those two
inconclusive observations.

101
00:05:10,460 --> 00:05:12,200
And finally, there
were four mistakes.

102
00:05:12,200 --> 00:05:15,740
There were four times where the
smart baseline model predicted

103
00:05:15,740 --> 00:05:18,480
that the Republican would
win, but actually the Democrat

104
00:05:18,480 --> 00:05:19,760
won the state.

105
00:05:19,760 --> 00:05:21,770
So as we can see, this
model, with four mistakes

106
00:05:21,770 --> 00:05:25,000
and two inconclusive results
out of the 100 training

107
00:05:25,000 --> 00:05:29,060
set observations is doing much,
much better than the naive

108
00:05:29,060 --> 00:05:31,560
baseline, which simply
was always predicting

109
00:05:31,560 --> 00:05:33,040
the Republican
would win and made

110
00:05:33,040 --> 00:05:35,780
47 mistakes on the same data.

111
00:05:35,780 --> 00:05:39,300
So we see that this is a much
more reasonable baseline model

112
00:05:39,300 --> 00:05:42,320
to carry forward, against
which we can compare

113
00:05:42,320 --> 00:05:45,150
our logistic
regression-based approach.