1
00:00:04,490 --> 00:00:06,250
Now that we've
prepared our data set,

2
00:00:06,250 --> 00:00:08,780
let's use CART to build
a predictive model.

3
00:00:08,780 --> 00:00:11,930
First, we need to load the
necessary packages in our R

4
00:00:11,930 --> 00:00:17,300
Console by typing
library(rpart),

5
00:00:17,300 --> 00:00:18,510
and then library(rpart.plot).

6
00:00:25,950 --> 00:00:27,860
Now let's build our model.

7
00:00:27,860 --> 00:00:34,760
We'll call it tweetCART, and
we'll use the rpart function

8
00:00:34,760 --> 00:00:39,450
to predict Negative using
all of the other variables

9
00:00:39,450 --> 00:00:45,890
as our independent variables
and the data set trainSparse.

10
00:00:45,890 --> 00:00:50,880
We'll add one more argument
here, which is method = "class"

11
00:00:50,880 --> 00:00:54,050
so that the rpart function
knows to build a classification

12
00:00:54,050 --> 00:00:55,320
model.

13
00:00:55,320 --> 00:00:57,660
We're just using the
default parameter settings

14
00:00:57,660 --> 00:01:01,920
so we won't add anything
for minbucket or cp.

15
00:01:01,920 --> 00:01:04,959
Now let's plot the tree
using the prp function.

16
00:01:10,960 --> 00:01:14,730
Our tree says that if the
word "freak" is in the tweet,

17
00:01:14,730 --> 00:01:17,830
then predict TRUE, or
negative sentiment.

18
00:01:17,830 --> 00:01:19,870
If the word "freak"
is not in the tweet,

19
00:01:19,870 --> 00:01:23,760
but the word "hate"
is, again predict TRUE.

20
00:01:23,760 --> 00:01:26,090
If neither of these two
words are in the tweet,

21
00:01:26,090 --> 00:01:30,240
but the word "wtf" is, also
predict TRUE, or negative

22
00:01:30,240 --> 00:01:31,650
sentiment.

23
00:01:31,650 --> 00:01:34,450
If none of these three
words are in the tweet,

24
00:01:34,450 --> 00:01:38,500
then predict FALSE, or
non-negative sentiment.

25
00:01:38,500 --> 00:01:40,500
This tree makes
sense intuitively

26
00:01:40,500 --> 00:01:42,300
since these three
words are generally

27
00:01:42,300 --> 00:01:45,220
seen as negative words.

28
00:01:45,220 --> 00:01:48,050
Now, let's go back
to our R Console

29
00:01:48,050 --> 00:01:51,380
and evaluate the numerical
performance of our model

30
00:01:51,380 --> 00:01:54,690
by making predictions
on the test set.

31
00:01:54,690 --> 00:01:56,990
We'll call our
predictions predictCART.

32
00:01:59,729 --> 00:02:02,310
And we'll use the
predict function

33
00:02:02,310 --> 00:02:10,960
to predict using our model
tweetCART on the new data set

34
00:02:10,960 --> 00:02:11,460
testSparse.

35
00:02:13,970 --> 00:02:18,870
We'll add one more argument,
which is type = "class"

36
00:02:18,870 --> 00:02:21,960
to make sure we get
class predictions.

37
00:02:21,960 --> 00:02:24,430
Now let's make our
confusion matrix

38
00:02:24,430 --> 00:02:26,520
using the table function.

39
00:02:26,520 --> 00:02:30,040
We'll give as the first
argument the actual outcomes,

40
00:02:30,040 --> 00:02:34,450
testSparse$Negative, and
then as the second argument,

41
00:02:34,450 --> 00:02:36,300
our predictions, predictCART.

42
00:02:41,020 --> 00:02:43,260
To compute the
accuracy of our model,

43
00:02:43,260 --> 00:02:48,640
we add up the numbers on
the diagonal, 294 plus 18--

44
00:02:48,640 --> 00:02:51,590
these are the observations
we predicted correctly--

45
00:02:51,590 --> 00:02:55,120
and divide by the total number
of observations in the table,

46
00:02:55,120 --> 00:02:58,360
or the total number of
observations in our test set.

47
00:03:01,000 --> 00:03:04,940
So the accuracy of our
CART model is about 0.88.

48
00:03:04,940 --> 00:03:07,600
Let's compare this to
a simple baseline model

49
00:03:07,600 --> 00:03:10,160
that always predicts
non-negative.

50
00:03:10,160 --> 00:03:12,930
To compute the accuracy
of the baseline model,

51
00:03:12,930 --> 00:03:16,760
let's make a table of just
the outcome variable Negative.

52
00:03:16,760 --> 00:03:20,290
So we'll type table,
and then in parentheses,

53
00:03:20,290 --> 00:03:21,120
testSparse$Negative.

54
00:03:28,150 --> 00:03:30,120
This tells us that
in our test set

55
00:03:30,120 --> 00:03:33,780
we have 300 observations
with non-negative sentiment

56
00:03:33,780 --> 00:03:37,450
and 55 observations
with negative sentiment.

57
00:03:37,450 --> 00:03:39,450
So the accuracy of
a baseline model

58
00:03:39,450 --> 00:03:41,760
that always predicts
non-negative

59
00:03:41,760 --> 00:03:48,340
would be 300 divided
by 355, or 0.845.

60
00:03:48,340 --> 00:03:52,360
So our CART model does better
than the simple baseline model.

61
00:03:52,360 --> 00:03:54,200
How about a random forest model?

62
00:03:54,200 --> 00:03:55,940
How well would that do?

63
00:03:55,940 --> 00:03:59,380
Let's first load the
random forest package

64
00:03:59,380 --> 00:04:05,680
with library(randomForest),
and then we'll

65
00:04:05,680 --> 00:04:10,080
set the seed to
123 so that we can

66
00:04:10,080 --> 00:04:12,750
replicate our model
if we want to.

67
00:04:12,750 --> 00:04:16,200
Keep in mind that even if
you set the seed to 123,

68
00:04:16,200 --> 00:04:18,800
you might get a different
random forest model than me

69
00:04:18,800 --> 00:04:21,820
depending on your
operating system.

70
00:04:21,820 --> 00:04:23,820
Now, let's create our model.

71
00:04:23,820 --> 00:04:30,570
We'll call it tweetRF and
use the randomForest function

72
00:04:30,570 --> 00:04:34,860
to predict Negative again using
all of our other variables

73
00:04:34,860 --> 00:04:39,030
as independent variables and
the data set trainSparse.

74
00:04:42,070 --> 00:04:45,409
We'll again use the
default parameter settings.

75
00:04:45,409 --> 00:04:47,940
The random forest model
takes significantly longer

76
00:04:47,940 --> 00:04:49,940
to build than the CART model.

77
00:04:49,940 --> 00:04:52,490
We've seen this before when
building CART and random forest

78
00:04:52,490 --> 00:04:54,710
models, but in this
case, the difference

79
00:04:54,710 --> 00:04:56,800
is particularly drastic.

80
00:04:56,800 --> 00:04:59,680
This is because we have so
many independent variables,

81
00:04:59,680 --> 00:05:02,350
about 300 different words.

82
00:05:02,350 --> 00:05:04,970
So far in this course,
we haven't seen data sets

83
00:05:04,970 --> 00:05:07,590
with this many
independent variables.

84
00:05:07,590 --> 00:05:10,680
So keep in mind that for
text analytics problems,

85
00:05:10,680 --> 00:05:13,760
building a random forest model
will take significantly longer

86
00:05:13,760 --> 00:05:15,980
than building a CART model.

87
00:05:15,980 --> 00:05:17,750
So now that our
model's finished,

88
00:05:17,750 --> 00:05:20,330
let's make predictions
on our test set.

89
00:05:20,330 --> 00:05:24,430
We'll call them predictRF,
and again, we'll

90
00:05:24,430 --> 00:05:27,570
use the predict function
to make predictions

91
00:05:27,570 --> 00:05:31,920
using the model
tweetRF this time,

92
00:05:31,920 --> 00:05:34,090
and again, the new
data set testSparse.

93
00:05:38,080 --> 00:05:40,950
Now let's make our confusion
matrix using the table

94
00:05:40,950 --> 00:05:44,790
function, first giving
the actual outcomes,

95
00:05:44,790 --> 00:05:50,530
testSparse$Negative, and
then giving our predictions,

96
00:05:50,530 --> 00:05:51,030
predictRF.

97
00:05:53,780 --> 00:05:56,690
To compute the accuracy of
the random forest model,

98
00:05:56,690 --> 00:06:02,280
we again sum up the cases
we got right, 293 plus 21,

99
00:06:02,280 --> 00:06:05,530
and divide by the total number
of observations in the table.

100
00:06:09,970 --> 00:06:14,370
So our random forest model
has an accuracy of 0.885.

101
00:06:14,370 --> 00:06:16,320
This is a little better
than our CART model,

102
00:06:16,320 --> 00:06:18,930
but due to the interpretability
of our CART model,

103
00:06:18,930 --> 00:06:22,310
I'd probably prefer it over
the random forest model.

104
00:06:22,310 --> 00:06:25,090
If you were to use
cross-validation to pick the cp

105
00:06:25,090 --> 00:06:27,770
parameter for the CART
model, the accuracy

106
00:06:27,770 --> 00:06:31,820
would increase to about the
same as the random forest model.

107
00:06:31,820 --> 00:06:35,280
So by using a bag-of-words
approach and these models,

108
00:06:35,280 --> 00:06:37,909
we can reasonably
predict sentiment even

109
00:06:37,909 --> 00:06:41,490
with a relatively small
data set of tweets.