1
00:00:04,500 --> 00:00:06,840
In this video, we'll
introduce a method

2
00:00:06,840 --> 00:00:10,750
that is similar to CART
called Random Forests.

3
00:00:10,750 --> 00:00:14,130
This method was designed to
improve the prediction accuracy

4
00:00:14,130 --> 00:00:19,110
of CART and works by building
a large number of CART trees.

5
00:00:19,110 --> 00:00:21,190
Unfortunately, this
makes the method

6
00:00:21,190 --> 00:00:24,130
less interpretable
than CART, so often you

7
00:00:24,130 --> 00:00:27,110
need to decide if you
value the interpretability

8
00:00:27,110 --> 00:00:30,420
or the increase
in accuracy more.

9
00:00:30,420 --> 00:00:33,950
To make a prediction for a
new observation, each tree

10
00:00:33,950 --> 00:00:36,680
in the forest votes
on the outcome

11
00:00:36,680 --> 00:00:38,420
and we pick the
outcome that receives

12
00:00:38,420 --> 00:00:41,640
the majority of the votes.

13
00:00:41,640 --> 00:00:45,480
So how does Random Forests
build many CART trees?

14
00:00:45,480 --> 00:00:48,120
We can't just run
CART multiple times

15
00:00:48,120 --> 00:00:51,640
because it would create
the same tree every time.

16
00:00:51,640 --> 00:00:54,330
To prevent this,
Random Forests only

17
00:00:54,330 --> 00:00:58,050
allows each tree to
split on a random subset

18
00:00:58,050 --> 00:01:00,790
of the available
independent variables.

19
00:01:00,790 --> 00:01:02,940
And each tree is
built from what we

20
00:01:02,940 --> 00:01:07,260
call a bagged or bootstrapped
sample of the data.

21
00:01:07,260 --> 00:01:09,110
This just means
that the data used

22
00:01:09,110 --> 00:01:11,260
as the training
data for each tree

23
00:01:11,260 --> 00:01:14,400
is selected randomly
with replacement.

24
00:01:14,400 --> 00:01:16,440
Let's look at an example.

25
00:01:16,440 --> 00:01:19,120
Suppose we have five data
points in our training set.

26
00:01:19,120 --> 00:01:22,890
We'll call them
1, 2, 3, 4, and 5.

27
00:01:22,890 --> 00:01:25,220
For the first tree,
we'll randomly

28
00:01:25,220 --> 00:01:29,930
pick five data points randomly
sampled with replacement.

29
00:01:29,930 --> 00:01:36,650
So the data could be
2, 4, 5, 2, and 1.

30
00:01:36,650 --> 00:01:38,890
Each time we pick
one of the five data

31
00:01:38,890 --> 00:01:43,330
points regardless of whether or
not it's been selected already.

32
00:01:43,330 --> 00:01:45,320
These would be the
five data points

33
00:01:45,320 --> 00:01:49,060
we use when constructing
the first CART tree.

34
00:01:49,060 --> 00:01:52,400
Then we repeat this process
for the second tree.

35
00:01:52,400 --> 00:01:57,950
This time the data set
might be 3, 5, 1, 5, and 2.

36
00:01:57,950 --> 00:02:02,460
And we would use this data when
building the second CART tree.

37
00:02:02,460 --> 00:02:04,210
Then we would
repeat this process

38
00:02:04,210 --> 00:02:07,840
for each additional
tree we want to create.

39
00:02:07,840 --> 00:02:11,140
So since each tree sees a
different set of variables

40
00:02:11,140 --> 00:02:13,770
and a different
set of data, we get

41
00:02:13,770 --> 00:02:18,710
what's called a forest
of many different trees.

42
00:02:18,710 --> 00:02:22,250
Just like CART, Random Forests
has some parameter values

43
00:02:22,250 --> 00:02:24,170
that need to be selected.

44
00:02:24,170 --> 00:02:28,070
The first is the minimum number
of observations in a subset,

45
00:02:28,070 --> 00:02:30,880
or the minbucket
parameter from CART.

46
00:02:30,880 --> 00:02:33,390
When we create a
random forest in R,

47
00:02:33,390 --> 00:02:36,680
this will be called nodesize.

48
00:02:36,680 --> 00:02:40,550
A smaller value of nodesize,
which leads to bigger trees,

49
00:02:40,550 --> 00:02:43,630
may take longer in
R. Random Forests

50
00:02:43,630 --> 00:02:48,170
is much more computationally
intensive than CART.

51
00:02:48,170 --> 00:02:51,410
The second parameter is the
number of trees to build,

52
00:02:51,410 --> 00:02:55,140
which is called intree
in R. This should not

53
00:02:55,140 --> 00:02:59,210
be set too small, but the larger
it is the longer it will take.

54
00:02:59,210 --> 00:03:03,030
A couple hundred trees
is typically plenty.

55
00:03:03,030 --> 00:03:05,030
A nice thing about
Random Forests

56
00:03:05,030 --> 00:03:07,790
is that it's not as sensitive
to the parameter values

57
00:03:07,790 --> 00:03:09,390
as CART is.

58
00:03:09,390 --> 00:03:11,920
In the next video, we'll
talk about a nice way

59
00:03:11,920 --> 00:03:13,890
to pick the CART parameter.

60
00:03:13,890 --> 00:03:17,240
For Random Forests, as long as
this selection is a reasonable

61
00:03:17,240 --> 00:03:19,000
it's OK.

62
00:03:19,000 --> 00:03:21,880
Let's switch to R and
create a Random Forest model

63
00:03:21,880 --> 00:03:24,140
to predict the decisions
of Justice Stevens.

64
00:03:27,170 --> 00:03:31,270
In our R console, let's start
by installing and loading

65
00:03:31,270 --> 00:03:33,910
the package "randomForest."

66
00:03:33,910 --> 00:03:36,560
We first need to
install the package

67
00:03:36,560 --> 00:03:40,690
using the install.packages
function for the package

68
00:03:40,690 --> 00:03:43,980
"randomForest."

69
00:03:43,980 --> 00:03:47,270
You should see a few lines
run in your R console

70
00:03:47,270 --> 00:03:49,720
and then when you're back
to the blinking cursor,

71
00:03:49,720 --> 00:03:51,810
load the package with
the library command.

72
00:03:56,240 --> 00:03:59,520
Now we're ready to build
our Random Forests model.

73
00:03:59,520 --> 00:04:05,370
We'll call it StevensForest and
use the randomForest function,

74
00:04:05,370 --> 00:04:07,670
first giving our
dependent variable,

75
00:04:07,670 --> 00:04:10,660
Reverse, followed
by a tilde sign,

76
00:04:10,660 --> 00:04:12,240
and then our
independent variable

77
00:04:12,240 --> 00:04:14,400
separated by plus signs.

78
00:04:14,400 --> 00:04:15,120
Circuit.

79
00:04:15,120 --> 00:04:16,060
Issue.

80
00:04:16,060 --> 00:04:17,579
Petitioner.

81
00:04:17,579 --> 00:04:18,079
Respondent.

82
00:04:21,200 --> 00:04:21,700
LowerCourt.

83
00:04:24,550 --> 00:04:26,790
And Unconst.

84
00:04:26,790 --> 00:04:28,890
We'll use the data set Train.

85
00:04:31,440 --> 00:04:35,060
For Random Forests we need to
give two additional arguments.

86
00:04:35,060 --> 00:04:39,880
These are nodesize, also
known as minbucket for CART,

87
00:04:39,880 --> 00:04:42,670
and we'll set this equal
to 25, the same value we

88
00:04:42,670 --> 00:04:44,520
used for our CART model.

89
00:04:44,520 --> 00:04:47,330
And then we need to set
the parameter ntree.

90
00:04:47,330 --> 00:04:49,490
This is the number
of trees to build.

91
00:04:49,490 --> 00:04:52,280
And we'll build 200 trees here.

92
00:04:52,280 --> 00:04:54,450
Then hit Enter.

93
00:04:54,450 --> 00:04:57,560
You should see an interesting
warning message here.

94
00:04:57,560 --> 00:05:01,320
In CART, we added the
argument method="class",

95
00:05:01,320 --> 00:05:05,030
so that it was clear that we're
doing a classification problem.

96
00:05:05,030 --> 00:05:07,380
As I mentioned
earlier, trees can also

97
00:05:07,380 --> 00:05:09,440
be used for regression
problems, which

98
00:05:09,440 --> 00:05:11,470
you'll see in the recitation.

99
00:05:11,470 --> 00:05:15,290
The Random Forest function does
not have a method argument.

100
00:05:15,290 --> 00:05:18,120
So when we want to do a
classification problem,

101
00:05:18,120 --> 00:05:21,250
we need to make sure
outcome is a factor.

102
00:05:21,250 --> 00:05:24,960
Let's convert the variable
Reverse to a factor variable

103
00:05:24,960 --> 00:05:28,180
in both our training
and our testing sets.

104
00:05:28,180 --> 00:05:31,410
We do this by typing the
name of the variable we want

105
00:05:31,410 --> 00:05:34,960
to convert-- in our
case Train$Reverse--

106
00:05:34,960 --> 00:05:40,600
and then type as.factor and
then in parentheses the variable

107
00:05:40,600 --> 00:05:43,580
name, Train$Reverse.

108
00:05:43,580 --> 00:05:46,550
And just repeat this for
the test set as well.

109
00:05:46,550 --> 00:05:55,200
Test$Reverse=as.factor(Test$Reverse)

110
00:05:55,200 --> 00:05:58,310
Now let's try creating
our Random Forest again.

111
00:05:58,310 --> 00:06:01,450
Just use the up arrow to get
back to the Random Forest line

112
00:06:01,450 --> 00:06:02,860
and hit Enter.

113
00:06:02,860 --> 00:06:05,100
We didn't get a warning
message this time

114
00:06:05,100 --> 00:06:08,370
so our model is ready
to make predictions.

115
00:06:08,370 --> 00:06:11,290
Let's compute predictions
on our test set.

116
00:06:11,290 --> 00:06:16,320
We'll call our predictions
PredictForest and use

117
00:06:16,320 --> 00:06:20,010
the predict function to make
predictions using our model,

118
00:06:20,010 --> 00:06:25,210
StevensForest, and
the new data set Test.

119
00:06:28,260 --> 00:06:32,180
Let's look at the confusion
matrix to compute our accuracy.

120
00:06:32,180 --> 00:06:36,550
We'll use the table function
and first give the true outcome,

121
00:06:36,550 --> 00:06:39,740
Test$Reverse, and then our
predictions, PredictForest.

122
00:06:43,290 --> 00:06:45,710
Our accuracy here is
(40+74)/(40+37+19+74).

123
00:06:56,330 --> 00:07:01,650
So the accuracy of our Random
Forest model is about 67%.

124
00:07:01,650 --> 00:07:03,460
Recall that our
logistic regression

125
00:07:03,460 --> 00:07:08,230
model had an accuracy of
66.5% and our CART model

126
00:07:08,230 --> 00:07:11,620
had an accuracy of 65.9%.

127
00:07:11,620 --> 00:07:14,460
So our Random Forest model
improved our accuracy

128
00:07:14,460 --> 00:07:16,470
a little bit over CART.

129
00:07:16,470 --> 00:07:19,850
Sometimes you'll see a smaller
improvement in accuracy

130
00:07:19,850 --> 00:07:22,180
and sometimes you'll see
that Random Forests can

131
00:07:22,180 --> 00:07:25,290
significantly improve
in accuracy over CART.

132
00:07:25,290 --> 00:07:28,150
We'll see this a lot in the
recitation in the homework

133
00:07:28,150 --> 00:07:30,010
assignments.

134
00:07:30,010 --> 00:07:33,940
Keep in mind that Random
Forests has a random component.

135
00:07:33,940 --> 00:07:37,070
You may have gotten a different
confusion matrix than me

136
00:07:37,070 --> 00:07:40,440
because there's a random
component to this method.