1
00:00:04,490 --> 00:00:08,970
In this video, we'll see how to
build a CART model in R. Let's

2
00:00:08,970 --> 00:00:13,190
start by reading in the
data file "stevens.csv".

3
00:00:13,190 --> 00:00:16,720
We'll call our
data frame stevens

4
00:00:16,720 --> 00:00:21,920
and use the read.csv function
to read in the data file

5
00:00:21,920 --> 00:00:22,500
"stevens.csv".

6
00:00:26,350 --> 00:00:28,250
Remember to navigate
to the directory

7
00:00:28,250 --> 00:00:33,520
on your computer containing
the file "stevens.csv" first.

8
00:00:33,520 --> 00:00:37,140
Now, let's take a look at our
data using the str function.

9
00:00:40,290 --> 00:00:45,930
We have 566 observations,
or Supreme Court cases,

10
00:00:45,930 --> 00:00:48,530
and nine different variables.

11
00:00:48,530 --> 00:00:52,020
Docket is just a unique
identifier for each case,

12
00:00:52,020 --> 00:00:54,990
and Term is the
year of the case.

13
00:00:54,990 --> 00:00:58,720
Then we have our six independent
variables: the circuit

14
00:00:58,720 --> 00:01:02,590
court of origin, the
issue area of the case,

15
00:01:02,590 --> 00:01:07,670
the type of petitioner, the type
of respondent, the lower court

16
00:01:07,670 --> 00:01:11,539
direction, and whether or
not the petitioner argued

17
00:01:11,539 --> 00:01:15,280
that a law or practice
was unconstitutional.

18
00:01:15,280 --> 00:01:18,560
The last variable is
our dependent variable,

19
00:01:18,560 --> 00:01:20,770
whether or not
Justice Stevens voted

20
00:01:20,770 --> 00:01:26,560
to reverse the case: 1 for
reverse, and 0 for affirm.

21
00:01:26,560 --> 00:01:28,430
Now before building
models, we need

22
00:01:28,430 --> 00:01:32,900
to split our data into a
training set and a testing set.

23
00:01:32,900 --> 00:01:36,050
We'll do this using the
sample.split function,

24
00:01:36,050 --> 00:01:39,490
like we did last week
for logistic regression.

25
00:01:39,490 --> 00:01:42,789
First, we need to load
the package caTools

26
00:01:42,789 --> 00:01:43,710
with library(caTools).

27
00:01:49,990 --> 00:01:54,370
Now, so that we all get the same
split, we need to set the seed.

28
00:01:54,370 --> 00:01:57,500
Remember that this can be
any number, as long as we all

29
00:01:57,500 --> 00:01:59,539
use the same number.

30
00:01:59,539 --> 00:02:06,440
Let's set the seed to 3000.

31
00:02:06,440 --> 00:02:08,410
Now, let's create our split.

32
00:02:08,410 --> 00:02:15,560
We'll call it spl, and we'll
use the sample.split function,

33
00:02:15,560 --> 00:02:19,260
where the first argument needs
to be our outcome variable,

34
00:02:19,260 --> 00:02:26,350
stevens$Reverse, and then
the second argument is

35
00:02:26,350 --> 00:02:29,829
the SplitRatio, or the
percentage of data that we want

36
00:02:29,829 --> 00:02:31,990
to put in the training set.

37
00:02:31,990 --> 00:02:37,880
In this case, we'll put 70% of
the data in the training set.

38
00:02:37,880 --> 00:02:40,220
Now, let's create our
training and testing

39
00:02:40,220 --> 00:02:43,410
sets using the subset function.

40
00:02:43,410 --> 00:02:46,910
We'll call our
training set Train,

41
00:02:46,910 --> 00:02:52,400
and we'll take a
subset of stevens,

42
00:02:52,400 --> 00:02:58,060
only taking the observations
for which spl is equal to TRUE.

43
00:02:58,060 --> 00:03:01,720
We'll call our testing
set Test, and here

44
00:03:01,720 --> 00:03:05,380
take a subset of
stevens, but this time,

45
00:03:05,380 --> 00:03:09,570
taking the observations for
which spl is equal to FALSE.

46
00:03:12,360 --> 00:03:14,890
Now, we're ready to
build our CART model.

47
00:03:14,890 --> 00:03:17,140
First we need to
install and load

48
00:03:17,140 --> 00:03:21,960
the rpart package and the
rpart plotting package.

49
00:03:21,960 --> 00:03:24,750
Remember that to
install a new package,

50
00:03:24,750 --> 00:03:29,200
we use the
install.packages function,

51
00:03:29,200 --> 00:03:31,320
and then in
parentheses and quotes,

52
00:03:31,320 --> 00:03:33,960
give the name of the
package we want to install.

53
00:03:33,960 --> 00:03:36,820
In this case, rpart.

54
00:03:36,820 --> 00:03:39,040
After you hit
Enter, a CRAN mirror

55
00:03:39,040 --> 00:03:43,470
should pop up asking you to
pick a location near you.

56
00:03:43,470 --> 00:03:46,550
Go ahead and pick the
appropriate location.

57
00:03:46,550 --> 00:03:48,829
In my case, I'll
pick Pennsylvania

58
00:03:48,829 --> 00:03:52,579
in the United
States, and hit OK.

59
00:03:52,579 --> 00:03:55,540
You should see some
lines run your R Console,

60
00:03:55,540 --> 00:03:58,250
and then, when you're back
to the blinking cursor,

61
00:03:58,250 --> 00:04:00,480
load the package
with library(rpart).

62
00:04:05,030 --> 00:04:12,490
Now, let's install the
package rpart.plot.

63
00:04:19,110 --> 00:04:21,940
Again, some lines should
run in your R Console,

64
00:04:21,940 --> 00:04:23,970
and when you're back
to the blinking cursor,

65
00:04:23,970 --> 00:04:26,310
load the package with
library(rpart.plot).

66
00:04:31,340 --> 00:04:35,659
Now we can create our CART
model using the rpart function.

67
00:04:35,659 --> 00:04:39,540
We'll call our
model StevensTree,

68
00:04:39,540 --> 00:04:41,680
and we'll use the
rpart function, where

69
00:04:41,680 --> 00:04:44,750
the first argument is the
same as if we were building

70
00:04:44,750 --> 00:04:47,430
a linear or logistic
regression model.

71
00:04:47,430 --> 00:04:49,810
We give our dependent
variable-- in our case,

72
00:04:49,810 --> 00:04:53,040
Reverse-- followed
by a tilde sign,

73
00:04:53,040 --> 00:04:54,780
and then the
independent variables

74
00:04:54,780 --> 00:04:56,940
separated by plus signs.

75
00:04:56,940 --> 00:05:06,890
So Circuit + Issue +
Petitioner + Respondent

76
00:05:06,890 --> 00:05:13,760
+ LowerCourt + Unconst.

77
00:05:13,760 --> 00:05:15,460
We also need to
give our data set

78
00:05:15,460 --> 00:05:18,280
that should be used to build
our model, which in our case

79
00:05:18,280 --> 00:05:20,150
is Train.

80
00:05:20,150 --> 00:05:22,850
Now we'll give two
additional arguments here.

81
00:05:22,850 --> 00:05:27,560
The first one is
method = "class".

82
00:05:27,560 --> 00:05:31,020
This tells rpart to build a
classification tree, instead of

83
00:05:31,020 --> 00:05:32,830
a regression tree.

84
00:05:32,830 --> 00:05:37,050
You'll see how we can create
regression trees in recitation.

85
00:05:37,050 --> 00:05:42,920
The last argument we'll
give is minbucket = 25.

86
00:05:42,920 --> 00:05:45,050
This limits the tree
so that it doesn't

87
00:05:45,050 --> 00:05:47,390
overfit to our training set.

88
00:05:47,390 --> 00:05:49,690
We selected a
value of 25, but we

89
00:05:49,690 --> 00:05:52,320
could pick a smaller
or larger value.

90
00:05:52,320 --> 00:05:57,350
We'll see another way to limit
the tree later in this lecture.

91
00:05:57,350 --> 00:06:00,890
Now let's plot our tree
using the prp function,

92
00:06:00,890 --> 00:06:04,240
where the only argument is the
name of our model, StevensTree.

93
00:06:07,990 --> 00:06:11,680
You should see the tree pop
up in the graphics window.

94
00:06:11,680 --> 00:06:14,160
The first split of our
tree is whether or not

95
00:06:14,160 --> 00:06:17,410
the lower court
decision is liberal.

96
00:06:17,410 --> 00:06:20,270
If it is, then we move
to the left in the tree.

97
00:06:20,270 --> 00:06:22,510
And we check the respondent.

98
00:06:22,510 --> 00:06:26,790
If the respondent is a criminal
defendant, injured person,

99
00:06:26,790 --> 00:06:30,480
politician, state,
or the United States,

100
00:06:30,480 --> 00:06:33,490
we predict 0, or affirm.

101
00:06:33,490 --> 00:06:36,890
You can see here that the
prp function abbreviates

102
00:06:36,890 --> 00:06:39,770
the values of the
independent variables.

103
00:06:39,770 --> 00:06:42,710
If you're not sure what
the abbreviations are,

104
00:06:42,710 --> 00:06:45,120
you could create a
table of the variable

105
00:06:45,120 --> 00:06:47,970
to see all of the
possible values.

106
00:06:47,970 --> 00:06:50,380
prp will select
the abbreviation so

107
00:06:50,380 --> 00:06:53,010
that they're uniquely
identifiable.

108
00:06:53,010 --> 00:06:55,070
So if you made a
table, you could

109
00:06:55,070 --> 00:06:58,659
see that CRI stands
for criminal defendant,

110
00:06:58,659 --> 00:07:03,030
INJ stands for
injured person, etc.

111
00:07:03,030 --> 00:07:06,680
So now moving on in our tree,
if the respondent is not

112
00:07:06,680 --> 00:07:09,920
one of these types, we
move on to the next split,

113
00:07:09,920 --> 00:07:12,020
and we check the petitioner.

114
00:07:12,020 --> 00:07:16,620
If the petitioner is a
city, employee, employer,

115
00:07:16,620 --> 00:07:19,300
government official,
or politician,

116
00:07:19,300 --> 00:07:22,150
then we predict 0, or affirm.

117
00:07:22,150 --> 00:07:25,650
If not, then we check the
circuit court of origin.

118
00:07:25,650 --> 00:07:32,460
If it's the 10th, 1st, 3rd,
4th, DC or Federal Court,

119
00:07:32,460 --> 00:07:34,280
then we predict 0.

120
00:07:34,280 --> 00:07:38,210
Otherwise, we predict
1, or reverse.

121
00:07:38,210 --> 00:07:42,159
We can repeat this same process
on the other side of the tree

122
00:07:42,159 --> 00:07:45,850
if the lower court
decision is not liberal.

123
00:07:45,850 --> 00:07:48,659
Comparing this to a
logistic regression model,

124
00:07:48,659 --> 00:07:51,450
we can see that it's
very interpretable.

125
00:07:51,450 --> 00:07:54,450
A CART tree is a series
of decision rules

126
00:07:54,450 --> 00:07:57,320
which can easily be explained.

127
00:07:57,320 --> 00:07:59,350
Now let's see how
well our CART model

128
00:07:59,350 --> 00:08:02,950
does at making predictions
for the test set.

129
00:08:02,950 --> 00:08:06,590
So back in our R Console,
we'll call our predictions

130
00:08:06,590 --> 00:08:12,630
PredictCART, and we'll use
the predict function, where

131
00:08:12,630 --> 00:08:15,720
the first argument is the name
of our model, StevensTree.

132
00:08:20,210 --> 00:08:22,510
The second argument
is the new data

133
00:08:22,510 --> 00:08:27,120
we want to make
predictions for, Test.

134
00:08:27,120 --> 00:08:32,860
And we'll add a third argument
here, which is type = "class".

135
00:08:32,860 --> 00:08:35,539
We need to give this argument
when making predictions

136
00:08:35,539 --> 00:08:40,070
for our CART model if we want
the majority class predictions.

137
00:08:40,070 --> 00:08:43,080
This is like using
a threshold of 0.5.

138
00:08:43,080 --> 00:08:46,090
We'll see in a few minutes how
we can leave this argument out

139
00:08:46,090 --> 00:08:49,940
and still get probabilities
from our CART model.

140
00:08:49,940 --> 00:08:52,360
Now let's compute the
accuracy of our model

141
00:08:52,360 --> 00:08:54,740
by building a confusion matrix.

142
00:08:54,740 --> 00:08:59,000
So we'll use the table function,
and first give the true outcome

143
00:08:59,000 --> 00:09:05,390
values-- Test$Reverse, and then
our predictions, PredictCART.

144
00:09:09,120 --> 00:09:11,380
To compute the accuracy,
we need to add up

145
00:09:11,380 --> 00:09:16,390
the observations we got
correct, 41 plus 71, divided

146
00:09:16,390 --> 00:09:19,200
by the total number of
observations in the table,

147
00:09:19,200 --> 00:09:23,230
or the total number of
observations in our test set.

148
00:09:23,230 --> 00:09:27,800
So the accuracy of our
CART model is 0.659.

149
00:09:27,800 --> 00:09:30,250
If you were to build a
logistic regression model,

150
00:09:30,250 --> 00:09:33,550
you would get an
accuracy of 0.665

151
00:09:33,550 --> 00:09:35,640
and a baseline model
that always predicts

152
00:09:35,640 --> 00:09:38,130
Reverse, the most
common outcome,

153
00:09:38,130 --> 00:09:41,480
has an accuracy of 0.547.

154
00:09:41,480 --> 00:09:44,830
So our CART model significantly
beats the baseline

155
00:09:44,830 --> 00:09:47,930
and is competitive with
logistic regression.

156
00:09:47,930 --> 00:09:49,980
It's also much
more interpretable

157
00:09:49,980 --> 00:09:53,520
than a logistic
regression model would be.

158
00:09:53,520 --> 00:09:57,340
Lastly, to evaluate our model,
let's generate an ROC curve

159
00:09:57,340 --> 00:10:01,060
for our CART model
using the ROCR package.

160
00:10:01,060 --> 00:10:04,160
First, we need to load the
package with the library

161
00:10:04,160 --> 00:10:09,640
function, and then we need to
generate our predictions again,

162
00:10:09,640 --> 00:10:13,020
this time without the
type = "class" argument.

163
00:10:13,020 --> 00:10:16,940
We'll call them
PredictROC, and we'll

164
00:10:16,940 --> 00:10:19,350
use the predict
function, giving just

165
00:10:19,350 --> 00:10:25,920
as the two arguments
StevensTree and newdata = Test.

166
00:10:25,920 --> 00:10:27,560
Let's take a look
at what this looks

167
00:10:27,560 --> 00:10:33,320
like by just typing
PredictROC and hitting Enter.

168
00:10:33,320 --> 00:10:35,880
For each observation
in the test set,

169
00:10:35,880 --> 00:10:38,010
it gives two numbers
which can be thought

170
00:10:38,010 --> 00:10:40,930
of as the probability
of outcome 0

171
00:10:40,930 --> 00:10:43,580
and the probability
of outcome 1.

172
00:10:43,580 --> 00:10:46,280
More concretely, each
test set observation

173
00:10:46,280 --> 00:10:50,600
is classified into a subset,
or bucket, of our CART tree.

174
00:10:50,600 --> 00:10:52,880
These numbers give the
percentage of training

175
00:10:52,880 --> 00:10:56,710
set data in that
subset with outcome 0

176
00:10:56,710 --> 00:10:59,250
and the percentage of
data in the training set

177
00:10:59,250 --> 00:11:01,950
in that subset with outcome 1.

178
00:11:01,950 --> 00:11:04,920
We'll use the second
column as our probabilities

179
00:11:04,920 --> 00:11:07,690
to generate an ROC curve.

180
00:11:07,690 --> 00:11:10,880
So just like we did last
week for logistic regression,

181
00:11:10,880 --> 00:11:13,710
we'll start by using
the prediction function.

182
00:11:13,710 --> 00:11:18,020
We'll call the output pred,
and then use prediction,

183
00:11:18,020 --> 00:11:22,540
where the first argument is the
second column of PredictROC,

184
00:11:22,540 --> 00:11:25,320
which we can access
with square brackets,

185
00:11:25,320 --> 00:11:28,520
and the second argument is
the true outcome values,

186
00:11:28,520 --> 00:11:30,780
Test$Reverse.

187
00:11:30,780 --> 00:11:36,200
Now we need to use the
performance function, where

188
00:11:36,200 --> 00:11:38,790
the first argument is the
outcome of the prediction

189
00:11:38,790 --> 00:11:41,780
function, and then
the next two arguments

190
00:11:41,780 --> 00:11:45,520
are true positive rate and
false positive rate, what

191
00:11:45,520 --> 00:11:49,750
we want on the x and
y-axes of our ROC curve.

192
00:11:49,750 --> 00:11:52,810
Now we can just plot our ROC
curve by typing plot(perf).

193
00:11:56,070 --> 00:11:58,700
If you switch back to
your graphics window,

194
00:11:58,700 --> 00:12:01,880
you should see the ROC
curve for our model.

195
00:12:01,880 --> 00:12:03,940
In the next quick
question, we'll

196
00:12:03,940 --> 00:12:08,480
ask you to compute the
test set AUC of this model.