1 00:00:04,490 --> 00:00:08,970 In this video, we'll see how to build a CART model in R. Let's 2 00:00:08,970 --> 00:00:13,190 start by reading in the data file "stevens.csv". 3 00:00:13,190 --> 00:00:16,720 We'll call our data frame stevens 4 00:00:16,720 --> 00:00:21,920 and use the read.csv function to read in the data file 5 00:00:21,920 --> 00:00:22,500 "stevens.csv". 6 00:00:26,350 --> 00:00:28,250 Remember to navigate to the directory 7 00:00:28,250 --> 00:00:33,520 on your computer containing the file "stevens.csv" first. 8 00:00:33,520 --> 00:00:37,140 Now, let's take a look at our data using the str function. 9 00:00:40,290 --> 00:00:45,930 We have 566 observations, or Supreme Court cases, 10 00:00:45,930 --> 00:00:48,530 and nine different variables. 11 00:00:48,530 --> 00:00:52,020 Docket is just a unique identifier for each case, 12 00:00:52,020 --> 00:00:54,990 and Term is the year of the case. 13 00:00:54,990 --> 00:00:58,720 Then we have our six independent variables: the circuit 14 00:00:58,720 --> 00:01:02,590 court of origin, the issue area of the case, 15 00:01:02,590 --> 00:01:07,670 the type of petitioner, the type of respondent, the lower court 16 00:01:07,670 --> 00:01:11,539 direction, and whether or not the petitioner argued 17 00:01:11,539 --> 00:01:15,280 that a law or practice was unconstitutional. 18 00:01:15,280 --> 00:01:18,560 The last variable is our dependent variable, 19 00:01:18,560 --> 00:01:20,770 whether or not Justice Stevens voted 20 00:01:20,770 --> 00:01:26,560 to reverse the case: 1 for reverse, and 0 for affirm. 21 00:01:26,560 --> 00:01:28,430 Now before building models, we need 22 00:01:28,430 --> 00:01:32,900 to split our data into a training set and a testing set. 23 00:01:32,900 --> 00:01:36,050 We'll do this using the sample.split function, 24 00:01:36,050 --> 00:01:39,490 like we did last week for logistic regression. 25 00:01:39,490 --> 00:01:42,789 First, we need to load the package caTools 26 00:01:42,789 --> 00:01:43,710 with library(caTools). 27 00:01:49,990 --> 00:01:54,370 Now, so that we all get the same split, we need to set the seed. 28 00:01:54,370 --> 00:01:57,500 Remember that this can be any number, as long as we all 29 00:01:57,500 --> 00:01:59,539 use the same number. 30 00:01:59,539 --> 00:02:06,440 Let's set the seed to 3000. 31 00:02:06,440 --> 00:02:08,410 Now, let's create our split. 32 00:02:08,410 --> 00:02:15,560 We'll call it spl, and we'll use the sample.split function, 33 00:02:15,560 --> 00:02:19,260 where the first argument needs to be our outcome variable, 34 00:02:19,260 --> 00:02:26,350 stevens$Reverse, and then the second argument is 35 00:02:26,350 --> 00:02:29,829 the SplitRatio, or the percentage of data that we want 36 00:02:29,829 --> 00:02:31,990 to put in the training set. 37 00:02:31,990 --> 00:02:37,880 In this case, we'll put 70% of the data in the training set. 38 00:02:37,880 --> 00:02:40,220 Now, let's create our training and testing 39 00:02:40,220 --> 00:02:43,410 sets using the subset function. 40 00:02:43,410 --> 00:02:46,910 We'll call our training set Train, 41 00:02:46,910 --> 00:02:52,400 and we'll take a subset of stevens, 42 00:02:52,400 --> 00:02:58,060 only taking the observations for which spl is equal to TRUE. 43 00:02:58,060 --> 00:03:01,720 We'll call our testing set Test, and here 44 00:03:01,720 --> 00:03:05,380 take a subset of stevens, but this time, 45 00:03:05,380 --> 00:03:09,570 taking the observations for which spl is equal to FALSE. 46 00:03:12,360 --> 00:03:14,890 Now, we're ready to build our CART model. 47 00:03:14,890 --> 00:03:17,140 First we need to install and load 48 00:03:17,140 --> 00:03:21,960 the rpart package and the rpart plotting package. 49 00:03:21,960 --> 00:03:24,750 Remember that to install a new package, 50 00:03:24,750 --> 00:03:29,200 we use the install.packages function, 51 00:03:29,200 --> 00:03:31,320 and then in parentheses and quotes, 52 00:03:31,320 --> 00:03:33,960 give the name of the package we want to install. 53 00:03:33,960 --> 00:03:36,820 In this case, rpart. 54 00:03:36,820 --> 00:03:39,040 After you hit Enter, a CRAN mirror 55 00:03:39,040 --> 00:03:43,470 should pop up asking you to pick a location near you. 56 00:03:43,470 --> 00:03:46,550 Go ahead and pick the appropriate location. 57 00:03:46,550 --> 00:03:48,829 In my case, I'll pick Pennsylvania 58 00:03:48,829 --> 00:03:52,579 in the United States, and hit OK. 59 00:03:52,579 --> 00:03:55,540 You should see some lines run your R Console, 60 00:03:55,540 --> 00:03:58,250 and then, when you're back to the blinking cursor, 61 00:03:58,250 --> 00:04:00,480 load the package with library(rpart). 62 00:04:05,030 --> 00:04:12,490 Now, let's install the package rpart.plot. 63 00:04:19,110 --> 00:04:21,940 Again, some lines should run in your R Console, 64 00:04:21,940 --> 00:04:23,970 and when you're back to the blinking cursor, 65 00:04:23,970 --> 00:04:26,310 load the package with library(rpart.plot). 66 00:04:31,340 --> 00:04:35,659 Now we can create our CART model using the rpart function. 67 00:04:35,659 --> 00:04:39,540 We'll call our model StevensTree, 68 00:04:39,540 --> 00:04:41,680 and we'll use the rpart function, where 69 00:04:41,680 --> 00:04:44,750 the first argument is the same as if we were building 70 00:04:44,750 --> 00:04:47,430 a linear or logistic regression model. 71 00:04:47,430 --> 00:04:49,810 We give our dependent variable-- in our case, 72 00:04:49,810 --> 00:04:53,040 Reverse-- followed by a tilde sign, 73 00:04:53,040 --> 00:04:54,780 and then the independent variables 74 00:04:54,780 --> 00:04:56,940 separated by plus signs. 75 00:04:56,940 --> 00:05:06,890 So Circuit + Issue + Petitioner + Respondent 76 00:05:06,890 --> 00:05:13,760 + LowerCourt + Unconst. 77 00:05:13,760 --> 00:05:15,460 We also need to give our data set 78 00:05:15,460 --> 00:05:18,280 that should be used to build our model, which in our case 79 00:05:18,280 --> 00:05:20,150 is Train. 80 00:05:20,150 --> 00:05:22,850 Now we'll give two additional arguments here. 81 00:05:22,850 --> 00:05:27,560 The first one is method = "class". 82 00:05:27,560 --> 00:05:31,020 This tells rpart to build a classification tree, instead of 83 00:05:31,020 --> 00:05:32,830 a regression tree. 84 00:05:32,830 --> 00:05:37,050 You'll see how we can create regression trees in recitation. 85 00:05:37,050 --> 00:05:42,920 The last argument we'll give is minbucket = 25. 86 00:05:42,920 --> 00:05:45,050 This limits the tree so that it doesn't 87 00:05:45,050 --> 00:05:47,390 overfit to our training set. 88 00:05:47,390 --> 00:05:49,690 We selected a value of 25, but we 89 00:05:49,690 --> 00:05:52,320 could pick a smaller or larger value. 90 00:05:52,320 --> 00:05:57,350 We'll see another way to limit the tree later in this lecture. 91 00:05:57,350 --> 00:06:00,890 Now let's plot our tree using the prp function, 92 00:06:00,890 --> 00:06:04,240 where the only argument is the name of our model, StevensTree. 93 00:06:07,990 --> 00:06:11,680 You should see the tree pop up in the graphics window. 94 00:06:11,680 --> 00:06:14,160 The first split of our tree is whether or not 95 00:06:14,160 --> 00:06:17,410 the lower court decision is liberal. 96 00:06:17,410 --> 00:06:20,270 If it is, then we move to the left in the tree. 97 00:06:20,270 --> 00:06:22,510 And we check the respondent. 98 00:06:22,510 --> 00:06:26,790 If the respondent is a criminal defendant, injured person, 99 00:06:26,790 --> 00:06:30,480 politician, state, or the United States, 100 00:06:30,480 --> 00:06:33,490 we predict 0, or affirm. 101 00:06:33,490 --> 00:06:36,890 You can see here that the prp function abbreviates 102 00:06:36,890 --> 00:06:39,770 the values of the independent variables. 103 00:06:39,770 --> 00:06:42,710 If you're not sure what the abbreviations are, 104 00:06:42,710 --> 00:06:45,120 you could create a table of the variable 105 00:06:45,120 --> 00:06:47,970 to see all of the possible values. 106 00:06:47,970 --> 00:06:50,380 prp will select the abbreviation so 107 00:06:50,380 --> 00:06:53,010 that they're uniquely identifiable. 108 00:06:53,010 --> 00:06:55,070 So if you made a table, you could 109 00:06:55,070 --> 00:06:58,659 see that CRI stands for criminal defendant, 110 00:06:58,659 --> 00:07:03,030 INJ stands for injured person, etc. 111 00:07:03,030 --> 00:07:06,680 So now moving on in our tree, if the respondent is not 112 00:07:06,680 --> 00:07:09,920 one of these types, we move on to the next split, 113 00:07:09,920 --> 00:07:12,020 and we check the petitioner. 114 00:07:12,020 --> 00:07:16,620 If the petitioner is a city, employee, employer, 115 00:07:16,620 --> 00:07:19,300 government official, or politician, 116 00:07:19,300 --> 00:07:22,150 then we predict 0, or affirm. 117 00:07:22,150 --> 00:07:25,650 If not, then we check the circuit court of origin. 118 00:07:25,650 --> 00:07:32,460 If it's the 10th, 1st, 3rd, 4th, DC or Federal Court, 119 00:07:32,460 --> 00:07:34,280 then we predict 0. 120 00:07:34,280 --> 00:07:38,210 Otherwise, we predict 1, or reverse. 121 00:07:38,210 --> 00:07:42,159 We can repeat this same process on the other side of the tree 122 00:07:42,159 --> 00:07:45,850 if the lower court decision is not liberal. 123 00:07:45,850 --> 00:07:48,659 Comparing this to a logistic regression model, 124 00:07:48,659 --> 00:07:51,450 we can see that it's very interpretable. 125 00:07:51,450 --> 00:07:54,450 A CART tree is a series of decision rules 126 00:07:54,450 --> 00:07:57,320 which can easily be explained. 127 00:07:57,320 --> 00:07:59,350 Now let's see how well our CART model 128 00:07:59,350 --> 00:08:02,950 does at making predictions for the test set. 129 00:08:02,950 --> 00:08:06,590 So back in our R Console, we'll call our predictions 130 00:08:06,590 --> 00:08:12,630 PredictCART, and we'll use the predict function, where 131 00:08:12,630 --> 00:08:15,720 the first argument is the name of our model, StevensTree. 132 00:08:20,210 --> 00:08:22,510 The second argument is the new data 133 00:08:22,510 --> 00:08:27,120 we want to make predictions for, Test. 134 00:08:27,120 --> 00:08:32,860 And we'll add a third argument here, which is type = "class". 135 00:08:32,860 --> 00:08:35,539 We need to give this argument when making predictions 136 00:08:35,539 --> 00:08:40,070 for our CART model if we want the majority class predictions. 137 00:08:40,070 --> 00:08:43,080 This is like using a threshold of 0.5. 138 00:08:43,080 --> 00:08:46,090 We'll see in a few minutes how we can leave this argument out 139 00:08:46,090 --> 00:08:49,940 and still get probabilities from our CART model. 140 00:08:49,940 --> 00:08:52,360 Now let's compute the accuracy of our model 141 00:08:52,360 --> 00:08:54,740 by building a confusion matrix. 142 00:08:54,740 --> 00:08:59,000 So we'll use the table function, and first give the true outcome 143 00:08:59,000 --> 00:09:05,390 values-- Test$Reverse, and then our predictions, PredictCART. 144 00:09:09,120 --> 00:09:11,380 To compute the accuracy, we need to add up 145 00:09:11,380 --> 00:09:16,390 the observations we got correct, 41 plus 71, divided 146 00:09:16,390 --> 00:09:19,200 by the total number of observations in the table, 147 00:09:19,200 --> 00:09:23,230 or the total number of observations in our test set. 148 00:09:23,230 --> 00:09:27,800 So the accuracy of our CART model is 0.659. 149 00:09:27,800 --> 00:09:30,250 If you were to build a logistic regression model, 150 00:09:30,250 --> 00:09:33,550 you would get an accuracy of 0.665 151 00:09:33,550 --> 00:09:35,640 and a baseline model that always predicts 152 00:09:35,640 --> 00:09:38,130 Reverse, the most common outcome, 153 00:09:38,130 --> 00:09:41,480 has an accuracy of 0.547. 154 00:09:41,480 --> 00:09:44,830 So our CART model significantly beats the baseline 155 00:09:44,830 --> 00:09:47,930 and is competitive with logistic regression. 156 00:09:47,930 --> 00:09:49,980 It's also much more interpretable 157 00:09:49,980 --> 00:09:53,520 than a logistic regression model would be. 158 00:09:53,520 --> 00:09:57,340 Lastly, to evaluate our model, let's generate an ROC curve 159 00:09:57,340 --> 00:10:01,060 for our CART model using the ROCR package. 160 00:10:01,060 --> 00:10:04,160 First, we need to load the package with the library 161 00:10:04,160 --> 00:10:09,640 function, and then we need to generate our predictions again, 162 00:10:09,640 --> 00:10:13,020 this time without the type = "class" argument. 163 00:10:13,020 --> 00:10:16,940 We'll call them PredictROC, and we'll 164 00:10:16,940 --> 00:10:19,350 use the predict function, giving just 165 00:10:19,350 --> 00:10:25,920 as the two arguments StevensTree and newdata = Test. 166 00:10:25,920 --> 00:10:27,560 Let's take a look at what this looks 167 00:10:27,560 --> 00:10:33,320 like by just typing PredictROC and hitting Enter. 168 00:10:33,320 --> 00:10:35,880 For each observation in the test set, 169 00:10:35,880 --> 00:10:38,010 it gives two numbers which can be thought 170 00:10:38,010 --> 00:10:40,930 of as the probability of outcome 0 171 00:10:40,930 --> 00:10:43,580 and the probability of outcome 1. 172 00:10:43,580 --> 00:10:46,280 More concretely, each test set observation 173 00:10:46,280 --> 00:10:50,600 is classified into a subset, or bucket, of our CART tree. 174 00:10:50,600 --> 00:10:52,880 These numbers give the percentage of training 175 00:10:52,880 --> 00:10:56,710 set data in that subset with outcome 0 176 00:10:56,710 --> 00:10:59,250 and the percentage of data in the training set 177 00:10:59,250 --> 00:11:01,950 in that subset with outcome 1. 178 00:11:01,950 --> 00:11:04,920 We'll use the second column as our probabilities 179 00:11:04,920 --> 00:11:07,690 to generate an ROC curve. 180 00:11:07,690 --> 00:11:10,880 So just like we did last week for logistic regression, 181 00:11:10,880 --> 00:11:13,710 we'll start by using the prediction function. 182 00:11:13,710 --> 00:11:18,020 We'll call the output pred, and then use prediction, 183 00:11:18,020 --> 00:11:22,540 where the first argument is the second column of PredictROC, 184 00:11:22,540 --> 00:11:25,320 which we can access with square brackets, 185 00:11:25,320 --> 00:11:28,520 and the second argument is the true outcome values, 186 00:11:28,520 --> 00:11:30,780 Test$Reverse. 187 00:11:30,780 --> 00:11:36,200 Now we need to use the performance function, where 188 00:11:36,200 --> 00:11:38,790 the first argument is the outcome of the prediction 189 00:11:38,790 --> 00:11:41,780 function, and then the next two arguments 190 00:11:41,780 --> 00:11:45,520 are true positive rate and false positive rate, what 191 00:11:45,520 --> 00:11:49,750 we want on the x and y-axes of our ROC curve. 192 00:11:49,750 --> 00:11:52,810 Now we can just plot our ROC curve by typing plot(perf). 193 00:11:56,070 --> 00:11:58,700 If you switch back to your graphics window, 194 00:11:58,700 --> 00:12:01,880 you should see the ROC curve for our model. 195 00:12:01,880 --> 00:12:03,940 In the next quick question, we'll 196 00:12:03,940 --> 00:12:08,480 ask you to compute the test set AUC of this model.