1 00:00:04,500 --> 00:00:06,840 In this video, we'll introduce a method 2 00:00:06,840 --> 00:00:10,750 that is similar to CART called Random Forests. 3 00:00:10,750 --> 00:00:14,130 This method was designed to improve the prediction accuracy 4 00:00:14,130 --> 00:00:19,110 of CART and works by building a large number of CART trees. 5 00:00:19,110 --> 00:00:21,190 Unfortunately, this makes the method 6 00:00:21,190 --> 00:00:24,130 less interpretable than CART, so often you 7 00:00:24,130 --> 00:00:27,110 need to decide if you value the interpretability 8 00:00:27,110 --> 00:00:30,420 or the increase in accuracy more. 9 00:00:30,420 --> 00:00:33,950 To make a prediction for a new observation, each tree 10 00:00:33,950 --> 00:00:36,680 in the forest votes on the outcome 11 00:00:36,680 --> 00:00:38,420 and we pick the outcome that receives 12 00:00:38,420 --> 00:00:41,640 the majority of the votes. 13 00:00:41,640 --> 00:00:45,480 So how does Random Forests build many CART trees? 14 00:00:45,480 --> 00:00:48,120 We can't just run CART multiple times 15 00:00:48,120 --> 00:00:51,640 because it would create the same tree every time. 16 00:00:51,640 --> 00:00:54,330 To prevent this, Random Forests only 17 00:00:54,330 --> 00:00:58,050 allows each tree to split on a random subset 18 00:00:58,050 --> 00:01:00,790 of the available independent variables. 19 00:01:00,790 --> 00:01:02,940 And each tree is built from what we 20 00:01:02,940 --> 00:01:07,260 call a bagged or bootstrapped sample of the data. 21 00:01:07,260 --> 00:01:09,110 This just means that the data used 22 00:01:09,110 --> 00:01:11,260 as the training data for each tree 23 00:01:11,260 --> 00:01:14,400 is selected randomly with replacement. 24 00:01:14,400 --> 00:01:16,440 Let's look at an example. 25 00:01:16,440 --> 00:01:19,120 Suppose we have five data points in our training set. 26 00:01:19,120 --> 00:01:22,890 We'll call them 1, 2, 3, 4, and 5. 27 00:01:22,890 --> 00:01:25,220 For the first tree, we'll randomly 28 00:01:25,220 --> 00:01:29,930 pick five data points randomly sampled with replacement. 29 00:01:29,930 --> 00:01:36,650 So the data could be 2, 4, 5, 2, and 1. 30 00:01:36,650 --> 00:01:38,890 Each time we pick one of the five data 31 00:01:38,890 --> 00:01:43,330 points regardless of whether or not it's been selected already. 32 00:01:43,330 --> 00:01:45,320 These would be the five data points 33 00:01:45,320 --> 00:01:49,060 we use when constructing the first CART tree. 34 00:01:49,060 --> 00:01:52,400 Then we repeat this process for the second tree. 35 00:01:52,400 --> 00:01:57,950 This time the data set might be 3, 5, 1, 5, and 2. 36 00:01:57,950 --> 00:02:02,460 And we would use this data when building the second CART tree. 37 00:02:02,460 --> 00:02:04,210 Then we would repeat this process 38 00:02:04,210 --> 00:02:07,840 for each additional tree we want to create. 39 00:02:07,840 --> 00:02:11,140 So since each tree sees a different set of variables 40 00:02:11,140 --> 00:02:13,770 and a different set of data, we get 41 00:02:13,770 --> 00:02:18,710 what's called a forest of many different trees. 42 00:02:18,710 --> 00:02:22,250 Just like CART, Random Forests has some parameter values 43 00:02:22,250 --> 00:02:24,170 that need to be selected. 44 00:02:24,170 --> 00:02:28,070 The first is the minimum number of observations in a subset, 45 00:02:28,070 --> 00:02:30,880 or the minbucket parameter from CART. 46 00:02:30,880 --> 00:02:33,390 When we create a random forest in R, 47 00:02:33,390 --> 00:02:36,680 this will be called nodesize. 48 00:02:36,680 --> 00:02:40,550 A smaller value of nodesize, which leads to bigger trees, 49 00:02:40,550 --> 00:02:43,630 may take longer in R. Random Forests 50 00:02:43,630 --> 00:02:48,170 is much more computationally intensive than CART. 51 00:02:48,170 --> 00:02:51,410 The second parameter is the number of trees to build, 52 00:02:51,410 --> 00:02:55,140 which is called intree in R. This should not 53 00:02:55,140 --> 00:02:59,210 be set too small, but the larger it is the longer it will take. 54 00:02:59,210 --> 00:03:03,030 A couple hundred trees is typically plenty. 55 00:03:03,030 --> 00:03:05,030 A nice thing about Random Forests 56 00:03:05,030 --> 00:03:07,790 is that it's not as sensitive to the parameter values 57 00:03:07,790 --> 00:03:09,390 as CART is. 58 00:03:09,390 --> 00:03:11,920 In the next video, we'll talk about a nice way 59 00:03:11,920 --> 00:03:13,890 to pick the CART parameter. 60 00:03:13,890 --> 00:03:17,240 For Random Forests, as long as this selection is a reasonable 61 00:03:17,240 --> 00:03:19,000 it's OK. 62 00:03:19,000 --> 00:03:21,880 Let's switch to R and create a Random Forest model 63 00:03:21,880 --> 00:03:24,140 to predict the decisions of Justice Stevens. 64 00:03:27,170 --> 00:03:31,270 In our R console, let's start by installing and loading 65 00:03:31,270 --> 00:03:33,910 the package "randomForest." 66 00:03:33,910 --> 00:03:36,560 We first need to install the package 67 00:03:36,560 --> 00:03:40,690 using the install.packages function for the package 68 00:03:40,690 --> 00:03:43,980 "randomForest." 69 00:03:43,980 --> 00:03:47,270 You should see a few lines run in your R console 70 00:03:47,270 --> 00:03:49,720 and then when you're back to the blinking cursor, 71 00:03:49,720 --> 00:03:51,810 load the package with the library command. 72 00:03:56,240 --> 00:03:59,520 Now we're ready to build our Random Forests model. 73 00:03:59,520 --> 00:04:05,370 We'll call it StevensForest and use the randomForest function, 74 00:04:05,370 --> 00:04:07,670 first giving our dependent variable, 75 00:04:07,670 --> 00:04:10,660 Reverse, followed by a tilde sign, 76 00:04:10,660 --> 00:04:12,240 and then our independent variable 77 00:04:12,240 --> 00:04:14,400 separated by plus signs. 78 00:04:14,400 --> 00:04:15,120 Circuit. 79 00:04:15,120 --> 00:04:16,060 Issue. 80 00:04:16,060 --> 00:04:17,579 Petitioner. 81 00:04:17,579 --> 00:04:18,079 Respondent. 82 00:04:21,200 --> 00:04:21,700 LowerCourt. 83 00:04:24,550 --> 00:04:26,790 And Unconst. 84 00:04:26,790 --> 00:04:28,890 We'll use the data set Train. 85 00:04:31,440 --> 00:04:35,060 For Random Forests we need to give two additional arguments. 86 00:04:35,060 --> 00:04:39,880 These are nodesize, also known as minbucket for CART, 87 00:04:39,880 --> 00:04:42,670 and we'll set this equal to 25, the same value we 88 00:04:42,670 --> 00:04:44,520 used for our CART model. 89 00:04:44,520 --> 00:04:47,330 And then we need to set the parameter ntree. 90 00:04:47,330 --> 00:04:49,490 This is the number of trees to build. 91 00:04:49,490 --> 00:04:52,280 And we'll build 200 trees here. 92 00:04:52,280 --> 00:04:54,450 Then hit Enter. 93 00:04:54,450 --> 00:04:57,560 You should see an interesting warning message here. 94 00:04:57,560 --> 00:05:01,320 In CART, we added the argument method="class", 95 00:05:01,320 --> 00:05:05,030 so that it was clear that we're doing a classification problem. 96 00:05:05,030 --> 00:05:07,380 As I mentioned earlier, trees can also 97 00:05:07,380 --> 00:05:09,440 be used for regression problems, which 98 00:05:09,440 --> 00:05:11,470 you'll see in the recitation. 99 00:05:11,470 --> 00:05:15,290 The Random Forest function does not have a method argument. 100 00:05:15,290 --> 00:05:18,120 So when we want to do a classification problem, 101 00:05:18,120 --> 00:05:21,250 we need to make sure outcome is a factor. 102 00:05:21,250 --> 00:05:24,960 Let's convert the variable Reverse to a factor variable 103 00:05:24,960 --> 00:05:28,180 in both our training and our testing sets. 104 00:05:28,180 --> 00:05:31,410 We do this by typing the name of the variable we want 105 00:05:31,410 --> 00:05:34,960 to convert-- in our case Train$Reverse-- 106 00:05:34,960 --> 00:05:40,600 and then type as.factor and then in parentheses the variable 107 00:05:40,600 --> 00:05:43,580 name, Train$Reverse. 108 00:05:43,580 --> 00:05:46,550 And just repeat this for the test set as well. 109 00:05:46,550 --> 00:05:55,200 Test$Reverse=as.factor(Test$Reverse) 110 00:05:55,200 --> 00:05:58,310 Now let's try creating our Random Forest again. 111 00:05:58,310 --> 00:06:01,450 Just use the up arrow to get back to the Random Forest line 112 00:06:01,450 --> 00:06:02,860 and hit Enter. 113 00:06:02,860 --> 00:06:05,100 We didn't get a warning message this time 114 00:06:05,100 --> 00:06:08,370 so our model is ready to make predictions. 115 00:06:08,370 --> 00:06:11,290 Let's compute predictions on our test set. 116 00:06:11,290 --> 00:06:16,320 We'll call our predictions PredictForest and use 117 00:06:16,320 --> 00:06:20,010 the predict function to make predictions using our model, 118 00:06:20,010 --> 00:06:25,210 StevensForest, and the new data set Test. 119 00:06:28,260 --> 00:06:32,180 Let's look at the confusion matrix to compute our accuracy. 120 00:06:32,180 --> 00:06:36,550 We'll use the table function and first give the true outcome, 121 00:06:36,550 --> 00:06:39,740 Test$Reverse, and then our predictions, PredictForest. 122 00:06:43,290 --> 00:06:45,710 Our accuracy here is (40+74)/(40+37+19+74). 123 00:06:56,330 --> 00:07:01,650 So the accuracy of our Random Forest model is about 67%. 124 00:07:01,650 --> 00:07:03,460 Recall that our logistic regression 125 00:07:03,460 --> 00:07:08,230 model had an accuracy of 66.5% and our CART model 126 00:07:08,230 --> 00:07:11,620 had an accuracy of 65.9%. 127 00:07:11,620 --> 00:07:14,460 So our Random Forest model improved our accuracy 128 00:07:14,460 --> 00:07:16,470 a little bit over CART. 129 00:07:16,470 --> 00:07:19,850 Sometimes you'll see a smaller improvement in accuracy 130 00:07:19,850 --> 00:07:22,180 and sometimes you'll see that Random Forests can 131 00:07:22,180 --> 00:07:25,290 significantly improve in accuracy over CART. 132 00:07:25,290 --> 00:07:28,150 We'll see this a lot in the recitation in the homework 133 00:07:28,150 --> 00:07:30,010 assignments. 134 00:07:30,010 --> 00:07:33,940 Keep in mind that Random Forests has a random component. 135 00:07:33,940 --> 00:07:37,070 You may have gotten a different confusion matrix than me 136 00:07:37,070 --> 00:07:40,440 because there's a random component to this method.