1 00:00:04,730 --> 00:00:06,960 In the previous video, we generated 2 00:00:06,960 --> 00:00:10,030 a CART tree with three splits, but why 3 00:00:10,030 --> 00:00:13,980 not two, or four, or even five? 4 00:00:13,980 --> 00:00:15,820 There are different ways to control 5 00:00:15,820 --> 00:00:18,210 how many splits are generated. 6 00:00:18,210 --> 00:00:21,510 One way is by setting a lower bound for the number of data 7 00:00:21,510 --> 00:00:23,800 points in each subset. 8 00:00:23,800 --> 00:00:27,460 In R, this is called the minbucket parameter, 9 00:00:27,460 --> 00:00:29,910 for the minimum number of observations 10 00:00:29,910 --> 00:00:32,570 in each bucket or subset. 11 00:00:32,570 --> 00:00:36,360 The smaller minbucket is, the more splits will be generated. 12 00:00:36,360 --> 00:00:40,880 But if it's too small, overfitting will occur. 13 00:00:40,880 --> 00:00:43,210 This means that CART will fit the training set 14 00:00:43,210 --> 00:00:45,330 almost perfectly. 15 00:00:45,330 --> 00:00:48,440 But this is bad because then the model will probably not 16 00:00:48,440 --> 00:00:52,260 perform well on test set data or new data. 17 00:00:52,260 --> 00:00:54,850 On the other hand, if the minbucket parameter 18 00:00:54,850 --> 00:00:57,900 is too large, the model will be too simple 19 00:00:57,900 --> 00:01:00,520 and the accuracy will be poor. 20 00:01:00,520 --> 00:01:03,250 Later in the lecture, we will learn about a nice method 21 00:01:03,250 --> 00:01:04,879 for selecting the stopping parameter. 22 00:01:08,000 --> 00:01:10,240 In each subset of a CART tree, we 23 00:01:10,240 --> 00:01:12,530 have a bucket of observations, which 24 00:01:12,530 --> 00:01:15,860 may contain both possible outcomes. 25 00:01:15,860 --> 00:01:19,190 In the small example we showed in the previous video, 26 00:01:19,190 --> 00:01:22,550 we have classified each subset as either red or gray 27 00:01:22,550 --> 00:01:25,750 depending on the majority in that subset. 28 00:01:25,750 --> 00:01:29,220 In the Supreme Court case, we'll be classifying observations 29 00:01:29,220 --> 00:01:32,470 as either affirm or reverse. 30 00:01:32,470 --> 00:01:34,960 Instead of just taking the majority outcome 31 00:01:34,960 --> 00:01:38,039 to be the prediction, we can compute the percentage 32 00:01:38,039 --> 00:01:42,080 of data in a subset of each type of outcome. 33 00:01:42,080 --> 00:01:44,690 As an example, if we have a subset 34 00:01:44,690 --> 00:01:50,750 with 10 affirms and two reverses, then 87% of the data 35 00:01:50,750 --> 00:01:52,650 is affirm. 36 00:01:52,650 --> 00:01:55,690 Then, just like in logistic regression, 37 00:01:55,690 --> 00:01:59,690 we can use a threshold value to obtain our prediction. 38 00:01:59,690 --> 00:02:02,810 For this example, we would predict affirm 39 00:02:02,810 --> 00:02:07,340 with a threshold of 0.5 since the majority is affirm. 40 00:02:07,340 --> 00:02:10,509 But if we increase that threshold to 0.9, 41 00:02:10,509 --> 00:02:12,930 we would predict reverse for this example. 42 00:02:15,860 --> 00:02:18,410 Then by varying the threshold value, 43 00:02:18,410 --> 00:02:21,730 we can compute an ROC curve and compute 44 00:02:21,730 --> 00:02:25,610 an AUC value to evaluate our model. 45 00:02:25,610 --> 00:02:28,480 In the next video, we'll build a CART tree in R 46 00:02:28,480 --> 00:02:31,390 to predict the decisions of Justice Stevens 47 00:02:31,390 --> 00:02:35,329 and evaluate our model using a ROC curve.