1 00:00:04,490 --> 00:00:10,220 The cp parameter-- cp stands for complexity parameter. 2 00:00:10,220 --> 00:00:12,180 Recall that the first tree we made 3 00:00:12,180 --> 00:00:15,630 using latitude and longitude only had many splits, 4 00:00:15,630 --> 00:00:19,370 but we were able to trim it without losing much accuracy. 5 00:00:19,370 --> 00:00:22,010 The intuition we gain is, having too many splits 6 00:00:22,010 --> 00:00:25,680 is bad for generalization-- that is, performance on the test 7 00:00:25,680 --> 00:00:27,930 set-- so we should penalize the complexity. 8 00:00:30,710 --> 00:00:35,800 Let us define RSS to be the residual sum of squares, also 9 00:00:35,800 --> 00:00:39,930 known as the sum of square differences. 10 00:00:39,930 --> 00:00:41,740 Our goal when building the tree is 11 00:00:41,740 --> 00:00:44,630 to minimize the RSS by making splits, 12 00:00:44,630 --> 00:00:48,780 but we want to penalize having too many splits now. 13 00:00:48,780 --> 00:00:51,210 Define S to be the number of splits, 14 00:00:51,210 --> 00:00:53,900 and lambda to be our penalty. 15 00:00:53,900 --> 00:00:56,050 Our new goal is to find a tree that 16 00:00:56,050 --> 00:01:00,080 minimizes the sum of the RSS at each leaf, 17 00:01:00,080 --> 00:01:04,730 plus lambda, times S, for the number of splits. 18 00:01:04,730 --> 00:01:08,289 Let us consider this following example. 19 00:01:08,289 --> 00:01:12,280 Here we have set lambda to be equal to 0.5. 20 00:01:12,280 --> 00:01:14,840 Initially, we have a tree with no splits. 21 00:01:14,840 --> 00:01:17,360 We simply take the average of the data. 22 00:01:17,360 --> 00:01:23,190 The RSS in this case is 5, thus our total penalty is also 5. 23 00:01:23,190 --> 00:01:27,150 If we make one split, we now have two leaves. 24 00:01:27,150 --> 00:01:32,600 At each of these leaves, say, we have an error, or RSS of 2. 25 00:01:32,600 --> 00:01:37,039 The total RSS error is then 2+2=4. 26 00:01:37,039 --> 00:01:43,370 And the total penalty is 4+0.5*1, the number of splits. 27 00:01:43,370 --> 00:01:47,410 Our total penalty in this case is 4.5. 28 00:01:47,410 --> 00:01:50,190 If we split again on one of our leaves, 29 00:01:50,190 --> 00:01:54,100 we now have a total of three leaves for two splits. 30 00:01:54,100 --> 00:01:56,940 The error at our left-most leaf is 1. 31 00:01:56,940 --> 00:01:59,600 The next leaf has an error of 0.8. 32 00:01:59,600 --> 00:02:04,340 And the next leaf has an error of 2, for a total error of 3.8. 33 00:02:04,340 --> 00:02:09,630 The total penalty is thus 3.8+0.5*2, 34 00:02:09,630 --> 00:02:14,220 for a total penalty of 4.8. 35 00:02:14,220 --> 00:02:16,950 Notice that if we pick a large value of lambda, 36 00:02:16,950 --> 00:02:18,970 we won't make many splits, because you 37 00:02:18,970 --> 00:02:21,380 pay a big price for every additional split that 38 00:02:21,380 --> 00:02:24,470 will outweigh the decrease in error. 39 00:02:24,470 --> 00:02:27,040 If we pick a small, or 0 value of lambda, 40 00:02:27,040 --> 00:02:29,960 it will make splits until it no longer decreases the error. 41 00:02:32,650 --> 00:02:35,690 You may be wondering at this point, the definition of cp 42 00:02:35,690 --> 00:02:37,750 is what, exactly? 43 00:02:37,750 --> 00:02:41,200 Well, it's very closely related to lambda. 44 00:02:41,200 --> 00:02:44,020 Considering a tree with no splits, 45 00:02:44,020 --> 00:02:46,740 we simply take the average of our data, 46 00:02:46,740 --> 00:02:48,890 calculate RSS for that so-called tree, 47 00:02:48,890 --> 00:02:52,540 and let us call that RSS for no splits. 48 00:02:52,540 --> 00:02:54,370 Then we can define cp=lambda/RSS(no splits). 49 00:02:58,950 --> 00:03:01,880 When you're actually using cp in your R code, 50 00:03:01,880 --> 00:03:05,000 you don't need to think exactly what it means-- just 51 00:03:05,000 --> 00:03:08,420 that small numbers of cp encourage large trees, 52 00:03:08,420 --> 00:03:12,400 and large values of cp encourage small trees. 53 00:03:12,400 --> 00:03:15,450 Let's go back to R now, and apply cross-validation 54 00:03:15,450 --> 00:03:17,720 to our training data.