1 00:00:05,150 --> 00:00:09,700 OK, so now we know what CP is, we can go ahead and build 2 00:00:09,700 --> 00:00:13,100 one last tree using cross validation. 3 00:00:13,100 --> 00:00:15,190 So we need to make sure first we have the required 4 00:00:15,190 --> 00:00:18,710 libraries installed and in use. 5 00:00:18,710 --> 00:00:22,840 So the first package is the "caret" package. 6 00:00:25,720 --> 00:00:32,189 And the second we need is the "e1071" package. 7 00:00:32,189 --> 00:00:33,730 OK. 8 00:00:33,730 --> 00:00:37,300 So we need to tell the caret package how exactly we 9 00:00:37,300 --> 00:00:39,730 want to do our parameter tuning. 10 00:00:39,730 --> 00:00:42,290 There are actually quite a few ways of doing it. 11 00:00:42,290 --> 00:00:44,520 But we're going to restrict ourselves in this course 12 00:00:44,520 --> 00:00:46,720 to just 10-fold cross validation, 13 00:00:46,720 --> 00:00:48,880 as was explained in the lecture. 14 00:00:48,880 --> 00:00:50,960 So let's say tr.control=trainControl(method="cv", 15 00:00:50,960 --> 00:00:51,460 number=10). 16 00:01:05,260 --> 00:01:07,470 OK, that was easy enough. 17 00:01:07,470 --> 00:01:11,280 Now we need to tell caret which range of CP parameters 18 00:01:11,280 --> 00:01:12,770 to try out. 19 00:01:12,770 --> 00:01:16,890 Now remember that CP varies between 0 and 1. 20 00:01:16,890 --> 00:01:18,600 It's likely for any given problem 21 00:01:18,600 --> 00:01:21,500 that we don't need to explore the whole range. 22 00:01:21,500 --> 00:01:23,530 I happen to know, by the fact that I 23 00:01:23,530 --> 00:01:27,090 made this presentation ahead of time, that the value of CP 24 00:01:27,090 --> 00:01:29,700 we're going to pick is very small. 25 00:01:29,700 --> 00:01:36,160 So what I want to do is make a grid of CP values to try. 26 00:01:36,160 --> 00:01:53,400 And it will be over the range of 0 to 0.01. 27 00:01:53,400 --> 00:01:57,170 OK, so how does what I wrote feed into that? 28 00:01:57,170 --> 00:02:04,240 Well, 1 times 0.001 is obviously 0.001. 29 00:02:04,240 --> 00:02:10,810 And 10 times 0.001 is obviously 0.01. 30 00:02:10,810 --> 00:02:15,300 0 to 5, or 0 to 10, means the numbers 31 00:02:15,300 --> 00:02:19,140 0, 1, 2, 3, 4 5, 6, 7, 8, 9, 10. 32 00:02:19,140 --> 00:02:30,680 So 0 to 10 times 0.001 is those numbers scaled by 0.001. 33 00:02:30,680 --> 00:02:35,650 So those are the values of CP that caret will try. 34 00:02:35,650 --> 00:02:40,150 So let's store the results of the cross validation fitting 35 00:02:40,150 --> 00:02:42,370 in a variable called tr. 36 00:02:42,370 --> 00:02:45,530 And we'll use the train function. 37 00:02:45,530 --> 00:02:59,120 Predicting MEDV is the LAT, LON, CRIM, zoning, industry, 38 00:02:59,120 --> 00:03:09,610 Charles River, pollution, rooms, age, distance, 39 00:03:09,610 --> 00:03:16,850 distance from highways, tax, and pupil-teacher ratio. 40 00:03:16,850 --> 00:03:22,840 OK, we're using the train data set. 41 00:03:22,840 --> 00:03:29,270 We're using trees (rpart), our train control 42 00:03:29,270 --> 00:03:34,460 is what we just made before, and our tuning grid 43 00:03:34,460 --> 00:03:40,540 is the other thing we just made, which we called CP grid. 44 00:03:40,540 --> 00:03:41,700 And it whirrs away. 45 00:03:41,700 --> 00:03:44,370 And what its doing there is it's trying all the different values 46 00:03:44,370 --> 00:03:47,060 of CP that we asked it to. 47 00:03:47,060 --> 00:03:51,240 So we can see what it's done but typing tr. 48 00:03:51,240 --> 00:03:55,600 You can see it tried 11 different values of CP. 49 00:03:55,600 --> 00:04:01,800 And it decided that CP equals 0.001 was the best because it 50 00:04:01,800 --> 00:04:07,380 had the best RMSE-- Root Mean Square Error. 51 00:04:07,380 --> 00:04:11,970 And it was 5.03 for 0.001. 52 00:04:11,970 --> 00:04:17,740 You see it's pretty insensitive to a particular value of CP. 53 00:04:17,740 --> 00:04:20,690 So it's maybe not too important. 54 00:04:20,690 --> 00:04:23,260 It's interesting though that the numbers are so low. 55 00:04:23,260 --> 00:04:26,420 I tried it for a much larger range of CP values, 56 00:04:26,420 --> 00:04:31,930 and the best solutions are always very close to 0. 57 00:04:31,930 --> 00:04:35,659 So it wants us to build a very detail-rich tree. 58 00:04:35,659 --> 00:04:39,659 So let's see what the tree that that value of CP corresponds to 59 00:04:39,659 --> 00:04:40,159 is. 60 00:04:40,159 --> 00:04:42,430 So we can get that from going best.tree=tr$finalModel. 61 00:04:56,100 --> 00:04:58,620 And we can plot that tree. 62 00:04:58,620 --> 00:05:04,160 So that's the model that corresponds to 0.001. 63 00:05:04,160 --> 00:05:07,310 Plot it. 64 00:05:07,310 --> 00:05:11,020 Wow, OK, so that's a very detailed tree. 65 00:05:11,020 --> 00:05:13,320 You see that it looks pretty much like the same tree we 66 00:05:13,320 --> 00:05:15,300 had before, initially. 67 00:05:15,300 --> 00:05:17,880 But then it starts to get much more detailed at the bottom. 68 00:05:17,880 --> 00:05:19,980 And in fact if you can see close enough, 69 00:05:19,980 --> 00:05:21,980 there's actually latitude and longitude in there 70 00:05:21,980 --> 00:05:24,140 right down at the bottom as well. 71 00:05:24,140 --> 00:05:26,650 So maybe the tree is finally going 72 00:05:26,650 --> 00:05:29,460 to be a linear regression model. 73 00:05:29,460 --> 00:05:31,990 Well, we can test that out same way as we did before. 74 00:05:31,990 --> 00:05:34,030 best.tree.pred=predict(best.tree, newdata=test). 75 00:05:43,070 --> 00:05:48,140 best.tree.sse, the Sum of Squared Errors, 76 00:05:48,140 --> 00:05:54,320 is the sum of the best tree's predictions 77 00:05:54,320 --> 00:06:01,160 minus the true values squared. 78 00:06:01,160 --> 00:06:07,410 That number is 3,675. 79 00:06:07,410 --> 00:06:10,150 So if you can remember from the last video, 80 00:06:10,150 --> 00:06:15,890 the tree from the previous video actually only got something 81 00:06:15,890 --> 00:06:16,530 in the 4,000s. 82 00:06:16,530 --> 00:06:17,370 So not very good. 83 00:06:17,370 --> 00:06:18,580 So we have actually improved. 84 00:06:18,580 --> 00:06:20,940 This tree is better on the testing set 85 00:06:20,940 --> 00:06:23,390 than the original tree we created. 86 00:06:23,390 --> 00:06:26,280 But, you may also remember that a linear regression 87 00:06:26,280 --> 00:06:29,510 model did actually better than that still. 88 00:06:29,510 --> 00:06:34,720 The linear regression SSE was more around 3,030. 89 00:06:34,720 --> 00:06:39,390 So the best tree is not as good as a linear regression model. 90 00:06:39,390 --> 00:06:43,930 But cross validation did improve performance. 91 00:06:43,930 --> 00:06:46,980 So the takeaway is, I guess, that trees 92 00:06:46,980 --> 00:06:50,040 aren't always the best method you have available to you. 93 00:06:50,040 --> 00:06:53,960 But you should always try cross validating 94 00:06:53,960 --> 00:06:57,330 them to get as much performance out of them as you can. 95 00:06:57,330 --> 00:07:01,000 And that's the end of the presentation Thank you.