1 00:00:04,500 --> 00:00:07,970 In the previous video, we got a feel for how regression trees 2 00:00:07,970 --> 00:00:10,970 can do things linear regression cannot. 3 00:00:10,970 --> 00:00:13,290 But what really matters at the end of the day 4 00:00:13,290 --> 00:00:16,790 is whether it can predict things better than linear regression. 5 00:00:16,790 --> 00:00:19,420 And so let's try that right now. 6 00:00:19,420 --> 00:00:21,420 We're going to try to predict house prices using 7 00:00:21,420 --> 00:00:24,060 all the variables we have available to us. 8 00:00:24,060 --> 00:00:28,990 So we'll load the caTools library. 9 00:00:28,990 --> 00:00:33,170 That will help us do a split on the data. 10 00:00:33,170 --> 00:00:37,120 We'll set the seed so our results are reproducible. 11 00:00:37,120 --> 00:00:45,610 And we'll say our split will be on the Boston house prices 12 00:00:45,610 --> 00:00:51,620 and we'll split it 70% training, 30% test. 13 00:00:51,620 --> 00:00:56,740 So our training data is a subset of the Boston data 14 00:00:56,740 --> 00:00:59,600 where the split is TRUE. 15 00:00:59,600 --> 00:01:04,470 And the testing data is the subset of the Boston data 16 00:01:04,470 --> 00:01:07,370 where the split is FALSE. 17 00:01:07,370 --> 00:01:11,250 OK, first of all, let's make a linear regression model, 18 00:01:11,250 --> 00:01:12,840 nice and easy. 19 00:01:12,840 --> 00:01:17,500 It's a linear model and the variables 20 00:01:17,500 --> 00:01:28,870 are latitude, longitude, crime, zoning, industry, whether it's 21 00:01:28,870 --> 00:01:37,050 on the Charles River or not, air pollution, rooms, age, 22 00:01:37,050 --> 00:01:44,110 distance, another form of distance, tax rates, 23 00:01:44,110 --> 00:01:47,620 and the pupil-teacher ratio. 24 00:01:47,620 --> 00:01:51,770 The data is training data. 25 00:01:51,770 --> 00:01:57,539 OK, let's see what our linear regression looks like. 26 00:01:57,539 --> 00:02:01,040 So we see that the latitude and longitude are not 27 00:02:01,040 --> 00:02:03,610 significant for a linear regression, which is perhaps 28 00:02:03,610 --> 00:02:05,980 not surprising because linear regression didn't seem 29 00:02:05,980 --> 00:02:08,660 to be able to take advantage of them. 30 00:02:08,660 --> 00:02:11,450 Crime is very important. 31 00:02:11,450 --> 00:02:14,220 The residential zoning might be important. 32 00:02:14,220 --> 00:02:15,600 Whether it's on the Charles River 33 00:02:15,600 --> 00:02:18,829 or not is a useful factor. 34 00:02:18,829 --> 00:02:20,460 Air pollution does seem to matter-- 35 00:02:20,460 --> 00:02:23,770 the coefficient is negative, as you'd expect. 36 00:02:23,770 --> 00:02:26,710 The average number of rooms is significant. 37 00:02:26,710 --> 00:02:29,140 The age is somewhat important. 38 00:02:29,140 --> 00:02:32,170 Distance to centers of employment (DIS), 39 00:02:32,170 --> 00:02:34,030 is very important. 40 00:02:34,030 --> 00:02:37,440 Distance to highways and tax is somewhat important, 41 00:02:37,440 --> 00:02:41,920 and the pupil-teacher ratio is also very significant. 42 00:02:41,920 --> 00:02:43,340 Some of these might be correlated, 43 00:02:43,340 --> 00:02:46,200 so we can't put too much stock in necessarily interpreting 44 00:02:46,200 --> 00:02:48,590 them directly, but it's interesting. 45 00:02:48,590 --> 00:02:55,130 The adjusted R squared is 0.65, which is pretty good. 46 00:02:55,130 --> 00:02:59,680 So because it's kind of hard to compare out 47 00:02:59,680 --> 00:03:02,340 of sample accuracy for regression, 48 00:03:02,340 --> 00:03:04,300 we need to think of how we're going to do that. 49 00:03:04,300 --> 00:03:08,120 With classification, we just say, this method got X% correct 50 00:03:08,120 --> 00:03:10,480 and this method got Y% correct. 51 00:03:10,480 --> 00:03:13,190 Well, since we're doing continuous variables, 52 00:03:13,190 --> 00:03:16,130 let's calculate the sum of squared error, which 53 00:03:16,130 --> 00:03:19,360 we discussed in the original linear regression video. 54 00:03:19,360 --> 00:03:24,440 So let's say the linear regression's predictions are 55 00:03:24,440 --> 00:03:33,760 predict(linreg, newdata=test) and the linear regression sum 56 00:03:33,760 --> 00:03:41,040 of squared errors is simply the sum of the predicted values 57 00:03:41,040 --> 00:03:45,840 versus the actual values squared. 58 00:03:45,840 --> 00:03:55,240 So let's see what that number is-- 3,037.008. 59 00:03:55,240 --> 00:03:58,270 OK, so you know what we're interested to see 60 00:03:58,270 --> 00:04:02,940 now is, can we beat this using regression trees? 61 00:04:02,940 --> 00:04:05,840 So let's build a tree. 62 00:04:05,840 --> 00:04:08,960 The tree rpart command again. 63 00:04:08,960 --> 00:04:11,690 Actually to save myself from typing it all up again, 64 00:04:11,690 --> 00:04:15,600 I'm going to go back to the regression command 65 00:04:15,600 --> 00:04:22,270 and just change "lm" to "rpart" and change 66 00:04:22,270 --> 00:04:25,000 "linreg" to "tree"-- much easier. 67 00:04:25,000 --> 00:04:27,210 All right. 68 00:04:27,210 --> 00:04:30,170 So we've built our tree-- let's have a look at it using 69 00:04:30,170 --> 00:04:35,860 the "prp" command from "rpart.plot." 70 00:04:35,860 --> 00:04:37,430 And here we go. 71 00:04:37,430 --> 00:04:42,820 So again, latitude and longitude aren't really important 72 00:04:42,820 --> 00:04:45,510 as far as the tree's concerned. 73 00:04:45,510 --> 00:04:48,420 The rooms aren't the most important split. 74 00:04:48,420 --> 00:04:51,480 Pollution appears in there twice, so it's, in some sense, 75 00:04:51,480 --> 00:04:53,060 nonlinear on the amount of pollution-- 76 00:04:53,060 --> 00:04:55,070 if it's greater than a certain amount 77 00:04:55,070 --> 00:04:57,860 or less than a certain amount, it does different things. 78 00:04:57,860 --> 00:05:00,490 Crime is in there, age is in there. 79 00:05:00,490 --> 00:05:02,990 Room appears three times, actually-- sorry. 80 00:05:02,990 --> 00:05:04,520 That's interesting. 81 00:05:04,520 --> 00:05:08,080 So it's very nonlinear on the number of rooms. 82 00:05:08,080 --> 00:05:10,590 Things that were important for the linear regression that 83 00:05:10,590 --> 00:05:15,300 don't appear in ours include pupil-teacher ratio. 84 00:05:15,300 --> 00:05:18,060 The DIS variable doesn't appear in our regression tree at all, 85 00:05:18,060 --> 00:05:19,540 either. 86 00:05:19,540 --> 00:05:21,850 So they're definitely doing different things, 87 00:05:21,850 --> 00:05:24,540 but how do they compare? 88 00:05:24,540 --> 00:05:28,280 So we'll predict, again, from a tree. 89 00:05:28,280 --> 00:05:35,180 "tree.pred" is the prediction of the tree on the new data. 90 00:05:39,440 --> 00:05:42,320 And the tree sum of squared errors 91 00:05:42,320 --> 00:05:46,950 is the sum of the tree's predictions 92 00:05:46,950 --> 00:05:52,330 versus what they really should be. 93 00:05:52,330 --> 00:05:58,580 And then the moment of truth-- 4,328. 94 00:05:58,580 --> 00:06:02,100 So, simply put, regression trees are not as good 95 00:06:02,100 --> 00:06:04,970 as linear regression for this problem. 96 00:06:04,970 --> 00:06:08,220 What this says to us, given what we saw with the latitude 97 00:06:08,220 --> 00:06:11,320 and longitude, is that latitude and longitude are nowhere near 98 00:06:11,320 --> 00:06:13,860 as useful for predicting, apparently, 99 00:06:13,860 --> 00:06:16,480 as these other variables are. 100 00:06:16,480 --> 00:06:18,580 That's just the way it goes, I guess. 101 00:06:18,580 --> 00:06:20,540 It's always nice when a new method does better, 102 00:06:20,540 --> 00:06:22,530 but there's no guarantee that's going to happen. 103 00:06:22,530 --> 00:06:25,250 We need a special structure to really be useful. 104 00:06:25,250 --> 00:06:28,680 Let's stop here with the R and go back to the slides 105 00:06:28,680 --> 00:06:31,540 and discuss how CP works and then we'll 106 00:06:31,540 --> 00:06:33,550 apply cross validation to our tree. 107 00:06:33,550 --> 00:06:36,850 And we'll see if maybe we can improve in our results.