1 00:00:05,040 --> 00:00:07,920 Let's see how regression trees do. 2 00:00:07,920 --> 00:00:11,450 We'll first load the rpart library 3 00:00:11,450 --> 00:00:16,550 and also load the rpart plotting library. 4 00:00:16,550 --> 00:00:18,800 We build a regression tree in the same way 5 00:00:18,800 --> 00:00:22,400 we would build a classification tree, using the rpart command. 6 00:00:25,390 --> 00:00:29,570 We predict MEDV as a function of latitude and longitude, 7 00:00:29,570 --> 00:00:30,620 using the Boston dataset. 8 00:00:33,950 --> 00:00:37,870 If we now plot the tree using the prp command, which 9 00:00:37,870 --> 00:00:43,620 is to find an rpart.plot, we can see it makes a lot of splits 10 00:00:43,620 --> 00:00:45,850 and is a little bit hard to interpret. 11 00:00:45,850 --> 00:00:49,340 But the important thing is look at the leaves. 12 00:00:49,340 --> 00:00:51,000 In a classification tree, the leaves 13 00:00:51,000 --> 00:00:54,680 would be the classification we assign 14 00:00:54,680 --> 00:00:58,000 that these splits would apply to. 15 00:00:58,000 --> 00:01:02,930 But in regression trees, we instead predict a number. 16 00:01:02,930 --> 00:01:06,230 That number is the average of the median house 17 00:01:06,230 --> 00:01:11,080 prices in that bucket or leaf. 18 00:01:11,080 --> 00:01:14,160 So let's see what that means in practice. 19 00:01:14,160 --> 00:01:18,460 So we'll plot again the latitude-- the points. 20 00:01:22,910 --> 00:01:29,590 And we'll again plot the points with above median prices. 21 00:01:29,590 --> 00:01:33,180 I just scrolled up from my command history to do that. 22 00:01:33,180 --> 00:01:35,650 Now we want to predict what the tree thinks 23 00:01:35,650 --> 00:01:38,310 is above median, just like we did with linear regression. 24 00:01:38,310 --> 00:01:42,030 So we'll say the fitted values we 25 00:01:42,030 --> 00:01:45,460 can get from using the predict command on the tree we just 26 00:01:45,460 --> 00:01:45,960 built. 27 00:01:49,070 --> 00:01:51,400 And we can do another points command, 28 00:01:51,400 --> 00:01:53,710 just like we did before. 29 00:01:53,710 --> 00:02:09,830 The fitted values are greater than 21.2 The color is blue. 30 00:02:09,830 --> 00:02:14,090 And the character is a dollar sign. 31 00:02:17,210 --> 00:02:19,730 Now we see that we've done a much better job 32 00:02:19,730 --> 00:02:21,930 than linear regression was able to do. 33 00:02:21,930 --> 00:02:25,840 We've correctly left the low value area in Boston 34 00:02:25,840 --> 00:02:29,110 and below out, and we've correctly 35 00:02:29,110 --> 00:02:30,780 managed to classify some of those points 36 00:02:30,780 --> 00:02:33,600 in the bottom right and top right. 37 00:02:33,600 --> 00:02:36,270 We're still making mistakes, but we're 38 00:02:36,270 --> 00:02:38,740 able to make a nonlinear prediction 39 00:02:38,740 --> 00:02:41,600 on latitude and longitude. 40 00:02:41,600 --> 00:02:45,840 So that's interesting, but the tree was very complicated. 41 00:02:45,840 --> 00:02:49,420 So maybe it's drastically overfitting. 42 00:02:49,420 --> 00:02:53,230 Can we get most of this effect with a much simpler tree? 43 00:02:53,230 --> 00:02:53,940 We can. 44 00:02:53,940 --> 00:02:56,170 We would just change the minbucket size. 45 00:02:56,170 --> 00:03:01,930 So let's build a new tree using the rpart command again: 46 00:03:01,930 --> 00:03:08,800 MEDV as a function of LAT and LON, the data=boston. 47 00:03:08,800 --> 00:03:11,840 But this time we'll say the minbucket size must be 50. 48 00:03:14,620 --> 00:03:20,270 We'll use the other way of plotting trees, plot, 49 00:03:20,270 --> 00:03:22,110 and we'll add text to the text command. 50 00:03:25,820 --> 00:03:27,710 And we see we have far fewer splits, 51 00:03:27,710 --> 00:03:30,040 and it's far more interpretable. 52 00:03:30,040 --> 00:03:32,590 The first split says if the longitude 53 00:03:32,590 --> 00:03:34,970 is greater than or equal to negative 71.07-- 54 00:03:34,970 --> 00:03:36,890 so if you're on the right side of the picture. 55 00:03:39,510 --> 00:03:41,510 So the left-hand branch is on the left-hand side 56 00:03:41,510 --> 00:03:42,960 of the picture and the right-hand-- 57 00:03:42,960 --> 00:03:44,810 So the left-hand side of the tree 58 00:03:44,810 --> 00:03:48,079 corresponds to the right-hand side of the map. 59 00:03:48,079 --> 00:03:49,829 And the right side of the tree corresponds 60 00:03:49,829 --> 00:03:51,770 to the left side of the map. 61 00:03:51,770 --> 00:03:54,340 That's a little bit of a mouthful. 62 00:03:54,340 --> 00:03:55,990 Let's see what it means visually. 63 00:03:55,990 --> 00:04:01,820 So we'll remember these values, and we'll 64 00:04:01,820 --> 00:04:07,460 plot the longitude and latitude again. 65 00:04:07,460 --> 00:04:08,510 So here's our map. 66 00:04:08,510 --> 00:04:09,120 OK. 67 00:04:09,120 --> 00:04:12,680 So the first split was on longitude, 68 00:04:12,680 --> 00:04:16,000 and it was negative 71.07. 69 00:04:16,000 --> 00:04:19,570 So there's a very handy command, "abline," 70 00:04:19,570 --> 00:04:23,670 which can put horizontal or vertical lines easily. 71 00:04:23,670 --> 00:04:27,070 So we're going to put a vertical line, so v, 72 00:04:27,070 --> 00:04:32,070 and we wanted to plot it at negative 71.07. 73 00:04:32,070 --> 00:04:32,850 OK. 74 00:04:32,850 --> 00:04:34,760 So that's that first split from the tree. 75 00:04:34,760 --> 00:04:37,159 It corresponds to being on either the left or right-hand 76 00:04:37,159 --> 00:04:40,909 side of this tree. 77 00:04:40,909 --> 00:04:48,240 We'll plot the-- what we want to do is, we'll focus on one area. 78 00:04:48,240 --> 00:04:53,040 We'll focus on the lowest price prediction, which 79 00:04:53,040 --> 00:04:55,050 is in the bottom left corner of the tree, 80 00:04:55,050 --> 00:04:57,140 right down the bottom left after all those splits. 81 00:04:57,140 --> 00:04:58,550 So that's where we want to get to. 82 00:04:58,550 --> 00:05:01,640 So let's plot again the points. 83 00:05:01,640 --> 00:05:03,400 Plot a vertical line. 84 00:05:03,400 --> 00:05:06,340 The next split down towards that bottom left corner 85 00:05:06,340 --> 00:05:12,900 was a horizontal line at 42.21. 86 00:05:12,900 --> 00:05:14,970 So I put that in. 87 00:05:14,970 --> 00:05:15,770 That's interesting. 88 00:05:15,770 --> 00:05:18,000 So that line corresponds pretty much to where 89 00:05:18,000 --> 00:05:20,940 the Charles River was from before. 90 00:05:20,940 --> 00:05:23,600 The final split you need to get to that bottom left corner I 91 00:05:23,600 --> 00:05:27,840 was pointing out is 42.17. 92 00:05:27,840 --> 00:05:31,260 It was above this line. 93 00:05:31,260 --> 00:05:32,810 And now that's interesting. 94 00:05:32,810 --> 00:05:35,960 If we look at the right side of the middle of the three 95 00:05:35,960 --> 00:05:37,970 rectangles on the right side, that 96 00:05:37,970 --> 00:05:39,340 is the bucket we were predicting. 97 00:05:39,340 --> 00:05:42,820 And it corresponds to that rectangle, those areas. 98 00:05:42,820 --> 00:05:47,890 That's the South Boston low price area we saw before. 99 00:05:47,890 --> 00:05:52,360 So maybe we can make that more clear by plotting, now, 100 00:05:52,360 --> 00:05:53,900 the high value prices. 101 00:05:53,900 --> 00:05:59,210 So let's go back up to where we plotted all the red dots 102 00:05:59,210 --> 00:06:00,700 and overlay it. 103 00:06:00,700 --> 00:06:02,470 So this makes it even more clear. 104 00:06:02,470 --> 00:06:06,510 We've correctly shown how the regression tree carves out 105 00:06:06,510 --> 00:06:09,120 that rectangle in the bottom of Boston 106 00:06:09,120 --> 00:06:13,460 and says that is a low value area. 107 00:06:13,460 --> 00:06:15,640 So that's actually very interesting. 108 00:06:15,640 --> 00:06:17,690 It's shown us something that regression trees can 109 00:06:17,690 --> 00:06:19,890 do that we would never expect linear regression to be 110 00:06:19,890 --> 00:06:20,880 able to do. 111 00:06:20,880 --> 00:06:23,180 So the question we're going to answer in the next video 112 00:06:23,180 --> 00:06:26,600 is given that regression trees can do these fancy things 113 00:06:26,600 --> 00:06:28,020 with latitude and longitude, is it 114 00:06:28,020 --> 00:06:29,890 actually going to help us to be able to build 115 00:06:29,890 --> 00:06:32,220 predictive models, predicting house prices? 116 00:06:32,220 --> 00:06:34,380 Well, we'll have to see.