1 00:00:09,930 --> 00:00:12,460 In real estate, there is a famous saying 2 00:00:12,460 --> 00:00:16,500 that the most important thing is location, location, location. 3 00:00:16,500 --> 00:00:19,590 In this recitation, we will be looking at regression trees, 4 00:00:19,590 --> 00:00:21,360 and applying them to data related 5 00:00:21,360 --> 00:00:25,160 to house prices and locations. 6 00:00:25,160 --> 00:00:29,070 Boston is the capital of the state of Massachusetts, USA. 7 00:00:29,070 --> 00:00:33,000 It was first settled in 1630, and in the greater Boston area 8 00:00:33,000 --> 00:00:35,490 there are about 5 million people. 9 00:00:35,490 --> 00:00:37,490 The area features some of the highest population 10 00:00:37,490 --> 00:00:38,610 densities in America. 11 00:00:41,150 --> 00:00:43,960 Here is a shot of Boston from above. 12 00:00:43,960 --> 00:00:47,220 In the middle of the picture, we have the Charles River. 13 00:00:47,220 --> 00:00:51,390 I'm talking to you from my office at MIT. 14 00:00:51,390 --> 00:00:52,470 My office is here. 15 00:00:52,470 --> 00:00:54,540 This is MIT here. 16 00:00:54,540 --> 00:00:58,310 MIT lies in the city of Cambridge, 17 00:00:58,310 --> 00:01:01,030 which is north of the river, and south over the river 18 00:01:01,030 --> 00:01:04,050 there is Boston City, itself. 19 00:01:04,050 --> 00:01:06,440 In this recitation, we will be talking about Boston 20 00:01:06,440 --> 00:01:08,920 in a sense of the greater Boston area. 21 00:01:08,920 --> 00:01:11,670 However, if we look at the housing in Boston right now, 22 00:01:11,670 --> 00:01:15,730 we can see that it is very dense. 23 00:01:15,730 --> 00:01:18,200 Over the greater Boston area, the nature of the housing 24 00:01:18,200 --> 00:01:18,890 varies widely. 25 00:01:21,670 --> 00:01:24,780 This data comes from a paper, "Hedonic Housing Prices 26 00:01:24,780 --> 00:01:26,440 and the Demand for Clean Air," which 27 00:01:26,440 --> 00:01:29,420 has been cited more than 1,000 times. 28 00:01:29,420 --> 00:01:32,210 This paper was written on a relationship between house 29 00:01:32,210 --> 00:01:35,150 prices and clean air in the late 1970s 30 00:01:35,150 --> 00:01:38,060 by David Harrison of Harvard and Daniel Rubinfeld 31 00:01:38,060 --> 00:01:40,090 of the University of Michigan. 32 00:01:40,090 --> 00:01:42,890 The data set is widely used to evaluate algorithms 33 00:01:42,890 --> 00:01:44,520 of a nature we discussed in this class. 34 00:01:47,270 --> 00:01:49,280 Now, in the lecture, we will mostly 35 00:01:49,280 --> 00:01:51,740 discuss classification trees with the output 36 00:01:51,740 --> 00:01:54,500 as a factor or a category. 37 00:01:54,500 --> 00:01:57,210 Trees can also be used for regression tasks. 38 00:01:57,210 --> 00:01:58,860 The output at each leaf of a tree 39 00:01:58,860 --> 00:02:01,980 is no longer a category, but a number. 40 00:02:01,980 --> 00:02:04,830 Just like classification trees, regression trees can capture 41 00:02:04,830 --> 00:02:08,660 nonlinearities that linear regression can't. 42 00:02:08,660 --> 00:02:10,780 So what does that mean? 43 00:02:10,780 --> 00:02:14,780 Well, with classification trees we report the average outcome 44 00:02:14,780 --> 00:02:16,430 at each leaf of our tree. 45 00:02:16,430 --> 00:02:19,050 For example, if the outcome is true 15 times, 46 00:02:19,050 --> 00:02:23,270 and false 5 times, the value at that leaf of a tree would be 47 00:02:23,270 --> 00:02:30,660 15/(15+5)=0.75. 48 00:02:30,660 --> 00:02:34,390 Now, if we use the default threshold of 0.5, 49 00:02:34,390 --> 00:02:40,300 we would say the value at this leaf is true. 50 00:02:40,300 --> 00:02:44,000 With regression trees, we now have continuous variables. 51 00:02:44,000 --> 00:02:47,090 So instead of-- we report the average 52 00:02:47,090 --> 00:02:48,840 of the values at that leaf. 53 00:02:48,840 --> 00:02:54,720 So suppose we had the values 3, 4, and 5 54 00:02:54,720 --> 00:02:56,840 at one of the leaves of our trees. 55 00:02:56,840 --> 00:03:00,030 Well, we just take the average of these numbers, which is 4, 56 00:03:00,030 --> 00:03:02,420 and that is what we report. 57 00:03:02,420 --> 00:03:06,500 That might be a bit confusing so let's look at a picture. 58 00:03:06,500 --> 00:03:10,070 Here is some fake data that I made up in R. 59 00:03:10,070 --> 00:03:14,460 We see x on the x-axis and y on the y-axis. 60 00:03:14,460 --> 00:03:20,030 y is our variable we are trying to predict using x. 61 00:03:20,030 --> 00:03:23,220 So if we fit a linear regression to this data set, 62 00:03:23,220 --> 00:03:25,420 we obtain the following line. 63 00:03:25,420 --> 00:03:26,800 As you can see, linear regression 64 00:03:26,800 --> 00:03:29,560 does not do very well on this data set. 65 00:03:29,560 --> 00:03:32,140 However, we can notice that the data 66 00:03:32,140 --> 00:03:35,870 lies in three different groups. 67 00:03:35,870 --> 00:03:39,530 If we draw these lines here, we see x is either less than 10, 68 00:03:39,530 --> 00:03:42,140 between 10 and 20, or greater then 20, 69 00:03:42,140 --> 00:03:44,710 and there is very different behavior in each group. 70 00:03:44,710 --> 00:03:47,760 Regression trees can fit that kind of thing exactly. 71 00:03:47,760 --> 00:03:51,130 So the splits would be x is less than or equal to 10, 72 00:03:51,130 --> 00:03:53,300 take the average of those values. 73 00:03:53,300 --> 00:03:56,940 x is between 10 and 20, take the average of those values. 74 00:03:56,940 --> 00:04:00,750 x is between 20 and 30, take the average of those values. 75 00:04:00,750 --> 00:04:03,540 We see that regression trees can fit some kinds of data 76 00:04:03,540 --> 00:04:07,250 very well that linear regression completely fails on. 77 00:04:07,250 --> 00:04:11,080 Of course, in reality nothing is ever so nice and simple, 78 00:04:11,080 --> 00:04:12,830 but it gives us some idea why we might 79 00:04:12,830 --> 00:04:16,779 be interested in regression trees. 80 00:04:16,779 --> 00:04:19,709 So in this recitation, we will explore 81 00:04:19,709 --> 00:04:21,959 the data set with the aid of trees. 82 00:04:21,959 --> 00:04:23,550 We will compare linear regression 83 00:04:23,550 --> 00:04:25,420 with regression trees. 84 00:04:25,420 --> 00:04:28,320 We will discuss what the cp parameter means that we brought 85 00:04:28,320 --> 00:04:31,440 up when we did cross-validation in the lecture, 86 00:04:31,440 --> 00:04:33,200 and we will apply cross-validation 87 00:04:33,200 --> 00:04:35,440 to regression trees.