1
00:00:09,930 --> 00:00:12,460
In real estate, there
is a famous saying

2
00:00:12,460 --> 00:00:16,500
that the most important thing
is location, location, location.

3
00:00:16,500 --> 00:00:19,590
In this recitation, we will be
looking at regression trees,

4
00:00:19,590 --> 00:00:21,360
and applying them
to data related

5
00:00:21,360 --> 00:00:25,160
to house prices and locations.

6
00:00:25,160 --> 00:00:29,070
Boston is the capital of the
state of Massachusetts, USA.

7
00:00:29,070 --> 00:00:33,000
It was first settled in 1630,
and in the greater Boston area

8
00:00:33,000 --> 00:00:35,490
there are about
5 million people.

9
00:00:35,490 --> 00:00:37,490
The area features some
of the highest population

10
00:00:37,490 --> 00:00:38,610
densities in America.

11
00:00:41,150 --> 00:00:43,960
Here is a shot of
Boston from above.

12
00:00:43,960 --> 00:00:47,220
In the middle of the picture,
we have the Charles River.

13
00:00:47,220 --> 00:00:51,390
I'm talking to you
from my office at MIT.

14
00:00:51,390 --> 00:00:52,470
My office is here.

15
00:00:52,470 --> 00:00:54,540
This is MIT here.

16
00:00:54,540 --> 00:00:58,310
MIT lies in the
city of Cambridge,

17
00:00:58,310 --> 00:01:01,030
which is north of the river,
and south over the river

18
00:01:01,030 --> 00:01:04,050
there is Boston City, itself.

19
00:01:04,050 --> 00:01:06,440
In this recitation, we will
be talking about Boston

20
00:01:06,440 --> 00:01:08,920
in a sense of the
greater Boston area.

21
00:01:08,920 --> 00:01:11,670
However, if we look at the
housing in Boston right now,

22
00:01:11,670 --> 00:01:15,730
we can see that
it is very dense.

23
00:01:15,730 --> 00:01:18,200
Over the greater Boston area,
the nature of the housing

24
00:01:18,200 --> 00:01:18,890
varies widely.

25
00:01:21,670 --> 00:01:24,780
This data comes from a paper,
"Hedonic Housing Prices

26
00:01:24,780 --> 00:01:26,440
and the Demand for
Clean Air," which

27
00:01:26,440 --> 00:01:29,420
has been cited more
than 1,000 times.

28
00:01:29,420 --> 00:01:32,210
This paper was written on a
relationship between house

29
00:01:32,210 --> 00:01:35,150
prices and clean air
in the late 1970s

30
00:01:35,150 --> 00:01:38,060
by David Harrison of
Harvard and Daniel Rubinfeld

31
00:01:38,060 --> 00:01:40,090
of the University of Michigan.

32
00:01:40,090 --> 00:01:42,890
The data set is widely
used to evaluate algorithms

33
00:01:42,890 --> 00:01:44,520
of a nature we
discussed in this class.

34
00:01:47,270 --> 00:01:49,280
Now, in the lecture,
we will mostly

35
00:01:49,280 --> 00:01:51,740
discuss classification
trees with the output

36
00:01:51,740 --> 00:01:54,500
as a factor or a category.

37
00:01:54,500 --> 00:01:57,210
Trees can also be used
for regression tasks.

38
00:01:57,210 --> 00:01:58,860
The output at each
leaf of a tree

39
00:01:58,860 --> 00:02:01,980
is no longer a
category, but a number.

40
00:02:01,980 --> 00:02:04,830
Just like classification trees,
regression trees can capture

41
00:02:04,830 --> 00:02:08,660
nonlinearities that
linear regression can't.

42
00:02:08,660 --> 00:02:10,780
So what does that mean?

43
00:02:10,780 --> 00:02:14,780
Well, with classification trees
we report the average outcome

44
00:02:14,780 --> 00:02:16,430
at each leaf of our tree.

45
00:02:16,430 --> 00:02:19,050
For example, if the
outcome is true 15 times,

46
00:02:19,050 --> 00:02:23,270
and false 5 times, the value
at that leaf of a tree would be

47
00:02:23,270 --> 00:02:30,660
15/(15+5)=0.75.

48
00:02:30,660 --> 00:02:34,390
Now, if we use the
default threshold of 0.5,

49
00:02:34,390 --> 00:02:40,300
we would say the value
at this leaf is true.

50
00:02:40,300 --> 00:02:44,000
With regression trees, we now
have continuous variables.

51
00:02:44,000 --> 00:02:47,090
So instead of-- we
report the average

52
00:02:47,090 --> 00:02:48,840
of the values at that leaf.

53
00:02:48,840 --> 00:02:54,720
So suppose we had the
values 3, 4, and 5

54
00:02:54,720 --> 00:02:56,840
at one of the
leaves of our trees.

55
00:02:56,840 --> 00:03:00,030
Well, we just take the average
of these numbers, which is 4,

56
00:03:00,030 --> 00:03:02,420
and that is what we report.

57
00:03:02,420 --> 00:03:06,500
That might be a bit confusing
so let's look at a picture.

58
00:03:06,500 --> 00:03:10,070
Here is some fake data
that I made up in R.

59
00:03:10,070 --> 00:03:14,460
We see x on the x-axis
and y on the y-axis.

60
00:03:14,460 --> 00:03:20,030
y is our variable we are
trying to predict using x.

61
00:03:20,030 --> 00:03:23,220
So if we fit a linear
regression to this data set,

62
00:03:23,220 --> 00:03:25,420
we obtain the following line.

63
00:03:25,420 --> 00:03:26,800
As you can see,
linear regression

64
00:03:26,800 --> 00:03:29,560
does not do very well
on this data set.

65
00:03:29,560 --> 00:03:32,140
However, we can
notice that the data

66
00:03:32,140 --> 00:03:35,870
lies in three different groups.

67
00:03:35,870 --> 00:03:39,530
If we draw these lines here, we
see x is either less than 10,

68
00:03:39,530 --> 00:03:42,140
between 10 and 20,
or greater then 20,

69
00:03:42,140 --> 00:03:44,710
and there is very different
behavior in each group.

70
00:03:44,710 --> 00:03:47,760
Regression trees can fit
that kind of thing exactly.

71
00:03:47,760 --> 00:03:51,130
So the splits would be x is
less than or equal to 10,

72
00:03:51,130 --> 00:03:53,300
take the average
of those values.

73
00:03:53,300 --> 00:03:56,940
x is between 10 and 20, take
the average of those values.

74
00:03:56,940 --> 00:04:00,750
x is between 20 and 30, take
the average of those values.

75
00:04:00,750 --> 00:04:03,540
We see that regression trees
can fit some kinds of data

76
00:04:03,540 --> 00:04:07,250
very well that linear
regression completely fails on.

77
00:04:07,250 --> 00:04:11,080
Of course, in reality nothing
is ever so nice and simple,

78
00:04:11,080 --> 00:04:12,830
but it gives us some
idea why we might

79
00:04:12,830 --> 00:04:16,779
be interested in
regression trees.

80
00:04:16,779 --> 00:04:19,709
So in this recitation,
we will explore

81
00:04:19,709 --> 00:04:21,959
the data set with
the aid of trees.

82
00:04:21,959 --> 00:04:23,550
We will compare
linear regression

83
00:04:23,550 --> 00:04:25,420
with regression trees.

84
00:04:25,420 --> 00:04:28,320
We will discuss what the cp
parameter means that we brought

85
00:04:28,320 --> 00:04:31,440
up when we did cross-validation
in the lecture,

86
00:04:31,440 --> 00:04:33,200
and we will apply
cross-validation

87
00:04:33,200 --> 00:04:35,440
to regression trees.