1
00:00:04,500 --> 00:00:07,970
In the previous video, we got
a feel for how regression trees

2
00:00:07,970 --> 00:00:10,970
can do things linear
regression cannot.

3
00:00:10,970 --> 00:00:13,290
But what really matters
at the end of the day

4
00:00:13,290 --> 00:00:16,790
is whether it can predict things
better than linear regression.

5
00:00:16,790 --> 00:00:19,420
And so let's try that right now.

6
00:00:19,420 --> 00:00:21,420
We're going to try to
predict house prices using

7
00:00:21,420 --> 00:00:24,060
all the variables we
have available to us.

8
00:00:24,060 --> 00:00:28,990
So we'll load the
caTools library.

9
00:00:28,990 --> 00:00:33,170
That will help us do
a split on the data.

10
00:00:33,170 --> 00:00:37,120
We'll set the seed so our
results are reproducible.

11
00:00:37,120 --> 00:00:45,610
And we'll say our split will
be on the Boston house prices

12
00:00:45,610 --> 00:00:51,620
and we'll split it 70%
training, 30% test.

13
00:00:51,620 --> 00:00:56,740
So our training data is a
subset of the Boston data

14
00:00:56,740 --> 00:00:59,600
where the split is TRUE.

15
00:00:59,600 --> 00:01:04,470
And the testing data is the
subset of the Boston data

16
00:01:04,470 --> 00:01:07,370
where the split is FALSE.

17
00:01:07,370 --> 00:01:11,250
OK, first of all, let's make
a linear regression model,

18
00:01:11,250 --> 00:01:12,840
nice and easy.

19
00:01:12,840 --> 00:01:17,500
It's a linear model
and the variables

20
00:01:17,500 --> 00:01:28,870
are latitude, longitude, crime,
zoning, industry, whether it's

21
00:01:28,870 --> 00:01:37,050
on the Charles River or not,
air pollution, rooms, age,

22
00:01:37,050 --> 00:01:44,110
distance, another form
of distance, tax rates,

23
00:01:44,110 --> 00:01:47,620
and the pupil-teacher ratio.

24
00:01:47,620 --> 00:01:51,770
The data is training data.

25
00:01:51,770 --> 00:01:57,539
OK, let's see what our
linear regression looks like.

26
00:01:57,539 --> 00:02:01,040
So we see that the latitude
and longitude are not

27
00:02:01,040 --> 00:02:03,610
significant for a linear
regression, which is perhaps

28
00:02:03,610 --> 00:02:05,980
not surprising because
linear regression didn't seem

29
00:02:05,980 --> 00:02:08,660
to be able to take
advantage of them.

30
00:02:08,660 --> 00:02:11,450
Crime is very important.

31
00:02:11,450 --> 00:02:14,220
The residential zoning
might be important.

32
00:02:14,220 --> 00:02:15,600
Whether it's on
the Charles River

33
00:02:15,600 --> 00:02:18,829
or not is a useful factor.

34
00:02:18,829 --> 00:02:20,460
Air pollution does
seem to matter--

35
00:02:20,460 --> 00:02:23,770
the coefficient is
negative, as you'd expect.

36
00:02:23,770 --> 00:02:26,710
The average number of
rooms is significant.

37
00:02:26,710 --> 00:02:29,140
The age is somewhat important.

38
00:02:29,140 --> 00:02:32,170
Distance to centers
of employment (DIS),

39
00:02:32,170 --> 00:02:34,030
is very important.

40
00:02:34,030 --> 00:02:37,440
Distance to highways and
tax is somewhat important,

41
00:02:37,440 --> 00:02:41,920
and the pupil-teacher ratio
is also very significant.

42
00:02:41,920 --> 00:02:43,340
Some of these might
be correlated,

43
00:02:43,340 --> 00:02:46,200
so we can't put too much stock
in necessarily interpreting

44
00:02:46,200 --> 00:02:48,590
them directly, but
it's interesting.

45
00:02:48,590 --> 00:02:55,130
The adjusted R squared is
0.65, which is pretty good.

46
00:02:55,130 --> 00:02:59,680
So because it's kind
of hard to compare out

47
00:02:59,680 --> 00:03:02,340
of sample accuracy
for regression,

48
00:03:02,340 --> 00:03:04,300
we need to think of how
we're going to do that.

49
00:03:04,300 --> 00:03:08,120
With classification, we just
say, this method got X% correct

50
00:03:08,120 --> 00:03:10,480
and this method got Y% correct.

51
00:03:10,480 --> 00:03:13,190
Well, since we're doing
continuous variables,

52
00:03:13,190 --> 00:03:16,130
let's calculate the sum
of squared error, which

53
00:03:16,130 --> 00:03:19,360
we discussed in the original
linear regression video.

54
00:03:19,360 --> 00:03:24,440
So let's say the linear
regression's predictions are

55
00:03:24,440 --> 00:03:33,760
predict(linreg, newdata=test)
and the linear regression sum

56
00:03:33,760 --> 00:03:41,040
of squared errors is simply
the sum of the predicted values

57
00:03:41,040 --> 00:03:45,840
versus the actual
values squared.

58
00:03:45,840 --> 00:03:55,240
So let's see what that
number is-- 3,037.008.

59
00:03:55,240 --> 00:03:58,270
OK, so you know what
we're interested to see

60
00:03:58,270 --> 00:04:02,940
now is, can we beat this
using regression trees?

61
00:04:02,940 --> 00:04:05,840
So let's build a tree.

62
00:04:05,840 --> 00:04:08,960
The tree rpart command again.

63
00:04:08,960 --> 00:04:11,690
Actually to save myself
from typing it all up again,

64
00:04:11,690 --> 00:04:15,600
I'm going to go back to
the regression command

65
00:04:15,600 --> 00:04:22,270
and just change "lm"
to "rpart" and change

66
00:04:22,270 --> 00:04:25,000
"linreg" to "tree"--
much easier.

67
00:04:25,000 --> 00:04:27,210
All right.

68
00:04:27,210 --> 00:04:30,170
So we've built our tree--
let's have a look at it using

69
00:04:30,170 --> 00:04:35,860
the "prp" command
from "rpart.plot."

70
00:04:35,860 --> 00:04:37,430
And here we go.

71
00:04:37,430 --> 00:04:42,820
So again, latitude and longitude
aren't really important

72
00:04:42,820 --> 00:04:45,510
as far as the tree's concerned.

73
00:04:45,510 --> 00:04:48,420
The rooms aren't the
most important split.

74
00:04:48,420 --> 00:04:51,480
Pollution appears in there
twice, so it's, in some sense,

75
00:04:51,480 --> 00:04:53,060
nonlinear on the
amount of pollution--

76
00:04:53,060 --> 00:04:55,070
if it's greater than
a certain amount

77
00:04:55,070 --> 00:04:57,860
or less than a certain amount,
it does different things.

78
00:04:57,860 --> 00:05:00,490
Crime is in there,
age is in there.

79
00:05:00,490 --> 00:05:02,990
Room appears three
times, actually-- sorry.

80
00:05:02,990 --> 00:05:04,520
That's interesting.

81
00:05:04,520 --> 00:05:08,080
So it's very nonlinear
on the number of rooms.

82
00:05:08,080 --> 00:05:10,590
Things that were important
for the linear regression that

83
00:05:10,590 --> 00:05:15,300
don't appear in ours
include pupil-teacher ratio.

84
00:05:15,300 --> 00:05:18,060
The DIS variable doesn't appear
in our regression tree at all,

85
00:05:18,060 --> 00:05:19,540
either.

86
00:05:19,540 --> 00:05:21,850
So they're definitely
doing different things,

87
00:05:21,850 --> 00:05:24,540
but how do they compare?

88
00:05:24,540 --> 00:05:28,280
So we'll predict,
again, from a tree.

89
00:05:28,280 --> 00:05:35,180
"tree.pred" is the prediction
of the tree on the new data.

90
00:05:39,440 --> 00:05:42,320
And the tree sum
of squared errors

91
00:05:42,320 --> 00:05:46,950
is the sum of the
tree's predictions

92
00:05:46,950 --> 00:05:52,330
versus what they
really should be.

93
00:05:52,330 --> 00:05:58,580
And then the moment
of truth-- 4,328.

94
00:05:58,580 --> 00:06:02,100
So, simply put, regression
trees are not as good

95
00:06:02,100 --> 00:06:04,970
as linear regression
for this problem.

96
00:06:04,970 --> 00:06:08,220
What this says to us, given
what we saw with the latitude

97
00:06:08,220 --> 00:06:11,320
and longitude, is that latitude
and longitude are nowhere near

98
00:06:11,320 --> 00:06:13,860
as useful for
predicting, apparently,

99
00:06:13,860 --> 00:06:16,480
as these other variables are.

100
00:06:16,480 --> 00:06:18,580
That's just the way
it goes, I guess.

101
00:06:18,580 --> 00:06:20,540
It's always nice when a
new method does better,

102
00:06:20,540 --> 00:06:22,530
but there's no guarantee
that's going to happen.

103
00:06:22,530 --> 00:06:25,250
We need a special structure
to really be useful.

104
00:06:25,250 --> 00:06:28,680
Let's stop here with the R
and go back to the slides

105
00:06:28,680 --> 00:06:31,540
and discuss how CP
works and then we'll

106
00:06:31,540 --> 00:06:33,550
apply cross validation
to our tree.

107
00:06:33,550 --> 00:06:36,850
And we'll see if maybe we
can improve in our results.