1
00:00:05,150 --> 00:00:09,700
OK, so now we know what CP
is, we can go ahead and build

2
00:00:09,700 --> 00:00:13,100
one last tree using
cross validation.

3
00:00:13,100 --> 00:00:15,190
So we need to make sure
first we have the required

4
00:00:15,190 --> 00:00:18,710
libraries installed and in use.

5
00:00:18,710 --> 00:00:22,840
So the first package
is the "caret" package.

6
00:00:25,720 --> 00:00:32,189
And the second we need
is the "e1071" package.

7
00:00:32,189 --> 00:00:33,730
OK.

8
00:00:33,730 --> 00:00:37,300
So we need to tell the
caret package how exactly we

9
00:00:37,300 --> 00:00:39,730
want to do our parameter tuning.

10
00:00:39,730 --> 00:00:42,290
There are actually quite
a few ways of doing it.

11
00:00:42,290 --> 00:00:44,520
But we're going to restrict
ourselves in this course

12
00:00:44,520 --> 00:00:46,720
to just 10-fold
cross validation,

13
00:00:46,720 --> 00:00:48,880
as was explained in the lecture.

14
00:00:48,880 --> 00:00:50,960
So let's say
tr.control=trainControl(method="cv",

15
00:00:50,960 --> 00:00:51,460
number=10).

16
00:01:05,260 --> 00:01:07,470
OK, that was easy enough.

17
00:01:07,470 --> 00:01:11,280
Now we need to tell caret
which range of CP parameters

18
00:01:11,280 --> 00:01:12,770
to try out.

19
00:01:12,770 --> 00:01:16,890
Now remember that CP
varies between 0 and 1.

20
00:01:16,890 --> 00:01:18,600
It's likely for
any given problem

21
00:01:18,600 --> 00:01:21,500
that we don't need to
explore the whole range.

22
00:01:21,500 --> 00:01:23,530
I happen to know,
by the fact that I

23
00:01:23,530 --> 00:01:27,090
made this presentation ahead
of time, that the value of CP

24
00:01:27,090 --> 00:01:29,700
we're going to
pick is very small.

25
00:01:29,700 --> 00:01:36,160
So what I want to do is make
a grid of CP values to try.

26
00:01:36,160 --> 00:01:53,400
And it will be over
the range of 0 to 0.01.

27
00:01:53,400 --> 00:01:57,170
OK, so how does what I
wrote feed into that?

28
00:01:57,170 --> 00:02:04,240
Well, 1 times 0.001
is obviously 0.001.

29
00:02:04,240 --> 00:02:10,810
And 10 times 0.001
is obviously 0.01.

30
00:02:10,810 --> 00:02:15,300
0 to 5, or 0 to 10,
means the numbers

31
00:02:15,300 --> 00:02:19,140
0, 1, 2, 3, 4 5, 6, 7, 8, 9, 10.

32
00:02:19,140 --> 00:02:30,680
So 0 to 10 times 0.001 is
those numbers scaled by 0.001.

33
00:02:30,680 --> 00:02:35,650
So those are the values
of CP that caret will try.

34
00:02:35,650 --> 00:02:40,150
So let's store the results of
the cross validation fitting

35
00:02:40,150 --> 00:02:42,370
in a variable called tr.

36
00:02:42,370 --> 00:02:45,530
And we'll use the
train function.

37
00:02:45,530 --> 00:02:59,120
Predicting MEDV is the LAT,
LON, CRIM, zoning, industry,

38
00:02:59,120 --> 00:03:09,610
Charles River, pollution,
rooms, age, distance,

39
00:03:09,610 --> 00:03:16,850
distance from highways, tax,
and pupil-teacher ratio.

40
00:03:16,850 --> 00:03:22,840
OK, we're using
the train data set.

41
00:03:22,840 --> 00:03:29,270
We're using trees
(rpart), our train control

42
00:03:29,270 --> 00:03:34,460
is what we just made
before, and our tuning grid

43
00:03:34,460 --> 00:03:40,540
is the other thing we just
made, which we called CP grid.

44
00:03:40,540 --> 00:03:41,700
And it whirrs away.

45
00:03:41,700 --> 00:03:44,370
And what its doing there is it's
trying all the different values

46
00:03:44,370 --> 00:03:47,060
of CP that we asked it to.

47
00:03:47,060 --> 00:03:51,240
So we can see what it's
done but typing tr.

48
00:03:51,240 --> 00:03:55,600
You can see it tried 11
different values of CP.

49
00:03:55,600 --> 00:04:01,800
And it decided that CP equals
0.001 was the best because it

50
00:04:01,800 --> 00:04:07,380
had the best RMSE--
Root Mean Square Error.

51
00:04:07,380 --> 00:04:11,970
And it was 5.03 for 0.001.

52
00:04:11,970 --> 00:04:17,740
You see it's pretty insensitive
to a particular value of CP.

53
00:04:17,740 --> 00:04:20,690
So it's maybe not too important.

54
00:04:20,690 --> 00:04:23,260
It's interesting though
that the numbers are so low.

55
00:04:23,260 --> 00:04:26,420
I tried it for a much
larger range of CP values,

56
00:04:26,420 --> 00:04:31,930
and the best solutions are
always very close to 0.

57
00:04:31,930 --> 00:04:35,659
So it wants us to build
a very detail-rich tree.

58
00:04:35,659 --> 00:04:39,659
So let's see what the tree that
that value of CP corresponds to

59
00:04:39,659 --> 00:04:40,159
is.

60
00:04:40,159 --> 00:04:42,430
So we can get that from going
best.tree=tr$finalModel.

61
00:04:56,100 --> 00:04:58,620
And we can plot that tree.

62
00:04:58,620 --> 00:05:04,160
So that's the model that
corresponds to 0.001.

63
00:05:04,160 --> 00:05:07,310
Plot it.

64
00:05:07,310 --> 00:05:11,020
Wow, OK, so that's a
very detailed tree.

65
00:05:11,020 --> 00:05:13,320
You see that it looks pretty
much like the same tree we

66
00:05:13,320 --> 00:05:15,300
had before, initially.

67
00:05:15,300 --> 00:05:17,880
But then it starts to get much
more detailed at the bottom.

68
00:05:17,880 --> 00:05:19,980
And in fact if you
can see close enough,

69
00:05:19,980 --> 00:05:21,980
there's actually latitude
and longitude in there

70
00:05:21,980 --> 00:05:24,140
right down at the
bottom as well.

71
00:05:24,140 --> 00:05:26,650
So maybe the tree
is finally going

72
00:05:26,650 --> 00:05:29,460
to be a linear regression model.

73
00:05:29,460 --> 00:05:31,990
Well, we can test that out
same way as we did before.

74
00:05:31,990 --> 00:05:34,030
best.tree.pred=predict(best.tree,
newdata=test).

75
00:05:43,070 --> 00:05:48,140
best.tree.sse, the
Sum of Squared Errors,

76
00:05:48,140 --> 00:05:54,320
is the sum of the best
tree's predictions

77
00:05:54,320 --> 00:06:01,160
minus the true values squared.

78
00:06:01,160 --> 00:06:07,410
That number is 3,675.

79
00:06:07,410 --> 00:06:10,150
So if you can remember
from the last video,

80
00:06:10,150 --> 00:06:15,890
the tree from the previous video
actually only got something

81
00:06:15,890 --> 00:06:16,530
in the 4,000s.

82
00:06:16,530 --> 00:06:17,370
So not very good.

83
00:06:17,370 --> 00:06:18,580
So we have actually improved.

84
00:06:18,580 --> 00:06:20,940
This tree is better
on the testing set

85
00:06:20,940 --> 00:06:23,390
than the original
tree we created.

86
00:06:23,390 --> 00:06:26,280
But, you may also remember
that a linear regression

87
00:06:26,280 --> 00:06:29,510
model did actually
better than that still.

88
00:06:29,510 --> 00:06:34,720
The linear regression SSE
was more around 3,030.

89
00:06:34,720 --> 00:06:39,390
So the best tree is not as good
as a linear regression model.

90
00:06:39,390 --> 00:06:43,930
But cross validation
did improve performance.

91
00:06:43,930 --> 00:06:46,980
So the takeaway is,
I guess, that trees

92
00:06:46,980 --> 00:06:50,040
aren't always the best method
you have available to you.

93
00:06:50,040 --> 00:06:53,960
But you should always
try cross validating

94
00:06:53,960 --> 00:06:57,330
them to get as much performance
out of them as you can.

95
00:06:57,330 --> 00:07:01,000
And that's the end of the
presentation Thank you.