1
00:00:04,570 --> 00:00:07,480
At long last, we're ready to
split our data into a training

2
00:00:07,480 --> 00:00:10,580
and testing set, and to
actually build a model.

3
00:00:10,580 --> 00:00:13,230
So we'll start by loading
the ca tools package,

4
00:00:13,230 --> 00:00:14,770
so that we can split our data.

5
00:00:14,770 --> 00:00:15,980
So we'll do library(caTools).

6
00:00:19,010 --> 00:00:21,720
And then, as usual, we're going
to set our random seeds so

7
00:00:21,720 --> 00:00:24,390
that everybody has
the same results.

8
00:00:24,390 --> 00:00:28,350
So use set.seed and we'll
pick the number 144.

9
00:00:28,350 --> 00:00:30,390
Again, the number isn't
particularly important.

10
00:00:30,390 --> 00:00:33,830
The important thing is that
we all use the same one.

11
00:00:33,830 --> 00:00:37,050
So as usual, we're going to
obtain the split variable.

12
00:00:37,050 --> 00:00:42,180
We'll call it spl,
using the sample.split.

13
00:00:42,180 --> 00:00:44,860
The outcome variable
that we pass is

14
00:00:44,860 --> 00:00:47,580
labeledTerms$responsive.

15
00:00:47,580 --> 00:00:49,610
And we'll do a 70/30 split.

16
00:00:49,610 --> 00:00:52,190
So we'll pass 0.7 here.

17
00:00:52,190 --> 00:00:55,080
So then train, the
training data frame,

18
00:00:55,080 --> 00:01:01,100
can be obtained using subset
on the labeled terms where

19
00:01:01,100 --> 00:01:03,150
spl is true.

20
00:01:03,150 --> 00:01:10,680
And test is the subset
when spl is false.

21
00:01:13,690 --> 00:01:15,780
So now we're ready
to build the model.

22
00:01:15,780 --> 00:01:17,630
And we'll build a
simple cart model

23
00:01:17,630 --> 00:01:19,280
using the default parameters.

24
00:01:19,280 --> 00:01:21,950
But a random forest would
be another good choice

25
00:01:21,950 --> 00:01:23,490
from our toolset.

26
00:01:23,490 --> 00:01:26,930
So we'll start by loading up
the packages for the cart model.

27
00:01:26,930 --> 00:01:27,930
We'll do library(rpart).

28
00:01:30,480 --> 00:01:36,990
And we'll also load up
the rpart.plot package, so

29
00:01:36,990 --> 00:01:40,100
that we can plot the outcome.

30
00:01:40,100 --> 00:01:44,450
So we'll create a model
called email cart,

31
00:01:44,450 --> 00:01:46,310
using the r part function.

32
00:01:46,310 --> 00:01:49,060
We're predicting responsive.

33
00:01:49,060 --> 00:01:50,990
And we're predicting
it using all

34
00:01:50,990 --> 00:01:52,160
of the additional variables.

35
00:01:52,160 --> 00:01:54,970
All the frequencies of the
terms that are included.

36
00:01:54,970 --> 00:01:57,220
Obviously tilde period
is important here,

37
00:01:57,220 --> 00:01:59,610
because there are 788 terms.

38
00:01:59,610 --> 00:02:02,520
Way too many to
actually type out.

39
00:02:02,520 --> 00:02:04,520
The data that we're
using to train the model

40
00:02:04,520 --> 00:02:07,400
is just our training
dataframe, train.

41
00:02:07,400 --> 00:02:09,850
And then the method
is class, since we

42
00:02:09,850 --> 00:02:13,130
have a classification
problem here.

43
00:02:13,130 --> 00:02:15,470
And once we've trained
the cart model,

44
00:02:15,470 --> 00:02:17,030
we can plot it out using prp.

45
00:02:21,020 --> 00:02:22,200
There we go.

46
00:02:22,200 --> 00:02:26,060
So we can see at the very
top is the word California.

47
00:02:26,060 --> 00:02:28,650
If California appears at
least twice in an email,

48
00:02:28,650 --> 00:02:33,270
we're going to take the right
part over here and predict

49
00:02:33,270 --> 00:02:35,340
that a document is responsive.

50
00:02:35,340 --> 00:02:38,170
It's somewhat unsurprising
that California shows up,

51
00:02:38,170 --> 00:02:41,060
because we know that Enron
had a heavy involvement

52
00:02:41,060 --> 00:02:43,670
in the California
energy markets.

53
00:02:43,670 --> 00:02:46,340
So further down the tree, we
see a number of other terms

54
00:02:46,340 --> 00:02:49,820
that we could plausibly
expect to be related

55
00:02:49,820 --> 00:02:52,490
to energy bids and
energy scheduling,

56
00:02:52,490 --> 00:02:56,690
like system, demand,
bid, and gas.

57
00:02:56,690 --> 00:02:59,210
Down here at the bottom
is Jeff, which is perhaps

58
00:02:59,210 --> 00:03:02,400
a reference to Enron's CEO,
Jeff Skillings, who ended up

59
00:03:02,400 --> 00:03:04,240
actually being jailed
for his involvement

60
00:03:04,240 --> 00:03:06,580
in the fraud at the company.