1 00:00:04,570 --> 00:00:07,480 At long last, we're ready to split our data into a training 2 00:00:07,480 --> 00:00:10,580 and testing set, and to actually build a model. 3 00:00:10,580 --> 00:00:13,230 So we'll start by loading the ca tools package, 4 00:00:13,230 --> 00:00:14,770 so that we can split our data. 5 00:00:14,770 --> 00:00:15,980 So we'll do library(caTools). 6 00:00:19,010 --> 00:00:21,720 And then, as usual, we're going to set our random seeds so 7 00:00:21,720 --> 00:00:24,390 that everybody has the same results. 8 00:00:24,390 --> 00:00:28,350 So use set.seed and we'll pick the number 144. 9 00:00:28,350 --> 00:00:30,390 Again, the number isn't particularly important. 10 00:00:30,390 --> 00:00:33,830 The important thing is that we all use the same one. 11 00:00:33,830 --> 00:00:37,050 So as usual, we're going to obtain the split variable. 12 00:00:37,050 --> 00:00:42,180 We'll call it spl, using the sample.split. 13 00:00:42,180 --> 00:00:44,860 The outcome variable that we pass is 14 00:00:44,860 --> 00:00:47,580 labeledTerms$responsive. 15 00:00:47,580 --> 00:00:49,610 And we'll do a 70/30 split. 16 00:00:49,610 --> 00:00:52,190 So we'll pass 0.7 here. 17 00:00:52,190 --> 00:00:55,080 So then train, the training data frame, 18 00:00:55,080 --> 00:01:01,100 can be obtained using subset on the labeled terms where 19 00:01:01,100 --> 00:01:03,150 spl is true. 20 00:01:03,150 --> 00:01:10,680 And test is the subset when spl is false. 21 00:01:13,690 --> 00:01:15,780 So now we're ready to build the model. 22 00:01:15,780 --> 00:01:17,630 And we'll build a simple cart model 23 00:01:17,630 --> 00:01:19,280 using the default parameters. 24 00:01:19,280 --> 00:01:21,950 But a random forest would be another good choice 25 00:01:21,950 --> 00:01:23,490 from our toolset. 26 00:01:23,490 --> 00:01:26,930 So we'll start by loading up the packages for the cart model. 27 00:01:26,930 --> 00:01:27,930 We'll do library(rpart). 28 00:01:30,480 --> 00:01:36,990 And we'll also load up the rpart.plot package, so 29 00:01:36,990 --> 00:01:40,100 that we can plot the outcome. 30 00:01:40,100 --> 00:01:44,450 So we'll create a model called email cart, 31 00:01:44,450 --> 00:01:46,310 using the r part function. 32 00:01:46,310 --> 00:01:49,060 We're predicting responsive. 33 00:01:49,060 --> 00:01:50,990 And we're predicting it using all 34 00:01:50,990 --> 00:01:52,160 of the additional variables. 35 00:01:52,160 --> 00:01:54,970 All the frequencies of the terms that are included. 36 00:01:54,970 --> 00:01:57,220 Obviously tilde period is important here, 37 00:01:57,220 --> 00:01:59,610 because there are 788 terms. 38 00:01:59,610 --> 00:02:02,520 Way too many to actually type out. 39 00:02:02,520 --> 00:02:04,520 The data that we're using to train the model 40 00:02:04,520 --> 00:02:07,400 is just our training dataframe, train. 41 00:02:07,400 --> 00:02:09,850 And then the method is class, since we 42 00:02:09,850 --> 00:02:13,130 have a classification problem here. 43 00:02:13,130 --> 00:02:15,470 And once we've trained the cart model, 44 00:02:15,470 --> 00:02:17,030 we can plot it out using prp. 45 00:02:21,020 --> 00:02:22,200 There we go. 46 00:02:22,200 --> 00:02:26,060 So we can see at the very top is the word California. 47 00:02:26,060 --> 00:02:28,650 If California appears at least twice in an email, 48 00:02:28,650 --> 00:02:33,270 we're going to take the right part over here and predict 49 00:02:33,270 --> 00:02:35,340 that a document is responsive. 50 00:02:35,340 --> 00:02:38,170 It's somewhat unsurprising that California shows up, 51 00:02:38,170 --> 00:02:41,060 because we know that Enron had a heavy involvement 52 00:02:41,060 --> 00:02:43,670 in the California energy markets. 53 00:02:43,670 --> 00:02:46,340 So further down the tree, we see a number of other terms 54 00:02:46,340 --> 00:02:49,820 that we could plausibly expect to be related 55 00:02:49,820 --> 00:02:52,490 to energy bids and energy scheduling, 56 00:02:52,490 --> 00:02:56,690 like system, demand, bid, and gas. 57 00:02:56,690 --> 00:02:59,210 Down here at the bottom is Jeff, which is perhaps 58 00:02:59,210 --> 00:03:02,400 a reference to Enron's CEO, Jeff Skillings, who ended up 59 00:03:02,400 --> 00:03:04,240 actually being jailed for his involvement 60 00:03:04,240 --> 00:03:06,580 in the fraud at the company.