1 00:00:04,530 --> 00:00:06,090 Now that we've trained a model, we 2 00:00:06,090 --> 00:00:09,150 need to evaluate it on the test set. 3 00:00:09,150 --> 00:00:12,980 So let's build an object called pred 4 00:00:12,980 --> 00:00:15,290 that has the predicted probabilities 5 00:00:15,290 --> 00:00:17,400 for each class from our cart model. 6 00:00:17,400 --> 00:00:22,070 So we'll use predict of emailCART, our cart model, 7 00:00:22,070 --> 00:00:25,290 passing it newdata=test, to get test set predicted 8 00:00:25,290 --> 00:00:27,380 probabilities. 9 00:00:27,380 --> 00:00:29,860 So to recall the structure of pred, 10 00:00:29,860 --> 00:00:34,540 we can look at the first 10 rows with predpred[1:10,]. 11 00:00:34,540 --> 00:00:35,900 So this is the rows we want. 12 00:00:35,900 --> 00:00:37,220 We want all the columns. 13 00:00:37,220 --> 00:00:41,600 So we'll just leave a comma and nothing else afterward. 14 00:00:41,600 --> 00:00:46,070 So the left column here is the predictive probability 15 00:00:46,070 --> 00:00:48,620 of the document being non-responsive. 16 00:00:48,620 --> 00:00:50,960 And the right column is the predictive probability 17 00:00:50,960 --> 00:00:52,660 of the document being responsive. 18 00:00:52,660 --> 00:00:54,380 They sum to 1. 19 00:00:54,380 --> 00:00:56,950 So in our case, we want to extract 20 00:00:56,950 --> 00:01:00,060 the predictive probability of the document being responsive. 21 00:01:00,060 --> 00:01:02,240 So we're looking for the rightmost column. 22 00:01:02,240 --> 00:01:06,030 So we'll create an object called pred.prob. 23 00:01:06,030 --> 00:01:12,190 And we'll select the right most or second column. 24 00:01:12,190 --> 00:01:12,690 All right. 25 00:01:12,690 --> 00:01:14,970 So pred.prob now contains our test set 26 00:01:14,970 --> 00:01:16,100 predicted probabilities. 27 00:01:16,100 --> 00:01:18,090 And we're interested in the accuracy 28 00:01:18,090 --> 00:01:20,110 of our model on the test set. 29 00:01:20,110 --> 00:01:24,890 So for this computation, we'll use a cutoff of 0.5. 30 00:01:24,890 --> 00:01:28,570 And so we can just table the true outcome, 31 00:01:28,570 --> 00:01:33,590 which is test$responsive against the predicted outcome, 32 00:01:33,590 --> 00:01:40,120 which is pred.prob >= 0.5. 33 00:01:40,120 --> 00:01:45,310 What we can see here is that in 195 cases, 34 00:01:45,310 --> 00:01:49,729 we predict false when the left column and the true outcome 35 00:01:49,729 --> 00:01:51,259 was zero, non-responsive. 36 00:01:51,259 --> 00:01:52,539 So we were correct. 37 00:01:52,539 --> 00:01:55,920 And in another 25, we correctly identified a responsive 38 00:01:55,920 --> 00:01:57,530 document. 39 00:01:57,530 --> 00:02:01,000 In 20 cases, we identified a document as responsive, 40 00:02:01,000 --> 00:02:03,200 but it was actually non-responsive. 41 00:02:03,200 --> 00:02:05,590 And in 17, the opposite happened. 42 00:02:05,590 --> 00:02:07,890 We identified a document as non-responsive, 43 00:02:07,890 --> 00:02:10,080 but it actually was responsive. 44 00:02:10,080 --> 00:02:17,180 So our accuracy is 195 + 25, our correct results, 45 00:02:17,180 --> 00:02:19,670 divided by the total number of elements 46 00:02:19,670 --> 00:02:28,110 in the testing set, 195 + 25 + 17 + 20. 47 00:02:28,110 --> 00:02:33,800 So we have an accuracy in the test set of 85.6%. 48 00:02:33,800 --> 00:02:35,370 And now we want to compare ourselves 49 00:02:35,370 --> 00:02:37,390 to the accuracy of the baseline model. 50 00:02:37,390 --> 00:02:39,700 As we've already established, the baseline model 51 00:02:39,700 --> 00:02:43,610 is always going to predict the document is non-responsive. 52 00:02:43,610 --> 00:02:49,329 So if we table test$responsive, we see that it's going to be 53 00:02:49,329 --> 00:02:52,530 correct in 215 of the cases. 54 00:02:52,530 --> 00:02:55,980 So then the accuracy is 215 divided 55 00:02:55,980 --> 00:03:00,220 by the total number of test set observations. 56 00:03:00,220 --> 00:03:04,260 So that's 83.7% accuracy. 57 00:03:04,260 --> 00:03:06,190 So we see just a small improvement 58 00:03:06,190 --> 00:03:09,420 in accuracy using the cart model, which, as we know, 59 00:03:09,420 --> 00:03:13,050 is a common case in unbalanced data sets. 60 00:03:13,050 --> 00:03:16,820 However, as in most document retrieval applications, 61 00:03:16,820 --> 00:03:20,329 there are uneven costs for different types of errors here. 62 00:03:20,329 --> 00:03:23,690 Typically, a human will still have to manually review 63 00:03:23,690 --> 00:03:25,910 all of the predicted responsive documents 64 00:03:25,910 --> 00:03:28,820 to make sure they are actually responsive. 65 00:03:28,820 --> 00:03:31,260 Therefore, if we have a false positive, 66 00:03:31,260 --> 00:03:33,820 in which a non-responsive document is labeled 67 00:03:33,820 --> 00:03:36,390 as responsive, the mistake translates 68 00:03:36,390 --> 00:03:38,670 to a bit of additional work in the manual review 69 00:03:38,670 --> 00:03:43,140 process but no further harm, since the manual review process 70 00:03:43,140 --> 00:03:45,770 will remove this erroneous result. 71 00:03:45,770 --> 00:03:48,610 But on the other hand, if we have a false negative, 72 00:03:48,610 --> 00:03:52,450 in which a responsive document is labeled as non-responsive 73 00:03:52,450 --> 00:03:55,650 by our model, we will miss the document entirely 74 00:03:55,650 --> 00:03:58,480 in our predictive coding process. 75 00:03:58,480 --> 00:04:01,670 Therefore, we're going to sign a higher cost to false negatives 76 00:04:01,670 --> 00:04:05,090 than to false positives, which makes this a good time to look 77 00:04:05,090 --> 00:04:08,880 at other cut-offs on our ROC curve.