1 00:00:04,500 --> 00:00:06,440 Now let's look at the ROC curve so we 2 00:00:06,440 --> 00:00:08,670 can understand the performance of our model 3 00:00:08,670 --> 00:00:10,360 at different cutoffs. 4 00:00:10,360 --> 00:00:12,990 We'll first need to load the ROCR package 5 00:00:12,990 --> 00:00:13,870 with a library(ROCR). 6 00:00:19,110 --> 00:00:22,270 Next, we'll build our ROCR prediction object. 7 00:00:22,270 --> 00:00:23,770 So we'll call this object predROCR = 8 00:00:23,770 --> 00:00:25,400 prediction(pred.prob, test$responsive). 9 00:00:47,910 --> 00:00:48,410 All right. 10 00:00:48,410 --> 00:00:51,420 So now we want to plot the ROC curve 11 00:00:51,420 --> 00:00:54,830 so we'll use the performance function to extract 12 00:00:54,830 --> 00:00:58,260 the true positive rate and false positive rate. 13 00:00:58,260 --> 00:01:00,110 So create something called perfROCR = 14 00:01:00,110 --> 00:01:01,610 performance(predROCR, "tpr", "fpr"). 15 00:01:11,170 --> 00:01:19,690 And then we'll plot(perfROCR, colorize=TRUE), 16 00:01:19,690 --> 00:01:22,560 so that we can see the colors for the different cutoff 17 00:01:22,560 --> 00:01:25,170 thresholds. 18 00:01:25,170 --> 00:01:26,170 All right. 19 00:01:26,170 --> 00:01:28,539 Now, of course, the best cutoff to select 20 00:01:28,539 --> 00:01:32,220 is entirely dependent on the costs assigned by the decision 21 00:01:32,220 --> 00:01:35,479 maker to false positives and true positives. 22 00:01:35,479 --> 00:01:39,160 However, again, we do favor cutoffs 23 00:01:39,160 --> 00:01:41,780 that give us a high sensitivity. 24 00:01:41,780 --> 00:01:44,970 We want to identify a large number of the responsive 25 00:01:44,970 --> 00:01:46,180 documents. 26 00:01:46,180 --> 00:01:48,070 So something that might look promising 27 00:01:48,070 --> 00:01:50,210 might be a point right around here, 28 00:01:50,210 --> 00:01:52,810 in this part of the curve, where we 29 00:01:52,810 --> 00:01:55,990 have a true positive rate of around 70%, 30 00:01:55,990 --> 00:01:58,350 meaning that we're getting about 70% 31 00:01:58,350 --> 00:02:01,630 of all the responsive documents, and a false positive rate 32 00:02:01,630 --> 00:02:05,220 of about 20%, meaning that we're making mistakes 33 00:02:05,220 --> 00:02:09,199 and accidentally identifying as responsive 20% 34 00:02:09,199 --> 00:02:11,540 of the non-responsive documents. 35 00:02:11,540 --> 00:02:14,190 Now, since, typically, the vast majority of documents 36 00:02:14,190 --> 00:02:18,210 are non-responsive, operating at this cutoff 37 00:02:18,210 --> 00:02:20,110 would result, perhaps, in a large decrease 38 00:02:20,110 --> 00:02:22,240 in the amount of manual effort needed 39 00:02:22,240 --> 00:02:24,490 in the e-discovery process. 40 00:02:24,490 --> 00:02:26,790 And we can see from the blue color 41 00:02:26,790 --> 00:02:29,340 of the plot at this particular location 42 00:02:29,340 --> 00:02:33,610 that we're looking at a threshold around maybe 0.15 43 00:02:33,610 --> 00:02:37,790 or so, significantly lower than 50%, which is definitely 44 00:02:37,790 --> 00:02:40,270 what we would expect since we favor 45 00:02:40,270 --> 00:02:44,570 false positives to false negatives. 46 00:02:44,570 --> 00:02:47,710 So lastly, we can use the ROCR package 47 00:02:47,710 --> 00:02:50,690 to compute our auc value. 48 00:02:50,690 --> 00:02:53,910 So, again, call the performance function 49 00:02:53,910 --> 00:02:59,610 with our prediction object, this time extracting the auc value 50 00:02:59,610 --> 00:03:04,000 and just grabbing the y value slot of it. 51 00:03:04,000 --> 00:03:09,780 We can see that we have an auc in the test set of 79.4%, which 52 00:03:09,780 --> 00:03:11,710 means that our model can differentiate 53 00:03:11,710 --> 00:03:15,220 between a randomly selected responsive and non-responsive 54 00:03:15,220 --> 00:03:19,170 document about 80% of the time.