1
00:00:04,500 --> 00:00:06,440
Now let's look at
the ROC curve so we

2
00:00:06,440 --> 00:00:08,670
can understand the
performance of our model

3
00:00:08,670 --> 00:00:10,360
at different cutoffs.

4
00:00:10,360 --> 00:00:12,990
We'll first need to
load the ROCR package

5
00:00:12,990 --> 00:00:13,870
with a library(ROCR).

6
00:00:19,110 --> 00:00:22,270
Next, we'll build our
ROCR prediction object.

7
00:00:22,270 --> 00:00:23,770
So we'll call this
object predROCR =

8
00:00:23,770 --> 00:00:25,400
prediction(pred.prob,
test$responsive).

9
00:00:47,910 --> 00:00:48,410
All right.

10
00:00:48,410 --> 00:00:51,420
So now we want to
plot the ROC curve

11
00:00:51,420 --> 00:00:54,830
so we'll use the performance
function to extract

12
00:00:54,830 --> 00:00:58,260
the true positive rate
and false positive rate.

13
00:00:58,260 --> 00:01:00,110
So create something
called perfROCR =

14
00:01:00,110 --> 00:01:01,610
performance(predROCR,
"tpr", "fpr").

15
00:01:11,170 --> 00:01:19,690
And then we'll plot(perfROCR,
colorize=TRUE),

16
00:01:19,690 --> 00:01:22,560
so that we can see the colors
for the different cutoff

17
00:01:22,560 --> 00:01:25,170
thresholds.

18
00:01:25,170 --> 00:01:26,170
All right.

19
00:01:26,170 --> 00:01:28,539
Now, of course, the
best cutoff to select

20
00:01:28,539 --> 00:01:32,220
is entirely dependent on the
costs assigned by the decision

21
00:01:32,220 --> 00:01:35,479
maker to false positives
and true positives.

22
00:01:35,479 --> 00:01:39,160
However, again, we
do favor cutoffs

23
00:01:39,160 --> 00:01:41,780
that give us a high sensitivity.

24
00:01:41,780 --> 00:01:44,970
We want to identify a large
number of the responsive

25
00:01:44,970 --> 00:01:46,180
documents.

26
00:01:46,180 --> 00:01:48,070
So something that
might look promising

27
00:01:48,070 --> 00:01:50,210
might be a point
right around here,

28
00:01:50,210 --> 00:01:52,810
in this part of
the curve, where we

29
00:01:52,810 --> 00:01:55,990
have a true positive
rate of around 70%,

30
00:01:55,990 --> 00:01:58,350
meaning that we're
getting about 70%

31
00:01:58,350 --> 00:02:01,630
of all the responsive documents,
and a false positive rate

32
00:02:01,630 --> 00:02:05,220
of about 20%, meaning
that we're making mistakes

33
00:02:05,220 --> 00:02:09,199
and accidentally identifying
as responsive 20%

34
00:02:09,199 --> 00:02:11,540
of the non-responsive documents.

35
00:02:11,540 --> 00:02:14,190
Now, since, typically, the
vast majority of documents

36
00:02:14,190 --> 00:02:18,210
are non-responsive,
operating at this cutoff

37
00:02:18,210 --> 00:02:20,110
would result, perhaps,
in a large decrease

38
00:02:20,110 --> 00:02:22,240
in the amount of
manual effort needed

39
00:02:22,240 --> 00:02:24,490
in the e-discovery process.

40
00:02:24,490 --> 00:02:26,790
And we can see
from the blue color

41
00:02:26,790 --> 00:02:29,340
of the plot at this
particular location

42
00:02:29,340 --> 00:02:33,610
that we're looking at a
threshold around maybe 0.15

43
00:02:33,610 --> 00:02:37,790
or so, significantly lower
than 50%, which is definitely

44
00:02:37,790 --> 00:02:40,270
what we would expect
since we favor

45
00:02:40,270 --> 00:02:44,570
false positives to
false negatives.

46
00:02:44,570 --> 00:02:47,710
So lastly, we can
use the ROCR package

47
00:02:47,710 --> 00:02:50,690
to compute our auc value.

48
00:02:50,690 --> 00:02:53,910
So, again, call the
performance function

49
00:02:53,910 --> 00:02:59,610
with our prediction object, this
time extracting the auc value

50
00:02:59,610 --> 00:03:04,000
and just grabbing the
y value slot of it.

51
00:03:04,000 --> 00:03:09,780
We can see that we have an auc
in the test set of 79.4%, which

52
00:03:09,780 --> 00:03:11,710
means that our model
can differentiate

53
00:03:11,710 --> 00:03:15,220
between a randomly selected
responsive and non-responsive

54
00:03:15,220 --> 00:03:19,170
document about 80% of the time.