1
00:00:04,500 --> 00:00:07,940
Picking a good threshold
value is often challenging.

2
00:00:07,940 --> 00:00:10,620
A Receiver Operator
Characteristic curve,

3
00:00:10,620 --> 00:00:13,320
or ROC curve, can
help you decide

4
00:00:13,320 --> 00:00:16,320
which value of the
threshold is best.

5
00:00:16,320 --> 00:00:19,000
The ROC curve for
our problem is shown

6
00:00:19,000 --> 00:00:20,850
on the right of this slide.

7
00:00:20,850 --> 00:00:24,400
The sensitivity, or true
positive rate of the model,

8
00:00:24,400 --> 00:00:26,790
is shown on the y-axis.

9
00:00:26,790 --> 00:00:31,250
And the false positive rate,
or 1 minus the specificity,

10
00:00:31,250 --> 00:00:33,820
is given on the x-axis.

11
00:00:33,820 --> 00:00:36,540
The line shows how these
two outcome measures

12
00:00:36,540 --> 00:00:39,520
vary with different
threshold values.

13
00:00:39,520 --> 00:00:44,340
The ROC curve always
starts at the point (0, 0).

14
00:00:44,340 --> 00:00:48,480
This corresponds to a
threshold value of 1.

15
00:00:48,480 --> 00:00:50,610
If you have a
threshold of 1, you

16
00:00:50,610 --> 00:00:53,430
will not catch any
poor care cases,

17
00:00:53,430 --> 00:00:56,070
or have a sensitivity of 0.

18
00:00:56,070 --> 00:00:59,070
But you will correctly
label of all the good care

19
00:00:59,070 --> 00:01:03,950
cases, meaning you have a
false positive rate of 0.

20
00:01:03,950 --> 00:01:07,810
The ROC curve always
ends at the point (1,1),

21
00:01:07,810 --> 00:01:11,520
which corresponds to a
threshold value of 0.

22
00:01:11,520 --> 00:01:13,660
If you have a
threshold of 0, you'll

23
00:01:13,660 --> 00:01:15,830
catch all of the
poor care cases,

24
00:01:15,830 --> 00:01:18,370
or have a sensitivity
of 1, but you'll

25
00:01:18,370 --> 00:01:21,539
label all of the good care
cases as poor care cases

26
00:01:21,539 --> 00:01:25,970
too, meaning you have a
false positive rate of 1.

27
00:01:25,970 --> 00:01:31,420
The threshold decreases as
you move from (0,0) to (1,1).

28
00:01:31,420 --> 00:01:37,280
At the point (0,
0.4), or about here,

29
00:01:37,280 --> 00:01:40,120
you're correctly
labeling about 40%

30
00:01:40,120 --> 00:01:44,960
of the poor care cases with a
very small false positive rate.

31
00:01:44,960 --> 00:01:50,390
On the other hand, at
the point (0.6, 0.9),

32
00:01:50,390 --> 00:01:54,570
you're correctly labeling about
90% of the poor care cases,

33
00:01:54,570 --> 00:01:58,310
but have a false
positive rate of 60%.

34
00:01:58,310 --> 00:02:02,120
In the middle,
around (0.3, 0.8),

35
00:02:02,120 --> 00:02:06,120
you're correctly labeling about
80% of the poor care cases,

36
00:02:06,120 --> 00:02:09,940
with a 30% false positive rate.

37
00:02:09,940 --> 00:02:14,380
The ROC curve captures all
thresholds simultaneously.

38
00:02:14,380 --> 00:02:17,829
The higher the threshold,
or closer to (0, 0),

39
00:02:17,829 --> 00:02:22,060
the higher the specificity
and the lower the sensitivity.

40
00:02:22,060 --> 00:02:25,400
The lower the threshold,
or closer to (1,1),

41
00:02:25,400 --> 00:02:30,480
the higher the sensitivity
and lower the specificity.

42
00:02:30,480 --> 00:02:33,360
So which threshold
value should you pick?

43
00:02:33,360 --> 00:02:35,460
You should select
the best threshold

44
00:02:35,460 --> 00:02:37,870
for the trade-off
you want to make.

45
00:02:37,870 --> 00:02:41,140
If you're more concerned with
having a high specificity

46
00:02:41,140 --> 00:02:44,720
or low false positive
rate, pick the threshold

47
00:02:44,720 --> 00:02:47,430
that maximizes the
true positive rate

48
00:02:47,430 --> 00:02:50,620
while keeping the false
positive rate really low.

49
00:02:50,620 --> 00:02:56,010
A threshold around (0.1,
0.5) on this ROC curve

50
00:02:56,010 --> 00:02:59,070
looks like a good
choice in this case.

51
00:02:59,070 --> 00:03:01,380
On the other hand, if
you're more concerned

52
00:03:01,380 --> 00:03:05,620
with having a high sensitivity
or high true positive rate,

53
00:03:05,620 --> 00:03:08,990
pick a threshold that minimizes
the false positive rate

54
00:03:08,990 --> 00:03:12,260
but has a very high
true positive rate.

55
00:03:12,260 --> 00:03:17,780
A threshold around
(0.3, 0.8) looks

56
00:03:17,780 --> 00:03:21,690
like a good choice in this case.

57
00:03:21,690 --> 00:03:24,329
You can label the
threshold values in R

58
00:03:24,329 --> 00:03:26,400
by color-coding the curve.

59
00:03:26,400 --> 00:03:28,880
The legend is
shown on the right.

60
00:03:28,880 --> 00:03:32,170
This shows us that-- say
we want to pick a threshold

61
00:03:32,170 --> 00:03:34,329
value around here.

62
00:03:34,329 --> 00:03:39,130
This corresponds to between the
aqua color and the green color.

63
00:03:39,130 --> 00:03:43,170
Or it looks like about
a threshold of 0.3.

64
00:03:43,170 --> 00:03:46,970
Instead, if we wanted to
pick a threshold around here,

65
00:03:46,970 --> 00:03:50,160
this looks like the start
of the darker blue color,

66
00:03:50,160 --> 00:03:55,079
and looks like it's probably
a threshold around 0.2.

67
00:03:55,079 --> 00:03:58,730
We can also add specific
threshold labels

68
00:03:58,730 --> 00:04:01,580
to the curve in
R. This helps you

69
00:04:01,580 --> 00:04:04,740
see which threshold
value you want to use.

70
00:04:04,740 --> 00:04:10,390
Let's go into R and see how
to generate these ROC curves.

71
00:04:10,390 --> 00:04:15,470
To generate ROC curves in R, we
need to install a new package.

72
00:04:15,470 --> 00:04:18,320
We'll use the same two
commands as we did earlier

73
00:04:18,320 --> 00:04:21,390
in the lecture, but this
time the name of the package

74
00:04:21,390 --> 00:04:23,350
is ROCR.

75
00:04:23,350 --> 00:04:31,110
So first type
install.packages("ROCR"),

76
00:04:31,110 --> 00:04:32,550
and hit Enter.

77
00:04:32,550 --> 00:04:36,270
Since we picked a CRAN mirror
already in this R session,

78
00:04:36,270 --> 00:04:38,450
we shouldn't have
to pick it again.

79
00:04:38,450 --> 00:04:40,560
You know the package
is done installing

80
00:04:40,560 --> 00:04:42,140
when you see the
arrow and blinking

81
00:04:42,140 --> 00:04:44,760
cursor in your R console.

82
00:04:44,760 --> 00:04:48,330
Now let's load the package
using the library function.

83
00:04:51,420 --> 00:04:54,409
Recall that we made
predictions on our training set

84
00:04:54,409 --> 00:04:56,330
and called them predictTrain.

85
00:04:56,330 --> 00:05:00,510
We'll use these predictions
to create our ROC curve.

86
00:05:00,510 --> 00:05:04,580
First, we'll call the
prediction function of ROCR.

87
00:05:04,580 --> 00:05:09,870
We'll call the output of
this function ROCRpred,

88
00:05:09,870 --> 00:05:12,570
and then use the
prediction function.

89
00:05:12,570 --> 00:05:15,140
This function takes
two arguments.

90
00:05:15,140 --> 00:05:17,080
The first is the
predictions we made

91
00:05:17,080 --> 00:05:21,250
with our model, which
we called predictTrain.

92
00:05:21,250 --> 00:05:25,830
The second argument is the true
outcomes of our data points,

93
00:05:25,830 --> 00:05:27,700
which in our case, is
qualityTrain$PoorCare.

94
00:05:33,120 --> 00:05:36,170
Now, we need to use the
performance function.

95
00:05:36,170 --> 00:05:37,760
This defines what
we'd like to plot

96
00:05:37,760 --> 00:05:41,780
on the x and y-axes
of our ROC curve.

97
00:05:41,780 --> 00:05:45,659
We'll call the output
of this ROCRperf,

98
00:05:45,659 --> 00:05:48,350
and use the performance
function, which

99
00:05:48,350 --> 00:05:53,250
takes as arguments the output
of the prediction function,

100
00:05:53,250 --> 00:05:56,250
and then what we want
on the x and y-axes.

101
00:05:56,250 --> 00:05:58,790
In this case, it's
true positive rate,

102
00:05:58,790 --> 00:06:03,380
or "tpr", and false
positive rate, or "fpr".

103
00:06:03,380 --> 00:06:06,880
Now, we just need to plot
the output of the performance

104
00:06:06,880 --> 00:06:08,140
function, ROCRperf.

105
00:06:10,980 --> 00:06:14,660
You should see the ROC curve
pop up in a new window.

106
00:06:14,660 --> 00:06:19,380
This should look exactly like
the one we saw on the slides.

107
00:06:19,380 --> 00:06:23,120
Now, you can add colors by
adding one additional argument

108
00:06:23,120 --> 00:06:24,770
to the plot function.

109
00:06:24,770 --> 00:06:30,910
So in your R console, hit the
up arrow, and after ROCRperf,

110
00:06:30,910 --> 00:06:35,530
type colorize=TRUE,
and hit Enter.

111
00:06:35,530 --> 00:06:37,300
If you go back to
your plot window,

112
00:06:37,300 --> 00:06:40,010
you should see the ROC
curve with the colors

113
00:06:40,010 --> 00:06:43,490
for the threshold values added.

114
00:06:43,490 --> 00:06:47,850
Now finally, let's add the
threshold labels to our plot.

115
00:06:47,850 --> 00:06:51,430
Back in your R console, hit the
up arrow again to get the plot

116
00:06:51,430 --> 00:06:55,250
function, and after
colorize=TRUE,

117
00:06:55,250 --> 00:06:57,870
we'll add two more arguments.

118
00:06:57,870 --> 00:07:11,550
The first is
print.cutoffs.at=seq(0,1,0.1),

119
00:07:11,550 --> 00:07:15,730
which will print the threshold
values in increments of 0.1.

120
00:07:15,730 --> 00:07:20,820
If you want finer increments,
just decrease the value of 0.1.

121
00:07:20,820 --> 00:07:23,170
And then the final argument
is text.adj=c(-0.2,1.7),

122
00:07:23,170 --> 00:07:23,750
and hit enter.

123
00:07:36,330 --> 00:07:39,140
If you go back to
your plot window,

124
00:07:39,140 --> 00:07:44,290
you should see the ROC curve
with threshold values added.

125
00:07:44,290 --> 00:07:47,530
Using this curve, we can
determine which threshold value

126
00:07:47,530 --> 00:07:49,960
we want to use depending
on our preferences

127
00:07:49,960 --> 00:07:52,300
as a decision-maker.

128
00:07:52,300 --> 00:07:54,760
In the next video,
we'll discuss how

129
00:07:54,760 --> 00:07:57,670
to assess the
strength of our model.