1
00:00:04,500 --> 00:00:07,940
This plot shows two of
our independent variables,

2
00:00:07,940 --> 00:00:11,600
the number of office
visits on the x-axis

3
00:00:11,600 --> 00:00:16,260
and the number of narcotics
prescribed on the y-axis.

4
00:00:16,260 --> 00:00:20,770
Each point is an observation
or a patient in our data set.

5
00:00:20,770 --> 00:00:24,300
The red points are patients
who received poor care,

6
00:00:24,300 --> 00:00:25,930
and the green
points are patients

7
00:00:25,930 --> 00:00:28,220
who received good care.

8
00:00:28,220 --> 00:00:30,280
It's hard to see a
trend in the data

9
00:00:30,280 --> 00:00:32,600
by just visually inspecting it.

10
00:00:32,600 --> 00:00:36,580
But it looks like maybe
more office visits and more

11
00:00:36,580 --> 00:00:40,750
narcotics, or data points
to the right of this line,

12
00:00:40,750 --> 00:00:43,960
are more likely
to have poor care.

13
00:00:43,960 --> 00:00:47,630
Let's see if we can build a
logistic regression model in R

14
00:00:47,630 --> 00:00:50,850
to better predict poor care.

15
00:00:50,850 --> 00:00:53,780
In our R console,
let's start by reading

16
00:00:53,780 --> 00:00:57,410
our data set into R.
Remember to navigate

17
00:00:57,410 --> 00:00:59,250
to the directory
on your computer

18
00:00:59,250 --> 00:01:03,100
containing the file quality.csv.

19
00:01:03,100 --> 00:01:05,780
We'll call our
data set "quality"

20
00:01:05,780 --> 00:01:09,870
and use the read.csv function
to read in the data file

21
00:01:09,870 --> 00:01:10,370
quality.csv.

22
00:01:13,330 --> 00:01:14,850
Let's take a look
at the structure

23
00:01:14,850 --> 00:01:18,780
of our data set by
using the str function.

24
00:01:18,780 --> 00:01:22,870
We have 131 observations,
one for each

25
00:01:22,870 --> 00:01:27,680
of the patients in our data
set, and 14 different variables.

26
00:01:27,680 --> 00:01:32,930
MemberID just numbers the
patients from 1 to 131.

27
00:01:32,930 --> 00:01:35,870
The 12 variables
from InpatientDays

28
00:01:35,870 --> 00:01:40,100
to AcuteDrugGapSmall are
the independent variables.

29
00:01:40,100 --> 00:01:43,320
We'll be using the
number of office visits

30
00:01:43,320 --> 00:01:46,150
and the number of
prescriptions for narcotics

31
00:01:46,150 --> 00:01:47,830
that the patient had.

32
00:01:47,830 --> 00:01:49,970
But you can read
descriptions of all

33
00:01:49,970 --> 00:01:53,190
of the independent
variables below this video.

34
00:01:53,190 --> 00:01:55,400
After the lecture,
try building models

35
00:01:55,400 --> 00:01:58,170
with different subsets
of independent variables

36
00:01:58,170 --> 00:02:01,990
to see what the best model
is that you can find.

37
00:02:01,990 --> 00:02:06,020
The final variable,
PoorCare, is our outcome

38
00:02:06,020 --> 00:02:08,960
or dependent variable
and is equal to 1

39
00:02:08,960 --> 00:02:12,150
if the patient had poor
care and equal to 0

40
00:02:12,150 --> 00:02:14,630
if the patient had good care.

41
00:02:14,630 --> 00:02:17,910
Let's see how many
patients received poor care

42
00:02:17,910 --> 00:02:20,630
and how many patients
received good care

43
00:02:20,630 --> 00:02:23,170
by using the table function.

44
00:02:23,170 --> 00:02:26,800
Let's make a table of our
outcome variable PoorCare.

45
00:02:29,480 --> 00:02:34,110
We can see that 98 out of the
131 patients in our data set

46
00:02:34,110 --> 00:02:40,050
received good care, or 0, and
33 patients received poor care,

47
00:02:40,050 --> 00:02:42,870
or those labeled with 1.

48
00:02:42,870 --> 00:02:45,680
Before building any
models, let's consider

49
00:02:45,680 --> 00:02:48,730
using a simple baseline method.

50
00:02:48,730 --> 00:02:51,050
Last week when we
computed the R-squared

51
00:02:51,050 --> 00:02:54,510
for linear regression, we
compared our predictions

52
00:02:54,510 --> 00:02:57,960
to the baseline method of
predicting the average outcome

53
00:02:57,960 --> 00:03:00,130
for all data points.

54
00:03:00,130 --> 00:03:04,100
In a classification problem,
a standard baseline method

55
00:03:04,100 --> 00:03:07,060
is to just predict the
most frequent outcome

56
00:03:07,060 --> 00:03:09,070
for all observations.

57
00:03:09,070 --> 00:03:13,040
Since good care is more common
than poor care, in this case,

58
00:03:13,040 --> 00:03:14,710
we would predict
that all patients

59
00:03:14,710 --> 00:03:16,460
are receiving good care.

60
00:03:16,460 --> 00:03:19,090
If we did this, we
would get 98 out

61
00:03:19,090 --> 00:03:22,930
of the 131 observations
correct, or have

62
00:03:22,930 --> 00:03:26,730
an accuracy of about 75%.

63
00:03:26,730 --> 00:03:31,210
So our baseline model
has an accuracy of 75%.

64
00:03:31,210 --> 00:03:34,220
This is what we'll try to beat
with our logistic regression

65
00:03:34,220 --> 00:03:36,630
model.

66
00:03:36,630 --> 00:03:40,020
Last week, we always gave
you the training data set

67
00:03:40,020 --> 00:03:43,329
and the testing data
set in separate files.

68
00:03:43,329 --> 00:03:46,130
This week, we only
have one data set.

69
00:03:46,130 --> 00:03:50,100
So we want to randomly split
our data set into a training set

70
00:03:50,100 --> 00:03:53,260
and testing set so that
we'll have a test set

71
00:03:53,260 --> 00:03:56,590
to measure our
out-of-sample accuracy.

72
00:03:56,590 --> 00:04:00,440
To do this, we need to
add a new package to R.

73
00:04:00,440 --> 00:04:04,150
There are many functions
and algorithms built into R,

74
00:04:04,150 --> 00:04:06,870
but there are many more
that you can install.

75
00:04:06,870 --> 00:04:10,780
We'll do this several times
throughout this course.

76
00:04:10,780 --> 00:04:14,230
First, let's install
the new package using

77
00:04:14,230 --> 00:04:18,290
the install.packages
function and then give

78
00:04:18,290 --> 00:04:21,510
the name of the package we
want to install in quotes.

79
00:04:21,510 --> 00:04:24,620
In this case, it's caTools.

80
00:04:24,620 --> 00:04:27,430
If you hit Enter, a
window should pop up

81
00:04:27,430 --> 00:04:30,270
asking you to pick
a CR AN mirror.

82
00:04:30,270 --> 00:04:33,590
Here, you should pick
a location near you.

83
00:04:33,590 --> 00:04:36,270
In my case, I'll
pick Pennsylvania

84
00:04:36,270 --> 00:04:37,170
in the United States.

85
00:04:39,860 --> 00:04:41,960
After you've
selected a location,

86
00:04:41,960 --> 00:04:45,570
you should see some information
pop up in your R console.

87
00:04:45,570 --> 00:04:50,330
It's done when you see the arrow
with the blinking cursor again.

88
00:04:50,330 --> 00:04:54,270
Now, we need to load the package
into our current R session.

89
00:04:54,270 --> 00:04:57,110
To do this, we'll use
the library function.

90
00:04:57,110 --> 00:05:00,080
So type library and
then in parentheses

91
00:05:00,080 --> 00:05:03,460
the name of our
package, caTools.

92
00:05:03,460 --> 00:05:05,610
Sometimes you'll get
a warning message

93
00:05:05,610 --> 00:05:09,040
based on the version
of R that you're using.

94
00:05:09,040 --> 00:05:12,200
This can usually
safely be ignored.

95
00:05:12,200 --> 00:05:15,490
In the future, whenever
you want to use a package,

96
00:05:15,490 --> 00:05:19,720
you won't need to install it,
but you will need to load it.

97
00:05:19,720 --> 00:05:23,400
Now, let's use this package
to randomly split our data

98
00:05:23,400 --> 00:05:26,250
into a training set
and testing set.

99
00:05:26,250 --> 00:05:30,120
We'll do this using a
function sample.split which

100
00:05:30,120 --> 00:05:32,740
is part of the caTools package.

101
00:05:32,740 --> 00:05:35,950
Since sample.split
randomly splits your data,

102
00:05:35,950 --> 00:05:38,580
it could split it
differently for each of us.

103
00:05:38,580 --> 00:05:42,810
To make sure that we all get the
same split, we'll set our seed.

104
00:05:42,810 --> 00:05:46,080
This initializes the
random number generator.

105
00:05:46,080 --> 00:05:51,130
So type set.seed(88),
a number I selected.

106
00:05:53,690 --> 00:05:56,500
Now, let's use sample.split.

107
00:05:56,500 --> 00:06:02,230
Type split =
sample.split, and then

108
00:06:02,230 --> 00:06:05,430
in parentheses, we need
to give two arguments.

109
00:06:05,430 --> 00:06:11,500
The first is our outcome
variable or quality$PoorCare,

110
00:06:11,500 --> 00:06:14,970
and the second argument is the
percentage of the data that we

111
00:06:14,970 --> 00:06:18,210
want in the training set.

112
00:06:18,210 --> 00:06:23,120
We type SplitRatio
equals, and in this case,

113
00:06:23,120 --> 00:06:27,020
we'll put 75% of the data in
the training set, which we'll

114
00:06:27,020 --> 00:06:31,120
use to build the model, and
25% of the data in the testing

115
00:06:31,120 --> 00:06:33,420
set to test our model.

116
00:06:33,420 --> 00:06:36,170
Sample.split randomly
splits the data.

117
00:06:36,170 --> 00:06:38,770
But it also makes sure
that the outcome variable

118
00:06:38,770 --> 00:06:41,280
is well-balanced in each piece.

119
00:06:41,280 --> 00:06:44,940
We saw earlier that
about 75% of our patients

120
00:06:44,940 --> 00:06:46,980
are receiving good care.

121
00:06:46,980 --> 00:06:50,060
This function makes sure
that in our training set,

122
00:06:50,060 --> 00:06:53,580
75% of our patients
are receiving good care

123
00:06:53,580 --> 00:06:57,010
and in our testing set
75% of our patients

124
00:06:57,010 --> 00:06:59,100
are receiving good care.

125
00:06:59,100 --> 00:07:02,510
We want to do this so that
our test set is representative

126
00:07:02,510 --> 00:07:04,450
of our training set.

127
00:07:04,450 --> 00:07:07,810
Let's take a look at split.

128
00:07:07,810 --> 00:07:12,260
There is a TRUE or FALSE value
for each of our observations.

129
00:07:12,260 --> 00:07:14,960
TRUE means that we should
put that observation

130
00:07:14,960 --> 00:07:17,680
in the training
set, and FALSE means

131
00:07:17,680 --> 00:07:21,600
that we should put that
observation in the testing set.

132
00:07:21,600 --> 00:07:24,470
So now let's create our
training and testing

133
00:07:24,470 --> 00:07:27,680
sets using the subset function.

134
00:07:27,680 --> 00:07:31,590
We'll call our training
set qualityTrain

135
00:07:31,590 --> 00:07:35,909
and use the subset function
to take a subset of quality

136
00:07:35,909 --> 00:07:38,220
and only taking the
observations for which

137
00:07:38,220 --> 00:07:40,940
split is equal to TRUE.

138
00:07:40,940 --> 00:07:44,340
We'll call our testing
set qualityTest

139
00:07:44,340 --> 00:07:46,540
and, again, use
the subset function

140
00:07:46,540 --> 00:07:48,909
to take the
observations of quality,

141
00:07:48,909 --> 00:07:53,409
but this time those for which
split is equal to FALSE.

142
00:07:53,409 --> 00:07:56,920
If you look at the number of
rows in each of our data sets,

143
00:07:56,920 --> 00:08:00,930
the training set and
then the testing set,

144
00:08:00,930 --> 00:08:04,430
you can see that there are 99
observations in the training

145
00:08:04,430 --> 00:08:09,580
set and 32 observations
in the testing set.

146
00:08:09,580 --> 00:08:12,380
Now, we are ready to build
a logistic regression

147
00:08:12,380 --> 00:08:15,260
model using OfficeVisits
and Narcotics

148
00:08:15,260 --> 00:08:17,440
as independent variables.

149
00:08:17,440 --> 00:08:23,110
We'll call our model QualityLog
and use the "glm" function

150
00:08:23,110 --> 00:08:26,710
for "generalized
linear model" to build

151
00:08:26,710 --> 00:08:28,770
our logistic regression model.

152
00:08:28,770 --> 00:08:30,850
We start by giving
the equation we

153
00:08:30,850 --> 00:08:33,960
want to build just like
in linear regression.

154
00:08:33,960 --> 00:08:36,490
We start with the dependent
variable, and then

155
00:08:36,490 --> 00:08:39,650
the tilde sign, and then
the independent variables

156
00:08:39,650 --> 00:08:43,820
we want to use separated
by the plus sign.

157
00:08:43,820 --> 00:08:45,890
We then give the
name of the data

158
00:08:45,890 --> 00:08:48,290
set we want to use
to build the model,

159
00:08:48,290 --> 00:08:51,260
in this case, qualityTrain.

160
00:08:51,260 --> 00:08:55,250
For a logistic regression model,
we need one last argument,

161
00:08:55,250 --> 00:08:56,300
which is family=binomial.

162
00:09:01,860 --> 00:09:05,670
This tells the glm function
to build a logistic regression

163
00:09:05,670 --> 00:09:07,480
model.

164
00:09:07,480 --> 00:09:10,020
Now, let's look at our model
using the summary function.

165
00:09:12,740 --> 00:09:17,910
The output looks similar to that
of a linear regression model.

166
00:09:17,910 --> 00:09:21,510
What we want to focus on
is the coefficients table.

167
00:09:21,510 --> 00:09:24,360
This gives the estimate
values for the coefficients,

168
00:09:24,360 --> 00:09:27,580
or the betas, for our
logistic regression model.

169
00:09:27,580 --> 00:09:30,990
We see here that the
coefficients for OfficeVisits

170
00:09:30,990 --> 00:09:34,060
and Narcotics are
both positive, which

171
00:09:34,060 --> 00:09:37,330
means that higher values
in these two variables

172
00:09:37,330 --> 00:09:39,890
are indicative of
poor care as we

173
00:09:39,890 --> 00:09:42,620
suspected from
looking at the data.

174
00:09:42,620 --> 00:09:44,990
We also see that both
of these variables

175
00:09:44,990 --> 00:09:47,500
have at least one star,
meaning that they're

176
00:09:47,500 --> 00:09:50,400
significant in our model.

177
00:09:50,400 --> 00:09:52,660
The last thing we want
to look at in the output

178
00:09:52,660 --> 00:09:55,140
is the AIC value.

179
00:09:55,140 --> 00:09:57,900
This is a measure of
the quality of the model

180
00:09:57,900 --> 00:10:00,880
and is like Adjusted
R-squared in that it accounts

181
00:10:00,880 --> 00:10:03,660
for the number of
variables used compared

182
00:10:03,660 --> 00:10:05,970
to the number of observations.

183
00:10:05,970 --> 00:10:08,300
Unfortunately, it
can only be compared

184
00:10:08,300 --> 00:10:11,230
between models on
the same data set.

185
00:10:11,230 --> 00:10:14,760
But it provides a means
for model selection.

186
00:10:14,760 --> 00:10:19,170
The preferred model is the
one with the minimum AIC.

187
00:10:19,170 --> 00:10:22,790
Now, let's make predictions
on the training set.

188
00:10:22,790 --> 00:10:27,800
We'll call them predictTrain
and use the predict function

189
00:10:27,800 --> 00:10:32,120
to make predictions using
the model QualityLog,

190
00:10:32,120 --> 00:10:33,920
and we'll give a
second argument,

191
00:10:33,920 --> 00:10:34,960
which is type="response".

192
00:10:39,350 --> 00:10:43,850
This tells the predict function
to give us probabilities.

193
00:10:43,850 --> 00:10:46,090
Let's take a look at
the statistical summary

194
00:10:46,090 --> 00:10:46,880
of our predictions.

195
00:10:51,780 --> 00:10:53,740
Since we're expecting
probabilities,

196
00:10:53,740 --> 00:10:57,160
all of the numbers should
be between zero and one.

197
00:10:57,160 --> 00:11:01,090
And we see that the
minimum value is about 0.07

198
00:11:01,090 --> 00:11:04,730
and the maximum value is 0.98.

199
00:11:04,730 --> 00:11:07,410
Let's see if we're predicting
higher probabilities

200
00:11:07,410 --> 00:11:11,430
for the actual poor
care cases as we expect.

201
00:11:11,430 --> 00:11:15,740
To do this, use the
tapply function,

202
00:11:15,740 --> 00:11:25,140
giving as arguments predictTrain
and then QualityTrain$PoorCare

203
00:11:25,140 --> 00:11:26,460
and then mean.

204
00:11:26,460 --> 00:11:28,980
This will compute the
average prediction

205
00:11:28,980 --> 00:11:31,770
for each of the true outcomes.

206
00:11:31,770 --> 00:11:35,720
So we see that for all of
the true poor care cases,

207
00:11:35,720 --> 00:11:39,750
we predict an average
probability of about 0.44.

208
00:11:39,750 --> 00:11:42,110
And all of the true
good care cases,

209
00:11:42,110 --> 00:11:46,020
we predict an average
probability of about 0.19.

210
00:11:46,020 --> 00:11:47,620
So this is a good
sign, because it

211
00:11:47,620 --> 00:11:49,140
looks like we're
predicting a higher

212
00:11:49,140 --> 00:11:53,010
probability for the
actual poor care cases.

213
00:11:53,010 --> 00:11:55,730
Now that we have our
model, in the next video,

214
00:11:55,730 --> 00:11:59,460
we'll discuss how to assess the
accuracy of our predictions.