1
00:00:04,500 --> 00:00:07,640
Now that we have identified
a set of risk factors,

2
00:00:07,640 --> 00:00:12,720
let's use this data to predict
the 10 year risk of CHD.

3
00:00:12,720 --> 00:00:15,190
First, we'll randomly
split our patients

4
00:00:15,190 --> 00:00:18,420
into a training set
and a testing set.

5
00:00:18,420 --> 00:00:20,830
Then, we'll use
logistic regression

6
00:00:20,830 --> 00:00:24,790
to predict whether or not
a patient experienced CHD

7
00:00:24,790 --> 00:00:28,640
within 10 years of
the first examination.

8
00:00:28,640 --> 00:00:30,800
Keep in mind that all
of the risk factors

9
00:00:30,800 --> 00:00:35,780
were collected at the first
examination of the patients.

10
00:00:35,780 --> 00:00:39,320
After building our model, we'll
evaluate the predictive power

11
00:00:39,320 --> 00:00:41,850
of the model on the test set.

12
00:00:41,850 --> 00:00:46,730
Let's go to R and create our
logistic regression model.

13
00:00:46,730 --> 00:00:51,220
In our R console, we'll
call our data set framingham

14
00:00:51,220 --> 00:00:56,200
and use the read.csv function
to read in the data file

15
00:00:56,200 --> 00:00:56,910
"framingham.csv".

16
00:00:59,760 --> 00:01:02,320
Remember to navigate to
the directory containing

17
00:01:02,320 --> 00:01:07,930
the file "framingham.csv"
before reading in the data.

18
00:01:07,930 --> 00:01:10,960
Let's take a look at our data
by using the str function.

19
00:01:14,720 --> 00:01:20,800
We have data for 4,240
patients and 16 variables.

20
00:01:20,800 --> 00:01:24,470
We have the demographic
risk factors male, age,

21
00:01:24,470 --> 00:01:27,240
and education; the
behavioral risk

22
00:01:27,240 --> 00:01:30,880
factors currentSmoker
and cigsPerDay;

23
00:01:30,880 --> 00:01:36,840
the medical history risk
factors BPMeds, prevalentStroke,

24
00:01:36,840 --> 00:01:41,500
prevalentHyp, and diabetes;
and the physical exam risk

25
00:01:41,500 --> 00:01:50,600
factors totChol, sysBP, diaBP,
BMI, heartRate, and glucose

26
00:01:50,600 --> 00:01:51,890
level.

27
00:01:51,890 --> 00:01:55,940
The last variable is the
outcome or dependent variable,

28
00:01:55,940 --> 00:01:57,940
whether or not the
patient developed

29
00:01:57,940 --> 00:02:01,360
CHD in the next 10 years.

30
00:02:01,360 --> 00:02:05,760
Now let's split our data into a
training set and a testing set

31
00:02:05,760 --> 00:02:10,320
using sample.split like we
did in the previous lecture.

32
00:02:10,320 --> 00:02:13,380
We first need to load
the library caTools.

33
00:02:17,660 --> 00:02:21,579
Now, let's set our seed
and create our split.

34
00:02:21,579 --> 00:02:25,829
We'll start by setting
our seed to 1000,

35
00:02:25,829 --> 00:02:31,230
and then use the sample.split
function to create the split.

36
00:02:35,170 --> 00:02:38,140
The first argument is
the outcome variable

37
00:02:38,140 --> 00:02:39,060
framingham$TenYearCHD.

38
00:02:43,970 --> 00:02:46,890
And the second argument
is the percentage of data

39
00:02:46,890 --> 00:02:50,880
that we want in the training
set or the SplitRatio.

40
00:02:50,880 --> 00:02:55,620
Here, we'll put 65% of the
data in the training set.

41
00:02:55,620 --> 00:02:58,300
When you have more
data like we do here,

42
00:02:58,300 --> 00:03:01,410
you can afford to put less
data in the training set

43
00:03:01,410 --> 00:03:03,650
and more in the testing set.

44
00:03:03,650 --> 00:03:05,610
This will increase
our confidence

45
00:03:05,610 --> 00:03:08,480
in the ability of the
model to extend to new data

46
00:03:08,480 --> 00:03:11,340
since we have a larger
test set, and still

47
00:03:11,340 --> 00:03:13,310
give us enough data
in the training set

48
00:03:13,310 --> 00:03:15,060
to create our model.

49
00:03:15,060 --> 00:03:19,000
You typically want to put
somewhere between 50% and 80%

50
00:03:19,000 --> 00:03:22,130
of the data in the training set.

51
00:03:22,130 --> 00:03:25,040
Now, let's split up
our data using subset.

52
00:03:25,040 --> 00:03:28,190
We'll call our
training set "train"

53
00:03:28,190 --> 00:03:32,400
and use the subset function
to take a subset of framingham

54
00:03:32,400 --> 00:03:37,310
and take the observations for
which split is equal to TRUE.

55
00:03:37,310 --> 00:03:39,730
We'll call our
testing set "test"

56
00:03:39,730 --> 00:03:41,890
and again use the
subset function

57
00:03:41,890 --> 00:03:44,690
to take a subset of
framingham and take

58
00:03:44,690 --> 00:03:47,400
the observations for
which split equals FALSE.

59
00:03:51,120 --> 00:03:53,780
Now we're ready to build
our logistic regression

60
00:03:53,780 --> 00:03:56,280
model using the training set.

61
00:03:56,280 --> 00:04:02,340
Let's call it framinghamLog,
and we'll use the glm function

62
00:04:02,340 --> 00:04:04,160
like we did in the
previous lecture

63
00:04:04,160 --> 00:04:07,800
to create a logistic
regression model.

64
00:04:07,800 --> 00:04:09,510
We'll use a nice
little trick here

65
00:04:09,510 --> 00:04:11,730
where we predict our
dependent variable

66
00:04:11,730 --> 00:04:14,450
using all of the other
variables in the data set

67
00:04:14,450 --> 00:04:16,810
as independent variables.

68
00:04:16,810 --> 00:04:20,120
First, type the name of
the dependent variable,

69
00:04:20,120 --> 00:04:25,910
TenYearCHD, followed by the
tilde and then a period.

70
00:04:28,560 --> 00:04:30,800
This will use all of
the other variables

71
00:04:30,800 --> 00:04:33,610
in the data set as
independent variables

72
00:04:33,610 --> 00:04:36,260
and is used in
place of listing out

73
00:04:36,260 --> 00:04:39,150
all of the independent
variables' names separated

74
00:04:39,150 --> 00:04:41,420
by the plus sign.

75
00:04:41,420 --> 00:04:43,340
Be careful doing
this with data sets

76
00:04:43,340 --> 00:04:46,380
that have identifying
variables like a patient ID

77
00:04:46,380 --> 00:04:48,780
or name since you
wouldn't want to use

78
00:04:48,780 --> 00:04:52,380
these as independent variables.

79
00:04:52,380 --> 00:04:55,100
Following the period, we
need to give the argument

80
00:04:55,100 --> 00:04:59,409
that defines the data
set to use, data = train.

81
00:04:59,409 --> 00:05:02,890
And then, the final argument
for a logistic regression model

82
00:05:02,890 --> 00:05:04,820
is family = binomial.

83
00:05:07,450 --> 00:05:09,500
Let's take a look at the
summary of our model.

84
00:05:17,810 --> 00:05:22,860
It looks like male,
age, prevalent stroke,

85
00:05:22,860 --> 00:05:27,090
total cholesterol, systolic
blood pressure, and glucose

86
00:05:27,090 --> 00:05:29,960
are all significant
in our model.

87
00:05:29,960 --> 00:05:32,680
Cigarettes per day and
prevalent hypertension

88
00:05:32,680 --> 00:05:35,159
are almost significant.

89
00:05:35,159 --> 00:05:39,220
All of significant variables
have positive coefficients,

90
00:05:39,220 --> 00:05:42,070
meaning that higher
values in these variables

91
00:05:42,070 --> 00:05:44,240
contribute to a
higher probability

92
00:05:44,240 --> 00:05:47,850
of 10-year coronary
heart disease.

93
00:05:47,850 --> 00:05:50,540
Now, let's use this
model to make predictions

94
00:05:50,540 --> 00:05:51,950
on our test set.

95
00:05:51,950 --> 00:05:56,260
We'll call our
predictions predictTest

96
00:05:56,260 --> 00:05:58,370
and use the predict
function, which

97
00:05:58,370 --> 00:06:03,190
takes as arguments the name
of our model, framinghamLog,

98
00:06:03,190 --> 00:06:08,790
then type = "response",
which gives us probabilities,

99
00:06:08,790 --> 00:06:16,670
and lastly newdata = test,
the name of our testing set.

100
00:06:16,670 --> 00:06:19,770
Now, let's use a
threshold value of 0.5

101
00:06:19,770 --> 00:06:22,360
to create a confusion matrix.

102
00:06:22,360 --> 00:06:26,580
We'll use the table function
and give as the first argument,

103
00:06:26,580 --> 00:06:33,590
the actual values,
test$TenYearCHD,

104
00:06:33,590 --> 00:06:36,990
and then as the second
argument our predictions,

105
00:06:36,990 --> 00:06:42,190
predictTest > 0.5.

106
00:06:42,190 --> 00:06:46,600
With a threshold of 0.5,
we predict an outcome of 1,

107
00:06:46,600 --> 00:06:49,630
the true column, very rarely.

108
00:06:49,630 --> 00:06:53,610
This means that our model
rarely predicts a 10-year CHD

109
00:06:53,610 --> 00:06:56,770
risk above 50%.

110
00:06:56,770 --> 00:06:59,690
What is the accuracy
of this model?

111
00:06:59,690 --> 00:07:05,280
Well, it's the sum of the cases
we get right, 1069 plus 11,

112
00:07:05,280 --> 00:07:08,580
divided by the total number
of observations in our data

113
00:07:08,580 --> 00:07:15,750
set, 1069 + 6 + 187 + 11.

114
00:07:15,750 --> 00:07:21,720
So the accuracy of our
model is about 84.8%.

115
00:07:21,720 --> 00:07:24,840
We need to compare this to the
accuracy of a simple baseline

116
00:07:24,840 --> 00:07:26,060
method.

117
00:07:26,060 --> 00:07:28,900
The more frequent outcome
in this case is 0,

118
00:07:28,900 --> 00:07:34,680
so the baseline method would
always predict 0 or no CHD.

119
00:07:34,680 --> 00:07:40,290
This baseline method would
get an accuracy of 1069

120
00:07:40,290 --> 00:07:46,710
+ 6-- this is the total number
of true negative cases--

121
00:07:46,710 --> 00:07:50,380
divided by the total number
of observations in our data

122
00:07:50,380 --> 00:07:57,130
set, 1069 + 6 + 187 + 11.

123
00:07:57,130 --> 00:08:03,590
So the baseline model would
get an accuracy of about 84.4%.

124
00:08:03,590 --> 00:08:08,330
So our model barely beats the
baseline in terms of accuracy.

125
00:08:08,330 --> 00:08:12,490
But do we still have a valuable
model by varying the threshold?

126
00:08:12,490 --> 00:08:15,790
Let's compute the
out-of-sample AUC.

127
00:08:15,790 --> 00:08:21,630
To do this, we first need
to load the ROCR package.

128
00:08:21,630 --> 00:08:24,040
And then, we'll use
the prediction function

129
00:08:24,040 --> 00:08:28,310
of the ROCR package to
make our predictions.

130
00:08:28,310 --> 00:08:33,440
Let's call the output of that
ROCRpred and use the prediction

131
00:08:33,440 --> 00:08:37,360
function, which takes as a
first argument our predictions,

132
00:08:37,360 --> 00:08:41,610
predictTest, and then as
a second argument the true

133
00:08:41,610 --> 00:08:42,900
outcome, test$TenYearCHD.

134
00:08:47,530 --> 00:08:49,990
Then, we need to type
as.numeric(performance(ROCRpred,

135
00:08:49,990 --> 00:08:50,690
"auc")@y.values).

136
00:09:13,710 --> 00:09:18,060
This will give us the AUC
value on our testing set.

137
00:09:18,060 --> 00:09:22,050
So we have an AUC of
about 74% on our test set,

138
00:09:22,050 --> 00:09:25,210
which means that the model can
differentiate between low risk

139
00:09:25,210 --> 00:09:29,010
patients and high risk
patients pretty well.

140
00:09:29,010 --> 00:09:33,070
As we saw in R, we were able
to build a logistic regression

141
00:09:33,070 --> 00:09:36,520
model with a few
interesting properties.

142
00:09:36,520 --> 00:09:41,440
It rarely predicted
10-year CHD risk above 50%.

143
00:09:41,440 --> 00:09:44,870
So the accuracy of the model
was very close to the baseline

144
00:09:44,870 --> 00:09:46,120
model.

145
00:09:46,120 --> 00:09:49,490
However, the model could
differentiate between low risk

146
00:09:49,490 --> 00:09:52,720
patients and high risk
patients pretty well

147
00:09:52,720 --> 00:09:57,140
with an out-of-sample
AUC of 0.74.

148
00:09:57,140 --> 00:09:59,960
Additionally, some of
the significant variables

149
00:09:59,960 --> 00:10:04,490
suggest possible
interventions to prevent CHD.

150
00:10:04,490 --> 00:10:09,240
We saw that more cigarettes per
day, higher cholesterol, higher

151
00:10:09,240 --> 00:10:12,900
systolic blood pressure,
and higher glucose levels

152
00:10:12,900 --> 00:10:15,480
all increased risk.

153
00:10:15,480 --> 00:10:17,500
Later in the lecture,
we'll discuss

154
00:10:17,500 --> 00:10:19,720
some medical interventions
that are currently

155
00:10:19,720 --> 00:10:22,750
used to prevent CHD.