1 00:00:04,500 --> 00:00:07,640 Now that we have identified a set of risk factors, 2 00:00:07,640 --> 00:00:12,720 let's use this data to predict the 10 year risk of CHD. 3 00:00:12,720 --> 00:00:15,190 First, we'll randomly split our patients 4 00:00:15,190 --> 00:00:18,420 into a training set and a testing set. 5 00:00:18,420 --> 00:00:20,830 Then, we'll use logistic regression 6 00:00:20,830 --> 00:00:24,790 to predict whether or not a patient experienced CHD 7 00:00:24,790 --> 00:00:28,640 within 10 years of the first examination. 8 00:00:28,640 --> 00:00:30,800 Keep in mind that all of the risk factors 9 00:00:30,800 --> 00:00:35,780 were collected at the first examination of the patients. 10 00:00:35,780 --> 00:00:39,320 After building our model, we'll evaluate the predictive power 11 00:00:39,320 --> 00:00:41,850 of the model on the test set. 12 00:00:41,850 --> 00:00:46,730 Let's go to R and create our logistic regression model. 13 00:00:46,730 --> 00:00:51,220 In our R console, we'll call our data set framingham 14 00:00:51,220 --> 00:00:56,200 and use the read.csv function to read in the data file 15 00:00:56,200 --> 00:00:56,910 "framingham.csv". 16 00:00:59,760 --> 00:01:02,320 Remember to navigate to the directory containing 17 00:01:02,320 --> 00:01:07,930 the file "framingham.csv" before reading in the data. 18 00:01:07,930 --> 00:01:10,960 Let's take a look at our data by using the str function. 19 00:01:14,720 --> 00:01:20,800 We have data for 4,240 patients and 16 variables. 20 00:01:20,800 --> 00:01:24,470 We have the demographic risk factors male, age, 21 00:01:24,470 --> 00:01:27,240 and education; the behavioral risk 22 00:01:27,240 --> 00:01:30,880 factors currentSmoker and cigsPerDay; 23 00:01:30,880 --> 00:01:36,840 the medical history risk factors BPMeds, prevalentStroke, 24 00:01:36,840 --> 00:01:41,500 prevalentHyp, and diabetes; and the physical exam risk 25 00:01:41,500 --> 00:01:50,600 factors totChol, sysBP, diaBP, BMI, heartRate, and glucose 26 00:01:50,600 --> 00:01:51,890 level. 27 00:01:51,890 --> 00:01:55,940 The last variable is the outcome or dependent variable, 28 00:01:55,940 --> 00:01:57,940 whether or not the patient developed 29 00:01:57,940 --> 00:02:01,360 CHD in the next 10 years. 30 00:02:01,360 --> 00:02:05,760 Now let's split our data into a training set and a testing set 31 00:02:05,760 --> 00:02:10,320 using sample.split like we did in the previous lecture. 32 00:02:10,320 --> 00:02:13,380 We first need to load the library caTools. 33 00:02:17,660 --> 00:02:21,579 Now, let's set our seed and create our split. 34 00:02:21,579 --> 00:02:25,829 We'll start by setting our seed to 1000, 35 00:02:25,829 --> 00:02:31,230 and then use the sample.split function to create the split. 36 00:02:35,170 --> 00:02:38,140 The first argument is the outcome variable 37 00:02:38,140 --> 00:02:39,060 framingham$TenYearCHD. 38 00:02:43,970 --> 00:02:46,890 And the second argument is the percentage of data 39 00:02:46,890 --> 00:02:50,880 that we want in the training set or the SplitRatio. 40 00:02:50,880 --> 00:02:55,620 Here, we'll put 65% of the data in the training set. 41 00:02:55,620 --> 00:02:58,300 When you have more data like we do here, 42 00:02:58,300 --> 00:03:01,410 you can afford to put less data in the training set 43 00:03:01,410 --> 00:03:03,650 and more in the testing set. 44 00:03:03,650 --> 00:03:05,610 This will increase our confidence 45 00:03:05,610 --> 00:03:08,480 in the ability of the model to extend to new data 46 00:03:08,480 --> 00:03:11,340 since we have a larger test set, and still 47 00:03:11,340 --> 00:03:13,310 give us enough data in the training set 48 00:03:13,310 --> 00:03:15,060 to create our model. 49 00:03:15,060 --> 00:03:19,000 You typically want to put somewhere between 50% and 80% 50 00:03:19,000 --> 00:03:22,130 of the data in the training set. 51 00:03:22,130 --> 00:03:25,040 Now, let's split up our data using subset. 52 00:03:25,040 --> 00:03:28,190 We'll call our training set "train" 53 00:03:28,190 --> 00:03:32,400 and use the subset function to take a subset of framingham 54 00:03:32,400 --> 00:03:37,310 and take the observations for which split is equal to TRUE. 55 00:03:37,310 --> 00:03:39,730 We'll call our testing set "test" 56 00:03:39,730 --> 00:03:41,890 and again use the subset function 57 00:03:41,890 --> 00:03:44,690 to take a subset of framingham and take 58 00:03:44,690 --> 00:03:47,400 the observations for which split equals FALSE. 59 00:03:51,120 --> 00:03:53,780 Now we're ready to build our logistic regression 60 00:03:53,780 --> 00:03:56,280 model using the training set. 61 00:03:56,280 --> 00:04:02,340 Let's call it framinghamLog, and we'll use the glm function 62 00:04:02,340 --> 00:04:04,160 like we did in the previous lecture 63 00:04:04,160 --> 00:04:07,800 to create a logistic regression model. 64 00:04:07,800 --> 00:04:09,510 We'll use a nice little trick here 65 00:04:09,510 --> 00:04:11,730 where we predict our dependent variable 66 00:04:11,730 --> 00:04:14,450 using all of the other variables in the data set 67 00:04:14,450 --> 00:04:16,810 as independent variables. 68 00:04:16,810 --> 00:04:20,120 First, type the name of the dependent variable, 69 00:04:20,120 --> 00:04:25,910 TenYearCHD, followed by the tilde and then a period. 70 00:04:28,560 --> 00:04:30,800 This will use all of the other variables 71 00:04:30,800 --> 00:04:33,610 in the data set as independent variables 72 00:04:33,610 --> 00:04:36,260 and is used in place of listing out 73 00:04:36,260 --> 00:04:39,150 all of the independent variables' names separated 74 00:04:39,150 --> 00:04:41,420 by the plus sign. 75 00:04:41,420 --> 00:04:43,340 Be careful doing this with data sets 76 00:04:43,340 --> 00:04:46,380 that have identifying variables like a patient ID 77 00:04:46,380 --> 00:04:48,780 or name since you wouldn't want to use 78 00:04:48,780 --> 00:04:52,380 these as independent variables. 79 00:04:52,380 --> 00:04:55,100 Following the period, we need to give the argument 80 00:04:55,100 --> 00:04:59,409 that defines the data set to use, data = train. 81 00:04:59,409 --> 00:05:02,890 And then, the final argument for a logistic regression model 82 00:05:02,890 --> 00:05:04,820 is family = binomial. 83 00:05:07,450 --> 00:05:09,500 Let's take a look at the summary of our model. 84 00:05:17,810 --> 00:05:22,860 It looks like male, age, prevalent stroke, 85 00:05:22,860 --> 00:05:27,090 total cholesterol, systolic blood pressure, and glucose 86 00:05:27,090 --> 00:05:29,960 are all significant in our model. 87 00:05:29,960 --> 00:05:32,680 Cigarettes per day and prevalent hypertension 88 00:05:32,680 --> 00:05:35,159 are almost significant. 89 00:05:35,159 --> 00:05:39,220 All of significant variables have positive coefficients, 90 00:05:39,220 --> 00:05:42,070 meaning that higher values in these variables 91 00:05:42,070 --> 00:05:44,240 contribute to a higher probability 92 00:05:44,240 --> 00:05:47,850 of 10-year coronary heart disease. 93 00:05:47,850 --> 00:05:50,540 Now, let's use this model to make predictions 94 00:05:50,540 --> 00:05:51,950 on our test set. 95 00:05:51,950 --> 00:05:56,260 We'll call our predictions predictTest 96 00:05:56,260 --> 00:05:58,370 and use the predict function, which 97 00:05:58,370 --> 00:06:03,190 takes as arguments the name of our model, framinghamLog, 98 00:06:03,190 --> 00:06:08,790 then type = "response", which gives us probabilities, 99 00:06:08,790 --> 00:06:16,670 and lastly newdata = test, the name of our testing set. 100 00:06:16,670 --> 00:06:19,770 Now, let's use a threshold value of 0.5 101 00:06:19,770 --> 00:06:22,360 to create a confusion matrix. 102 00:06:22,360 --> 00:06:26,580 We'll use the table function and give as the first argument, 103 00:06:26,580 --> 00:06:33,590 the actual values, test$TenYearCHD, 104 00:06:33,590 --> 00:06:36,990 and then as the second argument our predictions, 105 00:06:36,990 --> 00:06:42,190 predictTest > 0.5. 106 00:06:42,190 --> 00:06:46,600 With a threshold of 0.5, we predict an outcome of 1, 107 00:06:46,600 --> 00:06:49,630 the true column, very rarely. 108 00:06:49,630 --> 00:06:53,610 This means that our model rarely predicts a 10-year CHD 109 00:06:53,610 --> 00:06:56,770 risk above 50%. 110 00:06:56,770 --> 00:06:59,690 What is the accuracy of this model? 111 00:06:59,690 --> 00:07:05,280 Well, it's the sum of the cases we get right, 1069 plus 11, 112 00:07:05,280 --> 00:07:08,580 divided by the total number of observations in our data 113 00:07:08,580 --> 00:07:15,750 set, 1069 + 6 + 187 + 11. 114 00:07:15,750 --> 00:07:21,720 So the accuracy of our model is about 84.8%. 115 00:07:21,720 --> 00:07:24,840 We need to compare this to the accuracy of a simple baseline 116 00:07:24,840 --> 00:07:26,060 method. 117 00:07:26,060 --> 00:07:28,900 The more frequent outcome in this case is 0, 118 00:07:28,900 --> 00:07:34,680 so the baseline method would always predict 0 or no CHD. 119 00:07:34,680 --> 00:07:40,290 This baseline method would get an accuracy of 1069 120 00:07:40,290 --> 00:07:46,710 + 6-- this is the total number of true negative cases-- 121 00:07:46,710 --> 00:07:50,380 divided by the total number of observations in our data 122 00:07:50,380 --> 00:07:57,130 set, 1069 + 6 + 187 + 11. 123 00:07:57,130 --> 00:08:03,590 So the baseline model would get an accuracy of about 84.4%. 124 00:08:03,590 --> 00:08:08,330 So our model barely beats the baseline in terms of accuracy. 125 00:08:08,330 --> 00:08:12,490 But do we still have a valuable model by varying the threshold? 126 00:08:12,490 --> 00:08:15,790 Let's compute the out-of-sample AUC. 127 00:08:15,790 --> 00:08:21,630 To do this, we first need to load the ROCR package. 128 00:08:21,630 --> 00:08:24,040 And then, we'll use the prediction function 129 00:08:24,040 --> 00:08:28,310 of the ROCR package to make our predictions. 130 00:08:28,310 --> 00:08:33,440 Let's call the output of that ROCRpred and use the prediction 131 00:08:33,440 --> 00:08:37,360 function, which takes as a first argument our predictions, 132 00:08:37,360 --> 00:08:41,610 predictTest, and then as a second argument the true 133 00:08:41,610 --> 00:08:42,900 outcome, test$TenYearCHD. 134 00:08:47,530 --> 00:08:49,990 Then, we need to type as.numeric(performance(ROCRpred, 135 00:08:49,990 --> 00:08:50,690 "auc")@y.values). 136 00:09:13,710 --> 00:09:18,060 This will give us the AUC value on our testing set. 137 00:09:18,060 --> 00:09:22,050 So we have an AUC of about 74% on our test set, 138 00:09:22,050 --> 00:09:25,210 which means that the model can differentiate between low risk 139 00:09:25,210 --> 00:09:29,010 patients and high risk patients pretty well. 140 00:09:29,010 --> 00:09:33,070 As we saw in R, we were able to build a logistic regression 141 00:09:33,070 --> 00:09:36,520 model with a few interesting properties. 142 00:09:36,520 --> 00:09:41,440 It rarely predicted 10-year CHD risk above 50%. 143 00:09:41,440 --> 00:09:44,870 So the accuracy of the model was very close to the baseline 144 00:09:44,870 --> 00:09:46,120 model. 145 00:09:46,120 --> 00:09:49,490 However, the model could differentiate between low risk 146 00:09:49,490 --> 00:09:52,720 patients and high risk patients pretty well 147 00:09:52,720 --> 00:09:57,140 with an out-of-sample AUC of 0.74. 148 00:09:57,140 --> 00:09:59,960 Additionally, some of the significant variables 149 00:09:59,960 --> 00:10:04,490 suggest possible interventions to prevent CHD. 150 00:10:04,490 --> 00:10:09,240 We saw that more cigarettes per day, higher cholesterol, higher 151 00:10:09,240 --> 00:10:12,900 systolic blood pressure, and higher glucose levels 152 00:10:12,900 --> 00:10:15,480 all increased risk. 153 00:10:15,480 --> 00:10:17,500 Later in the lecture, we'll discuss 154 00:10:17,500 --> 00:10:19,720 some medical interventions that are currently 155 00:10:19,720 --> 00:10:22,750 used to prevent CHD.