1 00:00:04,500 --> 00:00:07,940 This plot shows two of our independent variables, 2 00:00:07,940 --> 00:00:11,600 the number of office visits on the x-axis 3 00:00:11,600 --> 00:00:16,260 and the number of narcotics prescribed on the y-axis. 4 00:00:16,260 --> 00:00:20,770 Each point is an observation or a patient in our data set. 5 00:00:20,770 --> 00:00:24,300 The red points are patients who received poor care, 6 00:00:24,300 --> 00:00:25,930 and the green points are patients 7 00:00:25,930 --> 00:00:28,220 who received good care. 8 00:00:28,220 --> 00:00:30,280 It's hard to see a trend in the data 9 00:00:30,280 --> 00:00:32,600 by just visually inspecting it. 10 00:00:32,600 --> 00:00:36,580 But it looks like maybe more office visits and more 11 00:00:36,580 --> 00:00:40,750 narcotics, or data points to the right of this line, 12 00:00:40,750 --> 00:00:43,960 are more likely to have poor care. 13 00:00:43,960 --> 00:00:47,630 Let's see if we can build a logistic regression model in R 14 00:00:47,630 --> 00:00:50,850 to better predict poor care. 15 00:00:50,850 --> 00:00:53,780 In our R console, let's start by reading 16 00:00:53,780 --> 00:00:57,410 our data set into R. Remember to navigate 17 00:00:57,410 --> 00:00:59,250 to the directory on your computer 18 00:00:59,250 --> 00:01:03,100 containing the file quality.csv. 19 00:01:03,100 --> 00:01:05,780 We'll call our data set "quality" 20 00:01:05,780 --> 00:01:09,870 and use the read.csv function to read in the data file 21 00:01:09,870 --> 00:01:10,370 quality.csv. 22 00:01:13,330 --> 00:01:14,850 Let's take a look at the structure 23 00:01:14,850 --> 00:01:18,780 of our data set by using the str function. 24 00:01:18,780 --> 00:01:22,870 We have 131 observations, one for each 25 00:01:22,870 --> 00:01:27,680 of the patients in our data set, and 14 different variables. 26 00:01:27,680 --> 00:01:32,930 MemberID just numbers the patients from 1 to 131. 27 00:01:32,930 --> 00:01:35,870 The 12 variables from InpatientDays 28 00:01:35,870 --> 00:01:40,100 to AcuteDrugGapSmall are the independent variables. 29 00:01:40,100 --> 00:01:43,320 We'll be using the number of office visits 30 00:01:43,320 --> 00:01:46,150 and the number of prescriptions for narcotics 31 00:01:46,150 --> 00:01:47,830 that the patient had. 32 00:01:47,830 --> 00:01:49,970 But you can read descriptions of all 33 00:01:49,970 --> 00:01:53,190 of the independent variables below this video. 34 00:01:53,190 --> 00:01:55,400 After the lecture, try building models 35 00:01:55,400 --> 00:01:58,170 with different subsets of independent variables 36 00:01:58,170 --> 00:02:01,990 to see what the best model is that you can find. 37 00:02:01,990 --> 00:02:06,020 The final variable, PoorCare, is our outcome 38 00:02:06,020 --> 00:02:08,960 or dependent variable and is equal to 1 39 00:02:08,960 --> 00:02:12,150 if the patient had poor care and equal to 0 40 00:02:12,150 --> 00:02:14,630 if the patient had good care. 41 00:02:14,630 --> 00:02:17,910 Let's see how many patients received poor care 42 00:02:17,910 --> 00:02:20,630 and how many patients received good care 43 00:02:20,630 --> 00:02:23,170 by using the table function. 44 00:02:23,170 --> 00:02:26,800 Let's make a table of our outcome variable PoorCare. 45 00:02:29,480 --> 00:02:34,110 We can see that 98 out of the 131 patients in our data set 46 00:02:34,110 --> 00:02:40,050 received good care, or 0, and 33 patients received poor care, 47 00:02:40,050 --> 00:02:42,870 or those labeled with 1. 48 00:02:42,870 --> 00:02:45,680 Before building any models, let's consider 49 00:02:45,680 --> 00:02:48,730 using a simple baseline method. 50 00:02:48,730 --> 00:02:51,050 Last week when we computed the R-squared 51 00:02:51,050 --> 00:02:54,510 for linear regression, we compared our predictions 52 00:02:54,510 --> 00:02:57,960 to the baseline method of predicting the average outcome 53 00:02:57,960 --> 00:03:00,130 for all data points. 54 00:03:00,130 --> 00:03:04,100 In a classification problem, a standard baseline method 55 00:03:04,100 --> 00:03:07,060 is to just predict the most frequent outcome 56 00:03:07,060 --> 00:03:09,070 for all observations. 57 00:03:09,070 --> 00:03:13,040 Since good care is more common than poor care, in this case, 58 00:03:13,040 --> 00:03:14,710 we would predict that all patients 59 00:03:14,710 --> 00:03:16,460 are receiving good care. 60 00:03:16,460 --> 00:03:19,090 If we did this, we would get 98 out 61 00:03:19,090 --> 00:03:22,930 of the 131 observations correct, or have 62 00:03:22,930 --> 00:03:26,730 an accuracy of about 75%. 63 00:03:26,730 --> 00:03:31,210 So our baseline model has an accuracy of 75%. 64 00:03:31,210 --> 00:03:34,220 This is what we'll try to beat with our logistic regression 65 00:03:34,220 --> 00:03:36,630 model. 66 00:03:36,630 --> 00:03:40,020 Last week, we always gave you the training data set 67 00:03:40,020 --> 00:03:43,329 and the testing data set in separate files. 68 00:03:43,329 --> 00:03:46,130 This week, we only have one data set. 69 00:03:46,130 --> 00:03:50,100 So we want to randomly split our data set into a training set 70 00:03:50,100 --> 00:03:53,260 and testing set so that we'll have a test set 71 00:03:53,260 --> 00:03:56,590 to measure our out-of-sample accuracy. 72 00:03:56,590 --> 00:04:00,440 To do this, we need to add a new package to R. 73 00:04:00,440 --> 00:04:04,150 There are many functions and algorithms built into R, 74 00:04:04,150 --> 00:04:06,870 but there are many more that you can install. 75 00:04:06,870 --> 00:04:10,780 We'll do this several times throughout this course. 76 00:04:10,780 --> 00:04:14,230 First, let's install the new package using 77 00:04:14,230 --> 00:04:18,290 the install.packages function and then give 78 00:04:18,290 --> 00:04:21,510 the name of the package we want to install in quotes. 79 00:04:21,510 --> 00:04:24,620 In this case, it's caTools. 80 00:04:24,620 --> 00:04:27,430 If you hit Enter, a window should pop up 81 00:04:27,430 --> 00:04:30,270 asking you to pick a CR AN mirror. 82 00:04:30,270 --> 00:04:33,590 Here, you should pick a location near you. 83 00:04:33,590 --> 00:04:36,270 In my case, I'll pick Pennsylvania 84 00:04:36,270 --> 00:04:37,170 in the United States. 85 00:04:39,860 --> 00:04:41,960 After you've selected a location, 86 00:04:41,960 --> 00:04:45,570 you should see some information pop up in your R console. 87 00:04:45,570 --> 00:04:50,330 It's done when you see the arrow with the blinking cursor again. 88 00:04:50,330 --> 00:04:54,270 Now, we need to load the package into our current R session. 89 00:04:54,270 --> 00:04:57,110 To do this, we'll use the library function. 90 00:04:57,110 --> 00:05:00,080 So type library and then in parentheses 91 00:05:00,080 --> 00:05:03,460 the name of our package, caTools. 92 00:05:03,460 --> 00:05:05,610 Sometimes you'll get a warning message 93 00:05:05,610 --> 00:05:09,040 based on the version of R that you're using. 94 00:05:09,040 --> 00:05:12,200 This can usually safely be ignored. 95 00:05:12,200 --> 00:05:15,490 In the future, whenever you want to use a package, 96 00:05:15,490 --> 00:05:19,720 you won't need to install it, but you will need to load it. 97 00:05:19,720 --> 00:05:23,400 Now, let's use this package to randomly split our data 98 00:05:23,400 --> 00:05:26,250 into a training set and testing set. 99 00:05:26,250 --> 00:05:30,120 We'll do this using a function sample.split which 100 00:05:30,120 --> 00:05:32,740 is part of the caTools package. 101 00:05:32,740 --> 00:05:35,950 Since sample.split randomly splits your data, 102 00:05:35,950 --> 00:05:38,580 it could split it differently for each of us. 103 00:05:38,580 --> 00:05:42,810 To make sure that we all get the same split, we'll set our seed. 104 00:05:42,810 --> 00:05:46,080 This initializes the random number generator. 105 00:05:46,080 --> 00:05:51,130 So type set.seed(88), a number I selected. 106 00:05:53,690 --> 00:05:56,500 Now, let's use sample.split. 107 00:05:56,500 --> 00:06:02,230 Type split = sample.split, and then 108 00:06:02,230 --> 00:06:05,430 in parentheses, we need to give two arguments. 109 00:06:05,430 --> 00:06:11,500 The first is our outcome variable or quality$PoorCare, 110 00:06:11,500 --> 00:06:14,970 and the second argument is the percentage of the data that we 111 00:06:14,970 --> 00:06:18,210 want in the training set. 112 00:06:18,210 --> 00:06:23,120 We type SplitRatio equals, and in this case, 113 00:06:23,120 --> 00:06:27,020 we'll put 75% of the data in the training set, which we'll 114 00:06:27,020 --> 00:06:31,120 use to build the model, and 25% of the data in the testing 115 00:06:31,120 --> 00:06:33,420 set to test our model. 116 00:06:33,420 --> 00:06:36,170 Sample.split randomly splits the data. 117 00:06:36,170 --> 00:06:38,770 But it also makes sure that the outcome variable 118 00:06:38,770 --> 00:06:41,280 is well-balanced in each piece. 119 00:06:41,280 --> 00:06:44,940 We saw earlier that about 75% of our patients 120 00:06:44,940 --> 00:06:46,980 are receiving good care. 121 00:06:46,980 --> 00:06:50,060 This function makes sure that in our training set, 122 00:06:50,060 --> 00:06:53,580 75% of our patients are receiving good care 123 00:06:53,580 --> 00:06:57,010 and in our testing set 75% of our patients 124 00:06:57,010 --> 00:06:59,100 are receiving good care. 125 00:06:59,100 --> 00:07:02,510 We want to do this so that our test set is representative 126 00:07:02,510 --> 00:07:04,450 of our training set. 127 00:07:04,450 --> 00:07:07,810 Let's take a look at split. 128 00:07:07,810 --> 00:07:12,260 There is a TRUE or FALSE value for each of our observations. 129 00:07:12,260 --> 00:07:14,960 TRUE means that we should put that observation 130 00:07:14,960 --> 00:07:17,680 in the training set, and FALSE means 131 00:07:17,680 --> 00:07:21,600 that we should put that observation in the testing set. 132 00:07:21,600 --> 00:07:24,470 So now let's create our training and testing 133 00:07:24,470 --> 00:07:27,680 sets using the subset function. 134 00:07:27,680 --> 00:07:31,590 We'll call our training set qualityTrain 135 00:07:31,590 --> 00:07:35,909 and use the subset function to take a subset of quality 136 00:07:35,909 --> 00:07:38,220 and only taking the observations for which 137 00:07:38,220 --> 00:07:40,940 split is equal to TRUE. 138 00:07:40,940 --> 00:07:44,340 We'll call our testing set qualityTest 139 00:07:44,340 --> 00:07:46,540 and, again, use the subset function 140 00:07:46,540 --> 00:07:48,909 to take the observations of quality, 141 00:07:48,909 --> 00:07:53,409 but this time those for which split is equal to FALSE. 142 00:07:53,409 --> 00:07:56,920 If you look at the number of rows in each of our data sets, 143 00:07:56,920 --> 00:08:00,930 the training set and then the testing set, 144 00:08:00,930 --> 00:08:04,430 you can see that there are 99 observations in the training 145 00:08:04,430 --> 00:08:09,580 set and 32 observations in the testing set. 146 00:08:09,580 --> 00:08:12,380 Now, we are ready to build a logistic regression 147 00:08:12,380 --> 00:08:15,260 model using OfficeVisits and Narcotics 148 00:08:15,260 --> 00:08:17,440 as independent variables. 149 00:08:17,440 --> 00:08:23,110 We'll call our model QualityLog and use the "glm" function 150 00:08:23,110 --> 00:08:26,710 for "generalized linear model" to build 151 00:08:26,710 --> 00:08:28,770 our logistic regression model. 152 00:08:28,770 --> 00:08:30,850 We start by giving the equation we 153 00:08:30,850 --> 00:08:33,960 want to build just like in linear regression. 154 00:08:33,960 --> 00:08:36,490 We start with the dependent variable, and then 155 00:08:36,490 --> 00:08:39,650 the tilde sign, and then the independent variables 156 00:08:39,650 --> 00:08:43,820 we want to use separated by the plus sign. 157 00:08:43,820 --> 00:08:45,890 We then give the name of the data 158 00:08:45,890 --> 00:08:48,290 set we want to use to build the model, 159 00:08:48,290 --> 00:08:51,260 in this case, qualityTrain. 160 00:08:51,260 --> 00:08:55,250 For a logistic regression model, we need one last argument, 161 00:08:55,250 --> 00:08:56,300 which is family=binomial. 162 00:09:01,860 --> 00:09:05,670 This tells the glm function to build a logistic regression 163 00:09:05,670 --> 00:09:07,480 model. 164 00:09:07,480 --> 00:09:10,020 Now, let's look at our model using the summary function. 165 00:09:12,740 --> 00:09:17,910 The output looks similar to that of a linear regression model. 166 00:09:17,910 --> 00:09:21,510 What we want to focus on is the coefficients table. 167 00:09:21,510 --> 00:09:24,360 This gives the estimate values for the coefficients, 168 00:09:24,360 --> 00:09:27,580 or the betas, for our logistic regression model. 169 00:09:27,580 --> 00:09:30,990 We see here that the coefficients for OfficeVisits 170 00:09:30,990 --> 00:09:34,060 and Narcotics are both positive, which 171 00:09:34,060 --> 00:09:37,330 means that higher values in these two variables 172 00:09:37,330 --> 00:09:39,890 are indicative of poor care as we 173 00:09:39,890 --> 00:09:42,620 suspected from looking at the data. 174 00:09:42,620 --> 00:09:44,990 We also see that both of these variables 175 00:09:44,990 --> 00:09:47,500 have at least one star, meaning that they're 176 00:09:47,500 --> 00:09:50,400 significant in our model. 177 00:09:50,400 --> 00:09:52,660 The last thing we want to look at in the output 178 00:09:52,660 --> 00:09:55,140 is the AIC value. 179 00:09:55,140 --> 00:09:57,900 This is a measure of the quality of the model 180 00:09:57,900 --> 00:10:00,880 and is like Adjusted R-squared in that it accounts 181 00:10:00,880 --> 00:10:03,660 for the number of variables used compared 182 00:10:03,660 --> 00:10:05,970 to the number of observations. 183 00:10:05,970 --> 00:10:08,300 Unfortunately, it can only be compared 184 00:10:08,300 --> 00:10:11,230 between models on the same data set. 185 00:10:11,230 --> 00:10:14,760 But it provides a means for model selection. 186 00:10:14,760 --> 00:10:19,170 The preferred model is the one with the minimum AIC. 187 00:10:19,170 --> 00:10:22,790 Now, let's make predictions on the training set. 188 00:10:22,790 --> 00:10:27,800 We'll call them predictTrain and use the predict function 189 00:10:27,800 --> 00:10:32,120 to make predictions using the model QualityLog, 190 00:10:32,120 --> 00:10:33,920 and we'll give a second argument, 191 00:10:33,920 --> 00:10:34,960 which is type="response". 192 00:10:39,350 --> 00:10:43,850 This tells the predict function to give us probabilities. 193 00:10:43,850 --> 00:10:46,090 Let's take a look at the statistical summary 194 00:10:46,090 --> 00:10:46,880 of our predictions. 195 00:10:51,780 --> 00:10:53,740 Since we're expecting probabilities, 196 00:10:53,740 --> 00:10:57,160 all of the numbers should be between zero and one. 197 00:10:57,160 --> 00:11:01,090 And we see that the minimum value is about 0.07 198 00:11:01,090 --> 00:11:04,730 and the maximum value is 0.98. 199 00:11:04,730 --> 00:11:07,410 Let's see if we're predicting higher probabilities 200 00:11:07,410 --> 00:11:11,430 for the actual poor care cases as we expect. 201 00:11:11,430 --> 00:11:15,740 To do this, use the tapply function, 202 00:11:15,740 --> 00:11:25,140 giving as arguments predictTrain and then QualityTrain$PoorCare 203 00:11:25,140 --> 00:11:26,460 and then mean. 204 00:11:26,460 --> 00:11:28,980 This will compute the average prediction 205 00:11:28,980 --> 00:11:31,770 for each of the true outcomes. 206 00:11:31,770 --> 00:11:35,720 So we see that for all of the true poor care cases, 207 00:11:35,720 --> 00:11:39,750 we predict an average probability of about 0.44. 208 00:11:39,750 --> 00:11:42,110 And all of the true good care cases, 209 00:11:42,110 --> 00:11:46,020 we predict an average probability of about 0.19. 210 00:11:46,020 --> 00:11:47,620 So this is a good sign, because it 211 00:11:47,620 --> 00:11:49,140 looks like we're predicting a higher 212 00:11:49,140 --> 00:11:53,010 probability for the actual poor care cases. 213 00:11:53,010 --> 00:11:55,730 Now that we have our model, in the next video, 214 00:11:55,730 --> 00:11:59,460 we'll discuss how to assess the accuracy of our predictions.