1 00:00:04,500 --> 00:00:08,920 Let us examine how to interpret the model we developed. 2 00:00:08,920 --> 00:00:11,450 One of the things we should look after 3 00:00:11,450 --> 00:00:14,770 is that there might be what is called multicollinearity. 4 00:00:14,770 --> 00:00:18,240 Multicollinearity occurs when the various independent 5 00:00:18,240 --> 00:00:21,280 variables are correlated, and this 6 00:00:21,280 --> 00:00:26,510 might confuse the coefficients-- the betas-- in the model. 7 00:00:26,510 --> 00:00:32,360 So tests to address that involve checking the correlations 8 00:00:32,360 --> 00:00:35,200 of independent variables. 9 00:00:35,200 --> 00:00:38,250 If they are excessively high, this 10 00:00:38,250 --> 00:00:40,700 would mean that there might be multicollinearity, 11 00:00:40,700 --> 00:00:44,230 and you have to potentially revisit the model, 12 00:00:44,230 --> 00:00:48,380 as well as whether the signs of the coefficients make sense. 13 00:00:48,380 --> 00:00:52,720 Is the coefficient beta positive or negative? 14 00:00:52,720 --> 00:00:58,250 If it agrees with intuition, then multicollinearity 15 00:00:58,250 --> 00:01:02,830 has not been a problem, but if intuition suggests 16 00:01:02,830 --> 00:01:09,400 a different sign, this might be a sign of multicollinearity. 17 00:01:09,400 --> 00:01:13,840 The next important element is significance. 18 00:01:13,840 --> 00:01:16,180 So how do we interpret the results, 19 00:01:16,180 --> 00:01:23,110 and how do we understand whether we have a good model or not? 20 00:01:23,110 --> 00:01:26,130 For that purpose, let's take a look 21 00:01:26,130 --> 00:01:30,490 at what is called Area Under the Curve, or AUC for short. 22 00:01:30,490 --> 00:01:36,150 So the Area Under the Curve shows an absolute measure 23 00:01:36,150 --> 00:01:41,450 of quality of prediction-- in this particular case, 77.5%, 24 00:01:41,450 --> 00:01:47,470 which means that, given that the perfect score is 100%, 25 00:01:47,470 --> 00:01:52,920 so this is like a B, whereas, as we'll see later, 26 00:01:52,920 --> 00:01:56,100 a 50% score, which is pure guessing, 27 00:01:56,100 --> 00:02:05,090 is a 50% rate of success. 28 00:02:05,090 --> 00:02:10,090 So the area under the curve gives an absolute measure 29 00:02:10,090 --> 00:02:16,090 of quality, and it's less affected by various benchmarks. 30 00:02:19,560 --> 00:02:22,870 So it illustrates how accurate the model 31 00:02:22,870 --> 00:02:26,790 is on a more absolute sense. 32 00:02:26,790 --> 00:02:28,840 So what is a good AUC? 33 00:02:28,840 --> 00:02:33,100 The area on the right shows the maximum possible 34 00:02:33,100 --> 00:02:41,480 of a perfect prediction, whereas the area on this 35 00:02:41,480 --> 00:02:46,970 curve now-- it is 0.5, and it's pure guessing. 36 00:02:46,970 --> 00:02:52,390 Other outcome measures that are important for us to discuss 37 00:02:52,390 --> 00:02:55,510 is the so-called confusion matrix. 38 00:02:55,510 --> 00:03:01,430 So the matrix here is formulas for the various terms we use. 39 00:03:04,870 --> 00:03:09,470 The actual class is 0-- means, in our example, 40 00:03:09,470 --> 00:03:12,430 good quality of care, and actual class = 1 41 00:03:12,430 --> 00:03:17,630 means poor quality of care, whereas the predicted class = 42 00:03:17,630 --> 00:03:20,420 0 means that will predict good quality, 43 00:03:20,420 --> 00:03:25,810 and the predicted class = 1 mean that we predict poor quality. 44 00:03:25,810 --> 00:03:29,210 So we define true negatives, short by TN. 45 00:03:29,210 --> 00:03:32,470 False positives, short by FP. 46 00:03:32,470 --> 00:03:36,660 False negatives, FN, and true positives by TP. 47 00:03:36,660 --> 00:03:38,990 So if N is the number of observations, 48 00:03:38,990 --> 00:03:42,420 the overall accuracy is basically 49 00:03:42,420 --> 00:03:47,079 the number of true negatives and true positives divided by N. 50 00:03:47,079 --> 00:03:51,820 It's basically the terms in the diagonal of this two 51 00:03:51,820 --> 00:03:54,730 by two matrix divided by the total observations. 52 00:03:54,730 --> 00:03:59,829 The overall error rate is the terms off-diagonal-- 53 00:03:59,829 --> 00:04:02,720 the false positives, plus the false negatives, divided 54 00:04:02,720 --> 00:04:04,790 by the total number of observations. 55 00:04:04,790 --> 00:04:10,370 That's the overall measure of an error rate. 56 00:04:10,370 --> 00:04:14,050 An important component is the so-called sensitivity, 57 00:04:14,050 --> 00:04:19,079 and sensitivity is TP, the true positives, whenever 58 00:04:19,079 --> 00:04:22,650 we predict poor quality, and indeed it 59 00:04:22,650 --> 00:04:30,120 is poor quality, divided by TP, these true positives, plus FN, 60 00:04:30,120 --> 00:04:37,630 which is the total number of cases of poor quality. 61 00:04:37,630 --> 00:04:39,850 So this is the total number of times 62 00:04:39,850 --> 00:04:45,590 that we predict poor quality, and it is, indeed, 63 00:04:45,590 --> 00:04:52,430 poor quality, versus the total number of times 64 00:04:52,430 --> 00:04:57,830 the actual quality is, in fact, poor. 65 00:04:57,830 --> 00:05:04,630 False negative rate is FN, the number of false negatives, 66 00:05:04,630 --> 00:05:10,500 divided by the number of true positives, 67 00:05:10,500 --> 00:05:12,740 plus the number of false negatives. 68 00:05:12,740 --> 00:05:19,550 And specificity is TN, true negatives, 69 00:05:19,550 --> 00:05:22,670 the number of times we predict the quality is good, 70 00:05:22,670 --> 00:05:24,810 and, in fact, the quality is good, 71 00:05:24,810 --> 00:05:31,540 divided by this number, TN, plus false positives. 72 00:05:31,540 --> 00:05:35,900 So specificity is the number of times 73 00:05:35,900 --> 00:05:40,780 we predict the quality is good, and it is indeed good, 74 00:05:40,780 --> 00:05:46,909 versus the total times we have good quality, 75 00:05:46,909 --> 00:05:51,930 and the false positive error rate is 76 00:05:51,930 --> 00:05:54,060 1 minus the specificity. 77 00:05:58,570 --> 00:06:02,060 So in this particular example that we have discussed, 78 00:06:02,060 --> 00:06:05,730 quality of care, just like in linear regression, 79 00:06:05,730 --> 00:06:09,010 we want to make predictions on a test set 80 00:06:09,010 --> 00:06:10,780 to compute out-of-sample metrics. 81 00:06:10,780 --> 00:06:17,850 We develop the logistic regression model using data, 82 00:06:17,850 --> 00:06:21,440 but would like to make predictions out-of-sample. 83 00:06:21,440 --> 00:06:29,840 So in our test, we utilized 32 cases, 84 00:06:29,840 --> 00:06:38,900 and the R command that makes the statements 85 00:06:38,900 --> 00:06:41,560 about the quality of a prediction out-of-sample 86 00:06:41,560 --> 00:06:44,930 is illustrated here in the slide. 87 00:06:44,930 --> 00:06:46,810 So in that way, we make predictions 88 00:06:46,810 --> 00:06:48,630 about probabilities, of course, simply 89 00:06:48,630 --> 00:06:52,380 because logistic regression makes predictions 90 00:06:52,380 --> 00:06:56,400 about probabilities, and then we transform them 91 00:06:56,400 --> 00:06:58,560 to a binary outcome-- the quality is good, 92 00:06:58,560 --> 00:07:01,700 or the quality is poor-- using a threshold. 93 00:07:01,700 --> 00:07:05,760 In this particular example, we used a threshold value of 0.3, 94 00:07:05,760 --> 00:07:10,060 and in doing so, we obtain the following confusion matrix. 95 00:07:10,060 --> 00:07:14,170 So there were, as I mentioned, there are 32 cases, 96 00:07:14,170 --> 00:07:18,830 out of which 24 of them are actually good care, 97 00:07:18,830 --> 00:07:22,490 and eight of them are actually poor care. 98 00:07:22,490 --> 00:07:27,310 We observe that the overall accuracy of the model 99 00:07:27,310 --> 00:07:33,000 is 19 plus 6, is 25, over 32. 100 00:07:33,000 --> 00:07:41,390 The false positive rate is, in this case, 5 over 24, 101 00:07:41,390 --> 00:07:49,650 19 plus 5, whereas the true positive rate is 6 out of 8-- 6 102 00:07:49,650 --> 00:07:51,540 plus 2. 103 00:07:51,540 --> 00:07:56,540 Notice, if you compare this model with making always-- 104 00:07:56,540 --> 00:08:00,370 let's say one alternative is to say 105 00:08:00,370 --> 00:08:04,010 we predict good care all the time. 106 00:08:04,010 --> 00:08:07,720 In that situation, we will be correct 19 107 00:08:07,720 --> 00:08:13,620 plus 5, 24 times, versus 25 times, in our case. 108 00:08:13,620 --> 00:08:19,520 But notice that predicting always good care 109 00:08:19,520 --> 00:08:24,920 does not capture the dynamics of what is happening, 110 00:08:24,920 --> 00:08:28,250 versus the logistic regression model that 111 00:08:28,250 --> 00:08:32,390 is far more intelligent in capturing these effects.