1 00:00:04,500 --> 00:00:08,310 Let's now see how the baseline method used by D2Hawkeye 2 00:00:08,310 --> 00:00:11,250 would perform on this data set. 3 00:00:11,250 --> 00:00:13,140 The baseline method would predict 4 00:00:13,140 --> 00:00:16,320 that the cost bucket for a patient in 2009 5 00:00:16,320 --> 00:00:19,860 will be the same as it was in 2008. 6 00:00:19,860 --> 00:00:23,680 So let's create a classification matrix to compute the accuracy 7 00:00:23,680 --> 00:00:26,830 for the baseline method on the test set. 8 00:00:26,830 --> 00:00:31,800 So we'll use the table function, where the actual outcomes are 9 00:00:31,800 --> 00:00:42,110 ClaimsTest$bucket2009, and our predictions are 10 00:00:42,110 --> 00:00:53,530 ClaimsTest$bucket2008. 11 00:00:53,530 --> 00:00:57,570 The accuracy is the sum of the diagonal, the observations that 12 00:00:57,570 --> 00:01:00,030 were classified correctly, divided 13 00:01:00,030 --> 00:01:03,850 by the total number of observations in our test set. 14 00:01:03,850 --> 00:01:26,780 So we want to add up 110138 + 10721 + 2774 + 1539 + 104. 15 00:01:26,780 --> 00:01:29,039 And we want to divide by the total number 16 00:01:29,039 --> 00:01:32,960 of observations in this table, or the number of rows 17 00:01:32,960 --> 00:01:33,640 in ClaimsTest. 18 00:01:39,380 --> 00:01:44,380 So the accuracy of the baseline method is 0.68. 19 00:01:44,380 --> 00:01:46,970 Now how about the penalty error? 20 00:01:46,970 --> 00:01:50,729 To compute this, we need to first create a penalty matrix 21 00:01:50,729 --> 00:01:53,460 in R. Keep in mind that we'll put 22 00:01:53,460 --> 00:01:57,560 the actual outcomes on the left, and the predicted outcomes 23 00:01:57,560 --> 00:01:59,210 on the top. 24 00:01:59,210 --> 00:02:05,460 So we'll call it PenaltyMatrix, which 25 00:02:05,460 --> 00:02:10,289 will be equal to a matrix object in R. 26 00:02:10,289 --> 00:02:12,020 And then we need to give the numbers 27 00:02:12,020 --> 00:02:18,650 that should fill up the matrix: 0, 1, 2, 3, 4. 28 00:02:18,650 --> 00:02:21,740 That'll be the first row. 29 00:02:21,740 --> 00:02:26,590 And then 2, 0, 1, 2, 3. 30 00:02:26,590 --> 00:02:28,800 That'll be the second row. 31 00:02:28,800 --> 00:02:34,170 4, 2, 0, 1, 2 for the third row. 32 00:02:34,170 --> 00:02:39,110 6, 4, 2, 0, 1 for the fourth row. 33 00:02:39,110 --> 00:02:45,720 And finally, 8, 6, 4, 2, 0 for the fifth row. 34 00:02:45,720 --> 00:02:49,130 And then after the parentheses, type a comma, 35 00:02:49,130 --> 00:03:00,420 and then byrow = TRUE, and then add nrow = 5. 36 00:03:00,420 --> 00:03:03,360 Close the parentheses, and hit Enter. 37 00:03:03,360 --> 00:03:05,400 So what did we just create? 38 00:03:05,400 --> 00:03:09,110 Type PenaltyMatrix and hit Enter. 39 00:03:12,260 --> 00:03:15,700 So with the previous command, we filled up our matrix row 40 00:03:15,700 --> 00:03:17,620 by row. 41 00:03:17,620 --> 00:03:20,160 The actual outcomes are on the left, 42 00:03:20,160 --> 00:03:22,950 and the predicted outcomes are on the top. 43 00:03:22,950 --> 00:03:25,810 So as we saw in the slides, the worst outcomes 44 00:03:25,810 --> 00:03:29,160 are when we predict a low cost bucket, 45 00:03:29,160 --> 00:03:33,110 but the actual outcome is a high cost bucket. 46 00:03:33,110 --> 00:03:34,900 We still give ourselves a penalty 47 00:03:34,900 --> 00:03:37,670 when we predict a high cost bucket 48 00:03:37,670 --> 00:03:43,090 and it's actually a low cost bucket, but it's not as bad. 49 00:03:43,090 --> 00:03:46,820 So now to compute the penalty error of the baseline method, 50 00:03:46,820 --> 00:03:49,570 we can multiply our classification matrix 51 00:03:49,570 --> 00:03:51,700 by the penalty matrix. 52 00:03:51,700 --> 00:03:54,530 So go ahead and hit the Up arrow to get back 53 00:03:54,530 --> 00:03:56,390 to where you created the classification 54 00:03:56,390 --> 00:03:59,210 matrix with the table function. 55 00:03:59,210 --> 00:04:02,950 And we're going to surround the entire table function 56 00:04:02,950 --> 00:04:08,150 by as.matrix to convert it to a matrix 57 00:04:08,150 --> 00:04:11,610 so that we can multiply it by our penalty matrix. 58 00:04:11,610 --> 00:04:14,730 So now at the end, close the parentheses 59 00:04:14,730 --> 00:04:21,220 and then multiply by PenaltyMatrix and hit Enter. 60 00:04:21,220 --> 00:04:24,100 So what this does is it takes each number 61 00:04:24,100 --> 00:04:27,160 in the classification matrix and multiplies it 62 00:04:27,160 --> 00:04:31,170 by the corresponding number in the penalty matrix. 63 00:04:31,170 --> 00:04:33,360 So now to compute the penalty error, 64 00:04:33,360 --> 00:04:35,750 we just need to sum it up and divide 65 00:04:35,750 --> 00:04:38,930 by the number of observations in our test set. 66 00:04:38,930 --> 00:04:41,740 So scroll up once, and then we'll 67 00:04:41,740 --> 00:04:44,760 just surround our entire previous command 68 00:04:44,760 --> 00:04:45,740 by the sum function. 69 00:04:51,960 --> 00:05:00,400 And we'll divide by the number of rows in ClaimsTest 70 00:05:00,400 --> 00:05:02,150 and hit Enter. 71 00:05:02,150 --> 00:05:06,850 So the penalty error for the baseline method is 0.74. 72 00:05:06,850 --> 00:05:09,180 In the next video, our goal will be 73 00:05:09,180 --> 00:05:14,790 to create a CART model that has an accuracy higher than 68% 74 00:05:14,790 --> 00:05:18,290 and a penalty error lower than 0.74.