1 00:00:04,500 --> 00:00:08,090 In the next few videos, we'll be using a data set published 2 00:00:08,090 --> 00:00:12,530 by the United States Centers for Medicare and Medicaid Services 3 00:00:12,530 --> 00:00:16,990 to practice creating CART models to predict health care cost. 4 00:00:16,990 --> 00:00:20,400 We unfortunately can't use the D2Hawkeye data 5 00:00:20,400 --> 00:00:22,460 due to privacy issues. 6 00:00:22,460 --> 00:00:24,700 The data set we'll be using instead, 7 00:00:24,700 --> 00:00:30,300 ClaimsData.csv, is structured to represent a sample of patients 8 00:00:30,300 --> 00:00:32,870 in the Medicare program, which provides 9 00:00:32,870 --> 00:00:37,080 health insurance to Americans aged 65 and older, 10 00:00:37,080 --> 00:00:39,610 as well as some younger people with certain medical 11 00:00:39,610 --> 00:00:41,070 conditions. 12 00:00:41,070 --> 00:00:44,130 To protect the privacy of patients represented 13 00:00:44,130 --> 00:00:47,760 in this publicly available data set, a number of steps 14 00:00:47,760 --> 00:00:50,440 are performed to anonymize the data. 15 00:00:50,440 --> 00:00:53,180 So we would need to retrain the models we develop 16 00:00:53,180 --> 00:00:56,060 in this lecture on de-anonymized data 17 00:00:56,060 --> 00:00:59,510 if we wanted to apply our models in the real world. 18 00:00:59,510 --> 00:01:02,580 Let's start by reading our data set into R 19 00:01:02,580 --> 00:01:05,010 and taking a look at its structure. 20 00:01:05,010 --> 00:01:09,120 We'll call our data set Claims, and we'll 21 00:01:09,120 --> 00:01:14,750 use the read.csv function to read in the data file 22 00:01:14,750 --> 00:01:15,380 ClaimsData.csv. 23 00:01:21,590 --> 00:01:24,420 Make sure to navigate to the directory on your computer 24 00:01:24,420 --> 00:01:29,390 containing the file ClaimsData.csv first. 25 00:01:29,390 --> 00:01:32,310 Now let's take a look at the structure of our data frame 26 00:01:32,310 --> 00:01:33,870 using the str function. 27 00:01:36,920 --> 00:01:40,710 The observations represent a 1% random sample 28 00:01:40,710 --> 00:01:43,410 of Medicare beneficiaries, limited 29 00:01:43,410 --> 00:01:47,360 to those still alive at the end of 2008. 30 00:01:47,360 --> 00:01:50,560 Our independent variables are from 2008, 31 00:01:50,560 --> 00:01:54,590 and we will be predicting cost in 2009. 32 00:01:54,590 --> 00:01:58,450 Our independent variables are the patient's age 33 00:01:58,450 --> 00:02:03,570 in years at the end of 2008, and then several binary variables 34 00:02:03,570 --> 00:02:05,680 indicating whether or not the patient had 35 00:02:05,680 --> 00:02:08,590 diagnosis codes for a particular disease 36 00:02:08,590 --> 00:02:16,020 or related disorder in 2008: alzheimers, arthritis, cancer, 37 00:02:16,020 --> 00:02:21,730 chronic obstructive pulmonary disease, or copd, depression, 38 00:02:21,730 --> 00:02:25,970 diabetes, heart.failure, ischemic heart disease, 39 00:02:25,970 --> 00:02:33,290 or ihd, kidney disease, osteoporosis, and stroke. 40 00:02:33,290 --> 00:02:36,940 Each of these variables will take value 1 if the patient had 41 00:02:36,940 --> 00:02:41,150 a diagnosis code for the particular disease and value 0 42 00:02:41,150 --> 00:02:42,950 otherwise. 43 00:02:42,950 --> 00:02:46,900 Reimbursement2008 is the total amount 44 00:02:46,900 --> 00:02:50,490 of Medicare reimbursements for this patient in 2008. 45 00:02:50,490 --> 00:02:53,550 And reimbursement2009 is the total value 46 00:02:53,550 --> 00:02:58,010 of all Medicare reimbursements for the patient in 2009. 47 00:02:58,010 --> 00:03:03,040 Bucket2008 is the cost bucket the patient fell into in 2008, 48 00:03:03,040 --> 00:03:05,600 and bucket2009 is the cost bucket 49 00:03:05,600 --> 00:03:08,680 the patient fell into in 2009. 50 00:03:08,680 --> 00:03:12,670 These cost buckets are defined using the thresholds determined 51 00:03:12,670 --> 00:03:14,500 by D2Hawkeye. 52 00:03:14,500 --> 00:03:17,090 So the first cost bucket contains patients 53 00:03:17,090 --> 00:03:21,100 with costs less than $3,000, the second cost bucket 54 00:03:21,100 --> 00:03:26,110 contains patients with costs between $3,000 and $8,000, 55 00:03:26,110 --> 00:03:27,850 and so on. 56 00:03:27,850 --> 00:03:31,880 We can verify that the number of patients in each cost bucket 57 00:03:31,880 --> 00:03:33,630 has the same structure as what we 58 00:03:33,630 --> 00:03:37,560 saw for D2Hawkeye by computing the percentage of patients 59 00:03:37,560 --> 00:03:39,400 in each cost bucket. 60 00:03:39,400 --> 00:03:48,160 So we'll create a table of the variable bucket2009 61 00:03:48,160 --> 00:03:53,730 and divide by the number of rows in Claims. 62 00:03:53,730 --> 00:03:55,800 This gives the percentage of patients 63 00:03:55,800 --> 00:03:58,180 in each of the cost buckets. 64 00:03:58,180 --> 00:04:02,100 The first cost bucket has almost 70% of the patients. 65 00:04:02,100 --> 00:04:05,990 The second cost bucket has about 20% of the patients. 66 00:04:05,990 --> 00:04:09,740 And the remaining 10% are split between the final three cost 67 00:04:09,740 --> 00:04:10,920 buckets. 68 00:04:10,920 --> 00:04:16,470 So the vast majority of patients in this data set have low cost. 69 00:04:16,470 --> 00:04:19,829 Our goal will be to predict the cost bucket the patient fell 70 00:04:19,829 --> 00:04:23,720 into in 2009 using a CART model. 71 00:04:23,720 --> 00:04:25,800 But before we build our model, we 72 00:04:25,800 --> 00:04:30,610 need to split our data into a training set and a testing set. 73 00:04:30,610 --> 00:04:36,230 So we'll load the package caTools, 74 00:04:36,230 --> 00:04:39,600 and then we'll set our random seed to 88 75 00:04:39,600 --> 00:04:42,260 so that we all get the same split. 76 00:04:42,260 --> 00:04:47,430 And we'll use the sample.split function, 77 00:04:47,430 --> 00:04:55,070 where our dependent variable is Claims$bucket2009, 78 00:04:55,070 --> 00:05:00,160 and we'll set our SplitRatio to be 0.6. 79 00:05:00,160 --> 00:05:04,170 So we'll put 60% of the data in the training set. 80 00:05:04,170 --> 00:05:08,930 We'll call our training set ClaimsTrain, 81 00:05:08,930 --> 00:05:14,660 and we'll take the observations of Claims 82 00:05:14,660 --> 00:05:20,100 for which spl is exactly equal to TRUE. 83 00:05:20,100 --> 00:05:24,540 And our testing set will be called ClaimsTest, 84 00:05:24,540 --> 00:05:31,790 where we'll take the observations of Claims 85 00:05:31,790 --> 00:05:35,640 for which spl is exactly equal to FALSE. 86 00:05:38,409 --> 00:05:41,950 Now that our data set is ready, we'll see in the next video 87 00:05:41,950 --> 00:05:45,570 how a smart baseline method would perform.