1
00:00:04,500 --> 00:00:08,090
In the next few videos, we'll
be using a data set published

2
00:00:08,090 --> 00:00:12,530
by the United States Centers for
Medicare and Medicaid Services

3
00:00:12,530 --> 00:00:16,990
to practice creating CART models
to predict health care cost.

4
00:00:16,990 --> 00:00:20,400
We unfortunately can't
use the D2Hawkeye data

5
00:00:20,400 --> 00:00:22,460
due to privacy issues.

6
00:00:22,460 --> 00:00:24,700
The data set we'll
be using instead,

7
00:00:24,700 --> 00:00:30,300
ClaimsData.csv, is structured to
represent a sample of patients

8
00:00:30,300 --> 00:00:32,870
in the Medicare
program, which provides

9
00:00:32,870 --> 00:00:37,080
health insurance to
Americans aged 65 and older,

10
00:00:37,080 --> 00:00:39,610
as well as some younger
people with certain medical

11
00:00:39,610 --> 00:00:41,070
conditions.

12
00:00:41,070 --> 00:00:44,130
To protect the privacy
of patients represented

13
00:00:44,130 --> 00:00:47,760
in this publicly available
data set, a number of steps

14
00:00:47,760 --> 00:00:50,440
are performed to
anonymize the data.

15
00:00:50,440 --> 00:00:53,180
So we would need to retrain
the models we develop

16
00:00:53,180 --> 00:00:56,060
in this lecture on
de-anonymized data

17
00:00:56,060 --> 00:00:59,510
if we wanted to apply our
models in the real world.

18
00:00:59,510 --> 00:01:02,580
Let's start by reading
our data set into R

19
00:01:02,580 --> 00:01:05,010
and taking a look
at its structure.

20
00:01:05,010 --> 00:01:09,120
We'll call our data
set Claims, and we'll

21
00:01:09,120 --> 00:01:14,750
use the read.csv function
to read in the data file

22
00:01:14,750 --> 00:01:15,380
ClaimsData.csv.

23
00:01:21,590 --> 00:01:24,420
Make sure to navigate to the
directory on your computer

24
00:01:24,420 --> 00:01:29,390
containing the file
ClaimsData.csv first.

25
00:01:29,390 --> 00:01:32,310
Now let's take a look at the
structure of our data frame

26
00:01:32,310 --> 00:01:33,870
using the str function.

27
00:01:36,920 --> 00:01:40,710
The observations represent
a 1% random sample

28
00:01:40,710 --> 00:01:43,410
of Medicare
beneficiaries, limited

29
00:01:43,410 --> 00:01:47,360
to those still alive
at the end of 2008.

30
00:01:47,360 --> 00:01:50,560
Our independent
variables are from 2008,

31
00:01:50,560 --> 00:01:54,590
and we will be
predicting cost in 2009.

32
00:01:54,590 --> 00:01:58,450
Our independent variables
are the patient's age

33
00:01:58,450 --> 00:02:03,570
in years at the end of 2008, and
then several binary variables

34
00:02:03,570 --> 00:02:05,680
indicating whether or
not the patient had

35
00:02:05,680 --> 00:02:08,590
diagnosis codes for
a particular disease

36
00:02:08,590 --> 00:02:16,020
or related disorder in 2008:
alzheimers, arthritis, cancer,

37
00:02:16,020 --> 00:02:21,730
chronic obstructive pulmonary
disease, or copd, depression,

38
00:02:21,730 --> 00:02:25,970
diabetes, heart.failure,
ischemic heart disease,

39
00:02:25,970 --> 00:02:33,290
or ihd, kidney disease,
osteoporosis, and stroke.

40
00:02:33,290 --> 00:02:36,940
Each of these variables will
take value 1 if the patient had

41
00:02:36,940 --> 00:02:41,150
a diagnosis code for the
particular disease and value 0

42
00:02:41,150 --> 00:02:42,950
otherwise.

43
00:02:42,950 --> 00:02:46,900
Reimbursement2008
is the total amount

44
00:02:46,900 --> 00:02:50,490
of Medicare reimbursements
for this patient in 2008.

45
00:02:50,490 --> 00:02:53,550
And reimbursement2009
is the total value

46
00:02:53,550 --> 00:02:58,010
of all Medicare reimbursements
for the patient in 2009.

47
00:02:58,010 --> 00:03:03,040
Bucket2008 is the cost bucket
the patient fell into in 2008,

48
00:03:03,040 --> 00:03:05,600
and bucket2009 is
the cost bucket

49
00:03:05,600 --> 00:03:08,680
the patient fell into in 2009.

50
00:03:08,680 --> 00:03:12,670
These cost buckets are defined
using the thresholds determined

51
00:03:12,670 --> 00:03:14,500
by D2Hawkeye.

52
00:03:14,500 --> 00:03:17,090
So the first cost
bucket contains patients

53
00:03:17,090 --> 00:03:21,100
with costs less than $3,000,
the second cost bucket

54
00:03:21,100 --> 00:03:26,110
contains patients with costs
between $3,000 and $8,000,

55
00:03:26,110 --> 00:03:27,850
and so on.

56
00:03:27,850 --> 00:03:31,880
We can verify that the number
of patients in each cost bucket

57
00:03:31,880 --> 00:03:33,630
has the same
structure as what we

58
00:03:33,630 --> 00:03:37,560
saw for D2Hawkeye by computing
the percentage of patients

59
00:03:37,560 --> 00:03:39,400
in each cost bucket.

60
00:03:39,400 --> 00:03:48,160
So we'll create a table
of the variable bucket2009

61
00:03:48,160 --> 00:03:53,730
and divide by the number
of rows in Claims.

62
00:03:53,730 --> 00:03:55,800
This gives the
percentage of patients

63
00:03:55,800 --> 00:03:58,180
in each of the cost buckets.

64
00:03:58,180 --> 00:04:02,100
The first cost bucket has
almost 70% of the patients.

65
00:04:02,100 --> 00:04:05,990
The second cost bucket has
about 20% of the patients.

66
00:04:05,990 --> 00:04:09,740
And the remaining 10% are split
between the final three cost

67
00:04:09,740 --> 00:04:10,920
buckets.

68
00:04:10,920 --> 00:04:16,470
So the vast majority of patients
in this data set have low cost.

69
00:04:16,470 --> 00:04:19,829
Our goal will be to predict the
cost bucket the patient fell

70
00:04:19,829 --> 00:04:23,720
into in 2009 using a CART model.

71
00:04:23,720 --> 00:04:25,800
But before we
build our model, we

72
00:04:25,800 --> 00:04:30,610
need to split our data into a
training set and a testing set.

73
00:04:30,610 --> 00:04:36,230
So we'll load the
package caTools,

74
00:04:36,230 --> 00:04:39,600
and then we'll set
our random seed to 88

75
00:04:39,600 --> 00:04:42,260
so that we all get
the same split.

76
00:04:42,260 --> 00:04:47,430
And we'll use the
sample.split function,

77
00:04:47,430 --> 00:04:55,070
where our dependent variable
is Claims$bucket2009,

78
00:04:55,070 --> 00:05:00,160
and we'll set our
SplitRatio to be 0.6.

79
00:05:00,160 --> 00:05:04,170
So we'll put 60% of the
data in the training set.

80
00:05:04,170 --> 00:05:08,930
We'll call our training
set ClaimsTrain,

81
00:05:08,930 --> 00:05:14,660
and we'll take the
observations of Claims

82
00:05:14,660 --> 00:05:20,100
for which spl is
exactly equal to TRUE.

83
00:05:20,100 --> 00:05:24,540
And our testing set will
be called ClaimsTest,

84
00:05:24,540 --> 00:05:31,790
where we'll take the
observations of Claims

85
00:05:31,790 --> 00:05:35,640
for which spl is
exactly equal to FALSE.

86
00:05:38,409 --> 00:05:41,950
Now that our data set is ready,
we'll see in the next video

87
00:05:41,950 --> 00:05:45,570
how a smart baseline
method would perform.