1 00:00:04,490 --> 00:00:07,240 Claims data offers an expansive view of the patients health 2 00:00:07,240 --> 00:00:08,230 history. 3 00:00:08,230 --> 00:00:10,850 Specifically, claims data include information 4 00:00:10,850 --> 00:00:14,710 on demographics, medical history, and medications. 5 00:00:14,710 --> 00:00:17,240 They offer insights regarding a patient's risk. 6 00:00:17,240 --> 00:00:21,710 And as I will demonstrate, may reveal indicative signals 7 00:00:21,710 --> 00:00:23,050 and patterns. 8 00:00:23,050 --> 00:00:25,980 We'll use health insurance claims 9 00:00:25,980 --> 00:00:29,690 filed for about 7,000 members from January, 2000 10 00:00:29,690 --> 00:00:34,620 until November, 2007. 11 00:00:34,620 --> 00:00:37,980 We concentrated on members with the four main attributes. 12 00:00:37,980 --> 00:00:41,200 At least five claims with coronary artery disease 13 00:00:41,200 --> 00:00:44,720 diagnosis, at least five claims with hypertension 14 00:00:44,720 --> 00:00:48,480 diagnostic codes, at least 100 total medical claims, 15 00:00:48,480 --> 00:00:51,420 at least five pharmacy claims, and data 16 00:00:51,420 --> 00:00:53,740 from at least five years. 17 00:00:53,740 --> 00:00:57,020 These selections yield patients with a high risk 18 00:00:57,020 --> 00:00:58,030 of heart attack. 19 00:00:58,030 --> 00:01:00,740 And a reasonably rich medical history with continuous 20 00:01:00,740 --> 00:01:01,240 coverage. 21 00:01:04,819 --> 00:01:08,360 Let us discuss how we've aggregated this data. 22 00:01:08,360 --> 00:01:11,930 The resulting data sets includes about 20 million health 23 00:01:11,930 --> 00:01:14,520 insurance entries, including individual, medical, 24 00:01:14,520 --> 00:01:16,150 and pharmaceutical records. 25 00:01:16,150 --> 00:01:19,600 Diagnosis, procedures, and drug codes in the data 26 00:01:19,600 --> 00:01:22,640 set comprised tens of thousands of attributes. 27 00:01:22,640 --> 00:01:25,460 The codes were aggregated into groups. 28 00:01:25,460 --> 00:01:29,500 218 diagnosis groups, 180 procedure groups, 29 00:01:29,500 --> 00:01:33,150 538 drug groups. 30 00:01:33,150 --> 00:01:35,350 46 diagnosis groups were considered 31 00:01:35,350 --> 00:01:38,720 by clinicians as possible risk factors for heart attacks. 32 00:01:43,240 --> 00:01:46,759 Let us discuss how we view the data over time. 33 00:01:46,759 --> 00:01:49,940 It is important in this study to view the medical records 34 00:01:49,940 --> 00:01:53,100 chronologically, and to represent a patient's diagnosis 35 00:01:53,100 --> 00:01:54,770 profile over time. 36 00:01:54,770 --> 00:01:59,970 So we record the cost and number of medical claims and hospital 37 00:01:59,970 --> 00:02:02,390 visits by a diagnosis. 38 00:02:02,390 --> 00:02:06,440 All the observations we have span over five years of data. 39 00:02:06,440 --> 00:02:11,890 They were split into 21 periods, each 90 days in length. 40 00:02:11,890 --> 00:02:15,110 We examine nine months of diagnostic history, 41 00:02:15,110 --> 00:02:18,440 leading up to heart attack or no heart attack event, 42 00:02:18,440 --> 00:02:23,980 and align the data to make observations date-independent, 43 00:02:23,980 --> 00:02:26,910 while preserving the order of events. 44 00:02:26,910 --> 00:02:29,950 We recorded the diagnostic history in three periods. 45 00:02:29,950 --> 00:02:33,230 Zero to three months before the event, 46 00:02:33,230 --> 00:02:35,090 three to six months before the event, 47 00:02:35,090 --> 00:02:37,600 and six to nine months before the event. 48 00:02:40,280 --> 00:02:42,940 What was a target variable we're trying to predict? 49 00:02:42,940 --> 00:02:44,940 The target prediction variable is the occurrence 50 00:02:44,940 --> 00:02:46,350 of a heart attack. 51 00:02:46,350 --> 00:02:49,730 We define this from a combination of several claims. 52 00:02:49,730 --> 00:02:52,060 Namely, diagnosis of a heart attack, 53 00:02:52,060 --> 00:02:54,700 alongside a trip to the emergency room, 54 00:02:54,700 --> 00:02:58,000 followed by subsequent hospitalization. 55 00:02:58,000 --> 00:03:00,510 Only considering heart attack diagnosis 56 00:03:00,510 --> 00:03:03,340 that are associated with the visits to an emergency room, 57 00:03:03,340 --> 00:03:05,810 and following hospitalization helps 58 00:03:05,810 --> 00:03:09,050 ensure that the target outcome is in fact a heart attack 59 00:03:09,050 --> 00:03:10,100 event. 60 00:03:10,100 --> 00:03:12,190 The target variable is binary. 61 00:03:12,190 --> 00:03:14,670 It is denoted by plus 1 or minus 1 62 00:03:14,670 --> 00:03:16,790 for the occurrence or non-occurrence of a heart 63 00:03:16,790 --> 00:03:19,790 attack in the targeted period of 90 days. 64 00:03:22,720 --> 00:03:24,090 How's the data organized? 65 00:03:24,090 --> 00:03:26,690 There were 147 variables. 66 00:03:26,690 --> 00:03:29,650 Variable one is the patient's identification number, 67 00:03:29,650 --> 00:03:32,030 and variable two is the patient's gender. 68 00:03:32,030 --> 00:03:35,160 There were variables related to the diagnoses group 69 00:03:35,160 --> 00:03:38,410 counts nine, six, and three months before the heart attack 70 00:03:38,410 --> 00:03:39,660 target period. 71 00:03:39,660 --> 00:03:44,950 There were variables related to the total course nine, six, 72 00:03:44,950 --> 00:03:46,910 and three months before the heart attack target 73 00:03:46,910 --> 00:03:51,630 period, and the final variable for 147, 74 00:03:51,630 --> 00:03:54,870 includes the classification of whether the event was a heart 75 00:03:54,870 --> 00:03:56,020 attack or not. 76 00:03:58,860 --> 00:04:02,940 Cost of medical care is a good summary of a person's health. 77 00:04:02,940 --> 00:04:08,110 In our database, the total cost of medical care in the three 90 78 00:04:08,110 --> 00:04:15,160 day periods preceding the heart attack target event ranged from 79 00:04:15,160 --> 00:04:20,760 $0.00 to $636,000 and approximately 70% 80 00:04:20,760 --> 00:04:24,880 of the overall cost were generated by only 11% 81 00:04:24,880 --> 00:04:26,360 of the population. 82 00:04:26,360 --> 00:04:28,920 This means that the highest patients 83 00:04:28,920 --> 00:04:32,460 with high medical expenses are a very small proportion 84 00:04:32,460 --> 00:04:35,810 of the data, and could skew our final results. 85 00:04:35,810 --> 00:04:39,700 According to the American Medical Association, only 10% 86 00:04:39,700 --> 00:04:42,640 of individuals have projected medical expenses 87 00:04:42,640 --> 00:04:45,790 of approximately $10,000 or greater 88 00:04:45,790 --> 00:04:48,960 per year, which is more than four times greater 89 00:04:48,960 --> 00:04:52,470 than the average projected medical expenses of 2,400 90 00:04:52,470 --> 00:04:53,880 per year. 91 00:04:53,880 --> 00:04:57,090 To lessen the effects of these high-cost outliers, 92 00:04:57,090 --> 00:04:59,960 we divided the data into different cost buckets, 93 00:04:59,960 --> 00:05:03,770 based on the findings of the American Medical Association. 94 00:05:03,770 --> 00:05:06,140 We did not want to have too many cost bins 95 00:05:06,140 --> 00:05:08,840 because the size of the data set. 96 00:05:08,840 --> 00:05:12,140 The table in the slide gives a summary 97 00:05:12,140 --> 00:05:13,990 of the cost bucket partitions. 98 00:05:13,990 --> 00:05:17,550 Patients with expenses over $10,000 in the nine month 99 00:05:17,550 --> 00:05:20,980 period were allocated to cost bucket 3. 100 00:05:20,980 --> 00:05:23,840 Patients with less than 2,000 in expenses 101 00:05:23,840 --> 00:05:25,910 were allocated to cost bucket 1. 102 00:05:25,910 --> 00:05:30,170 And the remaining patients with costs between 2,000 and 10,000 103 00:05:30,170 --> 00:05:31,850 to cost bucket 2. 104 00:05:31,850 --> 00:05:40,360 Please note that the majority of patients, 4,400 out of 6,500, 105 00:05:40,360 --> 00:05:44,590 or 67.5% of all patients fell into the first bucket 106 00:05:44,590 --> 00:05:46,470 of low expenses.