1 00:00:04,490 --> 00:00:09,180 Let us discuss data sources in the health care industry. 2 00:00:09,180 --> 00:00:14,800 So the industry is data-rich, but data may be hard to access. 3 00:00:14,800 --> 00:00:17,950 Sometimes it involves unstructured data 4 00:00:17,950 --> 00:00:21,280 like doctor's notes. 5 00:00:21,280 --> 00:00:24,660 Often the data is hard to get due to differences 6 00:00:24,660 --> 00:00:26,520 in technology. 7 00:00:26,520 --> 00:00:31,170 Hospitals in southern Massachusetts versus California 8 00:00:31,170 --> 00:00:36,110 might use different technologies and different platforms. 9 00:00:36,110 --> 00:00:40,460 Finally there are strong privacy laws, HIPAA, 10 00:00:40,460 --> 00:00:42,540 around health care data sharing. 11 00:00:42,540 --> 00:00:43,420 So what is available? 12 00:00:48,170 --> 00:00:51,230 Claims data is a major source. 13 00:00:51,230 --> 00:00:54,520 Claims data are requests for reimbursement submitted 14 00:00:54,520 --> 00:00:57,780 to insurance companies or state-provided insurance 15 00:00:57,780 --> 00:01:00,320 from doctors, hospitals and pharmacies. 16 00:01:03,160 --> 00:01:06,150 Another source of data is the eligibility information 17 00:01:06,150 --> 00:01:08,660 for employees. 18 00:01:08,660 --> 00:01:12,320 And finally demographic information: gender and age. 19 00:01:15,539 --> 00:01:20,940 Let me give you some examples on claims data. 20 00:01:20,940 --> 00:01:25,160 So this shows six different claims. 21 00:01:25,160 --> 00:01:28,180 Let's consider this one. 22 00:01:28,180 --> 00:01:31,560 So this is the provider's name. 23 00:01:31,560 --> 00:01:35,200 The corresponding diagnostic code. 24 00:01:35,200 --> 00:01:41,080 This is about upper respiratory disorders. 25 00:01:41,080 --> 00:01:46,400 This is another code associated with the diagnosis. 26 00:01:46,400 --> 00:01:52,640 This is the scientific term for the diagnosis. 27 00:01:52,640 --> 00:01:55,950 The specific code again. 28 00:01:55,950 --> 00:02:01,620 This was an office visit, and it's an established patient. 29 00:02:01,620 --> 00:02:03,760 The date. 30 00:02:03,760 --> 00:02:12,460 And the amount of money that was claimed by the physician. 31 00:02:12,460 --> 00:02:14,400 Others claims are similar. 32 00:02:17,920 --> 00:02:26,290 As we see, the claims data is a rich, structured data source. 33 00:02:26,290 --> 00:02:29,620 It is very high dimensional. 34 00:02:29,620 --> 00:02:34,470 For example, claims involving diagnosis 35 00:02:34,470 --> 00:02:37,329 involve thousands of different codes. 36 00:02:37,329 --> 00:02:40,870 Similarly with drugs, where there are tens of thousands, 37 00:02:40,870 --> 00:02:43,000 and procedures. 38 00:02:43,000 --> 00:02:46,530 However, this collection of data does not 39 00:02:46,530 --> 00:02:49,890 capture all aspects of a person's treatment or health. 40 00:02:49,890 --> 00:02:53,480 Many things must be inferred. 41 00:02:53,480 --> 00:02:56,300 Unlike electronic medical records, 42 00:02:56,300 --> 00:02:58,510 we do not know the results of a test, 43 00:02:58,510 --> 00:03:00,660 only that the test was administered. 44 00:03:00,660 --> 00:03:07,070 For example, we do not know the results of a blood test, 45 00:03:07,070 --> 00:03:09,240 but we do know that the blood test was administered. 46 00:03:15,060 --> 00:03:19,550 The specific exercise we are going to see in this lecture 47 00:03:19,550 --> 00:03:25,350 is an analytics approach to building models starting 48 00:03:25,350 --> 00:03:28,700 with 2.4 million people over a three year span. 49 00:03:33,150 --> 00:03:37,570 The observation period was 2001 to 2003. 50 00:03:37,570 --> 00:03:40,270 This is where this data is coming from. 51 00:03:40,270 --> 00:03:42,990 And then out of sample, we make predictions 52 00:03:42,990 --> 00:03:46,590 for the period of 2003 and 2004. 53 00:03:46,590 --> 00:03:48,600 This was in the early years of D2Hawkeye. 54 00:03:52,610 --> 00:03:56,990 Out of the 2.4 million people, we included only people 55 00:03:56,990 --> 00:03:59,720 with data for at least 10 months in both periods, 56 00:03:59,720 --> 00:04:02,850 both in the observation period and the results period. 57 00:04:02,850 --> 00:04:06,490 This decreased the data to 400,000 people.