1 00:00:04,810 --> 00:00:09,240 So let us explain what claims data is. 2 00:00:09,240 --> 00:00:17,720 So medical claims are generated when a patient visits a doctor. 3 00:00:17,720 --> 00:00:22,600 Medical claims include diagnosis code, procedures codes, 4 00:00:22,600 --> 00:00:25,370 as well as costs. 5 00:00:25,370 --> 00:00:31,070 Pharmacy claims involve drugs, the quantity of these drugs, 6 00:00:31,070 --> 00:00:36,150 the prescribing doctor, as well as the medication costs. 7 00:00:36,150 --> 00:00:39,500 Claims data are electronically available, they 8 00:00:39,500 --> 00:00:43,490 are standardized, they use well-established codes. 9 00:00:43,490 --> 00:00:45,630 However, since humans generate them, 10 00:00:45,630 --> 00:00:47,980 they are not 100% accurate. 11 00:00:47,980 --> 00:00:52,250 And often, under-reporting is common in the sense 12 00:00:52,250 --> 00:00:55,730 that it's a tedious job to record these claims, 13 00:00:55,730 --> 00:00:58,320 and as a result, often people under-report them. 14 00:00:58,320 --> 00:01:03,950 Also, claims for hospital visits can be vague. 15 00:01:03,950 --> 00:01:07,430 In creating a data set, our objective 16 00:01:07,430 --> 00:01:11,590 was to assess quality, health care quality. 17 00:01:11,590 --> 00:01:15,820 So we used a large health insurance claims database, 18 00:01:15,820 --> 00:01:22,270 and we randomly selected 131 diabetes patients. 19 00:01:22,270 --> 00:01:27,360 The ages ranged between 35 to 55 and the costs 20 00:01:27,360 --> 00:01:32,340 were in the neighborhood of $10,000 to $20,000. 21 00:01:32,340 --> 00:01:35,190 The period in which these claims were recorded 22 00:01:35,190 --> 00:01:41,780 were September 1, 2003 to August 31, 2005. 23 00:01:41,780 --> 00:01:44,590 An expert physician reviewed the claims 24 00:01:44,590 --> 00:01:48,020 and wrote descriptive notes, like "ongoing use 25 00:01:48,020 --> 00:01:52,210 of narcotics"; "only on Avandia, not a good first choice drug"; 26 00:01:52,210 --> 00:01:55,140 "had regular visits, mammogram, and immunizations"; 27 00:01:55,140 --> 00:01:59,100 "was given home testing supplies". 28 00:01:59,100 --> 00:02:02,810 After this review, this expert physician 29 00:02:02,810 --> 00:02:07,520 rated the quality of care on a two-point scale, poor or good. 30 00:02:07,520 --> 00:02:12,000 Examples included, I'd say care was poor. 31 00:02:12,000 --> 00:02:13,080 Poorly treated diabetes. 32 00:02:13,080 --> 00:02:17,900 Not an eye exam, but overall I'd say high quality. 33 00:02:17,900 --> 00:02:20,900 So based on these comments, we extracted variables. 34 00:02:20,900 --> 00:02:24,070 The dependent variable was the quality of care. 35 00:02:24,070 --> 00:02:27,720 The independent variables involve the ongoing use 36 00:02:27,720 --> 00:02:32,150 of narcotics; only on Avandia, not a good first choice drug; 37 00:02:32,150 --> 00:02:34,520 had regular visits, mammogram, and immunizations; 38 00:02:34,520 --> 00:02:37,540 was given home testing supplies. 39 00:02:37,540 --> 00:02:39,660 Overall, the independent variables 40 00:02:39,660 --> 00:02:41,900 involved diabetes treatment variables, 41 00:02:41,900 --> 00:02:45,710 patient demographics, health care utilization, providers, 42 00:02:45,710 --> 00:02:47,160 claims, and prescriptions. 43 00:02:50,720 --> 00:02:55,270 The dependent variable was modeled as a binary variable -- 44 00:02:55,270 --> 00:02:59,100 1 for low-quality care and 0 for high-quality care. 45 00:02:59,100 --> 00:03:01,770 This is by its nature a categorical variable. 46 00:03:01,770 --> 00:03:05,040 It only takes two possible values. 47 00:03:05,040 --> 00:03:08,530 We have seen linear regression as a way 48 00:03:08,530 --> 00:03:11,740 of predicting continuous outcomes. 49 00:03:11,740 --> 00:03:17,190 Of course, we can utilize linear regression 50 00:03:17,190 --> 00:03:19,470 to predict quality of care here, but then we 51 00:03:19,470 --> 00:03:22,710 have to round the outcome to 0 or 1. 52 00:03:22,710 --> 00:03:28,260 Instead, we will explain in this lecture 53 00:03:28,260 --> 00:03:31,090 how we can use logistic regression, which 54 00:03:31,090 --> 00:03:33,290 is an extension of linear regression, 55 00:03:33,290 --> 00:03:36,590 to environments where the dependent variable is 56 00:03:36,590 --> 00:03:37,280 categorical. 57 00:03:37,280 --> 00:03:40,460 In our case, 0 or 1.