1 00:00:04,500 --> 00:00:07,710 In this lecture, we'll be using analytical models 2 00:00:07,710 --> 00:00:09,900 to prevent heart disease. 3 00:00:09,900 --> 00:00:13,050 The first step is to identify risk factors, 4 00:00:13,050 --> 00:00:17,350 or the independent variables, that we will use in our model. 5 00:00:17,350 --> 00:00:21,490 Then, using data, we'll create a logistic regression model 6 00:00:21,490 --> 00:00:23,800 to predict heart disease. 7 00:00:23,800 --> 00:00:26,570 Using more data, we'll validate our model 8 00:00:26,570 --> 00:00:29,520 to make sure it performs well out of sample 9 00:00:29,520 --> 00:00:32,340 and on different populations than the training set 10 00:00:32,340 --> 00:00:34,310 population. 11 00:00:34,310 --> 00:00:37,260 Lastly, we'll discuss how medical interventions 12 00:00:37,260 --> 00:00:39,350 can be defined using the model. 13 00:00:41,930 --> 00:00:44,040 We'll be predicting the 10-year risk 14 00:00:44,040 --> 00:00:47,670 of coronary heart disease or CHD. 15 00:00:47,670 --> 00:00:51,600 This was the subject of an important 1998 paper 16 00:00:51,600 --> 00:00:55,420 introducing what is known as the Framingham Risk Score. 17 00:00:55,420 --> 00:00:58,210 This is one of the most influential applications 18 00:00:58,210 --> 00:01:00,890 of the Framingham Heart Study data. 19 00:01:00,890 --> 00:01:05,630 We'll use logistic regression to create a similar model. 20 00:01:05,630 --> 00:01:10,170 CHD is a disease of the blood vessels supplying the heart. 21 00:01:10,170 --> 00:01:12,320 This is one type of heart disease, which 22 00:01:12,320 --> 00:01:17,510 has been the leading cause of death worldwide since 1921. 23 00:01:17,510 --> 00:01:23,230 In 2008, $7.3 million people died from CHD. 24 00:01:23,230 --> 00:01:27,500 Even though the number of deaths due to CHD is still very high, 25 00:01:27,500 --> 00:01:29,480 age-adjusted death rates have actually 26 00:01:29,480 --> 00:01:33,860 declined 60% since 1950. 27 00:01:33,860 --> 00:01:38,140 This is in part due to earlier detection and monitoring partly 28 00:01:38,140 --> 00:01:40,210 because of the Framingham Heart Study. 29 00:01:43,050 --> 00:01:45,920 Before building a logistic regression model, 30 00:01:45,920 --> 00:01:48,570 we need to identify the independent variables 31 00:01:48,570 --> 00:01:50,509 we want to use. 32 00:01:50,509 --> 00:01:52,530 When predicting the risk of a disease, 33 00:01:52,530 --> 00:01:57,070 we want to identify what are known as risk factors. 34 00:01:57,070 --> 00:01:59,020 These are the variables that increase 35 00:01:59,020 --> 00:02:02,340 the chances of developing a disease. 36 00:02:02,340 --> 00:02:04,480 The term risk factors was actually 37 00:02:04,480 --> 00:02:07,140 coined by William Kannell and Roy Dawber 38 00:02:07,140 --> 00:02:10,020 from the Framingham Heart Study. 39 00:02:10,020 --> 00:02:12,050 Identifying these risk factors is 40 00:02:12,050 --> 00:02:14,450 the key to successful prediction of CHD. 41 00:02:17,220 --> 00:02:20,140 In this lecture, we'll focus on the risk factors 42 00:02:20,140 --> 00:02:23,060 that they collected data for in the original data 43 00:02:23,060 --> 00:02:26,100 collection for the Framingham Heart Study. 44 00:02:26,100 --> 00:02:28,320 We'll be using an anonymized version 45 00:02:28,320 --> 00:02:31,690 of the original data that was collected. 46 00:02:31,690 --> 00:02:35,490 This data set includes several demographic risk factors-- 47 00:02:35,490 --> 00:02:38,690 the sex of the patient, male or female; 48 00:02:38,690 --> 00:02:43,200 the age of the patient in years; the education level coded 49 00:02:43,200 --> 00:02:45,590 as either 1 for some high school, 50 00:02:45,590 --> 00:02:48,900 2 for a high school diploma or GED, 51 00:02:48,900 --> 00:02:51,920 3 for some college or vocational school, 52 00:02:51,920 --> 00:02:55,700 and 4 for a college degree. 53 00:02:55,700 --> 00:02:58,680 The data set also includes behavioral risk factors 54 00:02:58,680 --> 00:03:02,060 associated with smoking-- whether or not 55 00:03:02,060 --> 00:03:06,120 the patient is a current smoker and the number of cigarettes 56 00:03:06,120 --> 00:03:09,510 that the person smoked on average in one day. 57 00:03:09,510 --> 00:03:12,930 While it is now widely known that smoking increases 58 00:03:12,930 --> 00:03:14,980 the risk of heart disease, the idea 59 00:03:14,980 --> 00:03:20,579 of smoking being bad for you was a novel idea in the 1940s. 60 00:03:20,579 --> 00:03:24,230 Medical history risk factors were also included. 61 00:03:24,230 --> 00:03:25,940 These were whether or not the patient was 62 00:03:25,940 --> 00:03:29,660 on blood pressure medication, whether or not the patient had 63 00:03:29,660 --> 00:03:33,260 previously had a stroke, whether or not the patient was 64 00:03:33,260 --> 00:03:36,720 hypertensive, and whether or not the patient had diabetes. 65 00:03:39,650 --> 00:03:42,220 Lastly, the data set includes risk factors 66 00:03:42,220 --> 00:03:45,740 from the first physical examination of the patient. 67 00:03:45,740 --> 00:03:49,720 The total cholesterol level, systolic blood pressure, 68 00:03:49,720 --> 00:03:55,370 diastolic blood pressure, Body Mass Index, or BMI, heart rate, 69 00:03:55,370 --> 00:03:59,260 and blood glucose level of the patient were measured. 70 00:03:59,260 --> 00:04:02,480 In the next video, we'll use these risk factors 71 00:04:02,480 --> 00:04:06,450 to see if we can predict the 10-year risk CHD.