1 00:00:04,500 --> 00:00:06,990 Let us discuss the performance of a benchmark algorithm. 2 00:00:06,990 --> 00:00:09,250 The Random Forest algorithm is known 3 00:00:09,250 --> 00:00:12,690 for its attractive property of detecting variable interactions 4 00:00:12,690 --> 00:00:15,810 and excellent performance as a learning algorithm. 5 00:00:15,810 --> 00:00:17,940 For the decision, we're selecting the Random Forest 6 00:00:17,940 --> 00:00:19,990 algorithm as a benchmark-- initially, 7 00:00:19,990 --> 00:00:21,890 we randomly partitioned the full data 8 00:00:21,890 --> 00:00:25,900 set into two separate parts, where the split was 50-50, 9 00:00:25,900 --> 00:00:29,750 and the partitioning was done evenly within each cost bin. 10 00:00:29,750 --> 00:00:31,380 The first part, the training set, 11 00:00:31,380 --> 00:00:33,830 was used to develop the method. 12 00:00:33,830 --> 00:00:36,400 The second part, the test set, was 13 00:00:36,400 --> 00:00:39,020 used to evaluate the model's performance. 14 00:00:39,020 --> 00:00:43,290 The table in this slide reports the accuracy 15 00:00:43,290 --> 00:00:49,330 of the Random Forest algorithm on each of the three buckets. 16 00:00:49,330 --> 00:00:52,170 Let us now introduce the idea of clustering. 17 00:00:52,170 --> 00:00:55,740 Patients in each bucket may have different characteristics. 18 00:00:55,740 --> 00:01:00,190 For this reason, we create clusters for each cost bucket 19 00:01:00,190 --> 00:01:03,860 and make predictions for each cluster using the Random Forest 20 00:01:03,860 --> 00:01:04,360 algorithm. 21 00:01:06,990 --> 00:01:11,030 Clustering is mostly used in the absence of a target variable 22 00:01:11,030 --> 00:01:13,970 to search for relationships among input variables 23 00:01:13,970 --> 00:01:17,080 or to organize data into meaningful groups. 24 00:01:17,080 --> 00:01:19,450 In this study, although the target variable 25 00:01:19,450 --> 00:01:22,500 is well-defined as a heart attack or not a heart attack, 26 00:01:22,500 --> 00:01:25,160 there are many different trajectories 27 00:01:25,160 --> 00:01:27,780 that are associated with the target. 28 00:01:27,780 --> 00:01:32,100 There's not one set pattern of health 29 00:01:32,100 --> 00:01:35,530 or diagnostic combination that leads a person to heart attack. 30 00:01:35,530 --> 00:01:37,690 Instead, we'll show that there are 31 00:01:37,690 --> 00:01:41,360 many different dynamic health patterns and time series 32 00:01:41,360 --> 00:01:44,300 diagnostic relations preceding a heart attack. 33 00:01:47,440 --> 00:01:49,650 The clustering methods were used were 34 00:01:49,650 --> 00:01:52,740 spectral clustering and k-means clustering. 35 00:01:52,740 --> 00:01:57,550 We focus, in the lecture, on the k-means clustering. 36 00:01:57,550 --> 00:02:02,580 The broad description of the algorithm is as follows. 37 00:02:02,580 --> 00:02:08,500 We first specify the number of clusters k. 38 00:02:08,500 --> 00:02:14,780 Then we randomly assign each data point to a cluster. 39 00:02:14,780 --> 00:02:17,920 We then compute the cluster centroids. 40 00:02:17,920 --> 00:02:22,600 We re-assign each point to the closest cluster centroid. 41 00:02:22,600 --> 00:02:25,180 We then re-compute the cluster centroids, 42 00:02:25,180 --> 00:02:29,590 and we repeat steps 4 and 5 until no improvement is made. 43 00:02:32,560 --> 00:02:38,030 Let us illustrate the k-means algorithm in action. 44 00:02:38,030 --> 00:02:42,560 We specify the desired number of clusters k. 45 00:02:42,560 --> 00:02:44,640 In this case, we use k=2. 46 00:02:48,840 --> 00:02:53,800 We then randomly assign each data point to a cluster. 47 00:02:57,100 --> 00:03:00,940 In this case, we have the three points in red, 48 00:03:00,940 --> 00:03:04,380 and the two points in black. 49 00:03:04,380 --> 00:03:08,400 We then compute the cluster centroids, 50 00:03:08,400 --> 00:03:12,880 indicated by the red x and the grey x. 51 00:03:12,880 --> 00:03:19,120 We re-assign each point to the closest cluster centroid, 52 00:03:19,120 --> 00:03:24,540 and now you observe that this point changes from a red 53 00:03:24,540 --> 00:03:26,900 to a grey. 54 00:03:26,900 --> 00:03:34,130 We re-compute the cluster centroids, 55 00:03:34,130 --> 00:03:40,090 and we repeat the previous steps, 4 and 5 56 00:03:40,090 --> 00:03:41,850 until no improvement is made. 57 00:03:41,850 --> 00:03:46,579 We observe that, in this case, the k-means clustering is done, 58 00:03:46,579 --> 00:03:48,210 and this is our final clustering. 59 00:03:53,920 --> 00:03:56,690 Let us discuss some practical considerations. 60 00:03:56,690 --> 00:03:58,790 The number of clusters k can be selected 61 00:03:58,790 --> 00:04:02,070 from previous knowledge or by simply experimenting. 62 00:04:02,070 --> 00:04:05,570 We can strategically select initial partition of points 63 00:04:05,570 --> 00:04:09,910 into clusters if we have some knowledge of the data. 64 00:04:09,910 --> 00:04:12,660 We can also run the algorithm several times 65 00:04:12,660 --> 00:04:15,000 with different random starting points. 66 00:04:15,000 --> 00:04:17,560 In the recitations, we'll learn how 67 00:04:17,560 --> 00:04:24,120 to run the k-means algorithm in R. 68 00:04:24,120 --> 00:04:26,980 So how do we measure performance? 69 00:04:26,980 --> 00:04:29,830 After we construct the clusters in the training set, 70 00:04:29,830 --> 00:04:33,390 we assign new observations to clusters by proximity 71 00:04:33,390 --> 00:04:35,190 to the centroid of each cluster. 72 00:04:37,860 --> 00:04:40,730 We measure performance by recording 73 00:04:40,730 --> 00:04:43,240 the average performance rate in each cluster. 74 00:04:47,120 --> 00:04:49,290 Let us now discuss the performance of the clustering 75 00:04:49,290 --> 00:04:49,790 methods. 76 00:04:53,070 --> 00:04:58,150 We perform clustering on each bucket using k=10 clusters. 77 00:04:58,150 --> 00:05:03,650 In the table we record the average prediction 78 00:05:03,650 --> 00:05:06,700 rate of each cost bucket. 79 00:05:06,700 --> 00:05:10,550 We observe a very visible improvement 80 00:05:10,550 --> 00:05:17,800 when we use clustering-- from 49% to 64%, from 56% to 73%, 81 00:05:17,800 --> 00:05:20,890 from 58% to 78%.