1 00:00:04,970 --> 00:00:07,850 In this video we will try to segment 2 00:00:07,850 --> 00:00:10,700 an MRI brain image of a healthy patient using 3 00:00:10,700 --> 00:00:12,860 hierarchical clustering. 4 00:00:12,860 --> 00:00:14,790 Make sure that you are in the directory where 5 00:00:14,790 --> 00:00:17,340 you saved the healthy.csv file. 6 00:00:17,340 --> 00:00:19,730 We will be following the exact same steps 7 00:00:19,730 --> 00:00:21,960 we did in the previous video. 8 00:00:21,960 --> 00:00:25,610 First, read in the data, and call the data frame healthy. 9 00:00:25,610 --> 00:00:29,800 Use the read.csv function to read in the healthy data set. 10 00:00:29,800 --> 00:00:31,800 And remember that this healthy data set consists 11 00:00:31,800 --> 00:00:36,220 of a matrix of intensity values, so let's set the header 12 00:00:36,220 --> 00:00:38,740 to false. 13 00:00:38,740 --> 00:00:41,610 And now let's create the healthy matrix using 14 00:00:41,610 --> 00:00:45,210 the as.matrix function, which takes as an input 15 00:00:45,210 --> 00:00:46,170 the healthy data frame. 16 00:00:48,850 --> 00:00:50,570 And now let's output the structure 17 00:00:50,570 --> 00:00:53,030 of the healthy matrix. 18 00:00:53,030 --> 00:00:56,990 And then we realize that we have 566 19 00:00:56,990 --> 00:01:00,760 by 646 pixel resolution for our image. 20 00:01:00,760 --> 00:01:04,430 So this MRI image is considerably larger 21 00:01:04,430 --> 00:01:06,690 than the little flower image that we saw, 22 00:01:06,690 --> 00:01:10,170 and we worked with in the previous two videos. 23 00:01:10,170 --> 00:01:13,620 To see the MRI image, we can use the image function 24 00:01:13,620 --> 00:01:17,360 in R, which takes as an input the healthy matrix. 25 00:01:17,360 --> 00:01:20,560 And then let's turn our axes off. 26 00:01:20,560 --> 00:01:22,810 And then use the grey-scale color scheme. 27 00:01:22,810 --> 00:01:26,400 So the color is equal to grey, which shades a sequence 28 00:01:26,400 --> 00:01:31,970 of values going from zero to one, with the length of 256. 29 00:01:31,970 --> 00:01:34,450 And now going to our graphics window, 30 00:01:34,450 --> 00:01:37,680 we see that what we have is the T2-weighted MRI 31 00:01:37,680 --> 00:01:39,880 imaging of a top section of the brain. 32 00:01:39,880 --> 00:01:41,630 And it shows different substances, 33 00:01:41,630 --> 00:01:44,110 such as the gray matter, the white matter, 34 00:01:44,110 --> 00:01:47,039 and the cerebrospinal fluid. 35 00:01:47,039 --> 00:01:49,190 Now let us see if we can isolate these substances 36 00:01:49,190 --> 00:01:52,500 via hierarchical clustering. 37 00:01:52,500 --> 00:01:54,840 We first need to convert the healthy matrix to a vector, 38 00:01:54,840 --> 00:01:58,320 and let's call it healthy vector. 39 00:01:58,320 --> 00:02:02,800 And that is equal to S dot vector of the healthy matrix. 40 00:02:05,320 --> 00:02:07,230 And now the first step in performing 41 00:02:07,230 --> 00:02:10,669 hierarchical clustering is computing the distance matrix. 42 00:02:10,669 --> 00:02:17,060 So let's type distance equals dist of healthy vector. 43 00:02:17,060 --> 00:02:19,070 And let's specify the method to be euclidean. 44 00:02:23,530 --> 00:02:27,140 Oh, R gives us an error that seems to tell us that 45 00:02:27,140 --> 00:02:31,110 our vector is huge, and R cannot allocate enough memory. 46 00:02:31,110 --> 00:02:33,740 Well let us see how big is our vector. 47 00:02:33,740 --> 00:02:36,930 So we're going to go and use the structure 48 00:02:36,930 --> 00:02:39,350 function over the healthy vector, 49 00:02:39,350 --> 00:02:43,170 and let's see what we obtain. 50 00:02:43,170 --> 00:02:43,690 Hm. 51 00:02:43,690 --> 00:02:49,070 The healthy vector has 365,636 elements. 52 00:02:49,070 --> 00:02:50,450 Let's call this number n. 53 00:02:50,450 --> 00:02:53,800 And remember, from our previous video, 54 00:02:53,800 --> 00:02:56,579 that for R to calculate the pairwise distances, 55 00:02:56,579 --> 00:03:03,820 it would actually need to calculate n*(n-1)/2 and then 56 00:03:03,820 --> 00:03:06,010 store them in the distance matrix. 57 00:03:06,010 --> 00:03:08,800 Let's see how big this number is. 58 00:03:08,800 --> 00:03:09,600 Wow. 59 00:03:09,600 --> 00:03:11,820 Of course R would complain. 60 00:03:11,820 --> 00:03:14,320 It's 67 billion values that we're 61 00:03:14,320 --> 00:03:17,430 asking R to store in a matrix. 62 00:03:17,430 --> 00:03:20,930 The bad news now is that we cannot use hierarchical 63 00:03:20,930 --> 00:03:22,440 clustering. 64 00:03:22,440 --> 00:03:24,610 Is there any other solution? 65 00:03:24,610 --> 00:03:26,750 Well, we have seen in lecture two 66 00:03:26,750 --> 00:03:30,300 that another clustering method is k-means. 67 00:03:30,300 --> 00:03:32,160 Let us review it first, and see if it 68 00:03:32,160 --> 00:03:35,620 could work on our high resolution image. 69 00:03:35,620 --> 00:03:37,800 The k-means clustering algorithm aims 70 00:03:37,800 --> 00:03:40,720 at partitioning the data into k clusters, 71 00:03:40,720 --> 00:03:42,829 in a way that each data point belongs 72 00:03:42,829 --> 00:03:46,210 to the cluster whose mean is the nearest to it. 73 00:03:46,210 --> 00:03:49,740 Let's go over the algorithm step-by-step. 74 00:03:49,740 --> 00:03:52,560 In this example we have five data points. 75 00:03:52,560 --> 00:03:55,610 The first step is to specify the number of clusters. 76 00:03:55,610 --> 00:04:00,410 And suppose we wish to find two clusters, so set k=2. 77 00:04:00,410 --> 00:04:02,990 Then we start by randomly grouping the data 78 00:04:02,990 --> 00:04:04,450 into two clusters. 79 00:04:04,450 --> 00:04:07,140 For instance, three points in the red cluster, 80 00:04:07,140 --> 00:04:10,570 and the remaining two points in the grey cluster. 81 00:04:10,570 --> 00:04:14,360 The next step is to compute the cluster means or centroids. 82 00:04:14,360 --> 00:04:17,310 Let's first compute the mean of the red cluster, 83 00:04:17,310 --> 00:04:21,120 and then the mean of the grey cluster is simply the midpoint. 84 00:04:21,120 --> 00:04:24,010 Now remember that the k-means clustering algorithm 85 00:04:24,010 --> 00:04:27,790 tries to cluster points according to the nearest mean. 86 00:04:27,790 --> 00:04:31,260 But this red point over here seems to be closer to the mean 87 00:04:31,260 --> 00:04:34,290 of the grey cluster, then to the mean of the red cluster 88 00:04:34,290 --> 00:04:37,530 to which it was assigned in the previous step. 89 00:04:37,530 --> 00:04:40,850 So intuitively, the next step in the k-means algorithm 90 00:04:40,850 --> 00:04:45,220 is to re-assign the data points to the closest cluster mean. 91 00:04:45,220 --> 00:04:49,920 As a result, now this red point should be in the grey cluster. 92 00:04:49,920 --> 00:04:52,710 Now that we moved one point from the red cluster over 93 00:04:52,710 --> 00:04:55,900 to the grey cluster, we need to update the means. 94 00:04:55,900 --> 00:04:59,140 This is exactly the next step in the k-means algorithm. 95 00:04:59,140 --> 00:05:03,340 So let's recompute the mean of the red cluster, 96 00:05:03,340 --> 00:05:06,970 and then re-compute the mean of the grey cluster. 97 00:05:06,970 --> 00:05:09,340 Now we go back to Step 4. 98 00:05:09,340 --> 00:05:11,060 Is there any point here that seems 99 00:05:11,060 --> 00:05:14,850 to be cluster to a cluster mean that it does not belong to? 100 00:05:14,850 --> 00:05:18,700 If so, we need to re-assign it to the other cluster. 101 00:05:18,700 --> 00:05:20,500 However, in this case, all points 102 00:05:20,500 --> 00:05:24,000 are closest to their cluster mean, so the algorithm is done, 103 00:05:24,000 --> 00:05:26,310 and we can stop. 104 00:05:26,310 --> 00:05:28,800 In the next video, we will implement the k-means algorithm 105 00:05:28,800 --> 00:05:32,470 in R to try to segment the MRI brain image.