1 00:00:04,500 --> 00:00:07,540 In our previous video, we found the distance matrix, 2 00:00:07,540 --> 00:00:10,190 which computes the pairwise distances between all 3 00:00:10,190 --> 00:00:13,040 the intensity values in the flower vector. 4 00:00:13,040 --> 00:00:15,430 Now we can cluster the intensity values 5 00:00:15,430 --> 00:00:17,920 using hierarchical clustering. 6 00:00:17,920 --> 00:00:21,880 So we're going to type "cluster intensity." 7 00:00:21,880 --> 00:00:24,540 And then we're going to use the hclust function, which 8 00:00:24,540 --> 00:00:27,500 is the hierarchical clustering function in R, which 9 00:00:27,500 --> 00:00:30,080 takes as an input the distance matrix. 10 00:00:30,080 --> 00:00:32,810 And then we're going to specify the clustering method 11 00:00:32,810 --> 00:00:35,450 to be "word." 12 00:00:35,450 --> 00:00:37,640 As a reminder, the "words" method 13 00:00:37,640 --> 00:00:39,480 is a minimum variants method, which 14 00:00:39,480 --> 00:00:42,650 tries to find compact and spherical clusters. 15 00:00:42,650 --> 00:00:45,500 We can think about it as trying to minimize the variance 16 00:00:45,500 --> 00:00:49,250 within each cluster and the distance among clusters. 17 00:00:49,250 --> 00:00:51,000 Now we can plot the cluster dendrogram. 18 00:00:51,000 --> 00:00:52,170 So-- plot(clusterIntensity). 19 00:00:58,060 --> 00:01:01,240 And now we obtain the cluster dendrogram. 20 00:01:01,240 --> 00:01:04,540 Let's have here a little aside or a quick reminder 21 00:01:04,540 --> 00:01:09,120 about how to read a dendrogram and make sense of it. 22 00:01:09,120 --> 00:01:12,620 Let us first consider this toy dendrogram example. 23 00:01:12,620 --> 00:01:14,390 The lowest row of nodes represent 24 00:01:14,390 --> 00:01:17,250 the data or the individual observations, 25 00:01:17,250 --> 00:01:20,170 and the remaining nodes represent the clusters. 26 00:01:20,170 --> 00:01:22,270 The vertical lines depict the distance 27 00:01:22,270 --> 00:01:24,720 between two nodes or clusters. 28 00:01:24,720 --> 00:01:28,120 The taller the line, the more dissimilar the clusters are. 29 00:01:28,120 --> 00:01:33,479 For instance, cluster D-E-F is closer to cluster B-C-D-E-F 30 00:01:33,479 --> 00:01:35,600 than cluster B-C is. 31 00:01:35,600 --> 00:01:38,720 And this is well depicted by the height of the lines connecting 32 00:01:38,720 --> 00:01:43,720 each of clusters B-C and D-E-F to their parent node. 33 00:01:43,720 --> 00:01:46,759 Now cutting the dendrogram at a given level 34 00:01:46,759 --> 00:01:49,160 yields a certain partitioning of the data. 35 00:01:49,160 --> 00:01:53,110 For instance, if we cut the tree between levels two and three, 36 00:01:53,110 --> 00:01:58,590 we obtain four clusters, A, B-C, D-E, and F. 37 00:01:58,590 --> 00:02:02,120 If we cut the dendrogram between levels three and four, 38 00:02:02,120 --> 00:02:07,690 then we obtain three clusters, A, B-C, and D-E-F. 39 00:02:07,690 --> 00:02:10,580 And if we were to cut the dendrogram between levels four 40 00:02:10,580 --> 00:02:16,800 and five, then we obtain two clusters, A and B-C-D-E-F. 41 00:02:16,800 --> 00:02:20,120 What to choose, two, three, or four clusters? 42 00:02:20,120 --> 00:02:23,670 Well, the smaller the number of clusters, the coarser 43 00:02:23,670 --> 00:02:25,230 the clustering is. 44 00:02:25,230 --> 00:02:27,850 But at the same time, having many clusters 45 00:02:27,850 --> 00:02:30,020 may be too much of a stretch. 46 00:02:30,020 --> 00:02:33,410 We should always have this trade-off in mind. 47 00:02:33,410 --> 00:02:35,750 Now the distance information between clusters 48 00:02:35,750 --> 00:02:38,970 can guide our choice of the number of clusters. 49 00:02:38,970 --> 00:02:42,300 A good partition belongs to a cut that has a good enough room 50 00:02:42,300 --> 00:02:43,890 to move up and down. 51 00:02:43,890 --> 00:02:47,230 For instance, the cut between levels two and three can go up 52 00:02:47,230 --> 00:02:51,280 until it reaches cluster D-E-F. The cut between levels three 53 00:02:51,280 --> 00:02:54,310 and four has more room to move until it reaches the cluster 54 00:02:54,310 --> 00:02:58,590 B-C-D-E-F. And the cut between levels four and five has 55 00:02:58,590 --> 00:03:00,080 the least room. 56 00:03:00,080 --> 00:03:02,720 So it seems like choosing three clusters 57 00:03:02,720 --> 00:03:06,040 is reasonable in this case. 58 00:03:06,040 --> 00:03:08,280 Going back to our dendrogram, it seems 59 00:03:08,280 --> 00:03:11,580 that having two clusters or three clusters 60 00:03:11,580 --> 00:03:13,390 is reasonable in our case. 61 00:03:13,390 --> 00:03:15,360 We can actually visualize the cuts 62 00:03:15,360 --> 00:03:18,770 by plotting rectangles around the clusters on this tree. 63 00:03:18,770 --> 00:03:23,560 To do so, we can use the rect.hclust function, 64 00:03:23,560 --> 00:03:26,190 which takes as an input clusterIntensiy, which 65 00:03:26,190 --> 00:03:27,540 is our tree. 66 00:03:27,540 --> 00:03:30,140 And then we can specify the number of clusters 67 00:03:30,140 --> 00:03:30,760 that we want. 68 00:03:30,760 --> 00:03:33,260 So let's set k=3. 69 00:03:33,260 --> 00:03:35,660 And we can color the borders of the rectangles. 70 00:03:35,660 --> 00:03:39,420 And let's color them, for instance, in red. 71 00:03:39,420 --> 00:03:41,810 Now going back to our dendrogram, 72 00:03:41,810 --> 00:03:44,010 now we can see the three clusters 73 00:03:44,010 --> 00:03:46,870 in these red rectangles. 74 00:03:46,870 --> 00:03:50,400 Now let us split the data into these three clusters. 75 00:03:50,400 --> 00:03:51,940 We're going to call our clusters, 76 00:03:51,940 --> 00:03:55,650 for instance, flowerClusters. 77 00:03:55,650 --> 00:03:59,410 And then we're going to use the function cut tree. 78 00:03:59,410 --> 00:04:02,870 And literally, this function cuts the dendrogram 79 00:04:02,870 --> 00:04:05,420 into however many clusters we want. 80 00:04:05,420 --> 00:04:06,920 The input would be clusterIntensity. 81 00:04:09,500 --> 00:04:13,610 And then we have to specify k=3, because we would like to have 82 00:04:13,610 --> 00:04:16,670 three clusters. 83 00:04:16,670 --> 00:04:18,769 Now let us output the flower clusters 84 00:04:18,769 --> 00:04:20,769 variable to see how it looks. 85 00:04:20,769 --> 00:04:23,760 So flowerClusters. 86 00:04:23,760 --> 00:04:26,860 And we see that the flower cluster is actually 87 00:04:26,860 --> 00:04:30,330 a vector that assigns each intensity value in the flower 88 00:04:30,330 --> 00:04:31,830 vector to a cluster. 89 00:04:31,830 --> 00:04:35,240 It actually has the same length, which is 2,005, 90 00:04:35,240 --> 00:04:38,820 and has values 1, 2, and 3, which 91 00:04:38,820 --> 00:04:41,030 correspond to each cluster. 92 00:04:41,030 --> 00:04:44,470 To find the mean intensity value of each of our clusters, 93 00:04:44,470 --> 00:04:48,380 we can use the tapply function and ask R to group 94 00:04:48,380 --> 00:04:52,630 the values in the flower vector according to the flower 95 00:04:52,630 --> 00:04:55,770 clusters, and then apply the mean statistic 96 00:04:55,770 --> 00:04:57,650 to each of the groups. 97 00:04:57,650 --> 00:05:00,540 What we obtain is that the first cluster has a mean intensity 98 00:05:00,540 --> 00:05:04,170 value of 0.08, which is closest to zero, 99 00:05:04,170 --> 00:05:05,620 and this means that it corresponds 100 00:05:05,620 --> 00:05:07,850 to the darkest shape in our image. 101 00:05:07,850 --> 00:05:11,210 And then the third cluster here, which is closest to 1, 102 00:05:11,210 --> 00:05:13,940 corresponds to the fairest shade. 103 00:05:13,940 --> 00:05:15,540 And now the fun part. 104 00:05:15,540 --> 00:05:18,020 Let us see how the image was segmented. 105 00:05:18,020 --> 00:05:20,810 To output an image, we can use the image function 106 00:05:20,810 --> 00:05:24,150 in R, which takes a matrix as an input. 107 00:05:24,150 --> 00:05:28,460 But the variable flowerClusters, as we just saw, is a vector. 108 00:05:28,460 --> 00:05:31,340 So we need to convert it into a matrix. 109 00:05:31,340 --> 00:05:33,900 We can do this by setting the dimension of this variable 110 00:05:33,900 --> 00:05:35,930 by using the dimension function. 111 00:05:35,930 --> 00:05:38,940 So, let's use the dimension function, or dim, 112 00:05:38,940 --> 00:05:42,510 which takes as input flowerClusters. 113 00:05:42,510 --> 00:05:44,690 And then we're going to use the combined function, 114 00:05:44,690 --> 00:05:46,050 or the c function. 115 00:05:46,050 --> 00:05:48,690 And its first argument will be the number of rows 116 00:05:48,690 --> 00:05:51,680 that we want for the matrix, and that would be 50. 117 00:05:51,680 --> 00:05:54,620 And the second argument would be the number of columns. 118 00:05:54,620 --> 00:05:56,330 Why did we use 50? 119 00:05:56,330 --> 00:06:01,310 Simply because we have a 50 by 50 resolution picture. 120 00:06:01,310 --> 00:06:03,660 Now pressing Enter, and flowerClusters 121 00:06:03,660 --> 00:06:05,960 looks like a matrix. 122 00:06:05,960 --> 00:06:08,040 Now we can use the function image, 123 00:06:08,040 --> 00:06:12,160 which takes as an input the "flower cl clusters" matrix. 124 00:06:12,160 --> 00:06:16,650 And let's turn off the axes by writing axes="false." 125 00:06:16,650 --> 00:06:20,180 And now, going back to our graphics window, 126 00:06:20,180 --> 00:06:22,830 we can see our segmented image here. 127 00:06:22,830 --> 00:06:25,280 The darkest shade corresponds to the background, 128 00:06:25,280 --> 00:06:29,350 and this is actually associated with the first cluster. 129 00:06:29,350 --> 00:06:31,650 The one in the middle is the core of the flower, 130 00:06:31,650 --> 00:06:33,290 and this is cluster 2. 131 00:06:33,290 --> 00:06:35,720 And then the petals correspond to cluster 3, 132 00:06:35,720 --> 00:06:39,520 which has the fairest shade in our image. 133 00:06:39,520 --> 00:06:42,680 Let us now check how the original image looked. 134 00:06:42,680 --> 00:06:46,430 Let's go back to the console and then maximize it here. 135 00:06:46,430 --> 00:06:48,470 So let's go back to our image function, 136 00:06:48,470 --> 00:06:53,980 but now this time the input is the flower matrix. 137 00:06:53,980 --> 00:06:56,190 And then let's keep the axis as false. 138 00:06:56,190 --> 00:06:59,950 But now, how about we add an additional argument regarding 139 00:06:59,950 --> 00:07:01,220 the color scheme? 140 00:07:01,220 --> 00:07:02,580 Let's make it grayscale. 141 00:07:02,580 --> 00:07:05,100 So we're going to take the color, 142 00:07:05,100 --> 00:07:07,920 and it's going to take the function gray. 143 00:07:07,920 --> 00:07:10,240 And the input to this function is a sequence 144 00:07:10,240 --> 00:07:13,470 of values that goes from 0 to 1, which 145 00:07:13,470 --> 00:07:15,770 actually is from black to white. 146 00:07:15,770 --> 00:07:17,860 And then we have to also specify its length, 147 00:07:17,860 --> 00:07:22,250 and that's specified as 256, because this corresponds 148 00:07:22,250 --> 00:07:24,950 to the convention for grayscale. 149 00:07:24,950 --> 00:07:26,980 And now, going back to our image, 150 00:07:26,980 --> 00:07:30,780 now we can see our original grayscale image here. 151 00:07:30,780 --> 00:07:34,200 You can see that it has a very, very low resolution. 152 00:07:34,200 --> 00:07:36,860 But in our next video, we will try to segment 153 00:07:36,860 --> 00:07:39,760 an MRI image of the brain that has 154 00:07:39,760 --> 00:07:42,350 a much, much higher resolution.