1 00:00:04,500 --> 00:00:07,560 Let us try to understand the format of the data handed 2 00:00:07,560 --> 00:00:10,030 to us in the CSV files. 3 00:00:10,030 --> 00:00:13,760 Grayscale images are represented as a matrix of pixel intensity 4 00:00:13,760 --> 00:00:16,740 values that range from zero to one. 5 00:00:16,740 --> 00:00:19,840 The intensity value zero corresponds to the absence 6 00:00:19,840 --> 00:00:24,530 of color, or black, and the value one corresponds to white. 7 00:00:24,530 --> 00:00:29,250 For 8 bits per pixel images, we have 256 color levels 8 00:00:29,250 --> 00:00:31,680 ranging from zero to one. 9 00:00:31,680 --> 00:00:34,640 For instance, if we have the following grayscale image, 10 00:00:34,640 --> 00:00:37,000 the pixel information can be translated 11 00:00:37,000 --> 00:00:40,290 to a matrix of values between zero and one. 12 00:00:40,290 --> 00:00:44,050 It is exactly this matrix that we are given in our datasets. 13 00:00:44,050 --> 00:00:46,680 In other words, the datasets contain a table 14 00:00:46,680 --> 00:00:49,000 of values between zero and one. 15 00:00:49,000 --> 00:00:50,800 And the number of columns corresponds 16 00:00:50,800 --> 00:00:53,590 to the width of the image, whereas the number of rows 17 00:00:53,590 --> 00:00:56,150 corresponds to the height of the image. 18 00:00:56,150 --> 00:01:00,950 In this example, the resolution is 7 by 7 pixels. 19 00:01:00,950 --> 00:01:04,370 We have to be careful when reading the dataset in R. 20 00:01:04,370 --> 00:01:06,070 We need to make sure that R reads 21 00:01:06,070 --> 00:01:08,150 in the matrix appropriately. 22 00:01:08,150 --> 00:01:10,150 Until now in this class, our datasets 23 00:01:10,150 --> 00:01:14,090 were structured in a way where the rows refer to observations 24 00:01:14,090 --> 00:01:16,560 and the columns refer to variables. 25 00:01:16,560 --> 00:01:19,870 But this is not the case for the intensity matrix. 26 00:01:19,870 --> 00:01:22,580 So keep in mind that we need to do some maneuvering 27 00:01:22,580 --> 00:01:27,380 to make sure that R recognizes the data as a matrix. 28 00:01:27,380 --> 00:01:31,120 Grayscale image segmentation can be done by clustering pixels 29 00:01:31,120 --> 00:01:33,560 according to their intensity values. 30 00:01:33,560 --> 00:01:35,630 So we can think of our clustering algorithm 31 00:01:35,630 --> 00:01:38,870 as trying to divide the spectrum of intensity values 32 00:01:38,870 --> 00:01:42,600 from zero to one into intervals, or clusters. 33 00:01:42,600 --> 00:01:44,840 For instance, the red cluster corresponds 34 00:01:44,840 --> 00:01:48,970 to the darkest shades, and the green cluster to the lightest. 35 00:01:48,970 --> 00:01:53,030 Now, what should the input be to the clustering algorithm? 36 00:01:53,030 --> 00:01:55,539 Well, our observations should be all of the 7 37 00:01:55,539 --> 00:01:57,860 by 7 intensity values. 38 00:01:57,860 --> 00:02:00,740 Hence, we should have 49 observations. 39 00:02:00,740 --> 00:02:02,600 And we only have one variable, which 40 00:02:02,600 --> 00:02:04,790 is the pixel intensity value. 41 00:02:04,790 --> 00:02:07,830 So in other words, the input to the clustering algorithm 42 00:02:07,830 --> 00:02:12,410 should be a vector containing 49 elements, or intensity values. 43 00:02:12,410 --> 00:02:15,780 But what we have is a 7 by 7 matrix. 44 00:02:15,780 --> 00:02:18,230 A crucial step before feeding the intensity 45 00:02:18,230 --> 00:02:22,380 values to the clustering algorithm is morphing our data. 46 00:02:22,380 --> 00:02:24,670 We should modify the matrix structure 47 00:02:24,670 --> 00:02:28,640 and lump all the intensity values into a single vector. 48 00:02:28,640 --> 00:02:30,640 We will see that we can do this in R 49 00:02:30,640 --> 00:02:33,650 using the as.vector function. 50 00:02:33,650 --> 00:02:36,060 Now, once we have the vector, we can simply 51 00:02:36,060 --> 00:02:37,910 feed it into the clustering algorithm 52 00:02:37,910 --> 00:02:42,160 and assign each element in the vector to a cluster. 53 00:02:42,160 --> 00:02:44,620 Let us first use hierarchical clustering 54 00:02:44,620 --> 00:02:46,690 since we are familiar with it. 55 00:02:46,690 --> 00:02:49,770 The first step is to calculate the distance matrix, which 56 00:02:49,770 --> 00:02:52,390 computes the pairwise distances among the elements 57 00:02:52,390 --> 00:02:54,290 of the intensity vector. 58 00:02:54,290 --> 00:02:57,400 How many such distances do we need to calculate? 59 00:02:57,400 --> 00:02:59,980 Well, for each element in the intensity vector, 60 00:02:59,980 --> 00:03:03,020 we need to calculate its distance from the other 48 61 00:03:03,020 --> 00:03:03,920 elements. 62 00:03:03,920 --> 00:03:07,450 So this makes 48 calculations per element. 63 00:03:07,450 --> 00:03:11,030 And we have 49 such elements in the intensity vector. 64 00:03:11,030 --> 00:03:16,310 In total, we should compute 49 times 48 pairwise distances. 65 00:03:16,310 --> 00:03:20,340 But due to symmetry, we really need to calculate half of them. 66 00:03:20,340 --> 00:03:23,380 So the number of pairwise distance calculations is 67 00:03:23,380 --> 00:03:24,210 actually (49*48)/2. 68 00:03:27,060 --> 00:03:30,990 In general, if we call the size of the intensity vector n, 69 00:03:30,990 --> 00:03:36,320 then we need to compute n*(n-1)/2 pairwise distances 70 00:03:36,320 --> 00:03:39,320 and store them in the distance matrix. 71 00:03:39,320 --> 00:03:42,780 Now we should be ready to go to R. 72 00:03:42,780 --> 00:03:45,070 I already navigated to the directory 73 00:03:45,070 --> 00:03:48,370 where we saved the flower.csv file, which 74 00:03:48,370 --> 00:03:52,020 contains the matrix of pixel intensities of a flower image. 75 00:03:52,020 --> 00:03:54,860 Let us read in the matrix and save it to a data frame 76 00:03:54,860 --> 00:03:59,170 and call it flower, then use the read.csv function 77 00:03:59,170 --> 00:04:02,690 to instruct R to read in the flower dataset. 78 00:04:02,690 --> 00:04:05,060 And then we have to explicitly mention 79 00:04:05,060 --> 00:04:07,800 that we have no headers in the CSV file 80 00:04:07,800 --> 00:04:11,610 because it only contains a matrix of intensity values. 81 00:04:11,610 --> 00:04:13,200 So we're going to type header=FALSE. 82 00:04:16,240 --> 00:04:18,279 Note that the default in R assumes 83 00:04:18,279 --> 00:04:21,140 that the first row in the dataset is the header. 84 00:04:21,140 --> 00:04:25,030 So if we didn't specify that we have no headers in this case, 85 00:04:25,030 --> 00:04:26,450 we would have lost the information 86 00:04:26,450 --> 00:04:29,880 from the first row of the pixel intensity matrix. 87 00:04:29,880 --> 00:04:34,460 Now let us look at the structure of the flower data frame. 88 00:04:34,460 --> 00:04:36,810 We realize that the way the data is stored 89 00:04:36,810 --> 00:04:40,340 does not reflect that this is a matrix of intensity values. 90 00:04:40,340 --> 00:04:44,409 Actually, R treats the rows as observations and the columns 91 00:04:44,409 --> 00:04:46,340 as variables. 92 00:04:46,340 --> 00:04:48,820 Let's try to change the data type to a matrix 93 00:04:48,820 --> 00:04:51,630 by using the as.matrix function. 94 00:04:51,630 --> 00:04:55,790 So let's define our variable flowerMatrix 95 00:04:55,790 --> 00:04:58,670 and then use the as.matrix function, which 96 00:04:58,670 --> 00:05:02,120 takes as an input the flower data frame. 97 00:05:02,120 --> 00:05:06,930 And now if we look at the structure of the flower matrix, 98 00:05:06,930 --> 00:05:11,770 we realize that we have 50 rows and 50 columns. 99 00:05:11,770 --> 00:05:14,130 What this suggests is that the resolution of the image 100 00:05:14,130 --> 00:05:17,670 is 50 pixels in width and 50 pixels in height. 101 00:05:17,670 --> 00:05:20,850 This is actually a very, very small picture. 102 00:05:20,850 --> 00:05:24,290 I am very curious to see how this image looks like, but lets 103 00:05:24,290 --> 00:05:27,110 hold off now and do our clustering first. 104 00:05:27,110 --> 00:05:29,730 We do not want to be influenced by how the image looks 105 00:05:29,730 --> 00:05:32,240 like in our decision of the numbers of clusters 106 00:05:32,240 --> 00:05:34,840 we want to pick. 107 00:05:34,840 --> 00:05:36,480 To perform any type of clustering, 108 00:05:36,480 --> 00:05:38,070 we saw earlier that we would need 109 00:05:38,070 --> 00:05:40,470 to convert the matrix of pixel intensities 110 00:05:40,470 --> 00:05:44,500 to a vector that contains all the intensity values ranging 111 00:05:44,500 --> 00:05:45,930 from zero to one. 112 00:05:45,930 --> 00:05:49,550 And the clustering algorithm divides the intensity spectrum, 113 00:05:49,550 --> 00:05:52,490 the interval zero to one, into these joint clusters 114 00:05:52,490 --> 00:05:53,490 or intervals. 115 00:05:53,490 --> 00:05:58,100 So let us define the vector flowerVector, 116 00:05:58,100 --> 00:06:02,650 and then now we're going to use the function as.vector, which 117 00:06:02,650 --> 00:06:06,240 takes as an input the flowerMatrix. 118 00:06:06,240 --> 00:06:11,020 And now if we look at the structure of the flowerVector, 119 00:06:11,020 --> 00:06:16,570 we realize that we have 2,500 numerical values, which 120 00:06:16,570 --> 00:06:18,110 range between zero and one. 121 00:06:18,110 --> 00:06:22,310 And this totally makes sense because this reflects the 50 122 00:06:22,310 --> 00:06:26,830 times 50 intensity values that we had in our matrix. 123 00:06:26,830 --> 00:06:29,680 Now you might be wondering why we can't immediately 124 00:06:29,680 --> 00:06:33,000 convert the data frame flower to a vector. 125 00:06:33,000 --> 00:06:34,340 Let's try to do this. 126 00:06:34,340 --> 00:06:37,700 So let's go back to our as.vector function 127 00:06:37,700 --> 00:06:40,620 and then have the input be the flower data 128 00:06:40,620 --> 00:06:42,850 frame instead of the flower matrix. 129 00:06:42,850 --> 00:06:46,990 And then, let's name this variable flowerVector2, simply 130 00:06:46,990 --> 00:06:49,790 so that we don't overwrite the flower vector. 131 00:06:49,790 --> 00:06:51,290 And now let's look at its structure. 132 00:06:53,930 --> 00:06:57,750 It seems that R reads it exactly like the flower data frame 133 00:06:57,750 --> 00:07:02,360 and sees it as 50 observations and 50 variables. 134 00:07:02,360 --> 00:07:06,010 So converting the data to a matrix and then to the vector 135 00:07:06,010 --> 00:07:08,210 is a crucial step. 136 00:07:08,210 --> 00:07:11,660 Now we should be ready to start our hierarchical clustering. 137 00:07:11,660 --> 00:07:14,850 The first step is to create the distance matrix, as you already 138 00:07:14,850 --> 00:07:16,850 know, which in this case computes 139 00:07:16,850 --> 00:07:19,970 the difference between every two intensity values in our flower 140 00:07:19,970 --> 00:07:20,470 vector. 141 00:07:20,470 --> 00:07:22,340 So let's type distance=dist(flowerVector, 142 00:07:22,340 --> 00:07:23,170 method="euclidean"). 143 00:07:35,930 --> 00:07:38,020 Now that we have the distance, next 144 00:07:38,020 --> 00:07:41,540 we will be computing the hierarchical clusters.