1
00:00:04,500 --> 00:00:07,560
Let us try to understand the
format of the data handed

2
00:00:07,560 --> 00:00:10,030
to us in the CSV files.

3
00:00:10,030 --> 00:00:13,760
Grayscale images are represented
as a matrix of pixel intensity

4
00:00:13,760 --> 00:00:16,740
values that range
from zero to one.

5
00:00:16,740 --> 00:00:19,840
The intensity value zero
corresponds to the absence

6
00:00:19,840 --> 00:00:24,530
of color, or black, and the
value one corresponds to white.

7
00:00:24,530 --> 00:00:29,250
For 8 bits per pixel images,
we have 256 color levels

8
00:00:29,250 --> 00:00:31,680
ranging from zero to one.

9
00:00:31,680 --> 00:00:34,640
For instance, if we have the
following grayscale image,

10
00:00:34,640 --> 00:00:37,000
the pixel information
can be translated

11
00:00:37,000 --> 00:00:40,290
to a matrix of values
between zero and one.

12
00:00:40,290 --> 00:00:44,050
It is exactly this matrix that
we are given in our datasets.

13
00:00:44,050 --> 00:00:46,680
In other words, the
datasets contain a table

14
00:00:46,680 --> 00:00:49,000
of values between zero and one.

15
00:00:49,000 --> 00:00:50,800
And the number of
columns corresponds

16
00:00:50,800 --> 00:00:53,590
to the width of the image,
whereas the number of rows

17
00:00:53,590 --> 00:00:56,150
corresponds to the
height of the image.

18
00:00:56,150 --> 00:01:00,950
In this example, the
resolution is 7 by 7 pixels.

19
00:01:00,950 --> 00:01:04,370
We have to be careful when
reading the dataset in R.

20
00:01:04,370 --> 00:01:06,070
We need to make
sure that R reads

21
00:01:06,070 --> 00:01:08,150
in the matrix appropriately.

22
00:01:08,150 --> 00:01:10,150
Until now in this
class, our datasets

23
00:01:10,150 --> 00:01:14,090
were structured in a way where
the rows refer to observations

24
00:01:14,090 --> 00:01:16,560
and the columns
refer to variables.

25
00:01:16,560 --> 00:01:19,870
But this is not the case
for the intensity matrix.

26
00:01:19,870 --> 00:01:22,580
So keep in mind that we
need to do some maneuvering

27
00:01:22,580 --> 00:01:27,380
to make sure that R recognizes
the data as a matrix.

28
00:01:27,380 --> 00:01:31,120
Grayscale image segmentation
can be done by clustering pixels

29
00:01:31,120 --> 00:01:33,560
according to their
intensity values.

30
00:01:33,560 --> 00:01:35,630
So we can think of our
clustering algorithm

31
00:01:35,630 --> 00:01:38,870
as trying to divide the
spectrum of intensity values

32
00:01:38,870 --> 00:01:42,600
from zero to one into
intervals, or clusters.

33
00:01:42,600 --> 00:01:44,840
For instance, the red
cluster corresponds

34
00:01:44,840 --> 00:01:48,970
to the darkest shades, and the
green cluster to the lightest.

35
00:01:48,970 --> 00:01:53,030
Now, what should the input be
to the clustering algorithm?

36
00:01:53,030 --> 00:01:55,539
Well, our observations
should be all of the 7

37
00:01:55,539 --> 00:01:57,860
by 7 intensity values.

38
00:01:57,860 --> 00:02:00,740
Hence, we should
have 49 observations.

39
00:02:00,740 --> 00:02:02,600
And we only have
one variable, which

40
00:02:02,600 --> 00:02:04,790
is the pixel intensity value.

41
00:02:04,790 --> 00:02:07,830
So in other words, the input
to the clustering algorithm

42
00:02:07,830 --> 00:02:12,410
should be a vector containing 49
elements, or intensity values.

43
00:02:12,410 --> 00:02:15,780
But what we have
is a 7 by 7 matrix.

44
00:02:15,780 --> 00:02:18,230
A crucial step before
feeding the intensity

45
00:02:18,230 --> 00:02:22,380
values to the clustering
algorithm is morphing our data.

46
00:02:22,380 --> 00:02:24,670
We should modify
the matrix structure

47
00:02:24,670 --> 00:02:28,640
and lump all the intensity
values into a single vector.

48
00:02:28,640 --> 00:02:30,640
We will see that
we can do this in R

49
00:02:30,640 --> 00:02:33,650
using the as.vector function.

50
00:02:33,650 --> 00:02:36,060
Now, once we have the
vector, we can simply

51
00:02:36,060 --> 00:02:37,910
feed it into the
clustering algorithm

52
00:02:37,910 --> 00:02:42,160
and assign each element in
the vector to a cluster.

53
00:02:42,160 --> 00:02:44,620
Let us first use
hierarchical clustering

54
00:02:44,620 --> 00:02:46,690
since we are familiar with it.

55
00:02:46,690 --> 00:02:49,770
The first step is to calculate
the distance matrix, which

56
00:02:49,770 --> 00:02:52,390
computes the pairwise
distances among the elements

57
00:02:52,390 --> 00:02:54,290
of the intensity vector.

58
00:02:54,290 --> 00:02:57,400
How many such distances
do we need to calculate?

59
00:02:57,400 --> 00:02:59,980
Well, for each element
in the intensity vector,

60
00:02:59,980 --> 00:03:03,020
we need to calculate its
distance from the other 48

61
00:03:03,020 --> 00:03:03,920
elements.

62
00:03:03,920 --> 00:03:07,450
So this makes 48
calculations per element.

63
00:03:07,450 --> 00:03:11,030
And we have 49 such elements
in the intensity vector.

64
00:03:11,030 --> 00:03:16,310
In total, we should compute 49
times 48 pairwise distances.

65
00:03:16,310 --> 00:03:20,340
But due to symmetry, we really
need to calculate half of them.

66
00:03:20,340 --> 00:03:23,380
So the number of pairwise
distance calculations is

67
00:03:23,380 --> 00:03:24,210
actually (49*48)/2.

68
00:03:27,060 --> 00:03:30,990
In general, if we call the
size of the intensity vector n,

69
00:03:30,990 --> 00:03:36,320
then we need to compute
n*(n-1)/2 pairwise distances

70
00:03:36,320 --> 00:03:39,320
and store them in
the distance matrix.

71
00:03:39,320 --> 00:03:42,780
Now we should be
ready to go to R.

72
00:03:42,780 --> 00:03:45,070
I already navigated
to the directory

73
00:03:45,070 --> 00:03:48,370
where we saved the
flower.csv file, which

74
00:03:48,370 --> 00:03:52,020
contains the matrix of pixel
intensities of a flower image.

75
00:03:52,020 --> 00:03:54,860
Let us read in the matrix
and save it to a data frame

76
00:03:54,860 --> 00:03:59,170
and call it flower, then
use the read.csv function

77
00:03:59,170 --> 00:04:02,690
to instruct R to read
in the flower dataset.

78
00:04:02,690 --> 00:04:05,060
And then we have to
explicitly mention

79
00:04:05,060 --> 00:04:07,800
that we have no
headers in the CSV file

80
00:04:07,800 --> 00:04:11,610
because it only contains a
matrix of intensity values.

81
00:04:11,610 --> 00:04:13,200
So we're going to
type header=FALSE.

82
00:04:16,240 --> 00:04:18,279
Note that the
default in R assumes

83
00:04:18,279 --> 00:04:21,140
that the first row in the
dataset is the header.

84
00:04:21,140 --> 00:04:25,030
So if we didn't specify that we
have no headers in this case,

85
00:04:25,030 --> 00:04:26,450
we would have lost
the information

86
00:04:26,450 --> 00:04:29,880
from the first row of the
pixel intensity matrix.

87
00:04:29,880 --> 00:04:34,460
Now let us look at the structure
of the flower data frame.

88
00:04:34,460 --> 00:04:36,810
We realize that the
way the data is stored

89
00:04:36,810 --> 00:04:40,340
does not reflect that this is
a matrix of intensity values.

90
00:04:40,340 --> 00:04:44,409
Actually, R treats the rows as
observations and the columns

91
00:04:44,409 --> 00:04:46,340
as variables.

92
00:04:46,340 --> 00:04:48,820
Let's try to change the
data type to a matrix

93
00:04:48,820 --> 00:04:51,630
by using the as.matrix function.

94
00:04:51,630 --> 00:04:55,790
So let's define our
variable flowerMatrix

95
00:04:55,790 --> 00:04:58,670
and then use the
as.matrix function, which

96
00:04:58,670 --> 00:05:02,120
takes as an input the
flower data frame.

97
00:05:02,120 --> 00:05:06,930
And now if we look at the
structure of the flower matrix,

98
00:05:06,930 --> 00:05:11,770
we realize that we have
50 rows and 50 columns.

99
00:05:11,770 --> 00:05:14,130
What this suggests is that
the resolution of the image

100
00:05:14,130 --> 00:05:17,670
is 50 pixels in width
and 50 pixels in height.

101
00:05:17,670 --> 00:05:20,850
This is actually a very,
very small picture.

102
00:05:20,850 --> 00:05:24,290
I am very curious to see how
this image looks like, but lets

103
00:05:24,290 --> 00:05:27,110
hold off now and do
our clustering first.

104
00:05:27,110 --> 00:05:29,730
We do not want to be influenced
by how the image looks

105
00:05:29,730 --> 00:05:32,240
like in our decision of
the numbers of clusters

106
00:05:32,240 --> 00:05:34,840
we want to pick.

107
00:05:34,840 --> 00:05:36,480
To perform any
type of clustering,

108
00:05:36,480 --> 00:05:38,070
we saw earlier
that we would need

109
00:05:38,070 --> 00:05:40,470
to convert the matrix
of pixel intensities

110
00:05:40,470 --> 00:05:44,500
to a vector that contains all
the intensity values ranging

111
00:05:44,500 --> 00:05:45,930
from zero to one.

112
00:05:45,930 --> 00:05:49,550
And the clustering algorithm
divides the intensity spectrum,

113
00:05:49,550 --> 00:05:52,490
the interval zero to one,
into these joint clusters

114
00:05:52,490 --> 00:05:53,490
or intervals.

115
00:05:53,490 --> 00:05:58,100
So let us define the
vector flowerVector,

116
00:05:58,100 --> 00:06:02,650
and then now we're going to use
the function as.vector, which

117
00:06:02,650 --> 00:06:06,240
takes as an input
the flowerMatrix.

118
00:06:06,240 --> 00:06:11,020
And now if we look at the
structure of the flowerVector,

119
00:06:11,020 --> 00:06:16,570
we realize that we have
2,500 numerical values, which

120
00:06:16,570 --> 00:06:18,110
range between zero and one.

121
00:06:18,110 --> 00:06:22,310
And this totally makes sense
because this reflects the 50

122
00:06:22,310 --> 00:06:26,830
times 50 intensity values
that we had in our matrix.

123
00:06:26,830 --> 00:06:29,680
Now you might be wondering
why we can't immediately

124
00:06:29,680 --> 00:06:33,000
convert the data frame
flower to a vector.

125
00:06:33,000 --> 00:06:34,340
Let's try to do this.

126
00:06:34,340 --> 00:06:37,700
So let's go back to
our as.vector function

127
00:06:37,700 --> 00:06:40,620
and then have the input
be the flower data

128
00:06:40,620 --> 00:06:42,850
frame instead of
the flower matrix.

129
00:06:42,850 --> 00:06:46,990
And then, let's name this
variable flowerVector2, simply

130
00:06:46,990 --> 00:06:49,790
so that we don't overwrite
the flower vector.

131
00:06:49,790 --> 00:06:51,290
And now let's look
at its structure.

132
00:06:53,930 --> 00:06:57,750
It seems that R reads it exactly
like the flower data frame

133
00:06:57,750 --> 00:07:02,360
and sees it as 50
observations and 50 variables.

134
00:07:02,360 --> 00:07:06,010
So converting the data to a
matrix and then to the vector

135
00:07:06,010 --> 00:07:08,210
is a crucial step.

136
00:07:08,210 --> 00:07:11,660
Now we should be ready to start
our hierarchical clustering.

137
00:07:11,660 --> 00:07:14,850
The first step is to create the
distance matrix, as you already

138
00:07:14,850 --> 00:07:16,850
know, which in
this case computes

139
00:07:16,850 --> 00:07:19,970
the difference between every two
intensity values in our flower

140
00:07:19,970 --> 00:07:20,470
vector.

141
00:07:20,470 --> 00:07:22,340
So let's type
distance=dist(flowerVector,

142
00:07:22,340 --> 00:07:23,170
method="euclidean").

143
00:07:35,930 --> 00:07:38,020
Now that we have
the distance, next

144
00:07:38,020 --> 00:07:41,540
we will be computing the
hierarchical clusters.