1
00:00:04,970 --> 00:00:07,850
In this video we
will try to segment

2
00:00:07,850 --> 00:00:10,700
an MRI brain image of
a healthy patient using

3
00:00:10,700 --> 00:00:12,860
hierarchical clustering.

4
00:00:12,860 --> 00:00:14,790
Make sure that you are
in the directory where

5
00:00:14,790 --> 00:00:17,340
you saved the healthy.csv file.

6
00:00:17,340 --> 00:00:19,730
We will be following
the exact same steps

7
00:00:19,730 --> 00:00:21,960
we did in the previous video.

8
00:00:21,960 --> 00:00:25,610
First, read in the data, and
call the data frame healthy.

9
00:00:25,610 --> 00:00:29,800
Use the read.csv function to
read in the healthy data set.

10
00:00:29,800 --> 00:00:31,800
And remember that this
healthy data set consists

11
00:00:31,800 --> 00:00:36,220
of a matrix of intensity
values, so let's set the header

12
00:00:36,220 --> 00:00:38,740
to false.

13
00:00:38,740 --> 00:00:41,610
And now let's create
the healthy matrix using

14
00:00:41,610 --> 00:00:45,210
the as.matrix function,
which takes as an input

15
00:00:45,210 --> 00:00:46,170
the healthy data frame.

16
00:00:48,850 --> 00:00:50,570
And now let's
output the structure

17
00:00:50,570 --> 00:00:53,030
of the healthy matrix.

18
00:00:53,030 --> 00:00:56,990
And then we realize
that we have 566

19
00:00:56,990 --> 00:01:00,760
by 646 pixel resolution
for our image.

20
00:01:00,760 --> 00:01:04,430
So this MRI image is
considerably larger

21
00:01:04,430 --> 00:01:06,690
than the little flower
image that we saw,

22
00:01:06,690 --> 00:01:10,170
and we worked with in
the previous two videos.

23
00:01:10,170 --> 00:01:13,620
To see the MRI image, we
can use the image function

24
00:01:13,620 --> 00:01:17,360
in R, which takes as an
input the healthy matrix.

25
00:01:17,360 --> 00:01:20,560
And then let's
turn our axes off.

26
00:01:20,560 --> 00:01:22,810
And then use the
grey-scale color scheme.

27
00:01:22,810 --> 00:01:26,400
So the color is equal to
grey, which shades a sequence

28
00:01:26,400 --> 00:01:31,970
of values going from zero to
one, with the length of 256.

29
00:01:31,970 --> 00:01:34,450
And now going to
our graphics window,

30
00:01:34,450 --> 00:01:37,680
we see that what we have
is the T2-weighted MRI

31
00:01:37,680 --> 00:01:39,880
imaging of a top
section of the brain.

32
00:01:39,880 --> 00:01:41,630
And it shows
different substances,

33
00:01:41,630 --> 00:01:44,110
such as the gray matter,
the white matter,

34
00:01:44,110 --> 00:01:47,039
and the cerebrospinal fluid.

35
00:01:47,039 --> 00:01:49,190
Now let us see if we can
isolate these substances

36
00:01:49,190 --> 00:01:52,500
via hierarchical clustering.

37
00:01:52,500 --> 00:01:54,840
We first need to convert the
healthy matrix to a vector,

38
00:01:54,840 --> 00:01:58,320
and let's call it
healthy vector.

39
00:01:58,320 --> 00:02:02,800
And that is equal to S dot
vector of the healthy matrix.

40
00:02:05,320 --> 00:02:07,230
And now the first
step in performing

41
00:02:07,230 --> 00:02:10,669
hierarchical clustering is
computing the distance matrix.

42
00:02:10,669 --> 00:02:17,060
So let's type distance equals
dist of healthy vector.

43
00:02:17,060 --> 00:02:19,070
And let's specify the
method to be euclidean.

44
00:02:23,530 --> 00:02:27,140
Oh, R gives us an error
that seems to tell us that

45
00:02:27,140 --> 00:02:31,110
our vector is huge, and R
cannot allocate enough memory.

46
00:02:31,110 --> 00:02:33,740
Well let us see how
big is our vector.

47
00:02:33,740 --> 00:02:36,930
So we're going to go
and use the structure

48
00:02:36,930 --> 00:02:39,350
function over the
healthy vector,

49
00:02:39,350 --> 00:02:43,170
and let's see what we obtain.

50
00:02:43,170 --> 00:02:43,690
Hm.

51
00:02:43,690 --> 00:02:49,070
The healthy vector
has 365,636 elements.

52
00:02:49,070 --> 00:02:50,450
Let's call this number n.

53
00:02:50,450 --> 00:02:53,800
And remember, from
our previous video,

54
00:02:53,800 --> 00:02:56,579
that for R to calculate
the pairwise distances,

55
00:02:56,579 --> 00:03:03,820
it would actually need to
calculate n*(n-1)/2 and then

56
00:03:03,820 --> 00:03:06,010
store them in the
distance matrix.

57
00:03:06,010 --> 00:03:08,800
Let's see how big
this number is.

58
00:03:08,800 --> 00:03:09,600
Wow.

59
00:03:09,600 --> 00:03:11,820
Of course R would complain.

60
00:03:11,820 --> 00:03:14,320
It's 67 billion
values that we're

61
00:03:14,320 --> 00:03:17,430
asking R to store in a matrix.

62
00:03:17,430 --> 00:03:20,930
The bad news now is that
we cannot use hierarchical

63
00:03:20,930 --> 00:03:22,440
clustering.

64
00:03:22,440 --> 00:03:24,610
Is there any other solution?

65
00:03:24,610 --> 00:03:26,750
Well, we have seen
in lecture two

66
00:03:26,750 --> 00:03:30,300
that another clustering
method is k-means.

67
00:03:30,300 --> 00:03:32,160
Let us review it
first, and see if it

68
00:03:32,160 --> 00:03:35,620
could work on our
high resolution image.

69
00:03:35,620 --> 00:03:37,800
The k-means clustering
algorithm aims

70
00:03:37,800 --> 00:03:40,720
at partitioning the
data into k clusters,

71
00:03:40,720 --> 00:03:42,829
in a way that each
data point belongs

72
00:03:42,829 --> 00:03:46,210
to the cluster whose mean
is the nearest to it.

73
00:03:46,210 --> 00:03:49,740
Let's go over the
algorithm step-by-step.

74
00:03:49,740 --> 00:03:52,560
In this example we
have five data points.

75
00:03:52,560 --> 00:03:55,610
The first step is to specify
the number of clusters.

76
00:03:55,610 --> 00:04:00,410
And suppose we wish to find
two clusters, so set k=2.

77
00:04:00,410 --> 00:04:02,990
Then we start by randomly
grouping the data

78
00:04:02,990 --> 00:04:04,450
into two clusters.

79
00:04:04,450 --> 00:04:07,140
For instance, three
points in the red cluster,

80
00:04:07,140 --> 00:04:10,570
and the remaining two
points in the grey cluster.

81
00:04:10,570 --> 00:04:14,360
The next step is to compute
the cluster means or centroids.

82
00:04:14,360 --> 00:04:17,310
Let's first compute the
mean of the red cluster,

83
00:04:17,310 --> 00:04:21,120
and then the mean of the grey
cluster is simply the midpoint.

84
00:04:21,120 --> 00:04:24,010
Now remember that the
k-means clustering algorithm

85
00:04:24,010 --> 00:04:27,790
tries to cluster points
according to the nearest mean.

86
00:04:27,790 --> 00:04:31,260
But this red point over here
seems to be closer to the mean

87
00:04:31,260 --> 00:04:34,290
of the grey cluster, then to
the mean of the red cluster

88
00:04:34,290 --> 00:04:37,530
to which it was assigned
in the previous step.

89
00:04:37,530 --> 00:04:40,850
So intuitively, the next
step in the k-means algorithm

90
00:04:40,850 --> 00:04:45,220
is to re-assign the data points
to the closest cluster mean.

91
00:04:45,220 --> 00:04:49,920
As a result, now this red point
should be in the grey cluster.

92
00:04:49,920 --> 00:04:52,710
Now that we moved one point
from the red cluster over

93
00:04:52,710 --> 00:04:55,900
to the grey cluster, we
need to update the means.

94
00:04:55,900 --> 00:04:59,140
This is exactly the next step
in the k-means algorithm.

95
00:04:59,140 --> 00:05:03,340
So let's recompute the
mean of the red cluster,

96
00:05:03,340 --> 00:05:06,970
and then re-compute the
mean of the grey cluster.

97
00:05:06,970 --> 00:05:09,340
Now we go back to Step 4.

98
00:05:09,340 --> 00:05:11,060
Is there any point
here that seems

99
00:05:11,060 --> 00:05:14,850
to be cluster to a cluster mean
that it does not belong to?

100
00:05:14,850 --> 00:05:18,700
If so, we need to re-assign
it to the other cluster.

101
00:05:18,700 --> 00:05:20,500
However, in this
case, all points

102
00:05:20,500 --> 00:05:24,000
are closest to their cluster
mean, so the algorithm is done,

103
00:05:24,000 --> 00:05:26,310
and we can stop.

104
00:05:26,310 --> 00:05:28,800
In the next video, we will
implement the k-means algorithm

105
00:05:28,800 --> 00:05:32,470
in R to try to segment
the MRI brain image.