1
00:00:09,500 --> 00:00:11,830
So how does clustering work?

2
00:00:11,830 --> 00:00:14,270
The first step in
clustering is to define

3
00:00:14,270 --> 00:00:17,150
the distance between
two data points.

4
00:00:17,150 --> 00:00:19,750
The most popular way
to compute the distance

5
00:00:19,750 --> 00:00:22,850
is what's called
Euclidean distance.

6
00:00:22,850 --> 00:00:25,300
This is the standard
way to compute distance

7
00:00:25,300 --> 00:00:27,520
that you might have seen before.

8
00:00:27,520 --> 00:00:29,800
Suppose we have
two data points-- I

9
00:00:29,800 --> 00:00:33,320
and J. The distance
between the two points,

10
00:00:33,320 --> 00:00:37,840
which we'll call Dij, is
equal to the square root

11
00:00:37,840 --> 00:00:39,640
of the difference
between the two

12
00:00:39,640 --> 00:00:44,420
points in the first component,
squared, plus the difference

13
00:00:44,420 --> 00:00:46,830
between the two points
in the second component,

14
00:00:46,830 --> 00:00:50,270
squared, all the way up to
the difference between the two

15
00:00:50,270 --> 00:00:53,060
points in the k-th
component, squared,

16
00:00:53,060 --> 00:00:55,630
where k here is the
number of attributes

17
00:00:55,630 --> 00:00:58,660
or independent variables.

18
00:00:58,660 --> 00:01:02,330
Let's see how this works
by looking at an example.

19
00:01:02,330 --> 00:01:05,910
In our movie lens dataset,
we have binary vectors

20
00:01:05,910 --> 00:01:10,090
for each movie, classifying
that movie into genres.

21
00:01:10,090 --> 00:01:12,650
The movie Toy Story
is categorized

22
00:01:12,650 --> 00:01:16,620
as an animation, comedy,
and children's movie.

23
00:01:16,620 --> 00:01:19,340
So the data for
Toy Story has a 1

24
00:01:19,340 --> 00:01:24,460
in the spot for these three
genres and a 0 everywhere else.

25
00:01:24,460 --> 00:01:28,640
The movie Batman Forever is
categorized as an action,

26
00:01:28,640 --> 00:01:31,910
adventure, comedy,
and crime movie.

27
00:01:31,910 --> 00:01:36,479
So Batman Forever has a 1 in
the spot for these four genres

28
00:01:36,479 --> 00:01:39,270
and a 0 everywhere else.

29
00:01:39,270 --> 00:01:42,120
So given these two
data observations,

30
00:01:42,120 --> 00:01:44,890
let's compute the
distance between them.

31
00:01:44,890 --> 00:01:50,039
So the distance, d, would
be equal to the square root

32
00:01:50,039 --> 00:02:16,440
of (0-0)^2 + (0-1)^2 + (0-1)^2
+ (1-0)^2 , 1 et cetera.

33
00:02:16,440 --> 00:02:19,050
This ends up being equal
to the square root of 5.

34
00:02:21,690 --> 00:02:24,040
In addition to
Euclidean distance,

35
00:02:24,040 --> 00:02:26,020
there are many other
popular distance

36
00:02:26,020 --> 00:02:27,990
metrics that could be used.

37
00:02:27,990 --> 00:02:32,140
One is called Manhattan
distance, where the distance is

38
00:02:32,140 --> 00:02:35,810
computed to be the sum of
the absolute values instead

39
00:02:35,810 --> 00:02:37,840
of the sum of square.

40
00:02:37,840 --> 00:02:41,000
Another is called maximum
coordinate distance,

41
00:02:41,000 --> 00:02:44,320
where we only consider the
measurement for which the data

42
00:02:44,320 --> 00:02:46,910
points deviate the most.

43
00:02:46,910 --> 00:02:48,760
Another important
distance that we

44
00:02:48,760 --> 00:02:51,930
have to calculate for
clustering is the distance

45
00:02:51,930 --> 00:02:56,450
between clusters, when a cluster
is a group of data points.

46
00:02:56,450 --> 00:02:59,740
We just discussed how to
compute the distance between two

47
00:02:59,740 --> 00:03:02,690
individual points,
but how do we compute

48
00:03:02,690 --> 00:03:05,630
the distance between
groups of points?

49
00:03:05,630 --> 00:03:07,970
One way of doing
this is by using

50
00:03:07,970 --> 00:03:10,760
what's called the
minimum distance.

51
00:03:10,760 --> 00:03:13,620
This defines the
distance between clusters

52
00:03:13,620 --> 00:03:16,910
as the distance between the
two data points in the clusters

53
00:03:16,910 --> 00:03:19,460
that are closest together.

54
00:03:19,460 --> 00:03:22,340
For example, we would
define the distance

55
00:03:22,340 --> 00:03:24,850
between the yellow
and red clusters

56
00:03:24,850 --> 00:03:28,130
by computing the Euclidean
distance between these two

57
00:03:28,130 --> 00:03:30,120
points.

58
00:03:30,120 --> 00:03:33,670
The other points in the clusters
could be really far away,

59
00:03:33,670 --> 00:03:37,070
but it doesn't matter if
we use minimum distance.

60
00:03:37,070 --> 00:03:40,800
The only thing we care about is
how close together the closest

61
00:03:40,800 --> 00:03:42,440
points are.

62
00:03:42,440 --> 00:03:46,190
Alternatively, we could
use maximum distance.

63
00:03:46,190 --> 00:03:50,290
This one computes the distance
between the two clusters

64
00:03:50,290 --> 00:03:52,290
as the distance
between the two points

65
00:03:52,290 --> 00:03:54,340
that are the farthest apart.

66
00:03:54,340 --> 00:03:57,390
So for example, we would
compute the distance

67
00:03:57,390 --> 00:03:59,810
between the yellow
and red clusters

68
00:03:59,810 --> 00:04:03,150
by looking at these two points.

69
00:04:03,150 --> 00:04:06,360
Here, it doesn't matter how
close together the other points

70
00:04:06,360 --> 00:04:07,240
are.

71
00:04:07,240 --> 00:04:10,430
All we care about is how close
together the furthest points

72
00:04:10,430 --> 00:04:12,190
are.

73
00:04:12,190 --> 00:04:15,530
The most common distance
metric between clusters

74
00:04:15,530 --> 00:04:17,690
is called centroid distance.

75
00:04:17,690 --> 00:04:19,910
And this is what we'll use.

76
00:04:19,910 --> 00:04:22,610
It defines the distance
between clusters

77
00:04:22,610 --> 00:04:25,980
by computing the
centroid of the clusters.

78
00:04:25,980 --> 00:04:28,020
The centroid is
just the data point

79
00:04:28,020 --> 00:04:32,330
that takes the average of all
data points in each component.

80
00:04:32,330 --> 00:04:36,100
This takes all data points
in each cluster into account

81
00:04:36,100 --> 00:04:39,530
and can be thought of as
the middle data point.

82
00:04:39,530 --> 00:04:43,570
In our example, the centroids
between yellow and red

83
00:04:43,570 --> 00:04:46,700
are here, and we would
compute the distance

84
00:04:46,700 --> 00:04:49,420
between the clusters
by computing

85
00:04:49,420 --> 00:04:51,920
the Euclidean distance
between those two points.

86
00:04:54,650 --> 00:04:56,980
When we are computing
distances, it's

87
00:04:56,980 --> 00:05:00,880
highly influenced by the
scale of the variables.

88
00:05:00,880 --> 00:05:04,120
As an example, suppose
you're computing the distance

89
00:05:04,120 --> 00:05:07,170
between two data points,
where one variable is

90
00:05:07,170 --> 00:05:10,680
the revenue of a company
in thousands of dollars,

91
00:05:10,680 --> 00:05:14,050
and another is the age
of the company in years.

92
00:05:14,050 --> 00:05:16,210
The revenue variable
would really

93
00:05:16,210 --> 00:05:19,460
dominate in the
distance calculation.

94
00:05:19,460 --> 00:05:22,150
The differences between
the data points for revenue

95
00:05:22,150 --> 00:05:23,800
would be in the thousands.

96
00:05:23,800 --> 00:05:27,010
Whereas the differences
between the year variable

97
00:05:27,010 --> 00:05:29,650
would probably be less than 10.

98
00:05:29,650 --> 00:05:34,250
To handle this, it's customary
to normalize the data first.

99
00:05:34,250 --> 00:05:37,409
We can normalize by subtracting
the mean of the data

100
00:05:37,409 --> 00:05:39,870
and dividing by the
standard deviation.

101
00:05:39,870 --> 00:05:42,650
We'll see more of
this in the homework.

102
00:05:42,650 --> 00:05:46,250
In our movie data set,
all of our genre variables

103
00:05:46,250 --> 00:05:47,909
are on the same scale.

104
00:05:47,909 --> 00:05:50,770
So we don't have to
worry about normalizing.

105
00:05:50,770 --> 00:05:53,580
But if we wanted to add a
variable, like box office

106
00:05:53,580 --> 00:05:56,140
revenue, we would
need to normalize

107
00:05:56,140 --> 00:06:00,400
so that this variable didn't
dominate all of the others.

108
00:06:00,400 --> 00:06:03,870
Now that we've defined how
we'll compute the distances,

109
00:06:03,870 --> 00:06:06,660
we'll talk about a specific
clustering algorithm--

110
00:06:06,660 --> 00:06:10,590
hierarchical clustering--
in the next video.