1
00:00:04,490 --> 00:00:07,860
In this video we'll use
hierarchical clustering

2
00:00:07,860 --> 00:00:12,160
to cluster the movies in the
movie lens data set by genre.

3
00:00:12,160 --> 00:00:14,150
After we make our
clusters, we'll

4
00:00:14,150 --> 00:00:17,960
see how they can be used
to make recommendations.

5
00:00:17,960 --> 00:00:21,220
There are two steps to
hierarchical clustering.

6
00:00:21,220 --> 00:00:24,390
First we have to compute the
distances between all data

7
00:00:24,390 --> 00:00:25,450
points.

8
00:00:25,450 --> 00:00:28,400
And then we need to
cluster the points.

9
00:00:28,400 --> 00:00:32,350
To compute the distances we
can use the dist function.

10
00:00:32,350 --> 00:00:36,320
We only want a cluster
movies on the genre variable,

11
00:00:36,320 --> 00:00:40,400
not on the title variable, so
we'll cluster on columns two

12
00:00:40,400 --> 00:00:42,000
through 20.

13
00:00:42,000 --> 00:00:45,220
So let's call the
output distances,

14
00:00:45,220 --> 00:00:48,710
and we'll use the dist function,
where the first argument is

15
00:00:48,710 --> 00:00:54,460
moviesmovies[2:20], this is what
we want to cluster on.

16
00:00:54,460 --> 00:00:59,520
And the second argument
is method="euclidean",

17
00:00:59,520 --> 00:01:04,060
meaning that we want to
use euclidean distance.

18
00:01:04,060 --> 00:01:07,660
Now let's cluster our movies
using the hclust function

19
00:01:07,660 --> 00:01:09,960
for hierarchical clustering.

20
00:01:09,960 --> 00:01:13,210
We'll call the
output clusterMovies,

21
00:01:13,210 --> 00:01:18,170
and use hclust where the
first argument is distances,

22
00:01:18,170 --> 00:01:20,500
the output of the dist function.

23
00:01:20,500 --> 00:01:22,210
And the second argument
is method="ward".

24
00:01:25,210 --> 00:01:30,160
The ward method cares about the
distance between clusters using

25
00:01:30,160 --> 00:01:33,120
centroid distance, and
also the variance in each

26
00:01:33,120 --> 00:01:35,440
of the clusters.

27
00:01:35,440 --> 00:01:39,110
Now let's plot the dendrogram
of our clustering algorithm

28
00:01:39,110 --> 00:01:42,710
by typing plot, and then in
parentheses clusterMovies.

29
00:01:46,560 --> 00:01:48,940
This dendrogram might
look a little strange.

30
00:01:48,940 --> 00:01:52,180
We have all this black
along the bottom.

31
00:01:52,180 --> 00:01:54,150
Remember that the
dendrogram lists

32
00:01:54,150 --> 00:01:56,900
all of that data points
along the bottom.

33
00:01:56,900 --> 00:01:59,420
But when there are
over 1,000 data points

34
00:01:59,420 --> 00:02:01,420
it's impossible to read.

35
00:02:01,420 --> 00:02:04,900
We'll see later how to assign
our clusters to groups so

36
00:02:04,900 --> 00:02:09,419
that we can analyze which data
points are in which cluster.

37
00:02:09,419 --> 00:02:12,470
So looking at this
dendrogram, how many clusters

38
00:02:12,470 --> 00:02:13,800
would you pick?

39
00:02:13,800 --> 00:02:17,120
It looks like maybe
three or four clusters

40
00:02:17,120 --> 00:02:20,120
would be a good choice
according to the dendrogram.

41
00:02:20,120 --> 00:02:23,090
But let's keep our
application in mind, too.

42
00:02:23,090 --> 00:02:27,050
We probably want more than two,
three, or even four clusters

43
00:02:27,050 --> 00:02:31,260
of movies to make
recommendations to users.

44
00:02:31,260 --> 00:02:32,829
It looks like
there's a nice spot

45
00:02:32,829 --> 00:02:36,610
down here where
there's 10 clusters.

46
00:02:36,610 --> 00:02:39,870
This is probably better
for our application.

47
00:02:39,870 --> 00:02:42,400
We could select
even more clusters

48
00:02:42,400 --> 00:02:46,030
if we want to have very
specific genre groups.

49
00:02:46,030 --> 00:02:47,900
If you want a lot
of clusters it's

50
00:02:47,900 --> 00:02:50,750
hard to pick the right
number from the dendrogram.

51
00:02:50,750 --> 00:02:53,610
You need to use your
understanding of the problem

52
00:02:53,610 --> 00:02:56,020
to pick the number of clusters.

53
00:02:56,020 --> 00:02:58,400
Let's stick with 10
clusters for now,

54
00:02:58,400 --> 00:03:01,030
combining what we learned
from the dendrogram

55
00:03:01,030 --> 00:03:04,590
with our understanding
of the problem.

56
00:03:04,590 --> 00:03:10,010
Now back in our R console we can
label each of the data points

57
00:03:10,010 --> 00:03:11,910
according to what
cluster it belongs

58
00:03:11,910 --> 00:03:14,940
to using the cutree function.

59
00:03:14,940 --> 00:03:17,030
So let's type
clusterGroups=cutree(clusterMovies,

60
00:03:17,030 --> 00:03:17,530
k=10).

61
00:03:32,450 --> 00:03:35,360
Now let's figure out what
the clusters are like.

62
00:03:35,360 --> 00:03:37,990
We'll use the tapply
function to compute

63
00:03:37,990 --> 00:03:42,250
the percentage of movies
in each genre and cluster.

64
00:03:42,250 --> 00:03:46,880
So let's type tapply, and then
give us the first argument,

65
00:03:46,880 --> 00:03:51,970
movies$Action-- we'll
start the action genre--

66
00:03:51,970 --> 00:03:58,790
and then clusterGroups,
and then mean.

67
00:03:58,790 --> 00:04:00,370
So what does this do?

68
00:04:00,370 --> 00:04:04,050
It divides our data points
into the 10 clusters

69
00:04:04,050 --> 00:04:06,290
and then computes
the average value

70
00:04:06,290 --> 00:04:09,710
of the action variable
for each cluster.

71
00:04:09,710 --> 00:04:11,550
Remember that the
action variable

72
00:04:11,550 --> 00:04:15,320
is a binary variable
with value 0 or 1.

73
00:04:15,320 --> 00:04:18,370
So by computing the
average of this variable

74
00:04:18,370 --> 00:04:20,519
we're computing the
percentage of movies

75
00:04:20,519 --> 00:04:24,390
in that cluster that
belong in that genre.

76
00:04:24,390 --> 00:04:28,990
So we can see here that
in cluster 2, about 78%

77
00:04:28,990 --> 00:04:32,360
of the movies have
the action genre

78
00:04:32,360 --> 00:04:36,390
label, whereas in cluster
4 none of the movies

79
00:04:36,390 --> 00:04:38,940
are labeled as action movies.

80
00:04:38,940 --> 00:04:41,630
Let's try this
again, but this time

81
00:04:41,630 --> 00:04:43,270
let's look at the romance genre.

82
00:04:47,050 --> 00:04:50,850
Here we can see that all of
the movies in clusters six

83
00:04:50,850 --> 00:04:56,300
and seven are labeled as
romance movies, whereas only 4%

84
00:04:56,300 --> 00:05:00,420
of the movies in cluster two
are labeled as romance movies.

85
00:05:00,420 --> 00:05:03,580
We can repeat this
for each genre.

86
00:05:03,580 --> 00:05:06,050
If you do you can
create a large table

87
00:05:06,050 --> 00:05:10,250
to better analyze the clusters,
which I saved to a spreadsheet.

88
00:05:10,250 --> 00:05:11,110
Lets take a look.

89
00:05:13,730 --> 00:05:16,810
Here we have in each
column the cluster,

90
00:05:16,810 --> 00:05:19,740
and in each row the genre.

91
00:05:19,740 --> 00:05:21,950
I highlighted the
cells that have

92
00:05:21,950 --> 00:05:24,420
a higher than average value.

93
00:05:24,420 --> 00:05:28,390
So we can see here in
cluster 2, as we saw before,

94
00:05:28,390 --> 00:05:32,780
that cluster 2 has a high
number of action movies.

95
00:05:32,780 --> 00:05:36,710
Cluster 1 has a little bit of
everything, some animation,

96
00:05:36,710 --> 00:05:41,390
children's, fantasy,
musicals, war and westerns.

97
00:05:41,390 --> 00:05:44,690
So I'm calling this the
miscellaneous cluster.

98
00:05:44,690 --> 00:05:47,770
Cluster 2 has a lot of
the action, adventure,

99
00:05:47,770 --> 00:05:50,210
and sci-fi movies.

100
00:05:50,210 --> 00:05:55,520
Cluster 3 has the crime,
mystery, thriller movies.

101
00:05:55,520 --> 00:06:00,350
Cluster 4 exclusively
has drama movies.

102
00:06:00,350 --> 00:06:05,040
Cluster 5, exclusively
has comedies.

103
00:06:05,040 --> 00:06:09,880
Cluster 6 has a lot
of the romance movies.

104
00:06:09,880 --> 00:06:14,440
Cluster 7 has movies that are
comedies and romance movies.

105
00:06:14,440 --> 00:06:17,920
So I'm calling these
the romantic comedies.

106
00:06:17,920 --> 00:06:21,520
Cluster 8 has the documentaries.

107
00:06:21,520 --> 00:06:25,360
Cluster 9 has the movies
that are comedies and dramas,

108
00:06:25,360 --> 00:06:27,750
so the dramatic comedies.

109
00:06:27,750 --> 00:06:31,410
And cluster 10 has
the horror flicks.

110
00:06:31,410 --> 00:06:34,030
Knowing common movie
genres, these cluster

111
00:06:34,030 --> 00:06:36,610
seem to make a lot of sense.

112
00:06:36,610 --> 00:06:39,380
So now, back in our
rconsole, let's see

113
00:06:39,380 --> 00:06:42,670
how these clusters could be
used in a recommendation system.

114
00:06:45,220 --> 00:06:48,650
Remember that Amy liked
the movie Men in Black.

115
00:06:48,650 --> 00:06:51,810
Let's figure out what
cluster Men in Black is in.

116
00:06:51,810 --> 00:06:56,820
We'll use the subset function
to take a subset of movies

117
00:06:56,820 --> 00:06:59,190
and only look at the movies
where the Title="Men in Black

118
00:06:59,190 --> 00:07:08,000
(1997)".

119
00:07:08,000 --> 00:07:10,450
Close the quotes
in the parentheses.

120
00:07:10,450 --> 00:07:12,590
I knew that this is the
title of Men in Black

121
00:07:12,590 --> 00:07:16,430
because I looked it
up in our data set.

122
00:07:16,430 --> 00:07:21,660
So it looks like Men in Black
is the 257th row in our data.

123
00:07:21,660 --> 00:07:25,780
So which cluster did
the 257th movie go into?

124
00:07:25,780 --> 00:07:28,490
We can figure this out by
typing clusterGroupsclusterGroups[257].

125
00:07:34,110 --> 00:07:37,400
It looks like Men in
Black went into cluster 2.

126
00:07:37,400 --> 00:07:39,070
That make sense
since we just saw

127
00:07:39,070 --> 00:07:43,680
that cluster 2 is the action,
adventure, sci-fi cluster.

128
00:07:43,680 --> 00:07:45,890
So let's create a new
data set with just

129
00:07:45,890 --> 00:07:47,810
the movies from cluster two.

130
00:07:47,810 --> 00:07:52,130
We'll call it cluster two,
and use the subset function

131
00:07:52,130 --> 00:07:54,890
to take a subset of
movies only taking

132
00:07:54,890 --> 00:07:58,730
the observations for which
clusterGroups is equal to 2.

133
00:08:01,250 --> 00:08:03,740
Let's look at the first
10 titles in this cluster.

134
00:08:03,740 --> 00:08:06,240
We can do this by typing
cluster2$Titlecluster2$Title[1:10].

135
00:08:12,920 --> 00:08:16,150
So it looks like good
movies to recommend to Amy,

136
00:08:16,150 --> 00:08:18,440
according to our
clustering algorithm,

137
00:08:18,440 --> 00:08:23,900
would be movies like Apollo
13 and Jurassic Park.

138
00:08:23,900 --> 00:08:26,310
In this video we
saw how clustering

139
00:08:26,310 --> 00:08:29,940
can be applied to create a
movie recommendation system.

140
00:08:29,940 --> 00:08:32,350
In the next video,
we'll conclude

141
00:08:32,350 --> 00:08:35,850
by learning who ended up winning
the million dollar Netflix

142
00:08:35,850 --> 00:08:37,400
prize.