1 00:00:04,490 --> 00:00:07,860 In this video we'll use hierarchical clustering 2 00:00:07,860 --> 00:00:12,160 to cluster the movies in the movie lens data set by genre. 3 00:00:12,160 --> 00:00:14,150 After we make our clusters, we'll 4 00:00:14,150 --> 00:00:17,960 see how they can be used to make recommendations. 5 00:00:17,960 --> 00:00:21,220 There are two steps to hierarchical clustering. 6 00:00:21,220 --> 00:00:24,390 First we have to compute the distances between all data 7 00:00:24,390 --> 00:00:25,450 points. 8 00:00:25,450 --> 00:00:28,400 And then we need to cluster the points. 9 00:00:28,400 --> 00:00:32,350 To compute the distances we can use the dist function. 10 00:00:32,350 --> 00:00:36,320 We only want a cluster movies on the genre variable, 11 00:00:36,320 --> 00:00:40,400 not on the title variable, so we'll cluster on columns two 12 00:00:40,400 --> 00:00:42,000 through 20. 13 00:00:42,000 --> 00:00:45,220 So let's call the output distances, 14 00:00:45,220 --> 00:00:48,710 and we'll use the dist function, where the first argument is 15 00:00:48,710 --> 00:00:54,460 moviesmovies[2:20], this is what we want to cluster on. 16 00:00:54,460 --> 00:00:59,520 And the second argument is method="euclidean", 17 00:00:59,520 --> 00:01:04,060 meaning that we want to use euclidean distance. 18 00:01:04,060 --> 00:01:07,660 Now let's cluster our movies using the hclust function 19 00:01:07,660 --> 00:01:09,960 for hierarchical clustering. 20 00:01:09,960 --> 00:01:13,210 We'll call the output clusterMovies, 21 00:01:13,210 --> 00:01:18,170 and use hclust where the first argument is distances, 22 00:01:18,170 --> 00:01:20,500 the output of the dist function. 23 00:01:20,500 --> 00:01:22,210 And the second argument is method="ward". 24 00:01:25,210 --> 00:01:30,160 The ward method cares about the distance between clusters using 25 00:01:30,160 --> 00:01:33,120 centroid distance, and also the variance in each 26 00:01:33,120 --> 00:01:35,440 of the clusters. 27 00:01:35,440 --> 00:01:39,110 Now let's plot the dendrogram of our clustering algorithm 28 00:01:39,110 --> 00:01:42,710 by typing plot, and then in parentheses clusterMovies. 29 00:01:46,560 --> 00:01:48,940 This dendrogram might look a little strange. 30 00:01:48,940 --> 00:01:52,180 We have all this black along the bottom. 31 00:01:52,180 --> 00:01:54,150 Remember that the dendrogram lists 32 00:01:54,150 --> 00:01:56,900 all of that data points along the bottom. 33 00:01:56,900 --> 00:01:59,420 But when there are over 1,000 data points 34 00:01:59,420 --> 00:02:01,420 it's impossible to read. 35 00:02:01,420 --> 00:02:04,900 We'll see later how to assign our clusters to groups so 36 00:02:04,900 --> 00:02:09,419 that we can analyze which data points are in which cluster. 37 00:02:09,419 --> 00:02:12,470 So looking at this dendrogram, how many clusters 38 00:02:12,470 --> 00:02:13,800 would you pick? 39 00:02:13,800 --> 00:02:17,120 It looks like maybe three or four clusters 40 00:02:17,120 --> 00:02:20,120 would be a good choice according to the dendrogram. 41 00:02:20,120 --> 00:02:23,090 But let's keep our application in mind, too. 42 00:02:23,090 --> 00:02:27,050 We probably want more than two, three, or even four clusters 43 00:02:27,050 --> 00:02:31,260 of movies to make recommendations to users. 44 00:02:31,260 --> 00:02:32,829 It looks like there's a nice spot 45 00:02:32,829 --> 00:02:36,610 down here where there's 10 clusters. 46 00:02:36,610 --> 00:02:39,870 This is probably better for our application. 47 00:02:39,870 --> 00:02:42,400 We could select even more clusters 48 00:02:42,400 --> 00:02:46,030 if we want to have very specific genre groups. 49 00:02:46,030 --> 00:02:47,900 If you want a lot of clusters it's 50 00:02:47,900 --> 00:02:50,750 hard to pick the right number from the dendrogram. 51 00:02:50,750 --> 00:02:53,610 You need to use your understanding of the problem 52 00:02:53,610 --> 00:02:56,020 to pick the number of clusters. 53 00:02:56,020 --> 00:02:58,400 Let's stick with 10 clusters for now, 54 00:02:58,400 --> 00:03:01,030 combining what we learned from the dendrogram 55 00:03:01,030 --> 00:03:04,590 with our understanding of the problem. 56 00:03:04,590 --> 00:03:10,010 Now back in our R console we can label each of the data points 57 00:03:10,010 --> 00:03:11,910 according to what cluster it belongs 58 00:03:11,910 --> 00:03:14,940 to using the cutree function. 59 00:03:14,940 --> 00:03:17,030 So let's type clusterGroups=cutree(clusterMovies, 60 00:03:17,030 --> 00:03:17,530 k=10). 61 00:03:32,450 --> 00:03:35,360 Now let's figure out what the clusters are like. 62 00:03:35,360 --> 00:03:37,990 We'll use the tapply function to compute 63 00:03:37,990 --> 00:03:42,250 the percentage of movies in each genre and cluster. 64 00:03:42,250 --> 00:03:46,880 So let's type tapply, and then give us the first argument, 65 00:03:46,880 --> 00:03:51,970 movies$Action-- we'll start the action genre-- 66 00:03:51,970 --> 00:03:58,790 and then clusterGroups, and then mean. 67 00:03:58,790 --> 00:04:00,370 So what does this do? 68 00:04:00,370 --> 00:04:04,050 It divides our data points into the 10 clusters 69 00:04:04,050 --> 00:04:06,290 and then computes the average value 70 00:04:06,290 --> 00:04:09,710 of the action variable for each cluster. 71 00:04:09,710 --> 00:04:11,550 Remember that the action variable 72 00:04:11,550 --> 00:04:15,320 is a binary variable with value 0 or 1. 73 00:04:15,320 --> 00:04:18,370 So by computing the average of this variable 74 00:04:18,370 --> 00:04:20,519 we're computing the percentage of movies 75 00:04:20,519 --> 00:04:24,390 in that cluster that belong in that genre. 76 00:04:24,390 --> 00:04:28,990 So we can see here that in cluster 2, about 78% 77 00:04:28,990 --> 00:04:32,360 of the movies have the action genre 78 00:04:32,360 --> 00:04:36,390 label, whereas in cluster 4 none of the movies 79 00:04:36,390 --> 00:04:38,940 are labeled as action movies. 80 00:04:38,940 --> 00:04:41,630 Let's try this again, but this time 81 00:04:41,630 --> 00:04:43,270 let's look at the romance genre. 82 00:04:47,050 --> 00:04:50,850 Here we can see that all of the movies in clusters six 83 00:04:50,850 --> 00:04:56,300 and seven are labeled as romance movies, whereas only 4% 84 00:04:56,300 --> 00:05:00,420 of the movies in cluster two are labeled as romance movies. 85 00:05:00,420 --> 00:05:03,580 We can repeat this for each genre. 86 00:05:03,580 --> 00:05:06,050 If you do you can create a large table 87 00:05:06,050 --> 00:05:10,250 to better analyze the clusters, which I saved to a spreadsheet. 88 00:05:10,250 --> 00:05:11,110 Lets take a look. 89 00:05:13,730 --> 00:05:16,810 Here we have in each column the cluster, 90 00:05:16,810 --> 00:05:19,740 and in each row the genre. 91 00:05:19,740 --> 00:05:21,950 I highlighted the cells that have 92 00:05:21,950 --> 00:05:24,420 a higher than average value. 93 00:05:24,420 --> 00:05:28,390 So we can see here in cluster 2, as we saw before, 94 00:05:28,390 --> 00:05:32,780 that cluster 2 has a high number of action movies. 95 00:05:32,780 --> 00:05:36,710 Cluster 1 has a little bit of everything, some animation, 96 00:05:36,710 --> 00:05:41,390 children's, fantasy, musicals, war and westerns. 97 00:05:41,390 --> 00:05:44,690 So I'm calling this the miscellaneous cluster. 98 00:05:44,690 --> 00:05:47,770 Cluster 2 has a lot of the action, adventure, 99 00:05:47,770 --> 00:05:50,210 and sci-fi movies. 100 00:05:50,210 --> 00:05:55,520 Cluster 3 has the crime, mystery, thriller movies. 101 00:05:55,520 --> 00:06:00,350 Cluster 4 exclusively has drama movies. 102 00:06:00,350 --> 00:06:05,040 Cluster 5, exclusively has comedies. 103 00:06:05,040 --> 00:06:09,880 Cluster 6 has a lot of the romance movies. 104 00:06:09,880 --> 00:06:14,440 Cluster 7 has movies that are comedies and romance movies. 105 00:06:14,440 --> 00:06:17,920 So I'm calling these the romantic comedies. 106 00:06:17,920 --> 00:06:21,520 Cluster 8 has the documentaries. 107 00:06:21,520 --> 00:06:25,360 Cluster 9 has the movies that are comedies and dramas, 108 00:06:25,360 --> 00:06:27,750 so the dramatic comedies. 109 00:06:27,750 --> 00:06:31,410 And cluster 10 has the horror flicks. 110 00:06:31,410 --> 00:06:34,030 Knowing common movie genres, these cluster 111 00:06:34,030 --> 00:06:36,610 seem to make a lot of sense. 112 00:06:36,610 --> 00:06:39,380 So now, back in our rconsole, let's see 113 00:06:39,380 --> 00:06:42,670 how these clusters could be used in a recommendation system. 114 00:06:45,220 --> 00:06:48,650 Remember that Amy liked the movie Men in Black. 115 00:06:48,650 --> 00:06:51,810 Let's figure out what cluster Men in Black is in. 116 00:06:51,810 --> 00:06:56,820 We'll use the subset function to take a subset of movies 117 00:06:56,820 --> 00:06:59,190 and only look at the movies where the Title="Men in Black 118 00:06:59,190 --> 00:07:08,000 (1997)". 119 00:07:08,000 --> 00:07:10,450 Close the quotes in the parentheses. 120 00:07:10,450 --> 00:07:12,590 I knew that this is the title of Men in Black 121 00:07:12,590 --> 00:07:16,430 because I looked it up in our data set. 122 00:07:16,430 --> 00:07:21,660 So it looks like Men in Black is the 257th row in our data. 123 00:07:21,660 --> 00:07:25,780 So which cluster did the 257th movie go into? 124 00:07:25,780 --> 00:07:28,490 We can figure this out by typing clusterGroupsclusterGroups[257]. 125 00:07:34,110 --> 00:07:37,400 It looks like Men in Black went into cluster 2. 126 00:07:37,400 --> 00:07:39,070 That make sense since we just saw 127 00:07:39,070 --> 00:07:43,680 that cluster 2 is the action, adventure, sci-fi cluster. 128 00:07:43,680 --> 00:07:45,890 So let's create a new data set with just 129 00:07:45,890 --> 00:07:47,810 the movies from cluster two. 130 00:07:47,810 --> 00:07:52,130 We'll call it cluster two, and use the subset function 131 00:07:52,130 --> 00:07:54,890 to take a subset of movies only taking 132 00:07:54,890 --> 00:07:58,730 the observations for which clusterGroups is equal to 2. 133 00:08:01,250 --> 00:08:03,740 Let's look at the first 10 titles in this cluster. 134 00:08:03,740 --> 00:08:06,240 We can do this by typing cluster2$Titlecluster2$Title[1:10]. 135 00:08:12,920 --> 00:08:16,150 So it looks like good movies to recommend to Amy, 136 00:08:16,150 --> 00:08:18,440 according to our clustering algorithm, 137 00:08:18,440 --> 00:08:23,900 would be movies like Apollo 13 and Jurassic Park. 138 00:08:23,900 --> 00:08:26,310 In this video we saw how clustering 139 00:08:26,310 --> 00:08:29,940 can be applied to create a movie recommendation system. 140 00:08:29,940 --> 00:08:32,350 In the next video, we'll conclude 141 00:08:32,350 --> 00:08:35,850 by learning who ended up winning the million dollar Netflix 142 00:08:35,850 --> 00:08:37,400 prize.