1 00:00:09,570 --> 00:00:13,150 In this lecture, we'll be using data from MovieLens 2 00:00:13,150 --> 00:00:17,720 to explain clustering and perform content filtering. 3 00:00:17,720 --> 00:00:21,700 movielens.org is a movie recommendation website 4 00:00:21,700 --> 00:00:23,910 run by the GroupLens research lab 5 00:00:23,910 --> 00:00:26,630 at the University of Minnesota. 6 00:00:26,630 --> 00:00:29,740 They collect user preferences about movies 7 00:00:29,740 --> 00:00:32,759 and do collaborative filtering to make recommendations 8 00:00:32,759 --> 00:00:37,200 to users, based on the similarities between users. 9 00:00:37,200 --> 00:00:41,140 We'll use their movie database to do content filtering 10 00:00:41,140 --> 00:00:44,840 using a technique called clustering. 11 00:00:44,840 --> 00:00:48,200 First, let's discuss what data we have. 12 00:00:48,200 --> 00:00:51,410 Movies in the MovieLens data set are categorized 13 00:00:51,410 --> 00:00:53,970 as belonging to different genres. 14 00:00:53,970 --> 00:00:58,990 There are 18 different genres as well as an unknown category. 15 00:00:58,990 --> 00:01:03,250 The genres include crime, musical, mystery, 16 00:01:03,250 --> 00:01:05,069 and children's. 17 00:01:05,069 --> 00:01:08,490 Each movie may belong to many different genres. 18 00:01:08,490 --> 00:01:12,150 So a movie could be classified as drama, adventure, 19 00:01:12,150 --> 00:01:14,160 and sci-fi. 20 00:01:14,160 --> 00:01:18,670 The question we want to answer is, can we systematically 21 00:01:18,670 --> 00:01:23,210 find groups of movies with similar sets of genres? 22 00:01:23,210 --> 00:01:27,680 To answer this question, we'll use a method called clustering. 23 00:01:27,680 --> 00:01:30,770 Clustering is different from the other analytics methods 24 00:01:30,770 --> 00:01:32,650 we've covered so far. 25 00:01:32,650 --> 00:01:36,970 It's called an unsupervised learning method. 26 00:01:36,970 --> 00:01:38,660 This means that we're just trying 27 00:01:38,660 --> 00:01:41,390 to segment the data into similar groups, 28 00:01:41,390 --> 00:01:44,430 instead of trying to predict an outcome. 29 00:01:44,430 --> 00:01:46,740 In this image on the slide, based 30 00:01:46,740 --> 00:01:49,020 on the locations of points, we've 31 00:01:49,020 --> 00:01:53,509 divided them into three clusters-- a blue cluster, 32 00:01:53,509 --> 00:01:57,789 a red cluster, and a yellow cluster. 33 00:01:57,789 --> 00:02:00,880 This is the goal of clustering-- to put each data 34 00:02:00,880 --> 00:02:05,840 point into a group with similar values in the data. 35 00:02:05,840 --> 00:02:09,949 A clustering algorithm does not predict anything. 36 00:02:09,949 --> 00:02:15,560 However, clustering can be used to improve predictive methods. 37 00:02:15,560 --> 00:02:18,860 You can cluster the data into similar groups 38 00:02:18,860 --> 00:02:22,410 and then build a predictive model for each group. 39 00:02:22,410 --> 00:02:26,810 This can often improve the accuracy of predictive methods. 40 00:02:26,810 --> 00:02:30,520 But as a warning, be careful not to over-fit your model 41 00:02:30,520 --> 00:02:32,190 to the training set. 42 00:02:32,190 --> 00:02:35,910 This works best for large data sets. 43 00:02:35,910 --> 00:02:39,100 There are many different algorithms for clustering. 44 00:02:39,100 --> 00:02:41,430 They differ in what makes a cluster 45 00:02:41,430 --> 00:02:43,850 and how the clusters are found. 46 00:02:43,850 --> 00:02:47,690 In this class, we'll cover hierarchical clustering 47 00:02:47,690 --> 00:02:49,760 and K-means clustering. 48 00:02:49,760 --> 00:02:54,020 In this lecture, we'll discuss hierarchical clustering. 49 00:02:54,020 --> 00:02:58,490 And in the next lecture, we'll discuss K-means clustering. 50 00:02:58,490 --> 00:03:01,520 You'll learn how to create clusters using either method 51 00:03:01,520 --> 00:03:05,390 in R. There are other clustering methods also, 52 00:03:05,390 --> 00:03:08,010 but hierarchical and K-means are two 53 00:03:08,010 --> 00:03:10,800 of the most popular methods. 54 00:03:10,800 --> 00:03:13,460 To cluster data points, we need to compute 55 00:03:13,460 --> 00:03:16,000 how similar the points are. 56 00:03:16,000 --> 00:03:19,400 This is done by computing the distance between points, which 57 00:03:19,400 --> 00:03:22,530 we'll discuss in the next video.