1 00:00:04,760 --> 00:00:07,550 As we discussed in the previous video, 2 00:00:07,550 --> 00:00:10,260 Netflix was willing to pay over $1 million 3 00:00:10,260 --> 00:00:13,590 for the best user rating algorithm, which 4 00:00:13,590 --> 00:00:16,450 shows how critical the recommendation system is 5 00:00:16,450 --> 00:00:18,100 to their business. 6 00:00:18,100 --> 00:00:22,610 In this video, we'll discuss how recommendation systems work. 7 00:00:22,610 --> 00:00:25,490 Let's start by thinking about the data. 8 00:00:25,490 --> 00:00:29,960 When predicting user ratings, what data could be useful? 9 00:00:29,960 --> 00:00:33,310 There are two main types of data that we could use. 10 00:00:33,310 --> 00:00:37,260 The first is that for every movie in Netflix's database, 11 00:00:37,260 --> 00:00:41,320 we have a ranking from all users who have ranked that movie. 12 00:00:41,320 --> 00:00:45,170 The second is that we know facts about the movie itself-- 13 00:00:45,170 --> 00:00:47,740 the actors in the movie, the director, 14 00:00:47,740 --> 00:00:50,160 the genre classifications of the movie, 15 00:00:50,160 --> 00:00:53,370 the year it was released, et cetera. 16 00:00:53,370 --> 00:00:56,540 As an example, suppose we have the following user 17 00:00:56,540 --> 00:01:00,050 ratings for four users and four movies. 18 00:01:00,050 --> 00:01:04,050 Our users are Amy, Bob, Carl, and Dan. 19 00:01:04,050 --> 00:01:08,789 And our movies are Men in Black, Apollo 13, Top Gun, 20 00:01:08,789 --> 00:01:10,530 and Terminator. 21 00:01:10,530 --> 00:01:14,380 The ratings are on a one to five scale, where one is the lowest 22 00:01:14,380 --> 00:01:17,500 rating and five is the highest rating. 23 00:01:17,500 --> 00:01:22,010 The blank entries mean that the user has not rated the movie. 24 00:01:22,010 --> 00:01:25,760 We could suggest to Carl that he watch Men in Black. 25 00:01:25,760 --> 00:01:30,100 Since Amy rated it highly, she gave it a rating of five. 26 00:01:30,100 --> 00:01:33,210 And Amy and Carl seem to have similar ratings 27 00:01:33,210 --> 00:01:35,100 for the other movies. 28 00:01:35,100 --> 00:01:38,050 This technique of using other user's ratings 29 00:01:38,050 --> 00:01:42,220 to make predictions is called collaborative filtering. 30 00:01:42,220 --> 00:01:44,810 Note that we're not using any information 31 00:01:44,810 --> 00:01:48,190 about the movie itself here, just the similarity 32 00:01:48,190 --> 00:01:50,570 between users. 33 00:01:50,570 --> 00:01:53,580 Instead, we could use movie information 34 00:01:53,580 --> 00:01:55,520 to predict user ratings. 35 00:01:55,520 --> 00:01:58,560 We saw on the table that Amy liked Men in Black. 36 00:01:58,560 --> 00:02:00,680 She gave it a rating of five. 37 00:02:00,680 --> 00:02:04,650 We know that this movie was directed by Barry Sonnenfeld, 38 00:02:04,650 --> 00:02:08,780 is classified in the genres of action, adventure, sci-fi, 39 00:02:08,780 --> 00:02:13,350 and comedy, and it stars actor Will Smith. 40 00:02:13,350 --> 00:02:16,810 Based on this information, we could make recommendations 41 00:02:16,810 --> 00:02:18,200 to Amy. 42 00:02:18,200 --> 00:02:20,890 We could recommend to Amy another movie 43 00:02:20,890 --> 00:02:25,160 by the same director, Berry Sonnenfeld's movie, Get Shorty. 44 00:02:25,160 --> 00:02:27,120 We can instead recommend the movie 45 00:02:27,120 --> 00:02:29,880 Jurassic Park, which is also classified 46 00:02:29,880 --> 00:02:34,010 in the genres of action, adventure, and sci-fi. 47 00:02:34,010 --> 00:02:37,220 Or we could recommend to Amy another movie starring 48 00:02:37,220 --> 00:02:39,550 Will Smith-- Hitch. 49 00:02:39,550 --> 00:02:42,280 Note that we're not using the ratings of other users 50 00:02:42,280 --> 00:02:45,810 at all here, just information about the movie. 51 00:02:45,810 --> 00:02:48,030 This technique is called content filtering. 52 00:02:50,890 --> 00:02:52,510 There are strengths and weaknesses 53 00:02:52,510 --> 00:02:55,480 to both types of recommendation systems. 54 00:02:55,480 --> 00:02:57,680 Collaborative filtering can accurately 55 00:02:57,680 --> 00:03:00,930 suggest complex items without understanding 56 00:03:00,930 --> 00:03:02,840 the nature of the items. 57 00:03:02,840 --> 00:03:04,990 It didn't matter at all that our items were 58 00:03:04,990 --> 00:03:07,890 movies in the collaborative filtering example. 59 00:03:07,890 --> 00:03:10,820 We were just comparing user ratings. 60 00:03:10,820 --> 00:03:14,450 However, this requires a lot of data about the user 61 00:03:14,450 --> 00:03:17,280 to make accurate recommendations. 62 00:03:17,280 --> 00:03:19,910 Also, when there are millions of items, 63 00:03:19,910 --> 00:03:21,950 it needs a lot of computing power 64 00:03:21,950 --> 00:03:24,900 to compute the user similarities. 65 00:03:24,900 --> 00:03:27,480 On the other hand, content filtering 66 00:03:27,480 --> 00:03:30,370 requires very little data to get started. 67 00:03:30,370 --> 00:03:33,280 But the major weakness of content filtering 68 00:03:33,280 --> 00:03:35,650 is that it can be limited in scope. 69 00:03:35,650 --> 00:03:37,880 You're recommending similar things 70 00:03:37,880 --> 00:03:40,110 to what the user has already liked. 71 00:03:40,110 --> 00:03:43,050 So the recommendations are often not surprising 72 00:03:43,050 --> 00:03:46,290 or particularly insightful. 73 00:03:46,290 --> 00:03:49,600 Netflix actually uses what's called a hybrid recommendation 74 00:03:49,600 --> 00:03:50,570 system. 75 00:03:50,570 --> 00:03:54,350 They use both collaborative and content filtering. 76 00:03:54,350 --> 00:03:57,400 As an example, consider a collaborative filtering 77 00:03:57,400 --> 00:04:00,540 approach, where we determine that Amy and Carl have 78 00:04:00,540 --> 00:04:02,290 similar preferences. 79 00:04:02,290 --> 00:04:05,130 We could then do content filtering as well, 80 00:04:05,130 --> 00:04:08,180 where we could find that the movie Terminator, which they 81 00:04:08,180 --> 00:04:11,900 both liked, is classified in almost the same set of genres 82 00:04:11,900 --> 00:04:14,540 as Starship Troopers. 83 00:04:14,540 --> 00:04:17,880 So then we could recommend Starship Troopers 84 00:04:17,880 --> 00:04:21,180 to both Amy and Carl, even though neither of them 85 00:04:21,180 --> 00:04:22,890 have seen it before. 86 00:04:22,890 --> 00:04:25,550 If we were only doing collaborative filtering, 87 00:04:25,550 --> 00:04:28,680 one of them would have had to have seen it before. 88 00:04:28,680 --> 00:04:31,380 And if we were only doing content filtering, 89 00:04:31,380 --> 00:04:34,880 we would only be recommending to one user at a time. 90 00:04:34,880 --> 00:04:37,640 So by combining the two methods, the algorithm 91 00:04:37,640 --> 00:04:40,460 can be much more efficient and accurate. 92 00:04:40,460 --> 00:04:44,330 In the next video, we'll see how we can do content filtering 93 00:04:44,330 --> 00:04:47,230 by using a method called clustering.