1
00:00:04,760 --> 00:00:07,550
As we discussed in
the previous video,

2
00:00:07,550 --> 00:00:10,260
Netflix was willing
to pay over $1 million

3
00:00:10,260 --> 00:00:13,590
for the best user
rating algorithm, which

4
00:00:13,590 --> 00:00:16,450
shows how critical the
recommendation system is

5
00:00:16,450 --> 00:00:18,100
to their business.

6
00:00:18,100 --> 00:00:22,610
In this video, we'll discuss
how recommendation systems work.

7
00:00:22,610 --> 00:00:25,490
Let's start by thinking
about the data.

8
00:00:25,490 --> 00:00:29,960
When predicting user ratings,
what data could be useful?

9
00:00:29,960 --> 00:00:33,310
There are two main types
of data that we could use.

10
00:00:33,310 --> 00:00:37,260
The first is that for every
movie in Netflix's database,

11
00:00:37,260 --> 00:00:41,320
we have a ranking from all users
who have ranked that movie.

12
00:00:41,320 --> 00:00:45,170
The second is that we know
facts about the movie itself--

13
00:00:45,170 --> 00:00:47,740
the actors in the
movie, the director,

14
00:00:47,740 --> 00:00:50,160
the genre classifications
of the movie,

15
00:00:50,160 --> 00:00:53,370
the year it was
released, et cetera.

16
00:00:53,370 --> 00:00:56,540
As an example, suppose we
have the following user

17
00:00:56,540 --> 00:01:00,050
ratings for four
users and four movies.

18
00:01:00,050 --> 00:01:04,050
Our users are Amy,
Bob, Carl, and Dan.

19
00:01:04,050 --> 00:01:08,789
And our movies are Men in
Black, Apollo 13, Top Gun,

20
00:01:08,789 --> 00:01:10,530
and Terminator.

21
00:01:10,530 --> 00:01:14,380
The ratings are on a one to five
scale, where one is the lowest

22
00:01:14,380 --> 00:01:17,500
rating and five is
the highest rating.

23
00:01:17,500 --> 00:01:22,010
The blank entries mean that the
user has not rated the movie.

24
00:01:22,010 --> 00:01:25,760
We could suggest to Carl
that he watch Men in Black.

25
00:01:25,760 --> 00:01:30,100
Since Amy rated it highly,
she gave it a rating of five.

26
00:01:30,100 --> 00:01:33,210
And Amy and Carl seem
to have similar ratings

27
00:01:33,210 --> 00:01:35,100
for the other movies.

28
00:01:35,100 --> 00:01:38,050
This technique of using
other user's ratings

29
00:01:38,050 --> 00:01:42,220
to make predictions is called
collaborative filtering.

30
00:01:42,220 --> 00:01:44,810
Note that we're not
using any information

31
00:01:44,810 --> 00:01:48,190
about the movie itself
here, just the similarity

32
00:01:48,190 --> 00:01:50,570
between users.

33
00:01:50,570 --> 00:01:53,580
Instead, we could
use movie information

34
00:01:53,580 --> 00:01:55,520
to predict user ratings.

35
00:01:55,520 --> 00:01:58,560
We saw on the table that
Amy liked Men in Black.

36
00:01:58,560 --> 00:02:00,680
She gave it a rating of five.

37
00:02:00,680 --> 00:02:04,650
We know that this movie was
directed by Barry Sonnenfeld,

38
00:02:04,650 --> 00:02:08,780
is classified in the genres
of action, adventure, sci-fi,

39
00:02:08,780 --> 00:02:13,350
and comedy, and it
stars actor Will Smith.

40
00:02:13,350 --> 00:02:16,810
Based on this information,
we could make recommendations

41
00:02:16,810 --> 00:02:18,200
to Amy.

42
00:02:18,200 --> 00:02:20,890
We could recommend
to Amy another movie

43
00:02:20,890 --> 00:02:25,160
by the same director, Berry
Sonnenfeld's movie, Get Shorty.

44
00:02:25,160 --> 00:02:27,120
We can instead
recommend the movie

45
00:02:27,120 --> 00:02:29,880
Jurassic Park, which
is also classified

46
00:02:29,880 --> 00:02:34,010
in the genres of action,
adventure, and sci-fi.

47
00:02:34,010 --> 00:02:37,220
Or we could recommend to
Amy another movie starring

48
00:02:37,220 --> 00:02:39,550
Will Smith-- Hitch.

49
00:02:39,550 --> 00:02:42,280
Note that we're not using
the ratings of other users

50
00:02:42,280 --> 00:02:45,810
at all here, just
information about the movie.

51
00:02:45,810 --> 00:02:48,030
This technique is called
content filtering.

52
00:02:50,890 --> 00:02:52,510
There are strengths
and weaknesses

53
00:02:52,510 --> 00:02:55,480
to both types of
recommendation systems.

54
00:02:55,480 --> 00:02:57,680
Collaborative filtering
can accurately

55
00:02:57,680 --> 00:03:00,930
suggest complex items
without understanding

56
00:03:00,930 --> 00:03:02,840
the nature of the items.

57
00:03:02,840 --> 00:03:04,990
It didn't matter at
all that our items were

58
00:03:04,990 --> 00:03:07,890
movies in the collaborative
filtering example.

59
00:03:07,890 --> 00:03:10,820
We were just comparing
user ratings.

60
00:03:10,820 --> 00:03:14,450
However, this requires a
lot of data about the user

61
00:03:14,450 --> 00:03:17,280
to make accurate
recommendations.

62
00:03:17,280 --> 00:03:19,910
Also, when there are
millions of items,

63
00:03:19,910 --> 00:03:21,950
it needs a lot of
computing power

64
00:03:21,950 --> 00:03:24,900
to compute the
user similarities.

65
00:03:24,900 --> 00:03:27,480
On the other hand,
content filtering

66
00:03:27,480 --> 00:03:30,370
requires very little
data to get started.

67
00:03:30,370 --> 00:03:33,280
But the major weakness
of content filtering

68
00:03:33,280 --> 00:03:35,650
is that it can be
limited in scope.

69
00:03:35,650 --> 00:03:37,880
You're recommending
similar things

70
00:03:37,880 --> 00:03:40,110
to what the user
has already liked.

71
00:03:40,110 --> 00:03:43,050
So the recommendations
are often not surprising

72
00:03:43,050 --> 00:03:46,290
or particularly insightful.

73
00:03:46,290 --> 00:03:49,600
Netflix actually uses what's
called a hybrid recommendation

74
00:03:49,600 --> 00:03:50,570
system.

75
00:03:50,570 --> 00:03:54,350
They use both collaborative
and content filtering.

76
00:03:54,350 --> 00:03:57,400
As an example, consider
a collaborative filtering

77
00:03:57,400 --> 00:04:00,540
approach, where we determine
that Amy and Carl have

78
00:04:00,540 --> 00:04:02,290
similar preferences.

79
00:04:02,290 --> 00:04:05,130
We could then do content
filtering as well,

80
00:04:05,130 --> 00:04:08,180
where we could find that the
movie Terminator, which they

81
00:04:08,180 --> 00:04:11,900
both liked, is classified in
almost the same set of genres

82
00:04:11,900 --> 00:04:14,540
as Starship Troopers.

83
00:04:14,540 --> 00:04:17,880
So then we could recommend
Starship Troopers

84
00:04:17,880 --> 00:04:21,180
to both Amy and Carl, even
though neither of them

85
00:04:21,180 --> 00:04:22,890
have seen it before.

86
00:04:22,890 --> 00:04:25,550
If we were only doing
collaborative filtering,

87
00:04:25,550 --> 00:04:28,680
one of them would have had
to have seen it before.

88
00:04:28,680 --> 00:04:31,380
And if we were only
doing content filtering,

89
00:04:31,380 --> 00:04:34,880
we would only be recommending
to one user at a time.

90
00:04:34,880 --> 00:04:37,640
So by combining the two
methods, the algorithm

91
00:04:37,640 --> 00:04:40,460
can be much more
efficient and accurate.

92
00:04:40,460 --> 00:04:44,330
In the next video, we'll see
how we can do content filtering

93
00:04:44,330 --> 00:04:47,230
by using a method
called clustering.