1
00:00:09,500 --> 00:00:11,350
In the optimization
problem, we assumed

2
00:00:11,350 --> 00:00:13,480
the compatibility
scores were data

3
00:00:13,480 --> 00:00:16,890
that we could input directly
into the optimization model.

4
00:00:16,890 --> 00:00:19,150
But where do these
scores come from?

5
00:00:19,150 --> 00:00:23,240
In the words of the founder--
Neil Clark Warren --opposites

6
00:00:23,240 --> 00:00:25,320
attract, then they attack.

7
00:00:25,320 --> 00:00:27,180
eHarmony's compatibility
match score

8
00:00:27,180 --> 00:00:30,070
is based on similarity
between users' answers

9
00:00:30,070 --> 00:00:32,340
to the questionnaire.

10
00:00:32,340 --> 00:00:35,370
Let us attempt to
demonstrate an approach

11
00:00:35,370 --> 00:00:37,750
to develop compatibility scores.

12
00:00:37,750 --> 00:00:41,320
We utilize public data from
eHarmony containing features

13
00:00:41,320 --> 00:00:46,060
for 275,000 users and
binary compatibility.

14
00:00:46,060 --> 00:00:48,070
Feature names and
exact values are

15
00:00:48,070 --> 00:00:50,920
masked to protect
users' privacy.

16
00:00:50,920 --> 00:00:54,260
Correspondingly we won't be
able to directly interpret

17
00:00:54,260 --> 00:00:56,400
which features are
important as we do not

18
00:00:56,400 --> 00:00:59,250
know the identity
of these features.

19
00:00:59,250 --> 00:01:04,030
We used logistic regression
on pairs of users' differences

20
00:01:04,030 --> 00:01:05,080
to predict compatibility.

21
00:01:07,840 --> 00:01:09,520
To reduce the size
of the problem,

22
00:01:09,520 --> 00:01:14,260
we filtered the data to include
only users in the Boston area

23
00:01:14,260 --> 00:01:17,710
who have compatibility scores
listed in the data set.

24
00:01:17,710 --> 00:01:24,020
We computed absolute difference
in features for these 1,475

25
00:01:24,020 --> 00:01:26,930
pairs and trained a
logistic regression model

26
00:01:26,930 --> 00:01:29,450
on these differences.

27
00:01:29,450 --> 00:01:33,200
Let us observe the results
of this experiment.

28
00:01:33,200 --> 00:01:36,340
If we use a low threshold in
the logistic regression model,

29
00:01:36,340 --> 00:01:39,930
we predict more false
positives but also get

30
00:01:39,930 --> 00:01:41,140
more true positives.

31
00:01:41,140 --> 00:01:43,100
For example, the
classification matrix

32
00:01:43,100 --> 00:01:46,759
for threshold equal
to 0.2 is as follows.

33
00:01:50,820 --> 00:01:56,580
Note that we found 1,030
pairs that are not compatible

34
00:01:56,580 --> 00:02:00,090
and 92 pairs that are
compatible correctly.

35
00:02:00,090 --> 00:02:07,900
Note that 92 out of 319--
which is 227 plus 92 --of these

36
00:02:07,900 --> 00:02:10,780
were correctly identified.

37
00:02:10,780 --> 00:02:14,480
That is, 29% percent of
the matches we recommend

38
00:02:14,480 --> 00:02:19,440
would be successful, a very high
success rate for online dating.

39
00:02:27,310 --> 00:02:29,160
Clearly, there is a
potential for using

40
00:02:29,160 --> 00:02:31,170
many other analytic methods.

41
00:02:31,170 --> 00:02:34,079
Specifically trees,
which are especially

42
00:02:34,079 --> 00:02:35,890
useful for predicting
compatibility

43
00:02:35,890 --> 00:02:37,730
if there are nonlinear
relationships

44
00:02:37,730 --> 00:02:40,090
between variables.

45
00:02:40,090 --> 00:02:42,160
Clustering is another
potential approach

46
00:02:42,160 --> 00:02:44,390
with the idea of
segmenting the users.

47
00:02:44,390 --> 00:02:47,660
Finally, text analytics
is yet another approach

48
00:02:47,660 --> 00:02:52,100
with the idea of analyzing
the text of users' profiles.

49
00:02:52,100 --> 00:02:56,150
Of course, many other
techniques are possible.

50
00:02:56,150 --> 00:02:58,630
To give some intuition
of various features,

51
00:02:58,630 --> 00:03:01,220
let us see how the
probability of a match

52
00:03:01,220 --> 00:03:05,000
changes with the distance
between the two adults.

53
00:03:05,000 --> 00:03:09,430
It is interesting to note
that the probability drops

54
00:03:09,430 --> 00:03:11,770
with distance, and then
for a very long distance,

55
00:03:11,770 --> 00:03:15,900
the probability increases again.

56
00:03:15,900 --> 00:03:18,190
Also interesting
is this graph that

57
00:03:18,190 --> 00:03:21,890
shows that if the attractiveness
is too high or too low,

58
00:03:21,890 --> 00:03:24,160
the probability of a
successful match decreases.

59
00:03:26,990 --> 00:03:31,530
Finally, if the difference in
height is too high or too low,

60
00:03:31,530 --> 00:03:35,000
the probability of
the match also drops.

61
00:03:35,000 --> 00:03:38,480
It seems the sweet spot
is a difference in height

62
00:03:38,480 --> 00:03:41,370
between four and eight inches.