1 00:00:09,500 --> 00:00:11,350 In the optimization problem, we assumed 2 00:00:11,350 --> 00:00:13,480 the compatibility scores were data 3 00:00:13,480 --> 00:00:16,890 that we could input directly into the optimization model. 4 00:00:16,890 --> 00:00:19,150 But where do these scores come from? 5 00:00:19,150 --> 00:00:23,240 In the words of the founder-- Neil Clark Warren --opposites 6 00:00:23,240 --> 00:00:25,320 attract, then they attack. 7 00:00:25,320 --> 00:00:27,180 eHarmony's compatibility match score 8 00:00:27,180 --> 00:00:30,070 is based on similarity between users' answers 9 00:00:30,070 --> 00:00:32,340 to the questionnaire. 10 00:00:32,340 --> 00:00:35,370 Let us attempt to demonstrate an approach 11 00:00:35,370 --> 00:00:37,750 to develop compatibility scores. 12 00:00:37,750 --> 00:00:41,320 We utilize public data from eHarmony containing features 13 00:00:41,320 --> 00:00:46,060 for 275,000 users and binary compatibility. 14 00:00:46,060 --> 00:00:48,070 Feature names and exact values are 15 00:00:48,070 --> 00:00:50,920 masked to protect users' privacy. 16 00:00:50,920 --> 00:00:54,260 Correspondingly we won't be able to directly interpret 17 00:00:54,260 --> 00:00:56,400 which features are important as we do not 18 00:00:56,400 --> 00:00:59,250 know the identity of these features. 19 00:00:59,250 --> 00:01:04,030 We used logistic regression on pairs of users' differences 20 00:01:04,030 --> 00:01:05,080 to predict compatibility. 21 00:01:07,840 --> 00:01:09,520 To reduce the size of the problem, 22 00:01:09,520 --> 00:01:14,260 we filtered the data to include only users in the Boston area 23 00:01:14,260 --> 00:01:17,710 who have compatibility scores listed in the data set. 24 00:01:17,710 --> 00:01:24,020 We computed absolute difference in features for these 1,475 25 00:01:24,020 --> 00:01:26,930 pairs and trained a logistic regression model 26 00:01:26,930 --> 00:01:29,450 on these differences. 27 00:01:29,450 --> 00:01:33,200 Let us observe the results of this experiment. 28 00:01:33,200 --> 00:01:36,340 If we use a low threshold in the logistic regression model, 29 00:01:36,340 --> 00:01:39,930 we predict more false positives but also get 30 00:01:39,930 --> 00:01:41,140 more true positives. 31 00:01:41,140 --> 00:01:43,100 For example, the classification matrix 32 00:01:43,100 --> 00:01:46,759 for threshold equal to 0.2 is as follows. 33 00:01:50,820 --> 00:01:56,580 Note that we found 1,030 pairs that are not compatible 34 00:01:56,580 --> 00:02:00,090 and 92 pairs that are compatible correctly. 35 00:02:00,090 --> 00:02:07,900 Note that 92 out of 319-- which is 227 plus 92 --of these 36 00:02:07,900 --> 00:02:10,780 were correctly identified. 37 00:02:10,780 --> 00:02:14,480 That is, 29% percent of the matches we recommend 38 00:02:14,480 --> 00:02:19,440 would be successful, a very high success rate for online dating. 39 00:02:27,310 --> 00:02:29,160 Clearly, there is a potential for using 40 00:02:29,160 --> 00:02:31,170 many other analytic methods. 41 00:02:31,170 --> 00:02:34,079 Specifically trees, which are especially 42 00:02:34,079 --> 00:02:35,890 useful for predicting compatibility 43 00:02:35,890 --> 00:02:37,730 if there are nonlinear relationships 44 00:02:37,730 --> 00:02:40,090 between variables. 45 00:02:40,090 --> 00:02:42,160 Clustering is another potential approach 46 00:02:42,160 --> 00:02:44,390 with the idea of segmenting the users. 47 00:02:44,390 --> 00:02:47,660 Finally, text analytics is yet another approach 48 00:02:47,660 --> 00:02:52,100 with the idea of analyzing the text of users' profiles. 49 00:02:52,100 --> 00:02:56,150 Of course, many other techniques are possible. 50 00:02:56,150 --> 00:02:58,630 To give some intuition of various features, 51 00:02:58,630 --> 00:03:01,220 let us see how the probability of a match 52 00:03:01,220 --> 00:03:05,000 changes with the distance between the two adults. 53 00:03:05,000 --> 00:03:09,430 It is interesting to note that the probability drops 54 00:03:09,430 --> 00:03:11,770 with distance, and then for a very long distance, 55 00:03:11,770 --> 00:03:15,900 the probability increases again. 56 00:03:15,900 --> 00:03:18,190 Also interesting is this graph that 57 00:03:18,190 --> 00:03:21,890 shows that if the attractiveness is too high or too low, 58 00:03:21,890 --> 00:03:24,160 the probability of a successful match decreases. 59 00:03:26,990 --> 00:03:31,530 Finally, if the difference in height is too high or too low, 60 00:03:31,530 --> 00:03:35,000 the probability of the match also drops. 61 00:03:35,000 --> 00:03:38,480 It seems the sweet spot is a difference in height 62 00:03:38,480 --> 00:03:41,370 between four and eight inches.