1 00:00:01,790 --> 00:00:04,290 PROFESSOR: Now you may remember a discussion of the birthday 2 00:00:04,290 --> 00:00:08,680 paradox, which says that if you have 3 00:00:08,680 --> 00:00:12,270 a group of 27 random people. 4 00:00:12,270 --> 00:00:16,160 The probability is almost 2/3 that two of them 5 00:00:16,160 --> 00:00:18,660 are going to have a matching birthday, even though there 6 00:00:18,660 --> 00:00:22,150 are 365 birthdays in the year. 7 00:00:22,150 --> 00:00:25,160 You might sloppily think that with 27 people 8 00:00:25,160 --> 00:00:29,930 there'd only be a 27 out of 365, or some chance like that. 9 00:00:29,930 --> 00:00:31,830 It's actually 2/3. 10 00:00:31,830 --> 00:00:34,800 And by the time you get to a class of 110-- 11 00:00:34,800 --> 00:00:37,120 which is what we have data for and we're 12 00:00:37,120 --> 00:00:38,840 going to be looking at-- it turns out 13 00:00:38,840 --> 00:00:42,820 that the odds are almost 3/4 of a million to one 14 00:00:42,820 --> 00:00:46,490 that you'll have a couple of people with matching birthdays. 15 00:00:46,490 --> 00:00:49,360 So let's look at the matching birthday problem a little bit 16 00:00:49,360 --> 00:00:49,884 more today. 17 00:00:49,884 --> 00:00:51,300 And the reason we're looking at it 18 00:00:51,300 --> 00:00:54,350 is because it's a lovely example where there really 19 00:00:54,350 --> 00:00:57,920 is pairwise independence, and not mutual independence. 20 00:00:57,920 --> 00:01:01,740 So it's reinforcing the key idea behind the additivity 21 00:01:01,740 --> 00:01:04,844 of variance, and the pairwise independent sampling theorem. 22 00:01:04,844 --> 00:01:07,260 We're not going to use the sampling theorem here, but just 23 00:01:07,260 --> 00:01:09,587 pairwise independence, but it's worth looking at. 24 00:01:09,587 --> 00:01:11,170 Now before I go further let me mention 25 00:01:11,170 --> 00:01:14,990 that the birthday problem is just what we're doing for fun. 26 00:01:14,990 --> 00:01:17,390 But in fact, it has some real applications 27 00:01:17,390 --> 00:01:21,820 in more than one area, but the most famous one 28 00:01:21,820 --> 00:01:25,060 is the so-called birthday attack on a cryptosystem, which 29 00:01:25,060 --> 00:01:29,730 involves being able to search for matching pairs of keys 30 00:01:29,730 --> 00:01:32,250 with a relatively small sample. 31 00:01:32,250 --> 00:01:36,080 And you're very likely to find at least two that match. 32 00:01:36,080 --> 00:01:40,470 So with that motivation claimed, but not examined, 33 00:01:40,470 --> 00:01:43,020 let's just go back to thinking about birthdays. 34 00:01:43,020 --> 00:01:47,940 OK so let's suppose that I have some group of n people, 35 00:01:47,940 --> 00:01:49,360 and there are d-days in the year, 36 00:01:49,360 --> 00:01:52,289 just to keep the parameters abstract 37 00:01:52,289 --> 00:01:53,830 and not get too stuck on the numbers. 38 00:01:53,830 --> 00:01:55,840 Keeping the parameters makes it actually 39 00:01:55,840 --> 00:01:57,360 clearer to reason about. 40 00:01:57,360 --> 00:01:59,180 So we're implicitly assuming here 41 00:01:59,180 --> 00:02:03,270 that each person is kind of a random variable, 42 00:02:03,270 --> 00:02:07,080 or a random choice of a birthday. 43 00:02:07,080 --> 00:02:09,600 So each of these people are really 44 00:02:09,600 --> 00:02:14,065 random variables that return the value of a birthday. 45 00:02:14,065 --> 00:02:15,440 And it is a matter of fact, we're 46 00:02:15,440 --> 00:02:18,060 going assume that all the birthdays are equally likely. 47 00:02:18,060 --> 00:02:19,400 Real birthdays aren't. 48 00:02:19,400 --> 00:02:22,790 They tend to be of January tends to be a popular month, 49 00:02:22,790 --> 00:02:26,850 November tends to be a more popular month than other times. 50 00:02:26,850 --> 00:02:31,540 But let's ignore that because if the applications in crypto 51 00:02:31,540 --> 00:02:33,030 things really are uniform. 52 00:02:33,030 --> 00:02:36,340 And it makes our analysis plausible, still plausible 53 00:02:36,340 --> 00:02:38,560 but easy if we assume that birthdays 54 00:02:38,560 --> 00:02:40,050 are equally likely, OK. 55 00:02:40,050 --> 00:02:43,870 P is the number of pairs of birthdays that match 56 00:02:43,870 --> 00:02:46,690 in this population of n people. 57 00:02:46,690 --> 00:02:49,790 OK, let's get a grip on p by thinking 58 00:02:49,790 --> 00:02:51,840 of it as a sum of indicator variables. 59 00:02:51,840 --> 00:02:54,850 So let's let M sub ij be the indicator variable 60 00:02:54,850 --> 00:02:59,720 that ith and jith people among the n have a matching birthday. 61 00:02:59,720 --> 00:03:03,390 Well the number of matching birthdays 62 00:03:03,390 --> 00:03:08,440 is then simply the sum over all the possible pairs of people of 63 00:03:08,440 --> 00:03:10,550 whether or not they have a matching birthday. 64 00:03:10,550 --> 00:03:13,980 It's the sum of these indicator variables M sub ij. 65 00:03:13,980 --> 00:03:16,010 And the number of these indicator variables 66 00:03:16,010 --> 00:03:20,460 is of course all the ways of choosing two out of n people. 67 00:03:20,460 --> 00:03:25,420 So in short, if I look at the expectation M sub ij, 68 00:03:25,420 --> 00:03:26,990 let's think about that for a minute. 69 00:03:26,990 --> 00:03:30,060 We're assuming that all the birthdays are equally likely. 70 00:03:30,060 --> 00:03:32,620 And so I'm asking whether the ith and the jith people 71 00:03:32,620 --> 00:03:34,080 have the same birthday. 72 00:03:34,080 --> 00:03:36,190 Well whatever the ith's person birthday turns out 73 00:03:36,190 --> 00:03:39,370 to be, let's say it's November 5, 74 00:03:39,370 --> 00:03:42,870 the jth person, who has a uniform probability 75 00:03:42,870 --> 00:03:46,820 of equalling any birthday, still has a uniform probability 1 76 00:03:46,820 --> 00:03:49,840 chance in d of equalling November 5, which 77 00:03:49,840 --> 00:03:52,030 happens to be my birthday. 78 00:03:52,030 --> 00:03:57,010 OK so in short the probability that any two people 79 00:03:57,010 --> 00:04:00,420 have a matching birthday is one chance in d. 80 00:04:00,420 --> 00:04:02,680 And that means that the expectation 81 00:04:02,680 --> 00:04:05,570 of the indicator variable for that event, M sub ij, 82 00:04:05,570 --> 00:04:07,130 is 1 over d. 83 00:04:07,130 --> 00:04:10,010 And that tells us, by linearity of expectation, 84 00:04:10,010 --> 00:04:11,920 that the expected number of pairs 85 00:04:11,920 --> 00:04:15,390 is simply the number of those pairs 86 00:04:15,390 --> 00:04:17,690 times the expected number per pair, 87 00:04:17,690 --> 00:04:20,279 and choose 2 times 1 over d. 88 00:04:20,279 --> 00:04:23,130 Well as I said we have data for 110 students. 89 00:04:23,130 --> 00:04:27,250 So the expected number of pairs in a collection in a student 90 00:04:27,250 --> 00:04:32,730 body of 110 is 110 choose, 2 times 1 over 365, 91 00:04:32,730 --> 00:04:39,020 or about 16.4 pairs is the expected number of pairs 92 00:04:39,020 --> 00:04:41,640 of matching birthdays. 93 00:04:41,640 --> 00:04:44,020 OK, now that's an expected value. 94 00:04:44,020 --> 00:04:48,300 How likely is it to be if I take a selection of 110 students, 95 00:04:48,300 --> 00:04:49,960 and I count how many pairs of birthdays 96 00:04:49,960 --> 00:04:54,310 are there, do I really expect to get close to 16.4 or not? 97 00:04:54,310 --> 00:04:58,140 Well what we're asking for is the probability that p 98 00:04:58,140 --> 00:05:03,480 is near its mean, that the distance between P and 16.4 99 00:05:03,480 --> 00:05:04,790 is greater than k. 100 00:05:04,790 --> 00:05:09,200 I hope that as k gets bigger this probability is small. 101 00:05:09,200 --> 00:05:14,410 And so I'm really quite likely to have close to 16.4 birthdays 102 00:05:14,410 --> 00:05:17,140 in my sample of 110. 103 00:05:17,140 --> 00:05:21,060 But this probability is one that's a mess to calculate. 104 00:05:21,060 --> 00:05:25,190 But we can get a grip on it because the variance of P 105 00:05:25,190 --> 00:05:26,540 is easy to calculate. 106 00:05:26,540 --> 00:05:30,430 And that will allow us to apply the Chebyshev bound, 107 00:05:30,430 --> 00:05:33,380 and get some kind of an estimate on the likelihood 108 00:05:33,380 --> 00:05:37,590 that P is near its expectation. 109 00:05:37,590 --> 00:05:40,910 So the key observation that we need 110 00:05:40,910 --> 00:05:45,910 is that the indicator variables are pairwise independent. 111 00:05:45,910 --> 00:05:47,470 So let's think about the indicator 112 00:05:47,470 --> 00:05:50,820 variable for the event that the ith and the jth people 113 00:05:50,820 --> 00:05:53,520 have the same birthday, let's call them Albert and Drew. 114 00:05:53,520 --> 00:05:56,540 So Albert's the ith person, Drew is the jth person. 115 00:05:56,540 --> 00:05:59,240 And I'm interested in the event that Albert and Drew 116 00:05:59,240 --> 00:06:01,020 have the same birthday. 117 00:06:01,020 --> 00:06:04,292 And let's compare that to another pair of people, 118 00:06:04,292 --> 00:06:06,250 and whether or not they have the same birthday. 119 00:06:06,250 --> 00:06:08,400 So let's first of all think about Dave and Mike, 120 00:06:08,400 --> 00:06:10,600 whether Dave and Mike have the same birthday. 121 00:06:10,600 --> 00:06:14,260 And I want to know if these two events are independent. 122 00:06:14,260 --> 00:06:17,880 Well remember we are assuming that Albert's birthday is 123 00:06:17,880 --> 00:06:20,380 independent of Drew's birthday is independent of David's, is 124 00:06:20,380 --> 00:06:21,590 independent of Mike. 125 00:06:21,590 --> 00:06:25,130 Each of the people is supposedly chosen independently, 126 00:06:25,130 --> 00:06:28,220 and their birthdays are independent. 127 00:06:28,220 --> 00:06:31,640 So it's obvious that these two pairs that don't overlap 128 00:06:31,640 --> 00:06:33,290 have nothing to do with each other, 129 00:06:33,290 --> 00:06:34,873 and we don't have to worry about that. 130 00:06:34,873 --> 00:06:37,390 You could prove that formally, but it is obvious 131 00:06:37,390 --> 00:06:42,400 because each of the individual birthdays are independent. 132 00:06:42,400 --> 00:06:44,360 Now what's more interesting is the case 133 00:06:44,360 --> 00:06:48,610 when I asked whether or not Albert and Drew having 134 00:06:48,610 --> 00:06:51,290 the same birthday is independent of Albert and Mike 135 00:06:51,290 --> 00:06:53,400 having the same birthday. 136 00:06:53,400 --> 00:06:55,670 And that one is not so obvious. 137 00:06:55,670 --> 00:06:59,610 Here's a way to think about what could go wrong. 138 00:06:59,610 --> 00:07:03,130 Suppose that in fact the birthdays weren't uniform, 139 00:07:03,130 --> 00:07:05,940 suppose that some birthdays were more common than others. 140 00:07:05,940 --> 00:07:09,980 OK that makes it more likely that if Albert and Drew have 141 00:07:09,980 --> 00:07:12,740 the same birthday it slants things, 142 00:07:12,740 --> 00:07:16,890 so that they're more likely to have this very common birthday 143 00:07:16,890 --> 00:07:19,990 than they would have been otherwise. 144 00:07:19,990 --> 00:07:24,470 And now once I know that they match, and therefore are 145 00:07:24,470 --> 00:07:27,260 more likely to have the common birthday than they would have 146 00:07:27,260 --> 00:07:29,970 without any information, I know that Albert 147 00:07:29,970 --> 00:07:34,140 is more likely to have this common birthday than otherwise. 148 00:07:34,140 --> 00:07:39,010 And that means that Mike is even more likely to match Albert, 149 00:07:39,010 --> 00:07:41,740 because Albert's got the common birthday than Mike 150 00:07:41,740 --> 00:07:46,880 was to match Albert without any further information about what 151 00:07:46,880 --> 00:07:49,105 Albert's likely birthday was. 152 00:07:49,105 --> 00:07:51,730 You can think about that, and it can be worked out numerically, 153 00:07:51,730 --> 00:07:52,313 easily enough. 154 00:07:52,313 --> 00:07:55,830 So uniform is going to be a crucial factor here in order 155 00:07:55,830 --> 00:07:57,820 to conclude that Albert and Drew, 156 00:07:57,820 --> 00:08:04,909 and Albert and Mike are mutually independent events. 157 00:08:04,909 --> 00:08:06,450 But let's go back and think about it. 158 00:08:06,450 --> 00:08:08,722 All we really need is that Mike is uniform 159 00:08:08,722 --> 00:08:11,180 in order to conclude that these two events are independent. 160 00:08:11,180 --> 00:08:14,020 Because we know that Mike and Andrew and Albert 161 00:08:14,020 --> 00:08:16,000 separately are independent of each other. 162 00:08:16,000 --> 00:08:18,230 Their birthdays are chosen independently. 163 00:08:18,230 --> 00:08:21,870 So that intuitively means that the probability that Mike 164 00:08:21,870 --> 00:08:23,950 has any given birthday doesn't really 165 00:08:23,950 --> 00:08:26,170 matter what's going on with Albert and Drew, 166 00:08:26,170 --> 00:08:28,340 because Mike is independent of Albert and Drew. 167 00:08:28,340 --> 00:08:31,440 And if we know that Mike's probability of having 168 00:08:31,440 --> 00:08:36,419 a birthday is uniform, then whatever the birthday 169 00:08:36,419 --> 00:08:39,000 that Albert has, whether he matches Drew or not, 170 00:08:39,000 --> 00:08:43,059 Mike has a 1 chance in d of hitting 171 00:08:43,059 --> 00:08:46,000 the same birthday of whatever Albert wound up having. 172 00:08:46,000 --> 00:08:50,300 And that means that the probability that Mike matches 173 00:08:50,300 --> 00:08:54,050 Albert is the same one over d than it would have been if we 174 00:08:54,050 --> 00:08:56,230 had no further information. 175 00:08:56,230 --> 00:08:58,670 This is an argument that, in fact, is made rigorous 176 00:08:58,670 --> 00:09:01,490 in some class problems and a problem set, 177 00:09:01,490 --> 00:09:05,630 but let's just take it as plausible enough 178 00:09:05,630 --> 00:09:09,340 based on this hand-waving argument that I articulated, 179 00:09:09,340 --> 00:09:13,840 that these two events are independent pairwise and so 180 00:09:13,840 --> 00:09:18,150 the corresponding indicator variables and M Albert Drew, 181 00:09:18,150 --> 00:09:24,684 and M Albert Mike are independent of each other. 182 00:09:24,684 --> 00:09:25,850 So that's what we've argued. 183 00:09:25,850 --> 00:09:28,770 But notice that these events of pairwise matching 184 00:09:28,770 --> 00:09:30,470 are certainly not three-way independent, 185 00:09:30,470 --> 00:09:32,880 because after all if I know that Albert and Drew have 186 00:09:32,880 --> 00:09:34,910 the same birthday, and that Albert and Mike have 187 00:09:34,910 --> 00:09:36,850 the same birthday, I absolutely know 188 00:09:36,850 --> 00:09:39,420 with certainty that Drew and Mike have the same birthday. 189 00:09:39,420 --> 00:09:42,500 So this is a very nice, basic example 190 00:09:42,500 --> 00:09:45,340 where you have pairwise independence, but not 191 00:09:45,340 --> 00:09:47,760 three-way independence, assuming that 192 00:09:47,760 --> 00:09:52,950 all of these random variables Albert, Drew, and Mike are 193 00:09:52,950 --> 00:09:56,450 uniform in what birthday they have. 194 00:09:56,450 --> 00:09:59,560 OK so let's go back to counting birthdays. 195 00:09:59,560 --> 00:10:02,770 The variance of an indicator is pq. 196 00:10:02,770 --> 00:10:09,020 So in this case p is 1 over 365, and q is 1 minus 1 over 365. 197 00:10:09,020 --> 00:10:12,620 And because of pairwise independence, 198 00:10:12,620 --> 00:10:16,840 the variance of p, which is the sum of the M sub 199 00:10:16,840 --> 00:10:21,040 ijs, the variance of the number of birthday pairs, 200 00:10:21,040 --> 00:10:23,060 is the sum of those variances. 201 00:10:23,060 --> 00:10:27,900 It's 110 choose to times the variance of the M sub ij 202 00:10:27,900 --> 00:10:31,790 turns out to be about 16.37, which 203 00:10:31,790 --> 00:10:36,630 means that the standard deviation sigma is less than 4. 204 00:10:36,630 --> 00:10:40,170 Now I can apply Chebyshev, because by the Chebyshev band 205 00:10:40,170 --> 00:10:46,730 the probability that 16.4 is within a 2 206 00:10:46,730 --> 00:10:49,570 sigma, is further away than 2 sigma, 207 00:10:49,570 --> 00:10:51,230 is only one chance in four. 208 00:10:51,230 --> 00:10:54,230 Which means the probability that it's within 2 sigma, 209 00:10:54,230 --> 00:10:57,050 that the actual number of measured pairs 210 00:10:57,050 --> 00:11:00,700 is within 2 sigma of the expected number 16.4 211 00:11:00,700 --> 00:11:06,000 is greater than 1 minus 1/4, or 3/4. 212 00:11:06,000 --> 00:11:11,450 There's a 3/4 chance that the number of pairs that we find 213 00:11:11,450 --> 00:11:16,160 is within 2 sigma of the expected number 16.4. 214 00:11:16,160 --> 00:11:19,710 Sigma was about 4, so this is 8, which 215 00:11:19,710 --> 00:11:24,170 means that we're expecting, with 3/4 probability, 216 00:11:24,170 --> 00:11:31,090 somewhere between 8.4, meaning 9, and 24.4, meaning 25 pairs. 217 00:11:31,090 --> 00:11:34,350 So 75% of the time, in a class of 110, 218 00:11:34,350 --> 00:11:40,430 we're going to find between 9 and 25 pairs of birthdays. 219 00:11:40,430 --> 00:11:41,670 Did that actually happen? 220 00:11:41,670 --> 00:11:42,810 Well it did. 221 00:11:42,810 --> 00:11:45,960 In our class of 110 for whom we had data, 222 00:11:45,960 --> 00:11:49,170 we actually found 21 pairs of matching birthdays. 223 00:11:49,170 --> 00:11:51,900 Literally we found 12 pairs and three triples, 224 00:11:51,900 --> 00:11:55,440 but each triple counts as three matching pairs. 225 00:11:55,440 --> 00:11:59,180 And there they are, the blues are triples. 226 00:11:59,180 --> 00:12:00,870 And you can see whether your birthday 227 00:12:00,870 --> 00:12:02,620 is among those, and knowing that you 228 00:12:02,620 --> 00:12:05,620 have a classmate or two that have the same birthday 229 00:12:05,620 --> 00:12:06,780 that you do. 230 00:12:06,780 --> 00:12:08,470 So there are 15 different birthdays, 231 00:12:08,470 --> 00:12:10,680 but they count as 21 pairs because it's 232 00:12:10,680 --> 00:12:15,140 12 single pairs, and three triplets, each of which 233 00:12:15,140 --> 00:12:17,360 counts for three pairs.