1
00:00:01,790 --> 00:00:04,290
PROFESSOR: Now you may remember
a discussion of the birthday

2
00:00:04,290 --> 00:00:08,680
paradox, which says
that if you have

3
00:00:08,680 --> 00:00:12,270
a group of 27 random people.

4
00:00:12,270 --> 00:00:16,160
The probability is almost
2/3 that two of them

5
00:00:16,160 --> 00:00:18,660
are going to have a matching
birthday, even though there

6
00:00:18,660 --> 00:00:22,150
are 365 birthdays in the year.

7
00:00:22,150 --> 00:00:25,160
You might sloppily think
that with 27 people

8
00:00:25,160 --> 00:00:29,930
there'd only be a 27 out of
365, or some chance like that.

9
00:00:29,930 --> 00:00:31,830
It's actually 2/3.

10
00:00:31,830 --> 00:00:34,800
And by the time you
get to a class of 110--

11
00:00:34,800 --> 00:00:37,120
which is what we have
data for and we're

12
00:00:37,120 --> 00:00:38,840
going to be looking
at-- it turns out

13
00:00:38,840 --> 00:00:42,820
that the odds are almost
3/4 of a million to one

14
00:00:42,820 --> 00:00:46,490
that you'll have a couple of
people with matching birthdays.

15
00:00:46,490 --> 00:00:49,360
So let's look at the matching
birthday problem a little bit

16
00:00:49,360 --> 00:00:49,884
more today.

17
00:00:49,884 --> 00:00:51,300
And the reason
we're looking at it

18
00:00:51,300 --> 00:00:54,350
is because it's a lovely
example where there really

19
00:00:54,350 --> 00:00:57,920
is pairwise independence,
and not mutual independence.

20
00:00:57,920 --> 00:01:01,740
So it's reinforcing the key
idea behind the additivity

21
00:01:01,740 --> 00:01:04,844
of variance, and the pairwise
independent sampling theorem.

22
00:01:04,844 --> 00:01:07,260
We're not going to use the
sampling theorem here, but just

23
00:01:07,260 --> 00:01:09,587
pairwise independence,
but it's worth looking at.

24
00:01:09,587 --> 00:01:11,170
Now before I go
further let me mention

25
00:01:11,170 --> 00:01:14,990
that the birthday problem is
just what we're doing for fun.

26
00:01:14,990 --> 00:01:17,390
But in fact, it has
some real applications

27
00:01:17,390 --> 00:01:21,820
in more than one area,
but the most famous one

28
00:01:21,820 --> 00:01:25,060
is the so-called birthday
attack on a cryptosystem, which

29
00:01:25,060 --> 00:01:29,730
involves being able to search
for matching pairs of keys

30
00:01:29,730 --> 00:01:32,250
with a relatively small sample.

31
00:01:32,250 --> 00:01:36,080
And you're very likely to
find at least two that match.

32
00:01:36,080 --> 00:01:40,470
So with that motivation
claimed, but not examined,

33
00:01:40,470 --> 00:01:43,020
let's just go back to
thinking about birthdays.

34
00:01:43,020 --> 00:01:47,940
OK so let's suppose that I
have some group of n people,

35
00:01:47,940 --> 00:01:49,360
and there are
d-days in the year,

36
00:01:49,360 --> 00:01:52,289
just to keep the
parameters abstract

37
00:01:52,289 --> 00:01:53,830
and not get too
stuck on the numbers.

38
00:01:53,830 --> 00:01:55,840
Keeping the parameters
makes it actually

39
00:01:55,840 --> 00:01:57,360
clearer to reason about.

40
00:01:57,360 --> 00:01:59,180
So we're implicitly
assuming here

41
00:01:59,180 --> 00:02:03,270
that each person is kind
of a random variable,

42
00:02:03,270 --> 00:02:07,080
or a random choice
of a birthday.

43
00:02:07,080 --> 00:02:09,600
So each of these
people are really

44
00:02:09,600 --> 00:02:14,065
random variables that return
the value of a birthday.

45
00:02:14,065 --> 00:02:15,440
And it is a matter
of fact, we're

46
00:02:15,440 --> 00:02:18,060
going assume that all the
birthdays are equally likely.

47
00:02:18,060 --> 00:02:19,400
Real birthdays aren't.

48
00:02:19,400 --> 00:02:22,790
They tend to be of January
tends to be a popular month,

49
00:02:22,790 --> 00:02:26,850
November tends to be a more
popular month than other times.

50
00:02:26,850 --> 00:02:31,540
But let's ignore that because
if the applications in crypto

51
00:02:31,540 --> 00:02:33,030
things really are uniform.

52
00:02:33,030 --> 00:02:36,340
And it makes our analysis
plausible, still plausible

53
00:02:36,340 --> 00:02:38,560
but easy if we
assume that birthdays

54
00:02:38,560 --> 00:02:40,050
are equally likely, OK.

55
00:02:40,050 --> 00:02:43,870
P is the number of pairs
of birthdays that match

56
00:02:43,870 --> 00:02:46,690
in this population of n people.

57
00:02:46,690 --> 00:02:49,790
OK, let's get a grip
on p by thinking

58
00:02:49,790 --> 00:02:51,840
of it as a sum of
indicator variables.

59
00:02:51,840 --> 00:02:54,850
So let's let M sub ij be
the indicator variable

60
00:02:54,850 --> 00:02:59,720
that ith and jith people among
the n have a matching birthday.

61
00:02:59,720 --> 00:03:03,390
Well the number of
matching birthdays

62
00:03:03,390 --> 00:03:08,440
is then simply the sum over all
the possible pairs of people of

63
00:03:08,440 --> 00:03:10,550
whether or not they have
a matching birthday.

64
00:03:10,550 --> 00:03:13,980
It's the sum of these
indicator variables M sub ij.

65
00:03:13,980 --> 00:03:16,010
And the number of these
indicator variables

66
00:03:16,010 --> 00:03:20,460
is of course all the ways of
choosing two out of n people.

67
00:03:20,460 --> 00:03:25,420
So in short, if I look at
the expectation M sub ij,

68
00:03:25,420 --> 00:03:26,990
let's think about
that for a minute.

69
00:03:26,990 --> 00:03:30,060
We're assuming that all the
birthdays are equally likely.

70
00:03:30,060 --> 00:03:32,620
And so I'm asking whether
the ith and the jith people

71
00:03:32,620 --> 00:03:34,080
have the same birthday.

72
00:03:34,080 --> 00:03:36,190
Well whatever the ith's
person birthday turns out

73
00:03:36,190 --> 00:03:39,370
to be, let's say
it's November 5,

74
00:03:39,370 --> 00:03:42,870
the jth person, who has
a uniform probability

75
00:03:42,870 --> 00:03:46,820
of equalling any birthday, still
has a uniform probability 1

76
00:03:46,820 --> 00:03:49,840
chance in d of equalling
November 5, which

77
00:03:49,840 --> 00:03:52,030
happens to be my birthday.

78
00:03:52,030 --> 00:03:57,010
OK so in short the probability
that any two people

79
00:03:57,010 --> 00:04:00,420
have a matching birthday
is one chance in d.

80
00:04:00,420 --> 00:04:02,680
And that means that
the expectation

81
00:04:02,680 --> 00:04:05,570
of the indicator variable
for that event, M sub ij,

82
00:04:05,570 --> 00:04:07,130
is 1 over d.

83
00:04:07,130 --> 00:04:10,010
And that tells us, by
linearity of expectation,

84
00:04:10,010 --> 00:04:11,920
that the expected
number of pairs

85
00:04:11,920 --> 00:04:15,390
is simply the number
of those pairs

86
00:04:15,390 --> 00:04:17,690
times the expected
number per pair,

87
00:04:17,690 --> 00:04:20,279
and choose 2 times 1 over d.

88
00:04:20,279 --> 00:04:23,130
Well as I said we have
data for 110 students.

89
00:04:23,130 --> 00:04:27,250
So the expected number of pairs
in a collection in a student

90
00:04:27,250 --> 00:04:32,730
body of 110 is 110 choose,
2 times 1 over 365,

91
00:04:32,730 --> 00:04:39,020
or about 16.4 pairs is the
expected number of pairs

92
00:04:39,020 --> 00:04:41,640
of matching birthdays.

93
00:04:41,640 --> 00:04:44,020
OK, now that's an
expected value.

94
00:04:44,020 --> 00:04:48,300
How likely is it to be if I take
a selection of 110 students,

95
00:04:48,300 --> 00:04:49,960
and I count how many
pairs of birthdays

96
00:04:49,960 --> 00:04:54,310
are there, do I really expect
to get close to 16.4 or not?

97
00:04:54,310 --> 00:04:58,140
Well what we're asking for
is the probability that p

98
00:04:58,140 --> 00:05:03,480
is near its mean, that the
distance between P and 16.4

99
00:05:03,480 --> 00:05:04,790
is greater than k.

100
00:05:04,790 --> 00:05:09,200
I hope that as k gets bigger
this probability is small.

101
00:05:09,200 --> 00:05:14,410
And so I'm really quite likely
to have close to 16.4 birthdays

102
00:05:14,410 --> 00:05:17,140
in my sample of 110.

103
00:05:17,140 --> 00:05:21,060
But this probability is one
that's a mess to calculate.

104
00:05:21,060 --> 00:05:25,190
But we can get a grip on it
because the variance of P

105
00:05:25,190 --> 00:05:26,540
is easy to calculate.

106
00:05:26,540 --> 00:05:30,430
And that will allow us to
apply the Chebyshev bound,

107
00:05:30,430 --> 00:05:33,380
and get some kind of an
estimate on the likelihood

108
00:05:33,380 --> 00:05:37,590
that P is near its expectation.

109
00:05:37,590 --> 00:05:40,910
So the key observation
that we need

110
00:05:40,910 --> 00:05:45,910
is that the indicator variables
are pairwise independent.

111
00:05:45,910 --> 00:05:47,470
So let's think
about the indicator

112
00:05:47,470 --> 00:05:50,820
variable for the event that
the ith and the jth people

113
00:05:50,820 --> 00:05:53,520
have the same birthday, let's
call them Albert and Drew.

114
00:05:53,520 --> 00:05:56,540
So Albert's the ith person,
Drew is the jth person.

115
00:05:56,540 --> 00:05:59,240
And I'm interested in the
event that Albert and Drew

116
00:05:59,240 --> 00:06:01,020
have the same birthday.

117
00:06:01,020 --> 00:06:04,292
And let's compare that to
another pair of people,

118
00:06:04,292 --> 00:06:06,250
and whether or not they
have the same birthday.

119
00:06:06,250 --> 00:06:08,400
So let's first of all
think about Dave and Mike,

120
00:06:08,400 --> 00:06:10,600
whether Dave and Mike
have the same birthday.

121
00:06:10,600 --> 00:06:14,260
And I want to know if these
two events are independent.

122
00:06:14,260 --> 00:06:17,880
Well remember we are assuming
that Albert's birthday is

123
00:06:17,880 --> 00:06:20,380
independent of Drew's birthday
is independent of David's, is

124
00:06:20,380 --> 00:06:21,590
independent of Mike.

125
00:06:21,590 --> 00:06:25,130
Each of the people is
supposedly chosen independently,

126
00:06:25,130 --> 00:06:28,220
and their birthdays
are independent.

127
00:06:28,220 --> 00:06:31,640
So it's obvious that these
two pairs that don't overlap

128
00:06:31,640 --> 00:06:33,290
have nothing to do
with each other,

129
00:06:33,290 --> 00:06:34,873
and we don't have
to worry about that.

130
00:06:34,873 --> 00:06:37,390
You could prove that
formally, but it is obvious

131
00:06:37,390 --> 00:06:42,400
because each of the individual
birthdays are independent.

132
00:06:42,400 --> 00:06:44,360
Now what's more
interesting is the case

133
00:06:44,360 --> 00:06:48,610
when I asked whether or
not Albert and Drew having

134
00:06:48,610 --> 00:06:51,290
the same birthday is
independent of Albert and Mike

135
00:06:51,290 --> 00:06:53,400
having the same birthday.

136
00:06:53,400 --> 00:06:55,670
And that one is not so obvious.

137
00:06:55,670 --> 00:06:59,610
Here's a way to think
about what could go wrong.

138
00:06:59,610 --> 00:07:03,130
Suppose that in fact the
birthdays weren't uniform,

139
00:07:03,130 --> 00:07:05,940
suppose that some birthdays
were more common than others.

140
00:07:05,940 --> 00:07:09,980
OK that makes it more likely
that if Albert and Drew have

141
00:07:09,980 --> 00:07:12,740
the same birthday
it slants things,

142
00:07:12,740 --> 00:07:16,890
so that they're more likely to
have this very common birthday

143
00:07:16,890 --> 00:07:19,990
than they would
have been otherwise.

144
00:07:19,990 --> 00:07:24,470
And now once I know that
they match, and therefore are

145
00:07:24,470 --> 00:07:27,260
more likely to have the common
birthday than they would have

146
00:07:27,260 --> 00:07:29,970
without any information,
I know that Albert

147
00:07:29,970 --> 00:07:34,140
is more likely to have this
common birthday than otherwise.

148
00:07:34,140 --> 00:07:39,010
And that means that Mike is even
more likely to match Albert,

149
00:07:39,010 --> 00:07:41,740
because Albert's got the
common birthday than Mike

150
00:07:41,740 --> 00:07:46,880
was to match Albert without any
further information about what

151
00:07:46,880 --> 00:07:49,105
Albert's likely birthday was.

152
00:07:49,105 --> 00:07:51,730
You can think about that, and it
can be worked out numerically,

153
00:07:51,730 --> 00:07:52,313
easily enough.

154
00:07:52,313 --> 00:07:55,830
So uniform is going to be a
crucial factor here in order

155
00:07:55,830 --> 00:07:57,820
to conclude that
Albert and Drew,

156
00:07:57,820 --> 00:08:04,909
and Albert and Mike are
mutually independent events.

157
00:08:04,909 --> 00:08:06,450
But let's go back
and think about it.

158
00:08:06,450 --> 00:08:08,722
All we really need is
that Mike is uniform

159
00:08:08,722 --> 00:08:11,180
in order to conclude that these
two events are independent.

160
00:08:11,180 --> 00:08:14,020
Because we know that Mike
and Andrew and Albert

161
00:08:14,020 --> 00:08:16,000
separately are
independent of each other.

162
00:08:16,000 --> 00:08:18,230
Their birthdays are
chosen independently.

163
00:08:18,230 --> 00:08:21,870
So that intuitively means
that the probability that Mike

164
00:08:21,870 --> 00:08:23,950
has any given birthday
doesn't really

165
00:08:23,950 --> 00:08:26,170
matter what's going on
with Albert and Drew,

166
00:08:26,170 --> 00:08:28,340
because Mike is independent
of Albert and Drew.

167
00:08:28,340 --> 00:08:31,440
And if we know that Mike's
probability of having

168
00:08:31,440 --> 00:08:36,419
a birthday is uniform,
then whatever the birthday

169
00:08:36,419 --> 00:08:39,000
that Albert has, whether
he matches Drew or not,

170
00:08:39,000 --> 00:08:43,059
Mike has a 1 chance
in d of hitting

171
00:08:43,059 --> 00:08:46,000
the same birthday of whatever
Albert wound up having.

172
00:08:46,000 --> 00:08:50,300
And that means that the
probability that Mike matches

173
00:08:50,300 --> 00:08:54,050
Albert is the same one over d
than it would have been if we

174
00:08:54,050 --> 00:08:56,230
had no further information.

175
00:08:56,230 --> 00:08:58,670
This is an argument that,
in fact, is made rigorous

176
00:08:58,670 --> 00:09:01,490
in some class problems
and a problem set,

177
00:09:01,490 --> 00:09:05,630
but let's just take
it as plausible enough

178
00:09:05,630 --> 00:09:09,340
based on this hand-waving
argument that I articulated,

179
00:09:09,340 --> 00:09:13,840
that these two events are
independent pairwise and so

180
00:09:13,840 --> 00:09:18,150
the corresponding indicator
variables and M Albert Drew,

181
00:09:18,150 --> 00:09:24,684
and M Albert Mike are
independent of each other.

182
00:09:24,684 --> 00:09:25,850
So that's what we've argued.

183
00:09:25,850 --> 00:09:28,770
But notice that these
events of pairwise matching

184
00:09:28,770 --> 00:09:30,470
are certainly not
three-way independent,

185
00:09:30,470 --> 00:09:32,880
because after all if I know
that Albert and Drew have

186
00:09:32,880 --> 00:09:34,910
the same birthday, and
that Albert and Mike have

187
00:09:34,910 --> 00:09:36,850
the same birthday,
I absolutely know

188
00:09:36,850 --> 00:09:39,420
with certainty that Drew and
Mike have the same birthday.

189
00:09:39,420 --> 00:09:42,500
So this is a very
nice, basic example

190
00:09:42,500 --> 00:09:45,340
where you have pairwise
independence, but not

191
00:09:45,340 --> 00:09:47,760
three-way independence,
assuming that

192
00:09:47,760 --> 00:09:52,950
all of these random variables
Albert, Drew, and Mike are

193
00:09:52,950 --> 00:09:56,450
uniform in what
birthday they have.

194
00:09:56,450 --> 00:09:59,560
OK so let's go back
to counting birthdays.

195
00:09:59,560 --> 00:10:02,770
The variance of an
indicator is pq.

196
00:10:02,770 --> 00:10:09,020
So in this case p is 1 over 365,
and q is 1 minus 1 over 365.

197
00:10:09,020 --> 00:10:12,620
And because of
pairwise independence,

198
00:10:12,620 --> 00:10:16,840
the variance of p, which
is the sum of the M sub

199
00:10:16,840 --> 00:10:21,040
ijs, the variance of the
number of birthday pairs,

200
00:10:21,040 --> 00:10:23,060
is the sum of those variances.

201
00:10:23,060 --> 00:10:27,900
It's 110 choose to times
the variance of the M sub ij

202
00:10:27,900 --> 00:10:31,790
turns out to be
about 16.37, which

203
00:10:31,790 --> 00:10:36,630
means that the standard
deviation sigma is less than 4.

204
00:10:36,630 --> 00:10:40,170
Now I can apply Chebyshev,
because by the Chebyshev band

205
00:10:40,170 --> 00:10:46,730
the probability that
16.4 is within a 2

206
00:10:46,730 --> 00:10:49,570
sigma, is further
away than 2 sigma,

207
00:10:49,570 --> 00:10:51,230
is only one chance in four.

208
00:10:51,230 --> 00:10:54,230
Which means the probability
that it's within 2 sigma,

209
00:10:54,230 --> 00:10:57,050
that the actual number
of measured pairs

210
00:10:57,050 --> 00:11:00,700
is within 2 sigma of
the expected number 16.4

211
00:11:00,700 --> 00:11:06,000
is greater than 1
minus 1/4, or 3/4.

212
00:11:06,000 --> 00:11:11,450
There's a 3/4 chance that the
number of pairs that we find

213
00:11:11,450 --> 00:11:16,160
is within 2 sigma of the
expected number 16.4.

214
00:11:16,160 --> 00:11:19,710
Sigma was about 4,
so this is 8, which

215
00:11:19,710 --> 00:11:24,170
means that we're expecting,
with 3/4 probability,

216
00:11:24,170 --> 00:11:31,090
somewhere between 8.4, meaning
9, and 24.4, meaning 25 pairs.

217
00:11:31,090 --> 00:11:34,350
So 75% of the time,
in a class of 110,

218
00:11:34,350 --> 00:11:40,430
we're going to find between
9 and 25 pairs of birthdays.

219
00:11:40,430 --> 00:11:41,670
Did that actually happen?

220
00:11:41,670 --> 00:11:42,810
Well it did.

221
00:11:42,810 --> 00:11:45,960
In our class of 110
for whom we had data,

222
00:11:45,960 --> 00:11:49,170
we actually found 21 pairs
of matching birthdays.

223
00:11:49,170 --> 00:11:51,900
Literally we found 12
pairs and three triples,

224
00:11:51,900 --> 00:11:55,440
but each triple counts
as three matching pairs.

225
00:11:55,440 --> 00:11:59,180
And there they are,
the blues are triples.

226
00:11:59,180 --> 00:12:00,870
And you can see
whether your birthday

227
00:12:00,870 --> 00:12:02,620
is among those, and
knowing that you

228
00:12:02,620 --> 00:12:05,620
have a classmate or two
that have the same birthday

229
00:12:05,620 --> 00:12:06,780
that you do.

230
00:12:06,780 --> 00:12:08,470
So there are 15
different birthdays,

231
00:12:08,470 --> 00:12:10,680
but they count as 21
pairs because it's

232
00:12:10,680 --> 00:12:15,140
12 single pairs, and three
triplets, each of which

233
00:12:15,140 --> 00:12:17,360
counts for three pairs.