1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:28,762 --> 00:00:31,140
JOHN GUTTAG: So
today, we're going

9
00:00:31,140 --> 00:00:34,560
to move on to a fairly
different world than the world

10
00:00:34,560 --> 00:00:36,090
we've been living in.

11
00:00:36,090 --> 00:00:37,950
And this will be a
world we'll be living in

12
00:00:37,950 --> 00:00:40,580
for quite a few lectures.

13
00:00:40,580 --> 00:00:42,990
But before I do that,
I want to get back

14
00:00:42,990 --> 00:00:47,170
to just finish up something
that Professor Grimson started.

15
00:00:47,170 --> 00:00:50,520
You may recall he talked
about family trees

16
00:00:50,520 --> 00:00:52,650
and raised the question,
was it actually

17
00:00:52,650 --> 00:00:55,890
possible to represent all
ancestral relationships

18
00:00:55,890 --> 00:00:57,560
as a tree?

19
00:00:57,560 --> 00:00:59,890
Well, as a counterexample,
I'm sure some of you

20
00:00:59,890 --> 00:01:03,770
are familiar with Oedipus Rex.

21
00:01:03,770 --> 00:01:05,269
For those of you
who are not, I'm

22
00:01:05,269 --> 00:01:07,940
happy give you a plot summary
at the end of the lecture.

23
00:01:07,940 --> 00:01:10,880
It's a rather bizarre plot.

24
00:01:10,880 --> 00:01:16,110
But it was captured in a
wonderful song by Tom Lehrer.

25
00:01:16,110 --> 00:01:19,160
The short story is Oedipus
ended up marrying his mother

26
00:01:19,160 --> 00:01:22,430
and having four children.

27
00:01:22,430 --> 00:01:25,100
And Tom Lehrer, if you've
never heard of Tom Lehrer,

28
00:01:25,100 --> 00:01:29,660
you're missing one of the
world's funniest songwriters.

29
00:01:29,660 --> 00:01:32,300
And he had a wonderful
song called "Oedipus Rex,"

30
00:01:32,300 --> 00:01:38,510
and I recommend this YouTube as
a way to go and listen to it.

31
00:01:38,510 --> 00:01:44,870
And you can gather from the
quote what the story is about.

32
00:01:44,870 --> 00:01:46,970
I also recommend the
play, by the way.

33
00:01:46,970 --> 00:01:50,330
It's really kind of
appalling what goes on,

34
00:01:50,330 --> 00:01:53,090
but it's beautiful.

35
00:01:53,090 --> 00:01:57,050
Back to the main topic,
here's the relevant reading--

36
00:01:59,570 --> 00:02:05,250
a small bit from later in
the book and then chapter 14.

37
00:02:05,250 --> 00:02:07,260
You may notice that
we're not actually going

38
00:02:07,260 --> 00:02:09,264
through the book in order.

39
00:02:09,264 --> 00:02:11,430
And the reason we're not
doing that is because we're

40
00:02:11,430 --> 00:02:13,440
trying to get you
information you need in time

41
00:02:13,440 --> 00:02:14,395
to do problem sets.

42
00:02:18,810 --> 00:02:24,480
So the topic of today is
really uncertainty and the fact

43
00:02:24,480 --> 00:02:29,550
that the world is really
annoyingly hard to understand.

44
00:02:32,520 --> 00:02:36,480
This is a signpost
related to 6.0002,

45
00:02:36,480 --> 00:02:41,170
but we won't go into too
much detail about it.

46
00:02:41,170 --> 00:02:43,050
We'd rather things were certain.

47
00:02:43,050 --> 00:02:47,330
But in fact, they
usually are not.

48
00:02:47,330 --> 00:02:51,710
And this is a place
where 6.0002 diverges

49
00:02:51,710 --> 00:02:53,930
from the typical
introductory computer science

50
00:02:53,930 --> 00:02:58,250
course, which focuses on
things that are functional--

51
00:02:58,250 --> 00:03:02,030
given an input, you always
get the same output.

52
00:03:02,030 --> 00:03:03,950
It's predictable.

53
00:03:03,950 --> 00:03:07,760
And we like to do that,
because that's easier to teach.

54
00:03:07,760 --> 00:03:11,300
But in fact, for reasons
we'll be talking about,

55
00:03:11,300 --> 00:03:14,210
it's not nearly as
useful if you're

56
00:03:14,210 --> 00:03:16,580
trying to actually
write computations that

57
00:03:16,580 --> 00:03:18,860
help you understand the world.

58
00:03:18,860 --> 00:03:21,110
You have to face
uncertainty head on.

59
00:03:25,030 --> 00:03:27,860
An analogy is for many
years people, believed

60
00:03:27,860 --> 00:03:31,490
in Newtonian mechanics--

61
00:03:31,490 --> 00:03:34,520
I guess they still
do in 8.01 maybe--

62
00:03:34,520 --> 00:03:38,030
that every effect has a cause.

63
00:03:38,030 --> 00:03:40,430
An apple falls from the
tree because of gravity,

64
00:03:40,430 --> 00:03:42,390
and you know where
it's going to land.

65
00:03:42,390 --> 00:03:45,080
And the world can be
understood causally.

66
00:03:45,080 --> 00:03:50,090
And people believed this
really for quite a long time,

67
00:03:50,090 --> 00:03:54,500
most of history,
until the early part

68
00:03:54,500 --> 00:03:58,250
of the 20th century, when
the so-called Copenhagen

69
00:03:58,250 --> 00:04:00,770
doctrine was put forth.

70
00:04:03,880 --> 00:04:06,670
The doctrine there from
Bohr and Heisenberg,

71
00:04:06,670 --> 00:04:09,910
two very famous
physicists, was one

72
00:04:09,910 --> 00:04:13,930
of what they called
causal nondeterminism.

73
00:04:13,930 --> 00:04:17,709
And their assertion was that
the world at its very most

74
00:04:17,709 --> 00:04:24,610
fundamental level behaves in
a way that you cannot predict.

75
00:04:24,610 --> 00:04:28,990
It's OK to make a statement that
x is highly likely to occur,

76
00:04:28,990 --> 00:04:33,430
almost certain to occur,
but for no case can

77
00:04:33,430 --> 00:04:36,310
you make a statement
x will occur.

78
00:04:36,310 --> 00:04:40,360
Nothing has a
probability of one.

79
00:04:40,360 --> 00:04:43,720
This was hard for us to
imagine today, when we all

80
00:04:43,720 --> 00:04:45,580
know quantum mechanics.

81
00:04:45,580 --> 00:04:50,320
But at the turn of the century,
this was a shocking statement.

82
00:04:50,320 --> 00:04:53,230
And two other very
well-known physicists,

83
00:04:53,230 --> 00:04:55,900
Albert Einstein and
Schrodinger, basically

84
00:04:55,900 --> 00:04:57,460
said, no, this is wrong.

85
00:04:57,460 --> 00:05:00,130
Bohr, Heisenberg,
you guys are idiots.

86
00:05:00,130 --> 00:05:01,570
It's just not true.

87
00:05:01,570 --> 00:05:03,670
They probably didn't
call them idiots.

88
00:05:03,670 --> 00:05:06,730
And this is most exemplified
by Einstein's famous quote

89
00:05:06,730 --> 00:05:11,230
that "God does not play dice,"
which is indicative of the fact

90
00:05:11,230 --> 00:05:13,990
that this was actually a
discussion that permeated

91
00:05:13,990 --> 00:05:19,570
not just the world of physics,
but society in general people

92
00:05:19,570 --> 00:05:22,150
really turned it into
literally a religious issue,

93
00:05:22,150 --> 00:05:24,900
as did Einstein.

94
00:05:24,900 --> 00:05:26,940
Well, so now we should
ask the question,

95
00:05:26,940 --> 00:05:28,830
does it really matter?

96
00:05:28,830 --> 00:05:31,260
And to illustrate
that, I need two coins.

97
00:05:31,260 --> 00:05:33,900
I forgot to bring
any coins with me.

98
00:05:33,900 --> 00:05:35,840
Does anyone got a
coin they can lend me?

99
00:05:35,840 --> 00:05:37,301
AUDIENCE: I have some coins.

100
00:05:37,301 --> 00:05:39,900
JOHN GUTTAG: All right.

101
00:05:39,900 --> 00:05:42,300
Now, this is where I see how
much the students trust me.

102
00:05:42,300 --> 00:05:44,190
Do I get a penny?

103
00:05:44,190 --> 00:05:46,440
Do I get a silver dollar?

104
00:05:46,440 --> 00:05:47,460
So what do we got here?

105
00:05:50,500 --> 00:05:54,600
This is someone who's entrusting
me with quarters, not so bad.

106
00:05:57,500 --> 00:06:00,149
So we'll take these quarters,
and we'll shake them up,

107
00:06:00,149 --> 00:06:01,690
and we'll put them
down on the table.

108
00:06:04,240 --> 00:06:07,000
And now, we'll ask a question--

109
00:06:07,000 --> 00:06:13,140
do we have two heads, two
tails, or one head and one tail?

110
00:06:13,140 --> 00:06:17,220
So who thinks we have two heads?

111
00:06:17,220 --> 00:06:20,370
Who thinks we have two tails?

112
00:06:20,370 --> 00:06:23,230
Who thinks we have one of each?

113
00:06:23,230 --> 00:06:26,580
Well, clearly, everyone except
a few people-- for example,

114
00:06:26,580 --> 00:06:29,730
the Indians fan, who clearly
believe in the counterfactual--

115
00:06:33,030 --> 00:06:37,080
made the most
probabilistic decision.

116
00:06:37,080 --> 00:06:40,550
But in fact, there is
no nondeterminism here.

117
00:06:40,550 --> 00:06:43,040
I know the answer.

118
00:06:43,040 --> 00:06:47,600
And so in some sense,
it doesn't matter

119
00:06:47,600 --> 00:06:49,820
whether it's deterministic,
because in fact, it's

120
00:06:49,820 --> 00:06:52,070
not causally nondeterministic.

121
00:06:52,070 --> 00:06:58,120
The answer is quite clear,
but you don't know the answer.

122
00:06:58,120 --> 00:07:03,870
And so whether or not the world
is inherently unpredictable,

123
00:07:03,870 --> 00:07:08,760
the fact that we never have
complete knowledge of the world

124
00:07:08,760 --> 00:07:10,770
suggests that we
might as well treat

125
00:07:10,770 --> 00:07:15,130
it as inherently unpredictable.

126
00:07:15,130 --> 00:07:19,060
And so this is called
predictive nondeterminism.

127
00:07:19,060 --> 00:07:21,365
And this really is
what's going to underline

128
00:07:21,365 --> 00:07:23,740
pretty much everything else
we're going to be doing here.

129
00:07:30,370 --> 00:07:34,000
No comments about that?

130
00:07:34,000 --> 00:07:37,150
I wouldn't do that to you.

131
00:07:37,150 --> 00:07:39,700
Thank you.

132
00:07:39,700 --> 00:07:42,260
I know you are wishing to
get interest on the money,

133
00:07:42,260 --> 00:07:44,140
but you don't get any.

134
00:07:44,140 --> 00:07:46,060
AUDIENCE: Was it heads or tails?

135
00:07:51,376 --> 00:07:52,500
JOHN GUTTAG: What was that?

136
00:07:56,160 --> 00:08:00,660
So when we think about
nondeterminism in computation,

137
00:08:00,660 --> 00:08:04,150
we use the word
stochastic process.

138
00:08:04,150 --> 00:08:07,020
And that's any
process that's ongoing

139
00:08:07,020 --> 00:08:12,180
in which the next state depends
upon the previous states

140
00:08:12,180 --> 00:08:14,800
in some random element.

141
00:08:14,800 --> 00:08:18,450
So typically up till now
when we've written code,

142
00:08:18,450 --> 00:08:20,890
one line of code
did depended only

143
00:08:20,890 --> 00:08:23,260
on what the previous
lines of code did.

144
00:08:23,260 --> 00:08:25,810
There was no randomness.

145
00:08:25,810 --> 00:08:28,282
Here, we're going
to have randomness.

146
00:08:28,282 --> 00:08:29,740
And we can see the
difference if we

147
00:08:29,740 --> 00:08:34,450
look at these two
specifications of rolling a die.

148
00:08:34,450 --> 00:08:38,320
The first one, returns
an int between 1 and 6,

149
00:08:38,320 --> 00:08:41,890
is what I'll call
underdetermined.

150
00:08:41,890 --> 00:08:45,940
By that I mean you can't tell
what it's going to return.

151
00:08:45,940 --> 00:08:49,540
Maybe it will return a different
number each time you call it,

152
00:08:49,540 --> 00:08:51,700
but it's not required to.

153
00:08:51,700 --> 00:08:55,120
Maybe it will return three
every time you call it.

154
00:08:55,120 --> 00:08:58,690
The second specification
requires randomness.

155
00:08:58,690 --> 00:09:01,360
It says, it returns are
randomly chosen int.

156
00:09:01,360 --> 00:09:06,710
So it requires a
stochastic implementation.

157
00:09:06,710 --> 00:09:11,090
Let's look at how we implement
a random process in Python.

158
00:09:11,090 --> 00:09:15,890
We start by importing
the library random.

159
00:09:15,890 --> 00:09:17,520
This is not to
say you can import

160
00:09:17,520 --> 00:09:19,770
any random library you want.

161
00:09:19,770 --> 00:09:22,530
It's to say you import
the library called random.

162
00:09:22,530 --> 00:09:23,810
Let me get my pen out of here.

163
00:09:27,310 --> 00:09:29,230
And we'll use that a lot.

164
00:09:29,230 --> 00:09:32,590
And then we're going to use
the function in random called

165
00:09:32,590 --> 00:09:34,940
random.choice.

166
00:09:34,940 --> 00:09:39,530
It takes as an argument a
sequence, in this case a list,

167
00:09:39,530 --> 00:09:43,850
and randomly chooses
one member of the list.

168
00:09:43,850 --> 00:09:46,160
And it chooses it uniformly.

169
00:09:49,010 --> 00:09:52,930
It's a uniform distribution.

170
00:09:52,930 --> 00:09:56,860
And what that means is
that it's equally probable

171
00:09:56,860 --> 00:09:59,650
that it will choose any
number in that list each time

172
00:09:59,650 --> 00:10:01,690
you call it.

173
00:10:01,690 --> 00:10:03,820
We'll later look
at distributions

174
00:10:03,820 --> 00:10:06,700
that are not uniform,
not equally probable,

175
00:10:06,700 --> 00:10:08,320
where things are weighted.

176
00:10:08,320 --> 00:10:10,375
But here, it's quite
simple, it's just uniform.

177
00:10:13,470 --> 00:10:16,980
And then we can test
it using testRoll--

178
00:10:16,980 --> 00:10:21,930
take some number of n and
rolls the die that many times

179
00:10:21,930 --> 00:10:24,970
and creates a string
telling us what we got.

180
00:10:29,750 --> 00:10:36,300
So let's consider running this
on, say, testRoll of five.

181
00:10:36,300 --> 00:10:38,680
And we'll ask the
question, if we run it,

182
00:10:38,680 --> 00:10:43,180
how probable is it that it's
going to return a string

183
00:10:43,180 --> 00:10:43,960
of five 1's?

184
00:10:50,100 --> 00:10:51,120
How do we do that?

185
00:10:51,120 --> 00:10:54,420
Now, how many people
here are either in 6.041

186
00:10:54,420 --> 00:10:56,670
or would have taken 6.041?

187
00:10:56,670 --> 00:10:59,280
Raise your hand.

188
00:10:59,280 --> 00:10:59,850
Oh, good.

189
00:10:59,850 --> 00:11:02,830
So very few of you
know probability.

190
00:11:02,830 --> 00:11:03,330
That helps.

191
00:11:06,450 --> 00:11:09,170
So how do we think
about that question?

192
00:11:09,170 --> 00:11:14,480
Well, probability, to me at
least, is all about counting,

193
00:11:14,480 --> 00:11:16,740
especially discrete
probability, which

194
00:11:16,740 --> 00:11:19,900
is what we're looking at here.

195
00:11:19,900 --> 00:11:23,830
What you do is you start by
counting the number of events

196
00:11:23,830 --> 00:11:29,710
that have the
property of interest

197
00:11:29,710 --> 00:11:31,480
and the number of
possible events

198
00:11:31,480 --> 00:11:32,940
and divide one by the other.

199
00:11:35,580 --> 00:11:41,430
So if we think about
rolling a die five times,

200
00:11:41,430 --> 00:11:44,070
we can enumerate all of
the possible outcomes

201
00:11:44,070 --> 00:11:44,885
of five rolls.

202
00:11:47,390 --> 00:11:50,870
So if we look at that,
what are the outcomes?

203
00:11:50,870 --> 00:11:54,150
Well, I could get five 1's.

204
00:11:54,150 --> 00:12:00,720
I could get four 1's and a 2
or four 1's and 3, skip a few.

205
00:12:00,720 --> 00:12:05,040
The next one would be three 1's,
a 2 and a 1, then a 2 and 2,

206
00:12:05,040 --> 00:12:08,850
and finally, at
the end, all 6's.

207
00:12:08,850 --> 00:12:13,320
So remember, we
looked before at when

208
00:12:13,320 --> 00:12:17,160
we're looking at optimization
problems about binary numbers.

209
00:12:17,160 --> 00:12:20,670
And we said we can look at all
the possible choices of items

210
00:12:20,670 --> 00:12:24,460
in the knapsack by a
vector of 0's and 1's.

211
00:12:24,460 --> 00:12:27,340
We said, how many possible
choices are there?

212
00:12:27,340 --> 00:12:30,200
Well, it depended on how
many binary numbers you could

213
00:12:30,200 --> 00:12:32,910
get in that number of digits.

214
00:12:32,910 --> 00:12:36,410
Well, here we're doing the same
thing, but instead of base 2,

215
00:12:36,410 --> 00:12:37,710
it's base 6.

216
00:12:40,590 --> 00:12:45,150
And so the number of possible
outcomes of five rolls

217
00:12:45,150 --> 00:12:45,990
is quite high.

218
00:12:48,760 --> 00:12:50,860
How many of those are five 1's?

219
00:12:50,860 --> 00:12:54,180
Only one of them, right?

220
00:12:54,180 --> 00:12:58,300
So in order to get the
probability of a five 1's, I

221
00:12:58,300 --> 00:13:00,370
divide 1 by 6 to the fifth.

222
00:13:03,232 --> 00:13:06,720
Does that makes
sense to everybody?

223
00:13:06,720 --> 00:13:10,565
So in fact, we see
it's highly unlikely.

224
00:13:10,565 --> 00:13:15,460
The probability of a
five 1's is quite small.

225
00:13:15,460 --> 00:13:17,770
Now, suppose we were to
ask about the probability

226
00:13:17,770 --> 00:13:19,870
of something else--

227
00:13:19,870 --> 00:13:27,120
instead of five 1's, say 53421.

228
00:13:27,120 --> 00:13:31,230
It kind of looks more likely
than that than five 1's

229
00:13:31,230 --> 00:13:33,630
in a row, but of
course, it isn't, right?

230
00:13:33,630 --> 00:13:37,620
Any specific combination
is equally probable.

231
00:13:37,620 --> 00:13:40,420
And there are a lot of them.

232
00:13:40,420 --> 00:13:44,920
So this is all the probability
we're going to think about we

233
00:13:44,920 --> 00:13:48,550
could think about this way, as
simply a matter of counting--

234
00:13:48,550 --> 00:13:51,640
the number of possible events,
the number of events that have

235
00:13:51,640 --> 00:13:54,970
the property of interest--
in this case being all 1's--

236
00:13:54,970 --> 00:13:56,680
and then simple division.

237
00:13:59,530 --> 00:14:03,010
Given that framework, there
were three basic facts

238
00:14:03,010 --> 00:14:07,870
about probability we're
going to be using a lot of.

239
00:14:07,870 --> 00:14:15,980
So one, probabilities
always range from 0 to 1.

240
00:14:15,980 --> 00:14:17,460
How do we know that?

241
00:14:17,460 --> 00:14:19,930
Well, we've got a
fraction, right?

242
00:14:19,930 --> 00:14:25,190
And the denominator is
all possible events.

243
00:14:25,190 --> 00:14:29,840
The numerator is the subset
of that that's of interest.

244
00:14:29,840 --> 00:14:35,680
So it has to range from
0 to the denominator.

245
00:14:35,680 --> 00:14:37,330
And that tells us
that the fraction

246
00:14:37,330 --> 00:14:40,250
has to range from 0 to 1.

247
00:14:40,250 --> 00:14:43,430
So 1 says it's always
going to happen, 0 never.

248
00:14:46,870 --> 00:14:50,290
So if the probability of
an event occurring is p,

249
00:14:50,290 --> 00:14:54,250
what's the probability
of it not occurring?

250
00:14:54,250 --> 00:14:57,060
This follows from
the first bullet.

251
00:14:57,060 --> 00:15:04,050
It's simply going
to be 1 minus p.

252
00:15:04,050 --> 00:15:07,650
This is a trick that we'll
find we'll use a lot.

253
00:15:07,650 --> 00:15:09,660
Because it's often
the case when you

254
00:15:09,660 --> 00:15:13,080
want to compute the probability
of something happening,

255
00:15:13,080 --> 00:15:16,680
it's easier to compute the
probability of it not happening

256
00:15:16,680 --> 00:15:18,980
and subtract it from 1.

257
00:15:18,980 --> 00:15:21,560
And we'll see an example
of that later today.

258
00:15:24,550 --> 00:15:27,940
Now, here's the biggie.

259
00:15:27,940 --> 00:15:31,650
When events are
independent of each other,

260
00:15:31,650 --> 00:15:35,000
the probability of all
of the events occurring

261
00:15:35,000 --> 00:15:39,380
is equal to the product of
the probabilities of each

262
00:15:39,380 --> 00:15:40,885
of the events occurring.

263
00:15:44,280 --> 00:15:53,890
So if the probability of A is
0.5 and the probability of B

264
00:15:53,890 --> 00:16:01,150
is 0.4, the probability
of A and B is what?

265
00:16:06,110 --> 00:16:07,670
0.5 times 0.4.

266
00:16:07,670 --> 00:16:10,680
You guys can figure that out.

267
00:16:10,680 --> 00:16:14,330
I think that's 0.2.

268
00:16:14,330 --> 00:16:16,100
So you'd expect
that, that it should

269
00:16:16,100 --> 00:16:20,390
be much smaller than either of
the first two probabilities.

270
00:16:20,390 --> 00:16:22,060
This is the most
common rule, it's

271
00:16:22,060 --> 00:16:24,460
something we use all the
time in probabilities,

272
00:16:24,460 --> 00:16:28,360
the so-called
multiplicative law.

273
00:16:28,360 --> 00:16:33,120
We have to be careful
about it, however,

274
00:16:33,120 --> 00:16:37,470
in that it only holds if
the events are actually

275
00:16:37,470 --> 00:16:40,570
independent.

276
00:16:40,570 --> 00:16:44,920
Two events are independent
if the outcome of one

277
00:16:44,920 --> 00:16:47,110
has no influence on the
outcome of the other.

278
00:16:50,010 --> 00:16:52,370
So when we roll
the die, we assume

279
00:16:52,370 --> 00:16:54,350
that the first
roll, the outcome,

280
00:16:54,350 --> 00:16:55,870
was independent of the--

281
00:16:55,870 --> 00:16:58,370
or the second roll was
independent of the first roll,

282
00:16:58,370 --> 00:17:00,910
independent of the fourth roll.

283
00:17:00,910 --> 00:17:02,560
When we looked at
the two coins, we

284
00:17:02,560 --> 00:17:05,410
assume that heads and
tails of each coin

285
00:17:05,410 --> 00:17:08,460
was independent
of the other coin.

286
00:17:08,460 --> 00:17:10,200
I didn't, for example,
look at one coin

287
00:17:10,200 --> 00:17:12,304
and make sure that the
other one was different.

288
00:17:15,700 --> 00:17:19,079
The danger here is
that people often

289
00:17:19,079 --> 00:17:22,950
compute probabilities assuming
independence when you don't

290
00:17:22,950 --> 00:17:26,099
actually have independence.

291
00:17:26,099 --> 00:17:29,470
So let's look at an example.

292
00:17:29,470 --> 00:17:32,980
For those of you familiar
with American football,

293
00:17:32,980 --> 00:17:35,800
the New England Patriots
and the Denver Broncos

294
00:17:35,800 --> 00:17:38,380
are two prominent teams.

295
00:17:38,380 --> 00:17:40,660
And let's look at
computing the probability

296
00:17:40,660 --> 00:17:45,690
of whether one of them will
lose on a given Sunday.

297
00:17:45,690 --> 00:17:48,840
So the Patriots have a
winning percentage of 7 of 8--

298
00:17:48,840 --> 00:17:51,420
they've won 7 of
their 8 games so far--

299
00:17:51,420 --> 00:17:54,590
and the Broncos 6 of 8.

300
00:17:54,590 --> 00:17:57,560
The probability of both
winning next Sunday,

301
00:17:57,560 --> 00:18:00,860
assuming that this is
indicative of how good they are,

302
00:18:00,860 --> 00:18:03,470
we can get with the
multiplicative rule.

303
00:18:03,470 --> 00:18:08,750
So it's 7/8 times 6/8, or 42/64.

304
00:18:08,750 --> 00:18:12,060
We could simplify that
fraction, I suppose.

305
00:18:12,060 --> 00:18:14,370
Does that makes sense?

306
00:18:14,370 --> 00:18:17,840
So this is probably a pretty
good estimate of both of them

307
00:18:17,840 --> 00:18:20,600
winning next Sunday.

308
00:18:20,600 --> 00:18:24,380
So the probability of at
least one of them losing

309
00:18:24,380 --> 00:18:27,740
is 1 minus that.

310
00:18:27,740 --> 00:18:30,430
So here's an example
of why we often use

311
00:18:30,430 --> 00:18:34,120
the 1 minus rule,
because we could

312
00:18:34,120 --> 00:18:38,020
compute the probability
of both of them

313
00:18:38,020 --> 00:18:41,440
winning by simply multiplying.

314
00:18:41,440 --> 00:18:44,130
And we subtract that from 1.

315
00:18:44,130 --> 00:18:47,220
However, what about
Sunday, December 18?

316
00:18:47,220 --> 00:18:50,440
What's the probability?

317
00:18:50,440 --> 00:18:53,920
Well, as it happens,
that day the Patriots

318
00:18:53,920 --> 00:18:55,025
are playing the Broncos.

319
00:18:58,380 --> 00:19:02,230
So now suddenly, the
outcomes are not independent.

320
00:19:02,230 --> 00:19:05,550
The probability of
one of them losing

321
00:19:05,550 --> 00:19:10,470
is influenced by the probability
of the other winning.

322
00:19:10,470 --> 00:19:13,540
So you would expect
the probability of one

323
00:19:13,540 --> 00:19:17,989
of them losing is much
closer to 1 than 22/64,

324
00:19:17,989 --> 00:19:18,780
which is about 1/3.

325
00:19:21,780 --> 00:19:25,490
So in this case, it's easy.

326
00:19:25,490 --> 00:19:28,430
But as we'll see, as we
get through the term,

327
00:19:28,430 --> 00:19:30,560
there are lots of
cases where you

328
00:19:30,560 --> 00:19:33,950
have to work pretty hard to
understand whether or not two

329
00:19:33,950 --> 00:19:36,350
events really are independent.

330
00:19:36,350 --> 00:19:40,410
And if you get it wrong, you
get a totally bogus answer.

331
00:19:40,410 --> 00:19:45,530
1/3 versus 1 is a
pretty big difference.

332
00:19:45,530 --> 00:19:49,010
By the way, as it happens,
the probability of the Broncos

333
00:19:49,010 --> 00:19:50,070
losing is about 1.

334
00:19:56,190 --> 00:19:58,400
Let's go look at some code.

335
00:20:01,040 --> 00:20:03,260
And we'll go back to
our dice, because it's

336
00:20:03,260 --> 00:20:05,420
much easier to
simulate dice games

337
00:20:05,420 --> 00:20:08,300
than it is to simulate
football games.

338
00:20:11,510 --> 00:20:13,340
So here it is.

339
00:20:13,340 --> 00:20:17,030
And we're going to talk
a lot about simulations.

340
00:20:17,030 --> 00:20:18,980
So here, rather than
rolling the die,

341
00:20:18,980 --> 00:20:20,480
I've written a program to do it.

342
00:20:23,980 --> 00:20:27,660
We've already seen the
code for rolling a die.

343
00:20:27,660 --> 00:20:32,990
And so to run this simulation,
typically what we're doing here

344
00:20:32,990 --> 00:20:35,480
is I'm giving you the goal--

345
00:20:35,480 --> 00:20:38,510
for example, are we
going to get five 1's--

346
00:20:38,510 --> 00:20:41,800
the number of trials--

347
00:20:41,800 --> 00:20:47,060
each trial, in this case,
will be say of length 5--

348
00:20:47,060 --> 00:20:48,770
so I'm going to
roll the same die

349
00:20:48,770 --> 00:20:55,130
five times say 1,000 different
times, and then just some text

350
00:20:55,130 --> 00:20:57,910
as to what I'm going to print.

351
00:20:57,910 --> 00:21:01,090
Almost all the
simulations we look at

352
00:21:01,090 --> 00:21:05,630
are going to start with lines
that look a lot like that.

353
00:21:05,630 --> 00:21:08,650
We're going to
initialize some variable.

354
00:21:08,650 --> 00:21:11,755
And then we're going to
run some number of trials.

355
00:21:16,160 --> 00:21:19,860
So in this case,
we're going to get

356
00:21:19,860 --> 00:21:21,340
from the length of the goal--

357
00:21:21,340 --> 00:21:23,790
so if the goal is
five 1's, then we're

358
00:21:23,790 --> 00:21:26,490
going to roll the dice five
times; if it's 10 runs,

359
00:21:26,490 --> 00:21:29,830
we'll roll it 10 times.

360
00:21:29,830 --> 00:21:35,310
So this is essentially
one trial, one attempt.

361
00:21:38,850 --> 00:21:41,850
And then we'll check
the result. And if it

362
00:21:41,850 --> 00:21:43,720
has the property we want--

363
00:21:43,720 --> 00:21:47,460
in this case, it's
equal to the goal--

364
00:21:47,460 --> 00:21:50,040
then we're going to
increment the total, which

365
00:21:50,040 --> 00:21:54,380
we initialized up here by 1.

366
00:21:54,380 --> 00:21:57,170
So we'll keep track
with just the counting--

367
00:21:57,170 --> 00:22:01,610
the number of trials that
actually meet the goal.

368
00:22:01,610 --> 00:22:04,990
And then when we're done,
what we're going to do

369
00:22:04,990 --> 00:22:08,560
is divide the number
that met the goal

370
00:22:08,560 --> 00:22:10,870
by the number of trials--

371
00:22:10,870 --> 00:22:14,170
exactly the counting
argument we just looked at.

372
00:22:14,170 --> 00:22:19,700
And then we'll print the result.

373
00:22:19,700 --> 00:22:22,220
Almost every
simulation we look at

374
00:22:22,220 --> 00:22:24,360
is going to have this structure.

375
00:22:24,360 --> 00:22:27,680
There'll be an outer loop,
which is the number of trials.

376
00:22:27,680 --> 00:22:29,870
And then inside-- maybe
it'll have a loop,

377
00:22:29,870 --> 00:22:32,600
or maybe it won't--
will be a single trial.

378
00:22:32,600 --> 00:22:33,770
We'll sum up the results.

379
00:22:33,770 --> 00:22:36,920
And then we'll divide
by the number of trials.

380
00:22:36,920 --> 00:22:37,490
Let's run it.

381
00:22:45,300 --> 00:22:49,650
So a couple of things
are going to go on here.

382
00:22:49,650 --> 00:22:59,570
If you look at the code as
we've looked at it before,

383
00:22:59,570 --> 00:23:02,780
what you're seeing is I'm
computing the estimated

384
00:23:02,780 --> 00:23:05,180
probability by the simulation.

385
00:23:05,180 --> 00:23:08,270
And I'm comparing it to the
actual probability, which we've

386
00:23:08,270 --> 00:23:09,590
already seen how to compute.

387
00:23:12,117 --> 00:23:14,700
So if you look at it, there are
a couple of things to look at.

388
00:23:17,370 --> 00:23:19,260
The estimated
probability is pretty

389
00:23:19,260 --> 00:23:24,704
close to the actual
probability but not the same.

390
00:23:24,704 --> 00:23:26,590
So let's go back
to the PowerPoint.

391
00:23:31,860 --> 00:23:34,240
Here are the results.

392
00:23:34,240 --> 00:23:37,680
And there are at least
two questions raised

393
00:23:37,680 --> 00:23:40,050
by this result.
First of all, how

394
00:23:40,050 --> 00:23:43,290
did I know that this is
what would get printed?

395
00:23:43,290 --> 00:23:45,610
Remember, this is random.

396
00:23:45,610 --> 00:23:48,520
How did I know that the
estimate-- well, there's

397
00:23:48,520 --> 00:23:51,790
nothing random about
the actual probability.

398
00:23:51,790 --> 00:23:55,390
But how did I know that
the estimated probability

399
00:23:55,390 --> 00:23:57,180
would be 0?

400
00:23:57,180 --> 00:23:58,470
And why did it print it twice?

401
00:23:58,470 --> 00:24:00,330
Because I messed
up the PowerPoint.

402
00:24:00,330 --> 00:24:04,140
Any rate, so how do I know
what would get printed?

403
00:24:04,140 --> 00:24:12,610
Well a confession--
random.choice

404
00:24:12,610 --> 00:24:14,920
is not actually random.

405
00:24:14,920 --> 00:24:20,140
In fact, nothing we can do in
a computer is actually random.

406
00:24:20,140 --> 00:24:23,650
You can prove that it's
impossible to build

407
00:24:23,650 --> 00:24:28,950
a computer that actually
generates truly random numbers.

408
00:24:28,950 --> 00:24:32,520
What they do instead
is generate numbers

409
00:24:32,520 --> 00:24:34,050
that called pseudorandom.

410
00:24:42,120 --> 00:24:44,740
How do they do that?

411
00:24:44,740 --> 00:24:48,930
They have an algorithm that
given one number generates

412
00:24:48,930 --> 00:24:52,700
the next number in a sequence.

413
00:24:52,700 --> 00:24:56,375
And they start that
algorithm with a seed.

414
00:25:00,050 --> 00:25:02,630
Now, typically,
they get that seed

415
00:25:02,630 --> 00:25:05,930
by reading the clock
of the computer.

416
00:25:05,930 --> 00:25:08,090
So most computers have
a clock that, say,

417
00:25:08,090 --> 00:25:12,080
keeps track of the number of
microseconds since January 1,

418
00:25:12,080 --> 00:25:14,174
1978.

419
00:25:14,174 --> 00:25:15,590
I don't know if
that's still true.

420
00:25:15,590 --> 00:25:18,590
That's what Unix used to do.

421
00:25:18,590 --> 00:25:22,070
So the notion is, you
start your program,

422
00:25:22,070 --> 00:25:26,420
there's no way of knowing how
many microseconds have elapsed.

423
00:25:26,420 --> 00:25:29,395
And so you're getting a random
number to start the process.

424
00:25:32,040 --> 00:25:33,660
Since you don't know
where it starts,

425
00:25:33,660 --> 00:25:34,800
you don't know what
the second number

426
00:25:34,800 --> 00:25:37,050
is, you don't know what the
third number is, you don't

427
00:25:37,050 --> 00:25:38,580
know what the fourth number is.

428
00:25:38,580 --> 00:25:42,570
And so it's predictably
nondeterministic,

429
00:25:42,570 --> 00:25:46,600
because you don't know what
the seed is going to be.

430
00:25:46,600 --> 00:25:49,180
Now, you can imagine
that this makes

431
00:25:49,180 --> 00:25:52,460
programs really hard to debug.

432
00:25:52,460 --> 00:25:55,850
Every time you run it, something
different could happen.

433
00:25:55,850 --> 00:25:59,220
Now, we'll see often you want
them to be unpredictable.

434
00:25:59,220 --> 00:26:02,300
But for now, we want them to
be predictable, makes it easier

435
00:26:02,300 --> 00:26:04,130
prepare PowerPoint.

436
00:26:04,130 --> 00:26:08,635
So what you have is a command.

437
00:26:13,040 --> 00:26:19,190
You can call random.seed
and give it a value

438
00:26:19,190 --> 00:26:21,800
and say, I don't want you to
just choose some random seed,

439
00:26:21,800 --> 00:26:24,890
I want you to use 0 as the seed.

440
00:26:24,890 --> 00:26:27,530
For the same seed, you
always get the same sequence

441
00:26:27,530 --> 00:26:30,120
of random values.

442
00:26:30,120 --> 00:26:33,410
And so what I've done is I
set the seed to be, I think, 0

443
00:26:33,410 --> 00:26:36,620
in this case, not because
there's anything magic about 0,

444
00:26:36,620 --> 00:26:38,780
it's just sort of habit.

445
00:26:38,780 --> 00:26:41,540
But it made it predictable.

446
00:26:41,540 --> 00:26:43,640
As you write programs
with randomness

447
00:26:43,640 --> 00:26:45,980
in and when you're debugging
it, you will almost surely

448
00:26:45,980 --> 00:26:49,550
want to start by setting
random.seed to a value

449
00:26:49,550 --> 00:26:51,590
so you get the same answer.

450
00:26:51,590 --> 00:26:54,950
But make sure you debug it with
more than one value of this,

451
00:26:54,950 --> 00:26:58,320
so you didn't just get
lucky with your seed.

452
00:26:58,320 --> 00:27:01,460
So that's how I knew
what would get printed.

453
00:27:01,460 --> 00:27:06,480
The next question is,
why did the simulation

454
00:27:06,480 --> 00:27:09,670
give me the wrong answer?

455
00:27:09,670 --> 00:27:14,530
The actual probability
is three 0's and 1286.

456
00:27:14,530 --> 00:27:16,630
But it's estimated
a probability of 0.

457
00:27:19,150 --> 00:27:20,140
Why is it wrong?

458
00:27:24,200 --> 00:27:27,100
Well, let's think about this.

459
00:27:27,100 --> 00:27:30,020
I ran 1,000 trials.

460
00:27:30,020 --> 00:27:32,430
What does it mean to say
the probability is zero?

461
00:27:32,430 --> 00:27:36,670
It means that I tried it 1,000
times and didn't ever get

462
00:27:36,670 --> 00:27:39,380
a sequence of five 1's.

463
00:27:39,380 --> 00:27:44,500
So the numerator of the
division at the bottom was 0.

464
00:27:44,500 --> 00:27:46,150
Hence, the answer is 0.

465
00:27:46,150 --> 00:27:47,890
Is this surprising?

466
00:27:47,890 --> 00:27:49,440
Well, no.

467
00:27:49,440 --> 00:27:54,200
Because if that's the actual
probability of getting five

468
00:27:54,200 --> 00:27:58,075
1's, it's not very shocking
that in 1,000 trials

469
00:27:58,075 --> 00:27:58,825
it never happened.

470
00:28:02,260 --> 00:28:06,140
It's not a surprising
result. And so we

471
00:28:06,140 --> 00:28:09,230
have to be careful when we
run these things to understand

472
00:28:09,230 --> 00:28:14,250
the difference between what's in
this case an actual probability

473
00:28:14,250 --> 00:28:17,510
and what statisticians
call a sample probability.

474
00:28:25,530 --> 00:28:28,970
So what we got with
the sample was 0.

475
00:28:28,970 --> 00:28:32,740
So what's the
obvious thing to do?

476
00:28:32,740 --> 00:28:35,590
If you're doing a
simulation of an event

477
00:28:35,590 --> 00:28:39,020
and the event is
pretty rare, you

478
00:28:39,020 --> 00:28:43,520
want to try it on a very
large number of trials.

479
00:28:43,520 --> 00:28:45,050
So let's go back to our code.

480
00:28:51,350 --> 00:28:58,720
And we'll change it to
instead of 1,000, 1,000,000.

481
00:28:58,720 --> 00:29:01,572
You can see up here, by the
way, where I set the seed.

482
00:29:01,572 --> 00:29:02,840
And now, let's run it.

483
00:29:17,760 --> 00:29:19,650
We did a lot better.

484
00:29:19,650 --> 00:29:22,470
If we look at here our
estimated probability,

485
00:29:22,470 --> 00:29:25,980
it's three 0's 128,
still not quite

486
00:29:25,980 --> 00:29:30,142
the actual probability
but darn close.

487
00:29:30,142 --> 00:29:31,600
And maybe if I had
done 10 million,

488
00:29:31,600 --> 00:29:32,891
it would have been even closer.

489
00:29:35,610 --> 00:29:38,040
So if you're
writing a simulation

490
00:29:38,040 --> 00:29:41,130
to compute the
probability of an event

491
00:29:41,130 --> 00:29:44,040
and the event is
moderately rare,

492
00:29:44,040 --> 00:29:47,310
then you better
run a lot of trials

493
00:29:47,310 --> 00:29:51,750
before you believe your
estimated probability.

494
00:29:51,750 --> 00:29:55,440
In a week or so, we'll
actually look at that more

495
00:29:55,440 --> 00:29:57,810
mathematically and
say, what is a lot,

496
00:29:57,810 --> 00:29:59,130
how do we know what is enough.

497
00:30:12,110 --> 00:30:13,550
What are the morals here?

498
00:30:13,550 --> 00:30:15,430
Moral one, I've just told you--

499
00:30:15,430 --> 00:30:18,950
takes a lot of trials to get a
good estimate of the frequency

500
00:30:18,950 --> 00:30:21,510
of a rare event.

501
00:30:21,510 --> 00:30:26,470
Moral two, we should always,
if we're getting an estimated

502
00:30:26,470 --> 00:30:29,290
probability, know
that, and probably

503
00:30:29,290 --> 00:30:33,570
say that, and not confuse it
with the actual probability.

504
00:30:33,570 --> 00:30:36,400
The third moral here
is, it was kind of

505
00:30:36,400 --> 00:30:38,830
stupid to do a simulation.

506
00:30:38,830 --> 00:30:42,430
Since it was a very
simple closed-form answer

507
00:30:42,430 --> 00:30:45,550
that we could compute
that would really tell us

508
00:30:45,550 --> 00:30:48,220
what the actual
probability is, why even

509
00:30:48,220 --> 00:30:51,550
bother with the simulation?

510
00:30:51,550 --> 00:30:53,880
Well, we're going
to see why now,

511
00:30:53,880 --> 00:30:57,340
because simulations
can be very useful.

512
00:30:57,340 --> 00:31:00,390
Let's look at another problem.

513
00:31:00,390 --> 00:31:02,070
This is the famous
birthday problem.

514
00:31:02,070 --> 00:31:03,660
Some of you have seen it.

515
00:31:03,660 --> 00:31:06,240
What's the probability of at
least two people in a group

516
00:31:06,240 --> 00:31:08,770
having the same birthday?

517
00:31:08,770 --> 00:31:10,600
There's a URL at the bottom.

518
00:31:10,600 --> 00:31:12,760
That's pointing
to a Google form.

519
00:31:12,760 --> 00:31:15,940
I'd like please all of you
who have a computing device

520
00:31:15,940 --> 00:31:20,100
to go to it and fill
out your birthday.

521
00:31:20,100 --> 00:31:22,942
It's anonymous, so we won't know
how old you are, don't worry.

522
00:31:22,942 --> 00:31:24,150
Actually, it's only the date.

523
00:31:24,150 --> 00:31:25,290
It's not the year.

524
00:31:27,880 --> 00:31:33,870
So suppose there were 367
people in the group, roughly

525
00:31:33,870 --> 00:31:40,680
the number of people who
took the 6.0001 600 midterm.

526
00:31:40,680 --> 00:31:44,070
If they are 367 people, what's
the probability of at least two

527
00:31:44,070 --> 00:31:45,230
of them sharing a birthday?

528
00:31:49,790 --> 00:31:54,110
One, by something called
the pigeonhole principle.

529
00:31:54,110 --> 00:31:56,000
You got some number of holes.

530
00:31:56,000 --> 00:31:57,800
And if you have more
pigeons than holes,

531
00:31:57,800 --> 00:32:01,430
two pigeons have
to share a whole.

532
00:32:01,430 --> 00:32:04,040
What about smaller numbers?

533
00:32:04,040 --> 00:32:07,430
Well, if we make a
simplifying assumption

534
00:32:07,430 --> 00:32:10,650
that each birthdate
is equally likely,

535
00:32:10,650 --> 00:32:13,970
then there's actually a nice
closed-form solution for it.

536
00:32:17,760 --> 00:32:20,730
Again, this is a question
where it's easier

537
00:32:20,730 --> 00:32:24,210
to compute the opposite
of what you're trying

538
00:32:24,210 --> 00:32:26,670
to do and subtract it from 1.

539
00:32:26,670 --> 00:32:32,160
And so this fraction is giving
the probability of two people

540
00:32:32,160 --> 00:32:35,190
not sharing a birthday.

541
00:32:35,190 --> 00:32:38,560
The proof that this is right,
it's a little bit elaborate.

542
00:32:38,560 --> 00:32:42,450
But you can trust
me, it's accurate.

543
00:32:42,450 --> 00:32:46,150
But it's a formula, and it's
not that complicated a formula.

544
00:32:46,150 --> 00:32:49,800
So numbers like 366
factorial are big.

545
00:32:55,240 --> 00:32:57,460
So let's approximate a solution.

546
00:32:57,460 --> 00:33:00,940
We'll right a simulation and
see if we get the same answer

547
00:33:00,940 --> 00:33:03,920
that that formula gave us.

548
00:33:03,920 --> 00:33:05,200
So here's the code for that--

549
00:33:07,810 --> 00:33:09,550
two arguments-- the
number of people

550
00:33:09,550 --> 00:33:14,780
in the group and the
number that we asking do

551
00:33:14,780 --> 00:33:17,520
they have the same birthday.

552
00:33:17,520 --> 00:33:21,120
So since I'm assuming for now
that every birthday is equally

553
00:33:21,120 --> 00:33:26,100
likely, the possible
dates range from 1 to 366,

554
00:33:26,100 --> 00:33:28,005
because some years
have a February 29.

555
00:33:31,200 --> 00:33:35,490
I'll keep track of the number
of people born in each date

556
00:33:35,490 --> 00:33:38,640
by starting with none.

557
00:33:38,640 --> 00:33:41,470
And then for p in the
range of number of people,

558
00:33:41,470 --> 00:33:45,240
I'll make a random choice
of the possible dates

559
00:33:45,240 --> 00:33:49,999
and increment that
element of the list by 1.

560
00:33:49,999 --> 00:33:51,540
And then at the end,
we can say, look

561
00:33:51,540 --> 00:33:54,330
at the maximum
number of birthdays

562
00:33:54,330 --> 00:33:59,560
and see if it's greater than
or equal to the number of same.

563
00:33:59,560 --> 00:34:01,240
So that tells us that.

564
00:34:04,490 --> 00:34:07,220
And then we can actually look
at the birthday problem--

565
00:34:07,220 --> 00:34:09,640
number of people, the number
of same, and, as usual,

566
00:34:09,640 --> 00:34:10,514
the number of trials.

567
00:34:13,750 --> 00:34:17,840
So the number of hits is 0 for
t in range number of trials.

568
00:34:17,840 --> 00:34:21,940
If sameDate is true, then
we'll increment the number

569
00:34:21,940 --> 00:34:28,590
of hits by 1 and then as usual
divide by the number of trials.

570
00:34:28,590 --> 00:34:34,739
And we'll try it for 10,
20, 40, and 100 people.

571
00:34:37,310 --> 00:34:41,480
And then just, we'll print
the estimated probability

572
00:34:41,480 --> 00:34:46,429
and the actual
probability computed using

573
00:34:46,429 --> 00:34:48,320
that formula I showed you.

574
00:34:48,320 --> 00:34:50,600
I have not shown you,
but I've imported

575
00:34:50,600 --> 00:34:53,480
a library called
math, because it

576
00:34:53,480 --> 00:34:55,040
is a factorial implementation.

577
00:34:55,040 --> 00:34:56,900
It's way faster than
the recursive one

578
00:34:56,900 --> 00:35:00,270
that we've seen before.

579
00:35:00,270 --> 00:35:00,880
Let's run it.

580
00:35:23,920 --> 00:35:25,040
And we'll see what we get.

581
00:35:25,040 --> 00:35:30,580
So for 10, the estimated
probability is 0.11 now.

582
00:35:30,580 --> 00:35:36,720
So you can see, the estimates
are really pretty good.

583
00:35:36,720 --> 00:35:39,450
Once again, we have this
business that for 100,

584
00:35:39,450 --> 00:35:43,450
we're estimating 1, when the
real answer is point many,

585
00:35:43,450 --> 00:35:45,150
many 9's.

586
00:35:45,150 --> 00:35:47,400
But again, this is
sample probability.

587
00:35:47,400 --> 00:35:53,250
It just means in the number
of trials we did, every 1

588
00:35:53,250 --> 00:35:56,190
for 100 people, there
was a shared birthday.

589
00:35:56,190 --> 00:35:59,010
This is a number that
usually surprises people,

590
00:35:59,010 --> 00:36:03,690
as to why with 100 people
the probability is so high.

591
00:36:03,690 --> 00:36:06,990
But we could work out
the formula and see it.

592
00:36:06,990 --> 00:36:08,460
And as you can
see, the estimates

593
00:36:08,460 --> 00:36:10,930
are pretty good
from my simulation.

594
00:36:20,252 --> 00:36:22,210
Now, we're going to see
why we did a simulation

595
00:36:22,210 --> 00:36:23,720
in the first place.

596
00:36:23,720 --> 00:36:27,970
Suppose we want the probability
of three people sharing

597
00:36:27,970 --> 00:36:29,260
a birthday instead of two.

598
00:36:34,030 --> 00:36:37,240
It's pretty easy to see how
we changed the simulation.

599
00:36:37,240 --> 00:36:38,980
I even made a parameter.

600
00:36:38,980 --> 00:36:42,190
I just changed the
number 2 to number 3.

601
00:36:42,190 --> 00:36:45,200
The math, on the
other hand, is ugly.

602
00:36:48,030 --> 00:36:52,190
Why is the math so much
uglier for 3 than for 2?

603
00:36:52,190 --> 00:36:55,400
Because for 2, the
complementary problem--

604
00:36:55,400 --> 00:36:58,040
the number we're
subtracting from 1--

605
00:36:58,040 --> 00:37:03,640
is simply the question of,
are all birthdays different?

606
00:37:03,640 --> 00:37:08,170
So did two people share a
birthday is 1 minus or all

607
00:37:08,170 --> 00:37:11,570
does everybody have
a different birthday.

608
00:37:11,570 --> 00:37:16,250
On the other hand, for 3 people,
the complementary problem is

609
00:37:16,250 --> 00:37:19,490
a complicated disjunct--
a bunch of ors--

610
00:37:19,490 --> 00:37:22,190
either all birthdays
are distinct,

611
00:37:22,190 --> 00:37:26,240
or two people share a birthday
and the rest are distinct,

612
00:37:26,240 --> 00:37:30,140
or there are two groups of
two people sharing a birthday

613
00:37:30,140 --> 00:37:31,970
and everything is distinct.

614
00:37:31,970 --> 00:37:36,450
So you can see here, there's
a lot of possibilities.

615
00:37:36,450 --> 00:37:40,800
And so it's 1 minus now a
very complicated formula.

616
00:37:40,800 --> 00:37:42,840
And in fact, if you try
and look how to do this,

617
00:37:42,840 --> 00:37:45,450
most people will tell
you don't bother.

618
00:37:45,450 --> 00:37:48,490
Here's kind of a
good approximation.

619
00:37:48,490 --> 00:37:50,320
But the math gets very hairy.

620
00:37:53,040 --> 00:37:57,160
In contrast, changing the
simulation is dead easy.

621
00:37:57,160 --> 00:37:57,880
We can do that.

622
00:38:03,808 --> 00:38:06,280
Whoops.

623
00:38:06,280 --> 00:38:13,650
So if we come over here for
the code, all I have to do

624
00:38:13,650 --> 00:38:15,075
is change this to 2 or 3.

625
00:38:25,090 --> 00:38:27,190
And I'm going to leave
in this code, which

626
00:38:27,190 --> 00:38:31,180
is the wrong code, computing
the actual probability now

627
00:38:31,180 --> 00:38:35,110
for 2 people sharing rather
than 3, because I want

628
00:38:35,110 --> 00:38:37,660
to make it easy for you to see
the difference between what

629
00:38:37,660 --> 00:38:41,260
happens when we look at 3
shared rather than 2 shared.

630
00:38:53,140 --> 00:38:55,980
And I get invalid syntax.

631
00:38:55,980 --> 00:38:58,766
That's not good.

632
00:38:58,766 --> 00:39:00,640
That's what happens when
I type in real time.

633
00:39:07,970 --> 00:39:10,010
Why do I have invalid syntax?

634
00:39:10,010 --> 00:39:11,337
AUDIENCE: Line 56.

635
00:39:11,337 --> 00:39:12,170
JOHN GUTTAG: Pardon.

636
00:39:12,170 --> 00:39:13,631
AUDIENCE: Line 56.

637
00:39:13,631 --> 00:39:15,170
JOHN GUTTAG: One person, Anna.

638
00:39:15,170 --> 00:39:17,660
AUDIENCE: Line 56,
there's a comma.

639
00:39:17,660 --> 00:39:20,532
JOHN GUTTAG: Oh.

640
00:39:20,532 --> 00:39:21,490
That's not a good line.

641
00:39:32,960 --> 00:39:40,410
So now, we see that if we get,
say, to n equals 100, for 2,

642
00:39:40,410 --> 00:39:42,530
you'll remember, it was 0.99.

643
00:39:42,530 --> 00:39:46,000
But for 3, it's only 0.63.

644
00:39:46,000 --> 00:39:49,590
So we see going from two
sharing to three sharing

645
00:39:49,590 --> 00:39:54,930
gets us a radically different
answer, not surprisingly.

646
00:39:54,930 --> 00:39:57,240
But we also-- and the real
thing I wanted you to see--

647
00:39:57,240 --> 00:39:59,310
is how easy it was to
answer this question

648
00:39:59,310 --> 00:40:01,810
with the simulation.

649
00:40:01,810 --> 00:40:05,940
And that's a primary
reason we use simulations

650
00:40:05,940 --> 00:40:09,000
to get probabilistic
questions rather

651
00:40:09,000 --> 00:40:11,190
than sitting down and
the pencil and paper

652
00:40:11,190 --> 00:40:14,460
and doing fancy
probability calculations,

653
00:40:14,460 --> 00:40:19,300
because it's often way
easier to do a simulation.

654
00:40:19,300 --> 00:40:22,220
We can see that in spades if
we look at the next question.

655
00:40:26,680 --> 00:40:28,210
Let's think about
this assumption

656
00:40:28,210 --> 00:40:31,270
that all birthdays
are equally likely.

657
00:40:31,270 --> 00:40:33,370
Well, as you can
see, this is a chart

658
00:40:33,370 --> 00:40:38,440
of how common birthdates
are in the US, a heat map.

659
00:40:38,440 --> 00:40:44,820
And you'll see, for
example, that February 29

660
00:40:44,820 --> 00:40:47,930
is quite an uncommon birthday.

661
00:40:47,930 --> 00:40:52,010
So we should probably
treat that differently.

662
00:40:52,010 --> 00:40:53,480
Somewhat surprisingly,
you'll see

663
00:40:53,480 --> 00:40:57,160
that July 4 is a very
uncommon birthday as well.

664
00:40:57,160 --> 00:41:00,410
It's easy to understand
why February 29.

665
00:41:00,410 --> 00:41:02,570
The only thing I can
figure out for July 4

666
00:41:02,570 --> 00:41:06,230
is obstetricians don't
like working on holidays.

667
00:41:06,230 --> 00:41:08,300
And so they induce
labor sometime

668
00:41:08,300 --> 00:41:10,790
around the 2nd or
the 3rd, so they

669
00:41:10,790 --> 00:41:14,420
don't have to come to work
on the 4th or the 5th.

670
00:41:14,420 --> 00:41:15,680
Sounds a horrible thought.

671
00:41:15,680 --> 00:41:19,952
But I can't think of any other
explanation for this anomaly.

672
00:41:19,952 --> 00:41:21,410
You'll probably,
if you look at it,

673
00:41:21,410 --> 00:41:25,580
see Christmas day is
not so common either.

674
00:41:25,580 --> 00:41:27,170
So now, the question,
which we can

675
00:41:27,170 --> 00:41:29,120
answer, since you've
all fill out this form,

676
00:41:29,120 --> 00:41:32,810
is how exceptional
are MIT students?

677
00:41:32,810 --> 00:41:35,930
We like to think that you're
different in every respect.

678
00:41:35,930 --> 00:41:38,960
So are your birthdays
distributed differently

679
00:41:38,960 --> 00:41:40,830
than other dates?

680
00:41:40,830 --> 00:41:43,220
Have we got that data?

681
00:41:43,220 --> 00:41:44,890
So now we'll go look at that.

682
00:41:49,180 --> 00:41:50,920
We should have a heat
map for you guys.

683
00:41:53,900 --> 00:41:54,400
This one?

684
00:41:54,400 --> 00:41:56,850
AUDIENCE: Yep.

685
00:41:56,850 --> 00:41:59,300
I removed all the February 31.

686
00:41:59,300 --> 00:42:02,240
Thank you for those submissions.

687
00:42:02,240 --> 00:42:05,910
[LAUGHTER]

688
00:42:06,525 --> 00:42:08,750
JOHN GUTTAG: So here it is.

689
00:42:08,750 --> 00:42:13,310
And we can see that,
well, they don't

690
00:42:13,310 --> 00:42:17,790
seem to be banded quite as
much in the summer months,

691
00:42:17,790 --> 00:42:20,370
probably says more about your
parents than it does about you.

692
00:42:23,030 --> 00:42:26,090
But you can see that,
indeed, we do have--

693
00:42:26,090 --> 00:42:28,280
wow, we have a day
where there are

694
00:42:28,280 --> 00:42:30,110
five birthdays, that look like?

695
00:42:30,110 --> 00:42:30,620
Or no?

696
00:42:30,620 --> 00:42:32,556
AUDIENCE: February 12.

697
00:42:32,556 --> 00:42:33,355
JOHN GUTTAG: Wow.

698
00:42:36,654 --> 00:42:39,070
You want to raise your hand
if you're born on February 12?

699
00:42:42,388 --> 00:42:45,800
[LAUGHTER]

700
00:42:46,670 --> 00:42:51,902
So you are exceptional in that
you lie about when you're born.

701
00:42:51,902 --> 00:42:57,470
But if you hadn't lied, I
think we would have still seen

702
00:42:57,470 --> 00:42:59,450
the probabilities would hold.

703
00:42:59,450 --> 00:43:03,155
How many people were
there, do we know?

704
00:43:03,155 --> 00:43:07,865
AUDIENCE: 146 with
112 unique birthdays.

705
00:43:07,865 --> 00:43:12,190
JOHN GUTTAG: 146 people,
112 unique birthdays.

706
00:43:12,190 --> 00:43:16,220
So indeed, the
probability does work.

707
00:43:26,470 --> 00:43:28,990
So we know you're
exceptional in a funny way.

708
00:43:28,990 --> 00:43:32,240
Well, you can
imagine how hard it

709
00:43:32,240 --> 00:43:36,080
would be to adjust the
analytic model to account

710
00:43:36,080 --> 00:43:40,370
for a weird distribution
of birthdates.

711
00:43:40,370 --> 00:43:44,900
But again, adjusting the
simulation model is easy.

712
00:43:44,900 --> 00:43:46,700
I could have gone
back to that heat

713
00:43:46,700 --> 00:43:49,670
map I showed you of
birthdays in the US

714
00:43:49,670 --> 00:43:52,550
and gotten a separate
probability for each day,

715
00:43:52,550 --> 00:43:55,130
but I was too lazy.

716
00:43:55,130 --> 00:44:01,220
And instead, what I observed
was that we had a few days,

717
00:44:01,220 --> 00:44:06,950
like February 29, highly
unlikely, and this band

718
00:44:06,950 --> 00:44:10,040
in the middle of people
who were conceived

719
00:44:10,040 --> 00:44:13,670
in the late fall
and early winter.

720
00:44:13,670 --> 00:44:19,950
So what I did is I
duplicated some dates.

721
00:44:19,950 --> 00:44:25,565
So the 58th day of the year,
February 29, occurs only once.

722
00:44:28,750 --> 00:44:30,590
The dates before
that, I said, let's

723
00:44:30,590 --> 00:44:32,730
pretend they occur four times.

724
00:44:32,730 --> 00:44:34,590
What only matters
here is not how often

725
00:44:34,590 --> 00:44:36,450
they occur but the
relative frequency.

726
00:44:40,700 --> 00:44:46,000
And then the dates after
that occur four times

727
00:44:46,000 --> 00:44:49,480
except for the dates in
that band, which is going

728
00:44:49,480 --> 00:44:52,180
to have occur yet more often.

729
00:44:52,180 --> 00:44:56,000
So now-- and don't worry
about the exact details here--

730
00:44:56,000 --> 00:44:58,840
but what I'm doing is simply
adjusting the simulation

731
00:44:58,840 --> 00:45:02,140
to change the probability
of each date getting

732
00:45:02,140 --> 00:45:04,378
chosen by same date.

733
00:45:07,190 --> 00:45:09,170
And then I can run
the simulation model.

734
00:45:09,170 --> 00:45:13,450
And, again, with a very
small change to code,

735
00:45:13,450 --> 00:45:15,900
I've modeled something
that's mathematically

736
00:45:15,900 --> 00:45:18,360
enormously complex.

737
00:45:18,360 --> 00:45:22,050
I have no idea how to
actually do this probability

738
00:45:22,050 --> 00:45:23,670
mathematically.

739
00:45:23,670 --> 00:45:27,046
But the code is, as you can
see, quite straightforward.

740
00:45:33,850 --> 00:45:35,460
So let's go to that here.

741
00:45:39,090 --> 00:45:45,450
So what I'm going to do
is comment this one out

742
00:45:45,450 --> 00:46:02,660
and uncomment this more
complicated set of dates

743
00:46:02,660 --> 00:46:03,500
and see what we get.

744
00:46:14,020 --> 00:46:16,240
And again, it changes
quite dramatically.

745
00:46:16,240 --> 00:46:18,240
You might remember, before
it was around I think

746
00:46:18,240 --> 00:46:23,460
0.6-something for 100,
and now, it's 0.75.

747
00:46:23,460 --> 00:46:26,460
So getting away from the notion
that birthdays are uniformly

748
00:46:26,460 --> 00:46:28,710
distributed to saying
some birthdays are

749
00:46:28,710 --> 00:46:32,010
more common than others,
again, dramatically changes

750
00:46:32,010 --> 00:46:34,570
the answer.

751
00:46:34,570 --> 00:46:36,589
And we can easily look at that.

752
00:46:43,080 --> 00:46:49,730
So that gets us to the big
topic of simulation models.

753
00:46:49,730 --> 00:46:52,820
It's a program that
describes a computation that

754
00:46:52,820 --> 00:46:57,830
provides information about the
possible behaviors of a system.

755
00:46:57,830 --> 00:47:00,050
I say possible
behaviors, because I'm

756
00:47:00,050 --> 00:47:02,835
particularly interested
in stochastic systems.

757
00:47:05,720 --> 00:47:10,350
They're descriptive not
prescriptive in the sense

758
00:47:10,350 --> 00:47:13,740
that they describe
the possible outcomes.

759
00:47:13,740 --> 00:47:18,800
They don't tell you how to
achieve possible outcomes.

760
00:47:18,800 --> 00:47:20,720
This is different
from what we've

761
00:47:20,720 --> 00:47:22,550
looked at earlier in
the course, where we

762
00:47:22,550 --> 00:47:25,700
looked at optimization models.

763
00:47:25,700 --> 00:47:30,440
So an optimization
model is prescriptive.

764
00:47:30,440 --> 00:47:33,800
It tells you how to
achieve an effect,

765
00:47:33,800 --> 00:47:38,000
how to get the most value
out of your knapsack,

766
00:47:38,000 --> 00:47:42,350
how to find the shortest
path from A to B in a graph.

767
00:47:42,350 --> 00:47:44,750
In contrast, a
simulation model says,

768
00:47:44,750 --> 00:47:48,170
if I do this,
here's what happens.

769
00:47:48,170 --> 00:47:52,290
It doesn't tell you how to
make something happened.

770
00:47:52,290 --> 00:47:53,970
So it's very
different, and it's why

771
00:47:53,970 --> 00:47:57,390
we need both, why we
need optimization models

772
00:47:57,390 --> 00:48:00,570
and we need simulation models.

773
00:48:00,570 --> 00:48:03,750
We have to remember that
a simulation model is only

774
00:48:03,750 --> 00:48:06,570
an approximation to reality.

775
00:48:06,570 --> 00:48:10,110
I put in an approximation to
the distribution of birthdates,

776
00:48:10,110 --> 00:48:12,910
but it wasn't quite right.

777
00:48:12,910 --> 00:48:16,770
And as the very famous
statistician George Box said,

778
00:48:16,770 --> 00:48:22,320
"all models are wrong, but
some are actually very useful."

779
00:48:22,320 --> 00:48:27,930
In the next lecture, we'll look
at a useful class of models.

780
00:48:27,930 --> 00:48:30,610
When do we use simulations?

781
00:48:30,610 --> 00:48:33,310
Typically, as we've just
shown, to model systems that

782
00:48:33,310 --> 00:48:37,180
are mathematically intractable,
like the birthday problem

783
00:48:37,180 --> 00:48:39,740
we just looked at.

784
00:48:39,740 --> 00:48:43,130
In other situations, to
extract intermediate results--

785
00:48:43,130 --> 00:48:47,660
something happens along
the way to the answer.

786
00:48:47,660 --> 00:48:50,410
And as I hope you've
seen that simulations

787
00:48:50,410 --> 00:48:55,480
are used because we can play
what if games by successively

788
00:48:55,480 --> 00:48:57,340
refining it.

789
00:48:57,340 --> 00:48:59,230
We started with a
simple simulation

790
00:48:59,230 --> 00:49:01,960
that assumed that we only
asked the question of, do

791
00:49:01,960 --> 00:49:04,540
two people share a birthday.

792
00:49:04,540 --> 00:49:08,080
We showed how we could change
it to ask do three people share

793
00:49:08,080 --> 00:49:10,020
a birthday.

794
00:49:10,020 --> 00:49:11,910
We then saw that
we could change it

795
00:49:11,910 --> 00:49:16,260
to assume a different
distribution of birthdates

796
00:49:16,260 --> 00:49:18,620
in the group.

797
00:49:18,620 --> 00:49:20,520
And so we can start
with something simple.

798
00:49:20,520 --> 00:49:23,310
And we get it ever
more complexed

799
00:49:23,310 --> 00:49:25,415
to answer questions what if.

800
00:49:29,510 --> 00:49:32,030
We're going to start
in the next lecture

801
00:49:32,030 --> 00:49:36,680
by producing a simulation
of a random walk.

802
00:49:36,680 --> 00:49:38,120
And with that, I'll stop.

803
00:49:38,120 --> 00:49:40,840
And see you guys soon.