1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation, or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:29,590 --> 00:00:33,320
JOHN GUTTAG: Hello, everybody.

9
00:00:33,320 --> 00:00:36,880
Some announcements.

10
00:00:36,880 --> 00:00:42,310
The last reading assignment of
the semester, at least from us.

11
00:00:42,310 --> 00:00:48,850
Course evaluations are still
available through this Friday.

12
00:00:48,850 --> 00:00:50,530
But only till noon.

13
00:00:50,530 --> 00:00:53,930
Again, I urge you all to do it.

14
00:00:53,930 --> 00:00:58,180
And then finally,
for the final exam,

15
00:00:58,180 --> 00:00:59,890
we're going to be
giving you some code

16
00:00:59,890 --> 00:01:02,800
to study in advance of the exam.

17
00:01:02,800 --> 00:01:05,110
And then we will ask
questions about that code

18
00:01:05,110 --> 00:01:07,480
on the exam itself.

19
00:01:07,480 --> 00:01:10,480
This was described in the
announcement for the exam.

20
00:01:10,480 --> 00:01:14,290
And we will be making this
code available later today.

21
00:01:14,290 --> 00:01:17,110
Now, I would suggest
that you try and get

22
00:01:17,110 --> 00:01:18,610
your heads around it.

23
00:01:18,610 --> 00:01:21,160
If you are confused,
that's a good thing

24
00:01:21,160 --> 00:01:24,850
to talk about in office hours,
to get some help with it,

25
00:01:24,850 --> 00:01:28,160
as opposed to waiting till
20 minutes before the exam

26
00:01:28,160 --> 00:01:29,410
and realizing you're confused.

27
00:01:31,940 --> 00:01:32,720
All right.

28
00:01:32,720 --> 00:01:36,850
I want to pick up where
we left off on Monday.

29
00:01:36,850 --> 00:01:43,850
So you may recall that we
were comparing results of KNN

30
00:01:43,850 --> 00:01:47,970
and logistic regression
on our Titanic data.

31
00:01:47,970 --> 00:01:56,240
And we have this up using 10
80/20 splits for KNN equals 3

32
00:01:56,240 --> 00:02:01,170
and logistic regression
with p equals 0.5.

33
00:02:01,170 --> 00:02:06,800
And what I observed is that
logistic regression happened

34
00:02:06,800 --> 00:02:09,979
to perform slightly
better, but certainly

35
00:02:09,979 --> 00:02:13,810
nothing that you would
choose to write home about.

36
00:02:13,810 --> 00:02:15,344
It's a little bit better.

37
00:02:15,344 --> 00:02:17,135
That isn't to say it
will always be better.

38
00:02:17,135 --> 00:02:19,620
It happens to be here.

39
00:02:19,620 --> 00:02:23,390
But the point I
closed with is one

40
00:02:23,390 --> 00:02:27,680
of the things we care about when
we use machine learning is not

41
00:02:27,680 --> 00:02:31,890
only our ability to make
predictions with the model.

42
00:02:31,890 --> 00:02:37,120
But what can we learn by
studying the model itself?

43
00:02:37,120 --> 00:02:39,630
Remember, the idea is
that the model is somehow

44
00:02:39,630 --> 00:02:43,590
capturing the system
or the process that

45
00:02:43,590 --> 00:02:45,330
generated the data.

46
00:02:45,330 --> 00:02:48,720
And by studying the model we
can learn something useful.

47
00:02:51,630 --> 00:02:55,170
So to do that for
logistic regression,

48
00:02:55,170 --> 00:02:57,390
we begin by looking
at the weights

49
00:02:57,390 --> 00:03:00,400
of the different variables.

50
00:03:00,400 --> 00:03:04,850
And we had this up
in the last slide.

51
00:03:04,850 --> 00:03:09,070
The model classes are
"Died" and "Survived."

52
00:03:09,070 --> 00:03:12,500
For the label Survived,
we said that if you

53
00:03:12,500 --> 00:03:15,880
were in a first-class
cabin, that

54
00:03:15,880 --> 00:03:20,170
had a positive impact on
your survival, a pretty

55
00:03:20,170 --> 00:03:23,740
strong positive impact.

56
00:03:23,740 --> 00:03:27,920
You can't interpret these
weights in and of themselves.

57
00:03:27,920 --> 00:03:32,540
If I said it's 1.6, that
really doesn't mean anything.

58
00:03:32,540 --> 00:03:36,970
So what you have to look at
is the relative weights, not

59
00:03:36,970 --> 00:03:38,080
the absolute weights.

60
00:03:38,080 --> 00:03:42,420
And we see that it's a pretty
strong relative weight.

61
00:03:42,420 --> 00:03:45,940
A second-class cabin also has a
positive weight, in this case,

62
00:03:45,940 --> 00:03:49,230
of 0.46.

63
00:03:49,230 --> 00:03:52,300
So it was indicating
you better had

64
00:03:52,300 --> 00:03:55,870
a better-than-average
chance of surviving,

65
00:03:55,870 --> 00:04:00,410
but much less strong
than a first class.

66
00:04:00,410 --> 00:04:03,990
And if you are one of those poor
people in a third-class cabin,

67
00:04:03,990 --> 00:04:07,170
well, that had a negative
weight on survival.

68
00:04:07,170 --> 00:04:10,640
You were less likely to survive.

69
00:04:10,640 --> 00:04:16,899
Age had a very small effect
here, slightly negative.

70
00:04:16,899 --> 00:04:20,160
What that meant is the older
you were, the less likely

71
00:04:20,160 --> 00:04:21,959
you were to have survived.

72
00:04:21,959 --> 00:04:27,610
But it's a very
small negative value.

73
00:04:27,610 --> 00:04:32,720
The male gender had a relatively
large negative gender,

74
00:04:32,720 --> 00:04:34,280
suggesting that
if you were a male

75
00:04:34,280 --> 00:04:36,650
you were more likely to die
than if you were a female.

76
00:04:39,250 --> 00:04:41,300
This might be true in
the general population,

77
00:04:41,300 --> 00:04:46,010
but it was especially
true on the Titanic.

78
00:04:46,010 --> 00:04:50,360
Finally, I warned you that
while what I just went through

79
00:04:50,360 --> 00:04:54,500
is something you will read in
lots of papers that use machine

80
00:04:54,500 --> 00:04:58,520
learning, you will
hear in lots of talks

81
00:04:58,520 --> 00:05:00,800
about people who have
used machine learning.

82
00:05:00,800 --> 00:05:04,610
But you should be very wary
when people speak that way.

83
00:05:04,610 --> 00:05:09,290
It's not nonsense, but
some cautionary notes.

84
00:05:09,290 --> 00:05:12,200
In particular,
there's a big issue

85
00:05:12,200 --> 00:05:16,950
because the features are often
correlated with one another.

86
00:05:16,950 --> 00:05:20,345
And so you can't interpret the
weights one feature at a time.

87
00:05:23,380 --> 00:05:25,980
To get a little bit
technical, there

88
00:05:25,980 --> 00:05:31,410
are two major ways people
use logistic regression.

89
00:05:31,410 --> 00:05:33,180
They're called L1 and L2.

90
00:05:36,130 --> 00:05:36,960
We used an L2.

91
00:05:36,960 --> 00:05:38,610
I'll come back to
that in a minute.

92
00:05:38,610 --> 00:05:44,460
Because that's the default
in Python, or in [INAUDIBLE].

93
00:05:44,460 --> 00:05:49,290
You can set that parameter at L2
and do that to L1 if you want.

94
00:05:49,290 --> 00:05:50,460
I experimented with it.

95
00:05:50,460 --> 00:05:53,580
It didn't change the
results that much.

96
00:05:53,580 --> 00:05:56,760
But what an L1 regression
is designed to do

97
00:05:56,760 --> 00:05:59,595
is to find some weights
and drive them to 0.

98
00:06:03,200 --> 00:06:05,150
This is particularly
useful when you

99
00:06:05,150 --> 00:06:10,160
have a very high-dimensional
problem relative to the number

100
00:06:10,160 --> 00:06:11,840
of examples.

101
00:06:11,840 --> 00:06:14,030
And this gets back
to that question

102
00:06:14,030 --> 00:06:17,650
we've talked about many
times, of overfitting.

103
00:06:17,650 --> 00:06:23,290
If you've got 1,000
variables and 1,000 examples,

104
00:06:23,290 --> 00:06:24,565
you're very likely to overfit.

105
00:06:27,330 --> 00:06:30,270
L1 is designed to
avoid overfitting

106
00:06:30,270 --> 00:06:33,450
by taking many of
those 1,000 variables

107
00:06:33,450 --> 00:06:36,470
and just giving them 0 weight.

108
00:06:36,470 --> 00:06:41,030
And it does typically
generalize better.

109
00:06:41,030 --> 00:06:45,000
But if you have two variables
that are correlated,

110
00:06:45,000 --> 00:06:47,160
L1 will drive 1
to 0, and it will

111
00:06:47,160 --> 00:06:49,602
look like it's unimportant.

112
00:06:49,602 --> 00:06:51,060
But in fact, it
might be important.

113
00:06:51,060 --> 00:06:53,370
It's just correlated
with another, which

114
00:06:53,370 --> 00:06:56,880
has gotten all the credit.

115
00:06:56,880 --> 00:07:00,880
L2, which is what we
did, does the opposite.

116
00:07:00,880 --> 00:07:04,550
Is spreads the weight
across the variables.

117
00:07:04,550 --> 00:07:07,300
So have a bunch of
correlated variables,

118
00:07:07,300 --> 00:07:10,060
it might look like none of
them are very important.

119
00:07:10,060 --> 00:07:14,230
Because each of them gets a
small amount of the weight.

120
00:07:14,230 --> 00:07:17,350
Again, not so important when
you have four or five variables,

121
00:07:17,350 --> 00:07:18,850
is what I'm showing you.

122
00:07:18,850 --> 00:07:24,180
But it matters when you
have 100 or 1,000 variables.

123
00:07:24,180 --> 00:07:27,110
Let's look at an example.

124
00:07:27,110 --> 00:07:36,410
So the cabin classes, the way we
set it up, c1 plus c2 plus c3--

125
00:07:36,410 --> 00:07:38,600
whoops-- is not equal to 0.

126
00:07:38,600 --> 00:07:42,000
What is it equal to?

127
00:07:42,000 --> 00:07:45,916
I'll fix this right now.

128
00:07:45,916 --> 00:07:47,040
What should that have said?

129
00:07:51,720 --> 00:07:55,040
What's the invariant here?

130
00:07:55,040 --> 00:07:58,010
Well, a person is in
exactly one class.

131
00:07:58,010 --> 00:07:59,510
I guess if you're
really rich, maybe

132
00:07:59,510 --> 00:08:03,590
you rented two cabins, one
in first and one in second.

133
00:08:03,590 --> 00:08:05,110
But probably not.

134
00:08:05,110 --> 00:08:08,900
Or if you did, you put your
servants in second or third.

135
00:08:08,900 --> 00:08:11,290
But what does this
got to add up to?

136
00:08:11,290 --> 00:08:11,790
Yeah?

137
00:08:11,790 --> 00:08:12,289
AUDIENCE: 1.

138
00:08:12,289 --> 00:08:14,044
JOHN GUTTAG: Has to add up to 1.

139
00:08:14,044 --> 00:08:15,012
Thank you.

140
00:08:18,890 --> 00:08:20,240
So it adds up to 1.

141
00:08:24,140 --> 00:08:25,900
Whoa.

142
00:08:25,900 --> 00:08:28,990
Got his attention, at least.

143
00:08:28,990 --> 00:08:33,250
So what this tells us is the
values are not independent.

144
00:08:33,250 --> 00:08:37,590
Because if c1 is 1, then
c2 and c3 must be 0.

145
00:08:37,590 --> 00:08:38,090
Right?

146
00:08:40,870 --> 00:08:44,410
And so now we could go
back to the previous slide

147
00:08:44,410 --> 00:08:49,000
and ask the question well, is
it that being in first class

148
00:08:49,000 --> 00:08:50,590
is protective?

149
00:08:50,590 --> 00:08:57,070
Or is it that being in second
or third class is risky?

150
00:08:57,070 --> 00:09:00,580
And there's no simple
answer to that.

151
00:09:00,580 --> 00:09:02,140
So let's do an experiment.

152
00:09:02,140 --> 00:09:04,910
We have these
correlated variables.

153
00:09:04,910 --> 00:09:10,270
Suppose we eliminate
c1 altogether.

154
00:09:10,270 --> 00:09:18,020
So I did that by changing the
init method of class passenger.

155
00:09:18,020 --> 00:09:25,650
Takes the same arguments,
but we'll look at the code.

156
00:09:25,650 --> 00:09:29,190
Because it's a little
bit clearer there.

157
00:09:29,190 --> 00:09:32,240
So there was the original one.

158
00:09:32,240 --> 00:09:36,190
And I'm going to
replace that by this.

159
00:09:41,120 --> 00:09:42,670
combine that with
the original one.

160
00:09:47,330 --> 00:09:54,700
So what you see is that instead
of having five features,

161
00:09:54,700 --> 00:09:55,870
I now have four.

162
00:09:55,870 --> 00:09:59,290
I've eliminated the
c1 binary feature.

163
00:09:59,290 --> 00:10:09,330
And then the code
is straightforward,

164
00:10:09,330 --> 00:10:12,240
that I've just
come through here,

165
00:10:12,240 --> 00:10:15,900
and I've just enumerated
the possibilities.

166
00:10:15,900 --> 00:10:19,830
So if you're in first
class, then second and third

167
00:10:19,830 --> 00:10:21,570
are both 0.

168
00:10:21,570 --> 00:10:25,150
Otherwise, one of them is a 1.

169
00:10:25,150 --> 00:10:29,690
So my invariant is
gone now, right?

170
00:10:29,690 --> 00:10:32,030
It's not the case that
we know that these two

171
00:10:32,030 --> 00:10:34,740
things have to add up
to 1, because maybe I'm

172
00:10:34,740 --> 00:10:35,535
in the third case.

173
00:10:38,980 --> 00:10:46,120
OK, let's go run that
code and see what happens.

174
00:10:59,970 --> 00:11:03,680
Well, if you remember, we
see that our accuracy has not

175
00:11:03,680 --> 00:11:05,570
really declined much.

176
00:11:05,570 --> 00:11:08,960
Pretty much the same
results we got before.

177
00:11:08,960 --> 00:11:12,320
But our weights are
really quite different.

178
00:11:12,320 --> 00:11:17,000
Now, suddenly, c2 and c3
have large negative weights.

179
00:11:20,170 --> 00:11:22,320
We can look at them
side by side here.

180
00:11:32,110 --> 00:11:33,730
So you see, not much difference.

181
00:11:33,730 --> 00:11:36,040
It actually performs maybe--

182
00:11:36,040 --> 00:11:39,145
well, really no real
difference in performance.

183
00:11:39,145 --> 00:11:41,020
But you'll notice that
the weights are really

184
00:11:41,020 --> 00:11:42,490
quite different.

185
00:11:42,490 --> 00:11:49,400
That now, what had been a strong
positive weight and relatively

186
00:11:49,400 --> 00:11:52,370
weak negative weights
is now replaced

187
00:11:52,370 --> 00:11:54,095
by two strong negative weights.

188
00:11:58,870 --> 00:12:04,120
And age and gender
change just a little bit.

189
00:12:04,120 --> 00:12:05,770
So the whole point
here is that we

190
00:12:05,770 --> 00:12:08,710
have to be very careful, when
you have correlated features,

191
00:12:08,710 --> 00:12:11,590
about over-interpreting
the weights.

192
00:12:11,590 --> 00:12:14,560
It is generally pretty
safe to rely on the sign,

193
00:12:14,560 --> 00:12:18,160
whether it's
negative or positive.

194
00:12:18,160 --> 00:12:22,740
All right, changing
the topic but sticking

195
00:12:22,740 --> 00:12:26,760
with logistic regression,
there is this parameter

196
00:12:26,760 --> 00:12:30,342
you may recall, p, which
is the probability.

197
00:12:30,342 --> 00:12:32,810
And that was the cut-off.

198
00:12:32,810 --> 00:12:36,440
And we set it to 0.5,
saying if it estimates

199
00:12:36,440 --> 00:12:41,660
the probability of survival
to be 0.5 or higher,

200
00:12:41,660 --> 00:12:45,110
then we're going to guess
survived, predict survived.

201
00:12:45,110 --> 00:12:48,440
Otherwise, deceased.

202
00:12:48,440 --> 00:12:51,510
You can change that.

203
00:12:51,510 --> 00:12:57,300
And so I'm going to try two
extreme values, setting p

204
00:12:57,300 --> 00:13:04,500
to 0.1 and p to 0.9.

205
00:13:04,500 --> 00:13:06,875
Now, what do we think
that's likely to change?

206
00:13:10,620 --> 00:13:13,682
Remember, we looked at a
bunch of different attributes.

207
00:13:16,740 --> 00:13:18,360
In particular, what
attributes do we

208
00:13:18,360 --> 00:13:20,310
think are most likely to change?

209
00:13:20,310 --> 00:13:24,062
Anyone who has not answered
a question want to volunteer?

210
00:13:24,062 --> 00:13:25,770
I have nothing against
you, it's just I'm

211
00:13:25,770 --> 00:13:27,490
trying to spread the wealth.

212
00:13:27,490 --> 00:13:30,555
And I don't want to give you
diabetes, with all the candy.

213
00:13:33,210 --> 00:13:35,552
All right, you get to go again.

214
00:13:35,552 --> 00:13:37,356
AUDIENCE: Sensitivity.

215
00:13:37,356 --> 00:13:38,290
JOHN GUTTAG: Pardon?

216
00:13:38,290 --> 00:13:40,039
AUDIENCE: The sensitivity
and specificity.

217
00:13:40,039 --> 00:13:42,630
JOHN GUTTAG: Sensitivity
and specificity,

218
00:13:42,630 --> 00:13:44,400
positive predictive value.

219
00:13:44,400 --> 00:13:46,080
Because we're shifting.

220
00:13:46,080 --> 00:13:51,845
And we're saying, well, by
changing the probability,

221
00:13:51,845 --> 00:13:53,220
we're making a
decision that it's

222
00:13:53,220 --> 00:13:57,570
more important to
not miss survivors

223
00:13:57,570 --> 00:14:03,020
than it is to, say,
ask gets too high.

224
00:14:03,020 --> 00:14:05,870
So let's look at what
happens when we run that.

225
00:14:08,910 --> 00:14:10,230
I won't run it for you.

226
00:14:10,230 --> 00:14:13,900
But these are the
results we got.

227
00:14:13,900 --> 00:14:18,380
So as it happens, 0.9
gave me higher accuracy.

228
00:14:18,380 --> 00:14:22,040
But the key thing is, notice
the big difference here.

229
00:14:28,850 --> 00:14:32,460
So what is that telling me?

230
00:14:32,460 --> 00:14:34,800
Well, it's telling me that
if I predict you're going

231
00:14:34,800 --> 00:14:38,520
to survive you probably did.

232
00:14:38,520 --> 00:14:40,980
But look what it did
to the sensitivity.

233
00:14:46,320 --> 00:14:49,380
It means that most
of the survivors,

234
00:14:49,380 --> 00:14:53,300
I'm predicting they died.

235
00:14:53,300 --> 00:14:56,910
Why is the accuracy still OK?

236
00:14:56,910 --> 00:15:00,060
Well, because most people
died on the boat, on the ship,

237
00:15:00,060 --> 00:15:01,140
right?

238
00:15:01,140 --> 00:15:03,740
So we would have done
pretty well, you recall,

239
00:15:03,740 --> 00:15:06,135
if we just guessed
died for everybody.

240
00:15:11,250 --> 00:15:14,960
So it's important to
understand these things.

241
00:15:14,960 --> 00:15:16,799
I once did some
work using machine

242
00:15:16,799 --> 00:15:18,340
learning for an
insurance company who

243
00:15:18,340 --> 00:15:21,440
was trying to set rates.

244
00:15:21,440 --> 00:15:23,570
And I asked them what
they wanted to do.

245
00:15:23,570 --> 00:15:28,059
And they said they didn't
want to lose money.

246
00:15:28,059 --> 00:15:29,600
They didn't want to
insure people who

247
00:15:29,600 --> 00:15:32,060
were going to get in accidents.

248
00:15:32,060 --> 00:15:35,750
So I was able to
change this p parameter

249
00:15:35,750 --> 00:15:38,730
so that it did a great job.

250
00:15:38,730 --> 00:15:43,130
The problem was they got to
write almost no policies.

251
00:15:43,130 --> 00:15:46,010
Because I could pretty much
guarantee the people I said

252
00:15:46,010 --> 00:15:47,787
wouldn't get in an
accident wouldn't.

253
00:15:47,787 --> 00:15:49,370
But there were a
whole bunch of people

254
00:15:49,370 --> 00:15:51,980
who didn't, who they
wouldn't write policies for.

255
00:15:51,980 --> 00:15:53,844
So they ended up not
making any money.

256
00:15:53,844 --> 00:15:54,760
It was a bad decision.

257
00:15:59,530 --> 00:16:03,960
So we can change the cutoff.

258
00:16:03,960 --> 00:16:09,720
That leads to a really
important concept

259
00:16:09,720 --> 00:16:15,330
of something called the Receiver
Operating Characteristic.

260
00:16:15,330 --> 00:16:17,870
And it's a funny name, having
to do with it originally

261
00:16:17,870 --> 00:16:20,990
going back to radio receivers.

262
00:16:20,990 --> 00:16:22,050
But we can ignore that.

263
00:16:24,580 --> 00:16:27,970
The goal here is
to say, suppose I

264
00:16:27,970 --> 00:16:33,160
don't want to make a decision
about where the cutoff is,

265
00:16:33,160 --> 00:16:38,320
but I want to look at, in some
sense, all possible cutoffs

266
00:16:38,320 --> 00:16:39,980
and look at the shape of it.

267
00:16:42,500 --> 00:16:46,190
And that's what this
code is designed to do.

268
00:16:46,190 --> 00:16:54,180
So the way it works is I'll
take a training set and a test

269
00:16:54,180 --> 00:16:56,450
set, usual thing.

270
00:16:56,450 --> 00:16:58,990
I'll build one model.

271
00:16:58,990 --> 00:17:01,620
And that's an important thing,
that there's only one model

272
00:17:01,620 --> 00:17:06,589
getting built. And then
I'm going to vary p.

273
00:17:09,550 --> 00:17:13,300
And I'm going to
call apply model

274
00:17:13,300 --> 00:17:16,390
with the same model
and the same test set,

275
00:17:16,390 --> 00:17:23,234
but different p's and keep
track of all of those results.

276
00:17:27,990 --> 00:17:33,050
I'm then going to plot
a two-dimensional plot.

277
00:17:33,050 --> 00:17:36,590
The y-axis will
have sensitivity.

278
00:17:39,830 --> 00:17:43,940
And the x-axis will have
one minus specificity.

279
00:17:55,260 --> 00:17:57,695
So I am accumulating
a bunch of results.

280
00:18:05,020 --> 00:18:08,920
And then I'm going to
produce this curve calling

281
00:18:08,920 --> 00:18:14,050
sklearn.metrics.auc,
that's not the curve.

282
00:18:14,050 --> 00:18:20,455
AUC stands for Area
Under the Curve.

283
00:18:23,570 --> 00:18:27,175
And we'll see why we want to
get that area under the curve.

284
00:18:31,690 --> 00:18:37,410
When I run that,
it produces this.

285
00:18:37,410 --> 00:18:40,202
So here's the curve,
the blue line.

286
00:18:40,202 --> 00:18:42,270
And there's some things
to note about it.

287
00:18:44,870 --> 00:18:55,640
Way down at this end
I can have 0, right?

288
00:18:55,640 --> 00:18:58,430
I can set it so that I
don't make any predictions.

289
00:19:02,740 --> 00:19:05,210
And this is interesting.

290
00:19:05,210 --> 00:19:09,070
So at this end it
is saying what?

291
00:19:12,730 --> 00:19:16,180
Remember that my x-axis
is not specificity,

292
00:19:16,180 --> 00:19:20,280
but 1 minus specificity.

293
00:19:20,280 --> 00:19:27,600
So what we see is this corner
is highly sensitive and very

294
00:19:27,600 --> 00:19:28,300
unspecific.

295
00:19:31,000 --> 00:19:34,570
So I'll get a lot
of false positives.

296
00:19:34,570 --> 00:19:40,330
This corner is very specific,
because 1 minus specificity

297
00:19:40,330 --> 00:19:42,610
is 0, and very insensitive.

298
00:19:46,670 --> 00:19:50,210
So way down at the bottom,
I'm declaring nobody

299
00:19:50,210 --> 00:19:51,100
to be positive.

300
00:19:54,110 --> 00:19:56,960
And way up here, everybody.

301
00:19:56,960 --> 00:19:58,580
Clearly, I don't
want to be at either

302
00:19:58,580 --> 00:20:00,470
of these places on
the curve, right?

303
00:20:00,470 --> 00:20:03,680
Typically I want to be
somewhere in the middle.

304
00:20:03,680 --> 00:20:07,820
And here, we can see, there's
a nice knee in the curve here.

305
00:20:07,820 --> 00:20:10,910
We can choose a place.

306
00:20:10,910 --> 00:20:13,490
What does this green line
represent, do you think?

307
00:20:20,970 --> 00:20:26,430
The green line represents
a random classifier.

308
00:20:30,880 --> 00:20:33,370
I flip a coin and I
just classify something

309
00:20:33,370 --> 00:20:38,430
positive or negative, depending
on the heads or tails,

310
00:20:38,430 --> 00:20:39,000
in this case.

311
00:20:43,820 --> 00:20:49,190
So now we can look at an
interesting region, which

312
00:20:49,190 --> 00:20:57,560
is this region, the
area between the curve

313
00:20:57,560 --> 00:20:59,780
and a random classifier.

314
00:20:59,780 --> 00:21:02,650
And that sort of tells me how
much better I am than random.

315
00:21:06,400 --> 00:21:12,810
I can look at the whole area,
the area under the curve.

316
00:21:15,330 --> 00:21:20,280
And that's this, the area under
the Receiver Operating Curve.

317
00:21:25,300 --> 00:21:33,160
In the best of all worlds,
the curve would be 1.

318
00:21:33,160 --> 00:21:35,290
That would be a
perfect classifier.

319
00:21:38,050 --> 00:21:40,110
In the worst of all
worlds, it would be 0.

320
00:21:40,110 --> 00:21:45,190
But it's never 0 because
we don't do worse than 0.5.

321
00:21:45,190 --> 00:21:47,170
We hope not to do
worse than random.

322
00:21:47,170 --> 00:21:50,320
If so, we just reverse
our predictions.

323
00:21:50,320 --> 00:21:52,840
And then we're
better than random.

324
00:21:52,840 --> 00:21:56,750
So random is as bad
as you can do, really.

325
00:21:56,750 --> 00:21:59,600
And so this is a very
important concept.

326
00:21:59,600 --> 00:22:05,120
And it lets us evaluate how good
a classifier is independently

327
00:22:05,120 --> 00:22:07,000
of what we choose
to be the cutoff.

328
00:22:10,330 --> 00:22:13,020
So when you read the
literature and people say,

329
00:22:13,020 --> 00:22:16,050
I have this wonderful method
of making predictions,

330
00:22:16,050 --> 00:22:18,780
you'll almost always
see them cite the AUROC.

331
00:22:25,500 --> 00:22:29,430
Any questions about
this or about machine

332
00:22:29,430 --> 00:22:30,900
learning in general?

333
00:22:30,900 --> 00:22:33,420
If so, this would be a
good time to ask them,

334
00:22:33,420 --> 00:22:35,640
since I'm about to
totally change the topic.

335
00:22:41,300 --> 00:22:43,750
Yes?

336
00:22:43,750 --> 00:22:45,710
AUDIENCE: At what
level does AUROC

337
00:22:45,710 --> 00:22:48,160
start to be statistically
significant?

338
00:22:48,160 --> 00:22:50,610
And how many data
points do you need

339
00:22:50,610 --> 00:22:52,080
to also prove that [INAUDIBLE]?

340
00:22:52,080 --> 00:22:52,871
JOHN GUTTAG: Right.

341
00:22:52,871 --> 00:22:54,340
So the question
is, at what point

342
00:22:54,340 --> 00:22:59,260
does the AUROC become
statistically significant?

343
00:22:59,260 --> 00:23:02,515
And that is, essentially,
an unanswerable question.

344
00:23:06,840 --> 00:23:10,260
Whoops, relay it back.

345
00:23:10,260 --> 00:23:13,046
Needed to put more
air under the throw.

346
00:23:13,046 --> 00:23:18,751
I look like the
quarterback for the Rams,

347
00:23:18,751 --> 00:23:21,790
if you saw them play lately.

348
00:23:21,790 --> 00:23:27,370
So if you ask this question
about significance,

349
00:23:27,370 --> 00:23:30,890
it will depend upon
a number of things.

350
00:23:30,890 --> 00:23:36,147
So you're always asking, is it
significantly better than x?

351
00:23:36,147 --> 00:23:38,230
And so the question is,
is it significantly better

352
00:23:38,230 --> 00:23:40,150
than random?

353
00:23:40,150 --> 00:23:44,770
And you can't just say, for
example, that 0.6 isn't and 0.7

354
00:23:44,770 --> 00:23:45,880
is.

355
00:23:45,880 --> 00:23:50,320
Because it depends how
many points you have.

356
00:23:50,320 --> 00:23:51,910
If you have a lot
of points, it could

357
00:23:51,910 --> 00:23:54,280
be only a tiny bit
better than 0.5

358
00:23:54,280 --> 00:23:57,360
and still be
statistically significant.

359
00:23:57,360 --> 00:24:00,990
It may be
uninterestingly better.

360
00:24:00,990 --> 00:24:03,120
It may not be significant
in the English sense,

361
00:24:03,120 --> 00:24:07,280
but you still get
statistical significance.

362
00:24:07,280 --> 00:24:10,570
So that's a problem when
studies have lots of points.

363
00:24:10,570 --> 00:24:14,620
In general, it depends
upon the application.

364
00:24:14,620 --> 00:24:17,950
For a lot of applications,
you'll see things in the 0.7's

365
00:24:17,950 --> 00:24:20,770
being considered pretty useful.

366
00:24:20,770 --> 00:24:23,740
And the real question shouldn't
be whether it's significant,

367
00:24:23,740 --> 00:24:25,450
but whether it's useful.

368
00:24:25,450 --> 00:24:29,270
Can you make useful
decisions based upon it?

369
00:24:29,270 --> 00:24:31,420
And the other thing
is, typically,

370
00:24:31,420 --> 00:24:36,010
when you're talking about that,
you're selecting some point

371
00:24:36,010 --> 00:24:40,410
and really talking about a
region relative to that point.

372
00:24:40,410 --> 00:24:43,710
We usually don't really
care what it does out here.

373
00:24:43,710 --> 00:24:46,950
Because we hardly ever
operate out there anyway.

374
00:24:46,950 --> 00:24:49,970
We're usually somewhere
in the middle.

375
00:24:49,970 --> 00:24:52,040
But good question.

376
00:24:52,040 --> 00:24:53,576
Yeah?

377
00:24:53,576 --> 00:24:55,836
AUDIENCE: Why are we
doing 1 minus specificity?

378
00:24:55,836 --> 00:24:58,300
JOHN GUTTAG: Why are we
doing 1 minus specificity

379
00:24:58,300 --> 00:25:00,320
instead of specificity?

380
00:25:00,320 --> 00:25:02,500
Is that the question?

381
00:25:02,500 --> 00:25:04,720
And the answer is,
essentially, so we

382
00:25:04,720 --> 00:25:07,630
can do this trick of
computing the area.

383
00:25:07,630 --> 00:25:11,480
It gives us this nice curve.

384
00:25:11,480 --> 00:25:14,620
This nice, if you
will, concave curve

385
00:25:14,620 --> 00:25:16,750
which lets us compute
this area under here

386
00:25:16,750 --> 00:25:21,850
nicely if you were to take
specificity and just draw it,

387
00:25:21,850 --> 00:25:23,055
it would look different.

388
00:25:26,290 --> 00:25:29,500
Obviously, mathematically,
they're, in some sense,

389
00:25:29,500 --> 00:25:30,820
the same right.

390
00:25:30,820 --> 00:25:35,920
If you have 1 minus x and x, you
can get either from the other.

391
00:25:35,920 --> 00:25:37,750
So it really just has
to do with the way

392
00:25:37,750 --> 00:25:39,904
people want to
draw this picture.

393
00:25:39,904 --> 00:25:42,748
AUDIENCE: [INAUDIBLE]?

394
00:25:42,748 --> 00:25:43,696
JOHN GUTTAG: Pardon?

395
00:25:43,696 --> 00:25:45,592
AUDIENCE: Does that
not change [INAUDIBLE]?

396
00:25:45,592 --> 00:25:47,694
JOHN GUTTAG: Does it not--

397
00:25:47,694 --> 00:25:49,908
AUDIENCE: Doesn't it
change the meaning

398
00:25:49,908 --> 00:25:51,112
of what you're [INAUDIBLE]?

399
00:25:51,112 --> 00:25:53,570
JOHN GUTTAG: Well, you'd have
to use a different statistic.

400
00:25:53,570 --> 00:25:59,550
You couldn't cite the AUROC if
you did specificity directly.

401
00:25:59,550 --> 00:26:02,760
Which is why they do 1 minus.

402
00:26:02,760 --> 00:26:06,650
The goal is you want to have
this point at 0 and this 0.00

403
00:26:06,650 --> 00:26:08,350
and 1.1.

404
00:26:08,350 --> 00:26:10,480
And playing 1 minus
gives you this trick,

405
00:26:10,480 --> 00:26:12,760
of anchoring those two points.

406
00:26:12,760 --> 00:26:15,170
And so then you get a
curve connecting them,

407
00:26:15,170 --> 00:26:19,330
which you can then easily
compare to the random curve.

408
00:26:19,330 --> 00:26:21,330
It's just one of
these little tricks

409
00:26:21,330 --> 00:26:23,760
that statisticians
like to play to make

410
00:26:23,760 --> 00:26:28,530
things easy to visualize and
easy to compute statistics

411
00:26:28,530 --> 00:26:29,310
about.

412
00:26:29,310 --> 00:26:31,920
It's not a fundamentally
important issue.

413
00:26:34,500 --> 00:26:35,330
Anything else?

414
00:26:39,920 --> 00:26:44,315
All right, so I told you I
was going to change topics--

415
00:26:48,030 --> 00:26:49,760
finally got one completed--

416
00:26:49,760 --> 00:26:52,000
and I am.

417
00:26:52,000 --> 00:26:56,300
And this is a topic I
approach with some reluctance.

418
00:26:56,300 --> 00:26:59,540
So you have probably all
heard this expression,

419
00:26:59,540 --> 00:27:04,290
that there are three
kinds of lies, lies,

420
00:27:04,290 --> 00:27:08,480
damn lies, and statistics.

421
00:27:08,480 --> 00:27:11,390
And we've been talking
a lot about statistics.

422
00:27:11,390 --> 00:27:14,810
And now I want to spend
the rest of today's lecture

423
00:27:14,810 --> 00:27:19,340
and the start of
Wednesday's lecture

424
00:27:19,340 --> 00:27:24,080
talking about how to
lie with statistics.

425
00:27:24,080 --> 00:27:29,750
So at this point, I usually put
on my "Numbers Never Lie" hat.

426
00:27:29,750 --> 00:27:36,260
But do say that numbers never
lie, but liars use numbers.

427
00:27:36,260 --> 00:27:39,530
And I hope none of you will
ever go work for a politician

428
00:27:39,530 --> 00:27:42,440
and put this
knowledge to bad use.

429
00:27:42,440 --> 00:27:43,960
This quote is well known.

430
00:27:43,960 --> 00:27:46,220
It's variously
attributed, often,

431
00:27:46,220 --> 00:27:49,490
to Mark Twain, the
fellow on the left.

432
00:27:49,490 --> 00:27:51,590
He claimed not to
have invented it,

433
00:27:51,590 --> 00:27:55,460
but said it was invented
by Benjamin Disraeli.

434
00:27:55,460 --> 00:27:57,620
And I prefer to
believe that, since it

435
00:27:57,620 --> 00:28:02,690
does seem like something a
Prime Minister would invent.

436
00:28:02,690 --> 00:28:04,880
So let's think about this.

437
00:28:04,880 --> 00:28:08,090
The issue here is the
way the human mind works

438
00:28:08,090 --> 00:28:09,320
and statistics.

439
00:28:11,840 --> 00:28:15,440
Darrell Huff, a
well-known statistician

440
00:28:15,440 --> 00:28:19,064
who did write a book called
How to Lie with Statistics,

441
00:28:19,064 --> 00:28:20,480
says, "if you can't
prove what you

442
00:28:20,480 --> 00:28:24,740
want to prove, demonstrate
something else and pretend

443
00:28:24,740 --> 00:28:27,200
they are the same thing.

444
00:28:27,200 --> 00:28:29,780
In the daze that follows
the collision of statistics

445
00:28:29,780 --> 00:28:32,300
with the human
mind, hardly anyone

446
00:28:32,300 --> 00:28:34,940
will notice the difference."

447
00:28:34,940 --> 00:28:39,910
And indeed, empirically,
he seems to be right.

448
00:28:39,910 --> 00:28:42,530
So let's look at some examples.

449
00:28:42,530 --> 00:28:43,360
Here's one I like.

450
00:28:43,360 --> 00:28:47,890
This is from another famous
statistician called Anscombe.

451
00:28:47,890 --> 00:28:50,890
And he invented this thing
called Anscombe's Quartet.

452
00:28:50,890 --> 00:28:52,600
I take my hat off now.

453
00:28:52,600 --> 00:28:53,530
It's too hot in here.

454
00:28:56,560 --> 00:29:00,745
A bunch of numbers,
11 x, y pairs.

455
00:29:03,675 --> 00:29:05,550
I know you don't want
to look at the numbers,

456
00:29:05,550 --> 00:29:08,980
so here are some
statistics about them.

457
00:29:08,980 --> 00:29:14,090
Each of those pairs
has the same mean value

458
00:29:14,090 --> 00:29:17,840
for x, the same mean
for y, the same variance

459
00:29:17,840 --> 00:29:20,750
for x, the same variance for y.

460
00:29:20,750 --> 00:29:23,630
And then I went and I fit a
linear regression model to it.

461
00:29:23,630 --> 00:29:27,560
And lo and behold, I got the
same equation for everyone,

462
00:29:27,560 --> 00:29:31,430
y equals 0.5x plus 3.

463
00:29:31,430 --> 00:29:36,020
So that raises the
question, if we go back,

464
00:29:36,020 --> 00:29:40,740
is there really much difference
between these pairs of x and y?

465
00:29:44,190 --> 00:29:46,212
Are they really similar?

466
00:29:46,212 --> 00:29:47,670
And the answer is,
that's what they

467
00:29:47,670 --> 00:29:49,010
look like if you plot them.

468
00:29:51,770 --> 00:29:53,710
So even though
statistically they

469
00:29:53,710 --> 00:29:56,580
appear to be kind
of the same, they

470
00:29:56,580 --> 00:29:59,110
could hardly be more
different, right?

471
00:29:59,110 --> 00:30:02,440
Those are not the
same distributions.

472
00:30:02,440 --> 00:30:05,560
So there's an
important moral here,

473
00:30:05,560 --> 00:30:07,960
which is that
statistics about data

474
00:30:07,960 --> 00:30:12,380
is not the same thing
as the data itself.

475
00:30:12,380 --> 00:30:15,460
And this seems obvious,
but it's amazing

476
00:30:15,460 --> 00:30:17,590
how easy it is to forget it.

477
00:30:17,590 --> 00:30:19,270
The number of papers
I've read where

478
00:30:19,270 --> 00:30:21,470
I see a bunch of
statistics about the data

479
00:30:21,470 --> 00:30:24,550
but don't see the
data is enormous.

480
00:30:24,550 --> 00:30:27,940
And it's easy to lose
track of the fact

481
00:30:27,940 --> 00:30:33,470
that the statistics don't
tell the whole story.

482
00:30:33,470 --> 00:30:36,800
So the answer is the
old Chinese proverb,

483
00:30:36,800 --> 00:30:39,824
a picture is worth
a thousand words,

484
00:30:39,824 --> 00:30:41,740
I urge you, the first
thing you should do when

485
00:30:41,740 --> 00:30:44,740
you get a data set, is plot it.

486
00:30:44,740 --> 00:30:47,980
If it's got too many points
to plot all the points,

487
00:30:47,980 --> 00:30:52,050
subsample it and
plot of subsample.

488
00:30:52,050 --> 00:30:57,530
Use some visualization tool
to look at the data itself.

489
00:30:57,530 --> 00:31:00,820
Now, that said,
pictures are wonderful.

490
00:31:00,820 --> 00:31:03,400
But you can lie with pictures.

491
00:31:03,400 --> 00:31:05,380
So here's an interesting chart.

492
00:31:05,380 --> 00:31:10,020
These are grades in
6.0001 by gender.

493
00:31:10,020 --> 00:31:12,600
So the males are blue
and the females are pink.

494
00:31:12,600 --> 00:31:15,440
Sorry for being such
a traditionalist.

495
00:31:15,440 --> 00:31:20,520
And as you can see, the women
did way better than the men.

496
00:31:20,520 --> 00:31:23,970
Now, I know for some of you
this is confirmation bias.

497
00:31:23,970 --> 00:31:25,680
You say, of course.

498
00:31:25,680 --> 00:31:29,350
Others say, impossible, But in
fact, if you look carefully,

499
00:31:29,350 --> 00:31:32,730
you'll see that's not what
this chart says at all.

500
00:31:32,730 --> 00:31:37,050
Because if you look
at the axis here,

501
00:31:37,050 --> 00:31:41,510
you'll see that actually
there's not much difference.

502
00:31:41,510 --> 00:31:45,490
Here's what I get if
I plot it from 0 to 5.

503
00:31:45,490 --> 00:31:47,390
Yeah, the women did
a little bit better.

504
00:31:47,390 --> 00:31:50,620
But that's not a
statistically-significant

505
00:31:50,620 --> 00:31:53,190
difference.

506
00:31:53,190 --> 00:31:57,510
And by the way, when I plotted
it last year for 6.0002,

507
00:31:57,510 --> 00:32:00,780
the blue was about that
much higher than the pink.

508
00:32:00,780 --> 00:32:03,480
Don't read much
into either of them.

509
00:32:03,480 --> 00:32:07,350
But the trick was
here, I took the y-axis

510
00:32:07,350 --> 00:32:13,280
and ran it from 3.9 to 4.05.

511
00:32:13,280 --> 00:32:16,550
I cleverly chose my
baseline in such a way

512
00:32:16,550 --> 00:32:20,160
to make the difference look
much bigger than it is.

513
00:32:22,690 --> 00:32:26,740
Here I did the honest thing
of put the baseline at 0

514
00:32:26,740 --> 00:32:28,750
and run it to 5.

515
00:32:28,750 --> 00:32:33,140
Because that's the
range of grades at MIT.

516
00:32:33,140 --> 00:32:37,820
And so when you look
at a chart, it's

517
00:32:37,820 --> 00:32:39,410
important to keep
in mind that you

518
00:32:39,410 --> 00:32:43,650
need to look at the axis
labels and the scales.

519
00:32:47,092 --> 00:32:48,800
Let's look at another
chart, just in case

520
00:32:48,800 --> 00:32:53,580
you think I'm the only one who
likes to play with graphics.

521
00:32:53,580 --> 00:32:57,280
This is a chart from Fox News.

522
00:32:57,280 --> 00:33:01,210
And they're arguing here.

523
00:33:01,210 --> 00:33:03,860
It's the shocking
statistics that there

524
00:33:03,860 --> 00:33:06,640
are 108.6 million
people on welfare,

525
00:33:06,640 --> 00:33:11,980
and 101.7 with a full-time job.

526
00:33:11,980 --> 00:33:14,780
And you can imagine the rhetoric
that accompanies this chart.

527
00:33:18,540 --> 00:33:20,255
This is actually correct.

528
00:33:22,770 --> 00:33:24,880
It is true from the
Census Bureau data.

529
00:33:24,880 --> 00:33:26,230
Sort of.

530
00:33:26,230 --> 00:33:30,010
But notice that
I said you should

531
00:33:30,010 --> 00:33:33,970
read the labels on the axes.

532
00:33:33,970 --> 00:33:35,150
There is no label here.

533
00:33:39,150 --> 00:33:45,640
But you can bet that the
y-intercept is not 0 on this.

534
00:33:45,640 --> 00:33:50,670
Because you can see how
small 101.7 looks like.

535
00:33:50,670 --> 00:33:53,420
So it makes the difference
look bigger than it is.

536
00:33:56,310 --> 00:34:02,000
Now, that's not the only
funny thing about it.

537
00:34:02,000 --> 00:34:06,650
I said you should look at
the labels on the x-axis.

538
00:34:06,650 --> 00:34:08,290
Well, they've labeled them.

539
00:34:08,290 --> 00:34:11,639
But what do these things mean?

540
00:34:11,639 --> 00:34:13,679
Well, I looked it
up, and I'll tell you

541
00:34:13,679 --> 00:34:15,960
what they actually mean.

542
00:34:15,960 --> 00:34:19,760
People on welfare
counts the number

543
00:34:19,760 --> 00:34:23,581
of people in a household in
which at least one person is

544
00:34:23,581 --> 00:34:24,080
on welfare.

545
00:34:26,780 --> 00:34:31,340
So if there is say, two
parents, one is working

546
00:34:31,340 --> 00:34:33,590
and one is collecting
welfare and there

547
00:34:33,590 --> 00:34:36,949
are four kids, that counts
as six people on welfare.

548
00:34:40,469 --> 00:34:43,860
People with a full-time
job, is actually

549
00:34:43,860 --> 00:34:46,370
does not count households.

550
00:34:46,370 --> 00:34:50,030
So in the same
family, you would have

551
00:34:50,030 --> 00:34:56,179
six on the bar on the left, and
one on the bar on the right.

552
00:34:56,179 --> 00:35:00,670
Clearly giving a very
different impression.

553
00:35:00,670 --> 00:35:04,600
And so again,
pictures can be good.

554
00:35:04,600 --> 00:35:09,970
But if you don't dive deep into
them, they really can fool you.

555
00:35:09,970 --> 00:35:12,010
Now, before I should
leave this slide,

556
00:35:12,010 --> 00:35:14,740
I should say that it's not the
case that you can't believe

557
00:35:14,740 --> 00:35:17,500
anything you read on Fox News.

558
00:35:17,500 --> 00:35:20,560
Because in fact, the Red Sox
did beat the St. Louis Cardinals

559
00:35:20,560 --> 00:35:21,540
4 to 2 that day.

560
00:35:27,090 --> 00:35:30,840
So the moral here is to ask
whether the things being

561
00:35:30,840 --> 00:35:33,990
compared are
actually comparable.

562
00:35:33,990 --> 00:35:36,860
Or you're really comparing
apples and oranges,

563
00:35:36,860 --> 00:35:39,600
as they say.

564
00:35:39,600 --> 00:35:44,370
OK, this is probably the
most common statistical sin.

565
00:35:44,370 --> 00:35:46,260
It's called GIGO.

566
00:35:46,260 --> 00:35:49,110
And perhaps this
picture can make

567
00:35:49,110 --> 00:35:52,510
you guess what the
G's stand for GIGO

568
00:35:52,510 --> 00:35:55,590
is Garbage In, Garbage Out.

569
00:36:00,280 --> 00:36:03,260
So here's a great,
again, quote about it.

570
00:36:03,260 --> 00:36:10,840
So Charles Babbage designed
the first digital computer,

571
00:36:10,840 --> 00:36:12,750
the first actual
computation engine.

572
00:36:12,750 --> 00:36:15,430
He was unable to build it.

573
00:36:15,430 --> 00:36:17,650
But hundreds of years
after he died one

574
00:36:17,650 --> 00:36:22,030
was built according to his
design, and it actually worked.

575
00:36:22,030 --> 00:36:24,460
No electronics, really.

576
00:36:24,460 --> 00:36:27,220
So he was a famous person.

577
00:36:27,220 --> 00:36:31,090
And he was asked by Parliament
about his machine, which

578
00:36:31,090 --> 00:36:32,890
he was asking them to fund.

579
00:36:32,890 --> 00:36:36,640
Well, if you put wrong
numbers into the machine,

580
00:36:36,640 --> 00:36:41,450
will the machine have right
numbers come out the other end?

581
00:36:41,450 --> 00:36:44,250
And of course, he
was a very smart guy.

582
00:36:44,250 --> 00:36:45,830
And he was totally baffled.

583
00:36:45,830 --> 00:36:48,020
This question
seems so stupid, he

584
00:36:48,020 --> 00:36:50,510
couldn't believe anyone
would even ask it.

585
00:36:50,510 --> 00:36:53,810
That it was just computation.

586
00:36:53,810 --> 00:36:58,420
And the answers you get are
based on the data you put in.

587
00:36:58,420 --> 00:37:03,270
If you put in garbage,
you get out garbage.

588
00:37:03,270 --> 00:37:08,100
So here is an example
from the 1840s.

589
00:37:08,100 --> 00:37:10,140
They did a census in the 1840s.

590
00:37:10,140 --> 00:37:13,840
And for those of you who are not
familiar with American history,

591
00:37:13,840 --> 00:37:17,610
it was a very contentious
time in the US.

592
00:37:17,610 --> 00:37:20,640
The country was divided
between states that had slavery

593
00:37:20,640 --> 00:37:22,610
and states that didn't.

594
00:37:22,610 --> 00:37:28,160
And that was the dominant
political issue of the day.

595
00:37:28,160 --> 00:37:31,550
John Calhoun, who was
Secretary of State

596
00:37:31,550 --> 00:37:37,010
and a leader in the Senate,
was from South Carolina

597
00:37:37,010 --> 00:37:41,420
and probably the strongest
proponent of slavery.

598
00:37:41,420 --> 00:37:47,480
And he used the census data to
say that slavery was actually

599
00:37:47,480 --> 00:37:49,920
good for the slaves.

600
00:37:49,920 --> 00:37:51,780
Kind of an amazing thought.

601
00:37:51,780 --> 00:37:56,040
Basically saying that
this data claimed

602
00:37:56,040 --> 00:37:59,760
that freed slaves
were more likely to be

603
00:37:59,760 --> 00:38:01,590
insane than enslaved slaves.

604
00:38:07,760 --> 00:38:13,250
He was rebutted in the
House by John Quincy

605
00:38:13,250 --> 00:38:17,120
Adams, who had formerly been
President of the United States.

606
00:38:17,120 --> 00:38:20,810
After he stopped being
President, he ran for Congress.

607
00:38:20,810 --> 00:38:23,390
From Braintree, Massachusetts.

608
00:38:23,390 --> 00:38:24,980
Actually now called
Quincy, the part

609
00:38:24,980 --> 00:38:28,280
he's from, after his family.

610
00:38:28,280 --> 00:38:30,920
And he claimed that
atrocious misrepresentations

611
00:38:30,920 --> 00:38:34,550
had been made on a subject
of deep importance.

612
00:38:34,550 --> 00:38:37,100
He was an abolitionist.

613
00:38:37,100 --> 00:38:39,292
So you don't even have to
look at that statistics

614
00:38:39,292 --> 00:38:40,250
to know who to believe.

615
00:38:40,250 --> 00:38:42,137
Just look at these pictures.

616
00:38:42,137 --> 00:38:43,970
Are you going to believe
this nice gentleman

617
00:38:43,970 --> 00:38:49,100
from Braintree or this scary
guy from South Carolina?

618
00:38:49,100 --> 00:38:53,780
But setting looks aside,
Calhoun eventually

619
00:38:53,780 --> 00:38:58,080
admitted that the census
was indeed full of errors.

620
00:38:58,080 --> 00:38:59,600
But he said that was fine.

621
00:38:59,600 --> 00:39:01,992
Because there were
so many of them

622
00:39:01,992 --> 00:39:03,950
that they would balance
each other out and lead

623
00:39:03,950 --> 00:39:07,500
to the same conclusion, as
if they were all correct.

624
00:39:07,500 --> 00:39:09,880
So he didn't believe in
garbage in, garbage out.

625
00:39:09,880 --> 00:39:12,050
He said yeah, it is garbage.

626
00:39:12,050 --> 00:39:16,160
But it'll all come
out in the end OK.

627
00:39:16,160 --> 00:39:19,790
Well, now we know enough
to ask the question.

628
00:39:22,540 --> 00:39:27,480
This isn't totally
brain dead, in that

629
00:39:27,480 --> 00:39:30,090
we've already looked
at experiments

630
00:39:30,090 --> 00:39:32,580
and said we get
experimental error.

631
00:39:32,580 --> 00:39:37,750
And under some circumstances,
you can manage the error.

632
00:39:37,750 --> 00:39:38,890
The data isn't garbage.

633
00:39:38,890 --> 00:39:40,390
It just has errors.

634
00:39:40,390 --> 00:39:42,370
But it's true if the
measurement errors

635
00:39:42,370 --> 00:39:46,840
are unbiased and
independent of each other.

636
00:39:46,840 --> 00:39:50,740
And almost identically
distributed on either side

637
00:39:50,740 --> 00:39:52,330
of the mean, right?

638
00:39:52,330 --> 00:39:54,370
That's why we spend
so much time looking

639
00:39:54,370 --> 00:39:58,990
at the normal distribution,
and why it's called Gaussian.

640
00:39:58,990 --> 00:40:00,880
Because Gauss
said, yes, I know I

641
00:40:00,880 --> 00:40:04,730
have errors in my
astronomical measurements.

642
00:40:04,730 --> 00:40:08,200
But I believe my errors are
distributed in what we now

643
00:40:08,200 --> 00:40:10,120
call a Gaussian curve.

644
00:40:10,120 --> 00:40:12,130
And therefore, I can
still work with them

645
00:40:12,130 --> 00:40:13,960
and get an accurate
estimate of the values.

646
00:40:16,820 --> 00:40:21,010
Now, of course, that
wasn't true here.

647
00:40:21,010 --> 00:40:22,570
The errors were not random.

648
00:40:22,570 --> 00:40:25,360
They were, in fact,
quite systematic,

649
00:40:25,360 --> 00:40:27,970
designed to produce
a certain thing.

650
00:40:27,970 --> 00:40:31,570
And the last word was
from another abolitionist

651
00:40:31,570 --> 00:40:35,710
who claimed it was the
census that was insane.

652
00:40:35,710 --> 00:40:40,070
All right, that's
Garbage In, Garbage Out.

653
00:40:40,070 --> 00:40:45,130
The moral here is that
analysis of bad data

654
00:40:45,130 --> 00:40:49,090
is worse than no
analysis at all, really.

655
00:40:49,090 --> 00:40:52,270
Time and again we see people
doing, actually often,

656
00:40:52,270 --> 00:40:57,190
correct statistical
analysis of incorrect data

657
00:40:57,190 --> 00:41:00,400
and reaching conclusions.

658
00:41:00,400 --> 00:41:02,500
And that's really risky.

659
00:41:02,500 --> 00:41:04,540
So before one goes
off and starts

660
00:41:04,540 --> 00:41:08,770
using statistical techniques of
the sort we've been discussing,

661
00:41:08,770 --> 00:41:10,690
the first question
you have to ask is,

662
00:41:10,690 --> 00:41:14,620
is the data itself
worth analyzing?

663
00:41:14,620 --> 00:41:18,250
And it often isn't.

664
00:41:18,250 --> 00:41:21,220
Now, you could argue that
this is a thing of the past,

665
00:41:21,220 --> 00:41:25,090
and no modern politician would
make these kinds of mistakes.

666
00:41:25,090 --> 00:41:27,130
I'm not going to
insert a photo here.

667
00:41:27,130 --> 00:41:30,700
But I leave it to you to
think which politician's photo

668
00:41:30,700 --> 00:41:32,810
you might paste in this frame.

669
00:41:35,370 --> 00:41:38,490
All right, onto another
statistical sin.

670
00:41:38,490 --> 00:41:42,062
This is a picture of a
World War II fighter plane.

671
00:41:42,062 --> 00:41:44,520
I don't know enough about planes
to know what kind of plane

672
00:41:44,520 --> 00:41:45,019
it is.

673
00:41:45,019 --> 00:41:45,810
Anyone here?

674
00:41:45,810 --> 00:41:47,190
There must be an
Aero student who

675
00:41:47,190 --> 00:41:50,660
will be able to tell
me what plane this is.

676
00:41:50,660 --> 00:41:54,460
Don't they teach you guys
anything in Aero these days?

677
00:41:54,460 --> 00:41:55,150
Shame on them.

678
00:41:55,150 --> 00:41:56,080
All right.

679
00:41:56,080 --> 00:41:57,310
Anyway, it's a plane.

680
00:41:57,310 --> 00:41:58,690
That much I know.

681
00:41:58,690 --> 00:42:00,340
And it has a propeller.

682
00:42:00,340 --> 00:42:03,100
And that's all I can tell
you about the airplane.

683
00:42:03,100 --> 00:42:07,980
So this was a photo taken
at a airfield in Britain.

684
00:42:07,980 --> 00:42:12,960
And the Allies would
send planes over Germany

685
00:42:12,960 --> 00:42:16,920
for bombing runs and fighters
to protect the bombers.

686
00:42:16,920 --> 00:42:20,610
And when they came back, the
planes were often damaged.

687
00:42:20,610 --> 00:42:23,070
And they would inspect
the damage and say look,

688
00:42:23,070 --> 00:42:24,990
there's a lot of flak there.

689
00:42:24,990 --> 00:42:28,740
The Germans shot
flak at the planes.

690
00:42:28,740 --> 00:42:31,560
And that would be
a part of the plane

691
00:42:31,560 --> 00:42:34,990
that maybe we should
reinforce in the future.

692
00:42:34,990 --> 00:42:38,740
So when it gets hit by
flak it survives it.

693
00:42:38,740 --> 00:42:40,419
It does less damage.

694
00:42:40,419 --> 00:42:42,460
So you can analyze where
the Germans were hitting

695
00:42:42,460 --> 00:42:46,030
the planes, and you would
add a little extra armor

696
00:42:46,030 --> 00:42:49,160
to that part of the plane.

697
00:42:49,160 --> 00:42:53,528
What's the flaw in that?

698
00:42:53,528 --> 00:42:54,492
Yeah?

699
00:42:54,492 --> 00:42:56,366
AUDIENCE: They didn't
look at the planes that

700
00:42:56,366 --> 00:42:57,785
actually got shot down.

701
00:42:57,785 --> 00:42:59,080
JOHN GUTTAG: Yeah.

702
00:42:59,080 --> 00:43:03,070
This is what's called, in
the jargon, survivor bias.

703
00:43:09,955 --> 00:43:16,120
S-U-R-V-I-V-O-R.

704
00:43:16,120 --> 00:43:18,440
The planes they really
should have been analyzing

705
00:43:18,440 --> 00:43:20,720
were the ones that
got shot down.

706
00:43:20,720 --> 00:43:23,000
But those were hard to analyze.

707
00:43:23,000 --> 00:43:26,370
So they analyzed
the ones they had

708
00:43:26,370 --> 00:43:29,820
and drew conclusions,
and perhaps totally

709
00:43:29,820 --> 00:43:31,254
the wrong conclusion.

710
00:43:31,254 --> 00:43:33,420
Maybe the conclusion they
should have drawn is well,

711
00:43:33,420 --> 00:43:35,430
it's OK if you get hit here.

712
00:43:35,430 --> 00:43:38,070
Let's reinforce
the other places.

713
00:43:38,070 --> 00:43:40,590
I don't know enough to know
what the right answer was.

714
00:43:40,590 --> 00:43:44,250
I do know that this was
statistically the wrong thing

715
00:43:44,250 --> 00:43:47,420
to be thinking about doing.

716
00:43:47,420 --> 00:43:51,630
And this is an issue we have
whenever we do sampling.

717
00:43:51,630 --> 00:43:55,620
All statistical techniques
are based upon the assumption

718
00:43:55,620 --> 00:43:59,140
that by sampling a
subset of the population

719
00:43:59,140 --> 00:44:02,910
we can infer things about
the population as a whole.

720
00:44:02,910 --> 00:44:06,210
Everything we've done this
term has been based on that.

721
00:44:06,210 --> 00:44:11,220
When we were fitting
curves we were doing that.

722
00:44:11,220 --> 00:44:15,270
When we were talking about the
empirical rule and Monte Carlo

723
00:44:15,270 --> 00:44:17,362
Simulation, we were
doing that, when

724
00:44:17,362 --> 00:44:19,320
we were building models,
with machine learning,

725
00:44:19,320 --> 00:44:20,170
we were doing that.

726
00:44:23,000 --> 00:44:26,290
And if random
sampling is used, you

727
00:44:26,290 --> 00:44:29,800
can make meaningful
mathematical statements

728
00:44:29,800 --> 00:44:34,000
about the relation of the
sample to the entire population.

729
00:44:36,950 --> 00:44:40,700
And that's why so much
of what we did works.

730
00:44:40,700 --> 00:44:42,150
And when we're
doing simulations,

731
00:44:42,150 --> 00:44:45,010
that's really easy.

732
00:44:45,010 --> 00:44:48,970
When we were choosing
random values of the needles

733
00:44:48,970 --> 00:44:52,045
for trying to find
pi, or random value

734
00:44:52,045 --> 00:44:54,490
if the roulette wheel spins.

735
00:44:54,490 --> 00:44:57,130
We could be pretty sure our
samples were, indeed, random.

736
00:44:59,930 --> 00:45:04,290
In the field, it's not so easy.

737
00:45:04,290 --> 00:45:05,040
Right?

738
00:45:05,040 --> 00:45:06,750
Because some samples
are much more

739
00:45:06,750 --> 00:45:10,050
convenient to
acquire than others.

740
00:45:10,050 --> 00:45:13,470
It's much easier to acquire a
plane on the field in Britain

741
00:45:13,470 --> 00:45:15,960
than a plane on the
ground in France.

742
00:45:19,210 --> 00:45:21,730
Convenient sampling,
as it's often called,

743
00:45:21,730 --> 00:45:25,290
is not usually random.

744
00:45:25,290 --> 00:45:27,620
So you have survivor bias.

745
00:45:27,620 --> 00:45:31,830
So I asked you to do
course evaluations.

746
00:45:31,830 --> 00:45:34,370
Well, there's
survivor bias there.

747
00:45:34,370 --> 00:45:38,240
The people who really hated this
course have already dropped it.

748
00:45:38,240 --> 00:45:40,310
And so we won't sample them.

749
00:45:40,310 --> 00:45:43,400
That's good for me, at least.

750
00:45:43,400 --> 00:45:45,890
But we see that.

751
00:45:45,890 --> 00:45:47,960
We see that with grades.

752
00:45:47,960 --> 00:45:49,769
The people who are
really struggling,

753
00:45:49,769 --> 00:45:51,560
who were most likely
to fail, have probably

754
00:45:51,560 --> 00:45:53,715
dropped the course too.

755
00:45:53,715 --> 00:45:56,090
That's one of the reasons I
don't think it's fair to say,

756
00:45:56,090 --> 00:45:57,256
we're going to have a curve.

757
00:45:57,256 --> 00:45:59,840
And we're going to always
fail this fraction,

758
00:45:59,840 --> 00:46:01,910
and give A's to this fraction.

759
00:46:01,910 --> 00:46:05,800
Because by the end of the term,
we have a lot of survivor bias.

760
00:46:05,800 --> 00:46:08,430
The students who are left
are, on average, better

761
00:46:08,430 --> 00:46:11,150
than the students who
started the semester.

762
00:46:11,150 --> 00:46:15,090
So you need to take
that into account.

763
00:46:15,090 --> 00:46:19,050
Another kind of
non-representative sampling

764
00:46:19,050 --> 00:46:23,620
or convenience sampling
is opinion polls,

765
00:46:23,620 --> 00:46:28,200
in that you have something
there called non-response bias.

766
00:46:28,200 --> 00:46:29,700
So I don't know
about you, but I get

767
00:46:29,700 --> 00:46:33,470
phone calls asking my
opinion about things.

768
00:46:33,470 --> 00:46:36,010
Surveys about
products, whatever.

769
00:46:36,010 --> 00:46:36,850
I never answer.

770
00:46:36,850 --> 00:46:38,470
I just hang up the phone.

771
00:46:38,470 --> 00:46:39,910
I get a zillion emails.

772
00:46:39,910 --> 00:46:42,880
Every time I stay in a
hotel, I get an email

773
00:46:42,880 --> 00:46:45,460
asking me to rate the hotel.

774
00:46:45,460 --> 00:46:48,310
When I fly I get e-mails
from the airline.

775
00:46:48,310 --> 00:46:51,010
I don't answer any
of those surveys.

776
00:46:51,010 --> 00:46:54,580
But some people do, presumably,
or they wouldn't send them out.

777
00:46:54,580 --> 00:46:56,590
But why should they
think that the people who

778
00:46:56,590 --> 00:46:59,080
answer the survey
are representative

779
00:46:59,080 --> 00:47:02,790
of all the people who stay in
the hotel or all the people who

780
00:47:02,790 --> 00:47:04,180
fly on the plane?

781
00:47:04,180 --> 00:47:05,090
They're not.

782
00:47:05,090 --> 00:47:06,790
They're the kind
of people who maybe

783
00:47:06,790 --> 00:47:09,350
have time to answer surveys.

784
00:47:09,350 --> 00:47:13,550
And so you get a
non-response bias.

785
00:47:13,550 --> 00:47:16,040
And that tends to
distort your results.

786
00:47:19,270 --> 00:47:23,970
When samples are not
random and independent,

787
00:47:23,970 --> 00:47:25,920
we can still run
statistics on them.

788
00:47:25,920 --> 00:47:28,560
We can compute means
and standard deviations.

789
00:47:28,560 --> 00:47:30,630
And that's fine.

790
00:47:30,630 --> 00:47:33,450
But we can't draw
conclusions using

791
00:47:33,450 --> 00:47:36,780
things like the Empirical
Rule or the Central Limit

792
00:47:36,780 --> 00:47:38,940
Theorem, Standard Error.

793
00:47:38,940 --> 00:47:43,200
Because the basic assumption
underlying all of that

794
00:47:43,200 --> 00:47:48,180
is that the samples are
random and independent.

795
00:47:48,180 --> 00:47:51,190
This is one of the reasons
why political polls are

796
00:47:51,190 --> 00:47:53,020
so unreliable.

797
00:47:53,020 --> 00:47:56,230
They compute statistics
using Standard Error,

798
00:47:56,230 --> 00:47:59,940
assuming that the samples
are random and independent.

799
00:47:59,940 --> 00:48:06,030
But they, for example, get them
mostly by calling landlines.

800
00:48:06,030 --> 00:48:09,090
And so they get a bias
towards people who actually

801
00:48:09,090 --> 00:48:11,621
answer the phone on a landline.

802
00:48:11,621 --> 00:48:13,620
How many of you have a
land line where you live?

803
00:48:16,150 --> 00:48:16,950
Not many, right?

804
00:48:16,950 --> 00:48:20,470
Mostly you rely on
your cell phones.

805
00:48:20,470 --> 00:48:24,070
And so any survey that
depends on landlines

806
00:48:24,070 --> 00:48:27,090
is going to leave a lot
of the population out.

807
00:48:27,090 --> 00:48:30,280
They'll get a lot of
people of my vintage,

808
00:48:30,280 --> 00:48:33,220
not of your vintage.

809
00:48:33,220 --> 00:48:36,820
And that gets you in trouble.

810
00:48:36,820 --> 00:48:42,200
So the moral here
is always understand

811
00:48:42,200 --> 00:48:48,110
how the data was collected, what
the assumptions in the analysis

812
00:48:48,110 --> 00:48:53,700
were, and whether
they're satisfied.

813
00:48:53,700 --> 00:48:56,000
If these things
are not true, you

814
00:48:56,000 --> 00:48:59,690
need to be very
wary of the results.

815
00:48:59,690 --> 00:49:02,100
All right, I think
I'll stop here.

816
00:49:02,100 --> 00:49:07,130
We'll finish up our
panoply of statistical sins

817
00:49:07,130 --> 00:49:09,340
on Wednesday, in the first half.

818
00:49:09,340 --> 00:49:11,210
Then we'll do a course wrap-up.

819
00:49:11,210 --> 00:49:14,450
Then I'll wish you all
godspeed and a good final.

820
00:49:14,450 --> 00:49:16,570
See you Wednesday.