1
00:00:09,310 --> 00:00:11,300
PATRICK WINSTON: Ladies and
gentlemen, the Romanian

2
00:00:11,300 --> 00:00:12,960
national anthem.

3
00:00:12,960 --> 00:00:15,930
I did not ask you to stand,
because I didn't play it as a

4
00:00:15,930 --> 00:00:18,250
symbol of Romanian national
identity.

5
00:00:18,250 --> 00:00:21,580
But rather, to celebrate the
end of the Cold War, which

6
00:00:21,580 --> 00:00:24,740
occurred about the time
that you were born.

7
00:00:24,740 --> 00:00:28,980
Before that, no one came to
MIT from Eastern Europe.

8
00:00:28,980 --> 00:00:33,070
But since that time, we've been
blessed by having in our

9
00:00:33,070 --> 00:00:40,850
midst Lithuanians, Estonians,
Poles, Czecs, Slovaks,

10
00:00:40,850 --> 00:00:47,400
Bulgarians, Romanians,
Slovenians, Serbs, and all

11
00:00:47,400 --> 00:00:50,510
sorts of people from regions
of the world

12
00:00:50,510 --> 00:00:53,580
formally excluded to us.

13
00:00:53,580 --> 00:00:57,900
Believe me, you are all
welcome in our house.

14
00:00:57,900 --> 00:01:00,110
Almost all, that is to say.

15
00:01:00,110 --> 00:01:04,530
Because you may recall that
Romania is the traditional

16
00:01:04,530 --> 00:01:07,430
home of vampires.

17
00:01:07,430 --> 00:01:10,180
And since the end of the Cold
War, vampires have had new

18
00:01:10,180 --> 00:01:14,360
vectors for emerging from their
traditional places and

19
00:01:14,360 --> 00:01:16,010
penetrating into the
world at large.

20
00:01:16,010 --> 00:01:22,560
You may have vampire in your
suite, or on your floor.

21
00:01:22,560 --> 00:01:26,740
And it's important to know how
to recognize them, and take

22
00:01:26,740 --> 00:01:30,420
the necessary precautions.

23
00:01:30,420 --> 00:01:36,860
So if you have this concern, I
would expect that the first

24
00:01:36,860 --> 00:01:40,009
thing you would do would be to
look at some data concerning

25
00:01:40,009 --> 00:01:41,890
the characteristics
of vampires.

26
00:01:55,550 --> 00:02:00,460
So there's a little database of
samples of individuals who

27
00:02:00,460 --> 00:02:06,560
have been determined to be
vampires and not vampires.

28
00:02:06,560 --> 00:02:08,020
And our task today--

29
00:02:08,020 --> 00:02:10,919
and what you'll understand how
to do by the end of the hour--

30
00:02:10,919 --> 00:02:14,640
is to use data like this to
build a recognition mechanism

31
00:02:14,640 --> 00:02:17,740
that would help you to identify
whether someone is a

32
00:02:17,740 --> 00:02:21,000
vampire or an ordinary person.

33
00:02:21,000 --> 00:02:24,079
So this is a little different
from the kind of problem we

34
00:02:24,079 --> 00:02:27,360
worked with neural nets.

35
00:02:27,360 --> 00:02:27,700
Right?

36
00:02:27,700 --> 00:02:32,579
So what's the most conspicuous
difference between this data

37
00:02:32,579 --> 00:02:38,590
set and anything you could think
to work on with nearest

38
00:02:38,590 --> 00:02:40,079
neighbors, which we
studied last time.

39
00:02:40,079 --> 00:02:42,772
Katie, do you have any thoughts
about why it would be

40
00:02:42,772 --> 00:02:45,020
difficult to use nearest
neighbors with data like this?

41
00:02:48,260 --> 00:02:51,690
The question mark is there
because this is MIT, and a lot

42
00:02:51,690 --> 00:02:53,170
of people are completely
nocturnal.

43
00:02:53,170 --> 00:02:57,630
So you can't tell whether they
cast a shadow or not.

44
00:02:57,630 --> 00:03:01,220
We want to take that
into account.

45
00:03:01,220 --> 00:03:04,150
So what's different about
this from the

46
00:03:04,150 --> 00:03:06,851
electrical cover data set?

47
00:03:06,851 --> 00:03:08,101
STUDENT: [INAUDIBLE]

48
00:03:12,810 --> 00:03:14,550
PATRICK WINSTON: Could you use
the nearest neighbor technique

49
00:03:14,550 --> 00:03:18,180
to identify vampires
with this data?

50
00:03:18,180 --> 00:03:19,430
STUDENT: [INAUDIBLE]

51
00:03:25,620 --> 00:03:27,120
PATRICK WINSTON:
So obviously--

52
00:03:27,120 --> 00:03:29,831
Yes, Lana?

53
00:03:29,831 --> 00:03:31,081
STUDENT: [INAUDIBLE]

54
00:03:33,807 --> 00:03:35,800
STUDENT: You cannot
really quantify--

55
00:03:35,800 --> 00:03:37,110
PATRICK WINSTON: Oh,
that's the problem.

56
00:03:37,110 --> 00:03:38,970
This is not numerical data.

57
00:03:38,970 --> 00:03:41,138
This is symbolic.

58
00:03:41,138 --> 00:03:44,579
So we're not saying that
your ability to

59
00:03:44,579 --> 00:03:47,320
cast a shadow is 0.7.

60
00:03:47,320 --> 00:03:49,970
You either cast a shadow,
down cast a

61
00:03:49,970 --> 00:03:51,200
shadow, or we can't tell.

62
00:03:51,200 --> 00:03:53,770
It's a symbolic result.

63
00:03:53,770 --> 00:03:57,810
So problem number one we have to
face with data of this kind

64
00:03:57,810 --> 00:03:59,060
is that it's not numeric.

65
00:04:05,480 --> 00:04:07,950
And there are other
characteristics, as well.

66
00:04:07,950 --> 00:04:11,180
For example, it's not clear
that all of these

67
00:04:11,180 --> 00:04:13,820
characteristics actually
matter.

68
00:04:13,820 --> 00:04:15,490
So some characteristics
don't matter.

69
00:04:26,460 --> 00:04:29,990
And a corollary to that is that
some characteristics do

70
00:04:29,990 --> 00:04:33,460
matter, but they only matter
part of the time.

71
00:04:52,240 --> 00:04:56,180
And finally, there's
the matter of cost.

72
00:04:58,880 --> 00:05:01,400
Some of these tests may
be more expensive to

73
00:05:01,400 --> 00:05:03,350
perform than others.

74
00:05:03,350 --> 00:05:05,740
For example, if you wanted to
determine whether someone

75
00:05:05,740 --> 00:05:07,930
casts a shadow, you'd have to
go to the trouble of getting

76
00:05:07,930 --> 00:05:09,970
up during daylight.

77
00:05:09,970 --> 00:05:12,000
That might be an expensive
operation for you.

78
00:05:15,100 --> 00:05:17,570
You'd have to go find some
garlic and ask them to eat it.

79
00:05:17,570 --> 00:05:19,070
That might be expensive.

80
00:05:19,070 --> 00:05:21,140
So some of these tests
might be expensive

81
00:05:21,140 --> 00:05:23,560
relative to other tests.

82
00:05:23,560 --> 00:05:28,110
But once you realize that we are
talking in terms of tests,

83
00:05:28,110 --> 00:05:32,530
and not a vector of
real values, then

84
00:05:32,530 --> 00:05:34,920
what you do is clear.

85
00:05:34,920 --> 00:05:37,305
You build yourself a little
tree of tests.

86
00:05:40,680 --> 00:05:43,810
So who knows how this problem
will turn out?

87
00:05:43,810 --> 00:05:47,210
But you can imagine a situation
where you have one

88
00:05:47,210 --> 00:05:52,740
test up here which might
have three outcomes.

89
00:05:52,740 --> 00:05:57,520
And one but only one of those
outcomes might require you to

90
00:05:57,520 --> 00:06:01,020
perform another test.

91
00:06:01,020 --> 00:06:05,940
And only when you've created
the tree of tests that look

92
00:06:05,940 --> 00:06:07,835
like this are you finished.

93
00:06:10,540 --> 00:06:15,060
So given this set of tests and a
set of samples, the question

94
00:06:15,060 --> 00:06:18,640
becomes, how do you arrange the
tests in a tree like that

95
00:06:18,640 --> 00:06:22,250
so as to do the identification
that you want to do?

96
00:06:22,250 --> 00:06:25,030
So since we're talking about
identification, it's not

97
00:06:25,030 --> 00:06:28,250
surprising that this kind
of tree is called an

98
00:06:28,250 --> 00:06:29,500
identification tree.

99
00:06:41,560 --> 00:06:44,420
And there's a tendency-- and I
may slip into it myself-- to

100
00:06:44,420 --> 00:06:46,210
call this a decision tree.

101
00:06:46,210 --> 00:06:48,970
But a decision tree is a label
for something else.

102
00:06:48,970 --> 00:06:51,060
This is an identification
tree.

103
00:06:51,060 --> 00:06:55,170
And the task is to create
a good one.

104
00:06:55,170 --> 00:06:59,330
So what is a good one versus
a not so good one?

105
00:06:59,330 --> 00:07:02,070
What characteristic would you
like for a decision tree--

106
00:07:05,010 --> 00:07:08,250
for an identification trade to
have, if you're going to call

107
00:07:08,250 --> 00:07:12,170
it good identification tree?

108
00:07:12,170 --> 00:07:13,265
What do you think, Krishna?

109
00:07:13,265 --> 00:07:15,964
What would be a good
characteristic?

110
00:07:15,964 --> 00:07:18,399
STUDENT: Maybe the minimum
number of levels?

111
00:07:18,399 --> 00:07:18,886
PATRICK WINSTON: Yeah.

112
00:07:18,886 --> 00:07:22,110
He said minimum number
of levels.

113
00:07:22,110 --> 00:07:26,960
What's another way you could
say what a good one is?

114
00:07:26,960 --> 00:07:28,680
Each test costs something,
right?

115
00:07:28,680 --> 00:07:32,050
So what's another way of
thinking about what a good

116
00:07:32,050 --> 00:07:33,864
tree would look like?

117
00:07:33,864 --> 00:07:34,820
STUDENT: Minimum cost.

118
00:07:34,820 --> 00:07:36,280
PATRICK WINSTON: The
minimum cost.

119
00:07:36,280 --> 00:07:39,810
And if they all have the
same cost, then it's

120
00:07:39,810 --> 00:07:41,770
the number of tests.

121
00:07:41,770 --> 00:07:45,530
So overall, what you like
is a small tree

122
00:07:45,530 --> 00:07:47,400
rather than a big one.

123
00:07:47,400 --> 00:07:49,730
So you might be able to take
your sample data and divide it

124
00:07:49,730 --> 00:07:53,190
up, so that at the bottom of the
tree, at the leaves, all

125
00:07:53,190 --> 00:07:57,380
of the sets that are produced
by the tests are uniform,

126
00:07:57,380 --> 00:07:59,070
homogeneous.

127
00:07:59,070 --> 00:08:01,905
We'd like that tree to be the
simplest possible tree you can

128
00:08:01,905 --> 00:08:05,460
find, not some big complicated
one that also divides up all

129
00:08:05,460 --> 00:08:08,920
the data into uniform subsets.

130
00:08:08,920 --> 00:08:10,070
By uniform subset--

131
00:08:10,070 --> 00:08:13,100
at the bottom of the tree, you
have all of the vampires

132
00:08:13,100 --> 00:08:16,980
together, and all the
non-vampires together.

133
00:08:16,980 --> 00:08:17,840
So you'd like a small tree.

134
00:08:17,840 --> 00:08:23,590
So why not just go all the way
and do British Museum, and

135
00:08:23,590 --> 00:08:25,946
calculate all possible trees?

136
00:08:25,946 --> 00:08:28,730
Well, you can do that, but it's
one of those NP problems.

137
00:08:28,730 --> 00:08:32,620
And as you know, NP problems
suck in general.

138
00:08:32,620 --> 00:08:33,929
And so you don't want
to do that.

139
00:08:33,929 --> 00:08:37,820
You want to have some kind of
heuristic mechanism for

140
00:08:37,820 --> 00:08:39,640
building a small tree.

141
00:08:39,640 --> 00:08:42,590
And we want a small
tree because--

142
00:08:42,590 --> 00:08:43,900
Why do we want a small tree?

143
00:08:43,900 --> 00:08:45,040
Because of the cost.

144
00:08:45,040 --> 00:08:46,870
but there's another, more
important reason why we want a

145
00:08:46,870 --> 00:08:48,025
small tree.

146
00:08:48,025 --> 00:08:50,250
Let me give you a hint.

147
00:08:50,250 --> 00:08:52,090
It's Occam's Razor.

148
00:08:52,090 --> 00:08:57,110
The simplest explanation is
often the best explanation.

149
00:08:57,110 --> 00:08:58,650
So if you have a big,
complicated explanation,

150
00:08:58,650 --> 00:09:06,150
that's probably less good than
a simple, small explanation.

151
00:09:06,150 --> 00:09:07,400
Occam's Razor.

152
00:09:07,400 --> 00:09:10,290
Spelled so many ways it doesn't
matter how I spell it.

153
00:09:10,290 --> 00:09:12,090
And that's good, because
I can't spell.

154
00:09:15,260 --> 00:09:20,040
So how are we going to go
about finding the best

155
00:09:20,040 --> 00:09:24,180
possible arrangement
of those four tests

156
00:09:24,180 --> 00:09:26,470
in a tree like that?

157
00:09:26,470 --> 00:09:29,630
Well, step one will be
to see what each test

158
00:09:29,630 --> 00:09:30,340
does with the data.

159
00:09:30,340 --> 00:09:36,070
And by the way, before I go a
step further, you know and I

160
00:09:36,070 --> 00:09:40,740
know that this is a sample data
set that's very small,

161
00:09:40,740 --> 00:09:43,090
suitable for classroom
manipulation.

162
00:09:43,090 --> 00:09:47,090
You'd never bet your life on
a data set this small.

163
00:09:47,090 --> 00:09:49,020
We use it only for classroom
illustration.

164
00:09:49,020 --> 00:09:51,970
But imagine that these rows
are multiplied by 10.

165
00:09:51,970 --> 00:09:55,500
So instead of eight samples,
you've got 80.

166
00:09:55,500 --> 00:09:56,670
Then you might begin
to believe the

167
00:09:56,670 --> 00:09:58,220
results that are produced.

168
00:09:58,220 --> 00:10:00,700
So I'm just going to pretend
that each one of those

169
00:10:00,700 --> 00:10:04,330
represents 10 other
samples that I

170
00:10:04,330 --> 00:10:07,530
haven't bothered to show.

171
00:10:07,530 --> 00:10:10,290
But we can work with this one in
the classroom, because it's

172
00:10:10,290 --> 00:10:10,990
pretty small.

173
00:10:10,990 --> 00:10:13,070
And we can say, well, what
does this shadow test do?

174
00:10:16,034 --> 00:10:20,310
Well, the shadow test divides
the sample population into

175
00:10:20,310 --> 00:10:21,780
three groups.

176
00:10:21,780 --> 00:10:24,790
There's the I Don't Know group
of people who are nocturnal.

177
00:10:24,790 --> 00:10:26,110
There are the people
who do cast the

178
00:10:26,110 --> 00:10:28,360
shadow, the Yes people.

179
00:10:28,360 --> 00:10:32,220
And the people who do not cast
a shadow, the No people.

180
00:10:32,220 --> 00:10:37,430
So if I look at those rows up
there and see which ones are

181
00:10:37,430 --> 00:10:44,440
vampires, it looks to me that
if there's no shadow cast--

182
00:10:44,440 --> 00:10:46,140
there's only one that doesn't
cast a shadow--

183
00:10:46,140 --> 00:10:47,615
and that is a vampire.

184
00:10:51,670 --> 00:10:53,950
So that's a plus over there.

185
00:10:53,950 --> 00:10:56,400
Vampire.

186
00:10:56,400 --> 00:10:59,680
Now, if we look at the ones
who do cast a shadow, all

187
00:10:59,680 --> 00:11:00,715
those are not vampires.

188
00:11:00,715 --> 00:11:01,965
They're all OK.

189
00:11:05,870 --> 00:11:07,520
And now there're 8.

190
00:11:07,520 --> 00:11:09,450
Three are vampires.

191
00:11:09,450 --> 00:11:13,040
So that means that two of
these must be vampires.

192
00:11:13,040 --> 00:11:15,420
And I've got three, four,
five, six so far.

193
00:11:15,420 --> 00:11:17,340
So there must be two left.

194
00:11:17,340 --> 00:11:20,730
So that's the way the shadow
test divides up the data.

195
00:11:20,730 --> 00:11:21,980
Now let's do garlic.

196
00:11:25,620 --> 00:11:28,040
Vampires traditionally
don't eat garlic.

197
00:11:28,040 --> 00:11:29,290
I don't know why.

198
00:11:31,710 --> 00:11:33,480
So we look at the garlic
test, and we see

199
00:11:33,480 --> 00:11:37,010
that all of the Nos--

200
00:11:37,010 --> 00:11:39,880
well, there're three
Yeses, and they all

201
00:11:39,880 --> 00:11:41,920
produce a No answer.

202
00:11:41,920 --> 00:11:46,590
So if somebody eats garlic,
they're not vampires.

203
00:11:46,590 --> 00:11:49,650
That means the three vampires
must be over here.

204
00:11:49,650 --> 00:11:50,580
Then there are two left.

205
00:11:50,580 --> 00:11:53,040
So that's what the
garlic test does.

206
00:11:53,040 --> 00:11:53,960
See what we're trying to do?

207
00:11:53,960 --> 00:11:56,530
We're trying to look at all
these tests to see which one

208
00:11:56,530 --> 00:12:01,220
we like best on the basis of
how it divides up the data.

209
00:12:01,220 --> 00:12:06,940
So now we've got complexion.

210
00:12:13,210 --> 00:12:15,970
And there are three
choices for this.

211
00:12:15,970 --> 00:12:17,530
You can have an average
complexion.

212
00:12:17,530 --> 00:12:21,490
But a lot of vampires, in my
experience, are rather pale.

213
00:12:21,490 --> 00:12:23,650
So pale is a possibility.

214
00:12:23,650 --> 00:12:26,210
And then the other option is
that just after gorging

215
00:12:26,210 --> 00:12:27,610
themselves with blood,
they tend to get a

216
00:12:27,610 --> 00:12:29,300
little red in the face.

217
00:12:29,300 --> 00:12:32,640
So we'll have a ruddy
over here.

218
00:12:32,640 --> 00:12:35,650
Once again, we have to go back
to our data set to see how

219
00:12:35,650 --> 00:12:37,640
this test divides things up.

220
00:12:37,640 --> 00:12:43,980
So there are three ruddies, and
one's a No, one's a No,

221
00:12:43,980 --> 00:12:44,670
and one's a Yes.

222
00:12:44,670 --> 00:12:45,920
So two Nos and a Yes.

223
00:12:49,650 --> 00:12:52,680
Two Nos and a Yes.

224
00:12:52,680 --> 00:12:55,300
Now we can try for pale
complexion people.

225
00:12:55,300 --> 00:12:57,100
There are only two of those.

226
00:12:57,100 --> 00:12:58,405
A No and a No.

227
00:13:03,310 --> 00:13:06,310
That must mean that there are
two pluses over here, because

228
00:13:06,310 --> 00:13:08,370
there are three vampires
altogether.

229
00:13:08,370 --> 00:13:13,681
Two, four, six, seven,
eight, nine.

230
00:13:13,681 --> 00:13:14,848
Eight, sorry.

231
00:13:14,848 --> 00:13:15,724
Eight.

232
00:13:15,724 --> 00:13:17,480
Only eight.

233
00:13:17,480 --> 00:13:19,150
Just one more to go, and
that's the accent.

234
00:13:22,520 --> 00:13:27,560
Historically, vampires go to
great length to protect their

235
00:13:27,560 --> 00:13:29,600
accent and not betray
their origins.

236
00:13:29,600 --> 00:13:32,900
But nevertheless, we
can expect that if

237
00:13:32,900 --> 00:13:33,850
they've just arrived--

238
00:13:33,850 --> 00:13:35,450
if they're just in from

239
00:13:35,450 --> 00:13:37,470
Transylvania, part of Romania--

240
00:13:37,470 --> 00:13:38,590
they may still have an accent.

241
00:13:38,590 --> 00:13:41,950
So there's a normal, some still
have a heavy accent, and

242
00:13:41,950 --> 00:13:44,080
some persist in having
odd accents.

243
00:13:49,940 --> 00:13:51,090
So let's see.

244
00:13:51,090 --> 00:13:51,660
Accent.

245
00:13:51,660 --> 00:13:54,630
Four of them, right at the
top, have no accent.

246
00:13:54,630 --> 00:13:55,880
Two Nos and a Yes.

247
00:14:03,050 --> 00:14:04,930
Heavy accent.

248
00:14:04,930 --> 00:14:05,910
Three of those.

249
00:14:05,910 --> 00:14:08,390
A Yes and two Nos.

250
00:14:13,190 --> 00:14:15,710
That means we must
have a plus here.

251
00:14:15,710 --> 00:14:18,290
3, 6, plus and a minus.

252
00:14:20,790 --> 00:14:24,400
So we can look at this data and
say, well, what will be

253
00:14:24,400 --> 00:14:25,610
the best test to use?

254
00:14:25,610 --> 00:14:30,860
And the best test to use would
surely be the one that

255
00:14:30,860 --> 00:14:36,710
produces sets here, at the
bottom of the branches, that

256
00:14:36,710 --> 00:14:38,950
correspond to the outcomes
of the test.

257
00:14:38,950 --> 00:14:44,750
We're looking for a test that
produces homogeneous groups.

258
00:14:44,750 --> 00:14:48,120
So just for the sake of
illustration, I'm going to

259
00:14:48,120 --> 00:14:51,200
suppose that we're going to
judge the quality of the test

260
00:14:51,200 --> 00:14:56,010
by how many sample individuals
it put into a homogeneous set.

261
00:14:56,010 --> 00:15:00,910
So ideally, we'd like a test
that will put all the vampires

262
00:15:00,910 --> 00:15:03,670
in one group and all the
ordinary people in another

263
00:15:03,670 --> 00:15:05,530
group right off the bat.

264
00:15:05,530 --> 00:15:07,260
But there are no such tests.

265
00:15:07,260 --> 00:15:11,400
But we can add up the number of
sample individuals who are

266
00:15:11,400 --> 00:15:14,460
put in to at least
homogeneous sets.

267
00:15:14,460 --> 00:15:18,140
So when we do that, this
guy has 3 in a

268
00:15:18,140 --> 00:15:19,470
homogeneous set here.

269
00:15:19,470 --> 00:15:20,480
A fourth.

270
00:15:20,480 --> 00:15:22,770
But these are not a
homogeneous set.

271
00:15:22,770 --> 00:15:25,725
So the overall score for
this guy will be 4.

272
00:15:28,740 --> 00:15:31,690
This one, well, not
quite as good.

273
00:15:31,690 --> 00:15:34,130
It only puts 3 individuals
in a homogeneous set.

274
00:15:38,110 --> 00:15:42,870
This one here, 2 individuals
into a homogeneous set.

275
00:15:42,870 --> 00:15:45,150
Everybody else is all
mixed up with some

276
00:15:45,150 --> 00:15:46,950
other kind of person.

277
00:15:46,950 --> 00:15:49,920
And over here, how many
samples are in

278
00:15:49,920 --> 00:15:51,670
a homogeneous set?

279
00:15:51,670 --> 00:15:52,920
0.

280
00:15:55,260 --> 00:15:59,370
So on the basis of this
analysis, you would conclude

281
00:15:59,370 --> 00:16:02,470
that the ordering of the test
with respect to their quality

282
00:16:02,470 --> 00:16:04,250
is left to right.

283
00:16:04,250 --> 00:16:07,210
So the best test must
be the shadow test.

284
00:16:07,210 --> 00:16:11,840
So let's pick the shadow
test first, see what

285
00:16:11,840 --> 00:16:13,330
we can do with that.

286
00:16:13,330 --> 00:16:16,850
If we pick the shadow test
first, then we have this

287
00:16:16,850 --> 00:16:18,100
arrangement.

288
00:16:20,080 --> 00:16:25,200
We have question mark, and we
have Yes, casts a shadow, and

289
00:16:25,200 --> 00:16:26,610
No, doesn't.

290
00:16:26,610 --> 00:16:28,800
We have 3 minuses here.

291
00:16:28,800 --> 00:16:30,290
We have a plus here.

292
00:16:30,290 --> 00:16:32,890
And unfortunately, over
here, we have plus,

293
00:16:32,890 --> 00:16:34,060
plus, minus, minus.

294
00:16:34,060 --> 00:16:37,757
So we need another test to
divide that group up.

295
00:16:37,757 --> 00:16:38,731
Yes.

296
00:16:38,731 --> 00:16:41,653
STUDENT: How did you get the
4 on the shadow test again?

297
00:16:41,653 --> 00:16:44,100
Why was it 4?

298
00:16:44,100 --> 00:16:44,760
PATRICK WINSTON: Well,
if I look at the

299
00:16:44,760 --> 00:16:48,870
data and I see who--

300
00:16:48,870 --> 00:16:51,100
the question is, what about
that shadow test?

301
00:16:51,100 --> 00:16:52,990
If you look at the shadow test,
and you say, well, there

302
00:16:52,990 --> 00:16:55,270
are 4 question marks.

303
00:16:55,270 --> 00:16:57,210
And if we look and see what kind
of people belong to those

304
00:16:57,210 --> 00:17:00,990
4 question marks, there are 2
vampires and 2 non-vampires.

305
00:17:00,990 --> 00:17:02,760
That's why it's 2 pluses
and 2 minuses.

306
00:17:02,760 --> 00:17:04,148
STUDENT: No, I understand
that.

307
00:17:04,148 --> 00:17:07,853
The question is, how did you
get to the score of 4?

308
00:17:07,853 --> 00:17:10,450
PATRICK WINSTON: Oh, yeah.

309
00:17:10,450 --> 00:17:12,780
The question is how did
I get this number 4?

310
00:17:12,780 --> 00:17:14,829
It has nothing to do this,
because this is a mixed set.

311
00:17:14,829 --> 00:17:17,810
In fact, I've got three guys in
a homogeneous set here, and

312
00:17:17,810 --> 00:17:19,240
one guy in a homogeneous
set here, and I'm

313
00:17:19,240 --> 00:17:20,398
just adding them up.

314
00:17:20,398 --> 00:17:21,079
STUDENT: OK.

315
00:17:21,079 --> 00:17:23,190
PATRICK WINSTON: So very simple
classroom illustration.

316
00:17:23,190 --> 00:17:24,790
Wouldn't work in practice.

317
00:17:24,790 --> 00:17:25,098
Yes.

318
00:17:25,098 --> 00:17:27,969
STUDENT: How do you adjust
this for larger data sets

319
00:17:27,969 --> 00:17:30,390
where it's unlikely you're going
to have any [INAUDIBLE]?

320
00:17:30,390 --> 00:17:31,770
PATRICK WINSTON: The question
is, how do I adjust this for

321
00:17:31,770 --> 00:17:32,580
larger data sets?

322
00:17:32,580 --> 00:17:33,830
You're one step ahead.

323
00:17:38,540 --> 00:17:40,540
Trust me, I'll be doing large
data sets in a moment.

324
00:17:40,540 --> 00:17:43,550
I just want to get
the idea across.

325
00:17:43,550 --> 00:17:46,210
And I don't want there to be any
thought that the method we

326
00:17:46,210 --> 00:17:50,720
use for larger data sets has got
anything magic about it.

327
00:17:50,720 --> 00:17:52,450
OK, so we're off and running.

328
00:17:52,450 --> 00:17:56,710
And now we have to pick a
test that will divide

329
00:17:56,710 --> 00:17:58,590
those four guys up.

330
00:17:58,590 --> 00:18:02,060
So we're going to have to work
this a little harder, and

331
00:18:02,060 --> 00:18:03,610
repeat the analysis
we did there.

332
00:18:03,610 --> 00:18:05,350
But at least it'll be simpler,
because now we're only

333
00:18:05,350 --> 00:18:07,220
considering 4 samples, not 8.

334
00:18:07,220 --> 00:18:09,930
Just the 4 samples that we still
have to divide up that

335
00:18:09,930 --> 00:18:12,240
have come down that
left branch.

336
00:18:12,240 --> 00:18:13,590
So I have the shadow test.

337
00:18:18,190 --> 00:18:19,580
It has 3 outcomes.

338
00:18:19,580 --> 00:18:21,380
We have the garlic test.

339
00:18:24,000 --> 00:18:26,080
It has 2 outcomes.

340
00:18:26,080 --> 00:18:27,560
Yes and No.

341
00:18:27,560 --> 00:18:29,430
We have the complexion test.

342
00:18:34,290 --> 00:18:37,100
There's 3 outcomes.

343
00:18:37,100 --> 00:18:40,230
Average, pale, and ruddy.

344
00:18:40,230 --> 00:18:42,300
And we have finally
the accent test.

345
00:18:44,860 --> 00:18:50,610
And that comes out to be either
normal, heavy, or odd.

346
00:18:50,610 --> 00:18:53,340
And now, it's a little awkward
to figure out what the results

347
00:18:53,340 --> 00:18:56,500
are for this data
set as shown.

348
00:18:56,500 --> 00:19:00,910
So let me just strike out.

349
00:19:00,910 --> 00:19:04,000
The ones that we're no longer
concerned with, and limit our

350
00:19:04,000 --> 00:19:07,240
analysis to the samples for
which the outcome of the

351
00:19:07,240 --> 00:19:08,790
shadow test is a
question mark.

352
00:19:08,790 --> 00:19:10,740
This is exactly the four
people we still need to

353
00:19:10,740 --> 00:19:13,240
separate, right?

354
00:19:13,240 --> 00:19:18,170
So switching colors, keeping
the color the same.

355
00:19:18,170 --> 00:19:20,460
We actually don't want
to do the shadow

356
00:19:20,460 --> 00:19:21,340
test anymore, right?

357
00:19:21,340 --> 00:19:22,810
Because we've already
done that.

358
00:19:22,810 --> 00:19:24,100
There's no point in
doing that again.

359
00:19:24,100 --> 00:19:27,170
We don't have to look at that.

360
00:19:27,170 --> 00:19:30,890
It's already done all the
division of data that it can.

361
00:19:30,890 --> 00:19:32,260
So the garlic test.

362
00:19:32,260 --> 00:19:33,050
Well, let's see.

363
00:19:33,050 --> 00:19:33,920
Garlic.

364
00:19:33,920 --> 00:19:35,480
2 Yeses, 2 Nos.

365
00:19:35,480 --> 00:19:39,040
The Yeses produce Nos and
the Nos produce Yeses.

366
00:19:39,040 --> 00:19:44,210
So if the person does eat
garlic, they're OK.

367
00:19:44,210 --> 00:19:48,686
And if they don't eat garlic,
bad news-- they're vampires.

368
00:19:48,686 --> 00:19:50,320
Well, that looks like
a pretty good test.

369
00:19:50,320 --> 00:19:52,420
But just for the sake of working
it all out, let's try

370
00:19:52,420 --> 00:19:53,820
the others.

371
00:19:53,820 --> 00:19:55,070
Complexion.

372
00:19:56,920 --> 00:19:58,575
2 Ruddies, a Yes, and a No.

373
00:20:05,190 --> 00:20:09,540
1 pale, and that's a No.

374
00:20:09,540 --> 00:20:12,370
1 pale, and that's a No.

375
00:20:12,370 --> 00:20:17,590
And we must have 1 average, and
sure enough, that's a Yes.

376
00:20:17,590 --> 00:20:19,790
Now we can do accent, the one on
the far right, and look at

377
00:20:19,790 --> 00:20:23,820
how that measures up against the
people who are still under

378
00:20:23,820 --> 00:20:25,860
consideration as samples.

379
00:20:25,860 --> 00:20:26,550
Accent.

380
00:20:26,550 --> 00:20:26,990
Let's see.

381
00:20:26,990 --> 00:20:29,010
2 Nones, a Yes and a No.

382
00:20:33,930 --> 00:20:34,840
No Heavies.

383
00:20:34,840 --> 00:20:39,396
2 Odds, a Yes and a No.

384
00:20:39,396 --> 00:20:39,850
All right.

385
00:20:39,850 --> 00:20:42,060
So now we can do the same thing
we did before, and just

386
00:20:42,060 --> 00:20:45,400
say, for sake of classroom
illustration, how many

387
00:20:45,400 --> 00:20:48,230
individuals are put into
a homogeneous sets.

388
00:20:48,230 --> 00:20:51,120
And here we have 4.

389
00:20:51,120 --> 00:20:54,830
And here we have 2.

390
00:20:54,830 --> 00:20:58,420
And here we have 0.

391
00:20:58,420 --> 00:21:02,000
So plainly, the garlic test
is the test of choice.

392
00:21:02,000 --> 00:21:04,810
So we go back over here, and
we've completed the work that

393
00:21:04,810 --> 00:21:06,860
we needed to do.

394
00:21:06,860 --> 00:21:09,970
So that's the garlic test.

395
00:21:09,970 --> 00:21:13,070
And that produces 2 pluses.

396
00:21:13,070 --> 00:21:14,630
Let's see.

397
00:21:14,630 --> 00:21:16,340
Eats garlic, Yes.

398
00:21:16,340 --> 00:21:18,480
Eats garlic, No.

399
00:21:18,480 --> 00:21:22,565
I guess the pluses go
over here like so.

400
00:21:22,565 --> 00:21:24,860
And these are the two
ordinary people.

401
00:21:24,860 --> 00:21:26,710
And we're done with our task.

402
00:21:26,710 --> 00:21:30,080
And now you can quickly run
off and put this into your

403
00:21:30,080 --> 00:21:33,360
PDA, and forever be protected
against the possibility that

404
00:21:33,360 --> 00:21:36,135
one of those vampires got out
in the flood of people that

405
00:21:36,135 --> 00:21:38,380
came in from Eastern Europe.

406
00:21:38,380 --> 00:21:42,220
Except what do we do
a large data set?

407
00:21:42,220 --> 00:21:44,280
Well, the trouble is,
a large data set's

408
00:21:44,280 --> 00:21:45,530
not likely to produce--

409
00:21:52,130 --> 00:21:55,320
if you have a large data set,
no test is likely to put

410
00:21:55,320 --> 00:21:58,310
together any homogeneous
set right off.

411
00:21:58,310 --> 00:22:00,300
So you never get started.

412
00:22:00,300 --> 00:22:02,210
Everything would be 0.

413
00:22:02,210 --> 00:22:05,110
Every test would say, oh it
doesn't put anybody into

414
00:22:05,110 --> 00:22:06,280
homogeneous sets.

415
00:22:06,280 --> 00:22:07,530
So you're screwed.

416
00:22:10,910 --> 00:22:16,000
You need some other, more
sophisticated way of measuring

417
00:22:16,000 --> 00:22:17,580
how disordered this data is.

418
00:22:17,580 --> 00:22:22,040
Or how disordered these sets
are that you find at the

419
00:22:22,040 --> 00:22:24,740
bottom of the tree branches.

420
00:22:24,740 --> 00:22:25,500
That's what you need.

421
00:22:25,500 --> 00:22:31,130
You need a way of measuring
disorder of these sets that

422
00:22:31,130 --> 00:22:34,740
you find at the bottom of these
branches, so you can

423
00:22:34,740 --> 00:22:37,750
find a kind of overall quality
to the test based on your

424
00:22:37,750 --> 00:22:40,250
measurement of disorder.

425
00:22:40,250 --> 00:22:44,010
Now, the first heuristic of a
good life is, when you have a

426
00:22:44,010 --> 00:22:47,120
problem to solve, ask somebody
who knows the answer.

427
00:22:47,120 --> 00:22:48,180
It's the least amount of work.

428
00:22:48,180 --> 00:22:51,060
It's not even as hard
going to Google.

429
00:22:51,060 --> 00:22:55,010
So who would you ask
about ways of

430
00:22:55,010 --> 00:22:58,260
measuring disorder in sets?

431
00:22:58,260 --> 00:22:59,510
There are two possible
answers.

432
00:23:05,050 --> 00:23:07,060
STUDENT: You could
just do entropy.

433
00:23:07,060 --> 00:23:07,325
PATRICK WINSTON: What?

434
00:23:07,325 --> 00:23:09,840
STUDENT: Find the entropy
of the set.

435
00:23:09,840 --> 00:23:11,345
PATRICK WINSTON: Who
studies entropy?

436
00:23:11,345 --> 00:23:13,350
STUDENT: Probability.

437
00:23:13,350 --> 00:23:15,402
PATRICK WINSTON: What
kind of classes?

438
00:23:15,402 --> 00:23:15,898
STUDENT: Physics.

439
00:23:15,898 --> 00:23:16,890
STUDENT: Thermodynamics.

440
00:23:16,890 --> 00:23:18,415
PATRICK WINSTON:
Thermodynamics!

441
00:23:18,415 --> 00:23:21,125
The thermodynamicists are good
at measuring disorder, because

442
00:23:21,125 --> 00:23:22,900
that's what thermodynamics
is all about.

443
00:23:22,900 --> 00:23:25,640
Entropy increasing over time,
and all that sort of stuff.

444
00:23:25,640 --> 00:23:28,834
There's another equally
good answer.

445
00:23:28,834 --> 00:23:30,720
STUDENT: Statisticians?

446
00:23:30,720 --> 00:23:33,400
PATRICK WINSTON:
Statisticians.

447
00:23:33,400 --> 00:23:37,680
Perhaps, but it's not the
second best answer.

448
00:23:37,680 --> 00:23:39,190
It's actually not even
the best answer.

449
00:23:39,190 --> 00:23:39,930
That's the best answer.

450
00:23:39,930 --> 00:23:40,643
What's your name?

451
00:23:40,643 --> 00:23:41,510
STUDENT: Leo.

452
00:23:41,510 --> 00:23:42,760
PATRICK WINSTON: Oh, yeah.

453
00:23:45,322 --> 00:23:49,150
[LAUGHTER]

454
00:23:49,150 --> 00:23:50,870
PATRICK WINSTON: Leonardo has
got his finger on it.

455
00:23:50,870 --> 00:23:53,420
The information theorists are
pretty good at measuring

456
00:23:53,420 --> 00:23:57,440
disorder, because that's what
information is all about, too.

457
00:23:57,440 --> 00:24:00,130
So we might as well borrow a
mechanism for measuring the

458
00:24:00,130 --> 00:24:03,970
disorder of a set from those
information theory guys.

459
00:24:03,970 --> 00:24:05,710
So what we're going to
do is exactly that.

460
00:24:08,610 --> 00:24:11,760
Let's put it over here, so we'll
have it handy when we

461
00:24:11,760 --> 00:24:13,060
want to try to measure
those things.

462
00:24:16,910 --> 00:24:21,800
The gospel according to
information theorists is that

463
00:24:21,800 --> 00:24:28,910
the disorder, D, or some set
is equal to-- now let's

464
00:24:28,910 --> 00:24:32,500
suppose that this is a
set of binary values.

465
00:24:32,500 --> 00:24:34,720
So we have positives and
then we have negatives.

466
00:24:34,720 --> 00:24:36,490
Pluses and minuses.

467
00:24:36,490 --> 00:24:39,640
But pluses, they don't go very
well in an algebraic equation,

468
00:24:39,640 --> 00:24:41,790
because they might be confused
with adding.

469
00:24:41,790 --> 00:24:45,900
So I'm going to say P and N. And
then it'll be the total,

470
00:24:45,900 --> 00:24:48,030
which is P plus N. We only
have two choices,

471
00:24:48,030 --> 00:24:50,090
positive and negative.

472
00:24:50,090 --> 00:24:53,850
So the disorder of set,
according those guys, is equal

473
00:24:53,850 --> 00:24:59,400
to minus the number of positives
over the total

474
00:24:59,400 --> 00:25:05,440
number, times the log to the
base 2 of the positives over

475
00:25:05,440 --> 00:25:13,260
the total, minus the negatives
over the total, times the log

476
00:25:13,260 --> 00:25:16,270
2 of the negatives
over the total.

477
00:25:16,270 --> 00:25:18,780
Those negatives look a little
worrisome, because you think,

478
00:25:18,780 --> 00:25:20,210
well, maybe this thing
can go negative.

479
00:25:20,210 --> 00:25:21,390
But that's not going
to be true, right?

480
00:25:21,390 --> 00:25:26,170
Because these ratios are all
less than 1, and the logarithm

481
00:25:26,170 --> 00:25:29,830
of something that's less
than 1 is negative.

482
00:25:29,830 --> 00:25:32,770
So we're OK.

483
00:25:32,770 --> 00:25:37,040
So that's a lovely way of
measuring disorder.

484
00:25:37,040 --> 00:25:39,020
And then we ought to draw
a graph of what

485
00:25:39,020 --> 00:25:40,270
that curve looks like.

486
00:25:44,370 --> 00:25:47,720
And what we're going to graph
it against is the ratio of

487
00:25:47,720 --> 00:25:52,500
positives to the total number.

488
00:25:52,500 --> 00:25:56,730
So that's going to be an axis
where we go from 0 to 1.

489
00:26:01,140 --> 00:26:05,750
So let's just find a couple
of useful values.

490
00:26:05,750 --> 00:26:08,920
And by the way, it pays to pay
attention to these curves,

491
00:26:08,920 --> 00:26:11,890
because if you pay attention
to this stuff, you can work

492
00:26:11,890 --> 00:26:14,980
the quiz questions on
this very rapidly.

493
00:26:14,980 --> 00:26:19,460
Otherwise, we see people getting
out their calculators

494
00:26:19,460 --> 00:26:24,020
and quickly becoming both
lost and screwed.

495
00:26:24,020 --> 00:26:26,160
OK so let's see.

496
00:26:26,160 --> 00:26:29,100
Let's suppose that the number
of positives is equal to the

497
00:26:29,100 --> 00:26:29,810
number of negatives.

498
00:26:29,810 --> 00:26:32,260
So we've got a completely
mixed-up set.

499
00:26:32,260 --> 00:26:34,470
It has no bias in either
direction.

500
00:26:34,470 --> 00:26:41,310
So in that case, if P over T is
equal to 1/2, then this is

501
00:26:41,310 --> 00:26:51,230
equal to minus 1/2, times
the logarithm of 1/2.

502
00:26:51,230 --> 00:26:53,360
And I guess, since they're
both the same, we

503
00:26:53,360 --> 00:26:55,400
can multiply by two.

504
00:26:55,400 --> 00:26:56,650
And what's that value?

505
00:27:00,596 --> 00:27:04,768
[INAUDIBLE], what does that
calculate out to?

506
00:27:04,768 --> 00:27:06,716
STUDENT: Minus [INAUDIBLE]

507
00:27:06,716 --> 00:27:09,650
PATRICK WINSTON: Minus
[INAUDIBLE].

508
00:27:09,650 --> 00:27:11,790
Well, with a minus sign, you
just turn the argument upside

509
00:27:11,790 --> 00:27:12,950
down, so it's log(2).

510
00:27:12,950 --> 00:27:14,890
So what's log(2)?

511
00:27:14,890 --> 00:27:18,920
Logarithm of base 2 of 2?

512
00:27:18,920 --> 00:27:19,870
1!

513
00:27:19,870 --> 00:27:22,720
So this whole thing is--

514
00:27:22,720 --> 00:27:23,020
STUDENT: 1.

515
00:27:23,020 --> 00:27:24,000
PATRICK WINSTON: 1.

516
00:27:24,000 --> 00:27:27,650
So [INAUDIBLE], in her soft way,
says, well, let's see.

517
00:27:27,650 --> 00:27:28,800
2 times 1/2.

518
00:27:28,800 --> 00:27:29,710
That cancels out.

519
00:27:29,710 --> 00:27:32,680
The minus, that flips the
arguments so it's log to the

520
00:27:32,680 --> 00:27:35,640
base 2 of 2, and that's 1.

521
00:27:35,640 --> 00:27:38,785
So this whole thing, You
work out the algebra,

522
00:27:38,785 --> 00:27:40,710
it gives you 1.

523
00:27:40,710 --> 00:27:43,740
So that's cool.

524
00:27:43,740 --> 00:27:47,460
So right here in the middle
where they're equal, we get a

525
00:27:47,460 --> 00:27:48,710
value of 1.

526
00:27:51,800 --> 00:27:54,610
Next thing we need to do is
let's calculate what happens

527
00:27:54,610 --> 00:27:59,610
if P over T is equal to 1.

528
00:27:59,610 --> 00:28:01,450
That is to say, everything
is a positive.

529
00:28:01,450 --> 00:28:03,100
Any guesses?

530
00:28:03,100 --> 00:28:06,410
Maybe 10, 20, minus 15?

531
00:28:06,410 --> 00:28:08,650
Let's work it out.

532
00:28:08,650 --> 00:28:16,946
So if P over T equal 1, that
would be minus 1 times the log

533
00:28:16,946 --> 00:28:21,120
to the base 2 of 1.

534
00:28:21,120 --> 00:28:22,642
What's that?

535
00:28:22,642 --> 00:28:24,040
STUDENT: [INAUDIBLE]

536
00:28:24,040 --> 00:28:25,570
PATRICK WINSTON: A 0?

537
00:28:25,570 --> 00:28:25,910
Oh, yeah.

538
00:28:25,910 --> 00:28:30,430
Because 2 raise to
the 0 is one.

539
00:28:30,430 --> 00:28:31,715
So this part is 0.

540
00:28:34,310 --> 00:28:35,560
Now, what about this
other part?

541
00:28:38,810 --> 00:28:41,190
If everything's a P, then
nothing's an N.

542
00:28:41,190 --> 00:28:42,320
So we've got 0.

543
00:28:42,320 --> 00:28:44,820
And we can quit already.

544
00:28:44,820 --> 00:28:45,590
Well, not quite.

545
00:28:45,590 --> 00:28:46,710
We ought to work it out.

546
00:28:46,710 --> 00:28:50,290
Log 2 to the base 2 of 0.

547
00:28:50,290 --> 00:28:51,344
What's that?

548
00:28:51,344 --> 00:28:53,320
STUDENT: [INAUDIBLE]

549
00:28:53,320 --> 00:28:54,710
PATRICK WINSTON: Who?

550
00:28:54,710 --> 00:28:56,110
Minus infinity?

551
00:28:56,110 --> 00:28:57,100
Uh oh.

552
00:28:57,100 --> 00:29:00,140
0 times minus infinity is What
I didn't get that when I was

553
00:29:00,140 --> 00:29:02,470
in high school.

554
00:29:02,470 --> 00:29:06,550
Finally, 1801 makes
a difference.

555
00:29:06,550 --> 00:29:07,390
Finally.

556
00:29:07,390 --> 00:29:09,630
What's the answer.

557
00:29:09,630 --> 00:29:16,280
We're interested in the limit as
N over T goes to 0, right?

558
00:29:16,280 --> 00:29:20,392
And when you have a deal like
this, what do you do?

559
00:29:20,392 --> 00:29:25,330
You use that famous rule, that
we all mispronounce when we

560
00:29:25,330 --> 00:29:27,600
see it written, right?

561
00:29:27,600 --> 00:29:31,240
We use the good old El
Hospital's rule.

562
00:29:31,240 --> 00:29:33,210
OK, it's L'Hopital.

563
00:29:33,210 --> 00:29:34,610
L'Hopital's Rule.

564
00:29:34,610 --> 00:29:36,720
You have to differentiate
the--

565
00:29:36,720 --> 00:29:40,880
I guess we differentiate this
guy as a ratio or something,

566
00:29:40,880 --> 00:29:42,610
and see what happens
when it goes to 0.

567
00:29:42,610 --> 00:29:46,130
And what we get when we use
L'Hopital's Rule is that, oh

568
00:29:46,130 --> 00:29:50,100
thank God, this is still zero.

569
00:29:50,100 --> 00:29:52,980
So now we know that we have a
point up there and a point

570
00:29:52,980 --> 00:29:55,740
down there.

571
00:29:55,740 --> 00:29:57,410
So now we've got three
points on the curve,

572
00:29:57,410 --> 00:29:58,660
and we can draw it.

573
00:30:01,310 --> 00:30:02,560
It goes like that.

574
00:30:05,140 --> 00:30:06,260
No, it doesn't go like that.

575
00:30:06,260 --> 00:30:08,400
It's obviously a Gaussian,
right?

576
00:30:08,400 --> 00:30:10,240
Because everything in a
nature is a Gaussian.

577
00:30:10,240 --> 00:30:11,640
Can you put that laptop
away, please?

578
00:30:11,640 --> 00:30:13,260
Everything in nature
is a Gaussian, so

579
00:30:13,260 --> 00:30:14,510
it looks like this.

580
00:30:18,158 --> 00:30:21,020
That right?

581
00:30:21,020 --> 00:30:23,840
No, actually, not everything
in nature is a Gaussian.

582
00:30:23,840 --> 00:30:26,880
And in particular, this one
isn't a Gaussian either.

583
00:30:26,880 --> 00:30:30,400
It looks more like one of those
metal things they used

584
00:30:30,400 --> 00:30:32,660
to call quonset huts.

585
00:30:32,660 --> 00:30:34,612
That's what it looks like.

586
00:30:34,612 --> 00:30:37,100
Boom, like so.

587
00:30:37,100 --> 00:30:40,270
So that is the curve
of interest.

588
00:30:40,270 --> 00:30:43,540
Now, did God say that using this
way of measuring disorder

589
00:30:43,540 --> 00:30:45,950
was the best way?

590
00:30:45,950 --> 00:30:51,930
No, Got has not indicated
any choice here.

591
00:30:51,930 --> 00:30:55,570
We use this because it's a
convenient mechanism, it seems

592
00:30:55,570 --> 00:30:58,440
to make sense, but in contrast
to the reason it's used

593
00:30:58,440 --> 00:31:01,690
information theory, it's not
the result of some elegant

594
00:31:01,690 --> 00:31:02,270
mathematics.

595
00:31:02,270 --> 00:31:04,870
It's just a borrowing of
something that seems to work

596
00:31:04,870 --> 00:31:06,870
pretty well.

597
00:31:06,870 --> 00:31:09,560
Any of those curves would work
just about the same, because

598
00:31:09,560 --> 00:31:11,240
all we're doing with
it is measuring how

599
00:31:11,240 --> 00:31:13,780
disordered a set is.

600
00:31:13,780 --> 00:31:18,950
So one thing to note here is
that in this situation, where

601
00:31:18,950 --> 00:31:20,350
we're dealing with
two choices--

602
00:31:20,350 --> 00:31:23,340
P and N, positives
and negatives--

603
00:31:23,340 --> 00:31:26,450
we get a curve that
maxes out at one.

604
00:31:26,450 --> 00:31:28,830
And notice that it kind of gets
up there pretty fast.

605
00:31:28,830 --> 00:31:33,890
In fact, if you're down here
at 2/3, are you're up here,

606
00:31:33,890 --> 00:31:39,090
this is about 0.9.

607
00:31:39,090 --> 00:31:43,770
So it gives you a large number
for quite a bit of that area

608
00:31:43,770 --> 00:31:45,940
in the middle.

609
00:31:45,940 --> 00:31:49,210
So that, unfortunately, still
doesn't tell us everything we

610
00:31:49,210 --> 00:31:49,550
need to know.

611
00:31:49,550 --> 00:31:53,700
That tells us how to measure a
disorder in one of these sets.

612
00:31:53,700 --> 00:31:55,680
But we want to know how to
measure the quality of the

613
00:31:55,680 --> 00:31:57,750
test overall.

614
00:31:57,750 --> 00:32:00,950
So we need some mechanism that
says, OK, given that this test

615
00:32:00,950 --> 00:32:04,370
produces three different sets,
and we now have a measure of

616
00:32:04,370 --> 00:32:08,100
the disorder in each of these
sets, how do we measure the

617
00:32:08,100 --> 00:32:11,496
overall quality of the test?

618
00:32:11,496 --> 00:32:14,360
Well, you could just add
up the disorder.

619
00:32:14,360 --> 00:32:16,630
Let's write that down, because
that sounds good.

620
00:32:23,960 --> 00:32:33,200
So you can say that the quality
of a test is equal to

621
00:32:33,200 --> 00:32:36,160
some sum over the
sets produced.

622
00:32:41,280 --> 00:32:42,640
And what we're going to do is
we're going to add up the

623
00:32:42,640 --> 00:32:45,620
disorder of each
of those sets.

624
00:32:49,380 --> 00:32:53,220
I'm almost home, except that
this means we're going to give

625
00:32:53,220 --> 00:33:00,210
equal weight to a branch that
has almost nothing down it--

626
00:33:00,210 --> 00:33:03,190
we're going to give the same
weight to that as a branch

627
00:33:03,190 --> 00:33:05,920
that has almost everything
going down it.

628
00:33:05,920 --> 00:33:07,150
So that doesn't seem
that make sense.

629
00:33:07,150 --> 00:33:12,050
So one final flourish is we're
going to weight this sum

630
00:33:12,050 --> 00:33:16,360
according to the fraction of the
samples that end up down

631
00:33:16,360 --> 00:33:18,200
that branch.

632
00:33:18,200 --> 00:33:21,840
So it's, as usual, easier to
write it down than to say it.

633
00:33:21,840 --> 00:33:27,370
So we're going to multiply
that times the number of

634
00:33:27,370 --> 00:33:41,530
samples in the set, divided
by the number of

635
00:33:41,530 --> 00:33:51,390
samples handled by test.

636
00:33:54,610 --> 00:33:57,570
So if half the samples go down
a branch, and if that branch

637
00:33:57,570 --> 00:34:01,500
has a certain disorder, then
we're going to multiply that

638
00:34:01,500 --> 00:34:04,090
disorder times 1/2.

639
00:34:04,090 --> 00:34:04,570
All right.

640
00:34:04,570 --> 00:34:08,610
So now let's see how it works
with our sample problem.

641
00:34:08,610 --> 00:34:11,139
Well, here is our sample data.

642
00:34:11,139 --> 00:34:12,780
And we didn't need anything
fancy for it.

643
00:34:12,780 --> 00:34:16,270
But let's pretend it was
a large data set.

644
00:34:16,270 --> 00:34:16,790
Well, let's see.

645
00:34:16,790 --> 00:34:17,590
What would we do?

646
00:34:17,590 --> 00:34:23,020
Well, go down this
way, there are 4

647
00:34:23,020 --> 00:34:24,639
samples down that direction.

648
00:34:24,639 --> 00:34:26,790
That's half of the total
number of samples.

649
00:34:26,790 --> 00:34:28,139
So whatever we find down
there, we're going

650
00:34:28,139 --> 00:34:29,820
to multiply by 1/2.

651
00:34:29,820 --> 00:34:32,150
This one we're going
to multiply by 3/8.

652
00:34:32,150 --> 00:34:35,889
And this one we're going
to multiply by 1/8.

653
00:34:35,889 --> 00:34:37,969
Now, what do we actually find at
the bottom of these things?

654
00:34:37,969 --> 00:34:40,170
Well, here's a homogeneous
set.

655
00:34:40,170 --> 00:34:41,770
Everything's the same.

656
00:34:41,770 --> 00:34:44,560
So we go to that curve and say,
what is the disorder of a

657
00:34:44,560 --> 00:34:45,949
homogeneous set?

658
00:34:45,949 --> 00:34:47,199
It's zero.

659
00:34:50,380 --> 00:34:52,090
Let's see, they're
all the same.

660
00:34:52,090 --> 00:34:57,640
I guess that means it's
0 over there.

661
00:34:57,640 --> 00:35:04,470
So the disorder of this set
of three samples is zero.

662
00:35:04,470 --> 00:35:07,260
The disorder of this set
of one sample, all

663
00:35:07,260 --> 00:35:10,110
the same, is zero.

664
00:35:10,110 --> 00:35:13,720
The disorder of this set--
well, let's see.

665
00:35:13,720 --> 00:35:16,780
Half of the samples there are
plus, and half are minus, so

666
00:35:16,780 --> 00:35:20,830
we go over to our curve, and we
say, what's the disorder of

667
00:35:20,830 --> 00:35:23,640
something with equal mixture
of pluses and minuses?

668
00:35:23,640 --> 00:35:25,500
And that's one.

669
00:35:25,500 --> 00:35:28,560
So the disorder of
this guy is one.

670
00:35:28,560 --> 00:35:33,590
So now we've got 1/2 times 1,
and 3/8 times 0, 1/8 times 0.

671
00:35:33,590 --> 00:35:38,660
So the quality of this
particular test, as determined

672
00:35:38,660 --> 00:35:43,770
by the disorder of the sets
it produces, is 1/5.

673
00:35:43,770 --> 00:35:45,020
0.5.

674
00:35:48,420 --> 00:35:50,420
Let's do this one.

675
00:35:50,420 --> 00:35:53,910
So we have 3/8 coming
down this way, 5/8

676
00:35:53,910 --> 00:35:55,270
coming down this way.

677
00:35:55,270 --> 00:35:57,610
3/8 is multiplied by
the disorder of a

678
00:35:57,610 --> 00:35:58,730
set of uniform things.

679
00:35:58,730 --> 00:36:01,370
That's disorder 0.

680
00:36:01,370 --> 00:36:04,540
So this guy over here,
let's see.

681
00:36:04,540 --> 00:36:09,160
That's 2/5 and 3/5
multiplied--

682
00:36:09,160 --> 00:36:11,010
You know, this is one of those
deals where if you look at the

683
00:36:11,010 --> 00:36:14,840
curve, you're pretty close
to the middle.

684
00:36:14,840 --> 00:36:18,470
And that curve goes all the
way up to about 0.9 there.

685
00:36:18,470 --> 00:36:20,560
So you can kind of just look at
this, and eyeball it, and

686
00:36:20,560 --> 00:36:27,010
say, well, whatever it is, the
overall, this is going to be

687
00:36:27,010 --> 00:36:29,060
something multiplied
times 5/8.

688
00:36:29,060 --> 00:36:31,680
Something like 0.9 times 5/8.

689
00:36:31,680 --> 00:36:35,440
So let's just say, for the
sake of discussion, that

690
00:36:35,440 --> 00:36:39,550
that's going to be about 0.6,
which is within a hundredth, I

691
00:36:39,550 --> 00:36:41,340
think, of being right.

692
00:36:41,340 --> 00:36:44,220
Just kind of guessing.

693
00:36:44,220 --> 00:36:45,440
OK, well now we're on a roll.

694
00:36:45,440 --> 00:36:49,040
Here, we have 3/8 coming down
this branch, 3/8 coming down

695
00:36:49,040 --> 00:36:52,640
this branch, 1/4 coming
down this branch.

696
00:36:52,640 --> 00:36:54,510
This is 0.

697
00:36:54,510 --> 00:36:58,910
And this is one of those deals
where these two are about 0.9.

698
00:36:58,910 --> 00:37:05,680
So it looks like it's going
to be 3/8 plus 3/8 is 3/4.

699
00:37:05,680 --> 00:37:07,640
Times about 0.9.

700
00:37:07,640 --> 00:37:09,602
So that's going to turn
out to be about 0.7.

701
00:37:17,710 --> 00:37:19,230
So one last go here.

702
00:37:19,230 --> 00:37:24,850
3/8, 3/8, and 1/4.

703
00:37:24,850 --> 00:37:26,280
Oh, that's interesting.

704
00:37:26,280 --> 00:37:29,890
Because these two
are what we got

705
00:37:29,890 --> 00:37:32,490
contributed up to that 0.7.

706
00:37:32,490 --> 00:37:34,390
This one is 0.4 times--

707
00:37:34,390 --> 00:37:37,090
this is evenly divided,
so that's going to

708
00:37:37,090 --> 00:37:40,380
have disorder of 1.

709
00:37:40,380 --> 00:37:43,910
So that's going to be
0.25 bigger than the

710
00:37:43,910 --> 00:37:45,610
number we got over here.

711
00:37:45,610 --> 00:37:51,410
So that's going to end
up being about 0.95.

712
00:37:51,410 --> 00:37:53,980
So thanks god our answer is the
same as we got with our

713
00:37:53,980 --> 00:37:57,130
simple classroom measurement
of disorder.

714
00:37:57,130 --> 00:37:59,800
Except this is measuring how
disordered stuff is, we want

715
00:37:59,800 --> 00:38:02,520
the small number, not
the big number.

716
00:38:02,520 --> 00:38:06,110
So once again, based on this
analysis, you'll be sure to

717
00:38:06,110 --> 00:38:10,730
pick the shadow cast, because
0.5 is less than 0.6, which is

718
00:38:10,730 --> 00:38:13,640
less than 0.7, which
is less than 0.95.

719
00:38:13,640 --> 00:38:16,310
So that accent test is
really horrible.

720
00:38:16,310 --> 00:38:18,380
Don't use it.

721
00:38:18,380 --> 00:38:20,160
Just because somebody has a
heavy accent doesn't mean

722
00:38:20,160 --> 00:38:21,430
they're a vampire.

723
00:38:21,430 --> 00:38:24,240
In fact, most vampires have
worked very hard on their

724
00:38:24,240 --> 00:38:26,450
accent, as I mentioned before.

725
00:38:26,450 --> 00:38:28,470
All right, so now we know that
we're still going to pick the

726
00:38:28,470 --> 00:38:32,440
shadow test as our first go.

727
00:38:32,440 --> 00:38:34,040
So that's good.

728
00:38:34,040 --> 00:38:36,460
Now, let's see if we can repeat
the exercise with our

729
00:38:36,460 --> 00:38:39,290
second selection, the one we
have to have to pick those

730
00:38:39,290 --> 00:38:40,910
guys apart.

731
00:38:40,910 --> 00:38:42,760
And this is going to be easier,
because there are

732
00:38:42,760 --> 00:38:44,280
fewer things to work with.

733
00:38:44,280 --> 00:38:45,700
Ooh, wow, look.

734
00:38:45,700 --> 00:38:47,930
That's 0.

735
00:38:47,930 --> 00:38:49,030
That's 0.

736
00:38:49,030 --> 00:38:50,530
That's 1/2.

737
00:38:50,530 --> 00:38:53,000
That's 1/2.

738
00:38:53,000 --> 00:38:58,030
So the disorder of
this guy is 0.0.

739
00:38:58,030 --> 00:39:05,630
So this is 1/4, 1/4,
1/2, 0, 0.

740
00:39:05,630 --> 00:39:07,140
1/2 times 1.

741
00:39:07,140 --> 00:39:09,590
Ooh, that's 0.5.

742
00:39:09,590 --> 00:39:10,500
That was easy.

743
00:39:10,500 --> 00:39:11,740
How about this one?

744
00:39:11,740 --> 00:39:13,300
Oh, he says 1.

745
00:39:13,300 --> 00:39:13,500
Let's see.

746
00:39:13,500 --> 00:39:14,210
That's 1.

747
00:39:14,210 --> 00:39:15,170
That's 1.

748
00:39:15,170 --> 00:39:16,130
That's 1/2.

749
00:39:16,130 --> 00:39:16,890
That's 1/2.

750
00:39:16,890 --> 00:39:20,130
Yeah, it is one.

751
00:39:20,130 --> 00:39:22,480
So sure enough, the answer also
comes out to be the same

752
00:39:22,480 --> 00:39:24,450
as before, when we did our just

753
00:39:24,450 --> 00:39:27,160
simple intuition exercise.

754
00:39:27,160 --> 00:39:28,710
So I don't know.

755
00:39:28,710 --> 00:39:33,895
Christopher, is this all about
using information theory?

756
00:39:33,895 --> 00:39:34,320
STUDENT: No.

757
00:39:34,320 --> 00:39:35,570
PATRICK WINSTON: No, no, no.

758
00:39:38,230 --> 00:39:40,010
See, it's not about the math.

759
00:39:40,010 --> 00:39:41,000
It's about the intuition.

760
00:39:41,000 --> 00:39:43,630
And the intuition is that you
want to build a tree that's as

761
00:39:43,630 --> 00:39:44,750
simple as possible.

762
00:39:44,750 --> 00:39:47,500
And you can build a tree that's
as simple as possible

763
00:39:47,500 --> 00:39:50,610
if you look at the data, and
say, well, which test does the

764
00:39:50,610 --> 00:39:52,640
best job of splitting
things up?

765
00:39:52,640 --> 00:39:56,150
Which test does the best job of
building subsets underneath

766
00:39:56,150 --> 00:39:59,310
it that are as homogeneous
as possible?

767
00:39:59,310 --> 00:40:03,330
So all this information theory,
all this entropy

768
00:40:03,330 --> 00:40:07,290
stuff, is just a convenient
mechanism for doing something

769
00:40:07,290 --> 00:40:09,440
that is intuitionally sound.

770
00:40:09,440 --> 00:40:10,175
OK?

771
00:40:10,175 --> 00:40:11,840
It's not about information
theory.

772
00:40:11,840 --> 00:40:15,990
It's about a sound intuition.

773
00:40:15,990 --> 00:40:16,400
Oh, by the way.

774
00:40:16,400 --> 00:40:19,440
Does this kind of stuff ever
get used in practice?

775
00:40:19,440 --> 00:40:21,780
10s of thousands of times.

776
00:40:21,780 --> 00:40:25,400
This is a winning mechanism
that's used over and over

777
00:40:25,400 --> 00:40:30,320
again, even when the
data is numeric.

778
00:40:30,320 --> 00:40:32,170
How would it work if
it's numeric data?

779
00:40:32,170 --> 00:40:33,660
Well, let's think about
that for a little bit.

780
00:40:41,820 --> 00:40:45,620
So let's suppose that we
have an opportunity.

781
00:40:45,620 --> 00:40:48,430
We're an EMT or something,
we work in the infirmary.

782
00:40:48,430 --> 00:40:49,360
What do they call
it these days?

783
00:40:49,360 --> 00:40:49,940
Something else.

784
00:40:49,940 --> 00:40:52,500
But anyhow, you work in that
kind of area, and you have the

785
00:40:52,500 --> 00:40:55,420
opportunity to take people's
temperature.

786
00:40:55,420 --> 00:41:00,230
And so over time, you've
accumulated some data on the

787
00:41:00,230 --> 00:41:02,320
temperature of people.

788
00:41:02,320 --> 00:41:04,020
And maybe you've found that
there's a vampire

789
00:41:04,020 --> 00:41:07,140
here at about 102.

790
00:41:07,140 --> 00:41:09,980
There's a normal person
here, about 98.6.

791
00:41:09,980 --> 00:41:12,000
But then they're scattered
around.

792
00:41:14,590 --> 00:41:16,960
Some people have fevers
when they come in.

793
00:41:16,960 --> 00:41:19,950
So the question is, is there a
way of using numerical data--

794
00:41:19,950 --> 00:41:22,750
things that you can
put real numbers--

795
00:41:22,750 --> 00:41:25,180
is there a way of using that
with this mechanism?

796
00:41:25,180 --> 00:41:26,320
And the answer is yes.

797
00:41:26,320 --> 00:41:29,180
You just say, is the temperature
greater than or

798
00:41:29,180 --> 00:41:30,910
less than some threshold?

799
00:41:30,910 --> 00:41:33,620
And that gives you a test, a
binary test, just like any of

800
00:41:33,620 --> 00:41:36,460
these other tests.

801
00:41:36,460 --> 00:41:37,000
[? Krishna? ?]

802
00:41:37,000 --> 00:41:38,930
Right?

803
00:41:38,930 --> 00:41:40,180
But where would I put
the threshold?

804
00:41:44,660 --> 00:41:48,480
I suppose I could just put
it at the average value.

805
00:41:48,480 --> 00:41:51,100
But that might not be the place
that does the best job

806
00:41:51,100 --> 00:41:57,400
of splitting the samples into
homogeneous groups.

807
00:41:57,400 --> 00:41:57,690
Christopher?

808
00:41:57,690 --> 00:41:59,788
STUDENT: So you run this
numerical analysis on

809
00:41:59,788 --> 00:42:01,150
different places with different
thresholds.

810
00:42:01,150 --> 00:42:04,070
PATRICK WINSTON: So you try
different places, he says.

811
00:42:04,070 --> 00:42:05,340
And he's right.

812
00:42:05,340 --> 00:42:07,980
Because this is a computer,
this is our slave.

813
00:42:07,980 --> 00:42:09,930
We don't care how much
it works to figure

814
00:42:09,930 --> 00:42:11,660
out the right threshold.

815
00:42:11,660 --> 00:42:15,610
So what we do is we say, well,
maybe the threshold's halfway

816
00:42:15,610 --> 00:42:17,835
between those two guys, or
halfway between those two

817
00:42:17,835 --> 00:42:19,450
guys, or those two guys,
or those two guys,

818
00:42:19,450 --> 00:42:20,940
or those two guys.

819
00:42:20,940 --> 00:42:22,540
So we can try one
less threshold

820
00:42:22,540 --> 00:42:23,600
than we have samples.

821
00:42:23,600 --> 00:42:25,600
And we don't care if there are
10,000 samples, because this

822
00:42:25,600 --> 00:42:28,720
is a computer, and we don't care
if it works all night.

823
00:42:28,720 --> 00:42:32,750
So that's how you find the
threshold for a numeric test.

824
00:42:32,750 --> 00:42:34,520
By the way, I assured you
earlier on you would never use

825
00:42:34,520 --> 00:42:35,360
the same test twice.

826
00:42:35,360 --> 00:42:38,190
Is that true for this?

827
00:42:38,190 --> 00:42:39,910
Yes, you would still never
use the same test twice.

828
00:42:39,910 --> 00:42:41,320
But what you might do is you
might use a different

829
00:42:41,320 --> 00:42:44,860
threshold on the same
measurement

830
00:42:44,860 --> 00:42:46,940
the next time around.

831
00:42:46,940 --> 00:42:49,480
So when you start having
numerical data, you may find

832
00:42:49,480 --> 00:42:57,730
yourself using the same test
with the same axis but with a

833
00:42:57,730 --> 00:43:00,000
different value.

834
00:43:00,000 --> 00:43:00,330
All right.

835
00:43:00,330 --> 00:43:05,500
So now that we have this, then
we can go back and compare how

836
00:43:05,500 --> 00:43:10,660
this method would look when we
put it up against the sort of

837
00:43:10,660 --> 00:43:15,290
stuff we were talking about
last time, with

838
00:43:15,290 --> 00:43:16,700
the electrical covers.

839
00:43:20,980 --> 00:43:25,020
So with the electrical covers,
we had a situation like this.

840
00:43:25,020 --> 00:43:25,470
I don't know.

841
00:43:25,470 --> 00:43:31,290
We had samples that were places
like this, and we had a

842
00:43:31,290 --> 00:43:34,510
division of the space that look
pretty much like that.

843
00:43:37,220 --> 00:43:42,000
Not quite exactly in the right
spots, but pretty close.

844
00:43:42,000 --> 00:43:45,660
So these are the decision
boundaries for the situation

845
00:43:45,660 --> 00:43:47,540
where we are using nearest
neighbors to

846
00:43:47,540 --> 00:43:50,608
divide up the data.

847
00:43:50,608 --> 00:43:53,310
What would the decision
boundaries look like if these

848
00:43:53,310 --> 00:43:57,940
were four different kinds of
things, and we were using this

849
00:43:57,940 --> 00:43:59,190
kind of mechanism?

850
00:44:01,560 --> 00:44:04,640
And maybe there's a lot of
samples all clustered around

851
00:44:04,640 --> 00:44:07,170
places like that.

852
00:44:07,170 --> 00:44:09,280
What would the decision
boundaries look like?

853
00:44:09,280 --> 00:44:11,830
Would they be the
same as this?

854
00:44:11,830 --> 00:44:12,470
god, I hope not.

855
00:44:12,470 --> 00:44:14,400
Why?

856
00:44:14,400 --> 00:44:16,700
Because what we're going to
do is we're going to use a

857
00:44:16,700 --> 00:44:20,320
threshold on each axis.

858
00:44:20,320 --> 00:44:22,990
So therefore, the decision
boundaries are going to be

859
00:44:22,990 --> 00:44:26,100
parallel to one axis
or the other.

860
00:44:26,100 --> 00:44:29,460
So we might decide,
for example--

861
00:44:29,460 --> 00:44:30,220
Oh, shoot.

862
00:44:30,220 --> 00:44:32,480
I think I'll draw it again,
because it'll get confused if

863
00:44:32,480 --> 00:44:34,530
I draw it over the other one.

864
00:44:34,530 --> 00:44:37,720
So it looks like this.

865
00:44:37,720 --> 00:44:40,270
And that's how nearest
neighbors does it.

866
00:44:40,270 --> 00:44:44,540
But a identification tree
approach will pick a threshold

867
00:44:44,540 --> 00:44:45,870
along one axis or the other.

868
00:44:45,870 --> 00:44:48,340
Let's say it's this axis.

869
00:44:48,340 --> 00:44:49,580
It's only got one
choice there.

870
00:44:49,580 --> 00:44:53,370
So it's going to put
a line there.

871
00:44:53,370 --> 00:44:55,820
And now, what's the next
thing it does?

872
00:44:55,820 --> 00:44:59,370
Well, it still has these
two different kinds

873
00:44:59,370 --> 00:45:00,250
of things to separate.

874
00:45:00,250 --> 00:45:01,300
We're going to assume
we've got four

875
00:45:01,300 --> 00:45:03,380
different kinds of things.

876
00:45:03,380 --> 00:45:06,220
So it's going to say, oh!

877
00:45:06,220 --> 00:45:12,650
I've Come down the negative
side, so I need a threshold on

878
00:45:12,650 --> 00:45:14,440
the remaining data.

879
00:45:14,440 --> 00:45:17,260
And these are the only two
things that are now remaining.

880
00:45:17,260 --> 00:45:23,030
So my only choice is to put
a threshold in there.

881
00:45:23,030 --> 00:45:26,370
Now I guarantee this, absolutely
guaranteed--

882
00:45:26,370 --> 00:45:28,570
on the quiz, somebody--

883
00:45:28,570 --> 00:45:31,090
presumably somebody who doesn't
go to lectures--

884
00:45:31,090 --> 00:45:32,890
will draw that line all
the way across.

885
00:45:32,890 --> 00:45:35,730
And that's desperately wrong.

886
00:45:35,730 --> 00:45:39,780
Because we've already divided
this data set in half.

887
00:45:39,780 --> 00:45:43,170
Now the choice of what we do
over here is governed only by

888
00:45:43,170 --> 00:45:46,270
the remaining samples that
we see, these two.

889
00:45:46,270 --> 00:45:49,420
And so the threshold is going
to go in there like that.

890
00:45:52,150 --> 00:45:55,050
So that's what happens
when you go back.

891
00:45:55,050 --> 00:45:59,270
This is used 10s of thousands
of times.

892
00:45:59,270 --> 00:46:00,250
Always used.

893
00:46:00,250 --> 00:46:01,460
What are the virtues of it?

894
00:46:01,460 --> 00:46:05,360
Number one, you don't
use all the tests.

895
00:46:05,360 --> 00:46:07,630
You use only the test that seem
to be doing some useful

896
00:46:07,630 --> 00:46:09,250
work for you.

897
00:46:09,250 --> 00:46:11,750
So that means that you do a
better job, because your

898
00:46:11,750 --> 00:46:13,580
measurement technique
is simpler.

899
00:46:13,580 --> 00:46:17,220
And it costs less, because
you're not going to the

900
00:46:17,220 --> 00:46:21,140
expense of doing all
of the testing.

901
00:46:21,140 --> 00:46:22,910
So it's a real winner.

902
00:46:22,910 --> 00:46:24,230
But you know what?

903
00:46:24,230 --> 00:46:26,200
Some classes of people--

904
00:46:26,200 --> 00:46:30,420
not scientists, but I mean
people like doctors and stuff.

905
00:46:30,420 --> 00:46:33,750
They don't like to look
at these tress.

906
00:46:33,750 --> 00:46:36,170
They're kind of rule-oriented.

907
00:46:36,170 --> 00:46:40,370
So they look a tree like this
for determining what kind of

908
00:46:40,370 --> 00:46:45,420
thyroid disease you have, and
it would have maybe 20 or so

909
00:46:45,420 --> 00:46:48,670
tests in it of various kinds
of hormones, like thyroxine

910
00:46:48,670 --> 00:46:50,200
and this and that.

911
00:46:50,200 --> 00:46:52,560
And they say, ah, we can't
deal with that.

912
00:46:52,560 --> 00:46:56,400
So we have to work with them.

913
00:46:56,400 --> 00:47:01,230
So what we do is we convert the
tree into a set of rules.

914
00:47:01,230 --> 00:47:03,030
How do we convert the tree
into a set of rules?

915
00:47:09,720 --> 00:47:12,560
Oops, wrong one.

916
00:47:12,560 --> 00:47:13,810
Go away, go away.

917
00:47:16,310 --> 00:47:17,030
Here's what I want.

918
00:47:17,030 --> 00:47:18,070
Yeah, good.

919
00:47:18,070 --> 00:47:21,480
How would we convert this tree
into a set of rules?

920
00:47:21,480 --> 00:47:22,470
It's straightforward.

921
00:47:22,470 --> 00:47:23,680
[INAUDIBLE], what do we do?

922
00:47:23,680 --> 00:47:25,390
STUDENT: You'd basically just
look down each branch--

923
00:47:25,390 --> 00:47:26,730
PATRICK WINSTON: You'd basically
just go down each

924
00:47:26,730 --> 00:47:28,510
branch to a leaf.

925
00:47:28,510 --> 00:47:31,990
So you say, for example,
here's one rule.

926
00:47:31,990 --> 00:47:44,970
If shadow equals question mark,
and garlic equals oh,

927
00:47:44,970 --> 00:47:45,770
[INAUDIBLE]

928
00:47:45,770 --> 00:47:47,020
want to choose No.

929
00:47:50,012 --> 00:47:51,880
Doesn't eat garlic.

930
00:47:51,880 --> 00:47:52,350
No.

931
00:47:52,350 --> 00:47:54,130
I think I'll say Yes.

932
00:47:54,130 --> 00:47:54,810
Yes.

933
00:47:54,810 --> 00:47:56,060
That changes the answer.

934
00:48:03,740 --> 00:48:07,430
Then if it eats garlic, it's
not a vampire, right?

935
00:48:07,430 --> 00:48:09,700
That's one of four possible
rules, because there are four

936
00:48:09,700 --> 00:48:12,440
leaf nodes.

937
00:48:12,440 --> 00:48:15,900
Now, almost done.

938
00:48:15,900 --> 00:48:17,170
We are done, except
for one thing.

939
00:48:17,170 --> 00:48:20,160
We can actually take these four
rules, and start thinking

940
00:48:20,160 --> 00:48:21,950
about how to simplify them.

941
00:48:21,950 --> 00:48:26,140
You can ask questions like, if
I have a rule that tests both

942
00:48:26,140 --> 00:48:30,280
the shadow and the garlic, do I
actually need both of those

943
00:48:30,280 --> 00:48:30,550
antecedents?

944
00:48:30,550 --> 00:48:32,950
And the answer is, in
many cases, no.

945
00:48:32,950 --> 00:48:35,890
And in particular,
in this case, no.

946
00:48:35,890 --> 00:48:40,700
Because if we look at our data
set, what we discover is that

947
00:48:40,700 --> 00:48:45,180
in the event that we're
talking about a shadow

948
00:48:45,180 --> 00:48:46,285
question mark--

949
00:48:46,285 --> 00:48:49,240
oh, I guess I had a better
choice the other way.

950
00:48:49,240 --> 00:48:49,880
Oh, no.

951
00:48:49,880 --> 00:48:53,200
If you look at the garlic,
all the garlics--

952
00:48:53,200 --> 00:48:55,300
Yes, Yes, and Yes--

953
00:48:55,300 --> 00:48:57,600
it turns out that the answer is
no, independent of what the

954
00:48:57,600 --> 00:48:59,630
shadow condition is.

955
00:48:59,630 --> 00:49:01,680
So we can look at the rules,
and in some cases, we'll

956
00:49:01,680 --> 00:49:04,050
discover that our tree is a
little bit more complicated

957
00:49:04,050 --> 00:49:04,730
than it needs to be.

958
00:49:04,730 --> 00:49:06,800
We can actually get rid of
some of the clauses.

959
00:49:06,800 --> 00:49:10,340
So in the end, we can develop a
very simple mechanism based

960
00:49:10,340 --> 00:49:12,760
on good old fashioned rule-based
behavior, like you

961
00:49:12,760 --> 00:49:15,820
saw almost in the beginning
of the subject,

962
00:49:15,820 --> 00:49:16,940
that does the job.

963
00:49:16,940 --> 00:49:22,710
And now, without any royalty,
you're all free to put this in

964
00:49:22,710 --> 00:49:25,610
your PDA and use it to protect
yourself in the days to com,

965
00:49:25,610 --> 00:49:27,430
especially since Halloween's
just around the corner.