1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:30,932 --> 00:00:31,640
ERIC GRIMSON: OK.

9
00:00:31,640 --> 00:00:34,480
Welcome back.

10
00:00:34,480 --> 00:00:36,040
You know, it's that
time a term when

11
00:00:36,040 --> 00:00:38,950
we're all kind of doing this.

12
00:00:38,950 --> 00:00:41,980
So let me see if I can get a few
smiles by simply noting to you

13
00:00:41,980 --> 00:00:45,717
that two weeks from
today is the last class.

14
00:00:45,717 --> 00:00:48,050
Should be worth at least a
little bit of a smile, right?

15
00:00:48,050 --> 00:00:49,500
Professor Guttag is smiling.

16
00:00:49,500 --> 00:00:51,030
He likes that idea.

17
00:00:51,030 --> 00:00:53,830
You're almost there.

18
00:00:53,830 --> 00:00:56,670
What are we doing for the
last couple of lectures?

19
00:00:56,670 --> 00:00:58,452
We're talking about
linear regression.

20
00:00:58,452 --> 00:00:59,910
And I just want to
remind you, this

21
00:00:59,910 --> 00:01:03,690
was the idea of I have
some experimental data.

22
00:01:03,690 --> 00:01:06,360
Case of a spring where I put
different weights on measure

23
00:01:06,360 --> 00:01:07,620
displacements.

24
00:01:07,620 --> 00:01:11,640
And regression was giving
us a way of deducing a model

25
00:01:11,640 --> 00:01:13,495
to fit that data.

26
00:01:13,495 --> 00:01:14,947
And In some cases it was easy.

27
00:01:14,947 --> 00:01:17,280
We knew, for example, it was
going to be a linear model.

28
00:01:17,280 --> 00:01:19,320
We found the best line
that would fit that data.

29
00:01:19,320 --> 00:01:22,110
In some cases, we said
we could use validation

30
00:01:22,110 --> 00:01:24,960
to actually let us explore
to find the best model that

31
00:01:24,960 --> 00:01:29,580
would fit it, whether a
linear, a quadratic, a cubic,

32
00:01:29,580 --> 00:01:31,500
some higher order thing.

33
00:01:31,500 --> 00:01:36,340
So we'll be using that to
deduce something about a model.

34
00:01:36,340 --> 00:01:39,800
That's a nice segue into
the topic for the next three

35
00:01:39,800 --> 00:01:42,680
lectures, the last big
topic of the class,

36
00:01:42,680 --> 00:01:43,897
which is machine learning.

37
00:01:43,897 --> 00:01:46,480
And I'm going to argue, you can
debate whether that's actually

38
00:01:46,480 --> 00:01:47,510
an example of learning.

39
00:01:47,510 --> 00:01:49,135
But it has many of
the elements that we

40
00:01:49,135 --> 00:01:52,700
want to talk about when we
talk about machine learning.

41
00:01:52,700 --> 00:01:55,210
So as always, there's
a reading assignment.

42
00:01:55,210 --> 00:01:57,500
Chapter 22 of the book gives
you a good start on this,

43
00:01:57,500 --> 00:02:00,160
and it will follow
up with other pieces.

44
00:02:00,160 --> 00:02:03,072
And I want to start
by basically outlining

45
00:02:03,072 --> 00:02:04,030
what we're going to do.

46
00:02:04,030 --> 00:02:05,404
And I'm going to
begin by saying,

47
00:02:05,404 --> 00:02:09,130
as I'm sure you're aware,
this is a huge topic.

48
00:02:09,130 --> 00:02:12,760
I've listed just five
subjects in course six

49
00:02:12,760 --> 00:02:14,770
that all focus on
machine learning.

50
00:02:14,770 --> 00:02:16,900
And that doesn't
include other subjects

51
00:02:16,900 --> 00:02:19,030
where learning is
a central part.

52
00:02:19,030 --> 00:02:22,360
So natural language processing,
computational biology,

53
00:02:22,360 --> 00:02:25,390
computer vision
robotics all rely today,

54
00:02:25,390 --> 00:02:27,070
heavily on machine learning.

55
00:02:27,070 --> 00:02:30,050
And you'll see those in
those subjects as well.

56
00:02:30,050 --> 00:02:32,560
So we're not going to
compress five subjects

57
00:02:32,560 --> 00:02:34,490
into three lectures.

58
00:02:34,490 --> 00:02:36,920
But what we are going to do
is give you the introduction.

59
00:02:36,920 --> 00:02:39,580
We're going to start by talking
about the basic concepts

60
00:02:39,580 --> 00:02:40,810
of machine learning.

61
00:02:40,810 --> 00:02:43,780
The idea of having examples, and
how do you talk about features

62
00:02:43,780 --> 00:02:45,610
representing those
examples, how do

63
00:02:45,610 --> 00:02:47,650
you measure distances
between them,

64
00:02:47,650 --> 00:02:50,080
and use the notion
of distance to try

65
00:02:50,080 --> 00:02:52,210
and group similar
things together as a way

66
00:02:52,210 --> 00:02:53,450
of doing machine learning.

67
00:02:53,450 --> 00:02:55,330
And we're going to
look, as a consequence,

68
00:02:55,330 --> 00:02:59,530
of two different standard
ways of doing learning.

69
00:02:59,530 --> 00:03:01,555
One, we call
classification methods.

70
00:03:01,555 --> 00:03:02,930
Example we're
going to see, there

71
00:03:02,930 --> 00:03:05,110
is something called
"k nearest neighbor"

72
00:03:05,110 --> 00:03:08,390
and the second class,
called clustering methods.

73
00:03:08,390 --> 00:03:10,750
Classification works
well when I have what

74
00:03:10,750 --> 00:03:12,090
we would call labeled data.

75
00:03:12,090 --> 00:03:15,070
I know labels on my
examples, and I'm

76
00:03:15,070 --> 00:03:17,200
going to use that to
try and define classes

77
00:03:17,200 --> 00:03:19,510
that I can learn, and
clustering working well,

78
00:03:19,510 --> 00:03:20,930
when I don't have labeled data.

79
00:03:20,930 --> 00:03:23,138
And we'll see what that
means in a couple of minutes.

80
00:03:23,138 --> 00:03:27,500
But we're going to give
you an early view of this.

81
00:03:27,500 --> 00:03:29,360
Unless Professor Guttag
changes his mind,

82
00:03:29,360 --> 00:03:31,760
we're probably not going to
show you the current really

83
00:03:31,760 --> 00:03:33,500
sophisticated machine
learning methods

84
00:03:33,500 --> 00:03:35,870
like convolutional neural
nets or deep learning,

85
00:03:35,870 --> 00:03:37,040
things you'll read
about in the news.

86
00:03:37,040 --> 00:03:38,720
But you're going to
get a sense of what's

87
00:03:38,720 --> 00:03:40,640
behind those, by looking
at what we do when we

88
00:03:40,640 --> 00:03:43,890
talk about learning algorithms.

89
00:03:43,890 --> 00:03:45,640
Before I do it, I want
to point out to you

90
00:03:45,640 --> 00:03:47,200
just how prevalent this is.

91
00:03:47,200 --> 00:03:49,870
And I'm going to admit
with my gray hair,

92
00:03:49,870 --> 00:03:53,530
I started working in AI in
1975 when machine learning was

93
00:03:53,530 --> 00:03:55,030
a pretty simple thing to do.

94
00:03:55,030 --> 00:03:56,470
And it's been
fascinating to watch

95
00:03:56,470 --> 00:03:58,150
over 40 years, the change.

96
00:03:58,150 --> 00:04:01,550
And if you think about it, just
think about where you see it.

97
00:04:01,550 --> 00:04:06,730
AlphaGo, machine learning based
system from Google that beat

98
00:04:06,730 --> 00:04:09,040
a world-class level Go player.

99
00:04:09,040 --> 00:04:11,980
Chess has already been conquered
by computers for a while.

100
00:04:11,980 --> 00:04:14,020
Go now belongs to computers.

101
00:04:14,020 --> 00:04:16,930
Best Go players in the
world are computers.

102
00:04:16,930 --> 00:04:18,519
I'm sure many of
you use Netflix.

103
00:04:18,519 --> 00:04:20,660
Any recommendation
system, Netflix,

104
00:04:20,660 --> 00:04:23,770
Amazon, pick your favorite, uses
a machine learning algorithm

105
00:04:23,770 --> 00:04:25,481
to suggest things for you.

106
00:04:25,481 --> 00:04:27,730
And in fact, you've probably
seen it on Google, right?

107
00:04:27,730 --> 00:04:29,560
The ads that pop
up on Google are

108
00:04:29,560 --> 00:04:31,690
coming from a machine
learning algorithm that's

109
00:04:31,690 --> 00:04:33,220
looking at your preferences.

110
00:04:33,220 --> 00:04:34,900
Scary thought.

111
00:04:34,900 --> 00:04:38,980
Drug discovery, character
recognition-- the post office

112
00:04:38,980 --> 00:04:41,975
does character recognition of
handwritten characters using

113
00:04:41,975 --> 00:04:44,350
a machine learning algorithm
and a computer vision system

114
00:04:44,350 --> 00:04:46,330
behind it.

115
00:04:46,330 --> 00:04:48,190
You probably don't
know this company.

116
00:04:48,190 --> 00:04:50,260
It's actually an MIT
spin-off called Two Sigma,

117
00:04:50,260 --> 00:04:52,060
it's a hedge fund in New York.

118
00:04:52,060 --> 00:04:54,970
They heavily use AI and
machine learning techniques.

119
00:04:54,970 --> 00:05:01,947
And two years ago, their
fund returned a 56% return.

120
00:05:01,947 --> 00:05:03,280
I wish I'd invested in the fund.

121
00:05:03,280 --> 00:05:04,850
I don't have the kinds
of millions you need,

122
00:05:04,850 --> 00:05:06,370
but that's an impressive return.

123
00:05:06,370 --> 00:05:09,546
56% return on your
money in one year.

124
00:05:09,546 --> 00:05:11,170
Last year they didn't
do quite as well,

125
00:05:11,170 --> 00:05:14,140
but they do extremely well using
machine learning techniques.

126
00:05:14,140 --> 00:05:16,550
Siri.

127
00:05:16,550 --> 00:05:18,680
Another great MIT
company called Mobileye

128
00:05:18,680 --> 00:05:21,110
that does computer vision
systems with a heavy machine

129
00:05:21,110 --> 00:05:24,080
learning component that is
used in assistive driving

130
00:05:24,080 --> 00:05:26,450
and will be used in
completely autonomous driving.

131
00:05:26,450 --> 00:05:28,520
It will do things like
kick in your brakes

132
00:05:28,520 --> 00:05:32,067
if you're closing too fast
on the car in front of you,

133
00:05:32,067 --> 00:05:33,650
which is going to
be really bad for me

134
00:05:33,650 --> 00:05:35,600
because I drive
like a Bostonian.

135
00:05:35,600 --> 00:05:38,030
And it would be
kicking in constantly.

136
00:05:38,030 --> 00:05:39,530
Face recognition.

137
00:05:39,530 --> 00:05:42,530
Facebook uses this,
many other systems

138
00:05:42,530 --> 00:05:46,070
do to both detect
and recognize faces.

139
00:05:46,070 --> 00:05:48,720
IBM Watson-- cancer diagnosis.

140
00:05:48,720 --> 00:05:50,390
These are all just
examples of machine

141
00:05:50,390 --> 00:05:52,820
learning being used everywhere.

142
00:05:52,820 --> 00:05:53,870
And it really is.

143
00:05:53,870 --> 00:05:56,720
I've only picked nine.

144
00:05:56,720 --> 00:06:00,280
So what is it?

145
00:06:00,280 --> 00:06:02,140
I'm going to make an
obnoxious statement.

146
00:06:02,140 --> 00:06:03,695
You're now used to that.

147
00:06:03,695 --> 00:06:05,070
I'm going to claim
that you could

148
00:06:05,070 --> 00:06:09,420
argue that almost every computer
program learns something.

149
00:06:09,420 --> 00:06:11,640
But the level of learning
really varies a lot.

150
00:06:11,640 --> 00:06:15,150
So if you think back to
the first lecture in 60001,

151
00:06:15,150 --> 00:06:18,630
we showed you Newton's method
for computing square roots.

152
00:06:18,630 --> 00:06:20,910
And you could argue,
you'd have to stretch it,

153
00:06:20,910 --> 00:06:23,280
but you could argue
that that method learns

154
00:06:23,280 --> 00:06:25,260
something about how to
compute square roots.

155
00:06:25,260 --> 00:06:29,986
In fact, you could generalize
it to roots of any order power.

156
00:06:29,986 --> 00:06:31,110
But it really didn't learn.

157
00:06:31,110 --> 00:06:33,440
I really had to program it.

158
00:06:33,440 --> 00:06:33,960
All right.

159
00:06:33,960 --> 00:06:37,500
Think about last week when we
talked about linear regression.

160
00:06:37,500 --> 00:06:39,810
Now it starts to feel
a little bit more

161
00:06:39,810 --> 00:06:41,040
like a learning algorithm.

162
00:06:41,040 --> 00:06:41,998
Because what did we do?

163
00:06:41,998 --> 00:06:44,550
We gave you a set
of data points,

164
00:06:44,550 --> 00:06:47,200
mass displacement data points.

165
00:06:47,200 --> 00:06:49,650
And then we showed you how
the computer could essentially

166
00:06:49,650 --> 00:06:52,290
fit a curve to that data point.

167
00:06:52,290 --> 00:06:56,100
And it was, in some sense,
learning a model for that data

168
00:06:56,100 --> 00:06:58,960
that it could then use
to predict behavior.

169
00:06:58,960 --> 00:07:00,425
In other situations.

170
00:07:00,425 --> 00:07:01,800
And that's getting
closer to what

171
00:07:01,800 --> 00:07:03,510
we would like when we
think about a machine

172
00:07:03,510 --> 00:07:04,301
learning algorithm.

173
00:07:04,301 --> 00:07:10,460
We'd like to have program that
can learn from experience,

174
00:07:10,460 --> 00:07:14,280
something that it can then
use to deduce new facts.

175
00:07:14,280 --> 00:07:16,980
Now it's been a problem in
AI for a very long time.

176
00:07:16,980 --> 00:07:18,080
And I love this quote.

177
00:07:18,080 --> 00:07:21,530
It's from a gentleman
named Art Samuel.

178
00:07:21,530 --> 00:07:24,794
1959 is the quote
in which he says,

179
00:07:24,794 --> 00:07:26,210
his definition of
machine learning

180
00:07:26,210 --> 00:07:28,130
is the field of study
that gives computers

181
00:07:28,130 --> 00:07:32,391
the ability to learn without
being explicitly programmed.

182
00:07:32,391 --> 00:07:33,890
And I think many
people would argue,

183
00:07:33,890 --> 00:07:36,080
he wrote the first such program.

184
00:07:36,080 --> 00:07:38,710
It learned from experience.

185
00:07:38,710 --> 00:07:40,292
In his case, it played checkers.

186
00:07:40,292 --> 00:07:42,250
Kind of shows you how
the field has progressed.

187
00:07:42,250 --> 00:07:45,610
But we started with checkers,
we got to chess, we now do Go.

188
00:07:45,610 --> 00:07:46,570
But it played checkers.

189
00:07:46,570 --> 00:07:49,240
It beat national level
players, most importantly,

190
00:07:49,240 --> 00:07:52,570
it learned to
improve its methods

191
00:07:52,570 --> 00:07:55,630
by watching how it did in games
and then inferring something

192
00:07:55,630 --> 00:07:58,492
to change what it thought
about as it did that.

193
00:07:58,492 --> 00:07:59,950
Samuel did a bunch
of other things.

194
00:07:59,950 --> 00:08:00,940
I just highlighted one.

195
00:08:00,940 --> 00:08:02,170
You may see in a
follow on course,

196
00:08:02,170 --> 00:08:04,295
he invented what's called
Alpha-Beta Pruning, which

197
00:08:04,295 --> 00:08:06,896
is a really useful
technique for doing search.

198
00:08:06,896 --> 00:08:10,390
But the idea is, how can
we have the computer learn

199
00:08:10,390 --> 00:08:13,444
without being
explicitly programmed?

200
00:08:13,444 --> 00:08:14,860
And one way to
think about this is

201
00:08:14,860 --> 00:08:17,440
to think about the difference
between how we would normally

202
00:08:17,440 --> 00:08:19,870
program and what we would
like from a machine learning

203
00:08:19,870 --> 00:08:21,692
algorithm.

204
00:08:21,692 --> 00:08:23,650
Normal programming, I
know you're not convinced

205
00:08:23,650 --> 00:08:25,441
there's such a thing
as normal programming,

206
00:08:25,441 --> 00:08:27,820
but if you think of
traditional programming,

207
00:08:27,820 --> 00:08:30,100
what's the process?

208
00:08:30,100 --> 00:08:33,610
I write a program that
I input to the computer

209
00:08:33,610 --> 00:08:36,340
so that it can then
take data and produce

210
00:08:36,340 --> 00:08:38,409
some appropriate output.

211
00:08:38,409 --> 00:08:40,750
And the square root finder
really sits there, right?

212
00:08:40,750 --> 00:08:43,480
I wrote code for using Newton
method to find a square root,

213
00:08:43,480 --> 00:08:47,000
and then it gave me the
process of given any number,

214
00:08:47,000 --> 00:08:48,250
I'll give you the square root.

215
00:08:50,800 --> 00:08:52,780
But if you think about
what we did last time,

216
00:08:52,780 --> 00:08:53,863
it was a little different.

217
00:08:53,863 --> 00:08:56,900
And in fact, in a machine
learning approach,

218
00:08:56,900 --> 00:09:01,600
the idea is that I'm going
to give the computer output.

219
00:09:01,600 --> 00:09:05,920
I'm going to give it examples of
what I want the program to do,

220
00:09:05,920 --> 00:09:08,380
labels on data,
characterizations

221
00:09:08,380 --> 00:09:10,380
of different classes of things.

222
00:09:10,380 --> 00:09:11,860
And what I want
the computer to do

223
00:09:11,860 --> 00:09:15,580
is, given that characterization
of output and data,

224
00:09:15,580 --> 00:09:17,500
I wanted that machine
learning algorithm

225
00:09:17,500 --> 00:09:20,480
to actually produce
for me a program,

226
00:09:20,480 --> 00:09:23,330
a program that I can
then use to infer

227
00:09:23,330 --> 00:09:25,370
new information about things.

228
00:09:25,370 --> 00:09:29,442
And that creates, if you
like, a really nice loop

229
00:09:29,442 --> 00:09:31,400
where I can have the
machine learning algorithm

230
00:09:31,400 --> 00:09:34,190
learn the program
which I can then use

231
00:09:34,190 --> 00:09:36,190
to solve some other problem.

232
00:09:36,190 --> 00:09:38,260
That would be really
great if we could do it.

233
00:09:38,260 --> 00:09:40,340
And as I suggested, that
curve-fitting algorithm

234
00:09:40,340 --> 00:09:41,870
is a simple version of that.

235
00:09:41,870 --> 00:09:44,690
It learned a model for the
data, which I could then

236
00:09:44,690 --> 00:09:46,820
use to label any other
instances of the data

237
00:09:46,820 --> 00:09:49,670
or predict what I would see in
terms of spring displacement

238
00:09:49,670 --> 00:09:52,100
as I changed the masses.

239
00:09:52,100 --> 00:09:54,850
So that's the kind of idea
we're going to explore.

240
00:09:54,850 --> 00:09:57,050
If we want to learn
things, we could also

241
00:09:57,050 --> 00:09:59,200
ask, so how do you learn?

242
00:09:59,200 --> 00:10:02,537
And how should a computer learn?

243
00:10:02,537 --> 00:10:05,120
Well, for you as a human, there
are a couple of possibilities.

244
00:10:05,120 --> 00:10:06,140
This is the boring one.

245
00:10:06,140 --> 00:10:08,900
This is the old style
way of doing it, right?

246
00:10:08,900 --> 00:10:10,910
Memorize facts.

247
00:10:10,910 --> 00:10:13,630
Memorize as many facts as you
can and hope that we ask you

248
00:10:13,630 --> 00:10:16,190
on the final exam
instances of those facts,

249
00:10:16,190 --> 00:10:19,100
as opposed to some other
facts you haven't memorized.

250
00:10:19,100 --> 00:10:22,580
This is, if you think way
back to the first lecture,

251
00:10:22,580 --> 00:10:26,840
an example of declarative
knowledge, statements of truth.

252
00:10:26,840 --> 00:10:28,370
Memorize as many as you can.

253
00:10:28,370 --> 00:10:31,620
Have Wikipedia in
your back pocket.

254
00:10:31,620 --> 00:10:35,280
Better way to learn is to
be able to infer, to deduce

255
00:10:35,280 --> 00:10:37,980
new information from old.

256
00:10:37,980 --> 00:10:39,630
And if you think
about this, this

257
00:10:39,630 --> 00:10:42,870
gets closer to what we
called imperative knowledge--

258
00:10:42,870 --> 00:10:46,370
ways to deduce new things.

259
00:10:46,370 --> 00:10:48,100
Now, in the first
cases, we built

260
00:10:48,100 --> 00:10:51,510
that in when we wrote that
program to do square roots.

261
00:10:51,510 --> 00:10:53,430
But what we'd like in
a learning algorithm

262
00:10:53,430 --> 00:10:56,760
is to have much more like
that generalization idea.

263
00:10:56,760 --> 00:10:59,880
We're interested in
extending our capabilities

264
00:10:59,880 --> 00:11:03,840
to write programs that can
infer useful information

265
00:11:03,840 --> 00:11:06,030
from implicit
patterns in the data.

266
00:11:06,030 --> 00:11:08,310
So not something
explicitly built

267
00:11:08,310 --> 00:11:11,100
like that comparison of
weights and displacements,

268
00:11:11,100 --> 00:11:13,350
but actually implicit
patterns in the data,

269
00:11:13,350 --> 00:11:16,740
and have the algorithm figure
out what those patterns are,

270
00:11:16,740 --> 00:11:18,780
and use those to
generate a program you

271
00:11:18,780 --> 00:11:22,170
can use to infer new
data about objects,

272
00:11:22,170 --> 00:11:24,300
about string
displacements, whatever

273
00:11:24,300 --> 00:11:27,401
it is you're trying to do.

274
00:11:27,401 --> 00:11:27,900
OK.

275
00:11:27,900 --> 00:11:30,200
So the idea then,
the basic paradigm

276
00:11:30,200 --> 00:11:32,480
that we're going
to see, is we're

277
00:11:32,480 --> 00:11:35,090
going to give the
system some training

278
00:11:35,090 --> 00:11:37,130
data, some observations.

279
00:11:37,130 --> 00:11:41,300
We did that last time with
just the spring displacements.

280
00:11:41,300 --> 00:11:43,044
We're going to then
try and have a way

281
00:11:43,044 --> 00:11:44,960
to figure out, how do
we write code, how do we

282
00:11:44,960 --> 00:11:47,510
write a program, a system
that will infer something

283
00:11:47,510 --> 00:11:51,290
about the process that
generated the data?

284
00:11:51,290 --> 00:11:53,269
And then from
that, we want to be

285
00:11:53,269 --> 00:11:55,310
able to use that to make
predictions about things

286
00:11:55,310 --> 00:11:57,860
we haven't seen before.

287
00:11:57,860 --> 00:11:59,610
So again, I want to
drive home this point.

288
00:11:59,610 --> 00:12:04,830
If you think about it, the
spring example fit that model.

289
00:12:04,830 --> 00:12:08,340
I gave you a set of
data, spatial deviations

290
00:12:08,340 --> 00:12:09,720
relative to mass displacements.

291
00:12:09,720 --> 00:12:12,040
For different masses, how
far did the spring move?

292
00:12:12,040 --> 00:12:16,550
I then inferred something
about the underlying process.

293
00:12:16,550 --> 00:12:18,660
In the first case, I
said I know it's linear,

294
00:12:18,660 --> 00:12:21,480
but let me figure out what
the actual linear equation is.

295
00:12:21,480 --> 00:12:24,150
What's the spring constant
associated with it?

296
00:12:24,150 --> 00:12:26,820
And based on that result,
I got a piece of code

297
00:12:26,820 --> 00:12:30,120
I could use to predict
new displacements.

298
00:12:30,120 --> 00:12:32,860
So it's got all of those
elements, training data,

299
00:12:32,860 --> 00:12:35,190
an inference engine,
and then the ability

300
00:12:35,190 --> 00:12:38,140
to use that to make
new predictions.

301
00:12:38,140 --> 00:12:40,330
But that's a very simple
kind of learning setting.

302
00:12:40,330 --> 00:12:41,920
So the more common
one is one I'm

303
00:12:41,920 --> 00:12:43,450
going to use as
an example, which

304
00:12:43,450 --> 00:12:47,210
is, when I give you
a set of examples,

305
00:12:47,210 --> 00:12:49,360
those examples have some
data associated with them,

306
00:12:49,360 --> 00:12:52,330
some features and some labels.

307
00:12:52,330 --> 00:12:53,980
For each example,
I might say this

308
00:12:53,980 --> 00:12:55,960
is a particular kind of thing.

309
00:12:55,960 --> 00:12:58,102
This other one is
another kind of thing.

310
00:12:58,102 --> 00:12:59,560
And what I want to
do is figure out

311
00:12:59,560 --> 00:13:01,930
how to do inference on
labeling new things.

312
00:13:01,930 --> 00:13:04,524
So it's not just, what's the
displacement of the mass,

313
00:13:04,524 --> 00:13:05,440
it's actually a label.

314
00:13:05,440 --> 00:13:07,540
And I'm going to use one
of my favorite examples.

315
00:13:07,540 --> 00:13:09,970
I'm a big New
England Patriots fan,

316
00:13:09,970 --> 00:13:11,507
if you're not, my apologies.

317
00:13:11,507 --> 00:13:13,090
But I'm going to use
football players.

318
00:13:13,090 --> 00:13:15,040
So I'm going to show
you in a second,

319
00:13:15,040 --> 00:13:17,830
I'm going to give you a set of
examples of football players.

320
00:13:17,830 --> 00:13:20,222
The label is the
position they play.

321
00:13:20,222 --> 00:13:22,180
And the data, well, it
could be lots of things.

322
00:13:22,180 --> 00:13:23,920
We're going to use
height and weight.

323
00:13:23,920 --> 00:13:25,660
But what we want
to do is then see

324
00:13:25,660 --> 00:13:29,080
how would we come up with
a way of characterizing

325
00:13:29,080 --> 00:13:32,170
the implicit pattern of how
does weight and height predict

326
00:13:32,170 --> 00:13:34,630
the kind of position
this player could play.

327
00:13:34,630 --> 00:13:36,290
And then come up
with an algorithm

328
00:13:36,290 --> 00:13:38,515
that will predict the
position of new players.

329
00:13:38,515 --> 00:13:39,940
We'll do the draft
for next year.

330
00:13:39,940 --> 00:13:42,670
Where do we want them to play?

331
00:13:42,670 --> 00:13:44,500
That's the paradigm.

332
00:13:44,500 --> 00:13:48,910
Set of observations, potentially
labeled, potentially not.

333
00:13:48,910 --> 00:13:51,520
Think about how do we do
inference to find a model.

334
00:13:51,520 --> 00:13:55,000
And then how do we use that
model to make predictions.

335
00:13:55,000 --> 00:13:56,429
What we're going
to see, and we're

336
00:13:56,429 --> 00:13:57,970
going to see multiple
examples today,

337
00:13:57,970 --> 00:13:59,500
is that that
learning can be done

338
00:13:59,500 --> 00:14:02,870
in one of two very broad ways.

339
00:14:02,870 --> 00:14:05,460
The first one is called
supervised learning.

340
00:14:05,460 --> 00:14:08,140
And in that case,
for every new example

341
00:14:08,140 --> 00:14:09,850
I give you as part
of the training data,

342
00:14:09,850 --> 00:14:11,470
I have a label on it.

343
00:14:11,470 --> 00:14:13,554
I know the kind of thing it is.

344
00:14:13,554 --> 00:14:15,220
And what I'm going
to do is look for how

345
00:14:15,220 --> 00:14:18,400
do I find a rule that would
predict the label associated

346
00:14:18,400 --> 00:14:21,610
with unseen input based
on those examples.

347
00:14:21,610 --> 00:14:25,010
It's supervised because I
know what the labeling is.

348
00:14:25,010 --> 00:14:26,750
Second kind, if
this is supervised,

349
00:14:26,750 --> 00:14:28,820
the obvious other one
is called unsupervised.

350
00:14:28,820 --> 00:14:32,210
In that case, I'm just going to
give you a bunch of examples.

351
00:14:32,210 --> 00:14:34,550
But I don't know the labels
associated with them.

352
00:14:34,550 --> 00:14:36,140
I'm going to just
try and find what

353
00:14:36,140 --> 00:14:39,020
are the natural ways
to group those examples

354
00:14:39,020 --> 00:14:41,680
together into different models.

355
00:14:41,680 --> 00:14:44,090
And in some cases, I may know
how many models are there.

356
00:14:44,090 --> 00:14:45,923
In some cases, I may
want to just say what's

357
00:14:45,923 --> 00:14:48,770
the best grouping I can find.

358
00:14:48,770 --> 00:14:50,760
OK.

359
00:14:50,760 --> 00:14:52,929
What I'm going to do today
is not a lot of code.

360
00:14:52,929 --> 00:14:55,470
I was expecting cheers for that,
John, but I didn't get them.

361
00:14:55,470 --> 00:14:57,240
Not a lot of code.

362
00:14:57,240 --> 00:14:59,250
What I'm going to do
is show you basically,

363
00:14:59,250 --> 00:15:01,060
the intuitions behind
doing this learning.

364
00:15:01,060 --> 00:15:03,560
And I"m going to start with my
New England Patriots example.

365
00:15:03,560 --> 00:15:06,880
So here are some data points
about current Patriots players.

366
00:15:06,880 --> 00:15:09,120
And I've got two
kinds of positions.

367
00:15:09,120 --> 00:15:12,240
I've got receivers,
and I have linemen.

368
00:15:12,240 --> 00:15:15,090
And each one is just labeled by
the name, the height in inches,

369
00:15:15,090 --> 00:15:16,650
and the weight in pounds.

370
00:15:16,650 --> 00:15:17,700
OK?

371
00:15:17,700 --> 00:15:20,590
Five of each.

372
00:15:20,590 --> 00:15:24,370
If I plot those on a
two dimensional plot,

373
00:15:24,370 --> 00:15:26,260
this is what I get.

374
00:15:26,260 --> 00:15:26,860
OK?

375
00:15:26,860 --> 00:15:29,320
No big deal.

376
00:15:29,320 --> 00:15:30,490
What am I trying to do?

377
00:15:30,490 --> 00:15:33,370
I'm trying to learn, are
their characteristics

378
00:15:33,370 --> 00:15:36,220
that distinguish the two
classes from one another?

379
00:15:36,220 --> 00:15:38,200
And in the unlabeled
case, all I have

380
00:15:38,200 --> 00:15:40,240
are just a set of examples.

381
00:15:40,240 --> 00:15:42,220
So what I want to
do is decide what

382
00:15:42,220 --> 00:15:46,240
makes two players similar
with the goal of seeing,

383
00:15:46,240 --> 00:15:50,260
can I separate this
distribution into two or more

384
00:15:50,260 --> 00:15:52,430
natural groups.

385
00:15:52,430 --> 00:15:54,120
Similar is a distance measure.

386
00:15:54,120 --> 00:15:56,118
It says how do I take
two examples with values

387
00:15:56,118 --> 00:15:57,493
or features
associated, and we're

388
00:15:57,493 --> 00:15:59,500
going to decide how
far apart are they?

389
00:15:59,500 --> 00:16:03,770
And in the unlabeled case, the
simple way to do it is to say,

390
00:16:03,770 --> 00:16:06,216
if I know that there are
at least k groups there--

391
00:16:06,216 --> 00:16:08,090
in this case, I'm going
to tell you there are

392
00:16:08,090 --> 00:16:09,800
two different groups there--

393
00:16:09,800 --> 00:16:12,830
how could I decide how
best to cluster things

394
00:16:12,830 --> 00:16:15,470
together so that all the
examples in one group

395
00:16:15,470 --> 00:16:18,410
are close to each other, all
the examples in the other group

396
00:16:18,410 --> 00:16:22,190
are close to each other, and
they're reasonably far apart.

397
00:16:22,190 --> 00:16:23,780
There are many ways to do it.

398
00:16:23,780 --> 00:16:25,260
I'm going to show you one.

399
00:16:25,260 --> 00:16:29,920
It's a very standard way, and
it works, basically, as follows.

400
00:16:29,920 --> 00:16:32,340
If all I know is that
there are two groups there,

401
00:16:32,340 --> 00:16:34,080
I'm going to start
by just picking

402
00:16:34,080 --> 00:16:37,497
two examples as my exemplars.

403
00:16:37,497 --> 00:16:38,330
Pick them at random.

404
00:16:38,330 --> 00:16:39,410
Actually at random is not great.

405
00:16:39,410 --> 00:16:40,820
I don't want to pick too
closely to each other.

406
00:16:40,820 --> 00:16:42,528
I'm going to try and
pick them far apart.

407
00:16:42,528 --> 00:16:44,920
But I pick two examples
as my exemplars.

408
00:16:44,920 --> 00:16:47,540
And for all the other
examples in the training data,

409
00:16:47,540 --> 00:16:50,862
I say which one
is it closest to.

410
00:16:50,862 --> 00:16:52,820
What I'm going to try
and do is create clusters

411
00:16:52,820 --> 00:16:54,590
with the property
that the distances

412
00:16:54,590 --> 00:16:57,860
between all of the examples
of that cluster are small.

413
00:16:57,860 --> 00:16:59,932
The average distance is small.

414
00:16:59,932 --> 00:17:01,390
And see if I can
find clusters that

415
00:17:01,390 --> 00:17:03,250
gets the average distance
for both clusters

416
00:17:03,250 --> 00:17:05,380
as small as possible.

417
00:17:05,380 --> 00:17:08,099
This algorithm works by
picking two examples,

418
00:17:08,099 --> 00:17:10,290
clustering all the other
examples by simply saying

419
00:17:10,290 --> 00:17:15,480
put it in the group to which
it's closest to that example.

420
00:17:15,480 --> 00:17:17,369
Once I've got
those clusters, I'm

421
00:17:17,369 --> 00:17:20,160
going to find the median
element of that group.

422
00:17:20,160 --> 00:17:24,060
Not mean, but median, what's
the one closest to the center?

423
00:17:24,060 --> 00:17:28,057
And treat those as exemplars
and repeat the process.

424
00:17:28,057 --> 00:17:30,015
And I'll just do it either
some number of times

425
00:17:30,015 --> 00:17:33,570
or until I don't get any
change in the process.

426
00:17:33,570 --> 00:17:35,490
So it's clustering
based on distance.

427
00:17:35,490 --> 00:17:38,734
And we'll come back to
distance in a second.

428
00:17:38,734 --> 00:17:40,650
So here's what would
have my football players.

429
00:17:40,650 --> 00:17:43,770
If I just did this
based on weight,

430
00:17:43,770 --> 00:17:45,400
there's the natural
dividing line.

431
00:17:45,400 --> 00:17:46,865
And it kind of makes sense.

432
00:17:46,865 --> 00:17:47,500
All right?

433
00:17:47,500 --> 00:17:49,194
These three are
obviously clustered,

434
00:17:49,194 --> 00:17:50,610
and again, it's
just on this axis.

435
00:17:50,610 --> 00:17:51,990
They're all down here.

436
00:17:51,990 --> 00:17:53,580
These seven are at
a different place.

437
00:17:53,580 --> 00:17:56,530
There's a natural
dividing line there.

438
00:17:56,530 --> 00:18:01,790
If I were to do it based
on height, not as clean.

439
00:18:01,790 --> 00:18:03,410
This is what my
algorithm came up

440
00:18:03,410 --> 00:18:05,360
with as the best
dividing line here,

441
00:18:05,360 --> 00:18:09,140
meaning that these four,
again, just based on this axis

442
00:18:09,140 --> 00:18:10,550
are close together.

443
00:18:10,550 --> 00:18:12,170
These six are close together.

444
00:18:12,170 --> 00:18:13,900
But it's not nearly as clean.

445
00:18:13,900 --> 00:18:15,650
And that's part of the
issue we'll look at

446
00:18:15,650 --> 00:18:18,350
is how do I find
the best clusters.

447
00:18:18,350 --> 00:18:22,440
If I use both
height and weight, I

448
00:18:22,440 --> 00:18:25,570
get that, which was actually
kind of nice, right?

449
00:18:25,570 --> 00:18:28,780
Those three cluster together.
they're near each other,

450
00:18:28,780 --> 00:18:30,910
in terms of just
distance in the plane.

451
00:18:30,910 --> 00:18:32,830
Those seven are near each other.

452
00:18:32,830 --> 00:18:36,550
There's a nice, natural
dividing line through here.

453
00:18:36,550 --> 00:18:40,240
And in fact, that
gives me a classifier.

454
00:18:40,240 --> 00:18:43,360
This line is the
equidistant line

455
00:18:43,360 --> 00:18:45,460
between the centers
of those two clusters.

456
00:18:45,460 --> 00:18:47,650
Meaning, any point
along this line

457
00:18:47,650 --> 00:18:49,750
is the same distance to
the center of that group

458
00:18:49,750 --> 00:18:51,570
as it is to that group.

459
00:18:51,570 --> 00:18:53,620
And so any new example,
if it's above the line,

460
00:18:53,620 --> 00:18:56,080
I would say gets that label,
if it's below the line,

461
00:18:56,080 --> 00:18:58,211
gets that label.

462
00:18:58,211 --> 00:18:59,710
In a second, we'll
come back to look

463
00:18:59,710 --> 00:19:01,168
at how do we measure
the distances,

464
00:19:01,168 --> 00:19:02,830
but the idea here
is pretty simple.

465
00:19:02,830 --> 00:19:05,530
I want to find groupings
near each other

466
00:19:05,530 --> 00:19:09,110
and far apart from
the other group.

467
00:19:09,110 --> 00:19:13,890
Now suppose I actually knew
the labels on these players.

468
00:19:16,790 --> 00:19:18,980
These are the receivers.

469
00:19:18,980 --> 00:19:21,060
Those are the linemen.

470
00:19:21,060 --> 00:19:22,880
And for those of you
who are football fans,

471
00:19:22,880 --> 00:19:23,710
you can figure it out, right?

472
00:19:23,710 --> 00:19:24,918
Those are the two tight ends.

473
00:19:24,918 --> 00:19:26,127
They are much bigger.

474
00:19:26,127 --> 00:19:28,460
I think that's Bennett and
that's Gronk if you're really

475
00:19:28,460 --> 00:19:29,306
a big Patriots fan.

476
00:19:29,306 --> 00:19:31,430
But those are tight ends,
those are wide receivers,

477
00:19:31,430 --> 00:19:33,096
and it's going to
come back in a second,

478
00:19:33,096 --> 00:19:34,400
but there are the labels.

479
00:19:34,400 --> 00:19:36,830
Now what I want to do is say,
if I could take advantage

480
00:19:36,830 --> 00:19:40,280
of knowing the labels, how
would I divide these groups up?

481
00:19:40,280 --> 00:19:42,770
And that's kind of easy to see.

482
00:19:42,770 --> 00:19:44,870
Basic idea, in this
case, is if I've

483
00:19:44,870 --> 00:19:46,640
got labeled groups
in that feature

484
00:19:46,640 --> 00:19:51,080
space, what I want to do is
find a subsurface that naturally

485
00:19:51,080 --> 00:19:52,220
divides that space.

486
00:19:52,220 --> 00:19:53,810
Now subsurface is a fancy word.

487
00:19:53,810 --> 00:19:55,460
It says, in the
two-dimensional case,

488
00:19:55,460 --> 00:19:58,280
I want to know
what's the best line,

489
00:19:58,280 --> 00:20:01,070
if I can find a single line,
that separates all the examples

490
00:20:01,070 --> 00:20:04,820
with one label from all the
examples of the second label.

491
00:20:04,820 --> 00:20:07,560
We'll see that, if the
examples are well separated,

492
00:20:07,560 --> 00:20:09,890
this is easy to
do, and it's great.

493
00:20:09,890 --> 00:20:11,600
But in some cases,
it's going to be

494
00:20:11,600 --> 00:20:13,790
more complicated because
some of the examples

495
00:20:13,790 --> 00:20:15,872
may be very close
to one another.

496
00:20:15,872 --> 00:20:17,330
And that's going
to raise a problem

497
00:20:17,330 --> 00:20:19,040
that you saw last lecture.

498
00:20:19,040 --> 00:20:20,900
I want to avoid overfitting.

499
00:20:20,900 --> 00:20:23,510
I don't want to create a
really complicated surface

500
00:20:23,510 --> 00:20:24,870
to separate things.

501
00:20:24,870 --> 00:20:27,560
And so we may have to
tolerate a few incorrectly

502
00:20:27,560 --> 00:20:30,716
labeled things, if
we can't pull it out.

503
00:20:30,716 --> 00:20:32,590
And as you already
figured out, in this case,

504
00:20:32,590 --> 00:20:35,020
with the labeled data,
there's the best fitting line

505
00:20:35,020 --> 00:20:36,460
right there.

506
00:20:36,460 --> 00:20:40,090
Anybody over 280 pounds is
going to be a great lineman.

507
00:20:40,090 --> 00:20:45,250
Anybody under 280 pounds is
more likely to be a receiver.

508
00:20:45,250 --> 00:20:45,789
OK.

509
00:20:45,789 --> 00:20:47,830
So I've got two different
ways of trying to think

510
00:20:47,830 --> 00:20:48,913
about doing this labeling.

511
00:20:48,913 --> 00:20:52,160
I'm going to come back to
both of them in a second.

512
00:20:52,160 --> 00:20:55,490
Now suppose I add
in some new data.

513
00:20:55,490 --> 00:20:57,879
I want to label new instances.

514
00:20:57,879 --> 00:21:00,170
Now these are actually players
of a different position.

515
00:21:00,170 --> 00:21:01,230
These are running backs.

516
00:21:01,230 --> 00:21:04,530
But I say, all I know about
is receivers and linemen.

517
00:21:04,530 --> 00:21:06,510
I get these two new data points.

518
00:21:06,510 --> 00:21:08,610
I'd like to know, are
they more likely to be

519
00:21:08,610 --> 00:21:11,430
a receiver or a linemen?

520
00:21:11,430 --> 00:21:14,130
And there's the data
for these two gentlemen.

521
00:21:14,130 --> 00:21:17,640
So if I go back to
now plotting them,

522
00:21:17,640 --> 00:21:19,210
oh you notice one of the issues.

523
00:21:19,210 --> 00:21:22,110
So there are my linemen, the
red ones are my receivers,

524
00:21:22,110 --> 00:21:25,860
the two black dots are
the two running backs.

525
00:21:25,860 --> 00:21:28,500
And notice right here.

526
00:21:28,500 --> 00:21:31,550
It's going to be really
hard to separate those two

527
00:21:31,550 --> 00:21:32,707
examples from one another.

528
00:21:32,707 --> 00:21:34,040
They are so close to each other.

529
00:21:34,040 --> 00:21:35,706
And that's going to
be one of the things

530
00:21:35,706 --> 00:21:37,040
we have to trade off.

531
00:21:37,040 --> 00:21:41,240
But if I think about using
what I learned as a classifier

532
00:21:41,240 --> 00:21:46,030
with unlabeled data, there
were my two clusters.

533
00:21:46,030 --> 00:21:48,420
Now you see, oh, I've got
an interesting example.

534
00:21:48,420 --> 00:21:51,690
This new example I would
say is clearly more

535
00:21:51,690 --> 00:21:54,240
like a receiver than a lineman.

536
00:21:54,240 --> 00:21:57,660
But that one there, unclear.

537
00:21:57,660 --> 00:22:00,180
Almost exactly lies
along that dividing line

538
00:22:00,180 --> 00:22:02,310
between those two clusters.

539
00:22:02,310 --> 00:22:04,860
And I would either say, I
want to rethink the clustering

540
00:22:04,860 --> 00:22:06,610
or I want to say, you know what?

541
00:22:06,610 --> 00:22:09,270
As I know, maybe there
aren't two clusters here.

542
00:22:09,270 --> 00:22:10,560
Maybe there are three.

543
00:22:10,560 --> 00:22:13,270
And I want to classify
them a little differently.

544
00:22:13,270 --> 00:22:15,570
So I'll come back to that.

545
00:22:15,570 --> 00:22:18,540
On the other hand, if I
had used the labeled data,

546
00:22:18,540 --> 00:22:20,340
there was my dividing line.

547
00:22:20,340 --> 00:22:21,960
This is really easy.

548
00:22:21,960 --> 00:22:24,150
Both of those new
examples are clearly

549
00:22:24,150 --> 00:22:25,320
below the dividing line.

550
00:22:25,320 --> 00:22:27,240
They are clearly
examples that I would

551
00:22:27,240 --> 00:22:29,910
categorize as being
more like receivers

552
00:22:29,910 --> 00:22:32,380
than they are like linemen.

553
00:22:32,380 --> 00:22:33,880
And I know it's a
football example.

554
00:22:33,880 --> 00:22:36,130
If you don't like football,
pick another example.

555
00:22:36,130 --> 00:22:38,080
But you get the
sense of why I can

556
00:22:38,080 --> 00:22:41,500
use the data in a labeled
case and the unlabeled case

557
00:22:41,500 --> 00:22:45,772
to come up with different
ways of building the clusters.

558
00:22:45,772 --> 00:22:47,480
So what we're going
to do over the next 2

559
00:22:47,480 --> 00:22:50,190
and 1/2 lectures is
look at how can we

560
00:22:50,190 --> 00:22:54,550
write code to learn that way
of separating things out?

561
00:22:54,550 --> 00:22:57,059
We're going to learn models
based on unlabeled data.

562
00:22:57,059 --> 00:22:59,350
That's the case where I don't
know what the labels are,

563
00:22:59,350 --> 00:23:02,890
by simply trying to find ways
to cluster things together

564
00:23:02,890 --> 00:23:05,830
nearby, and then use the
clusters to assign labels

565
00:23:05,830 --> 00:23:07,020
to new data.

566
00:23:07,020 --> 00:23:09,820
And we're going to learn models
by looking at labeled data

567
00:23:09,820 --> 00:23:13,510
and seeing how do we best come
up with a way of separating

568
00:23:13,510 --> 00:23:17,230
with a line or a plane or a
collection of lines, examples

569
00:23:17,230 --> 00:23:20,320
from one group, from
examples of the other group.

570
00:23:20,320 --> 00:23:23,464
With the acknowledgment that
we want to avoid overfitting,

571
00:23:23,464 --> 00:23:25,630
we don't want to create a
really complicated system.

572
00:23:25,630 --> 00:23:27,127
And as a consequence,
we're going

573
00:23:27,127 --> 00:23:28,960
to have to make some
trade-offs between what

574
00:23:28,960 --> 00:23:32,200
we call false positives
and false negatives.

575
00:23:32,200 --> 00:23:34,780
But the resulting classifier
can then label any new data

576
00:23:34,780 --> 00:23:36,730
by just deciding where
you are with respect

577
00:23:36,730 --> 00:23:37,730
to that separating line.

578
00:23:40,360 --> 00:23:43,020
So here's what you're going
to see over the next 2

579
00:23:43,020 --> 00:23:44,930
and 1/2 lectures.

580
00:23:44,930 --> 00:23:49,510
Every machine learning method
has five essential components.

581
00:23:49,510 --> 00:23:51,310
We need to decide what's
the training data,

582
00:23:51,310 --> 00:23:54,370
and how are we going to evaluate
the success of that system.

583
00:23:54,370 --> 00:23:56,900
We've already seen
some examples of that.

584
00:23:56,900 --> 00:23:58,600
We need to decide
how are we going

585
00:23:58,600 --> 00:24:02,560
to represent each instance
that we're giving it.

586
00:24:02,560 --> 00:24:05,190
I happened to choose height and
weight for football players.

587
00:24:05,190 --> 00:24:08,200
But I might have been better
off to pick average speed

588
00:24:08,200 --> 00:24:10,210
or, I don't know, arm
length, something else.

589
00:24:10,210 --> 00:24:12,670
How do I figure out what
are the right features.

590
00:24:12,670 --> 00:24:15,820
And associated with that,
how do I measure distances

591
00:24:15,820 --> 00:24:17,320
between those features?

592
00:24:17,320 --> 00:24:20,096
How do I decide what's
close and what's not close?

593
00:24:20,096 --> 00:24:22,720
Maybe it should be different, in
terms of weight versus height,

594
00:24:22,720 --> 00:24:23,290
for example.

595
00:24:23,290 --> 00:24:24,734
I need to make that decision.

596
00:24:24,734 --> 00:24:26,150
And those are the
two things we're

597
00:24:26,150 --> 00:24:30,030
going to show you examples of
today, how to go through that.

598
00:24:30,030 --> 00:24:31,530
Starting next week,
Professor Guttag

599
00:24:31,530 --> 00:24:34,430
is going to show you how you
take those and actually start

600
00:24:34,430 --> 00:24:38,260
building more detailed versions
of measuring clustering,

601
00:24:38,260 --> 00:24:41,560
measuring similarities to find
an objective function that you

602
00:24:41,560 --> 00:24:44,830
want to minimize to decide what
is the best cluster to use.

603
00:24:44,830 --> 00:24:47,230
And then what is the best
optimization method you want

604
00:24:47,230 --> 00:24:50,350
to use to learn that model.

605
00:24:50,350 --> 00:24:54,080
So let's start talking
about features.

606
00:24:54,080 --> 00:24:56,960
I've got a set of
examples, labeled or not.

607
00:24:56,960 --> 00:24:59,450
I need to decide what is it
about those examples that's

608
00:24:59,450 --> 00:25:02,660
useful to use when I
want to decide what's

609
00:25:02,660 --> 00:25:05,350
close to another thing or not.

610
00:25:05,350 --> 00:25:07,470
And one of the problems
is, if it was really easy,

611
00:25:07,470 --> 00:25:09,600
it would be really easy.

612
00:25:09,600 --> 00:25:12,510
Features don't always
capture what you want.

613
00:25:12,510 --> 00:25:14,490
I'm going to belabor
that football analogy,

614
00:25:14,490 --> 00:25:16,050
but why did I pick
height and weight.

615
00:25:16,050 --> 00:25:18,307
Because it was easy to find.

616
00:25:18,307 --> 00:25:20,640
You know, if you work for the
New England Patriots, what

617
00:25:20,640 --> 00:25:22,900
is the thing that you really
look for when you're asking,

618
00:25:22,900 --> 00:25:23,660
what's the right feature?

619
00:25:23,660 --> 00:25:25,660
It's probably some other
combination of things.

620
00:25:25,660 --> 00:25:27,810
So you, as a designer,
have to say what

621
00:25:27,810 --> 00:25:29,952
are the features I want to use.

622
00:25:29,952 --> 00:25:31,410
That quote, by the
way, is from one

623
00:25:31,410 --> 00:25:33,618
of the great statisticians
of the 20th century, which

624
00:25:33,618 --> 00:25:35,640
I think captures it well.

625
00:25:35,640 --> 00:25:39,630
So feature engineering,
as you, as a programmer,

626
00:25:39,630 --> 00:25:42,030
comes down to deciding
both what are the features

627
00:25:42,030 --> 00:25:45,480
I want to measure in that vector
that I'm going to put together,

628
00:25:45,480 --> 00:25:49,340
and how do I decide
relative ways to weight it?

629
00:25:49,340 --> 00:25:55,520
So John, and Ana, and I
could have made our job

630
00:25:55,520 --> 00:25:57,900
this term really easy
if we had sat down

631
00:25:57,900 --> 00:25:59,900
at the beginning of the
term and said, you know,

632
00:25:59,900 --> 00:26:01,270
we've taught this
course many times.

633
00:26:01,270 --> 00:26:02,686
We've got data
from, I don't know,

634
00:26:02,686 --> 00:26:05,650
John, thousands of students,
probably over this time.

635
00:26:05,650 --> 00:26:07,640
Let's just build a
little learning algorithm

636
00:26:07,640 --> 00:26:11,127
that takes a set of data and
predicts your final grade.

637
00:26:11,127 --> 00:26:12,710
You don't have to
come to class, don't

638
00:26:12,710 --> 00:26:13,670
have to go through
all the problems,

639
00:26:13,670 --> 00:26:15,440
because we'll just
predict your final grade.

640
00:26:15,440 --> 00:26:16,356
Wouldn't that be nice?

641
00:26:16,356 --> 00:26:18,950
Make our job a little easier,
and you may or may not

642
00:26:18,950 --> 00:26:20,930
like that idea.

643
00:26:20,930 --> 00:26:23,204
But I could think about
predicting that grade?

644
00:26:23,204 --> 00:26:24,620
Now why am I telling
this example.

645
00:26:24,620 --> 00:26:26,620
I was trying to see if I
could get a few smiles.

646
00:26:26,620 --> 00:26:28,750
I saw a couple of them there.

647
00:26:28,750 --> 00:26:30,780
But think about the features.

648
00:26:30,780 --> 00:26:31,961
What I measure?

649
00:26:31,961 --> 00:26:34,210
Actually, I'll put this on
John because it's his idea.

650
00:26:34,210 --> 00:26:35,930
What would he measure?

651
00:26:35,930 --> 00:26:41,090
Well, GPA is probably not a
bad predictor of performance.

652
00:26:41,090 --> 00:26:42,650
You do well in other
classes, you're

653
00:26:42,650 --> 00:26:45,140
likely to do well in this class.

654
00:26:45,140 --> 00:26:47,000
I'm going to use this
one very carefully.

655
00:26:47,000 --> 00:26:50,840
Prior programming experience
is at least a predictor,

656
00:26:50,840 --> 00:26:53,549
but it is not a
perfect predictor.

657
00:26:53,549 --> 00:26:55,340
Those of you who haven't
programmed before,

658
00:26:55,340 --> 00:26:57,320
in this class, you can still
do really well in this class.

659
00:26:57,320 --> 00:26:59,695
But it's an indication that
you've seen other programming

660
00:26:59,695 --> 00:27:01,580
languages.

661
00:27:01,580 --> 00:27:04,250
On the other hand, I don't
believe in astrology.

662
00:27:04,250 --> 00:27:06,890
So I don't think the month
in which you're born,

663
00:27:06,890 --> 00:27:09,470
the astrological sign
under which you were born

664
00:27:09,470 --> 00:27:12,500
has probably anything to do
with how well you'd program.

665
00:27:12,500 --> 00:27:14,304
I doubt that eye color
has anything to do

666
00:27:14,304 --> 00:27:15,470
with how well you'd program.

667
00:27:15,470 --> 00:27:16,178
You get the idea.

668
00:27:16,178 --> 00:27:19,730
Some features
matter, others don't.

669
00:27:19,730 --> 00:27:23,030
Now I could just throw all
the features in and hope that

670
00:27:23,030 --> 00:27:25,610
the machine learning algorithm
sorts out those it wants

671
00:27:25,610 --> 00:27:27,920
to keep from those it doesn't.

672
00:27:27,920 --> 00:27:30,560
But I remind you of that
idea of overfitting.

673
00:27:30,560 --> 00:27:32,570
If I do that,
there is the danger

674
00:27:32,570 --> 00:27:35,900
that it will find some
correlation between birth

675
00:27:35,900 --> 00:27:39,200
month, eye color, and GPA.

676
00:27:39,200 --> 00:27:41,030
And that's going to
lead to a conclusion

677
00:27:41,030 --> 00:27:43,162
that we really don't like.

678
00:27:43,162 --> 00:27:44,620
By the way, in case
you're worried,

679
00:27:44,620 --> 00:27:46,060
I can assure you
that Stu Schmill

680
00:27:46,060 --> 00:27:47,860
in the dean of
admissions department

681
00:27:47,860 --> 00:27:50,437
does not use machine
learning to pick you.

682
00:27:50,437 --> 00:27:52,270
He actually looks at a
whole bunch of things

683
00:27:52,270 --> 00:27:55,180
because it's not easy to
replace him with a machine--

684
00:27:55,180 --> 00:27:56,765
yet.

685
00:27:56,765 --> 00:27:57,370
All right.

686
00:27:57,370 --> 00:28:00,014
So what this says is
we need to think about

687
00:28:00,014 --> 00:28:01,180
how do we pick the features.

688
00:28:01,180 --> 00:28:02,638
And mostly, what
we're trying to do

689
00:28:02,638 --> 00:28:05,830
is to maximize something called
the signal to noise ratio.

690
00:28:05,830 --> 00:28:09,670
Maximize those features that
carry the most information,

691
00:28:09,670 --> 00:28:12,580
and remove the ones that don't.

692
00:28:12,580 --> 00:28:14,410
So I want to show
you an example of how

693
00:28:14,410 --> 00:28:17,530
you might think about this.

694
00:28:17,530 --> 00:28:19,965
I want to label reptiles.

695
00:28:19,965 --> 00:28:22,860
I want to come up with a
way of labeling animals as,

696
00:28:22,860 --> 00:28:25,067
are they a reptile or not.

697
00:28:25,067 --> 00:28:26,400
And I give you a single example.

698
00:28:26,400 --> 00:28:28,560
With a single example,
you can't really do much.

699
00:28:28,560 --> 00:28:32,760
But from this example, I know
that a cobra, it lays eggs,

700
00:28:32,760 --> 00:28:35,130
it has scales, it's
poisonous, it's cold blooded,

701
00:28:35,130 --> 00:28:37,200
it has no legs,
and it's a reptile.

702
00:28:37,200 --> 00:28:39,834
So I could say my model
of a reptile is well,

703
00:28:39,834 --> 00:28:40,500
I'm not certain.

704
00:28:40,500 --> 00:28:42,990
I don't have enough data yet.

705
00:28:42,990 --> 00:28:45,330
But if I give you
a second example,

706
00:28:45,330 --> 00:28:47,070
and it also happens
to be egg-laying,

707
00:28:47,070 --> 00:28:49,740
have scales, poisonous,
cold blooded, no legs.

708
00:28:49,740 --> 00:28:51,720
There is my model, right?

709
00:28:51,720 --> 00:28:53,959
Perfectly reasonable
model, whether I design it

710
00:28:53,959 --> 00:28:55,500
or a machine learning
algorithm would

711
00:28:55,500 --> 00:29:00,210
do it says, if all of these are
true, label it as a reptile.

712
00:29:00,210 --> 00:29:02,110
OK?

713
00:29:02,110 --> 00:29:05,366
And now I give you
a boa constrictor.

714
00:29:05,366 --> 00:29:07,260
Ah.

715
00:29:07,260 --> 00:29:08,790
It's a reptile.

716
00:29:08,790 --> 00:29:11,120
But it doesn't fit the model.

717
00:29:11,120 --> 00:29:14,760
And in particular,
it's not egg-laying,

718
00:29:14,760 --> 00:29:16,907
and it's not poisonous.

719
00:29:16,907 --> 00:29:18,240
So I've got to refine the model.

720
00:29:18,240 --> 00:29:19,990
Or the algorithm has
got to refine the model.

721
00:29:19,990 --> 00:29:21,960
And this, I want to remind you,
is looking at the features.

722
00:29:21,960 --> 00:29:23,910
So I started out
with five features.

723
00:29:23,910 --> 00:29:25,870
This doesn't fit.

724
00:29:25,870 --> 00:29:28,730
So probably what I
should do is reduce it.

725
00:29:28,730 --> 00:29:29,970
I'm going to look at scales.

726
00:29:29,970 --> 00:29:31,386
I'm going to look
at cold blooded.

727
00:29:31,386 --> 00:29:32,790
I'm going to look at legs.

728
00:29:32,790 --> 00:29:34,411
That captures all
three examples.

729
00:29:34,411 --> 00:29:36,660
Again, if you think about
this in terms of clustering,

730
00:29:36,660 --> 00:29:39,621
all three of them
would fit with that.

731
00:29:39,621 --> 00:29:40,120
OK.

732
00:29:40,120 --> 00:29:42,680
Now I give you another example--

733
00:29:42,680 --> 00:29:44,070
chicken.

734
00:29:44,070 --> 00:29:45,535
I don't think it's a reptile.

735
00:29:45,535 --> 00:29:48,600
In fact, I'm pretty
sure it's not a reptile.

736
00:29:48,600 --> 00:29:53,350
And it nicely still
fits this model, right?

737
00:29:53,350 --> 00:29:56,310
Because, while it has scales,
which you may or not realize,

738
00:29:56,310 --> 00:29:58,500
it's not cold blooded,
and it has legs.

739
00:29:58,500 --> 00:30:02,380
So it is a negative example
that reinforces the model.

740
00:30:02,380 --> 00:30:04,644
Sounds good.

741
00:30:04,644 --> 00:30:07,740
And now I'll give
you an alligator.

742
00:30:07,740 --> 00:30:09,720
It's a reptile.

743
00:30:09,720 --> 00:30:11,460
And oh fudge, right?

744
00:30:11,460 --> 00:30:14,370
It doesn't satisfy the model.

745
00:30:14,370 --> 00:30:19,170
Because while it does have
scales and it is cold blooded,

746
00:30:19,170 --> 00:30:21,155
it has legs.

747
00:30:21,155 --> 00:30:22,530
I'm almost done
with the example.

748
00:30:22,530 --> 00:30:23,446
But you see the point.

749
00:30:23,446 --> 00:30:25,650
Again, I've got to think
about how do I refine this.

750
00:30:25,650 --> 00:30:28,500
And I could by
saying, all right.

751
00:30:28,500 --> 00:30:30,930
Let's make it a little more
complicated-- has scales,

752
00:30:30,930 --> 00:30:32,787
cold blooded, 0 or four legs--

753
00:30:32,787 --> 00:30:34,120
I'm going to say it's a reptile.

754
00:30:36,670 --> 00:30:38,860
I'll give you the dart frog.

755
00:30:38,860 --> 00:30:40,780
Not a reptile,
it's an amphibian.

756
00:30:40,780 --> 00:30:43,000
And that's nice because
it still satisfies this.

757
00:30:43,000 --> 00:30:45,730
So it's an example outside
of the cluster that

758
00:30:45,730 --> 00:30:50,211
says no scales,
not cold blooded,

759
00:30:50,211 --> 00:30:51,460
but happens to have four legs.

760
00:30:51,460 --> 00:30:52,251
It's not a reptile.

761
00:30:52,251 --> 00:30:53,780
That's good.

762
00:30:53,780 --> 00:30:55,112
And then I give you--

763
00:30:55,112 --> 00:30:56,570
I have to give you
a python, right?

764
00:30:56,570 --> 00:30:58,931
I mean, there has to
be a python in here.

765
00:30:58,931 --> 00:30:59,430
Oh come on.

766
00:30:59,430 --> 00:31:01,130
At least grown at
me when I say that.

767
00:31:01,130 --> 00:31:02,990
There has to be a python here.

768
00:31:02,990 --> 00:31:05,470
And I give you
that and a salmon.

769
00:31:05,470 --> 00:31:08,620
And now I am in trouble.

770
00:31:08,620 --> 00:31:14,810
Because look at scales, look
at cold blooded, look at legs.

771
00:31:14,810 --> 00:31:16,610
I can't separate them.

772
00:31:16,610 --> 00:31:18,230
On those features,
there's no way

773
00:31:18,230 --> 00:31:20,510
to come up with a way
that will correctly

774
00:31:20,510 --> 00:31:24,576
say that the python is a
reptile and the salmon is not.

775
00:31:24,576 --> 00:31:28,370
And so there's no easy
way to add in that rule.

776
00:31:28,370 --> 00:31:30,560
And probably my best
thing is to simply go back

777
00:31:30,560 --> 00:31:34,490
to just two features,
scales and cold blooded.

778
00:31:34,490 --> 00:31:35,960
And basically say,
if something has

779
00:31:35,960 --> 00:31:38,786
scales and it's cold blooded,
I'm going to call it a reptile.

780
00:31:38,786 --> 00:31:40,160
If it doesn't have
both of those,

781
00:31:40,160 --> 00:31:42,170
I'm going to say
it's not a reptile.

782
00:31:42,170 --> 00:31:44,009
It won't be perfect.

783
00:31:44,009 --> 00:31:45,800
It's going to incorrectly
label the salmon.

784
00:31:45,800 --> 00:31:49,310
But I've made a design
choice here that's important.

785
00:31:49,310 --> 00:31:54,230
And the design choice is that
I will have no false negatives.

786
00:31:54,230 --> 00:31:55,730
What that means is
there's not going

787
00:31:55,730 --> 00:31:59,240
to be any instance of something
that's not a reptile that I'm

788
00:31:59,240 --> 00:32:01,380
going to call a reptile.

789
00:32:01,380 --> 00:32:04,114
I may have some false positives.

790
00:32:04,114 --> 00:32:05,280
So I did that the wrong way.

791
00:32:05,280 --> 00:32:06,720
A false negative
says, everything

792
00:32:06,720 --> 00:32:10,020
that's not a reptile I'm going
to categorize that direction.

793
00:32:10,020 --> 00:32:11,760
I may have some false
positives, in that,

794
00:32:11,760 --> 00:32:13,920
I may have a few things
that I will incorrectly

795
00:32:13,920 --> 00:32:15,690
label as a reptile.

796
00:32:15,690 --> 00:32:17,700
And in particular,
salmon is going

797
00:32:17,700 --> 00:32:19,620
to be an instance of that.

798
00:32:19,620 --> 00:32:22,099
This trade off of false
positives and false negatives

799
00:32:22,099 --> 00:32:24,390
is something that we worry
about, as we think about it.

800
00:32:24,390 --> 00:32:26,690
Because there's no perfect
way, in many cases,

801
00:32:26,690 --> 00:32:28,391
to separate out the data.

802
00:32:28,391 --> 00:32:30,640
And if you think back to my
example of the New England

803
00:32:30,640 --> 00:32:33,876
Patriots, that running back
and that wide receiver were

804
00:32:33,876 --> 00:32:35,500
so close together in
height and weight,

805
00:32:35,500 --> 00:32:38,256
there was no way I'm going to
be able to separate them apart.

806
00:32:38,256 --> 00:32:39,880
And I just have to
be willing to decide

807
00:32:39,880 --> 00:32:42,320
how many false positives
or false negatives

808
00:32:42,320 --> 00:32:45,370
do I want to tolerate.

809
00:32:45,370 --> 00:32:49,980
Once I've figured out what
features to use, which is good,

810
00:32:49,980 --> 00:32:52,210
then I have to decide
about distance.

811
00:32:52,210 --> 00:32:53,960
How do I compare
two feature vectors?

812
00:32:53,960 --> 00:32:54,960
I'm going to say vector
because there could

813
00:32:54,960 --> 00:32:56,640
be multiple dimensions to it.

814
00:32:56,640 --> 00:32:58,260
How do I decide how
to compare them?

815
00:32:58,260 --> 00:33:01,350
Because I want to use the
distances to figure out either

816
00:33:01,350 --> 00:33:03,960
how to group things together
or how to find a dividing line

817
00:33:03,960 --> 00:33:05,940
that separates things apart.

818
00:33:05,940 --> 00:33:09,470
So one of the things I have
to decide is which features.

819
00:33:09,470 --> 00:33:10,970
I also have to
decide the distance.

820
00:33:10,970 --> 00:33:12,710
And finally, I
may want to decide

821
00:33:12,710 --> 00:33:16,660
how to weigh relative importance
of different dimensions

822
00:33:16,660 --> 00:33:17,990
in the feature vector.

823
00:33:17,990 --> 00:33:21,400
Some may be more valuable than
others in making that decision.

824
00:33:21,400 --> 00:33:24,570
And I want to show you
an example of that.

825
00:33:24,570 --> 00:33:27,909
So let's go back to my animals.

826
00:33:27,909 --> 00:33:29,950
I started off with a
feature vector that actually

827
00:33:29,950 --> 00:33:31,390
had five dimensions to it.

828
00:33:31,390 --> 00:33:36,305
It was egg-laying, cold
blooded, has scales,

829
00:33:36,305 --> 00:33:39,700
I forget what the other one
was, and number of legs.

830
00:33:39,700 --> 00:33:41,560
So one of the ways I
could think about this

831
00:33:41,560 --> 00:33:46,180
is saying I've got four binary
features and one integer

832
00:33:46,180 --> 00:33:48,910
feature associated
with each animal.

833
00:33:48,910 --> 00:33:52,000
And one way to learn to separate
out reptiles from non reptiles

834
00:33:52,000 --> 00:33:56,591
is to measure the distance
between pairs of examples

835
00:33:56,591 --> 00:33:58,840
and use that distance to
decide what's near each other

836
00:33:58,840 --> 00:33:59,664
and what's not.

837
00:33:59,664 --> 00:34:01,330
And as we've said
before, it will either

838
00:34:01,330 --> 00:34:04,210
be used to cluster things or to
find a classifier surface that

839
00:34:04,210 --> 00:34:06,620
separates them.

840
00:34:06,620 --> 00:34:09,070
So here's a simple way to do it.

841
00:34:09,070 --> 00:34:11,470
For each of these examples,
I'm going to just let true

842
00:34:11,470 --> 00:34:13,060
be 1, false be 0.

843
00:34:13,060 --> 00:34:15,310
So the first four
are either 0s or 1s.

844
00:34:15,310 --> 00:34:17,709
And the last one is
the number of legs.

845
00:34:17,709 --> 00:34:19,000
And now I could say, all right.

846
00:34:19,000 --> 00:34:22,540
How do I measure
distances between animals

847
00:34:22,540 --> 00:34:25,884
or anything else, but these
kinds of feature vectors?

848
00:34:25,884 --> 00:34:27,300
Here, we're going
to use something

849
00:34:27,300 --> 00:34:30,750
called the Minkowski Metric
or the Minkowski difference.

850
00:34:30,750 --> 00:34:34,080
Given two vectors
and a power, p,

851
00:34:34,080 --> 00:34:36,300
we basically take
the absolute value

852
00:34:36,300 --> 00:34:38,429
of the difference between
each of the components

853
00:34:38,429 --> 00:34:43,969
of the vector, raise it to
the p-th power, take the sum,

854
00:34:43,969 --> 00:34:46,840
and take the p-th route of that.

855
00:34:46,840 --> 00:34:48,460
So let's do the two
obvious examples.

856
00:34:48,460 --> 00:34:51,699
If p is equal to 1, I just
measure the absolute distance

857
00:34:51,699 --> 00:34:56,469
between each component, add
them up, and that's my distance.

858
00:34:56,469 --> 00:34:58,715
It's called the
Manhattan metric.

859
00:34:58,715 --> 00:35:00,840
The one you've seen more,
the one we saw last time,

860
00:35:00,840 --> 00:35:03,900
if p is equal to 2, this is
Euclidean distance, right?

861
00:35:03,900 --> 00:35:05,990
It's the sum of the
squares of the differences

862
00:35:05,990 --> 00:35:07,050
of the components.

863
00:35:07,050 --> 00:35:08,149
Take the square root.

864
00:35:08,149 --> 00:35:09,690
Take the square root
because it makes

865
00:35:09,690 --> 00:35:12,420
it have certain
properties of a distance.

866
00:35:12,420 --> 00:35:16,540
That's the Euclidean distance.

867
00:35:16,540 --> 00:35:20,240
So now if I want to measure
difference between these two,

868
00:35:20,240 --> 00:35:22,750
here's the question.

869
00:35:22,750 --> 00:35:27,780
Is this circle closer to the
star or closer to the cross?

870
00:35:27,780 --> 00:35:30,310
Unfortunately, I put
the answer up here.

871
00:35:30,310 --> 00:35:33,260
But it differs, depending
on the metric I use.

872
00:35:33,260 --> 00:35:33,760
Right?

873
00:35:33,760 --> 00:35:37,000
Euclidean distance, well,
that's square root of 2 times 2,

874
00:35:37,000 --> 00:35:38,692
so it's about 2.8.

875
00:35:38,692 --> 00:35:39,400
And that's three.

876
00:35:39,400 --> 00:35:42,580
So in terms of just standard
distance in the plane,

877
00:35:42,580 --> 00:35:46,680
we would say that these two
are closer than those two are.

878
00:35:46,680 --> 00:35:48,430
Manhattan distance,
why is it called that?

879
00:35:48,430 --> 00:35:52,040
Because you can only walk along
the avenues and the streets.

880
00:35:52,040 --> 00:35:53,500
Manhattan distance
would basically

881
00:35:53,500 --> 00:35:56,500
say this is one, two,
three, four units away.

882
00:35:56,500 --> 00:35:59,170
This is one, two,
three units away.

883
00:35:59,170 --> 00:36:02,020
And under Manhattan
distance, this is closer,

884
00:36:02,020 --> 00:36:05,847
this pairing is closer
than that pairing is.

885
00:36:05,847 --> 00:36:07,430
Now you're used to
thinking Euclidean.

886
00:36:07,430 --> 00:36:08,220
We're going to use that.

887
00:36:08,220 --> 00:36:09,595
But this is going
to be important

888
00:36:09,595 --> 00:36:12,080
when we think about how
are we comparing distances

889
00:36:12,080 --> 00:36:15,360
between these different pieces.

890
00:36:15,360 --> 00:36:16,912
So typically, we'll
use Euclidean.

891
00:36:16,912 --> 00:36:19,120
We're going to see Manhattan
actually has some value.

892
00:36:19,120 --> 00:36:20,960
So if I go back to my three
examples-- boy, that's

893
00:36:20,960 --> 00:36:21,960
a gross slide, isn't it?

894
00:36:21,960 --> 00:36:22,790
But there we go--

895
00:36:22,790 --> 00:36:25,570
rattlesnake, boa
constrictor, and dart frog.

896
00:36:25,570 --> 00:36:26,870
There is the representation.

897
00:36:26,870 --> 00:36:29,457
I can ask, what's the
distance between them?

898
00:36:29,457 --> 00:36:31,790
In the handout for today,
we've given you a little piece

899
00:36:31,790 --> 00:36:32,914
of code that would do that.

900
00:36:32,914 --> 00:36:36,050
And if I actually run
through it, I get,

901
00:36:36,050 --> 00:36:38,510
actually, a nice
little result. Here

902
00:36:38,510 --> 00:36:43,199
are the distances between those
vectors using Euclidean metric.

903
00:36:43,199 --> 00:36:44,490
I'm going to come back to them.

904
00:36:44,490 --> 00:36:48,350
But you can see the
two snakes, nicely, are

905
00:36:48,350 --> 00:36:50,030
reasonably close to each other.

906
00:36:50,030 --> 00:36:54,220
Whereas, the dart frog is a
fair distance away from that.

907
00:36:54,220 --> 00:36:54,910
Nice, right?

908
00:36:54,910 --> 00:36:56,740
That's a nice separation
that says there's

909
00:36:56,740 --> 00:36:58,480
a difference between these two.

910
00:36:58,480 --> 00:37:00,220
OK.

911
00:37:00,220 --> 00:37:03,160
Now I throw in the alligator.

912
00:37:03,160 --> 00:37:04,810
Sounds like a Dungeons
& Dragons game.

913
00:37:04,810 --> 00:37:09,480
I throw in the alligator, and I
want to do the same comparison.

914
00:37:09,480 --> 00:37:14,720
And I don't get nearly as nice
a result. Because now it says,

915
00:37:14,720 --> 00:37:19,320
as before, the two snakes
are close to each other.

916
00:37:19,320 --> 00:37:21,700
But it says that the dart
frog and the alligator

917
00:37:21,700 --> 00:37:24,640
are much closer, under
this measurement,

918
00:37:24,640 --> 00:37:27,185
than either of them
is to the other.

919
00:37:27,185 --> 00:37:30,220
And to remind you, right,
the alligator and the two

920
00:37:30,220 --> 00:37:33,250
snakes I would like to be close
to one another and a distance

921
00:37:33,250 --> 00:37:34,640
away from the frog.

922
00:37:34,640 --> 00:37:38,470
Because I'm trying to
classify reptiles versus not.

923
00:37:38,470 --> 00:37:41,015
So what happened here?

924
00:37:41,015 --> 00:37:43,140
Well, this is a place where
the feature engineering

925
00:37:43,140 --> 00:37:44,640
is going to be important.

926
00:37:44,640 --> 00:37:47,820
Because in fact, the alligator
differs from the frog

927
00:37:47,820 --> 00:37:49,120
in three features.

928
00:37:51,810 --> 00:37:55,300
And only in two features from,
say, the boa constrictor.

929
00:37:55,300 --> 00:37:57,590
But one of those features
is the number of legs.

930
00:37:57,590 --> 00:37:59,650
And there, while
on the binary axes,

931
00:37:59,650 --> 00:38:01,540
the difference is
between a 0 and 1,

932
00:38:01,540 --> 00:38:05,620
here it can be between 0 and 4.

933
00:38:05,620 --> 00:38:09,100
So that is weighing the distance
a lot more than we would like.

934
00:38:09,100 --> 00:38:13,520
The legs dimension is
too large, if you like.

935
00:38:13,520 --> 00:38:15,416
How would I fix this?

936
00:38:15,416 --> 00:38:18,020
This is actually, I would
argue, a natural place

937
00:38:18,020 --> 00:38:20,690
to use Manhattan distance.

938
00:38:20,690 --> 00:38:22,520
Why should I think
that the difference

939
00:38:22,520 --> 00:38:26,160
in the number of legs or the
number of legs difference

940
00:38:26,160 --> 00:38:30,400
is more important than
whether it has scales or not?

941
00:38:30,400 --> 00:38:32,620
Why should I think that
measuring that distance

942
00:38:32,620 --> 00:38:34,300
Euclidean-wise makes sense?

943
00:38:34,300 --> 00:38:36,590
They are really completely
different measurements.

944
00:38:36,590 --> 00:38:38,090
And in fact, I'm
not going to do it,

945
00:38:38,090 --> 00:38:39,880
but if I ran Manhattan
metric on this,

946
00:38:39,880 --> 00:38:43,160
it would get the alligator
much closer to the snakes,

947
00:38:43,160 --> 00:38:48,160
exactly because it differs only
in two features, not three.

948
00:38:48,160 --> 00:38:49,900
The other way I
could fix it would

949
00:38:49,900 --> 00:38:52,510
be to say I'm letting too
much weight be associated

950
00:38:52,510 --> 00:38:54,430
with the difference
in the number of legs.

951
00:38:54,430 --> 00:38:56,800
So let's just make
it a binary feature.

952
00:38:56,800 --> 00:39:00,310
Either it doesn't have
legs or it does have legs.

953
00:39:00,310 --> 00:39:03,040
Run the same classification.

954
00:39:03,040 --> 00:39:07,450
And now you see the
snakes and the alligator

955
00:39:07,450 --> 00:39:09,510
are all close to each other.

956
00:39:09,510 --> 00:39:13,290
Whereas the dart frog, not
as far away as it was before,

957
00:39:13,290 --> 00:39:15,480
but there's a pretty natural
separation, especially

958
00:39:15,480 --> 00:39:18,450
using that number between them.

959
00:39:18,450 --> 00:39:20,180
What's my point?

960
00:39:20,180 --> 00:39:22,610
Choice of features matters.

961
00:39:22,610 --> 00:39:24,710
Throwing too many
features in may, in fact,

962
00:39:24,710 --> 00:39:27,450
give us some overfitting.

963
00:39:27,450 --> 00:39:29,300
And in particular,
deciding the weights

964
00:39:29,300 --> 00:39:32,090
that I want on those
features has a real impact.

965
00:39:32,090 --> 00:39:33,830
And you, as a designer
or a programmer,

966
00:39:33,830 --> 00:39:37,340
have a lot of influence in how
you think about using those.

967
00:39:37,340 --> 00:39:38,930
So feature engineering
really matters.

968
00:39:38,930 --> 00:39:40,610
How you pick the
features, what you use

969
00:39:40,610 --> 00:39:43,580
is going to be important.

970
00:39:43,580 --> 00:39:44,880
OK.

971
00:39:44,880 --> 00:39:47,740
The last piece of
this then is we're

972
00:39:47,740 --> 00:39:51,370
going to look at some examples
where we give you data, got

973
00:39:51,370 --> 00:39:53,180
features associated with them.

974
00:39:53,180 --> 00:39:55,180
We're going to, in some
cases have them labeled,

975
00:39:55,180 --> 00:39:56,120
in other cases not.

976
00:39:56,120 --> 00:39:57,970
And we know how now to
think about how do we

977
00:39:57,970 --> 00:39:59,261
measure distances between them.

978
00:39:59,261 --> 00:40:00,425
John.

979
00:40:00,425 --> 00:40:02,050
JOHN GUTTAG: You
probably didn't intend

980
00:40:02,050 --> 00:40:03,460
to say weights of features.

981
00:40:03,460 --> 00:40:04,780
You intended to say
how they're scaled.

982
00:40:04,780 --> 00:40:04,990
ERIC GRIMSON: Sorry.

983
00:40:04,990 --> 00:40:06,530
The scales and not
the-- thank you, John.

984
00:40:06,530 --> 00:40:07,029
No, I did.

985
00:40:07,029 --> 00:40:07,850
I take that back.

986
00:40:07,850 --> 00:40:09,600
I did not mean to say
weights of features.

987
00:40:09,600 --> 00:40:11,650
I meant to say the
scale of the dimension

988
00:40:11,650 --> 00:40:12,900
is going to be important here.

989
00:40:12,900 --> 00:40:15,210
Thank you, for the
amplification and correction.

990
00:40:15,210 --> 00:40:16,210
You're absolutely right.

991
00:40:16,210 --> 00:40:18,082
JOHN GUTTAG: Weights, we
use in a different way,

992
00:40:18,082 --> 00:40:19,020
as we'll see next time.

993
00:40:19,020 --> 00:40:19,590
ERIC GRIMSON: And
we're going to see

994
00:40:19,590 --> 00:40:21,450
next time why we're going to
use weights in different ways.

995
00:40:21,450 --> 00:40:22,404
So rephrase it.

996
00:40:22,404 --> 00:40:23,570
Block that out of your mind.

997
00:40:23,570 --> 00:40:26,070
We're going to talk about
scales and the scale on the axes

998
00:40:26,070 --> 00:40:27,536
as being important here.

999
00:40:27,536 --> 00:40:29,160
And we already said
we're going to look

1000
00:40:29,160 --> 00:40:31,740
at two different
kinds of learning,

1001
00:40:31,740 --> 00:40:34,920
labeled and unlabeled,
clustering and classifying.

1002
00:40:34,920 --> 00:40:37,530
And I want to just
finish up by showing you

1003
00:40:37,530 --> 00:40:38,940
two examples of that.

1004
00:40:38,940 --> 00:40:41,310
How we would think about
them algorithmically,

1005
00:40:41,310 --> 00:40:44,004
and we'll look at them
in more detail next time.

1006
00:40:44,004 --> 00:40:45,420
As we look at it,
I want to remind

1007
00:40:45,420 --> 00:40:48,530
you the things that are
going to be important to you.

1008
00:40:48,530 --> 00:40:50,930
How do I measure distance
between examples?

1009
00:40:50,930 --> 00:40:53,060
What's the right
way to design that?

1010
00:40:53,060 --> 00:40:57,060
What is the right set of
features to use in that vector?

1011
00:40:57,060 --> 00:41:01,520
And then, what constraints do
I want to put on the model?

1012
00:41:01,520 --> 00:41:03,020
In the case of
unlabelled data, how

1013
00:41:03,020 --> 00:41:06,424
do I decide how many
clusters I want to have?

1014
00:41:06,424 --> 00:41:08,840
Because I can give you a really
easy way to do clustering.

1015
00:41:08,840 --> 00:41:12,110
If I give you 100 examples,
I say build 100 clusters.

1016
00:41:12,110 --> 00:41:14,250
Every example is
its own cluster.

1017
00:41:14,250 --> 00:41:15,470
Distance is really good.

1018
00:41:15,470 --> 00:41:18,110
It's really close to itself,
but it does a lousy job

1019
00:41:18,110 --> 00:41:19,240
of labeling things on it.

1020
00:41:19,240 --> 00:41:20,656
So I have to think
about, how do I

1021
00:41:20,656 --> 00:41:23,526
decide how many clusters,
what's the complexity

1022
00:41:23,526 --> 00:41:24,650
of that separating service?

1023
00:41:24,650 --> 00:41:27,710
How do I basically avoid
the overfitting problem,

1024
00:41:27,710 --> 00:41:30,840
which I don't want to have?

1025
00:41:30,840 --> 00:41:32,850
So just to remind
you, we've already

1026
00:41:32,850 --> 00:41:36,240
seen a little version of
this, the clustering method.

1027
00:41:36,240 --> 00:41:39,276
This is a standard way to
do it, simply repeating what

1028
00:41:39,276 --> 00:41:40,400
we had on an earlier slide.

1029
00:41:40,400 --> 00:41:42,420
If I want to cluster
it into groups,

1030
00:41:42,420 --> 00:41:45,410
I start by saying how many
clusters am I looking for?

1031
00:41:45,410 --> 00:41:48,590
Pick an example I take as
my early representation.

1032
00:41:48,590 --> 00:41:50,640
For every other example
in the training data,

1033
00:41:50,640 --> 00:41:53,210
put it to the closest cluster.

1034
00:41:53,210 --> 00:41:57,080
Once I've got those, find the
median, repeat the process.

1035
00:41:57,080 --> 00:42:01,820
And that led to that separation.

1036
00:42:01,820 --> 00:42:03,930
Now once I've got it,
I like to validate it.

1037
00:42:03,930 --> 00:42:05,780
And in fact, I should
have said this better.

1038
00:42:05,780 --> 00:42:09,980
Those two clusters came without
looking at the two black dots.

1039
00:42:09,980 --> 00:42:11,630
Once I put the
black dots in, I'd

1040
00:42:11,630 --> 00:42:14,510
like to validate, how well
does this really work?

1041
00:42:14,510 --> 00:42:17,780
And that example there is
really not very encouraging.

1042
00:42:17,780 --> 00:42:19,590
It's too close.

1043
00:42:19,590 --> 00:42:22,020
So that's a natural place to
say, OK, what if I did this

1044
00:42:22,020 --> 00:42:25,360
with three clusters?

1045
00:42:25,360 --> 00:42:27,970
That's what I get.

1046
00:42:27,970 --> 00:42:29,240
I like the that.

1047
00:42:29,240 --> 00:42:29,860
All right?

1048
00:42:29,860 --> 00:42:33,460
That has a really
nice cluster up here.

1049
00:42:33,460 --> 00:42:35,630
The fact that the algorithm
didn't know the labeling

1050
00:42:35,630 --> 00:42:36,213
is irrelevant.

1051
00:42:36,213 --> 00:42:37,720
There's a nice grouping of five.

1052
00:42:37,720 --> 00:42:39,710
There's a nice grouping of four.

1053
00:42:39,710 --> 00:42:42,620
And there's a nice grouping
of three in between.

1054
00:42:42,620 --> 00:42:45,980
And in fact, if I looked
at the average distance

1055
00:42:45,980 --> 00:42:48,200
between examples in
each of these clusters,

1056
00:42:48,200 --> 00:42:52,440
it is much tighter
than in that example.

1057
00:42:52,440 --> 00:42:56,550
And so that leads to, then,
the question of should I

1058
00:42:56,550 --> 00:42:57,642
look for four clusters?

1059
00:42:57,642 --> 00:42:58,350
Question, please.

1060
00:42:58,350 --> 00:43:01,020
AUDIENCE: Is that overlap
between the two clusters

1061
00:43:01,020 --> 00:43:01,690
not an issue?

1062
00:43:01,690 --> 00:43:02,440
ERIC GRIMSON: Yes.

1063
00:43:02,440 --> 00:43:04,600
The question is, is the overlap
between the two clusters

1064
00:43:04,600 --> 00:43:05,099
a problem?

1065
00:43:05,099 --> 00:43:05,824
No.

1066
00:43:05,824 --> 00:43:07,240
I just drew it
here so I could let

1067
00:43:07,240 --> 00:43:09,010
you see where those pieces are.

1068
00:43:09,010 --> 00:43:13,090
But in fact, if you like,
the center is there.

1069
00:43:13,090 --> 00:43:15,550
Those three points are
all closer to that center

1070
00:43:15,550 --> 00:43:16,780
than they are to that center.

1071
00:43:16,780 --> 00:43:18,260
So the fact that they
overlap is a good question.

1072
00:43:18,260 --> 00:43:20,020
It's just the way I
happened to draw them.

1073
00:43:20,020 --> 00:43:21,490
I should really
draw these, not as

1074
00:43:21,490 --> 00:43:25,104
circles, but as some little
bit more convoluted surface.

1075
00:43:25,104 --> 00:43:26,050
OK?

1076
00:43:26,050 --> 00:43:28,900
Having done three, I could
say should I look for four?

1077
00:43:28,900 --> 00:43:31,919
Well, those points down
there, as I've already said,

1078
00:43:31,919 --> 00:43:33,460
are an example where
it's going to be

1079
00:43:33,460 --> 00:43:34,750
hard to separate them out.

1080
00:43:34,750 --> 00:43:35,920
And I don't want to overfit.

1081
00:43:35,920 --> 00:43:37,720
Because the only way
to separate those out

1082
00:43:37,720 --> 00:43:40,900
is going to be to come up with
a really convoluted cluster,

1083
00:43:40,900 --> 00:43:41,950
which I don't like.

1084
00:43:41,950 --> 00:43:43,580
All right?

1085
00:43:43,580 --> 00:43:46,480
Let me finish with showing
you one other example

1086
00:43:46,480 --> 00:43:47,650
from the other direction.

1087
00:43:47,650 --> 00:43:52,010
Which is, suppose I give
you labeled examples.

1088
00:43:52,010 --> 00:43:54,200
So again, the goal
is I've got features

1089
00:43:54,200 --> 00:43:55,470
associated with each example.

1090
00:43:55,470 --> 00:43:57,470
They're going to have
multiple dimensions on it.

1091
00:43:57,470 --> 00:43:59,450
But I also know the label
associated with them.

1092
00:43:59,450 --> 00:44:01,880
And I want to learn
what is the best

1093
00:44:01,880 --> 00:44:04,760
way to come up with a rule that
will let me take new examples

1094
00:44:04,760 --> 00:44:07,301
and assign them to
the right group.

1095
00:44:07,301 --> 00:44:08,910
A number of ways to do this.

1096
00:44:08,910 --> 00:44:12,020
You can simply say I'm looking
for the simplest surface that

1097
00:44:12,020 --> 00:44:13,927
will separate those examples.

1098
00:44:13,927 --> 00:44:16,010
In my football case that
were in the plane, what's

1099
00:44:16,010 --> 00:44:17,660
the best line that
separates them,

1100
00:44:17,660 --> 00:44:19,610
which turns out to be easy.

1101
00:44:19,610 --> 00:44:21,725
I might look for a more
complicated surface.

1102
00:44:21,725 --> 00:44:23,600
And we're going to see
an example in a second

1103
00:44:23,600 --> 00:44:26,261
where maybe it's a
sequence of line segments

1104
00:44:26,261 --> 00:44:27,260
that separates them out.

1105
00:44:27,260 --> 00:44:30,920
Because there's not just one
line that does the separation.

1106
00:44:30,920 --> 00:44:32,790
As before, I want to be careful.

1107
00:44:32,790 --> 00:44:34,370
If I make it too
complicated, I may

1108
00:44:34,370 --> 00:44:38,054
get a really good separator,
but I overfit to the data.

1109
00:44:38,054 --> 00:44:39,470
And you're going
to see next time.

1110
00:44:39,470 --> 00:44:40,520
I'm going to just
highlight it here.

1111
00:44:40,520 --> 00:44:42,019
There's a third
way, which will lead

1112
00:44:42,019 --> 00:44:43,910
to almost the same
kind of result

1113
00:44:43,910 --> 00:44:46,160
called k nearest neighbors.

1114
00:44:46,160 --> 00:44:49,550
And the idea here is I've
got a set of labeled data.

1115
00:44:49,550 --> 00:44:52,100
And what I'm going to do
is, for every new example,

1116
00:44:52,100 --> 00:44:57,250
say find the k, say the five
closest labeled examples.

1117
00:44:57,250 --> 00:44:58,600
And take a vote.

1118
00:44:58,600 --> 00:45:01,870
If 3 out of 5 or 4 out of 5
or 5 out of 5 of those labels

1119
00:45:01,870 --> 00:45:04,605
are the same, I'm going to
say it's part of that group.

1120
00:45:04,605 --> 00:45:05,980
And if I have less
than that, I'm

1121
00:45:05,980 --> 00:45:07,510
going to leave it
as unclassified.

1122
00:45:07,510 --> 00:45:09,259
And that's a nice way
of actually thinking

1123
00:45:09,259 --> 00:45:10,870
about how to learn them.

1124
00:45:10,870 --> 00:45:12,940
And let me just finish by
showing you an example.

1125
00:45:12,940 --> 00:45:14,814
Now I won't use football
players on this one.

1126
00:45:14,814 --> 00:45:17,380
I'll use a different example.

1127
00:45:17,380 --> 00:45:20,020
I'm going to give
you some voting data.

1128
00:45:20,020 --> 00:45:21,800
I think this is
actually simulated data.

1129
00:45:21,800 --> 00:45:25,974
But these are a set of
voters in the United States

1130
00:45:25,974 --> 00:45:26,890
with their preference.

1131
00:45:26,890 --> 00:45:28,136
They tend to vote Republican.

1132
00:45:28,136 --> 00:45:29,260
They tend to vote Democrat.

1133
00:45:29,260 --> 00:45:32,800
And the two categories are
their age and how far away

1134
00:45:32,800 --> 00:45:34,441
they live from Boston.

1135
00:45:34,441 --> 00:45:36,440
Whether those are relevant
or not, I don't know,

1136
00:45:36,440 --> 00:45:39,064
but they are just two things I'm
going to use to classify them.

1137
00:45:39,064 --> 00:45:41,750
And I'd like to say,
how would I fit a curve

1138
00:45:41,750 --> 00:45:46,110
to separate those two classes?

1139
00:45:46,110 --> 00:45:48,690
I'm going to keep
half the data to test.

1140
00:45:48,690 --> 00:45:50,910
I'm going to use half
the data to train.

1141
00:45:50,910 --> 00:45:52,590
So if this is my
training data, I

1142
00:45:52,590 --> 00:45:57,040
can say what's the best
line that separates these?

1143
00:45:57,040 --> 00:46:00,200
I don't know about best,
but here are two examples.

1144
00:46:00,200 --> 00:46:03,880
This solid line has the
property that all the Democrats

1145
00:46:03,880 --> 00:46:05,620
are on one side.

1146
00:46:05,620 --> 00:46:07,540
Everything on the other
side is a Republican,

1147
00:46:07,540 --> 00:46:10,000
but there are some Republicans
on this side of the line.

1148
00:46:10,000 --> 00:46:12,310
I can't find a line that
completely separates these,

1149
00:46:12,310 --> 00:46:14,260
as I did with the
football players.

1150
00:46:14,260 --> 00:46:17,659
But there is a decent
line to separate them.

1151
00:46:17,659 --> 00:46:18,700
Here's another candidate.

1152
00:46:18,700 --> 00:46:22,695
That dash line has the
property that on the right side

1153
00:46:22,695 --> 00:46:24,820
you've got-- boy, I don't
think this is deliberate,

1154
00:46:24,820 --> 00:46:26,480
John, right-- but
on the right side,

1155
00:46:26,480 --> 00:46:28,195
you've got almost
all Republicans.

1156
00:46:28,195 --> 00:46:30,730
It seems perfectly appropriate.

1157
00:46:30,730 --> 00:46:34,000
One Democrat, but there's a
pretty good separation there.

1158
00:46:34,000 --> 00:46:36,130
And on the left side,
you've got a mix of things.

1159
00:46:36,130 --> 00:46:39,980
But most of the Democrats are
on the left side of that line.

1160
00:46:39,980 --> 00:46:40,480
All right?

1161
00:46:40,480 --> 00:46:42,104
The fact that left
and right correlates

1162
00:46:42,104 --> 00:46:44,470
with distance from Boston is
completely irrelevant here.

1163
00:46:44,470 --> 00:46:46,620
But it has a nice punch to it.

1164
00:46:46,620 --> 00:46:48,370
JOHN GUTTAG: Relevant,
but not accidental.

1165
00:46:48,370 --> 00:46:49,745
ERIC GRIMSON: But
not accidental.

1166
00:46:49,745 --> 00:46:50,570
Thank you.

1167
00:46:50,570 --> 00:46:51,070
All right.

1168
00:46:51,070 --> 00:46:53,194
So now the question is,
how would I evaluate these?

1169
00:46:53,194 --> 00:46:55,306
How do I decide
which one is better?

1170
00:46:55,306 --> 00:46:56,680
And I'm simply
going to show you,

1171
00:46:56,680 --> 00:46:58,880
very quickly, some examples.

1172
00:46:58,880 --> 00:47:02,747
First one is to look at what's
called the confusion matrix.

1173
00:47:02,747 --> 00:47:03,580
What does that mean?

1174
00:47:03,580 --> 00:47:07,090
It says for this, one of
these classifiers for example,

1175
00:47:07,090 --> 00:47:07,760
the solid line.

1176
00:47:07,760 --> 00:47:10,260
Here are the predictions,
based on the solid line

1177
00:47:10,260 --> 00:47:12,010
of whether they would
be more likely to be

1178
00:47:12,010 --> 00:47:13,540
Democrat or Republican.

1179
00:47:13,540 --> 00:47:16,090
And here is the actual label.

1180
00:47:16,090 --> 00:47:17,410
Same thing for the dashed line.

1181
00:47:17,410 --> 00:47:21,280
And that diagonal is
important because those are

1182
00:47:21,280 --> 00:47:23,740
the correctly labeled results.

1183
00:47:23,740 --> 00:47:24,540
Right?

1184
00:47:24,540 --> 00:47:27,460
It correctly, in
the solid line case,

1185
00:47:27,460 --> 00:47:30,400
gets all of the correct
labelings of the Democrats.

1186
00:47:30,400 --> 00:47:32,080
It gets half of the
Republicans right.

1187
00:47:32,080 --> 00:47:35,080
But it has some where
it's actually Republican,

1188
00:47:35,080 --> 00:47:37,700
but it labels it as a Democrat.

1189
00:47:37,700 --> 00:47:40,580
That, we'd like to
be really large.

1190
00:47:40,580 --> 00:47:43,070
And in fact, it leads
to a natural measure

1191
00:47:43,070 --> 00:47:44,880
called the accuracy.

1192
00:47:44,880 --> 00:47:46,520
Which is, just to
go back to that,

1193
00:47:46,520 --> 00:47:48,650
we say that these
are true positives.

1194
00:47:48,650 --> 00:47:52,070
Meaning, I labeled it as being
an instance, and it really is.

1195
00:47:52,070 --> 00:47:53,330
These are true negatives.

1196
00:47:53,330 --> 00:47:56,330
I label it as not being an
instance, and it really isn't.

1197
00:47:56,330 --> 00:47:59,450
And then these are
the false positives.

1198
00:47:59,450 --> 00:48:01,424
I labeled it as being an
instance and it's not,

1199
00:48:01,424 --> 00:48:02,840
and these are the
false negatives.

1200
00:48:02,840 --> 00:48:05,680
I labeled it as not being
an instance, and it is.

1201
00:48:05,680 --> 00:48:09,620
And an easy way to measure it
is to look at the correct labels

1202
00:48:09,620 --> 00:48:11,290
over all of the labels.

1203
00:48:11,290 --> 00:48:13,040
The true positives and
the true negatives,

1204
00:48:13,040 --> 00:48:14,820
the ones I got right.

1205
00:48:14,820 --> 00:48:19,862
And in that case, both models
come up with a value of 0.7.

1206
00:48:19,862 --> 00:48:20,820
So which one is better?

1207
00:48:20,820 --> 00:48:21,900
Well, I should validate that.

1208
00:48:21,900 --> 00:48:23,399
And I'm going to
do that in a second

1209
00:48:23,399 --> 00:48:25,511
by looking at other data.

1210
00:48:25,511 --> 00:48:27,260
We could also ask,
could we find something

1211
00:48:27,260 --> 00:48:28,660
with less training error?

1212
00:48:28,660 --> 00:48:31,310
This is only getting 70% right.

1213
00:48:31,310 --> 00:48:33,250
Not great.

1214
00:48:33,250 --> 00:48:35,692
Well, here is a more
complicated model.

1215
00:48:35,692 --> 00:48:37,150
And this is where
you start getting

1216
00:48:37,150 --> 00:48:38,140
worried about overfitting.

1217
00:48:38,140 --> 00:48:39,598
Now what I've done,
is I've come up

1218
00:48:39,598 --> 00:48:42,430
with a sequence of lines
that separate them.

1219
00:48:42,430 --> 00:48:45,260
So everything above this
line, I'm going to say

1220
00:48:45,260 --> 00:48:46,080
is a Republican.

1221
00:48:46,080 --> 00:48:48,910
Everything below this line,
I'm going to say is a Democrat.

1222
00:48:48,910 --> 00:48:50,350
So I'm avoiding that one.

1223
00:48:50,350 --> 00:48:51,310
I'm avoiding that one.

1224
00:48:51,310 --> 00:48:54,340
I'm still capturing
many of the same things.

1225
00:48:54,340 --> 00:48:59,140
And in this case, I get 12 true
positives, 13 true negatives,

1226
00:48:59,140 --> 00:49:02,001
and only 5 false positives.

1227
00:49:02,001 --> 00:49:03,000
And that's kind of nice.

1228
00:49:03,000 --> 00:49:03,790
You can see the 5.

1229
00:49:03,790 --> 00:49:06,040
It's those five red
ones down there.

1230
00:49:06,040 --> 00:49:09,360
It's accuracy is 0.833.

1231
00:49:09,360 --> 00:49:15,596
And now, if I apply that to the
test data, I get an OK result.

1232
00:49:15,596 --> 00:49:19,440
It has an accuracy of about 0.6.

1233
00:49:19,440 --> 00:49:21,930
I could use this idea to try
and generalize to say could I

1234
00:49:21,930 --> 00:49:23,100
come up with a better model.

1235
00:49:23,100 --> 00:49:25,860
And you're going to
see that next time.

1236
00:49:25,860 --> 00:49:28,050
There could be other ways
in which I measure this.

1237
00:49:28,050 --> 00:49:29,860
And I want to use this
as the last example.

1238
00:49:29,860 --> 00:49:34,740
Another good measure we use is
called PPV, Positive Predictive

1239
00:49:34,740 --> 00:49:39,290
Value which is how many true
positives do I come up with out

1240
00:49:39,290 --> 00:49:42,580
of all the things I
labeled positively.

1241
00:49:42,580 --> 00:49:45,630
And in this solid model,
in the dashed line,

1242
00:49:45,630 --> 00:49:48,120
I can get values about 0.57.

1243
00:49:48,120 --> 00:49:50,500
The complex model on the
training data is better.

1244
00:49:50,500 --> 00:49:53,960
And then the testing
data is even stronger.

1245
00:49:53,960 --> 00:49:55,820
And finally, two other
examples are called

1246
00:49:55,820 --> 00:49:58,850
sensitivity and specificity.

1247
00:49:58,850 --> 00:50:01,490
Sensitivity basically
tells you what percentage

1248
00:50:01,490 --> 00:50:03,610
did I correctly find.

1249
00:50:03,610 --> 00:50:05,510
And specificity
said what percentage

1250
00:50:05,510 --> 00:50:07,700
did I correctly reject.

1251
00:50:07,700 --> 00:50:09,290
And I show you this
because this is

1252
00:50:09,290 --> 00:50:12,240
where the trade-off comes in.

1253
00:50:12,240 --> 00:50:14,210
If sensitivity is how
many did I correctly

1254
00:50:14,210 --> 00:50:16,400
label out of those
that I both correctly

1255
00:50:16,400 --> 00:50:20,382
labeled and incorrectly
labeled as being negative,

1256
00:50:20,382 --> 00:50:21,840
how many them did
I correctly label

1257
00:50:21,840 --> 00:50:23,980
as being the kind that I want?

1258
00:50:23,980 --> 00:50:27,310
I can make sensitivity 1.

1259
00:50:27,310 --> 00:50:30,090
Label everything is the
thing I'm looking for.

1260
00:50:30,090 --> 00:50:30,710
Great.

1261
00:50:30,710 --> 00:50:32,140
Everything is correct.

1262
00:50:32,140 --> 00:50:35,740
But the specificity will be 0.

1263
00:50:35,740 --> 00:50:39,020
Because I'll have a bunch of
things incorrectly labeled.

1264
00:50:39,020 --> 00:50:43,460
I could make the specificity
1, reject everything.

1265
00:50:43,460 --> 00:50:45,410
Say nothing as an instance.

1266
00:50:45,410 --> 00:50:52,070
True negatives goes to 1, and
I'm in a great place there,

1267
00:50:52,070 --> 00:50:55,170
but my sensitivity goes to 0.

1268
00:50:55,170 --> 00:50:56,130
I've got a trade-off.

1269
00:50:56,130 --> 00:50:58,680
As I think about the machine
learning algorithm I'm using

1270
00:50:58,680 --> 00:51:01,164
and my choice of
that classifier,

1271
00:51:01,164 --> 00:51:02,580
I'm going to see
a trade off where

1272
00:51:02,580 --> 00:51:07,260
I can increase specificity at
the cost of sensitivity or vice

1273
00:51:07,260 --> 00:51:08,310
versa.

1274
00:51:08,310 --> 00:51:11,430
And you'll see a nice technique
called ROC or Receiver Operator

1275
00:51:11,430 --> 00:51:14,576
Curve that gives you a sense of
how you want to deal with that.

1276
00:51:14,576 --> 00:51:16,200
And with that, we'll
see you next time.

1277
00:51:16,200 --> 00:51:17,180
We'll take your
question off line

1278
00:51:17,180 --> 00:51:18,680
if you don't mind, because
I've run over time.

1279
00:51:18,680 --> 00:51:20,763
But we'll see you next
time where Professor Guttag

1280
00:51:20,763 --> 00:51:22,930
will show you examples of this.