1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation, or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,270
at ocw.mit.edu.

8
00:00:28,431 --> 00:00:30,380
PROFESSOR: Hello, everybody.

9
00:00:30,380 --> 00:00:35,060
Before we start the material,
a couple of announcements.

10
00:00:35,060 --> 00:00:37,280
As usual, there's some
reading assignments,

11
00:00:37,280 --> 00:00:40,940
and you might be surprised
to see something from Chapter

12
00:00:40,940 --> 00:00:43,370
5 suddenly popping up.

13
00:00:43,370 --> 00:00:45,380
But this is my
relentless attempt

14
00:00:45,380 --> 00:00:47,050
to introduce more Python.

15
00:00:47,050 --> 00:00:51,690
We'll see one new concept later
today, list comprehension.

16
00:00:51,690 --> 00:00:55,650
Today we're going to
look at classification.

17
00:00:55,650 --> 00:00:58,580
And you remember
last, on Monday,

18
00:00:58,580 --> 00:01:01,640
we looked at
unsupervised learning.

19
00:01:01,640 --> 00:01:04,489
Today we're looking at
supervised learning.

20
00:01:04,489 --> 00:01:08,660
It can usually be divided
into two categories.

21
00:01:08,660 --> 00:01:11,940
Regression, where
you try and predict

22
00:01:11,940 --> 00:01:15,420
some real number associated
with the feature vector,

23
00:01:15,420 --> 00:01:18,540
and this is something
we've already done really,

24
00:01:18,540 --> 00:01:22,980
back when we looked at curve
fitting, linear regression

25
00:01:22,980 --> 00:01:24,150
in particular.

26
00:01:24,150 --> 00:01:28,080
It was exactly building a model
that, given some features,

27
00:01:28,080 --> 00:01:30,232
would predict a point.

28
00:01:30,232 --> 00:01:31,690
In this case, it
was pretty simple.

29
00:01:31,690 --> 00:01:33,810
It was given x predict y.

30
00:01:33,810 --> 00:01:38,640
You can imagine generalizing
that to multi dimensions.

31
00:01:38,640 --> 00:01:42,660
Today I'm going to talk
about classification,

32
00:01:42,660 --> 00:01:45,840
which is very common,
in many ways more

33
00:01:45,840 --> 00:01:48,390
common than regression for--

34
00:01:48,390 --> 00:01:50,550
in the machine learning world.

35
00:01:50,550 --> 00:01:55,170
And here the goal is to predict
a discrete value, often called

36
00:01:55,170 --> 00:02:00,420
a label, associated with
some feature vector.

37
00:02:00,420 --> 00:02:04,400
So this is the sort of thing
where you try and, for example,

38
00:02:04,400 --> 00:02:07,550
predict whether a
person will have

39
00:02:07,550 --> 00:02:10,340
an adverse reaction to a drug.

40
00:02:10,340 --> 00:02:12,290
You're not looking
for a real number,

41
00:02:12,290 --> 00:02:17,720
you're looking for will they get
sick, will they not get sick.

42
00:02:17,720 --> 00:02:21,230
Maybe you're trying to predict
the grade in a course A, B, C,

43
00:02:21,230 --> 00:02:25,350
D, and other grades
we won't mention.

44
00:02:25,350 --> 00:02:27,020
Again, those are
labels, so it doesn't

45
00:02:27,020 --> 00:02:32,860
have to be a binary label but
it's a finite number of labels.

46
00:02:32,860 --> 00:02:34,720
So here's an example
to start with.

47
00:02:34,720 --> 00:02:37,580
We won't linger on it too long.

48
00:02:37,580 --> 00:02:40,660
This is basically
something you saw

49
00:02:40,660 --> 00:02:44,470
in an earlier lecture, where
we had a bunch of animals

50
00:02:44,470 --> 00:02:48,070
and a bunch of properties,
and a label identifying

51
00:02:48,070 --> 00:02:49,885
whether or not they
were a reptile.

52
00:02:55,810 --> 00:03:01,640
So we start by building
a distance matrix.

53
00:03:01,640 --> 00:03:07,270
How far apart they are,
an in fact, in this case,

54
00:03:07,270 --> 00:03:11,020
I'm not using the
representation you just saw.

55
00:03:11,020 --> 00:03:15,010
I'm going to use the
binary representation,

56
00:03:15,010 --> 00:03:17,667
As Professor Grimson showed
you, and for the reasons

57
00:03:17,667 --> 00:03:18,250
he showed you.

58
00:03:21,240 --> 00:03:25,320
If you're interested, I didn't
produce this table by hand,

59
00:03:25,320 --> 00:03:28,500
I wrote some Python
code to produce it,

60
00:03:28,500 --> 00:03:30,420
not only to compute
the distances,

61
00:03:30,420 --> 00:03:36,030
but more delicately to
produce the actual table.

62
00:03:36,030 --> 00:03:39,030
And you'll probably find it
instructive at some point

63
00:03:39,030 --> 00:03:41,700
to at least remember
that that code is there,

64
00:03:41,700 --> 00:03:45,910
in case you need to ever
produce a table for some paper.

65
00:03:45,910 --> 00:03:51,100
In general, you probably noticed
I spent relatively little time

66
00:03:51,100 --> 00:03:53,560
going over the actual
vast amounts of codes

67
00:03:53,560 --> 00:03:55,930
we've been posting.

68
00:03:55,930 --> 00:03:58,930
That doesn't mean you
shouldn't look at it.

69
00:03:58,930 --> 00:04:02,380
In part, a lot of
it's there because I'm

70
00:04:02,380 --> 00:04:04,510
hoping at some point in
the future it will be handy

71
00:04:04,510 --> 00:04:08,680
for you to have a model
on how to do something.

72
00:04:08,680 --> 00:04:09,610
All right.

73
00:04:09,610 --> 00:04:12,640
So we have all these distances.

74
00:04:12,640 --> 00:04:18,070
And we can tell how far apart
one animal is from another.

75
00:04:18,070 --> 00:04:22,320
Now how do we use those
to classify animals?

76
00:04:22,320 --> 00:04:25,020
And the simplest approach
to classification,

77
00:04:25,020 --> 00:04:28,320
and it's actually one that's
used a fair amount in practice

78
00:04:28,320 --> 00:04:31,750
is called nearest neighbor.

79
00:04:31,750 --> 00:04:35,140
So the learning part is trivial.

80
00:04:35,140 --> 00:04:39,010
We don't actually learn anything
other than we just remember.

81
00:04:39,010 --> 00:04:42,010
So we remember
the training data.

82
00:04:42,010 --> 00:04:45,640
And when we want to predict
the label of a new example,

83
00:04:45,640 --> 00:04:48,240
we find the nearest example
in the training data,

84
00:04:48,240 --> 00:04:53,030
and just choose the label
associated with that example.

85
00:04:53,030 --> 00:04:55,570
So here I've just
drawing a cloud

86
00:04:55,570 --> 00:04:59,060
of red dots and black dots.

87
00:04:59,060 --> 00:05:02,060
I have a fuschia
colored X. And if I

88
00:05:02,060 --> 00:05:05,230
want to classify
X as black or red,

89
00:05:05,230 --> 00:05:08,100
I'd say well its
nearest neighbor is red.

90
00:05:08,100 --> 00:05:10,010
So we'll call X red.

91
00:05:12,650 --> 00:05:14,210
Doesn't get much
simpler than that.

92
00:05:18,781 --> 00:05:19,280
All right.

93
00:05:19,280 --> 00:05:22,970
Let's try and do it
now for our animals.

94
00:05:22,970 --> 00:05:26,250
I've blocked out this
lower right hand corner,

95
00:05:26,250 --> 00:05:30,260
because I want to classify these
three animals that are in gray.

96
00:05:30,260 --> 00:05:34,310
So my training data, very
small, are these animals.

97
00:05:34,310 --> 00:05:37,170
And these are my test set here.

98
00:05:37,170 --> 00:05:40,980
So let's first try and
classify the zebra.

99
00:05:40,980 --> 00:05:43,710
We look at the zebra's
nearest neighbor.

100
00:05:43,710 --> 00:05:48,230
Well it's either a
guppy or a dart frog.

101
00:05:48,230 --> 00:05:49,430
Well, let's just choose one.

102
00:05:49,430 --> 00:05:51,050
Let's choose the guppy.

103
00:05:51,050 --> 00:05:54,650
And if we look at the
guppy, it's not a reptile,

104
00:05:54,650 --> 00:05:57,780
so we say the zebra
is not a reptile.

105
00:05:57,780 --> 00:05:59,300
So got one right.

106
00:06:02,760 --> 00:06:05,130
Look at the python, choose
its nearest neighbor,

107
00:06:05,130 --> 00:06:06,920
say it's a cobra.

108
00:06:06,920 --> 00:06:09,990
The label associated
with cobra is reptile,

109
00:06:09,990 --> 00:06:12,535
so we win again on the python.

110
00:06:16,030 --> 00:06:22,170
Alligator, it's nearest
neighbor is clearly a chicken.

111
00:06:22,170 --> 00:06:27,770
And so we classify the
alligator as not a reptile.

112
00:06:31,450 --> 00:06:33,400
Oh, dear.

113
00:06:33,400 --> 00:06:34,840
Clearly the wrong answer.

114
00:06:38,720 --> 00:06:39,890
All right.

115
00:06:39,890 --> 00:06:43,540
What might have gone wrong?

116
00:06:43,540 --> 00:06:49,040
Well, the problem with
K nearest neighbors,

117
00:06:49,040 --> 00:06:52,340
we can illustrate it by
looking at this example.

118
00:06:52,340 --> 00:06:55,310
So one of the things people do
with classifiers these days is

119
00:06:55,310 --> 00:06:57,750
handwriting recognition.

120
00:06:57,750 --> 00:07:01,800
So I just copied from a
website a bunch of numbers,

121
00:07:01,800 --> 00:07:06,810
then I wrote the number 40 in
my own inimitable handwriting.

122
00:07:06,810 --> 00:07:09,300
So if we go and we look for,
say, the nearest neighbor

123
00:07:09,300 --> 00:07:10,500
of four--

124
00:07:10,500 --> 00:07:13,020
or sorry, of whatever
that digit is.

125
00:07:17,530 --> 00:07:20,080
It is, I believe, this one.

126
00:07:20,080 --> 00:07:23,640
And sure enough that's
the row of fours.

127
00:07:23,640 --> 00:07:24,810
We're OK on this.

128
00:07:27,570 --> 00:07:32,910
Now if we want to
classify my zero,

129
00:07:32,910 --> 00:07:35,610
the actual nearest
neighbor, in terms

130
00:07:35,610 --> 00:07:39,770
of the bitmaps if you will,
turns out to be this guy.

131
00:07:39,770 --> 00:07:42,240
A very poorly written nine.

132
00:07:42,240 --> 00:07:45,930
I didn't make up this nine,
it was it was already there.

133
00:07:45,930 --> 00:07:50,670
And the problem we see here
when we use nearest neighbor is

134
00:07:50,670 --> 00:07:55,540
if something is noisy, if you
have one noisy piece of data,

135
00:07:55,540 --> 00:07:59,040
in this case, it's rather
ugly looking version of nine,

136
00:07:59,040 --> 00:08:01,170
you can get the wrong
answer because you match it.

137
00:08:03,830 --> 00:08:07,490
And indeed, in this case, you
would get the wrong answer.

138
00:08:07,490 --> 00:08:10,960
What is usually done to
avoid that is something

139
00:08:10,960 --> 00:08:12,940
called K nearest neighbors.

140
00:08:16,300 --> 00:08:19,930
And the basic idea here
is that we don't just

141
00:08:19,930 --> 00:08:22,600
take the nearest
neighbors, we take

142
00:08:22,600 --> 00:08:26,440
some number of nearest
neighbors, usually

143
00:08:26,440 --> 00:08:30,730
an odd number, and we
just let them vote.

144
00:08:30,730 --> 00:08:36,900
So now if we want to
classify this fuchsia X,

145
00:08:36,900 --> 00:08:39,630
and we said K equal to
three, we say well these

146
00:08:39,630 --> 00:08:42,600
are it's three
nearest neighbors.

147
00:08:42,600 --> 00:08:45,570
One is red, two
are black, so we're

148
00:08:45,570 --> 00:08:49,540
going to call X black
is our better guess.

149
00:08:49,540 --> 00:08:51,670
And maybe that actually
is a better guess,

150
00:08:51,670 --> 00:08:54,310
because it looks like this
red point here is really

151
00:08:54,310 --> 00:08:59,320
an outlier, and we don't want
to let the outliers dominate

152
00:08:59,320 --> 00:09:01,450
our classification.

153
00:09:01,450 --> 00:09:05,560
And this is why people almost
always use K nearest neighbors

154
00:09:05,560 --> 00:09:09,410
rather than just
nearest neighbor.

155
00:09:09,410 --> 00:09:14,520
Now if we look at this, and
we use K nearest neighbors,

156
00:09:14,520 --> 00:09:18,270
those are the three nearest
to the first numeral,

157
00:09:18,270 --> 00:09:21,132
and they are all fours.

158
00:09:21,132 --> 00:09:22,840
And if we look at the
K nearest neighbors

159
00:09:22,840 --> 00:09:25,840
for the second numeral,
we still have this nine

160
00:09:25,840 --> 00:09:28,600
but now we have two zeros.

161
00:09:28,600 --> 00:09:32,150
And so we vote and we
decide it's a zero.

162
00:09:32,150 --> 00:09:33,290
Is it infallible?

163
00:09:33,290 --> 00:09:34,130
No.

164
00:09:34,130 --> 00:09:37,130
But it's typically
much more reliable

165
00:09:37,130 --> 00:09:41,620
than just nearest neighbors,
hence used much more often.

166
00:09:45,880 --> 00:09:49,120
And that was our problem, by
the way, with the alligator.

167
00:09:49,120 --> 00:09:51,830
The nearest neighbor
was the chicken,

168
00:09:51,830 --> 00:09:54,170
but if we went back
and looked at it--

169
00:09:54,170 --> 00:09:55,470
maybe we should go do that.

170
00:10:01,950 --> 00:10:04,930
And we take the alligator's
three nearest neighbors,

171
00:10:04,930 --> 00:10:09,870
it would be the chicken, a
cobra, and the rattlesnake--

172
00:10:09,870 --> 00:10:12,120
or the boa, we
don't care, and we

173
00:10:12,120 --> 00:10:15,180
would end up correctly
classifying it now

174
00:10:15,180 --> 00:10:17,070
as a reptile.

175
00:10:17,070 --> 00:10:18,304
Yes?

176
00:10:18,304 --> 00:10:22,222
AUDIENCE: Is there like a
limit to how many [INAUDIBLE]?

177
00:10:22,222 --> 00:10:23,680
PROFESSOR: The
question is is there

178
00:10:23,680 --> 00:10:26,980
a limit to how many nearest
neighbors you'd want?

179
00:10:26,980 --> 00:10:29,560
Absolutely.

180
00:10:29,560 --> 00:10:33,850
Most obviously, there's no point
in setting K equal to-- whoops.

181
00:10:33,850 --> 00:10:36,090
Ooh, on the rebound--

182
00:10:36,090 --> 00:10:40,270
to the size of the training set.

183
00:10:40,270 --> 00:10:42,940
So one of the problems
with K nearest neighbors

184
00:10:42,940 --> 00:10:44,980
is efficiency.

185
00:10:44,980 --> 00:10:47,680
If you're trying to
define K nearest neighbors

186
00:10:47,680 --> 00:10:51,530
and K is bigger,
it takes longer.

187
00:10:51,530 --> 00:10:55,460
So we worry about
how big K should be.

188
00:10:55,460 --> 00:10:58,400
And if we make it too big--

189
00:10:58,400 --> 00:11:00,650
and this is a crucial thing--

190
00:11:00,650 --> 00:11:07,240
we end up getting dominated
by the size of the class.

191
00:11:07,240 --> 00:11:10,650
So let's look at this
picture we had before.

192
00:11:10,650 --> 00:11:14,650
It happens to be more
red dots than black dots.

193
00:11:14,650 --> 00:11:20,440
If I make K 10 or 15, I'm going
to classify a lot of things

194
00:11:20,440 --> 00:11:26,230
as red, just because red is so
much more prevalent than black.

195
00:11:26,230 --> 00:11:29,140
And so when you have an
imbalance, which you usually

196
00:11:29,140 --> 00:11:34,250
do, you have to be very careful
about K. Does that make sense?

197
00:11:34,250 --> 00:11:36,525
AUDIENCE: [INAUDIBLE] choose K?

198
00:11:36,525 --> 00:11:38,710
PROFESSOR: So how
do you choose K?

199
00:11:38,710 --> 00:11:43,780
Remember back on Monday when we
talked about choosing K for K

200
00:11:43,780 --> 00:11:45,900
means clustering?

201
00:11:45,900 --> 00:11:49,740
We typically do a very
similar kind of thing.

202
00:11:49,740 --> 00:11:56,230
We take our training data and
we split it into two parts.

203
00:11:56,230 --> 00:11:58,550
So we have training
and testing, but now

204
00:11:58,550 --> 00:12:01,260
we just take the training,
and we split that

205
00:12:01,260 --> 00:12:05,270
into training and
testing multiple times.

206
00:12:05,270 --> 00:12:08,540
And we experiment with
different K's, and we

207
00:12:08,540 --> 00:12:13,110
see which K's gives us the best
result on the training data.

208
00:12:13,110 --> 00:12:19,190
And then that becomes our K.
And that's a very common method.

209
00:12:19,190 --> 00:12:22,570
It's called
cross-validation, and it's--

210
00:12:22,570 --> 00:12:26,760
for almost all of machine
learning, the algorithms

211
00:12:26,760 --> 00:12:30,960
have parameters in this case,
it's just one parameter, K.

212
00:12:30,960 --> 00:12:34,170
And the way we typically
choose the parameter values

213
00:12:34,170 --> 00:12:37,110
is by searching
through the space using

214
00:12:37,110 --> 00:12:40,720
this cross-validation
in the training data.

215
00:12:40,720 --> 00:12:43,350
Does that makes
sense to everybody?

216
00:12:43,350 --> 00:12:44,481
Great question.

217
00:12:44,481 --> 00:12:46,230
And there was someone
else had a question,

218
00:12:46,230 --> 00:12:47,362
but maybe it was the same.

219
00:12:47,362 --> 00:12:48,570
Do you still have a question?

220
00:12:48,570 --> 00:12:52,310
AUDIENCE: Well, just that
you were using like K nearest

221
00:12:52,310 --> 00:12:54,351
and you get, like
if my K is three

222
00:12:54,351 --> 00:12:56,684
and I get three different
clusters for the K [INAUDIBLE]

223
00:12:56,684 --> 00:12:58,183
PROFESSOR: Three
different clusters?

224
00:12:58,183 --> 00:12:59,126
AUDIENCE: [INAUDIBLE]

225
00:12:59,126 --> 00:13:00,370
PROFESSOR: Well, right.

226
00:13:00,370 --> 00:13:05,250
So if K is 3, and I had
red, black, and purple

227
00:13:05,250 --> 00:13:08,190
and I get one of each,
then what do I do?

228
00:13:08,190 --> 00:13:10,120
And then I'm kind of stuck.

229
00:13:10,120 --> 00:13:13,260
So you need to typically
choose K in such a way

230
00:13:13,260 --> 00:13:16,140
that when you vote
you get a winner.

231
00:13:16,140 --> 00:13:16,670
Nice.

232
00:13:16,670 --> 00:13:19,880
So if there's two, any
odd number will do.

233
00:13:19,880 --> 00:13:22,070
If it's three, well then
you need another number

234
00:13:22,070 --> 00:13:25,410
so that there's some-- so
there's always a majority.

235
00:13:25,410 --> 00:13:27,070
Right?

236
00:13:27,070 --> 00:13:30,920
You want to make sure
that there is a winner.

237
00:13:30,920 --> 00:13:31,940
Also a good question.

238
00:13:36,900 --> 00:13:39,210
Let's see if I get
this to you directly.

239
00:13:41,870 --> 00:13:45,560
I'm much better at
throwing overhand, I guess.

240
00:13:45,560 --> 00:13:46,430
Wow.

241
00:13:46,430 --> 00:13:48,140
Finally got applause
for something.

242
00:13:48,140 --> 00:13:52,770
All right, advantages
and disadvantages KNN?

243
00:13:52,770 --> 00:13:54,930
The learning is
really fast, right?

244
00:13:54,930 --> 00:13:57,120
I just remember everything.

245
00:13:57,120 --> 00:13:59,472
No math is required.

246
00:13:59,472 --> 00:14:00,930
Didn't have to show
you any theory.

247
00:14:00,930 --> 00:14:03,660
Was obviously an idea.

248
00:14:03,660 --> 00:14:06,900
It's easy to explain the method
to somebody, and the results.

249
00:14:06,900 --> 00:14:08,430
Why did I label it black?

250
00:14:08,430 --> 00:14:12,210
Because that's who
it was closest to.

251
00:14:12,210 --> 00:14:15,730
The disadvantages is
it's memory intensive.

252
00:14:15,730 --> 00:14:19,740
If I've got a million examples,
I have to store them all.

253
00:14:19,740 --> 00:14:23,840
And the predictions
can take a long time.

254
00:14:23,840 --> 00:14:27,650
If I have an example and I
want to find its K nearest

255
00:14:27,650 --> 00:14:30,480
neighbors, I'm doing
a lot of comparisons.

256
00:14:30,480 --> 00:14:30,980
Right?

257
00:14:30,980 --> 00:14:33,710
If I have a million
tank training points

258
00:14:33,710 --> 00:14:37,550
I have to compare my
example to all a million.

259
00:14:37,550 --> 00:14:41,100
So I have no real
pre-processing overhead.

260
00:14:41,100 --> 00:14:43,460
But each time I need
to do a classification,

261
00:14:43,460 --> 00:14:46,030
it takes a long time.

262
00:14:46,030 --> 00:14:48,700
Now there are better
algorithms and brute force

263
00:14:48,700 --> 00:14:53,230
that give you approximate
K nearest neighbors.

264
00:14:53,230 --> 00:14:56,760
But on the whole,
it's still not fast.

265
00:14:56,760 --> 00:15:02,920
And we're not getting any
information about what process

266
00:15:02,920 --> 00:15:06,210
might have generated the data.

267
00:15:06,210 --> 00:15:10,290
We don't have a model of the
data in the way we say when

268
00:15:10,290 --> 00:15:13,680
we did our linear regression
for curve fitting,

269
00:15:13,680 --> 00:15:18,280
we had a model for the data that
sort of described the pattern.

270
00:15:18,280 --> 00:15:23,240
We don't get that out
of k nearest neighbors.

271
00:15:23,240 --> 00:15:25,340
I'm going to show you a
different approach where

272
00:15:25,340 --> 00:15:27,180
we do get that.

273
00:15:27,180 --> 00:15:29,540
And I'm going to do it on
a more interesting example

274
00:15:29,540 --> 00:15:32,030
than reptiles.

275
00:15:32,030 --> 00:15:36,230
I apologize to those of
you who are reptologists.

276
00:15:36,230 --> 00:15:40,160
So you probably all
heard of the Titanic.

277
00:15:40,160 --> 00:15:43,670
There was a movie
about it, I'm told.

278
00:15:43,670 --> 00:15:47,610
It was one of the great
sea disasters of all time,

279
00:15:47,610 --> 00:15:50,300
a so-called unsinkable ship--

280
00:15:50,300 --> 00:15:53,060
they had advertised
it as unsinkable--

281
00:15:53,060 --> 00:15:55,025
hit an iceberg and went down.

282
00:15:55,025 --> 00:15:58,760
Of the 1,300
passengers, 812 died.

283
00:15:58,760 --> 00:16:00,829
The crew did way worse.

284
00:16:00,829 --> 00:16:02,870
So at least it looks as
if the curve was actually

285
00:16:02,870 --> 00:16:04,070
pretty heroic.

286
00:16:04,070 --> 00:16:06,530
They had a higher death rate.

287
00:16:06,530 --> 00:16:08,870
So we're going to
use machine learning

288
00:16:08,870 --> 00:16:12,530
to see if we can predict
which passengers survived.

289
00:16:15,940 --> 00:16:17,960
There's an online
database I'm using.

290
00:16:17,960 --> 00:16:20,280
It doesn't have all
1,200 passengers,

291
00:16:20,280 --> 00:16:24,790
but it has information
about 1,046 of them.

292
00:16:24,790 --> 00:16:27,220
Some of them they couldn't
get the information.

293
00:16:27,220 --> 00:16:29,830
Says what cabin class they
were in first, second,

294
00:16:29,830 --> 00:16:33,760
or third, how old they
were, and their gender.

295
00:16:33,760 --> 00:16:36,100
Also has their
name and their home

296
00:16:36,100 --> 00:16:39,450
address and things,
which I'm not using.

297
00:16:39,450 --> 00:16:42,990
We want to use these
features to see

298
00:16:42,990 --> 00:16:46,020
if we can predict
which passengers were

299
00:16:46,020 --> 00:16:50,030
going to survive the disaster.

300
00:16:50,030 --> 00:16:52,870
Well, the first
question is something

301
00:16:52,870 --> 00:16:57,530
that Professor Grimson
alluded to is, is it OK,

302
00:16:57,530 --> 00:16:58,940
just to look at accuracy?

303
00:16:58,940 --> 00:17:03,560
How are we going to evaluate
our machine learning?

304
00:17:03,560 --> 00:17:04,329
And it's not.

305
00:17:04,329 --> 00:17:08,290
If we just predict died
for everybody, well then

306
00:17:08,290 --> 00:17:14,319
we'll be 62% accurate for the
passengers and 76% accurate

307
00:17:14,319 --> 00:17:16,270
for the crew members.

308
00:17:16,270 --> 00:17:18,760
Usually machine
learning, if you're 76%

309
00:17:18,760 --> 00:17:20,710
you say that's not bad.

310
00:17:20,710 --> 00:17:25,329
Well, here I can get that
just by predicting died.

311
00:17:25,329 --> 00:17:30,490
So whenever you have a class
imbalance that much more of one

312
00:17:30,490 --> 00:17:33,960
than the other, accuracy isn't
a particularly meaningful

313
00:17:33,960 --> 00:17:34,460
measure.

314
00:17:37,340 --> 00:17:41,460
I discovered this early on
in my work and medical area.

315
00:17:41,460 --> 00:17:43,550
There are a lot of
diseases that rarely occur,

316
00:17:43,550 --> 00:17:46,970
they occur in say 0.1%
of the population.

317
00:17:46,970 --> 00:17:49,280
And I can build a great
model for predicting it

318
00:17:49,280 --> 00:17:51,500
by just saying,
no, you don't have

319
00:17:51,500 --> 00:17:57,170
it, which will be 0.999%
accurate, but totally useless.

320
00:18:00,650 --> 00:18:02,810
Unfortunately, you do see
people doing that sort

321
00:18:02,810 --> 00:18:04,200
of thing in the literature.

322
00:18:06,750 --> 00:18:10,710
You saw these in an earlier
lecture, just to remind you,

323
00:18:10,710 --> 00:18:15,110
we're going to be
looking at other metrics.

324
00:18:15,110 --> 00:18:18,870
Sensitivity, think
of that as how good

325
00:18:18,870 --> 00:18:22,260
is it at identifying
the positive cases.

326
00:18:22,260 --> 00:18:26,980
In this case, positive
is going to be dead.

327
00:18:26,980 --> 00:18:33,110
How specific is it, and the
positive predictive value.

328
00:18:33,110 --> 00:18:35,820
If we say somebody died,
what's the probability

329
00:18:35,820 --> 00:18:38,172
is that they really did?

330
00:18:38,172 --> 00:18:40,130
And then there's the
negative predictive value.

331
00:18:40,130 --> 00:18:41,900
If we say they
didn't die, what's

332
00:18:41,900 --> 00:18:43,430
the probability they didn't die?

333
00:18:46,380 --> 00:18:50,040
So these are four
very common metrics.

334
00:18:50,040 --> 00:18:54,660
There is something called an
F score that combines them,

335
00:18:54,660 --> 00:18:58,500
but I'm not going to be
showing you that today.

336
00:18:58,500 --> 00:19:00,810
I will mention that
in the literature,

337
00:19:00,810 --> 00:19:04,170
people often use the word
recall to mean sensitivity

338
00:19:04,170 --> 00:19:09,480
or sensitivity I mean recall,
and specificity and precision

339
00:19:09,480 --> 00:19:12,160
are used pretty much
interchangeably.

340
00:19:12,160 --> 00:19:16,080
So you might see various
combinations of these words.

341
00:19:16,080 --> 00:19:18,840
Typically, people talk
about recall n precision

342
00:19:18,840 --> 00:19:22,730
or sensitivity and specificity.

343
00:19:22,730 --> 00:19:24,400
Does that makes
sense, why we want

344
00:19:24,400 --> 00:19:27,010
to look at the measures
other than accuracy?

345
00:19:27,010 --> 00:19:31,330
We will look at accuracy,
too, and how they all tell us

346
00:19:31,330 --> 00:19:34,510
kind of different
things, and how you might

347
00:19:34,510 --> 00:19:37,840
choose a different balance.

348
00:19:37,840 --> 00:19:42,550
For example, if I'm running
a screening test, say

349
00:19:42,550 --> 00:19:47,600
for breast cancer, a
mammogram, and trying

350
00:19:47,600 --> 00:19:49,310
to find the people
who should get on

351
00:19:49,310 --> 00:19:52,580
for a more extensive
examination,

352
00:19:52,580 --> 00:19:55,990
what do I want to
emphasize here?

353
00:19:55,990 --> 00:19:58,610
Which of these is likely
to be the most important?

354
00:20:02,050 --> 00:20:04,750
Or what would you
care about most?

355
00:20:08,190 --> 00:20:10,830
Well, maybe I want sensitivity.

356
00:20:10,830 --> 00:20:15,390
Since I'm going to send this
person on for future tests,

357
00:20:15,390 --> 00:20:19,760
I really don't want to miss
somebody who has cancer,

358
00:20:19,760 --> 00:20:22,580
and so I might
think sensitivity is

359
00:20:22,580 --> 00:20:27,460
more important than specificity
in that particular case.

360
00:20:27,460 --> 00:20:30,720
On the other hand,
if I'm deciding

361
00:20:30,720 --> 00:20:36,710
who is so sick I should do
open heart surgery on them,

362
00:20:36,710 --> 00:20:39,860
maybe I want to be
pretty specific.

363
00:20:39,860 --> 00:20:43,190
Because the risk of the
surgery itself are very high.

364
00:20:43,190 --> 00:20:47,060
I don't want to do it on
people who don't need it.

365
00:20:47,060 --> 00:20:51,530
So we end up having to choose
a balance between these things,

366
00:20:51,530 --> 00:20:53,210
depending upon our application.

367
00:20:57,160 --> 00:21:01,050
The other thing I want to talk
about before actually building

368
00:21:01,050 --> 00:21:07,760
a classifier is how we
test our classifier,

369
00:21:07,760 --> 00:21:09,870
because this is very important.

370
00:21:09,870 --> 00:21:13,190
I'm going to talk about
two different methods,

371
00:21:13,190 --> 00:21:17,150
leave one out class of
testing and repeated

372
00:21:17,150 --> 00:21:21,730
random subsampling.

373
00:21:21,730 --> 00:21:24,780
For leave one out,
it's typically

374
00:21:24,780 --> 00:21:31,140
used when you have a
small number of examples,

375
00:21:31,140 --> 00:21:34,200
so you want as much
training data as possible

376
00:21:34,200 --> 00:21:36,730
as you build your model.

377
00:21:36,730 --> 00:21:41,680
So you take all of your n
examples, remove one of them,

378
00:21:41,680 --> 00:21:45,850
train on n minus
1, test on the 1.

379
00:21:45,850 --> 00:21:49,450
Then you put that 1 back
and remove another 1.

380
00:21:49,450 --> 00:21:53,110
Train on n minus 1, test on 1.

381
00:21:53,110 --> 00:21:56,284
And you do this for each
element of the data,

382
00:21:56,284 --> 00:21:57,700
and then you average
your results.

383
00:22:02,670 --> 00:22:05,490
Repeated random
subsampling is done

384
00:22:05,490 --> 00:22:10,860
when you have a larger set of
data, and there you might say

385
00:22:10,860 --> 00:22:13,730
split your data 80/20.

386
00:22:13,730 --> 00:22:20,130
Take 80% of the data to
train on, test it on 20.

387
00:22:20,130 --> 00:22:23,910
So this is very similar to
what I talked about earlier,

388
00:22:23,910 --> 00:22:26,310
and answered the
question about how

389
00:22:26,310 --> 00:22:32,340
to choose K. I haven't
seen the future examples,

390
00:22:32,340 --> 00:22:35,930
but in order to
believe in my model

391
00:22:35,930 --> 00:22:38,930
and say my parameter
settings, I do this repeated

392
00:22:38,930 --> 00:22:44,090
random subsampling or
leave one out, either one.

393
00:22:44,090 --> 00:22:45,680
There's the code
for leave one out.

394
00:22:48,790 --> 00:22:51,340
Absolutely nothing
interesting about it,

395
00:22:51,340 --> 00:22:54,670
so I'm not going to waste
your time looking at it.

396
00:22:57,430 --> 00:23:04,110
Repeated random subsampling
is a little more interesting.

397
00:23:04,110 --> 00:23:10,640
What I've done here
is I first sample--

398
00:23:10,640 --> 00:23:13,600
this one is just
to splitted 80/20.

399
00:23:13,600 --> 00:23:15,810
It's not doing
anything repeated,

400
00:23:15,810 --> 00:23:27,445
and I start by sampling 20% of
the indices, not the samples.

401
00:23:30,040 --> 00:23:31,690
And I want to do that at random.

402
00:23:31,690 --> 00:23:33,655
I don't want to say
get consecutive ones.

403
00:23:37,840 --> 00:23:42,050
So we do that, and then
once I've got the indices,

404
00:23:42,050 --> 00:23:44,550
I just go through and
assign each example,

405
00:23:44,550 --> 00:23:50,560
to either test or training,
and then return the two sets.

406
00:23:50,560 --> 00:23:54,500
But if I just sort
of sampled one,

407
00:23:54,500 --> 00:23:56,480
then I'd have to do a
more complicated thing

408
00:23:56,480 --> 00:23:57,740
to subtract it from the other.

409
00:23:57,740 --> 00:23:59,780
This is just efficiency.

410
00:23:59,780 --> 00:24:02,370
And then here's the--

411
00:24:02,370 --> 00:24:04,640
sorry about the yellow there--

412
00:24:04,640 --> 00:24:05,520
the random splits.

413
00:24:09,110 --> 00:24:10,820
Obviously, I was
searching for results

414
00:24:10,820 --> 00:24:12,440
when I did my screen capture.

415
00:24:15,579 --> 00:24:17,620
I'm just going to for
range and number of splits,

416
00:24:17,620 --> 00:24:19,617
I'm going to split it 80/20.

417
00:24:22,240 --> 00:24:26,550
It takes a parameter method,
and that's interesting,

418
00:24:26,550 --> 00:24:29,800
and we'll see the
ramifications of that later.

419
00:24:29,800 --> 00:24:32,520
That's going to be the
machine learning method.

420
00:24:32,520 --> 00:24:35,850
We're going to compare KNN
to another method called

421
00:24:35,850 --> 00:24:37,620
logistic regression.

422
00:24:37,620 --> 00:24:41,160
I didn't want to
have to do this code

423
00:24:41,160 --> 00:24:45,260
twice, so I made the
method itself a parameter.

424
00:24:45,260 --> 00:24:47,870
We'll see that introduces
a slight complication,

425
00:24:47,870 --> 00:24:51,140
but we'll get to it
when we get to it.

426
00:24:51,140 --> 00:24:54,090
So I split it, I apply
whatever that method is

427
00:24:54,090 --> 00:25:01,040
the training the test
set, I get the results,

428
00:25:01,040 --> 00:25:05,330
true positive false positive,
true negative false negatives.

429
00:25:05,330 --> 00:25:08,210
And then I call this
thing get stats,

430
00:25:08,210 --> 00:25:11,300
but I'm dividing it by
the number of splits,

431
00:25:11,300 --> 00:25:13,580
so that will give me
the average number

432
00:25:13,580 --> 00:25:18,320
of true positives, the average
number of false positives, etc.

433
00:25:18,320 --> 00:25:22,340
And then I'm just going
to return the average.

434
00:25:22,340 --> 00:25:27,770
Get stats actually just prints
a bunch of statistics for us.

435
00:25:27,770 --> 00:25:29,840
Any questions about
the two methods,

436
00:25:29,840 --> 00:25:32,300
leave one out versus
repeated random sampling?

437
00:25:38,690 --> 00:25:41,870
Let's try it for
KNN on the Titanic.

438
00:25:45,120 --> 00:25:50,400
So I'm not going to show you
the code for K nearest classify.

439
00:25:50,400 --> 00:25:53,160
It's in the code we uploaded.

440
00:25:53,160 --> 00:25:56,520
It takes four arguments
the training set,

441
00:25:56,520 --> 00:26:01,620
the test set, the label that
we're trying to classify.

442
00:26:01,620 --> 00:26:03,270
Are we looking for
the people who died?

443
00:26:03,270 --> 00:26:04,478
Or the people who didn't die?

444
00:26:04,478 --> 00:26:07,410
Are we looking for
reptiles or not reptiles?

445
00:26:07,410 --> 00:26:09,240
Or if case there
were six labels,

446
00:26:09,240 --> 00:26:11,910
which one are we
trying to detect?

447
00:26:11,910 --> 00:26:16,470
And K as in how many
nearest neighbors?

448
00:26:16,470 --> 00:26:18,990
And then it returns the true
positives, the false positives,

449
00:26:18,990 --> 00:26:20,970
the true negatives, and
the false negatives.

450
00:26:26,440 --> 00:26:30,820
Then you'll recall we'd
already looked at lambda

451
00:26:30,820 --> 00:26:32,950
in a different context.

452
00:26:32,950 --> 00:26:41,250
The issue here is K nearest
classify takes four arguments,

453
00:26:41,250 --> 00:26:47,180
yet if we go back here, for
example, to random splits,

454
00:26:47,180 --> 00:26:51,320
what we're seeing is I'm
calling the method with only two

455
00:26:51,320 --> 00:26:53,640
arguments.

456
00:26:53,640 --> 00:26:56,910
Because after all, if I'm not
doing K nearest neighbors,

457
00:26:56,910 --> 00:27:02,120
maybe I don't need to pass
in K. I'm sure I don't.

458
00:27:02,120 --> 00:27:04,070
Different methods will
take different numbers

459
00:27:04,070 --> 00:27:09,920
of parameters, and yet I want
to use the same function here

460
00:27:09,920 --> 00:27:12,630
method.

461
00:27:12,630 --> 00:27:14,760
So the trick I use
to get around that--

462
00:27:14,760 --> 00:27:17,900
and this is a very common
programming trick--

463
00:27:17,900 --> 00:27:18,550
in math.

464
00:27:18,550 --> 00:27:22,380
It's called currying, after
the mathematician Curry,

465
00:27:22,380 --> 00:27:25,990
not the Indian dish.

466
00:27:25,990 --> 00:27:30,520
I'm creating a function a
new function called KNN.

467
00:27:30,520 --> 00:27:33,580
This will be a function of
two arguments, the training

468
00:27:33,580 --> 00:27:36,070
set and the test
set, and it will

469
00:27:36,070 --> 00:27:40,240
be K nearest classifier
with training set and test

470
00:27:40,240 --> 00:27:46,970
set as variables, and
two constants, survived--

471
00:27:46,970 --> 00:27:48,890
so I'm going to
predict who survived--

472
00:27:48,890 --> 00:27:53,420
and 3, the K.

473
00:27:53,420 --> 00:27:56,450
I've been able to turn a
function of four arguments,

474
00:27:56,450 --> 00:28:00,140
K nearest classify, into a
function of two arguments

475
00:28:00,140 --> 00:28:05,570
KNN by using lambda abstraction.

476
00:28:05,570 --> 00:28:09,000
This is something that
people do fairly frequently,

477
00:28:09,000 --> 00:28:12,690
because it lets you build much
more general programs when

478
00:28:12,690 --> 00:28:16,030
you don't have to worry about
the number of arguments.

479
00:28:16,030 --> 00:28:19,500
So it's a good trick to
keeping your bag of tricks.

480
00:28:19,500 --> 00:28:23,110
Again, it's a trick
we've used before.

481
00:28:23,110 --> 00:28:26,740
Then I've just chosen 10
for the number of splits,

482
00:28:26,740 --> 00:28:36,850
and we'll try it, and we'll try
it for both methods of testing.

483
00:28:36,850 --> 00:28:38,990
Any questions before
I run this code?

484
00:28:52,720 --> 00:28:53,309
So here it is.

485
00:28:53,309 --> 00:28:53,850
We'll run it.

486
00:28:59,470 --> 00:29:02,020
Well, I should learn how to
spell finished, shouldn't I?

487
00:29:02,020 --> 00:29:03,050
But that's OK.

488
00:29:11,220 --> 00:29:16,680
Here we have the
results, and they're--

489
00:29:16,680 --> 00:29:18,780
well, what can we
say about them?

490
00:29:18,780 --> 00:29:21,750
They're not much
different to start with,

491
00:29:21,750 --> 00:29:24,630
so it doesn't appear that
our testing methodology had

492
00:29:24,630 --> 00:29:29,640
much of a difference on
how well the KNN worked,

493
00:29:29,640 --> 00:29:33,060
and that's actually
kind of comforting.

494
00:29:33,060 --> 00:29:36,480
The accurate-- none of
the evaluation criteria

495
00:29:36,480 --> 00:29:39,660
are radically different,
so that's kind of good.

496
00:29:39,660 --> 00:29:42,880
We hoped that was true.

497
00:29:42,880 --> 00:29:45,390
The other thing to notice
is that we're actually

498
00:29:45,390 --> 00:29:50,040
doing considerably better than
just always predicting, say,

499
00:29:50,040 --> 00:29:50,745
didn't survive.

500
00:29:56,070 --> 00:29:59,750
We're doing better than
a random prediction.

501
00:29:59,750 --> 00:30:01,617
Let's go back now
to the Power Point.

502
00:30:08,075 --> 00:30:08,950
Here are the results.

503
00:30:08,950 --> 00:30:10,738
We don't need to
study them anymore.

504
00:30:14,020 --> 00:30:18,550
Better than 62% accuracy,
but not much difference

505
00:30:18,550 --> 00:30:21,490
between the experiments.

506
00:30:21,490 --> 00:30:23,770
So that's one method.

507
00:30:23,770 --> 00:30:26,240
Now let's look at
a different method,

508
00:30:26,240 --> 00:30:28,340
and this is probably
the most common method

509
00:30:28,340 --> 00:30:30,290
used in machine learning.

510
00:30:30,290 --> 00:30:34,830
It's called logistic regression.

511
00:30:34,830 --> 00:30:37,800
It's, in some ways, if
you look at it, similar

512
00:30:37,800 --> 00:30:40,200
to a linear regression,
but different

513
00:30:40,200 --> 00:30:41,475
in some important ways.

514
00:30:44,490 --> 00:30:49,900
Linear regression, you
will I'm sure recall,

515
00:30:49,900 --> 00:30:51,870
is designed to
predict a real number.

516
00:30:54,920 --> 00:31:02,220
Now what we want here
is a probability, so

517
00:31:02,220 --> 00:31:04,770
the probability of some event.

518
00:31:04,770 --> 00:31:07,140
We know that the dependent
variable can only

519
00:31:07,140 --> 00:31:17,020
take on a finite set of values,
so we want to predict survived

520
00:31:17,020 --> 00:31:18,820
or didn't survive.

521
00:31:18,820 --> 00:31:23,310
It's no good to say we predict
this person half survived,

522
00:31:23,310 --> 00:31:25,480
you know survived, but is
brain dead or something.

523
00:31:25,480 --> 00:31:27,040
I don't know.

524
00:31:27,040 --> 00:31:29,500
That's not what
we're trying to do.

525
00:31:29,500 --> 00:31:33,370
The problem with just using
regular linear regression

526
00:31:33,370 --> 00:31:37,240
is a lot of time you get
nonsense predictions.

527
00:31:37,240 --> 00:31:41,050
Now you can claim,
OK 0.5 is there,

528
00:31:41,050 --> 00:31:44,860
and it means has a half
probability of dying,

529
00:31:44,860 --> 00:31:47,320
not that half died.

530
00:31:47,320 --> 00:31:49,900
But in fact, if you
look at what goes on,

531
00:31:49,900 --> 00:31:54,740
you could get more
than one or less than 0

532
00:31:54,740 --> 00:31:57,670
out of linear
regression, and that's

533
00:31:57,670 --> 00:32:01,130
nonsense when we're talking
about probabilities.

534
00:32:01,130 --> 00:32:06,520
So we need a different method,
and that's logistic regression.

535
00:32:06,520 --> 00:32:10,420
What logistic
regression does is it

536
00:32:10,420 --> 00:32:14,330
finds what are called the
weights for each feature.

537
00:32:14,330 --> 00:32:17,710
You may recall I complained when
Professor [? Grimson ?] used

538
00:32:17,710 --> 00:32:22,450
the word weights to mean
something somewhat different.

539
00:32:22,450 --> 00:32:27,640
We take each feature, for
example the gender, the cabin

540
00:32:27,640 --> 00:32:37,114
class, the age, and
compute for that weight

541
00:32:37,114 --> 00:32:39,030
that we're going to use
in making predictions.

542
00:32:39,030 --> 00:32:42,380
So think of the weights
as corresponding

543
00:32:42,380 --> 00:32:46,410
to the coefficients we get
when we do a linear regression.

544
00:32:46,410 --> 00:32:51,400
So we have now a coefficient
associated with each variable.

545
00:32:51,400 --> 00:32:53,710
We're going to take
those coefficients,

546
00:32:53,710 --> 00:32:56,710
add them up, multiply
them by something,

547
00:32:56,710 --> 00:32:59,200
and make a prediction.

548
00:32:59,200 --> 00:33:02,820
A positive weight implies--

549
00:33:02,820 --> 00:33:04,870
and I'll come back
to this later--

550
00:33:04,870 --> 00:33:08,530
it almost implies that
the variable is positively

551
00:33:08,530 --> 00:33:11,840
correlated with the outcome.

552
00:33:11,840 --> 00:33:18,030
So we would, for
example, say the

553
00:33:18,030 --> 00:33:20,670
have scales is
positively correlated

554
00:33:20,670 --> 00:33:24,100
with being a reptile.

555
00:33:24,100 --> 00:33:27,820
A negative weight implies that
the variable is negatively

556
00:33:27,820 --> 00:33:32,650
correlated with the
outcome, so number of legs

557
00:33:32,650 --> 00:33:34,840
might have a negative weight.

558
00:33:34,840 --> 00:33:37,330
The more legs an animal
has, the less likely

559
00:33:37,330 --> 00:33:40,150
it is to be a reptile.

560
00:33:40,150 --> 00:33:47,020
It's not absolute, it's
just a correlation.

561
00:33:47,020 --> 00:33:49,390
The absolute
magnitude is related

562
00:33:49,390 --> 00:33:52,230
to the strength of
the correlation,

563
00:33:52,230 --> 00:33:54,230
so if it's being
positive it means

564
00:33:54,230 --> 00:33:55,970
it's a really strong indicator.

565
00:33:55,970 --> 00:33:58,460
If it's big negative,
it's a really strong

566
00:33:58,460 --> 00:33:59,893
negative indicator.

567
00:34:04,150 --> 00:34:07,960
And then we use an
optimization process

568
00:34:07,960 --> 00:34:11,949
to compute these weights
from the training data.

569
00:34:11,949 --> 00:34:13,659
It's a little bit complex.

570
00:34:13,659 --> 00:34:17,110
It's key is the way it uses
the log function, hence

571
00:34:17,110 --> 00:34:21,610
the name logistic, but I'm not
going to make you look at it.

572
00:34:24,270 --> 00:34:28,090
But I will show
you how to use it.

573
00:34:28,090 --> 00:34:31,805
You start by importing something
called sklearn.linear_model.

574
00:34:35,139 --> 00:34:42,300
Sklearn is a Python library,
and in that is a class

575
00:34:42,300 --> 00:34:44,440
called logistic regression.

576
00:34:44,440 --> 00:34:47,330
It's the name of a
class, and here are

577
00:34:47,330 --> 00:34:50,760
three methods of that class.

578
00:34:50,760 --> 00:34:56,610
Fit, which takes a
sequence of feature vectors

579
00:34:56,610 --> 00:34:59,640
and a sequence of
labels and returns

580
00:34:59,640 --> 00:35:05,180
an object of type
logistic regression.

581
00:35:05,180 --> 00:35:09,960
So this is the place where
the optimization is done.

582
00:35:09,960 --> 00:35:13,230
Now all the examples
I'm going to show you,

583
00:35:13,230 --> 00:35:17,500
these two sequences will be--

584
00:35:17,500 --> 00:35:18,220
well all right.

585
00:35:18,220 --> 00:35:20,860
So think of this as the
sequence of feature vectors,

586
00:35:20,860 --> 00:35:25,870
one per passenger, and the
labels associated with those.

587
00:35:25,870 --> 00:35:28,050
So this and this have
to be the same length.

588
00:35:33,280 --> 00:35:37,470
That produces an
object of this type,

589
00:35:37,470 --> 00:35:41,990
and then I can ask for
the coefficients, which

590
00:35:41,990 --> 00:35:47,350
will return the weight of
each variable, each feature.

591
00:35:47,350 --> 00:35:51,320
And then I can
make a prediction,

592
00:35:51,320 --> 00:35:55,040
given a feature vector
returned the probabilities

593
00:35:55,040 --> 00:35:59,120
of different labels.

594
00:35:59,120 --> 00:36:02,550
Let's look at it as an example.

595
00:36:02,550 --> 00:36:03,980
So first let's build the model.

596
00:36:06,870 --> 00:36:09,690
To build the model, we'll take
the examples, the training

597
00:36:09,690 --> 00:36:13,410
data, and I just said whether
we're going to print something.

598
00:36:13,410 --> 00:36:15,600
You'll notice from
this slide I've

599
00:36:15,600 --> 00:36:18,020
elighted the printed stuff.

600
00:36:18,020 --> 00:36:22,090
We'll come back in a later slide
and look at what's in there.

601
00:36:22,090 --> 00:36:24,980
But for now I want to focus on
actually building the model.

602
00:36:28,160 --> 00:36:32,270
I need to create two vectors,
two lists in this case,

603
00:36:32,270 --> 00:36:34,940
the feature vectors
and the labels.

604
00:36:34,940 --> 00:36:36,695
For e in examples,
featurevectors.a

605
00:36:36,695 --> 00:36:40,870
ppend(e.getfeatures
e.getfeatures e.getlabel.

606
00:36:40,870 --> 00:36:45,830
Couldn't be much
simpler than that.

607
00:36:45,830 --> 00:36:50,360
Then, just because it wouldn't
fit on a line on my slide,

608
00:36:50,360 --> 00:36:52,700
I've created this
identifier called

609
00:36:52,700 --> 00:36:56,495
logistic regression,
which is sklearn.linearmo

610
00:36:56,495 --> 00:37:00,010
del.logisticregression.

611
00:37:00,010 --> 00:37:04,340
So this is the thing I
imported, and this is a class,

612
00:37:04,340 --> 00:37:06,890
and now I'll get
a model by first

613
00:37:06,890 --> 00:37:10,670
creating an instance of the
class, logistic regression.

614
00:37:10,670 --> 00:37:13,070
Here I'm getting an
instance, and then I'll

615
00:37:13,070 --> 00:37:16,730
call dot fit with
that instance, passing

616
00:37:16,730 --> 00:37:19,410
it feature vecs and labels.

617
00:37:19,410 --> 00:37:21,660
I now have built a
logistic regression

618
00:37:21,660 --> 00:37:25,260
model, which is simply
a set of weights

619
00:37:25,260 --> 00:37:27,507
for each of the variables.

620
00:37:27,507 --> 00:37:28,215
This makes sense?

621
00:37:32,770 --> 00:37:35,590
Now we're going to
apply the model,

622
00:37:35,590 --> 00:37:39,040
and I think this is the
last piece of Python

623
00:37:39,040 --> 00:37:42,130
I'm going to introduce this
semester, in case you're

624
00:37:42,130 --> 00:37:44,620
tired of learning about Python.

625
00:37:44,620 --> 00:37:48,050
And this is at least
list comprehension.

626
00:37:48,050 --> 00:37:53,140
This is how I'm going to build
my set of test feature vectors.

627
00:37:53,140 --> 00:37:56,470
So before we go and
look at the code,

628
00:37:56,470 --> 00:38:00,690
let's look at how list
comprehension works.

629
00:38:00,690 --> 00:38:04,380
In its simplest form,
says some expression

630
00:38:04,380 --> 00:38:06,840
for some identifier
in some list,

631
00:38:06,840 --> 00:38:14,235
L. It creates a new list by
evaluating this expression Len

632
00:38:14,235 --> 00:38:19,860
(L) times with the ID in
the expression replaced

633
00:38:19,860 --> 00:38:23,400
by each element of
the list L. So let's

634
00:38:23,400 --> 00:38:25,500
look at a simple example.

635
00:38:25,500 --> 00:38:32,150
Here I'm saying L equals x
times x for x in range 10.

636
00:38:32,150 --> 00:38:34,020
What's that going to do?

637
00:38:34,020 --> 00:38:37,654
It's going to,
essentially, create a list.

638
00:38:37,654 --> 00:38:39,070
Think of it as a
list, or at least

639
00:38:39,070 --> 00:38:43,620
a sequence of values, a range
type actually in Python 3--

640
00:38:43,620 --> 00:38:47,200
of values 0 to 9.

641
00:38:47,200 --> 00:38:51,080
It will then create a
list of length 10, where

642
00:38:51,080 --> 00:38:54,260
the first element is
going to be 0 times 0.

643
00:38:54,260 --> 00:38:58,630
The second element
1 times 1, etc.

644
00:38:58,630 --> 00:38:59,820
OK?

645
00:38:59,820 --> 00:39:01,560
So it's a simple
way for me to create

646
00:39:01,560 --> 00:39:05,030
a list that looks like that.

647
00:39:05,030 --> 00:39:12,800
I can be fancier and say for x
times L equals x times x for x

648
00:39:12,800 --> 00:39:15,810
in range 10, and I add and if.

649
00:39:15,810 --> 00:39:20,080
If x mod 2 is equal to 0.

650
00:39:20,080 --> 00:39:22,540
Now instead of returning all--

651
00:39:22,540 --> 00:39:25,880
building a list using
each value in range 10,

652
00:39:25,880 --> 00:39:29,754
it will use only those values
that satisfy that test.

653
00:39:34,880 --> 00:39:37,220
We can go look at what
happens when we run that code.

654
00:39:51,700 --> 00:39:54,610
You can see the first
list is 1 times 1, 2 times

655
00:39:54,610 --> 00:39:57,100
2, et cetera, and
the second list

656
00:39:57,100 --> 00:40:00,820
is much shorter, because I'm
only squaring even numbers.

657
00:40:07,060 --> 00:40:09,280
Well, you can see that
list comprehension gives us

658
00:40:09,280 --> 00:40:13,940
a convenient compact way to
do certain kinds of things.

659
00:40:13,940 --> 00:40:19,460
Like lambda expressions,
they're easy to misuse.

660
00:40:19,460 --> 00:40:22,220
I hate reading code where I
have list comprehensions that

661
00:40:22,220 --> 00:40:26,060
go over multiple lines on
my screen, for example.

662
00:40:26,060 --> 00:40:29,750
So I use it quite a lot
for small things like this.

663
00:40:29,750 --> 00:40:33,110
If it's very large, I
find another way to do it.

664
00:40:48,410 --> 00:40:49,720
Now we can move forward.

665
00:40:58,790 --> 00:41:03,480
In applying the model, I
first build my testing feature

666
00:41:03,480 --> 00:41:07,160
of x, my e.getfeatures
for e in test set,

667
00:41:07,160 --> 00:41:09,290
so that will give me
the features associated

668
00:41:09,290 --> 00:41:11,570
with each element
in the test set.

669
00:41:11,570 --> 00:41:14,930
I could obviously have written
a for loop to do the same thing,

670
00:41:14,930 --> 00:41:18,250
but this was just
a little cooler.

671
00:41:18,250 --> 00:41:22,690
Then we get model.predict
for each of these.

672
00:41:22,690 --> 00:41:28,120
Model.predict_proba is nice in
that I don't have to predict it

673
00:41:28,120 --> 00:41:30,340
for one example at a time.

674
00:41:30,340 --> 00:41:33,880
I can pass it as set of
examples, and what I get back

675
00:41:33,880 --> 00:41:42,890
is a list of predictions,
so that's just convenient.

676
00:41:42,890 --> 00:41:50,420
And then setting these to 0,
and for I in range len of probs,

677
00:41:50,420 --> 00:41:53,280
here a probability of 0.5.

678
00:41:53,280 --> 00:42:00,200
What's that's saying is what I
get out of logistic regression

679
00:42:00,200 --> 00:42:04,570
is a probability of
something having a label.

680
00:42:04,570 --> 00:42:08,950
I then have to build a
classifier, give a threshold.

681
00:42:08,950 --> 00:42:11,650
And here what I've said, if the
probability of it being true

682
00:42:11,650 --> 00:42:14,890
is over a 0.5, call it true.

683
00:42:14,890 --> 00:42:17,650
So if the probability
of survival is over 0.5,

684
00:42:17,650 --> 00:42:19,030
call it survived.

685
00:42:19,030 --> 00:42:22,600
If it's below, call
it not survived.

686
00:42:22,600 --> 00:42:27,430
We'll later see that, again,
setting that probability

687
00:42:27,430 --> 00:42:31,630
is itself an interesting thing,
but the default in most systems

688
00:42:31,630 --> 00:42:34,390
is half, for obvious reasons.

689
00:42:38,280 --> 00:42:41,970
I get my probabilities
for each feature vector,

690
00:42:41,970 --> 00:42:44,820
and then for I in ranged
lens of probabilities,

691
00:42:44,820 --> 00:42:48,840
I'm just testing whether
the predicted label is

692
00:42:48,840 --> 00:42:54,000
the same as the actual label,
and updating true positives,

693
00:42:54,000 --> 00:42:56,940
false positives, true
negatives, and false negatives

694
00:42:56,940 --> 00:42:59,518
accordingly.

695
00:42:59,518 --> 00:43:00,492
So far, so good?

696
00:43:05,860 --> 00:43:09,200
All right, let's
put it all together.

697
00:43:09,200 --> 00:43:13,225
I'm defining something called
LR, for logistic regression.

698
00:43:13,225 --> 00:43:17,720
It takes the training data,
the test data, the probability,

699
00:43:17,720 --> 00:43:21,810
it builds a model, and
then it gets the results

700
00:43:21,810 --> 00:43:24,520
by calling apply model
with the label survived

701
00:43:24,520 --> 00:43:27,840
and whatever this prob was.

702
00:43:27,840 --> 00:43:30,430
Again, we'll do it
for both leave one out

703
00:43:30,430 --> 00:43:34,950
and random splits, and
again for 10 random splits.

704
00:44:03,790 --> 00:44:05,820
You'll notice it actually runs--

705
00:44:05,820 --> 00:44:10,700
maybe you won't notice, but
it does run faster than KNN.

706
00:44:10,700 --> 00:44:13,460
One of the nice things
about logistic regression

707
00:44:13,460 --> 00:44:16,010
is building the
model takes a while,

708
00:44:16,010 --> 00:44:18,590
but once you've got
the model, applying it

709
00:44:18,590 --> 00:44:23,660
to a large number of variables--
feature vectors is fast.

710
00:44:23,660 --> 00:44:25,940
It's independent of the
number of training examples,

711
00:44:25,940 --> 00:44:29,000
because we've got our weights.

712
00:44:29,000 --> 00:44:32,450
So solving the optimization
problem, getting the weights,

713
00:44:32,450 --> 00:44:35,180
depends upon the number
of training examples.

714
00:44:35,180 --> 00:44:39,350
Once we've got the weights, it's
just evaluating a polynomial.

715
00:44:39,350 --> 00:44:42,986
It's very fast, so
that's a nice advantage.

716
00:44:46,720 --> 00:44:47,595
If we look at those--

717
00:44:55,170 --> 00:44:59,290
and we should probably compare
them to our earlier KNN

718
00:44:59,290 --> 00:45:04,560
results, so KNN on the
left, logistic regression

719
00:45:04,560 --> 00:45:06,290
on the right.

720
00:45:06,290 --> 00:45:12,000
And I guess if I look at it, it
looks like logistic regression

721
00:45:12,000 --> 00:45:13,100
did a little bit better.

722
00:45:18,100 --> 00:45:20,580
That's not guaranteed,
but it often

723
00:45:20,580 --> 00:45:25,172
does outperform because it's
more subtle in what it does,

724
00:45:25,172 --> 00:45:26,880
in being able to assign
different weights

725
00:45:26,880 --> 00:45:30,330
to different variables.

726
00:45:30,330 --> 00:45:31,400
It's a little bit better.

727
00:45:31,400 --> 00:45:36,800
That's probably a good
thing, but there's

728
00:45:36,800 --> 00:45:40,040
another reason that's really
important that people prefer

729
00:45:40,040 --> 00:45:42,680
logistic regression,
is it provides

730
00:45:42,680 --> 00:45:46,570
insights about the variables.

731
00:45:46,570 --> 00:45:48,245
We can look at the
feature weights.

732
00:45:51,100 --> 00:45:56,130
This code does that, so remember
we looked at build model

733
00:45:56,130 --> 00:45:58,390
and I left out the printing?

734
00:45:58,390 --> 00:46:01,630
Well here I'm leaving out
everything except the printing.

735
00:46:01,630 --> 00:46:04,900
Same function, but leaving out
everything except the printing.

736
00:46:07,410 --> 00:46:10,250
We can do model
underbar classes,

737
00:46:10,250 --> 00:46:16,110
so model.classes underbar
gives you the classes.

738
00:46:16,110 --> 00:46:19,707
In this case, the classes
are survived, didn't survive.

739
00:46:19,707 --> 00:46:20,790
I forget what I called it.

740
00:46:20,790 --> 00:46:22,200
We'll see.

741
00:46:22,200 --> 00:46:24,270
So I can see what the
classes it's using

742
00:46:24,270 --> 00:46:30,510
are, and then for I in range
len model dot cof underbar,

743
00:46:30,510 --> 00:46:32,910
these are giving the
weights of each variable.

744
00:46:32,910 --> 00:46:36,656
The coefficients, I can
print what they are.

745
00:46:39,530 --> 00:46:41,450
So let's run that
and see what we get.

746
00:46:47,890 --> 00:46:50,940
We get a syntax error
because I turned a comment

747
00:46:50,940 --> 00:46:51,940
into a line of code.

748
00:47:03,320 --> 00:47:08,650
Our model classes are
died and survived,

749
00:47:08,650 --> 00:47:12,460
and for label survived--

750
00:47:12,460 --> 00:47:15,100
what I've done, by the
way, in the representation

751
00:47:15,100 --> 00:47:18,820
is I represented the cabin
class as a binary variable.

752
00:47:18,820 --> 00:47:22,600
It's either 0 or 1, because
it doesn't make sense

753
00:47:22,600 --> 00:47:26,896
to treat them as if they were
really numbers because we don't

754
00:47:26,896 --> 00:47:28,270
know, for example,
the difference

755
00:47:28,270 --> 00:47:31,030
between first and second is
the same as the difference

756
00:47:31,030 --> 00:47:33,050
between second and third.

757
00:47:33,050 --> 00:47:35,570
If we treated the class,
we just said cabin class

758
00:47:35,570 --> 00:47:39,610
and used an integer, implicitly
the learning algorithm

759
00:47:39,610 --> 00:47:42,250
is going to assume that the
difference between 1 and 2

760
00:47:42,250 --> 00:47:44,770
is the same as between 2 and 3.

761
00:47:44,770 --> 00:47:47,320
If you, for example, look at
the prices of these cabins,

762
00:47:47,320 --> 00:47:50,690
you'll see that that's not true.

763
00:47:50,690 --> 00:47:53,120
The difference in an
airplane between economy plus

764
00:47:53,120 --> 00:47:58,040
and economy is way smaller than
between economy plus him first.

765
00:47:58,040 --> 00:48:00,840
Same thing on the Titanic.

766
00:48:00,840 --> 00:48:06,060
But what we see here is
that for the label survived,

767
00:48:06,060 --> 00:48:08,340
pretty good sized
positive weight

768
00:48:08,340 --> 00:48:10,320
for being in first class cabin.

769
00:48:13,000 --> 00:48:14,560
Moderate for being
in the second,

770
00:48:14,560 --> 00:48:18,130
and if you're in the third
class well, tough luck.

771
00:48:18,130 --> 00:48:20,590
So what we see here is
that rich people did better

772
00:48:20,590 --> 00:48:22,180
than the poor people.

773
00:48:22,180 --> 00:48:25,135
Shocking.

774
00:48:25,135 --> 00:48:29,820
If We look at age, we'll see
it's negatively correlated.

775
00:48:29,820 --> 00:48:32,010
What does this mean?

776
00:48:32,010 --> 00:48:34,110
It's not a huge weight,
but it basically

777
00:48:34,110 --> 00:48:39,780
says that if you're older,
the bigger your age,

778
00:48:39,780 --> 00:48:44,770
the less likely you are to
have survived the disaster.

779
00:48:44,770 --> 00:48:47,860
And finally, it
says it's really bad

780
00:48:47,860 --> 00:48:52,330
to be a male, that the men--

781
00:48:52,330 --> 00:48:57,040
being a male was very negatively
correlated with surviving.

782
00:48:57,040 --> 00:49:01,060
We see a nice thing here is
we get these labels, which

783
00:49:01,060 --> 00:49:03,040
we can make sense of.

784
00:49:03,040 --> 00:49:05,080
One more slide
and then I'm done.

785
00:49:09,890 --> 00:49:11,910
These values are
slightly different,

786
00:49:11,910 --> 00:49:15,270
because different randomization,
different example,

787
00:49:15,270 --> 00:49:17,820
but the main point
I want to say is

788
00:49:17,820 --> 00:49:19,830
you have to be a little
bit wary of reading

789
00:49:19,830 --> 00:49:22,290
too much into these weights.

790
00:49:22,290 --> 00:49:26,220
Because not in this example,
but other examples--

791
00:49:26,220 --> 00:49:30,580
well, also in these features
are often correlated,

792
00:49:30,580 --> 00:49:36,210
and if they're
correlated, you run--

793
00:49:36,210 --> 00:49:37,620
actually it's 3:56.

794
00:49:37,620 --> 00:49:40,590
I'm going to explain the
problem with this on Monday

795
00:49:40,590 --> 00:49:42,900
when I have time
to do it properly.

796
00:49:42,900 --> 00:49:45,440
So I'll see you then.