1
00:00:00,790 --> 00:00:03,130
The following content is
provided under a Creative

2
00:00:03,130 --> 00:00:04,550
Commons license.

3
00:00:04,550 --> 00:00:06,760
Your support will help
MIT OpenCourseWare

4
00:00:06,760 --> 00:00:10,850
continue to offer high-quality
educational resources for free.

5
00:00:10,850 --> 00:00:13,390
To make a donation, or to
view additional materials

6
00:00:13,390 --> 00:00:17,320
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,320 --> 00:00:18,570
at ocw.mit.edu.

8
00:00:21,462 --> 00:00:22,920
JOHN GUTTAG: I'm
a little reluctant

9
00:00:22,920 --> 00:00:25,880
to say good afternoon,
given the weather,

10
00:00:25,880 --> 00:00:28,500
but I'll say it anyway.

11
00:00:28,500 --> 00:00:32,900
I guess now we all do know
that we live in Boston.

12
00:00:32,900 --> 00:00:34,880
And I should say,
I hope none of you

13
00:00:34,880 --> 00:00:39,740
were affected too much by the
fire yesterday in Cambridge,

14
00:00:39,740 --> 00:00:42,650
but that seems to have been
a pretty disastrous event

15
00:00:42,650 --> 00:00:44,000
for some.

16
00:00:44,000 --> 00:00:45,740
Anyway, here's the reading.

17
00:00:45,740 --> 00:00:48,840
This is a chapter in
the book on clustering,

18
00:00:48,840 --> 00:00:52,610
a topic that Professor
Grimson introduced last week.

19
00:00:52,610 --> 00:00:57,560
And I'm going to try and finish
up with respect to this course

20
00:00:57,560 --> 00:01:00,080
today, though not with
respect to everything

21
00:01:00,080 --> 00:01:02,780
there is to know
about clustering.

22
00:01:02,780 --> 00:01:07,700
Quickly just reviewing
where we were.

23
00:01:07,700 --> 00:01:10,640
We're in the unit of a
course on machine learning,

24
00:01:10,640 --> 00:01:13,190
and we always follow
the same paradigm.

25
00:01:13,190 --> 00:01:16,160
We observe some set
of examples, which

26
00:01:16,160 --> 00:01:18,440
we call the training data.

27
00:01:18,440 --> 00:01:22,010
We try and infer something
about the process

28
00:01:22,010 --> 00:01:25,450
that created those examples.

29
00:01:25,450 --> 00:01:28,390
And then we use inference
techniques, different kinds

30
00:01:28,390 --> 00:01:30,760
of techniques, to
make predictions

31
00:01:30,760 --> 00:01:33,820
about previously unseen data.

32
00:01:33,820 --> 00:01:36,830
We call that the test data.

33
00:01:36,830 --> 00:01:40,790
As Professor Grimson said, you
can think of two broad classes.

34
00:01:40,790 --> 00:01:44,450
Supervised, where we have a
set of examples and some label

35
00:01:44,450 --> 00:01:46,670
associated with the example--

36
00:01:46,670 --> 00:01:50,600
Democrat, Republican,
smart, dumb,

37
00:01:50,600 --> 00:01:54,770
whatever you want to
associate with them--

38
00:01:54,770 --> 00:01:57,920
and then we try and
infer the labels.

39
00:01:57,920 --> 00:02:02,270
Or unsupervised, where we're
given a set of feature vectors

40
00:02:02,270 --> 00:02:05,660
without labels, and
then we attempt to group

41
00:02:05,660 --> 00:02:09,860
them into natural clusters.

42
00:02:09,860 --> 00:02:13,470
That's going to be
today's topic, clustering.

43
00:02:13,470 --> 00:02:18,440
So clustering is an
optimization problem.

44
00:02:18,440 --> 00:02:20,780
As we'll see later,
supervised machine learning

45
00:02:20,780 --> 00:02:23,330
is also an optimization problem.

46
00:02:23,330 --> 00:02:26,660
Clustering's a
rather simple one.

47
00:02:26,660 --> 00:02:31,180
We're going to start first
with the notion of variability.

48
00:02:31,180 --> 00:02:34,940
So this little c is
a single cluster,

49
00:02:34,940 --> 00:02:38,750
and we're going to talk about
the variability in that cluster

50
00:02:38,750 --> 00:02:45,440
of the sum of the distance
between the mean of the cluster

51
00:02:45,440 --> 00:02:47,880
and each example in the cluster.

52
00:02:47,880 --> 00:02:50,920
And then we square it.

53
00:02:50,920 --> 00:02:51,800
OK?

54
00:02:51,800 --> 00:02:54,860
Pretty straightforward.

55
00:02:54,860 --> 00:02:56,510
For the moment,
we can just assume

56
00:02:56,510 --> 00:02:59,720
that we're using Euclidean
distance as our distance

57
00:02:59,720 --> 00:03:00,910
metric.

58
00:03:00,910 --> 00:03:04,080
Minkowski with p equals two.

59
00:03:04,080 --> 00:03:10,030
So variability should look
pretty similar to something

60
00:03:10,030 --> 00:03:13,010
we've seen before, right?

61
00:03:13,010 --> 00:03:16,100
It's not quite variance,
right, but it's very close.

62
00:03:16,100 --> 00:03:19,650
In a minute, we'll look
at why it's different.

63
00:03:19,650 --> 00:03:23,160
And then we can look
at the dissimilarity

64
00:03:23,160 --> 00:03:27,570
of a set of clusters, a group
of clusters, which I'm writing

65
00:03:27,570 --> 00:03:30,600
as capital C, and
that's just the sum

66
00:03:30,600 --> 00:03:32,190
of all the variabilities.

67
00:03:34,720 --> 00:03:40,150
Now, if I had
divided variability

68
00:03:40,150 --> 00:03:45,514
by the size of the
cluster, what would I have?

69
00:03:45,514 --> 00:03:46,680
Something we've seen before.

70
00:03:46,680 --> 00:03:49,410
What would that be?

71
00:03:49,410 --> 00:03:51,890
Somebody?

72
00:03:51,890 --> 00:03:55,070
Isn't that just the variance?

73
00:03:55,070 --> 00:03:57,910
So the question is, why
am I not doing that?

74
00:03:57,910 --> 00:04:02,170
If up til now, we always
wanted to talk about variance,

75
00:04:02,170 --> 00:04:05,310
why suddenly am I not doing it?

76
00:04:05,310 --> 00:04:07,800
Why do I define this
notion of variability

77
00:04:07,800 --> 00:04:10,750
instead of good old variance?

78
00:04:10,750 --> 00:04:11,395
Any thoughts?

79
00:04:15,120 --> 00:04:18,300
What am I accomplishing
by not dividing

80
00:04:18,300 --> 00:04:20,459
by the size of the cluster?

81
00:04:20,459 --> 00:04:22,350
Or what would happen
if I did divide

82
00:04:22,350 --> 00:04:24,420
by the size of the cluster?

83
00:04:24,420 --> 00:04:25,258
Yes.

84
00:04:25,258 --> 00:04:26,711
AUDIENCE: You normalize it?

85
00:04:26,711 --> 00:04:27,710
JOHN GUTTAG: Absolutely.

86
00:04:27,710 --> 00:04:29,720
I'd normalize it.

87
00:04:29,720 --> 00:04:31,820
That's exactly what
it would be doing.

88
00:04:31,820 --> 00:04:36,380
And what might be good or
bad about normalizing it?

89
00:04:41,010 --> 00:04:44,280
What does it essentially
mean to normalize?

90
00:04:44,280 --> 00:04:48,420
It means that the
penalty for a big cluster

91
00:04:48,420 --> 00:04:51,540
with a lot of variance
in it is no higher

92
00:04:51,540 --> 00:04:53,520
than the penalty of
a tiny little cluster

93
00:04:53,520 --> 00:04:56,720
with a lot of variance in it.

94
00:04:56,720 --> 00:05:00,590
By not normalizing,
what I'm saying is

95
00:05:00,590 --> 00:05:05,510
I want to penalize big,
highly-diverse clusters

96
00:05:05,510 --> 00:05:09,370
more than small,
highly-diverse clusters.

97
00:05:09,370 --> 00:05:09,870
OK?

98
00:05:09,870 --> 00:05:12,990
And if you think about it,
that probably makes sense.

99
00:05:15,770 --> 00:05:18,470
Big and bad is worse
than small and bad.

100
00:05:21,500 --> 00:05:26,110
All right, so now we define
the objective function.

101
00:05:26,110 --> 00:05:29,250
And can we say that the
optimization problem

102
00:05:29,250 --> 00:05:34,470
we want to solve by clustering
is simply finding a capital

103
00:05:34,470 --> 00:05:37,860
C that minimizes dissimilarity?

104
00:05:41,500 --> 00:05:43,460
Is that a reasonable definition?

105
00:05:46,743 --> 00:05:51,050
Well, hint-- no.

106
00:05:51,050 --> 00:05:54,680
What foolish thing could
we do that would optimize

107
00:05:54,680 --> 00:05:56,510
that objective function?

108
00:05:56,510 --> 00:05:57,010
Yeah.

109
00:05:57,010 --> 00:05:58,676
AUDIENCE: You could
have the same number

110
00:05:58,676 --> 00:05:59,720
of clusters as points?

111
00:05:59,720 --> 00:06:00,500
JOHN GUTTAG: Yeah.

112
00:06:00,500 --> 00:06:02,100
I can have the same
number of clusters

113
00:06:02,100 --> 00:06:07,700
as points, assign each point
to its own cluster, whoops.

114
00:06:07,700 --> 00:06:10,010
Ooh, almost a relay.

115
00:06:10,010 --> 00:06:14,520
The dissimilarity of
each cluster would be 0.

116
00:06:14,520 --> 00:06:17,270
The variability would be 0, so
the dissimilarity would be 0,

117
00:06:17,270 --> 00:06:19,630
and I just solved the problem.

118
00:06:19,630 --> 00:06:24,040
Well, that's clearly not
a very useful thing to do.

119
00:06:24,040 --> 00:06:28,870
So, well, what do you think
we do to get around that?

120
00:06:28,870 --> 00:06:29,370
Yeah.

121
00:06:29,370 --> 00:06:30,750
AUDIENCE: We apply a constraint?

122
00:06:30,750 --> 00:06:32,530
JOHN GUTTAG: We
apply a constraint.

123
00:06:32,530 --> 00:06:33,030
Exactly.

124
00:06:35,830 --> 00:06:38,730
And so we have to
pick some constraint.

125
00:06:42,970 --> 00:06:48,020
What would be a suitable
constraint, for example?

126
00:06:48,020 --> 00:06:51,080
Well, maybe we'd
say, OK, the clusters

127
00:06:51,080 --> 00:06:53,450
have to have some minimum
distance between them.

128
00:06:55,960 --> 00:06:59,580
Or-- and this is the constraint
we'll be using today--

129
00:06:59,580 --> 00:07:02,740
we could constrain the
number of clusters.

130
00:07:02,740 --> 00:07:07,160
Say, all right, I only want
to have at most five clusters.

131
00:07:07,160 --> 00:07:11,680
Do the best you can to
minimize dissimilarity,

132
00:07:11,680 --> 00:07:14,630
but you're not allowed to
use more than five clusters.

133
00:07:14,630 --> 00:07:17,230
That's the most
common constraint that

134
00:07:17,230 --> 00:07:20,550
gets placed in the problem.

135
00:07:20,550 --> 00:07:23,036
All right, we're going to
look at two algorithms.

136
00:07:23,036 --> 00:07:24,910
Maybe I should say two
methods, because there

137
00:07:24,910 --> 00:07:28,780
are multiple implementations
of these methods.

138
00:07:28,780 --> 00:07:31,650
The first is called
hierarchical clustering,

139
00:07:31,650 --> 00:07:33,750
and the second is
called k-means.

140
00:07:33,750 --> 00:07:36,460
There should be an S
on the word mean there.

141
00:07:36,460 --> 00:07:38,650
Sorry about that.

142
00:07:38,650 --> 00:07:41,000
All right, let's look at
hierarchical clustering first.

143
00:07:44,330 --> 00:07:47,460
It's a strange algorithm.

144
00:07:47,460 --> 00:07:51,870
We start by assigning
each item, each example,

145
00:07:51,870 --> 00:07:54,220
to its own cluster.

146
00:07:54,220 --> 00:07:57,610
So this is the trivial solution
we talked about before.

147
00:07:57,610 --> 00:07:59,850
So if you have N items,
you now have N clusters,

148
00:07:59,850 --> 00:08:02,280
each containing just one item.

149
00:08:07,050 --> 00:08:12,870
In the next step, we find
the two most similar clusters

150
00:08:12,870 --> 00:08:17,470
we have and merge them
into a single cluster,

151
00:08:17,470 --> 00:08:19,300
so that now instead
of N clusters,

152
00:08:19,300 --> 00:08:20,860
we have N minus 1 clusters.

153
00:08:26,210 --> 00:08:29,140
And we continue this
process until all items

154
00:08:29,140 --> 00:08:34,010
are clustered into a
single cluster of size N.

155
00:08:34,010 --> 00:08:36,980
Now of course,
that's kind of silly,

156
00:08:36,980 --> 00:08:38,456
because if all I
wanted to put them

157
00:08:38,456 --> 00:08:39,830
all it in is in
a single cluster,

158
00:08:39,830 --> 00:08:40,829
I don't need to iterate.

159
00:08:40,829 --> 00:08:43,280
I just go wham, right?

160
00:08:43,280 --> 00:08:46,010
But what's interesting about
hierarchical clustering

161
00:08:46,010 --> 00:08:50,770
is you stop it, typically,
somewhere along the way.

162
00:08:50,770 --> 00:08:53,960
You produce something
called a [? dendogram. ?]

163
00:08:53,960 --> 00:08:55,240
Let me write that down.

164
00:09:02,960 --> 00:09:08,920
At each step here, it shows you
what you've merged thus far.

165
00:09:08,920 --> 00:09:11,330
We'll see an example
of that shortly.

166
00:09:11,330 --> 00:09:14,170
And then you can have
some stopping criteria.

167
00:09:14,170 --> 00:09:16,730
We'll talk about that.

168
00:09:16,730 --> 00:09:19,820
This is called
agglomerative hierarchical

169
00:09:19,820 --> 00:09:23,000
clustering because we start
with a bunch of things

170
00:09:23,000 --> 00:09:24,200
and we agglomerate them.

171
00:09:24,200 --> 00:09:28,050
That is to say, we
put them together.

172
00:09:28,050 --> 00:09:28,920
All right?

173
00:09:28,920 --> 00:09:31,480
Make sense?

174
00:09:31,480 --> 00:09:34,060
Well, there's a catch.

175
00:09:34,060 --> 00:09:36,760
What do we mean by distance?

176
00:09:36,760 --> 00:09:42,160
And there are multiple plausible
definitions of distance,

177
00:09:42,160 --> 00:09:44,470
and you would get a
different answer depending

178
00:09:44,470 --> 00:09:45,985
upon which measure you used.

179
00:09:50,410 --> 00:09:53,350
These are called
linkage metrics.

180
00:09:53,350 --> 00:09:58,040
The most common one used
is probably single-linkage,

181
00:09:58,040 --> 00:10:01,930
and that says the distance
between a pair of clusters

182
00:10:01,930 --> 00:10:06,130
is equal to the shortest
distance from any member of one

183
00:10:06,130 --> 00:10:08,990
cluster to any member
of the other cluster.

184
00:10:12,100 --> 00:10:17,580
So if I have two
clusters, here and here,

185
00:10:17,580 --> 00:10:21,230
and they have bunches
of points in them,

186
00:10:21,230 --> 00:10:23,510
single-linkage distance
would say, well,

187
00:10:23,510 --> 00:10:27,260
let's use these two points
which are the closest,

188
00:10:27,260 --> 00:10:29,780
and the distance
between these two

189
00:10:29,780 --> 00:10:32,215
is the distance
between the clusters.

190
00:10:37,090 --> 00:10:43,990
You can also use
complete-linkage,

191
00:10:43,990 --> 00:10:47,140
and that says the distance
between any two clusters

192
00:10:47,140 --> 00:10:50,170
is equal to the greatest
distance from any member

193
00:10:50,170 --> 00:10:53,441
to any other member.

194
00:10:53,441 --> 00:10:53,940
OK?

195
00:10:53,940 --> 00:10:56,150
So if we had the same
picture we had before--

196
00:11:01,860 --> 00:11:04,810
probably not the same
picture, but it's a picture.

197
00:11:04,810 --> 00:11:07,450
Whoops.

198
00:11:07,450 --> 00:11:10,930
Then we would say, well, I guess
complete-linkage is probably

199
00:11:10,930 --> 00:11:12,760
the distance, maybe,
between those two.

200
00:11:19,078 --> 00:11:24,550
And finally, not
surprisingly, you

201
00:11:24,550 --> 00:11:28,530
can take the average distance.

202
00:11:28,530 --> 00:11:31,050
These are all plausible metrics.

203
00:11:31,050 --> 00:11:36,450
They're all used and practiced
for different kinds of results

204
00:11:36,450 --> 00:11:39,740
depending upon the
application of the clustering.

205
00:11:42,740 --> 00:11:45,750
All right, let's
look at an example.

206
00:11:45,750 --> 00:11:49,070
So what I have here
is the air distance

207
00:11:49,070 --> 00:11:55,200
between six different cities,
Boston, New York, Chicago,

208
00:11:55,200 --> 00:11:59,890
Denver, San Francisco,
and Seattle.

209
00:11:59,890 --> 00:12:04,910
And now let's say we're-- want
to cluster these airports just

210
00:12:04,910 --> 00:12:07,470
based upon their distance.

211
00:12:07,470 --> 00:12:09,620
So we start.

212
00:12:09,620 --> 00:12:12,860
The first piece of our
[? dendogram ?] says,

213
00:12:12,860 --> 00:12:15,080
well, all right,
I have six cities,

214
00:12:15,080 --> 00:12:17,480
I have six clusters,
each containing one city.

215
00:12:22,777 --> 00:12:23,985
All right, what happens next?

216
00:12:27,030 --> 00:12:30,550
What's the next level
going to look like?

217
00:12:30,550 --> 00:12:31,050
Yeah?

218
00:12:31,050 --> 00:12:32,980
AUDIENCE: You're going
from Boston [INAUDIBLE]

219
00:12:32,980 --> 00:12:35,620
JOHN GUTTAG: I'm going to
join Boston and New York, as

220
00:12:35,620 --> 00:12:38,860
improbable as that sounds.

221
00:12:38,860 --> 00:12:42,130
All right, so that's
the next level.

222
00:12:42,130 --> 00:12:45,640
And if for some reason I only
wanted to have five clusters,

223
00:12:45,640 --> 00:12:48,890
well, I could stop here.

224
00:12:48,890 --> 00:12:50,330
Next, what happens?

225
00:12:53,260 --> 00:12:56,100
Well, I look at it,
I say well, I'll

226
00:12:56,100 --> 00:12:58,790
join up Chicago with
Boston and New York.

227
00:13:04,320 --> 00:13:04,820
All right.

228
00:13:04,820 --> 00:13:06,590
What do I get at the next level?

229
00:13:06,590 --> 00:13:07,150
Somebody?

230
00:13:07,150 --> 00:13:07,650
Yeah.

231
00:13:07,650 --> 00:13:12,150
AUDIENCE: Seattle [INAUDIBLE]

232
00:13:12,150 --> 00:13:14,110
JOHN GUTTAG: Doesn't
look like it to me.

233
00:13:14,110 --> 00:13:21,130
If you look at San Francisco
and Seattle, they are 808 miles,

234
00:13:21,130 --> 00:13:27,140
and Denver and San
Francisco is 1,235.

235
00:13:27,140 --> 00:13:31,241
So I'd end up, in fact, joining
San Francisco and Seattle.

236
00:13:31,241 --> 00:13:34,130
AUDIENCE: That's what I said.

237
00:13:34,130 --> 00:13:38,084
JOHN GUTTAG: Well, that explains
why I need my hearing fixed.

238
00:13:38,084 --> 00:13:39,380
[LAUGHTER]

239
00:13:39,380 --> 00:13:40,490
All right.

240
00:13:40,490 --> 00:13:44,480
So I combine San
Francisco and Seattle,

241
00:13:44,480 --> 00:13:47,110
and now it gets interesting.

242
00:13:47,110 --> 00:13:50,230
I have two choices with Denver.

243
00:13:50,230 --> 00:13:57,520
Obviously, there are
only two choices,

244
00:13:57,520 --> 00:14:03,280
and which I choose depends upon
which linkage criterion I use.

245
00:14:03,280 --> 00:14:07,030
If I'm using single-linkage,
well, then Denver

246
00:14:07,030 --> 00:14:09,910
gets joined with Boston,
New York, and Chicago,

247
00:14:09,910 --> 00:14:13,570
because it's closer to Chicago
than it is to either San

248
00:14:13,570 --> 00:14:14,760
Francisco or Seattle.

249
00:14:17,420 --> 00:14:20,160
But if I use
complete-linkage, it

250
00:14:20,160 --> 00:14:23,950
gets joined up with San
Francisco and Seattle,

251
00:14:23,950 --> 00:14:31,060
because it is further from
Boston than it is from,

252
00:14:31,060 --> 00:14:32,920
I guess it's San
Francisco or Seattle.

253
00:14:32,920 --> 00:14:35,310
Whichever it is, right?

254
00:14:35,310 --> 00:14:37,920
So this is a place
where you see what

255
00:14:37,920 --> 00:14:41,160
answer I get depends upon
the linkage criteria.

256
00:14:41,160 --> 00:14:44,100
And then if I want, I can
consider to the next step

257
00:14:44,100 --> 00:14:46,090
and just join them all.

258
00:14:46,090 --> 00:14:47,100
All right?

259
00:14:47,100 --> 00:14:50,670
That's hierarchical clustering.

260
00:14:50,670 --> 00:14:56,110
So it's good because you get
this whole history of the

261
00:14:56,110 --> 00:14:59,320
[? dendograms, ?] and
you get to look at it,

262
00:14:59,320 --> 00:15:02,600
say, well, all right,
that looks pretty good.

263
00:15:02,600 --> 00:15:06,560
I'll stick with this clustering.

264
00:15:06,560 --> 00:15:09,600
It's deterministic.

265
00:15:09,600 --> 00:15:13,680
Given a linkage criterion, you
always get the same answer.

266
00:15:13,680 --> 00:15:14,900
There's nothing random here.

267
00:15:17,500 --> 00:15:20,500
Notice, by the way,
the answer might not

268
00:15:20,500 --> 00:15:23,680
be optimal with regards
to that linkage criteria.

269
00:15:23,680 --> 00:15:26,480
Why not?

270
00:15:26,480 --> 00:15:29,132
What kind of algorithm is this?

271
00:15:29,132 --> 00:15:29,840
AUDIENCE: Greedy.

272
00:15:29,840 --> 00:15:32,420
JOHN GUTTAG: It's a
greedy algorithm, exactly.

273
00:15:32,420 --> 00:15:34,940
And so I'm making
locally optimal decisions

274
00:15:34,940 --> 00:15:38,510
at each point which may or
may not be globally optimal.

275
00:15:43,160 --> 00:15:44,450
It's flexible.

276
00:15:44,450 --> 00:15:46,070
Choosing different
linkage criteria,

277
00:15:46,070 --> 00:15:48,050
I get different results.

278
00:15:48,050 --> 00:15:53,660
But it's also potentially
really, really slow.

279
00:15:53,660 --> 00:15:58,610
This is not something you want
to do on a million examples.

280
00:15:58,610 --> 00:16:02,570
The naive algorithm, the one
I just sort of showed you,

281
00:16:02,570 --> 00:16:05,730
is N cubed.

282
00:16:05,730 --> 00:16:10,120
N cubed is typically
impractical.

283
00:16:10,120 --> 00:16:14,590
For some linkage criteria, for
example, single-linkage, there

284
00:16:14,590 --> 00:16:18,680
exists very clever N
squared algorithms.

285
00:16:18,680 --> 00:16:21,380
For others, you
can't beat N cubed.

286
00:16:21,380 --> 00:16:27,420
But even N squared is
really not very good.

287
00:16:27,420 --> 00:16:30,670
Which gets me to a much
faster greedy algorithm called

288
00:16:30,670 --> 00:16:31,170
k-means.

289
00:16:33,740 --> 00:16:40,350
Now, the k in k-means is the
number of clusters you want.

290
00:16:40,350 --> 00:16:42,510
So the catch with
k-means is if you

291
00:16:42,510 --> 00:16:46,050
don't have any idea how
many clusters you want,

292
00:16:46,050 --> 00:16:50,260
it's problematical,
whereas hierarchical, you

293
00:16:50,260 --> 00:16:53,640
get to inspect it and
see what you're getting.

294
00:16:53,640 --> 00:16:57,330
If you know how many you
want, it's a good choice

295
00:16:57,330 --> 00:16:59,010
because it's much faster.

296
00:17:02,170 --> 00:17:07,319
All right, the algorithm,
again, is very simple.

297
00:17:07,319 --> 00:17:11,089
This is the one that Professor
Grimson briefly discussed.

298
00:17:11,089 --> 00:17:16,349
You randomly choose k examples
as your initial centroids.

299
00:17:16,349 --> 00:17:19,970
Doesn't matter which of
the examples you choose.

300
00:17:19,970 --> 00:17:24,020
Then you create k clusters
by assigning each example

301
00:17:24,020 --> 00:17:31,440
to the closest centroid,
compute k new centroids

302
00:17:31,440 --> 00:17:35,470
by averaging the
examples in each cluster.

303
00:17:35,470 --> 00:17:40,950
So in the first iteration,
the centroids are all examples

304
00:17:40,950 --> 00:17:42,460
that you started with.

305
00:17:42,460 --> 00:17:46,410
But after that, they're
probably not examples,

306
00:17:46,410 --> 00:17:49,620
because you're now taking the
average of two examples, which

307
00:17:49,620 --> 00:17:53,070
may not correspond to
any example you have.

308
00:17:53,070 --> 00:17:56,810
Actually the average
of N examples.

309
00:17:56,810 --> 00:17:59,120
And then you just
keep doing this

310
00:17:59,120 --> 00:18:02,730
until the centroids don't move.

311
00:18:02,730 --> 00:18:03,230
Right?

312
00:18:03,230 --> 00:18:04,875
Once you go through
one iteration

313
00:18:04,875 --> 00:18:06,500
where they don't
move, there's no point

314
00:18:06,500 --> 00:18:10,100
in recomputing them again
and again and again,

315
00:18:10,100 --> 00:18:12,440
so it is converged.

316
00:18:16,610 --> 00:18:20,730
So let's look at the complexity.

317
00:18:20,730 --> 00:18:23,810
Well, at the moment,
we can't tell you

318
00:18:23,810 --> 00:18:25,970
how many iterations
you're going to have,

319
00:18:25,970 --> 00:18:28,370
but what's the complexity
of one iteration?

320
00:18:34,640 --> 00:18:38,890
Well, let's think about
what you're doing here.

321
00:18:38,890 --> 00:18:43,240
You've got k centroids.

322
00:18:43,240 --> 00:18:46,570
Now I have to take each
example and compare it

323
00:18:46,570 --> 00:18:50,020
to each-- in a naively, at
least-- to each centroid

324
00:18:50,020 --> 00:18:52,750
to see which it's closest to.

325
00:18:52,750 --> 00:18:54,310
Right?

326
00:18:54,310 --> 00:19:01,510
So that's k comparisons
per example.

327
00:19:01,510 --> 00:19:07,480
So that's k times
n times d, where

328
00:19:07,480 --> 00:19:10,480
how much time each of
these comparison takes,

329
00:19:10,480 --> 00:19:12,910
which is likely to depend
upon the dimensionality

330
00:19:12,910 --> 00:19:14,740
of the features, right?

331
00:19:14,740 --> 00:19:17,310
Just the Euclidean
distance, for example.

332
00:19:20,150 --> 00:19:25,600
But this is a way small number
than N squared, typically.

333
00:19:25,600 --> 00:19:27,490
So each iteration
is pretty quick,

334
00:19:27,490 --> 00:19:31,330
and in practice, as
we'll see, this typically

335
00:19:31,330 --> 00:19:34,540
converges quite
quickly, so you usually

336
00:19:34,540 --> 00:19:39,120
need a very small
number of iterations.

337
00:19:39,120 --> 00:19:41,580
So it is quite
efficient, and then there

338
00:19:41,580 --> 00:19:43,830
are various ways
you can optimize it

339
00:19:43,830 --> 00:19:45,900
to make it even more efficient.

340
00:19:45,900 --> 00:19:49,920
This is the most commonly-used
clustering algorithm

341
00:19:49,920 --> 00:19:53,200
because it works really fast.

342
00:19:53,200 --> 00:19:55,220
Let's look at an example.

343
00:19:55,220 --> 00:19:58,880
So I've got a bunch
of blue points here,

344
00:19:58,880 --> 00:20:02,090
and I actually wrote
the code to do this.

345
00:20:02,090 --> 00:20:03,770
I'm not going to
show you the code.

346
00:20:03,770 --> 00:20:13,020
And I chose four centroids
at random, colored stars.

347
00:20:13,020 --> 00:20:18,390
A green one, a fuchsia-colored
one, a red one, and a blue one.

348
00:20:21,410 --> 00:20:24,480
So maybe they're not the
ones you would have chosen,

349
00:20:24,480 --> 00:20:25,380
but there they are.

350
00:20:28,030 --> 00:20:33,630
And I then, having chosen
them, assign each point

351
00:20:33,630 --> 00:20:38,550
to one of those centroids,
whichever one it's closest to.

352
00:20:38,550 --> 00:20:40,660
All right?

353
00:20:40,660 --> 00:20:41,290
Step one.

354
00:20:45,680 --> 00:20:50,350
And then I recompute
the centroid.

355
00:20:50,350 --> 00:20:51,260
So let's go back.

356
00:20:53,780 --> 00:20:59,020
So we're here, and these
are the initial centroids.

357
00:20:59,020 --> 00:21:03,280
Now, when I find
the new centroids,

358
00:21:03,280 --> 00:21:06,130
if we look at where
the red one is,

359
00:21:06,130 --> 00:21:10,540
the red one is this point,
this point, and this point.

360
00:21:10,540 --> 00:21:14,170
Clearly, the new centroid
is going to move, right?

361
00:21:14,170 --> 00:21:16,750
It's going to move somewhere
along in here or something

362
00:21:16,750 --> 00:21:19,950
like that, right?

363
00:21:19,950 --> 00:21:24,154
So we'll get those
new centroids.

364
00:21:24,154 --> 00:21:26,460
There it is.

365
00:21:26,460 --> 00:21:31,870
And now we'll re-assign points.

366
00:21:31,870 --> 00:21:38,190
And what we'll see is this point
is now closer to the red star

367
00:21:38,190 --> 00:21:41,340
than it is to the fuchsia
star, because we've

368
00:21:41,340 --> 00:21:43,920
moved the red star.

369
00:21:43,920 --> 00:21:44,970
Whoops.

370
00:21:44,970 --> 00:21:46,195
That one.

371
00:21:46,195 --> 00:21:47,070
Said the wrong thing.

372
00:21:47,070 --> 00:21:48,660
They were red to start with.

373
00:21:48,660 --> 00:21:53,490
This one is now suddenly
closer to the purple, so--

374
00:21:53,490 --> 00:21:54,150
and to the red.

375
00:21:54,150 --> 00:21:55,920
It will get recolored.

376
00:21:55,920 --> 00:21:57,350
We compute the new centroids.

377
00:21:59,970 --> 00:22:02,100
We're going to move
something again.

378
00:22:02,100 --> 00:22:03,570
We continue.

379
00:22:03,570 --> 00:22:05,290
Points will move around.

380
00:22:05,290 --> 00:22:08,620
This time we move two points.

381
00:22:08,620 --> 00:22:09,820
Here we go again.

382
00:22:09,820 --> 00:22:11,980
Notice, again, the
centroids don't

383
00:22:11,980 --> 00:22:14,090
correspond to actual examples.

384
00:22:14,090 --> 00:22:16,420
This one is close, but it's
not really one of them.

385
00:22:19,210 --> 00:22:20,930
Move two more.

386
00:22:20,930 --> 00:22:24,040
Recompute centroids,
and we're done.

387
00:22:24,040 --> 00:22:29,300
So here we've converged, and I
think it was five iterations,

388
00:22:29,300 --> 00:22:31,481
and nothing will move again.

389
00:22:31,481 --> 00:22:31,980
All right?

390
00:22:31,980 --> 00:22:34,354
Does that make
sense to everybody?

391
00:22:34,354 --> 00:22:35,270
So it's pretty simple.

392
00:22:38,420 --> 00:22:39,770
What are the downsides?

393
00:22:39,770 --> 00:22:45,170
Well, choosing k foolishly
can lead to strange results.

394
00:22:45,170 --> 00:22:49,100
So if I chose k
equal to 3, looking

395
00:22:49,100 --> 00:22:51,470
at this particular
arrangement of points,

396
00:22:51,470 --> 00:22:55,670
it's not obvious what "the
right answer" is, right?

397
00:22:55,670 --> 00:22:58,130
Maybe it's making all
of this one cluster.

398
00:22:58,130 --> 00:23:00,100
I don't know.

399
00:23:00,100 --> 00:23:02,890
But there are weird
k's and if you

400
00:23:02,890 --> 00:23:08,050
choose a k that is nonsensical
with respect to your data,

401
00:23:08,050 --> 00:23:11,470
then your clustering
will be nonsensical.

402
00:23:11,470 --> 00:23:13,240
So that's one problem
we have think about.

403
00:23:13,240 --> 00:23:16,330
How do we choose k?

404
00:23:16,330 --> 00:23:20,120
Another problem, and this is
one somebody raised last time,

405
00:23:20,120 --> 00:23:24,560
is that the results can depend
upon the initial centroids.

406
00:23:24,560 --> 00:23:29,330
Unlike hierarchical clustering,
k-means is non-deterministic.

407
00:23:29,330 --> 00:23:34,460
Depending upon what
random examples we choose,

408
00:23:34,460 --> 00:23:36,470
we can get a different
number of iterations.

409
00:23:36,470 --> 00:23:40,190
If we choose them poorly, it
could take longer to converge.

410
00:23:40,190 --> 00:23:44,110
More worrisome, you
get a different answer.

411
00:23:44,110 --> 00:23:45,670
You're running this
greedy algorithm,

412
00:23:45,670 --> 00:23:47,920
and you might actually
get to a different place,

413
00:23:47,920 --> 00:23:49,720
depending upon which
centroids you chose.

414
00:23:52,390 --> 00:23:54,210
So these are the
two issues we have

415
00:23:54,210 --> 00:23:57,000
to think about dealing with.

416
00:23:57,000 --> 00:24:00,980
So let's first think
about choosing k.

417
00:24:00,980 --> 00:24:04,400
What often happens
is people choose

418
00:24:04,400 --> 00:24:07,820
k using a priori knowledge
about the application.

419
00:24:10,670 --> 00:24:13,070
If I'm in medicine,
I actually know

420
00:24:13,070 --> 00:24:15,080
that there are only
five different kinds

421
00:24:15,080 --> 00:24:17,280
of bacteria in the world.

422
00:24:17,280 --> 00:24:19,110
That's true.

423
00:24:19,110 --> 00:24:22,930
I mean, there are subspecies,
but five large categories.

424
00:24:22,930 --> 00:24:25,980
And if I had a bunch of
bacterium I wanted to cluster,

425
00:24:25,980 --> 00:24:30,050
may just set k equal to 5.

426
00:24:30,050 --> 00:24:32,390
Maybe I believe there are
only two kinds of people

427
00:24:32,390 --> 00:24:35,585
in the world, those who are
at MIT and those who are not.

428
00:24:35,585 --> 00:24:37,550
And so I'll choose k equal to 2.

429
00:24:40,200 --> 00:24:45,060
Often, we know enough about the
application, we can choose k.

430
00:24:45,060 --> 00:24:49,110
As we'll see later, often we
can think we do, and we don't.

431
00:24:51,940 --> 00:24:56,160
A better approach is
to search for a good k.

432
00:25:01,050 --> 00:25:03,900
So you can try
different values of k

433
00:25:03,900 --> 00:25:08,050
and evaluate the
quality of the result.

434
00:25:08,050 --> 00:25:09,925
Assume you have some
metric, as to say yeah,

435
00:25:09,925 --> 00:25:13,290
I like this clustering, I
don't like this clustering.

436
00:25:13,290 --> 00:25:16,410
And we'll talk about
do that in detail.

437
00:25:16,410 --> 00:25:22,260
Or you can run hierarchical
clustering on a subset of data.

438
00:25:22,260 --> 00:25:23,970
I've got a million points.

439
00:25:23,970 --> 00:25:27,060
All right, what I'm going to
do is take a subset of 1,000

440
00:25:27,060 --> 00:25:28,630
of them or 10,000.

441
00:25:28,630 --> 00:25:31,550
Run hierarchical clustering.

442
00:25:31,550 --> 00:25:36,750
From that, get a sense of the
structure underlying the data.

443
00:25:36,750 --> 00:25:41,650
Decide k should be 6, and then
run k-means with k equals 6.

444
00:25:41,650 --> 00:25:42,940
People often do this.

445
00:25:42,940 --> 00:25:47,380
They run hierarchical clustering
on a small subset of the data

446
00:25:47,380 --> 00:25:48,570
and then choose k.

447
00:25:51,860 --> 00:25:57,830
And we'll look-- but one we're
going to look at is that one.

448
00:25:57,830 --> 00:26:00,810
What about unlucky centroids?

449
00:26:00,810 --> 00:26:05,640
So here I got the same
points we started with.

450
00:26:05,640 --> 00:26:08,390
Different initial centroids.

451
00:26:08,390 --> 00:26:11,310
I've got a fuchsia
one, a black one,

452
00:26:11,310 --> 00:26:16,130
and then I've got red
and blue down here,

453
00:26:16,130 --> 00:26:21,780
which I happened to accidentally
choose close to one another.

454
00:26:21,780 --> 00:26:24,960
Well, if I start
with these centroids,

455
00:26:24,960 --> 00:26:27,300
certainly you
would expect things

456
00:26:27,300 --> 00:26:29,470
to take longer to converge.

457
00:26:29,470 --> 00:26:31,580
But in fact, what
happens is this--

458
00:26:34,450 --> 00:26:40,060
I get this assignment of
blue, this assignment of red,

459
00:26:40,060 --> 00:26:43,160
and I'm done.

460
00:26:43,160 --> 00:26:48,980
It converges on this,
which probably is not

461
00:26:48,980 --> 00:26:51,410
what we wanted out of this.

462
00:26:51,410 --> 00:26:54,350
Maybe it is, but the
fact that I converged

463
00:26:54,350 --> 00:26:57,500
on some very
different place shows

464
00:26:57,500 --> 00:26:59,480
that it's a real weakness
of the algorithm,

465
00:26:59,480 --> 00:27:02,420
that it's sensitive to the
randomly-chosen initial

466
00:27:02,420 --> 00:27:05,738
conditions.

467
00:27:05,738 --> 00:27:11,000
Well, couple of things
you can do about that.

468
00:27:11,000 --> 00:27:17,180
You could be clever and try and
select good initial centroids.

469
00:27:17,180 --> 00:27:20,150
So people often will do that,
and what they'll do is try

470
00:27:20,150 --> 00:27:24,740
and just make sure that they're
distributed over the space.

471
00:27:24,740 --> 00:27:27,290
So they would look at
some picture like this

472
00:27:27,290 --> 00:27:31,940
and say, well, let's just put
my centroids at the corners

473
00:27:31,940 --> 00:27:35,570
or something like that so
that they're far apart.

474
00:27:39,760 --> 00:27:42,960
Another approach is
to try multiple sets

475
00:27:42,960 --> 00:27:46,280
of randomly-chosen
centroids, and then

476
00:27:46,280 --> 00:27:47,825
just select the best results.

477
00:27:50,830 --> 00:27:55,980
And that's what this little
algorithm on the screen does.

478
00:27:55,980 --> 00:28:00,540
So I'll say best is equal
to k-means of the points

479
00:28:00,540 --> 00:28:05,350
themselves, or
something, then for t

480
00:28:05,350 --> 00:28:10,630
in range number of trials, I'll
say C equals k-means of points,

481
00:28:10,630 --> 00:28:14,080
and I'll just keep track and
choose the one with the least

482
00:28:14,080 --> 00:28:15,406
dissimilarity.

483
00:28:15,406 --> 00:28:16,780
The thing I'm
trying to minimize.

484
00:28:16,780 --> 00:28:17,280
OK?

485
00:28:21,450 --> 00:28:24,910
The first one is got all
the points in one cluster.

486
00:28:24,910 --> 00:28:27,460
So it's very dissimilar.

487
00:28:27,460 --> 00:28:29,050
And then I'll just
keep generating

488
00:28:29,050 --> 00:28:31,210
for different k's
and I'll choose

489
00:28:31,210 --> 00:28:34,700
the k that seems to
be the best, that

490
00:28:34,700 --> 00:28:39,740
does the best job of minimizing
my objective function.

491
00:28:39,740 --> 00:28:42,650
And this is a very common
solution, by the way,

492
00:28:42,650 --> 00:28:46,010
for any randomized
greedy algorithm.

493
00:28:46,010 --> 00:28:49,280
And there are a lot of
randomized greedy algorithms

494
00:28:49,280 --> 00:28:53,270
that you just choose
multiple initial conditions,

495
00:28:53,270 --> 00:28:55,580
try them all out
and pick the best.

496
00:28:59,450 --> 00:29:00,830
All right, now I
want to show you

497
00:29:00,830 --> 00:29:04,585
a slightly more real example.

498
00:29:07,530 --> 00:29:13,470
So this is a file we've
got with medical patients,

499
00:29:13,470 --> 00:29:17,280
and we're going to try
and cluster them and see

500
00:29:17,280 --> 00:29:19,170
whether the clusters
tell us anything

501
00:29:19,170 --> 00:29:21,990
about the probability
of them dying

502
00:29:21,990 --> 00:29:26,340
of a heart attack in, say,
the next year or some period

503
00:29:26,340 --> 00:29:27,910
of time.

504
00:29:27,910 --> 00:29:30,570
So to simplify things,
and this is something

505
00:29:30,570 --> 00:29:33,060
I have done with research,
but we're looking

506
00:29:33,060 --> 00:29:35,550
at only four features here--

507
00:29:35,550 --> 00:29:39,570
the heart rate in
beats per minute,

508
00:29:39,570 --> 00:29:46,250
the number of previous heart
attacks, the age, and something

509
00:29:46,250 --> 00:29:49,680
called ST elevation,
a binary attribute.

510
00:29:49,680 --> 00:29:52,700
So the first three are obvious.

511
00:29:52,700 --> 00:29:57,510
If you take an ECG of somebody's
heart, it looks like this.

512
00:29:57,510 --> 00:29:59,900
This is a normal one.

513
00:29:59,900 --> 00:30:01,850
They have the S, the
T, and then there's

514
00:30:01,850 --> 00:30:06,480
this region between the
S wave and the T wave.

515
00:30:06,480 --> 00:30:11,950
And if it's higher, hence
elevated, that's a bad thing.

516
00:30:11,950 --> 00:30:13,890
And so this is about
the first thing

517
00:30:13,890 --> 00:30:17,550
that they measure if someone
is having cardiac problems.

518
00:30:17,550 --> 00:30:19,490
Do they have ST elevation?

519
00:30:22,370 --> 00:30:24,290
And then with each
patient, we're

520
00:30:24,290 --> 00:30:28,270
going to have an outcome,
whether they died,

521
00:30:28,270 --> 00:30:31,390
and it's related
to the features,

522
00:30:31,390 --> 00:30:35,450
but it's probabilistic
not deterministic.

523
00:30:35,450 --> 00:30:39,920
So for example, an older person
with multiple heart attacks

524
00:30:39,920 --> 00:30:42,470
is at higher risk than
a young person who's

525
00:30:42,470 --> 00:30:44,692
never had a heart attack.

526
00:30:44,692 --> 00:30:46,400
That doesn't mean,
though, that the older

527
00:30:46,400 --> 00:30:48,440
person will die first.

528
00:30:48,440 --> 00:30:49,715
It's just more probable.

529
00:30:54,290 --> 00:30:57,327
We're going to take this data,
we're going to cluster it,

530
00:30:57,327 --> 00:30:58,910
and then we're going
to look at what's

531
00:30:58,910 --> 00:31:02,970
called the purity
of the clusters

532
00:31:02,970 --> 00:31:06,030
relative to the outcomes.

533
00:31:06,030 --> 00:31:11,380
So is the cluster, say,
enriched by people who died?

534
00:31:11,380 --> 00:31:14,380
If you have one cluster
and everyone in it died,

535
00:31:14,380 --> 00:31:17,410
then the clustering is
clearly finding some structure

536
00:31:17,410 --> 00:31:18,490
related to the outcome.

537
00:31:23,990 --> 00:31:27,910
So the file is in the
zip file I uploaded.

538
00:31:27,910 --> 00:31:30,235
It looks more or less like this.

539
00:31:30,235 --> 00:31:30,940
Right?

540
00:31:30,940 --> 00:31:33,040
So it's very straightforward.

541
00:31:33,040 --> 00:31:34,310
The outcomes are binary.

542
00:31:34,310 --> 00:31:36,940
1 is a positive outcome.

543
00:31:36,940 --> 00:31:39,220
Strangely enough in
the medical jargon,

544
00:31:39,220 --> 00:31:42,220
a death is a positive outcome.

545
00:31:42,220 --> 00:31:44,800
I guess maybe if you're
responsible for the medical

546
00:31:44,800 --> 00:31:46,350
bills, it's positive.

547
00:31:46,350 --> 00:31:50,410
If you're the patient, it's hard
to think of it as a good thing.

548
00:31:50,410 --> 00:31:53,530
Nevertheless, that's
the way that they talk.

549
00:31:53,530 --> 00:31:55,450
And the others are
all there, right?

550
00:31:55,450 --> 00:31:59,710
Heart rate, other things.

551
00:31:59,710 --> 00:32:01,480
All right, let's
look at some code.

552
00:32:04,160 --> 00:32:05,481
So I've extracted some code.

553
00:32:05,481 --> 00:32:06,980
I'm not going to
show you all of it.

554
00:32:06,980 --> 00:32:10,910
There's quite a lot
of it, as you'll see.

555
00:32:10,910 --> 00:32:14,450
So we'll start-- one
of the files you've got

556
00:32:14,450 --> 00:32:17,180
is called cluster dot pi.

557
00:32:17,180 --> 00:32:18,890
I decided there
was enough code, I

558
00:32:18,890 --> 00:32:21,020
didn't want to put
it all in one file.

559
00:32:21,020 --> 00:32:22,860
I was getting confused.

560
00:32:22,860 --> 00:32:24,560
So I said, let me
create a file that

561
00:32:24,560 --> 00:32:27,950
has some of the code
and a different file

562
00:32:27,950 --> 00:32:30,110
that will then
import it and use it.

563
00:32:30,110 --> 00:32:33,500
Cluster has things
that are pretty much

564
00:32:33,500 --> 00:32:38,700
unrelated to this example, but
just useful for clustering.

565
00:32:38,700 --> 00:32:44,970
So an example here has
name, features, and label.

566
00:32:44,970 --> 00:32:47,740
And really, the only
interesting thing in it--

567
00:32:47,740 --> 00:32:50,880
and it's not that
interesting-- is distance.

568
00:32:50,880 --> 00:32:54,990
And the fact that I'm
using Minkowski with 2

569
00:32:54,990 --> 00:32:56,760
says we're using
Euclidean distance.

570
00:33:02,290 --> 00:33:04,400
Class cluster.

571
00:33:04,400 --> 00:33:08,410
It's a lot more
code to that one.

572
00:33:08,410 --> 00:33:11,350
So we start with a
non-empty list of examples.

573
00:33:11,350 --> 00:33:12,400
That's what init does.

574
00:33:12,400 --> 00:33:14,380
You can imagine what
the code looks like,

575
00:33:14,380 --> 00:33:17,080
or you can look at it.

576
00:33:17,080 --> 00:33:25,580
Update is interesting in that it
takes the cluster and examples

577
00:33:25,580 --> 00:33:35,550
and puts them in-- if you
think of k-means in the cluster

578
00:33:35,550 --> 00:33:38,640
closest to the
previous centroids

579
00:33:38,640 --> 00:33:43,500
and then returns the amount
the centroid has changed.

580
00:33:43,500 --> 00:33:45,700
So if the centroid
has changed by 0,

581
00:33:45,700 --> 00:33:48,140
then you don't have
anything, right?

582
00:33:48,140 --> 00:33:50,270
Creates the new cluster.

583
00:33:50,270 --> 00:33:54,050
And the most interesting
thing is computeCentroid.

584
00:33:54,050 --> 00:33:55,430
And if you look
at this code, you

585
00:33:55,430 --> 00:33:58,820
can see that I'm a slightly
unreconstructed Python 2

586
00:33:58,820 --> 00:34:00,290
programmers.

587
00:34:00,290 --> 00:34:01,910
I just noticed this.

588
00:34:01,910 --> 00:34:04,610
I really shouldn't
have written 0.0.

589
00:34:04,610 --> 00:34:08,420
I should have just written
0, but in Python 2,

590
00:34:08,420 --> 00:34:10,760
you had to write that 0.0.

591
00:34:10,760 --> 00:34:12,320
Sorry about that.

592
00:34:12,320 --> 00:34:15,449
Thought I'd fixed these.

593
00:34:15,449 --> 00:34:18,880
Anyway, so how do we
compute the centroid?

594
00:34:18,880 --> 00:34:25,750
We start by creating
an array of all 0s.

595
00:34:25,750 --> 00:34:30,350
The dimensionality is the number
of features in the example.

596
00:34:30,350 --> 00:34:34,100
It's one of the methods from--

597
00:34:34,100 --> 00:34:37,130
I didn't put up
on the PowerPoint.

598
00:34:37,130 --> 00:34:40,310
And then for e in
examples, I'm going

599
00:34:40,310 --> 00:34:47,790
to add to vals
e.getFeatures, and then I'm

600
00:34:47,790 --> 00:34:52,860
just going to divide vals by
the length of self.examples,

601
00:34:52,860 --> 00:34:54,480
the number of examples.

602
00:34:54,480 --> 00:34:59,480
So now you see why I made it a
pylab array, or a numpy array

603
00:34:59,480 --> 00:35:02,180
rather than a
list, so I could do

604
00:35:02,180 --> 00:35:07,890
nice things like divide the
whole thing in one expression.

605
00:35:07,890 --> 00:35:10,350
As you do math, any
kind of math things,

606
00:35:10,350 --> 00:35:14,010
you'll find these arrays
are incredibly convenient.

607
00:35:14,010 --> 00:35:16,440
Rather than having to
write recursive functions

608
00:35:16,440 --> 00:35:19,140
or do bunches of
iterations, the fact

609
00:35:19,140 --> 00:35:23,820
that you can do it in one
keystroke is incredibly nice.

610
00:35:23,820 --> 00:35:25,570
And then I'm going to
return the centroid.

611
00:35:30,330 --> 00:35:33,565
Variability is exactly
what we saw in the formula.

612
00:35:36,360 --> 00:35:39,690
And then just for fun,
so you could see this,

613
00:35:39,690 --> 00:35:42,270
I used an iterator here.

614
00:35:42,270 --> 00:35:43,950
I don't know that
any of you have used

615
00:35:43,950 --> 00:35:47,340
the yield statement in Python.

616
00:35:47,340 --> 00:35:48,480
I recommend it.

617
00:35:48,480 --> 00:35:50,500
It's very convenient.

618
00:35:50,500 --> 00:35:52,740
One of the nice
things about Python

619
00:35:52,740 --> 00:35:55,770
is almost anything
that's built in,

620
00:35:55,770 --> 00:35:58,540
you can make your
own version of it.

621
00:35:58,540 --> 00:36:04,470
And so once I've done
this, if c is a cluster,

622
00:36:04,470 --> 00:36:11,320
I can now write something
like for c in big C,

623
00:36:11,320 --> 00:36:17,740
and this will make it work just
like iterating over a list.

624
00:36:17,740 --> 00:36:21,780
Right, so this makes it
possible to iterate over it.

625
00:36:21,780 --> 00:36:24,360
If you haven't read
about yield, you probably

626
00:36:24,360 --> 00:36:27,660
should read the probably
about two paragraphs

627
00:36:27,660 --> 00:36:30,340
in the textbook
explaining how it works,

628
00:36:30,340 --> 00:36:33,320
but it's very convenient.

629
00:36:33,320 --> 00:36:35,530
Dissimilarity
we've already seen.

630
00:36:38,570 --> 00:36:41,870
All right, now we
get to patients.

631
00:36:41,870 --> 00:36:48,300
This is in the file lec
12, lecture 12 dot py.

632
00:36:48,300 --> 00:36:51,810
In addition to importing
the usual suspects of pylab

633
00:36:51,810 --> 00:36:57,260
and numpy, and probably it
should import random too,

634
00:36:57,260 --> 00:37:01,550
it imports cluster, the
one we just looked at.

635
00:37:04,160 --> 00:37:11,590
And so patient is a
sub-type of cluster.Example.

636
00:37:11,590 --> 00:37:14,800
Then I'm going to define
this interesting thing called

637
00:37:14,800 --> 00:37:18,330
scale attributes.

638
00:37:18,330 --> 00:37:21,720
So you might remember,
in the last lecture

639
00:37:21,720 --> 00:37:25,680
when Professor Grimson was
looking at these reptiles,

640
00:37:25,680 --> 00:37:28,770
he ran into this
problem about alligators

641
00:37:28,770 --> 00:37:31,200
looking like chickens
because they each have

642
00:37:31,200 --> 00:37:33,570
a large number of legs.

643
00:37:33,570 --> 00:37:37,330
And he said, well, what can
we do to get around this?

644
00:37:37,330 --> 00:37:41,670
Well, we can represent the
feature as a binary number.

645
00:37:41,670 --> 00:37:43,215
Has legs, doesn't have legs.

646
00:37:43,215 --> 00:37:45,210
0 or 1.

647
00:37:45,210 --> 00:37:47,940
And the problem he
was dealing with

648
00:37:47,940 --> 00:37:51,860
is that when you
have a feature vector

649
00:37:51,860 --> 00:37:55,910
and the dynamic range
of some features

650
00:37:55,910 --> 00:37:59,210
is much greater than
the others, they

651
00:37:59,210 --> 00:38:03,260
tend to dominate because the
distances just look bigger when

652
00:38:03,260 --> 00:38:06,190
you get Euclidean distance.

653
00:38:06,190 --> 00:38:08,760
So for example, if we
wanted to cluster the people

654
00:38:08,760 --> 00:38:13,980
in this room, and I
had one feature that

655
00:38:13,980 --> 00:38:18,510
was, say, 1 for male
and 0 for female,

656
00:38:18,510 --> 00:38:21,810
and another feature that
was 1 for wears glasses,

657
00:38:21,810 --> 00:38:26,490
0 for doesn't wear glasses,
and then a third feature which

658
00:38:26,490 --> 00:38:31,260
was weight, and
I clustered them,

659
00:38:31,260 --> 00:38:33,240
well, weight would
always completely

660
00:38:33,240 --> 00:38:36,690
dominate the Euclidean
distance, right?

661
00:38:36,690 --> 00:38:39,030
Because the dynamic range
of the weights in this

662
00:38:39,030 --> 00:38:45,450
room is much higher than
the dynamic range of 0 to 1.

663
00:38:45,450 --> 00:38:51,120
And so for the reptiles,
he said, well, OK, we'll

664
00:38:51,120 --> 00:38:53,640
just make it a binary variable.

665
00:38:53,640 --> 00:38:55,410
But maybe we don't
want to make weight

666
00:38:55,410 --> 00:38:58,170
a binary variable, because
maybe it is something

667
00:38:58,170 --> 00:39:00,880
we want to take into account.

668
00:39:00,880 --> 00:39:04,350
So what we do is we scale it.

669
00:39:04,350 --> 00:39:09,090
So this is a method
called z-scaling.

670
00:39:09,090 --> 00:39:14,280
More general than just
making things 0 or 1.

671
00:39:14,280 --> 00:39:16,200
It's a simple code.

672
00:39:16,200 --> 00:39:22,240
It takes in all of the
values of a specific feature

673
00:39:22,240 --> 00:39:26,030
and then performs some
simple calculations,

674
00:39:26,030 --> 00:39:34,970
and when it's done, the
resulting array it returns

675
00:39:34,970 --> 00:39:40,320
has a known mean and a
known standard deviation.

676
00:39:40,320 --> 00:39:41,960
So what's the mean going to be?

677
00:39:41,960 --> 00:39:44,179
It's always going to be
the same thing, independent

678
00:39:44,179 --> 00:39:45,095
of the initial values.

679
00:39:47,660 --> 00:39:48,920
Take a look at the code.

680
00:39:48,920 --> 00:39:50,510
Try and see if you
can figure it out.

681
00:39:55,190 --> 00:39:57,970
Anybody want to
take a guess at it?

682
00:39:57,970 --> 00:39:59,550
0.

683
00:39:59,550 --> 00:40:00,120
Right?

684
00:40:00,120 --> 00:40:04,160
So the mean will always be 0.

685
00:40:04,160 --> 00:40:07,040
And the standard deviation,
a little harder to figure,

686
00:40:07,040 --> 00:40:08,330
but it will always be 1.

687
00:40:13,320 --> 00:40:13,820
OK?

688
00:40:13,820 --> 00:40:17,140
So it's done this scaling.

689
00:40:17,140 --> 00:40:22,160
This is a very common kind
of scaling called z-scaling.

690
00:40:22,160 --> 00:40:25,150
The other way people
scale is interpolate.

691
00:40:25,150 --> 00:40:29,440
They take the smallest value and
call it 0, the biggest value,

692
00:40:29,440 --> 00:40:33,580
they call it 1, and then they
do a linear interpolation

693
00:40:33,580 --> 00:40:36,230
of all the values
between 0 and 1.

694
00:40:36,230 --> 00:40:39,570
So the range is 0 to 1.

695
00:40:39,570 --> 00:40:43,230
That's also very common.

696
00:40:43,230 --> 00:40:45,600
So this is a general
way to get all

697
00:40:45,600 --> 00:40:48,836
of the features sort
of in the same ballpark

698
00:40:48,836 --> 00:40:50,002
so that we can compare them.

699
00:40:53,100 --> 00:40:55,140
And we'll look at what
happens when we scale

700
00:40:55,140 --> 00:40:57,480
and when we don't scale.

701
00:40:57,480 --> 00:41:01,200
And that's why my getData
function has this parameter

702
00:41:01,200 --> 00:41:02,820
to scale.

703
00:41:02,820 --> 00:41:06,150
It either creates a set of
examples with the attributes

704
00:41:06,150 --> 00:41:10,090
as initially or scaled.

705
00:41:10,090 --> 00:41:11,980
And then there's k-means.

706
00:41:11,980 --> 00:41:14,920
It's exactly the
algorithm I showed you

707
00:41:14,920 --> 00:41:20,200
with one little wrinkle,
which is this part.

708
00:41:20,200 --> 00:41:23,200
You don't want to end
up with empty clusters.

709
00:41:23,200 --> 00:41:26,170
If I tell you I
want four clusters,

710
00:41:26,170 --> 00:41:28,240
I don't mean I want
three with examples

711
00:41:28,240 --> 00:41:30,390
and one that's empty, right?

712
00:41:30,390 --> 00:41:34,050
Because then I really
don't have four clusters.

713
00:41:34,050 --> 00:41:36,840
And so this is one
of multiple ways

714
00:41:36,840 --> 00:41:39,510
to avoid having empty clusters.

715
00:41:39,510 --> 00:41:41,470
Basically what I
did here is say,

716
00:41:41,470 --> 00:41:44,640
well, I'm going to try a lot of
different initial conditions.

717
00:41:44,640 --> 00:41:47,880
If one of them is so unlucky
to give me an empty cluster,

718
00:41:47,880 --> 00:41:51,550
I'm just going to skip it
and go on to the next one

719
00:41:51,550 --> 00:41:55,892
by raising a value
error, empty cluster.

720
00:41:55,892 --> 00:41:57,350
And if you look at
the code, you'll

721
00:41:57,350 --> 00:42:00,450
see how this value
error is used.

722
00:42:00,450 --> 00:42:02,690
And then try k-means.

723
00:42:02,690 --> 00:42:07,490
We'll call k-means numTrial
times, each one getting

724
00:42:07,490 --> 00:42:11,060
a different set of
initial centroids,

725
00:42:11,060 --> 00:42:13,550
and return the result with
the lowest dissimilarity.

726
00:42:16,820 --> 00:42:23,090
Then I have various ways
to examine the results.

727
00:42:23,090 --> 00:42:25,040
Nothing very
interesting, and here's

728
00:42:25,040 --> 00:42:28,190
the key place where we're
going to run the whole thing.

729
00:42:28,190 --> 00:42:31,970
We'll get the data,
initially not scaling it,

730
00:42:31,970 --> 00:42:34,200
because remember,
it defaults to true.

731
00:42:34,200 --> 00:42:38,770
Then initially, I'm only going
to try one k. k equals 2.

732
00:42:38,770 --> 00:42:47,950
And we'll call testClustering
with the patients.

733
00:42:47,950 --> 00:42:50,920
The number of clusters, k.

734
00:42:50,920 --> 00:42:53,770
I put in seed as
a parameter here

735
00:42:53,770 --> 00:42:56,080
because I wanted to be
able to play with it

736
00:42:56,080 --> 00:42:59,710
and make sure I got different
things for 0 and 1 and 2

737
00:42:59,710 --> 00:43:01,630
just as a testing thing.

738
00:43:01,630 --> 00:43:06,230
And five trials
it's defaulting to.

739
00:43:06,230 --> 00:43:12,480
And then we'll look
at testClustering

740
00:43:12,480 --> 00:43:17,100
is returning the fraction
of positive examples

741
00:43:17,100 --> 00:43:19,780
for each cluster.

742
00:43:19,780 --> 00:43:21,730
OK?

743
00:43:21,730 --> 00:43:23,530
So let's see what
happens when we run it.

744
00:43:39,690 --> 00:43:41,460
All right.

745
00:43:41,460 --> 00:43:43,710
So we got two clusters.

746
00:43:43,710 --> 00:43:49,590
Cluster of size 118 with
.3305, and a cluster

747
00:43:49,590 --> 00:43:55,010
of size 132 with a positive
fraction of point quadruple 3.

748
00:43:59,230 --> 00:44:03,230
Should we be happy?

749
00:44:03,230 --> 00:44:07,870
Does our clustering tell
us anything, somehow

750
00:44:07,870 --> 00:44:13,220
correspond to the expected
outcome for patients here?

751
00:44:13,220 --> 00:44:15,630
Probably not, right?

752
00:44:15,630 --> 00:44:18,600
Those numbers are pretty
much indistinguishable

753
00:44:18,600 --> 00:44:20,280
statistically.

754
00:44:20,280 --> 00:44:23,070
And you'd have to guess that
the fraction of positives

755
00:44:23,070 --> 00:44:26,544
in the whole population
is around .33, right?

756
00:44:26,544 --> 00:44:27,960
That about a third
of these people

757
00:44:27,960 --> 00:44:30,350
died of their heart attack.

758
00:44:30,350 --> 00:44:35,040
And I might as well have
signed them randomly

759
00:44:35,040 --> 00:44:36,584
to the two clusters, right?

760
00:44:36,584 --> 00:44:38,250
There's not much
difference between this

761
00:44:38,250 --> 00:44:42,480
and what you would get
with the random result.

762
00:44:42,480 --> 00:44:44,490
Well, why do we
think that's true?

763
00:44:47,270 --> 00:44:49,550
Because I didn't scale, right?

764
00:44:49,550 --> 00:44:53,150
And so one of the issues
we had to deal with

765
00:44:53,150 --> 00:44:56,760
is, well, age had a
big dynamic range,

766
00:44:56,760 --> 00:45:02,300
and, say, ST elevation, which I
told you was highly diagnostic,

767
00:45:02,300 --> 00:45:04,600
was either 0 or 1.

768
00:45:04,600 --> 00:45:06,280
And so probably
everything is getting

769
00:45:06,280 --> 00:45:12,820
swamped by age or
something else, right?

770
00:45:12,820 --> 00:45:17,350
All right, so we have
an easy way to fix that.

771
00:45:17,350 --> 00:45:20,440
We'll just scale the data.

772
00:45:20,440 --> 00:45:21,670
Now let's see what we get.

773
00:45:26,660 --> 00:45:27,400
All right.

774
00:45:27,400 --> 00:45:31,140
That's interesting.

775
00:45:31,140 --> 00:45:33,090
With casting rule?

776
00:45:33,090 --> 00:45:35,600
Good grief.

777
00:45:35,600 --> 00:45:37,010
That caught me by surprise.

778
00:45:48,150 --> 00:45:51,360
Good thing I have the answers
in PowerPoint to show you,

779
00:45:51,360 --> 00:45:53,236
because the code doesn't
seem to be working.

780
00:46:00,190 --> 00:46:01,130
Try it once more.

781
00:46:05,310 --> 00:46:05,810
No.

782
00:46:05,810 --> 00:46:09,890
All right, well, in
the interest of getting

783
00:46:09,890 --> 00:46:11,630
through this
lecture on schedule,

784
00:46:11,630 --> 00:46:14,690
we'll go look at the
results that we get--

785
00:46:14,690 --> 00:46:16,291
I got last time I ran it.

786
00:46:20,281 --> 00:46:20,780
All right.

787
00:46:23,720 --> 00:46:32,110
When I scaled, what we see here
is that now there is a pretty

788
00:46:32,110 --> 00:46:34,770
dramatic difference, right?

789
00:46:34,770 --> 00:46:37,170
One of the clusters has
a much higher fraction

790
00:46:37,170 --> 00:46:43,030
of positive patients
than others,

791
00:46:43,030 --> 00:46:46,910
but it's still a
bit problematic.

792
00:46:46,910 --> 00:46:52,670
So this has pretty
good specificity,

793
00:46:52,670 --> 00:46:57,275
or positive predictive value,
but its sensitivity is lousy.

794
00:47:02,170 --> 00:47:06,640
Remember, a third of our
initial population more or less,

795
00:47:06,640 --> 00:47:08,260
was positive.

796
00:47:08,260 --> 00:47:13,320
26 is way less than a
third, so in fact I've

797
00:47:13,320 --> 00:47:18,690
got a class, a cluster,
that is strongly enriched,

798
00:47:18,690 --> 00:47:23,250
but I'm still lumping most
of the positive patients

799
00:47:23,250 --> 00:47:24,350
into the other cluster.

800
00:47:27,030 --> 00:47:31,790
And in fact, there
are 83 positives.

801
00:47:31,790 --> 00:47:33,840
Wrote some code to do that.

802
00:47:33,840 --> 00:47:37,870
And so we see that
of the 83 positives,

803
00:47:37,870 --> 00:47:41,800
only this class,
which is 70% positive,

804
00:47:41,800 --> 00:47:44,710
only has 26 in it
to start with it.

805
00:47:44,710 --> 00:47:48,980
So I'm clearly missing
most of the positives.

806
00:47:48,980 --> 00:47:51,130
So why?

807
00:47:51,130 --> 00:47:54,640
Well, my hypothesis was
that different subgroups

808
00:47:54,640 --> 00:47:58,852
of positive patients have
different characteristics.

809
00:48:01,590 --> 00:48:09,080
And so we could test this
by trying other values of k

810
00:48:09,080 --> 00:48:11,570
to see with-- we would
get more clusters.

811
00:48:11,570 --> 00:48:14,540
So here, I said, let's
try k equals 2, 4, and 6.

812
00:48:18,090 --> 00:48:19,740
And here's what I
got when I ran that.

813
00:48:23,870 --> 00:48:32,010
So what you'll notice here, as
we get to, say, 4, that I have

814
00:48:32,010 --> 00:48:39,030
two clusters, this
one and this one,

815
00:48:39,030 --> 00:48:43,230
which are heavily enriched
with positive patients.

816
00:48:43,230 --> 00:48:49,530
26 as before in the first
one, but 76 patients

817
00:48:49,530 --> 00:48:51,240
in the third one.

818
00:48:51,240 --> 00:48:55,560
So I'm now getting a much
higher fraction of patients

819
00:48:55,560 --> 00:49:00,930
in one of the "risky" clusters.

820
00:49:00,930 --> 00:49:08,930
And I can continue to do that,
but if I look at k equals 6,

821
00:49:08,930 --> 00:49:11,420
we now look at the
positive clusters.

822
00:49:11,420 --> 00:49:15,560
There were three of them
significantly positive.

823
00:49:15,560 --> 00:49:20,210
But I'm not really getting
a lot more patients total,

824
00:49:20,210 --> 00:49:22,260
so maybe 4 is the right answer.

825
00:49:24,860 --> 00:49:29,470
So what you see here is that
we have at least two parameters

826
00:49:29,470 --> 00:49:32,530
to play with, scaling and k.

827
00:49:32,530 --> 00:49:35,200
Even though I was only
wanted a structure

828
00:49:35,200 --> 00:49:37,090
that would separate the risk--

829
00:49:37,090 --> 00:49:39,640
high-risk patients
from the lower-risk,

830
00:49:39,640 --> 00:49:45,140
which is why I started
with 2, I later

831
00:49:45,140 --> 00:49:48,260
discovered that, in fact,
there are multiple reasons

832
00:49:48,260 --> 00:49:50,390
for being high-risk.

833
00:49:50,390 --> 00:49:52,070
And so maybe one
of these clusters

834
00:49:52,070 --> 00:49:54,800
is heavily enriched
by old people.

835
00:49:54,800 --> 00:49:56,420
Maybe another one
is heavily enriched

836
00:49:56,420 --> 00:50:00,500
by people who have had three
heart attacks in the past,

837
00:50:00,500 --> 00:50:03,990
or ST elevation or
some combination.

838
00:50:03,990 --> 00:50:05,540
And when I had
only two clusters,

839
00:50:05,540 --> 00:50:08,640
I couldn't get that
fine gradation.

840
00:50:08,640 --> 00:50:11,520
So this is what data
scientists spend

841
00:50:11,520 --> 00:50:14,130
their time doing when
they're doing clustering,

842
00:50:14,130 --> 00:50:17,970
is they actually have
multiple parameters.

843
00:50:17,970 --> 00:50:19,770
They try different things out.

844
00:50:19,770 --> 00:50:22,020
They look at the
results, and that's

845
00:50:22,020 --> 00:50:26,040
why you actually have to think
to manipulate data rather

846
00:50:26,040 --> 00:50:28,860
than just push a button
and wait for the answer.

847
00:50:28,860 --> 00:50:30,060
All right.

848
00:50:30,060 --> 00:50:34,350
More of this general
topic on Wednesday

849
00:50:34,350 --> 00:50:37,440
when we're going to talk
about classification.

850
00:50:37,440 --> 00:50:38,828
Thank you.