1
00:00:00,530 --> 00:00:02,960
The following content is
provided under a Creative

2
00:00:02,960 --> 00:00:04,370
Commons license.

3
00:00:04,370 --> 00:00:07,410
Your support will help MIT
OpenCourseWare continue to

4
00:00:07,410 --> 00:00:11,060
offer high quality educational
resources for free.

5
00:00:11,060 --> 00:00:13,960
To make a donation or view
additional materials from

6
00:00:13,960 --> 00:00:19,780
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:19,780 --> 00:00:21,030
ocw.mit.edu.

8
00:00:25,998 --> 00:00:27,492
AUDIENCE: OK.

9
00:00:27,492 --> 00:00:29,484
Number (2) --

10
00:00:29,484 --> 00:00:30,978
(2.3).

11
00:00:30,978 --> 00:00:34,962
It says if the code
[INAUDIBLE]

12
00:00:34,962 --> 00:00:36,954
0 would be the [INAUDIBLE].

13
00:00:36,954 --> 00:00:39,444
I thought it was, you're
generating

14
00:00:39,444 --> 00:00:41,440
random values for that?

15
00:00:41,440 --> 00:00:45,880
PROFESSOR: Yeah, you were but if
you look at what totes 0 is

16
00:00:45,880 --> 00:00:52,570
collecting, so if you look at
the, where it draws a random

17
00:00:52,570 --> 00:00:57,640
number, j is indexing
in totes, right?

18
00:00:57,640 --> 00:01:01,890
So when j is 0 your standard
deviation, which is also being

19
00:01:01,890 --> 00:01:04,704
indexed by j, is
going to be 0.

20
00:01:04,704 --> 00:01:06,175
So you're always going to
get the same value.

21
00:01:10,058 --> 00:01:11,270
Any more questions?

22
00:01:11,270 --> 00:01:12,890
No?

23
00:01:12,890 --> 00:01:15,490
OK, that was easy.

24
00:01:18,440 --> 00:01:23,490
So in lecture we were talking a
lot about clustering, we've

25
00:01:23,490 --> 00:01:27,920
been talking about clustering
the past, is it two lectures?

26
00:01:27,920 --> 00:01:30,550
And we had two different
types of clustering

27
00:01:30,550 --> 00:01:33,663
methods, what were they?

28
00:01:33,663 --> 00:01:35,630
AUDIENCE: Hierarchical and--

29
00:01:35,630 --> 00:01:36,880
PROFESSOR: Heirarchical
and K-means.

30
00:01:39,890 --> 00:01:42,450
Can someone give me a run down
of what the steps are in

31
00:01:42,450 --> 00:01:43,700
hierarchical clustering?

32
00:01:50,310 --> 00:01:52,746
AUDIENCE: Something that breaks
everything down into

33
00:01:52,746 --> 00:01:53,996
one cluster [INAUDIBLE]

34
00:01:56,970 --> 00:02:00,130
PROFESSOR: So let's say I have
a bunch of data points.

35
00:02:00,130 --> 00:02:01,380
What would be the first step?

36
00:02:09,750 --> 00:02:13,820
You're going to first assign
each point to a cluster.

37
00:02:13,820 --> 00:02:20,410
So each point gets
its own cluster.

38
00:02:20,410 --> 00:02:24,430
And then the next step
would be what?

39
00:02:30,238 --> 00:02:32,670
AUDIENCE: [INAUDIBLE]

40
00:02:32,670 --> 00:02:35,000
PROFESSOR: Right, so you're
going to find the two clusters

41
00:02:35,000 --> 00:02:36,700
that are closest to each
other and merge them.

42
00:02:36,700 --> 00:02:41,520
So in this very contrived
example, would be these guys.

43
00:02:41,520 --> 00:02:45,040
And then you're going to keep
doing that until you get to a

44
00:02:45,040 --> 00:02:46,770
certain number of
clusters, right?

45
00:02:46,770 --> 00:02:50,480
So you merge these two, then you
might merge these two, and

46
00:02:50,480 --> 00:02:55,212
you might merge these two, et
cetera, et cetera, right?

47
00:02:55,212 --> 00:02:58,190
AUDIENCE: [INAUDIBLE]

48
00:02:58,190 --> 00:03:01,240
PROFESSOR: So you're going to
set the number of clusters

49
00:03:01,240 --> 00:03:07,590
that you want at the outset, so
I guess, for the mammalian

50
00:03:07,590 --> 00:03:11,200
teeth example, there was
stopping criteria of two

51
00:03:11,200 --> 00:03:12,450
clusters, if I'm not mistaken.

52
00:03:20,080 --> 00:03:23,270
So let's take a look at the code
here that implements a

53
00:03:23,270 --> 00:03:24,520
hierarchical cluster.

54
00:03:27,770 --> 00:03:30,560
So this is just some
infrastructure code, it builds

55
00:03:30,560 --> 00:03:33,720
up the number of points.

56
00:03:33,720 --> 00:03:36,116
We have a cluster set class,
which we'll go over in a

57
00:03:36,116 --> 00:03:43,150
second, and then we just add,
for each point we're going to

58
00:03:43,150 --> 00:03:49,610
create a cluster object, and
add it to the cluster set.

59
00:03:49,610 --> 00:03:50,910
Let's take a look at
the cluster set.

60
00:03:56,390 --> 00:04:02,600
Cluster set has one attribute,
the members attribute, and it

61
00:04:02,600 --> 00:04:08,770
just has a set of points, or a
set of clusters, actually.

62
00:04:08,770 --> 00:04:13,900
And the key method in here--

63
00:04:13,900 --> 00:04:16,180
or the key methods are
merge-1 and merge-n.

64
00:04:16,180 --> 00:04:17,990
Merge-n is what actually
implements

65
00:04:17,990 --> 00:04:19,800
the clustering here.

66
00:04:19,800 --> 00:04:23,860
So you give it the distance
metric that you're going to

67
00:04:23,860 --> 00:04:28,210
use for your points, the number
of clusters that you

68
00:04:28,210 --> 00:04:32,650
want at the end of your
clustering, the history

69
00:04:32,650 --> 00:04:35,530
tracker, and then you also tell
it if you want to print

70
00:04:35,530 --> 00:04:38,580
out some debugging
information.

71
00:04:38,580 --> 00:04:44,590
So, which apparently is not used
to this method, oh, now

72
00:04:44,590 --> 00:04:47,810
it is, merge-1.

73
00:04:47,810 --> 00:04:53,690
Anyway, so while we have more
clusters than the number of

74
00:04:53,690 --> 00:04:56,800
clusters we desire, we're going
to keep reiterating.

75
00:04:56,800 --> 00:04:59,560
And on each step we're going
to call this function--

76
00:04:59,560 --> 00:05:03,040
or method called merge-1
and just pass at

77
00:05:03,040 --> 00:05:04,750
the distance metric.

78
00:05:04,750 --> 00:05:13,500
And all merge-1 is going to
do is it's going to, if

79
00:05:13,500 --> 00:05:16,790
there's only one--

80
00:05:16,790 --> 00:05:19,470
if there's only one cluster
here, then it's just going to

81
00:05:19,470 --> 00:05:20,520
return none.

82
00:05:20,520 --> 00:05:30,610
If there are two clusters and
its going to merge them, and

83
00:05:30,610 --> 00:05:33,390
if there are more than two
clusters, it's going to find

84
00:05:33,390 --> 00:05:36,220
the closest, according to
the distance metric and

85
00:05:36,220 --> 00:05:37,470
then merge those two.

86
00:05:39,820 --> 00:05:43,780
So then the return value
is going to be the

87
00:05:43,780 --> 00:05:45,030
two clusters it merged.

88
00:05:52,660 --> 00:05:58,200
Let's look at the merge
clusters code.

89
00:05:58,200 --> 00:06:04,960
All it does is it takes the
two clusters and for each

90
00:06:04,960 --> 00:06:11,400
point, in both clusters, it adds
it to a new list points

91
00:06:11,400 --> 00:06:14,200
and it creates a new cluster
from those points.

92
00:06:14,200 --> 00:06:18,720
And then it removes two clusters
from members and adds

93
00:06:18,720 --> 00:06:20,185
the newly created cluster.

94
00:06:26,480 --> 00:06:30,670
So then, find closest method.

95
00:06:35,870 --> 00:06:40,018
So what's this bit of
code doing here?

96
00:06:40,018 --> 00:06:41,268
AUDIENCE: [INAUDIBLE]

97
00:06:47,290 --> 00:06:51,250
PROFESSOR: Right, so we'll get
to the metric in a second.

98
00:06:51,250 --> 00:06:59,010
So it initially finds, it looks
at the first two members

99
00:06:59,010 --> 00:07:03,150
in the cluster set and it sets
minDistance to be that, and it

100
00:07:03,150 --> 00:07:05,290
sets toMerge to be these
two members.

101
00:07:05,290 --> 00:07:11,710
And then it narrates through
every possible pair of

102
00:07:11,710 --> 00:07:19,560
clusters, in this cluster set
and finds the minimum distance

103
00:07:19,560 --> 00:07:20,810
according to the metric.

104
00:07:33,080 --> 00:07:37,500
So let's look at the
cluster class.

105
00:07:37,500 --> 00:07:41,540
All the cluster object does
or class does is it

106
00:07:41,540 --> 00:07:43,950
holds a set of points.

107
00:07:43,950 --> 00:07:47,900
It knows the type of point that
it's holding because that

108
00:07:47,900 --> 00:07:50,390
becomes important, when we talk
about the different types

109
00:07:50,390 --> 00:07:54,160
of things that we want to
cluster, and then it also has

110
00:07:54,160 --> 00:07:56,730
something called a centroid.

111
00:07:56,730 --> 00:08:01,150
All a centroid is, is just the
middle of the cluster, if you

112
00:08:01,150 --> 00:08:06,370
take all of the points and
average their distances, or

113
00:08:06,370 --> 00:08:09,500
average their location.

114
00:08:09,500 --> 00:08:16,980
So these different functions,
these just compute metrics

115
00:08:16,980 --> 00:08:19,140
about this particular
cluster, right?

116
00:08:19,140 --> 00:08:23,610
So single linkage dist., all
this is going to do is it

117
00:08:23,610 --> 00:08:28,920
finds a minimum distance between
every pair of points

118
00:08:28,920 --> 00:08:30,170
in the cluster.

119
00:08:33,950 --> 00:08:38,600
And what does max linkage
distance do?

120
00:08:38,600 --> 00:08:42,720
I'm sorry I'm mistaken it finds
the minimum distance

121
00:08:42,720 --> 00:08:45,190
between this--

122
00:08:45,190 --> 00:08:46,680
a point in this cluster
and a point in

123
00:08:46,680 --> 00:08:50,100
another cluster, I misspoke.

124
00:08:50,100 --> 00:08:54,264
So what does max linkage
distance do?

125
00:08:54,264 --> 00:08:56,669
AUDIENCE: [INAUDIBLE]

126
00:08:56,669 --> 00:08:57,631
PROFESSOR: The opposite.

127
00:08:57,631 --> 00:09:01,680
I have to keep you talking
or you'll fall asleep.

128
00:09:01,680 --> 00:09:03,220
And then, averageLinkageDist?

129
00:09:05,870 --> 00:09:06,740
Same thing.

130
00:09:06,740 --> 00:09:10,180
This is why having meaningful
function names is important,

131
00:09:10,180 --> 00:09:13,000
because it helps you
explain code.

132
00:09:13,000 --> 00:09:17,590
So it also has this method in
here called update and what

133
00:09:17,590 --> 00:09:27,570
update does, is it takes a new
set of points and it sets the

134
00:09:27,570 --> 00:09:31,750
points that this cluster has,
to be these new points.

135
00:09:31,750 --> 00:09:35,440
And then it computes a new
centroid for this cluster.

136
00:09:35,440 --> 00:09:41,990
And the return value is the
distance of the old centroid

137
00:09:41,990 --> 00:09:46,160
from the new centroid.

138
00:09:46,160 --> 00:09:47,920
And this becomes important in
some of the algorithms.

139
00:09:51,008 --> 00:09:54,820
Then there's just some
bookkeeping stuff here, like

140
00:09:54,820 --> 00:09:59,890
members will just give you all
the points in this cluster.

141
00:09:59,890 --> 00:10:03,540
You all know what yield
does, right?

142
00:10:03,540 --> 00:10:05,428
AUDIENCE: [INAUDIBLE]

143
00:10:05,428 --> 00:10:06,380
PROFESSOR: OK.

144
00:10:06,380 --> 00:10:09,890
So yield returns a generator
object, which allows you to

145
00:10:09,890 --> 00:10:12,200
iterate over elements.

146
00:10:12,200 --> 00:10:17,210
So this was asked during
the quiz review.

147
00:10:17,210 --> 00:10:20,300
What's the difference between
range and x range?

148
00:10:20,300 --> 00:10:21,700
Right.

149
00:10:21,700 --> 00:10:48,570
So if I have a range of values
it actually returns a

150
00:10:48,570 --> 00:10:50,640
different type.

151
00:10:50,640 --> 00:10:56,980
So I can print out
this list, right?

152
00:10:59,540 --> 00:11:08,540
In this case, it will print out
the type of object it is,

153
00:11:08,540 --> 00:11:16,040
so this is accomplished
using yield.

154
00:11:16,040 --> 00:11:45,060
So if I wanted to write this
myself, what this is going to

155
00:11:45,060 --> 00:11:46,700
do is going to return something

156
00:11:46,700 --> 00:11:48,850
called a generator object.

157
00:11:48,850 --> 00:11:52,740
And all it does is, instead of
holding all the numbers in

158
00:11:52,740 --> 00:11:56,310
memory, it's going to return
them one at a time to me.

159
00:11:56,310 --> 00:12:00,520
So like when I use range here,
it constructs a list and it

160
00:12:00,520 --> 00:12:02,830
has all of those integers
in memory.

161
00:12:02,830 --> 00:12:07,330
If I use xrange it's not going
to hold all the integers in

162
00:12:07,330 --> 00:12:09,070
memory, but I can still
iterate over

163
00:12:09,070 --> 00:12:11,013
them one at a time.

164
00:12:11,013 --> 00:12:14,877
AUDIENCE: So within that
function [INAUDIBLE]

165
00:12:14,877 --> 00:12:18,940
yield a bunch of times before
the function, right?

166
00:12:18,940 --> 00:12:19,210
PROFESSOR: Yeah.

167
00:12:19,210 --> 00:12:21,870
AUDIENCE: How is that
accomplished?

168
00:12:21,870 --> 00:12:23,782
Does it operate [INAUDIBLE]
within the way you normally

169
00:12:23,782 --> 00:12:27,130
have functions [INAUDIBLE]?

170
00:12:27,130 --> 00:12:27,840
PROFESSOR: Right.

171
00:12:27,840 --> 00:12:34,110
So what this tells Python is
that when it sees a yield,

172
00:12:34,110 --> 00:12:36,730
it's sort of like a return,
except it's telling Python

173
00:12:36,730 --> 00:12:39,220
that I want to come back to this
location at some point.

174
00:12:39,220 --> 00:12:43,910
So it a return just returns
out of the function

175
00:12:43,910 --> 00:12:44,790
completely.

176
00:12:44,790 --> 00:12:48,540
What a yield does is it takes
the value that is specified

177
00:12:48,540 --> 00:12:56,720
after yield, and it returns to
the calling place in the

178
00:12:56,720 --> 00:12:59,580
program that value.

179
00:12:59,580 --> 00:13:05,980
But then, when it comes time
to get a new value, it'll

180
00:13:05,980 --> 00:13:08,360
return back to where
this yield exited.

181
00:13:08,360 --> 00:13:14,800
So kind of a way maybe
seeing this is if I

182
00:13:14,800 --> 00:13:16,625
iterate over my xrange.

183
00:13:43,590 --> 00:13:46,270
Each time it needs new value,
it's going to go back inside

184
00:13:46,270 --> 00:13:48,780
this function and grab it.

185
00:13:48,780 --> 00:13:51,440
So it looks like a function.

186
00:13:51,440 --> 00:13:56,240
But what it's actually doing
is creating what's called a

187
00:13:56,240 --> 00:13:57,160
generator object.

188
00:13:57,160 --> 00:13:59,330
And it has these special
methods for

189
00:13:59,330 --> 00:14:01,100
getting the next value.

190
00:14:01,100 --> 00:14:04,990
So it's some nice
syntactic sugar.

191
00:14:04,990 --> 00:14:06,060
But it's pretty neat.

192
00:14:06,060 --> 00:14:10,280
But that's what's going on with
this yield statement here

193
00:14:10,280 --> 00:14:13,500
is that instead of returning the
entire list of points, or

194
00:14:13,500 --> 00:14:19,630
instead of doing it in some
other way, all it's doing is

195
00:14:19,630 --> 00:14:22,850
just yielding each point one
at a time so that you can

196
00:14:22,850 --> 00:14:24,100
iterate over them.

197
00:14:30,140 --> 00:14:31,390
So what else?

198
00:14:33,890 --> 00:14:37,822
So here's a method for computing
the centroid.

199
00:14:44,810 --> 00:14:55,750
All we are going to do is total
up where each point is.

200
00:14:55,750 --> 00:14:59,080
And then, take the average
over all the points.

201
00:15:02,400 --> 00:15:04,094
Does that makes sense?

202
00:15:04,094 --> 00:15:05,344
All right.

203
00:15:07,370 --> 00:15:12,050
So the example we saw
was mammal teeth.

204
00:15:12,050 --> 00:15:15,870
And the way that that's
accomplished in this set of

205
00:15:15,870 --> 00:15:20,850
code is we're going to define a
sub-class of a class, point,

206
00:15:20,850 --> 00:15:22,340
that's call mammal.

207
00:15:22,340 --> 00:15:33,190
And what point does is it has a
name for a given data point.

208
00:15:33,190 --> 00:15:34,930
It has a set of attributes.

209
00:15:34,930 --> 00:15:39,370
And then, you can also give it
some normalized attributes.

210
00:15:39,370 --> 00:15:42,830
If you don't give it the
normalized attributes, then

211
00:15:42,830 --> 00:15:44,810
it'll just use the original
attributes.

212
00:15:44,810 --> 00:15:48,040
So it becomes important
when we talk about--

213
00:15:48,040 --> 00:15:54,020
when we do scaling over data,
which we'll do shortly.

214
00:15:54,020 --> 00:15:59,580
So there's nothing really
special about it except for

215
00:15:59,580 --> 00:16:02,730
this distance function.

216
00:16:02,730 --> 00:16:05,880
It's just defining the Euclidean
distance for a given

217
00:16:05,880 --> 00:16:08,730
multi-dimensional point.

218
00:16:08,730 --> 00:16:12,990
So everyone knows that if
you have a point in two

219
00:16:12,990 --> 00:16:18,620
dimensions, then if it's an
xy point, then it's just

220
00:16:18,620 --> 00:16:21,020
x-squared plus y-squared
-- square root of.

221
00:16:21,020 --> 00:16:24,360
It generalizes to higher
dimensions if you weren't

222
00:16:24,360 --> 00:16:25,520
already aware.

223
00:16:25,520 --> 00:16:28,180
So if I want to find the
straight line distance between

224
00:16:28,180 --> 00:16:33,990
a point in 3D, it's just going
to be x-squared plus y-squared

225
00:16:33,990 --> 00:16:35,640
plus z-squared --

226
00:16:35,640 --> 00:16:36,990
square root.

227
00:16:36,990 --> 00:16:37,520
That's all.

228
00:16:37,520 --> 00:16:39,640
And then, so on and so forth.

229
00:16:51,010 --> 00:16:55,750
So all this does is it
sub-classes point.

230
00:16:55,750 --> 00:17:01,070
And it has this function,
scaleFeatures.

231
00:17:01,070 --> 00:17:07,354
And what scaleFeatures does
is it'll take a key.

232
00:17:07,354 --> 00:17:09,940
And in this case, we have to
find two ways of scaling this

233
00:17:09,940 --> 00:17:11,420
data, of scaling this point.

234
00:17:11,420 --> 00:17:13,790
So we have the identity, which
is it's just going to leave

235
00:17:13,790 --> 00:17:15,020
every point alone.

236
00:17:15,020 --> 00:17:19,819
And then, we have this 1 over
max, which is going to scale

237
00:17:19,819 --> 00:17:24,930
each attribute by the maximum
value in this data set.

238
00:17:24,930 --> 00:17:33,600
And if we look at the data
set, we know that our max

239
00:17:33,600 --> 00:17:36,680
value is 6.

240
00:17:36,680 --> 00:17:38,910
So you could do that
automatically.

241
00:17:38,910 --> 00:17:45,620
But in this case, we're using
prior knowledge of the data

242
00:17:45,620 --> 00:17:46,870
set that we have.

243
00:17:54,100 --> 00:17:57,130
So why don't we do a cluster?

244
00:17:57,130 --> 00:18:00,540
So this is going to do a
hierarchical cluster, right?

245
00:18:00,540 --> 00:18:04,130
And what we're going to ask, if
I just specify the default

246
00:18:04,130 --> 00:18:06,140
parameters, all it's going to
do is it's going to look for

247
00:18:06,140 --> 00:18:07,710
two clusters.

248
00:18:07,710 --> 00:18:09,630
It's going to use
the identity.

249
00:18:09,630 --> 00:18:13,760
And it's just going to print
out the history, like when

250
00:18:13,760 --> 00:18:15,130
it's performed the
different merges.

251
00:18:21,400 --> 00:18:24,617
Unless I have extraneous code
that I'm already running.

252
00:18:50,470 --> 00:18:54,180
So what starts off first is we
get a lot of merges with just

253
00:18:54,180 --> 00:18:57,650
these single element
clusters, right?

254
00:18:57,650 --> 00:19:00,730
So I have a beaver with a
groundhog, so I guess they're

255
00:19:00,730 --> 00:19:02,740
pretty similar in
terms of teeth.

256
00:19:02,740 --> 00:19:07,670
We have a squirrel with a
porcupine, wolf with a bear.

257
00:19:07,670 --> 00:19:13,570
And so eventually, though, we
start finding clusters--

258
00:19:13,570 --> 00:19:15,720
a wolf and a bear, I guess,
are more similar.

259
00:19:15,720 --> 00:19:18,920
But they're also similar
with a dog.

260
00:19:18,920 --> 00:19:22,130
So we're going to start merging
multi-point clusters.

261
00:19:27,650 --> 00:19:32,770
So we start seeing to beaver and
groundhog cluster is going

262
00:19:32,770 --> 00:19:34,315
to get merged with
the squirrel and

263
00:19:34,315 --> 00:19:35,480
the porcupine cluster.

264
00:19:35,480 --> 00:19:42,460
So if you were to visualize
this, the reason why it's

265
00:19:42,460 --> 00:19:45,550
called hierarchical
clustering is--

266
00:19:45,550 --> 00:19:46,760
which one did I say?

267
00:19:46,760 --> 00:19:48,010
Beaver, groundhog--

268
00:19:52,580 --> 00:19:56,020
these guys have been merged
into a cluster, right?

269
00:19:56,020 --> 00:19:58,070
They started out as
their own cluster.

270
00:19:58,070 --> 00:20:01,660
And they've been merged into
their own cluster.

271
00:20:01,660 --> 00:20:06,970
And then, the grey squirrel and
the porcupine, same thing.

272
00:20:06,970 --> 00:20:09,190
They started off with their own
clusters at the beginning.

273
00:20:09,190 --> 00:20:11,440
They got merged.

274
00:20:11,440 --> 00:20:14,580
And now, what this step is
saying is that these two

275
00:20:14,580 --> 00:20:15,940
clusters get merged.

276
00:20:15,940 --> 00:20:18,440
So we're building this
tree, or hierarchy.

277
00:20:18,440 --> 00:20:19,990
That's where the hierarchical
comes from.

278
00:20:24,970 --> 00:20:28,720
So we use hierarchical
clustering a

279
00:20:28,720 --> 00:20:30,670
lot in other fields.

280
00:20:30,670 --> 00:20:34,660
So in speech recognition,
we can do a hierarchical

281
00:20:34,660 --> 00:20:36,190
clustering of speech sounds.

282
00:20:36,190 --> 00:20:47,170
So if I have say different
vowels, and maybe a couple of

283
00:20:47,170 --> 00:20:54,040
consonants, I would expect to
see, say, these kind of

284
00:20:54,040 --> 00:20:55,290
clustered together first.

285
00:20:58,270 --> 00:21:04,730
And so what I might see is these
would be fricatives.

286
00:21:04,730 --> 00:21:08,720
But then, I might have some
stops, like "t" and "b" that

287
00:21:08,720 --> 00:21:10,100
get merged first.

288
00:21:10,100 --> 00:21:16,930
So it's a way of making these
generalized groupings at

289
00:21:16,930 --> 00:21:18,180
different levels.

290
00:21:23,450 --> 00:21:24,980
I don't know.

291
00:21:24,980 --> 00:21:27,140
Does anyone have any real
questions about hierarchical

292
00:21:27,140 --> 00:21:28,390
clustering?

293
00:21:30,700 --> 00:21:33,483
So should I move
on to k-means?

294
00:21:33,483 --> 00:21:34,870
All right.

295
00:21:34,870 --> 00:21:38,590
So what's the general
idea with k-means?

296
00:21:46,860 --> 00:21:48,725
So I start off with a
set of data points.

297
00:21:57,390 --> 00:21:59,743
What's my first step?

298
00:21:59,743 --> 00:22:02,570
AUDIENCE: Choose your total
number of clusters?

299
00:22:02,570 --> 00:22:05,310
PROFESSOR: Right, so I'm
going to choose a k.

300
00:22:05,310 --> 00:22:11,620
So let's say for giggles we're
going to choose k equals 3.

301
00:22:11,620 --> 00:22:13,075
And then, what's my next step?

302
00:22:15,925 --> 00:22:18,780
AUDIENCE: Choose k's
[INAUDIBLE]?

303
00:22:18,780 --> 00:22:21,470
PROFESSOR: So we're going
to pick k random points

304
00:22:21,470 --> 00:22:22,720
from our data set.

305
00:22:25,870 --> 00:22:29,688
All right, and then,
what do I do?

306
00:22:34,170 --> 00:22:35,664
AUDIENCE: Cluster?

307
00:22:35,664 --> 00:22:37,990
PROFESSOR: Then you'd cluster.

308
00:22:37,990 --> 00:22:39,200
Yeah, all right.

309
00:22:39,200 --> 00:22:43,430
So after we've chosen our three
centroids here, these

310
00:22:43,430 --> 00:22:45,275
become our clusters, right?

311
00:22:45,275 --> 00:22:47,050
And we're going to look
at each point.

312
00:22:47,050 --> 00:22:50,400
And we're going to figure
out which cluster

313
00:22:50,400 --> 00:22:52,670
they're closest to.

314
00:22:52,670 --> 00:22:55,920
So in this case, this is going
to be a pretty easy

315
00:22:55,920 --> 00:22:56,420
clustering.

316
00:22:56,420 --> 00:22:58,270
So all these points are
going to belong here.

317
00:23:00,870 --> 00:23:04,390
All these points are going
to belong here.

318
00:23:04,390 --> 00:23:08,620
All these points are going
to belong here, right?

319
00:23:08,620 --> 00:23:12,820
And then, we're going to update
our centroid for each

320
00:23:12,820 --> 00:23:15,520
of these clusters.

321
00:23:15,520 --> 00:23:19,340
And there's going to be a
distance that the centroid

322
00:23:19,340 --> 00:23:22,990
moves each time we update it.

323
00:23:22,990 --> 00:23:31,920
So in this case, the centroid
moved quite a bit, right?

324
00:23:31,920 --> 00:23:34,830
Then, we're going to find the
maximum distance that the

325
00:23:34,830 --> 00:23:37,170
centroid moved.

326
00:23:37,170 --> 00:23:39,510
And if it's below a certain
cut off value, then we're

327
00:23:39,510 --> 00:23:41,860
going to say, I've got a
good enough clustering.

328
00:23:41,860 --> 00:23:45,110
If it's above a certain cut off
value, then what I'm going

329
00:23:45,110 --> 00:23:48,960
to say is like, this centroid
moved quite a bit for this

330
00:23:48,960 --> 00:23:50,370
cluster, right?

331
00:23:50,370 --> 00:23:52,470
So I'm going to try
another iteration.

332
00:23:52,470 --> 00:23:55,550
I'm going to say, for each one
of these points, I'm now going

333
00:23:55,550 --> 00:24:00,020
to look and try to the closest
cluster that it belongs to

334
00:24:00,020 --> 00:24:02,990
based on these new centroids.

335
00:24:02,990 --> 00:24:05,460
And in this case, nothing's
really going to change.

336
00:24:05,460 --> 00:24:08,060
So all of the deltas, all of
the centroids, are going to

337
00:24:08,060 --> 00:24:09,520
stay the same.

338
00:24:09,520 --> 00:24:11,840
So it's going to be below
the cut off value.

339
00:24:11,840 --> 00:24:14,340
And it's going to stop.

340
00:24:14,340 --> 00:24:23,000
So what's an advantage of
k-means over hierarchical

341
00:24:23,000 --> 00:24:24,250
clustering?

342
00:24:27,362 --> 00:24:30,680
AUDIENCE: More efficient?

343
00:24:30,680 --> 00:24:31,820
PROFESSOR: Yeah.

344
00:24:31,820 --> 00:24:35,250
So let's say that I have
a million points.

345
00:24:38,620 --> 00:24:41,640
If I were to hierarchically
cluster these, that means I'd

346
00:24:41,640 --> 00:24:43,500
start off with a million
clusters.

347
00:24:43,500 --> 00:24:44,960
And now, in each iteration,
I'm just going to

348
00:24:44,960 --> 00:24:46,210
reduce it by 1.

349
00:24:53,485 --> 00:24:57,760
I don't know, let's say 3, OK?

350
00:24:57,760 --> 00:25:02,310
So on each iteration, we're just
reducing it by 1, which,

351
00:25:02,310 --> 00:25:04,750
if that's all we were doing,
would not be so hard.

352
00:25:04,750 --> 00:25:06,287
It doesn't take too long
to count down from a

353
00:25:06,287 --> 00:25:07,480
million on a computer.

354
00:25:07,480 --> 00:25:11,140
But on each one of these steps,
we have to compute the

355
00:25:11,140 --> 00:25:14,490
pairwise distance between
each cluster.

356
00:25:14,490 --> 00:25:20,160
So it's going to be n times n
minus 1 comparisons on each

357
00:25:20,160 --> 00:25:24,426
step which, in this first case,
works out to a lot.

358
00:25:29,410 --> 00:25:33,880
And it doesn't get
much better.

359
00:25:33,880 --> 00:25:37,520
So approximately, right?

360
00:25:37,520 --> 00:25:41,740
And it doesn't get much
better as we go down.

361
00:25:41,740 --> 00:25:47,510
With k-means, what happens is
we just have to perform, on

362
00:25:47,510 --> 00:25:54,120
each step, if we have a million
points and we have k

363
00:25:54,120 --> 00:26:01,600
clusters, we just have to
perform k times 1 million

364
00:26:01,600 --> 00:26:03,490
comparisons.

365
00:26:03,490 --> 00:26:06,340
Because for each point, we
need to find the closest

366
00:26:06,340 --> 00:26:10,940
centroid approximately.

367
00:26:10,940 --> 00:26:15,490
So the upshot is that k-means
winds up being a lot more

368
00:26:15,490 --> 00:26:18,210
efficient on each iteration,
which is why if you have a

369
00:26:18,210 --> 00:26:21,500
large number of points, you
might want to choose k-means

370
00:26:21,500 --> 00:26:23,360
over hierarchical clustering.

371
00:26:28,280 --> 00:26:32,360
What's an advantage, though,
of hierarchical clustering

372
00:26:32,360 --> 00:26:35,330
over k-means?

373
00:26:35,330 --> 00:26:39,220
Even though it's less efficient,
what's another--

374
00:26:39,220 --> 00:26:40,600
AUDIENCE: [INAUDIBLE]

375
00:26:40,600 --> 00:26:41,460
PROFESSOR: What's that?

376
00:26:41,460 --> 00:26:42,806
AUDIENCE: More thorough.

377
00:26:42,806 --> 00:26:44,570
PROFESSOR: More thorough.

378
00:26:44,570 --> 00:26:46,262
AUDIENCE: And you can get a lot
of different levels that

379
00:26:46,262 --> 00:26:47,390
you can look at.

380
00:26:47,390 --> 00:26:49,460
PROFESSOR: Yeah, you can get
a lot of different levels.

381
00:26:49,460 --> 00:26:52,730
So you can look at the
clusterings from different

382
00:26:52,730 --> 00:26:54,740
perspectives.

383
00:26:54,740 --> 00:26:57,957
But the key thing with--

384
00:26:57,957 --> 00:27:00,939
AUDIENCE: You don't necessarily
know how many

385
00:27:00,939 --> 00:27:04,418
clusters there actually are.

386
00:27:04,418 --> 00:27:06,406
Hierarchical clustering will
tell you all of the

387
00:27:06,406 --> 00:27:08,080
[INAUDIBLE].

388
00:27:08,080 --> 00:27:11,950
You can just go down
the tree and--

389
00:27:11,950 --> 00:27:14,660
PROFESSOR: Right, so you could
go down different levels of

390
00:27:14,660 --> 00:27:19,130
the tree and pick however many
number of clusters you want.

391
00:27:19,130 --> 00:27:21,170
But the big reason--

392
00:27:21,170 --> 00:27:24,010
or one of the kind of main
advantages that hierarchical

393
00:27:24,010 --> 00:27:28,830
clustering has over k-means
is that k-means is random.

394
00:27:28,830 --> 00:27:31,230
It's non-deterministic.

395
00:27:31,230 --> 00:27:34,140
Hierarchical clustering
is deterministic.

396
00:27:34,140 --> 00:27:36,980
It's always going to give
you the same result.

397
00:27:36,980 --> 00:27:40,840
With k-means, because your
initial starting conditions

398
00:27:40,840 --> 00:27:45,700
are random because you're
choosing k random points, the

399
00:27:45,700 --> 00:27:49,650
end result will be different
each time.

400
00:27:49,650 --> 00:27:55,340
And so when we do k-means
clustering, this necessarily

401
00:27:55,340 --> 00:27:56,980
means that we don't necessarily

402
00:27:56,980 --> 00:27:58,290
want to do it once.

403
00:27:58,290 --> 00:28:03,780
Like if we choose k equals 3,
we might want to do five

404
00:28:03,780 --> 00:28:07,130
different k-means clusterings
and take the best one.

405
00:28:12,240 --> 00:28:15,040
So that's one of the big
points with k-means.

406
00:28:19,120 --> 00:28:23,700
There's a degenerate condition
with k-means.

407
00:28:23,700 --> 00:28:30,600
So if my objective is to--

408
00:28:30,600 --> 00:28:33,080
if my stopping criteria is
that the centroid doesn't

409
00:28:33,080 --> 00:28:38,690
move, what's a really easy way
to make the centroid not move

410
00:28:38,690 --> 00:28:39,980
by choosing k?

411
00:28:43,420 --> 00:28:44,380
What was it?

412
00:28:44,380 --> 00:28:45,610
K equals n, right?

413
00:28:45,610 --> 00:28:52,902
So if I have n points and I have
k equals n, then all of

414
00:28:52,902 --> 00:28:54,720
my points are going to
be their own cluster.

415
00:28:54,720 --> 00:28:56,900
And every time I update,
I'm never

416
00:28:56,900 --> 00:28:58,150
going to move my centroid.

417
00:29:00,360 --> 00:29:02,630
So in your problem set, you're
going to be asked to compute a

418
00:29:02,630 --> 00:29:06,935
standard error for each of the
clusters and a total error for

419
00:29:06,935 --> 00:29:08,830
the entire cluster.

420
00:29:08,830 --> 00:29:20,550
So what that is, is if I have
the centroids, and I had each

421
00:29:20,550 --> 00:29:24,945
point in the centroids--

422
00:29:35,810 --> 00:29:40,250
so what I'm going to do is I'm
going to take the centroid for

423
00:29:40,250 --> 00:29:41,370
each cluster.

424
00:29:41,370 --> 00:29:46,230
And I'm going to find the
distance from each point in

425
00:29:46,230 --> 00:29:48,050
the cluster to the centroid.

426
00:29:48,050 --> 00:29:51,520
And then, I'm going to sum up
all of those distances over

427
00:29:51,520 --> 00:29:52,640
the entire cluster.

428
00:29:52,640 --> 00:29:55,662
That's going to give
me my error.

429
00:29:55,662 --> 00:29:59,620
Not sure if this equation's
totally right.

430
00:29:59,620 --> 00:30:02,354
There might be a like of
division in there.

431
00:30:02,354 --> 00:30:07,190
But the general idea, what I'm
trying to emphasize, is that

432
00:30:07,190 --> 00:30:10,250
we can reduce this number
by just increasing

433
00:30:10,250 --> 00:30:12,110
the number of k.

434
00:30:12,110 --> 00:30:15,770
And if we make k equal n, then
this is going to be 0.

435
00:30:15,770 --> 00:30:20,110
So like I was saying with
statistics, you never want to

436
00:30:20,110 --> 00:30:21,610
trust just one number.

437
00:30:21,610 --> 00:30:24,220
With k-means, you never just
want to trust one clustering

438
00:30:24,220 --> 00:30:26,460
or one measurement of error.

439
00:30:26,460 --> 00:30:28,850
You want to look at it from
multiple perspectives and

440
00:30:28,850 --> 00:30:30,100
advantage points.

441
00:30:32,990 --> 00:30:35,840
So why don't we look at the code
and try to match up all

442
00:30:35,840 --> 00:30:40,075
of that stuff with what you'll
see on your problem set?

443
00:30:42,860 --> 00:30:47,450
So the big function to
look at for k-means

444
00:30:47,450 --> 00:30:50,870
is aptly named k-means.

445
00:30:50,870 --> 00:30:54,660
And it's going to take
a set of points.

446
00:30:54,660 --> 00:30:56,520
It's going to take a
number of clusters.

447
00:30:56,520 --> 00:31:01,930
It's going to take a cut off
value, and a point type, and a

448
00:31:01,930 --> 00:31:05,340
variable named maxIters.

449
00:31:05,340 --> 00:31:09,290
So first step, we get our
initial centroids.

450
00:31:09,290 --> 00:31:10,990
All we're going to do is we're
going to sample our points

451
00:31:10,990 --> 00:31:12,780
randomly and choose k of them.

452
00:31:16,470 --> 00:31:19,520
Our clusters, we're going
to represent as a list.

453
00:31:19,520 --> 00:31:29,280
And we are going to, for all
the points in the initial

454
00:31:29,280 --> 00:31:36,600
centroids, we are going
to add a cluster

455
00:31:36,600 --> 00:31:39,690
with just that point.

456
00:31:39,690 --> 00:31:44,790
And then, we get into
our loop here.

457
00:31:44,790 --> 00:31:49,500
So what this is saying is, while
our biggest change in

458
00:31:49,500 --> 00:31:54,490
centroid is greater than the
cut off and we haven't

459
00:31:54,490 --> 00:31:57,530
exceeded the number of
iterations, or the maximum

460
00:31:57,530 --> 00:32:01,325
number of iterations, we're
going to keep trying to refine

461
00:32:01,325 --> 00:32:03,690
our cluster.

462
00:32:03,690 --> 00:32:08,670
So that brings up a point I
actually failed to mention.

463
00:32:08,670 --> 00:32:10,870
Why should we have this cut
off point, maxiters?

464
00:32:14,170 --> 00:32:15,990
AUDIENCE: It'll go forever?

465
00:32:15,990 --> 00:32:18,720
PROFESSOR: Yeah, there's a
chance that, if our cut-off

466
00:32:18,720 --> 00:32:23,060
value is too small, or we have a
point that's on a border and

467
00:32:23,060 --> 00:32:27,625
likes to jump between clusters
and move the centroid just

468
00:32:27,625 --> 00:32:31,790
above the cut off point,
that we'll never

469
00:32:31,790 --> 00:32:34,080
converge to our cut off.

470
00:32:34,080 --> 00:32:36,810
And so we want to set up
a secondary break.

471
00:32:36,810 --> 00:32:42,420
So we have this maxIters,
which defaults to 100.

472
00:32:42,420 --> 00:32:46,290
So with this set up, though,
there's a couple of things you

473
00:32:46,290 --> 00:32:46,990
have to consider.

474
00:32:46,990 --> 00:32:49,290
And this is, one, you need
to make sure that

475
00:32:49,290 --> 00:32:51,150
maxiters is not too small.

476
00:32:51,150 --> 00:32:54,660
Because if it's too small,
you're not going to converge.

477
00:32:54,660 --> 00:32:56,620
And you don't want to make it
too large, because then your

478
00:32:56,620 --> 00:33:00,270
algorithm will just take
forever to run, right?

479
00:33:00,270 --> 00:33:01,570
Likewise, you don't want
to make your cut

480
00:33:01,570 --> 00:33:04,100
off too small, either.

481
00:33:04,100 --> 00:33:08,260
So sometimes you have to play
around with the algorithm to

482
00:33:08,260 --> 00:33:11,980
figure out what the best
parameters are.

483
00:33:11,980 --> 00:33:17,050
And that's, oftentimes, more of
an art than a hard science.

484
00:33:17,050 --> 00:33:22,730
So anyway, continuing on.

485
00:33:22,730 --> 00:33:24,140
for each iteration, we're
going to set up

486
00:33:24,140 --> 00:33:25,200
a new set of clusters.

487
00:33:25,200 --> 00:33:27,460
And we're going to set them
initially to have

488
00:33:27,460 --> 00:33:29,140
no points in them.

489
00:33:29,140 --> 00:33:35,750
And then, for all the points in
our data set, we are going

490
00:33:35,750 --> 00:33:38,050
to look for the smallest
distance.

491
00:33:38,050 --> 00:33:41,140
So that means we are
going to look--

492
00:33:41,140 --> 00:33:43,090
we're going to initially set
our smallest distance to be

493
00:33:43,090 --> 00:33:46,100
the distance from the point
to the first centroid.

494
00:33:46,100 --> 00:33:48,310
And then, we're just going to
interate through all the

495
00:33:48,310 --> 00:33:51,300
centroids, or through all the
clusters, and then find the

496
00:33:51,300 --> 00:33:53,610
smallest distance.

497
00:33:53,610 --> 00:33:55,510
Is anyone lost by that?

498
00:33:55,510 --> 00:33:56,760
Make sense?

499
00:33:59,060 --> 00:34:03,460
Once we find that, we're just
going to add that point to the

500
00:34:03,460 --> 00:34:04,710
new clusters.

501
00:34:08,550 --> 00:34:10,909
And then, we're going to
go through our update.

502
00:34:10,909 --> 00:34:14,820
We're going to iterate through
each of our clusters.

503
00:34:14,820 --> 00:34:17,530
We are going to update the
points in the cluster.

504
00:34:17,530 --> 00:34:20,350
So remember, the update method
sets the points of the cluster

505
00:34:20,350 --> 00:34:24,010
to be this new set of points
you've given it.

506
00:34:24,010 --> 00:34:28,820
And it also updates the
centroid, or it updates the

507
00:34:28,820 --> 00:34:31,590
centroid, and it returns the
delta between the old centroid

508
00:34:31,590 --> 00:34:33,139
and the new centroid.

509
00:34:33,139 --> 00:34:36,460
So that's where this change
variable is coming from.

510
00:34:36,460 --> 00:34:40,389
And then, we're just go look for
the biggest change, right?

511
00:34:40,389 --> 00:34:44,929
And if, at some point in our
clustering, the centroids have

512
00:34:44,929 --> 00:34:49,460
stabilized and our clusters
are relatively stationary,

513
00:34:49,460 --> 00:34:53,860
then our max change
will be small.

514
00:34:53,860 --> 00:34:56,143
And it'll wind up terminating
the algorithm.

515
00:35:00,880 --> 00:35:06,750
And all this function does is,
once it's converged or it's

516
00:35:06,750 --> 00:35:09,090
gone through the maximum number
of iterations, it's

517
00:35:09,090 --> 00:35:13,600
just going to find the
maximum distance of a

518
00:35:13,600 --> 00:35:16,530
point to its centroid.

519
00:35:16,530 --> 00:35:22,730
So it's going to look for a
point that has a maximum

520
00:35:22,730 --> 00:35:26,380
distance from its corresponding
centroid.

521
00:35:26,380 --> 00:35:31,500
And that's going to be the
coherence of this clustering.

522
00:35:31,500 --> 00:35:33,260
And then, it's just going to
return a tuple containing the

523
00:35:33,260 --> 00:35:34,675
clusters and the maximum
distance.

524
00:35:38,220 --> 00:35:42,940
So it's not a hard algorithm
to understand.

525
00:35:42,940 --> 00:35:45,890
And it's pretty simple
to implement.

526
00:35:45,890 --> 00:35:48,920
There any real questions about
what's going on here?

527
00:35:54,250 --> 00:36:02,050
So the example that he went over
in lecture with k-means

528
00:36:02,050 --> 00:36:07,960
was this counties clustering
example.

529
00:36:07,960 --> 00:36:13,010
So we had a bunch of data of
different counties in the US.

530
00:36:13,010 --> 00:36:16,540
And we just played around with
clustering them and seeing

531
00:36:16,540 --> 00:36:18,820
what we got.

532
00:36:18,820 --> 00:36:25,390
So if we made five clusters,
and we clustered on all the

533
00:36:25,390 --> 00:36:32,210
features, wanted to see what the
distribution would be for,

534
00:36:32,210 --> 00:36:45,830
say, incomes, what this
function, test, is going to do

535
00:36:45,830 --> 00:36:50,910
is it's going to take
a k, a cut off,

536
00:36:50,910 --> 00:36:52,280
and a number of trials.

537
00:36:52,280 --> 00:36:54,370
So remember, we said that
because k-means is

538
00:36:54,370 --> 00:36:56,670
non-deterministic, we're going
to want to run it a number of

539
00:36:56,670 --> 00:37:04,895
times to find maybe we get a bad
initial set of points for

540
00:37:04,895 --> 00:37:08,780
our centroids or for
our clusters.

541
00:37:08,780 --> 00:37:10,910
And that gives us a
bad clustering.

542
00:37:10,910 --> 00:37:14,690
So we're going to run it a
number of times and try and

543
00:37:14,690 --> 00:37:16,386
prevent that from happening.

544
00:37:16,386 --> 00:37:19,182
AUDIENCE: How do we [INAUDIBLE]
multiple runs,

545
00:37:19,182 --> 00:37:20,580
because [INAUDIBLE]

546
00:37:20,580 --> 00:37:24,543
really different clustering
happens [INAUDIBLE]

547
00:37:24,543 --> 00:37:25,793
after you run it a
couple of times?

548
00:37:29,510 --> 00:37:33,070
PROFESSOR: It can be tricky,
to be honest.

549
00:37:33,070 --> 00:37:39,400
One technique you could use
would be to have a training

550
00:37:39,400 --> 00:37:41,090
set and a development set.

551
00:37:44,390 --> 00:37:48,060
What I mean by that is, you
perform a clustering on the

552
00:37:48,060 --> 00:37:49,300
training set.

553
00:37:49,300 --> 00:37:53,980
And then, you take the
development set, and you

554
00:37:53,980 --> 00:37:55,690
figure out which clusters
they belong to.

555
00:37:55,690 --> 00:38:01,470
And then, you measure the error
of that development set.

556
00:38:01,470 --> 00:38:08,170
So once you've assigned these
development points to the

557
00:38:08,170 --> 00:38:10,700
clusters, you measure the
distance to the centroid.

558
00:38:10,700 --> 00:38:13,750
And you sum up the squared
distances, and you sum up over

559
00:38:13,750 --> 00:38:15,660
all the clusters.

560
00:38:15,660 --> 00:38:18,630
Then if you do that a number of
times, what you would do is

561
00:38:18,630 --> 00:38:22,240
you'd choose the clustering
that gave you the smallest

562
00:38:22,240 --> 00:38:24,160
error on your development set.

563
00:38:24,160 --> 00:38:26,325
And then, you'd say, that's
probably my best clustering

564
00:38:26,325 --> 00:38:28,480
for this data.

565
00:38:28,480 --> 00:38:29,950
So that's one way of doing it.

566
00:38:29,950 --> 00:38:37,380
There's multiple ways
of skinning the cat.

567
00:38:37,380 --> 00:38:40,670
Was trying to think of
a good aphorism.

568
00:38:40,670 --> 00:38:42,810
And actually, that's what's
on your problem set.

569
00:38:42,810 --> 00:38:46,450
One of the problems on your
problem set is to cluster

570
00:38:46,450 --> 00:38:51,490
based on a holdout set and see
what the effect of the error

571
00:38:51,490 --> 00:38:54,940
is on this holdout set.

572
00:38:54,940 --> 00:38:58,736
So did that answer
your question?

573
00:38:58,736 --> 00:38:59,668
AUDIENCE: Yeah.

574
00:38:59,668 --> 00:39:02,000
PROFESSOR: OK.

575
00:39:02,000 --> 00:39:05,490
And then, choosing k, there's
different methods

576
00:39:05,490 --> 00:39:07,020
for doing it, too.

577
00:39:07,020 --> 00:39:10,990
A lot of it is you run a lot
of experiments and you see

578
00:39:10,990 --> 00:39:12,970
what you get.

579
00:39:12,970 --> 00:39:15,790
This is where the research part
comes in for a lot of

580
00:39:15,790 --> 00:39:17,910
applications.

581
00:39:17,910 --> 00:39:23,670
So you can also try some other
automatic methods, like

582
00:39:23,670 --> 00:39:25,930
entropy or other,
more complicated

583
00:39:25,930 --> 00:39:27,600
measurements of error.

584
00:39:27,600 --> 00:39:31,910
But don't worry about those.

585
00:39:31,910 --> 00:39:36,260
For our purposes, if you get
below the cut off value, and

586
00:39:36,260 --> 00:39:38,310
you run a number of iterations,
and you've

587
00:39:38,310 --> 00:39:41,630
minimized your error on your
test set, we'll be happy.

588
00:39:44,440 --> 00:39:47,620
We want you to be familiar with
k-means, but not experts

589
00:39:47,620 --> 00:39:53,620
in it because it's a useful
tool for your kit.

590
00:39:53,620 --> 00:40:00,850
Anyway, so all this code is
going to do is it's going to

591
00:40:00,850 --> 00:40:02,620
run a number of trials.

592
00:40:02,620 --> 00:40:06,970
It's going to perform
k-means clustering.

593
00:40:06,970 --> 00:40:16,640
And it's going to look for the
clustering with the smallest

594
00:40:16,640 --> 00:40:19,550
maximum distance.

595
00:40:19,550 --> 00:40:23,080
So remember, the return value
of k-means is the maximum

596
00:40:23,080 --> 00:40:26,680
distance from a point
to its centroid.

597
00:40:26,680 --> 00:40:29,500
We're going to define our best
clustering as being the

598
00:40:29,500 --> 00:40:33,730
clustering that gives us the
smallest max distance.

599
00:40:33,730 --> 00:40:36,870
So that's another way you
would choose your best

600
00:40:36,870 --> 00:40:38,120
clustering.

601
00:40:43,470 --> 00:40:45,760
Yeah, and that's all we're
going to do for

602
00:40:45,760 --> 00:40:48,290
this bit of code here.

603
00:40:48,290 --> 00:40:52,350
We're going to find the average
income in each cluster

604
00:40:52,350 --> 00:40:55,384
and draw a histogram.

605
00:40:58,110 --> 00:40:59,825
So I think this is
actually done.

606
00:41:02,730 --> 00:41:05,990
So we have five clusters.

607
00:41:05,990 --> 00:41:10,200
And what they're showing us
is that the clusters--

608
00:41:10,200 --> 00:41:12,310
if we take the average income
of the different clusters,

609
00:41:12,310 --> 00:41:16,760
they're going to be centered
at these average incomes.

610
00:41:22,215 --> 00:41:24,005
Let's see what some other
plots look like.

611
00:41:35,530 --> 00:41:39,260
This set of examples is
using a point type

612
00:41:39,260 --> 00:41:42,620
that's called county.

613
00:41:42,620 --> 00:41:46,650
And county, like the mammal
class, inherits from point.

614
00:41:46,650 --> 00:41:54,450
And it defines a set of
features that can be

615
00:41:54,450 --> 00:41:56,490
used, or a set of--

616
00:41:56,490 --> 00:41:57,740
it calls them filters.

617
00:41:57,740 --> 00:42:00,160
But basically, if you pass it
one of these filter names,

618
00:42:00,160 --> 00:42:03,490
like allEducation,
noEducational, wealthOnly,

619
00:42:03,490 --> 00:42:08,550
noWealth, what this is doing
is it's selecting

620
00:42:08,550 --> 00:42:10,260
this tuple of tuples.

621
00:42:10,260 --> 00:42:16,620
And so if I say wealthonly, it's
going to use this tuple

622
00:42:16,620 --> 00:42:18,240
as a filter.

623
00:42:18,240 --> 00:42:26,810
And for each element in this
tuple of tuples, it has a

624
00:42:26,810 --> 00:42:29,680
tuple that has the name of the
attribute and whether or not

625
00:42:29,680 --> 00:42:31,490
it should be used in
the clustering.

626
00:42:31,490 --> 00:42:32,910
So if it has a 1, it
should be used.

627
00:42:32,910 --> 00:42:35,620
If it has a 0, it shouldn't
be used.

628
00:42:35,620 --> 00:42:40,560
So if we look at how that's
applied, it'll get a

629
00:42:40,560 --> 00:42:42,220
filterSpec.

630
00:42:42,220 --> 00:42:56,080
And then, it'll just set an
attribute called atrFilter.

631
00:42:56,080 --> 00:43:05,230
And if we see where that's
used, then it's going to

632
00:43:05,230 --> 00:43:07,800
iterate through all
of the attributes.

633
00:43:07,800 --> 00:43:13,430
And if the given attribute has
a 1, then it's going to

634
00:43:13,430 --> 00:43:17,340
include it in the set of
features that are used in this

635
00:43:17,340 --> 00:43:18,590
distance computation.

636
00:43:21,030 --> 00:43:23,310
So did that make sense at all?

637
00:43:27,390 --> 00:43:29,046
No?

638
00:43:29,046 --> 00:43:30,296
All right.

639
00:43:37,866 --> 00:43:41,470
So the idea is to illustrate
that if you use different

640
00:43:41,470 --> 00:43:44,120
features when you're doing
your clustering, you'll

641
00:43:44,120 --> 00:43:46,330
probably attain different
clusterings.

642
00:43:46,330 --> 00:43:48,380
So in that first example
I showed, we

643
00:43:48,380 --> 00:43:50,270
used all the features.

644
00:43:50,270 --> 00:43:56,410
But in this example, we are
going to look at only wealth.

645
00:43:56,410 --> 00:44:03,170
And that includes the features,
if we look at this

646
00:44:03,170 --> 00:44:06,485
set of filters here, it's going
to include the home

647
00:44:06,485 --> 00:44:10,930
value, the income,
the poverty.

648
00:44:10,930 --> 00:44:14,480
And then, all the other
attributes, like population

649
00:44:14,480 --> 00:44:16,000
change, it's not include.

650
00:44:16,000 --> 00:44:19,610
So those aren't going to be
used in the clusterings.

651
00:44:19,610 --> 00:44:24,280
So this is going to change what
our clusters look like.

652
00:44:24,280 --> 00:44:29,045
So if we look at what we have
for our clusters-- well,

653
00:44:29,045 --> 00:44:30,295
that's not very clear.

654
00:44:34,198 --> 00:44:37,790
Yeah, so let's see
what happens.

655
00:44:50,245 --> 00:44:52,200
I'm not sure that's really
going to show us.

656
00:44:55,820 --> 00:44:57,980
Probably better if I show
them all at once, right?

657
00:45:05,225 --> 00:45:06,475
So this will take a while.

658
00:45:28,890 --> 00:45:29,871
Yes?

659
00:45:29,871 --> 00:45:33,308
AUDIENCE: Actually I had just
one question [INAUDIBLE]

660
00:45:33,308 --> 00:45:38,218
but there is a method showing
us [INAUDIBLE] iterative

661
00:45:38,218 --> 00:45:39,200
additionally?

662
00:45:39,200 --> 00:45:40,182
PROFESSOR: Mm-hm.

663
00:45:40,182 --> 00:45:41,655
AUDIENCE: [INAUDIBLE]

664
00:45:41,655 --> 00:45:44,790
PROFESSOR: It's like iterValues
or iterKeys, yeah.

665
00:45:44,790 --> 00:45:48,240
AUDIENCE: I was wondering how
to go about using that in

666
00:45:48,240 --> 00:45:49,190
actual code.

667
00:45:49,190 --> 00:45:50,760
PROFESSOR: In actual code?

668
00:45:50,760 --> 00:45:56,470
So you know that if you
have a dictionary, d,

669
00:45:56,470 --> 00:45:57,950
you can do like d.keys.

670
00:46:00,770 --> 00:46:04,820
So remember I was demo-ing
that code, the difference

671
00:46:04,820 --> 00:46:06,320
between range and xrange?

672
00:46:06,320 --> 00:46:08,150
Same thing.

673
00:46:08,150 --> 00:46:10,060
This is going to return
an actual list of

674
00:46:10,060 --> 00:46:11,890
all the keys, right?

675
00:46:11,890 --> 00:46:18,720
What's d.iterkeys returns is a
generator object that gives

676
00:46:18,720 --> 00:46:22,390
you, one by one by one, each
key in the dictionary.

677
00:46:22,390 --> 00:46:25,320
So a lot of times, you guys in
your code will use something

678
00:46:25,320 --> 00:46:32,726
like for k in d.keys(): to
iterate through all the keys

679
00:46:32,726 --> 00:46:33,950
in the dictionary.

680
00:46:33,950 --> 00:46:39,260
What this method does, though,
is it creates an actual copy

681
00:46:39,260 --> 00:46:42,520
of that list of keys, right?

682
00:46:42,520 --> 00:46:46,850
So when you call d.keys, if you
have a lot of keys in your

683
00:46:46,850 --> 00:46:49,830
dictionary, it's going to go
one by one by one by one

684
00:46:49,830 --> 00:46:53,510
through each of those keys,
add it to a list, and then

685
00:46:53,510 --> 00:46:55,110
return it to you.

686
00:46:55,110 --> 00:47:01,330
What d.iterkeys does is it skips
that going one by one by

687
00:47:01,330 --> 00:47:03,540
one and adding it
to a new list.

688
00:47:03,540 --> 00:47:06,620
It just gives you a generator
object which, when you use it

689
00:47:06,620 --> 00:47:08,890
in a for loop--

690
00:47:20,240 --> 00:47:25,690
when you use it in a for loop,
it's going to just yield the

691
00:47:25,690 --> 00:47:30,600
key one at a time without
creating a separate list.

692
00:47:30,600 --> 00:47:31,365
That make sense?

693
00:47:31,365 --> 00:47:32,315
AUDIENCE: So will it
be more efficient?

694
00:47:32,315 --> 00:47:35,070
PROFESSOR: Yeah, it's generally
more efficient.

695
00:47:35,070 --> 00:47:41,170
And then, there's also, I think,
iterValues, which goes

696
00:47:41,170 --> 00:47:42,240
through each of the values.

697
00:47:42,240 --> 00:47:43,810
And then, I think there's
iterItems.

698
00:47:49,220 --> 00:47:53,550
And I think, if I'm not
mistaken, if you do something

699
00:47:53,550 --> 00:48:04,230
just like that, so you do, for
k, v in d, this is going to

700
00:48:04,230 --> 00:48:08,740
iterate through a tuple that
contains the key and the

701
00:48:08,740 --> 00:48:10,575
values associated
with that key.

702
00:48:10,575 --> 00:48:20,760
And it's equivalent
to doing this.

703
00:48:24,650 --> 00:48:26,156
Make sense?

704
00:48:26,156 --> 00:48:27,406
AUDIENCE: Yeah.

705
00:48:31,570 --> 00:48:32,940
PROFESSOR: So where
were we here?

706
00:48:35,772 --> 00:48:37,873
Oh, maybe I shouldn't have
done this all at once.

707
00:48:54,882 --> 00:48:56,343
Why don't we just look at two?

708
00:48:59,900 --> 00:49:03,610
Why don't we take a look at what
the average incomes, what

709
00:49:03,610 --> 00:49:12,950
the clustering gives us for
average incomes for education

710
00:49:12,950 --> 00:49:16,250
versus no education.

711
00:49:16,250 --> 00:49:20,000
That's probably not going to
be a very good comparison.

712
00:49:20,000 --> 00:49:24,130
Just doing five trials and two
trials for each clustering.

713
00:49:30,900 --> 00:49:34,250
So k-means is the efficient
one, which means that

714
00:49:34,250 --> 00:49:35,990
hierarchical would take
a long, long time.

715
00:49:39,830 --> 00:49:41,240
There we go.

716
00:49:43,800 --> 00:49:51,750
So these are the average incomes
if we cluster with k

717
00:49:51,750 --> 00:49:56,680
equals 50 on education.

718
00:49:56,680 --> 00:49:58,705
And then, there should
be another one.

719
00:50:08,060 --> 00:50:09,310
I didn't create the
new figure.

720
00:50:14,980 --> 00:50:17,550
So apparently there's
a bug in the code.

721
00:50:17,550 --> 00:50:20,260
I wanted to show the two plots
side by side so you could see

722
00:50:20,260 --> 00:50:21,820
the differences.

723
00:50:21,820 --> 00:50:23,350
Because what you
should see is--

724
00:50:25,960 --> 00:50:30,200
we would see a different
distribution in incomes among

725
00:50:30,200 --> 00:50:33,310
the clusters if we clustered
based on no education versus

726
00:50:33,310 --> 00:50:34,980
education level.

727
00:50:34,980 --> 00:50:40,810
But my code is buggy and
not working the way

728
00:50:40,810 --> 00:50:42,400
I expected it to.

729
00:50:42,400 --> 00:50:43,650
So I apologize.