1
00:00:00,080 --> 00:00:01,770
The following
content is provided

2
00:00:01,770 --> 00:00:04,010
under a Creative
Commons license.

3
00:00:04,010 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue

4
00:00:06,860 --> 00:00:10,720
to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,207
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,207 --> 00:00:17,832
at ocw.mit.edu.

8
00:00:20,749 --> 00:00:23,040
VICTOR COSTAN: So I'm excited
about today's recitation,

9
00:00:23,040 --> 00:00:25,560
because if I do this
right and you guys get it,

10
00:00:25,560 --> 00:00:28,750
then I can mess up every
other recitation after it.

11
00:00:28,750 --> 00:00:31,720
And you'll still get
the gist of 6.006.

12
00:00:31,720 --> 00:00:34,700
So all I have to do
is get this working.

13
00:00:34,700 --> 00:00:36,760
So most of the time
in the real world

14
00:00:36,760 --> 00:00:39,580
you're probably not going to be
coming up with new algorithms

15
00:00:39,580 --> 00:00:42,080
to do something, but rather
you'll have some code

16
00:00:42,080 --> 00:00:43,960
and you want to make it faster.

17
00:00:43,960 --> 00:00:45,760
And the first step
in making it faster

18
00:00:45,760 --> 00:00:48,510
is you realize, how
does it do right now?

19
00:00:48,510 --> 00:00:51,210
How does it run, which lines
are slow, which lines are fast,

20
00:00:51,210 --> 00:00:53,030
and where you can
make improvements.

21
00:00:53,030 --> 00:00:56,190
So in lecture we talked
about the Python Cost Model

22
00:00:56,190 --> 00:00:59,100
which is what you use
to look at the code

23
00:00:59,100 --> 00:01:01,590
and figure out how much
time it takes to run.

24
00:01:01,590 --> 00:01:04,069
And we talked about
document distance,

25
00:01:04,069 --> 00:01:05,530
which is a problem
that we'll use

26
00:01:05,530 --> 00:01:08,180
to practice our analysis skills.

27
00:01:08,180 --> 00:01:09,860
And this entire
recitation is all

28
00:01:09,860 --> 00:01:12,610
about looking at versions
of document distance

29
00:01:12,610 --> 00:01:14,810
and analyzing them.

30
00:01:14,810 --> 00:01:16,860
So that's what we'll
do, look at Python code,

31
00:01:16,860 --> 00:01:19,010
look at Python code,
look at Python code.

32
00:01:19,010 --> 00:01:21,320
So you better have handouts,
because I can't project.

33
00:01:21,320 --> 00:01:25,637
OK, how many people remember
the document distance problem?

34
00:01:25,637 --> 00:01:27,345
You guys said you went
to lecture, right?

35
00:01:30,540 --> 00:01:33,470
OK, so very, very fast,
document distance.

36
00:01:33,470 --> 00:01:34,650
I have two documents.

37
00:01:37,720 --> 00:01:41,780
The fox is in the hat.

38
00:01:45,990 --> 00:01:49,930
And the fox is outside.

39
00:01:54,250 --> 00:01:56,820
Document 1, document 2.

40
00:01:56,820 --> 00:01:58,580
What's the first
thing I want to do?

41
00:01:58,580 --> 00:02:01,940
So there are three operations
that Eric mentioned in lecture.

42
00:02:01,940 --> 00:02:07,410
Operation one,
take each document,

43
00:02:07,410 --> 00:02:08,820
break it up into words.

44
00:02:08,820 --> 00:02:10,220
Right?

45
00:02:10,220 --> 00:02:12,610
This is a string.

46
00:02:12,610 --> 00:02:15,640
When I read it, then it becomes
word one, word two, word three,

47
00:02:15,640 --> 00:02:18,060
word four, so on and so forth.

48
00:02:18,060 --> 00:02:20,930
Operation two, build
document vectors

49
00:02:20,930 --> 00:02:22,840
out of the two documents.

50
00:02:22,840 --> 00:02:25,790
So the documents are D1 and D2.

51
00:02:29,080 --> 00:02:30,950
A document vector
is basically a list

52
00:02:30,950 --> 00:02:35,670
of the words in the documents
with a count of how many times

53
00:02:35,670 --> 00:02:37,870
each word appears
in the document.

54
00:02:37,870 --> 00:02:43,112
So let's build a document
vector for document one.

55
00:02:43,112 --> 00:02:44,570
I'm not going to
write it formally,

56
00:02:44,570 --> 00:02:47,120
so can anyone tell me
what it should look like,

57
00:02:47,120 --> 00:02:49,110
and I'll sort of write
it down as a list.

58
00:02:52,780 --> 00:02:55,440
So for all the words here,
I want to list the words

59
00:02:55,440 --> 00:02:57,640
and how many times they show up.

60
00:02:57,640 --> 00:03:01,376
Somebody, please.

61
00:03:01,376 --> 00:03:03,840
AUDIENCE: The is in there twice?

62
00:03:03,840 --> 00:03:04,970
VICTOR COSTAN: OK.

63
00:03:04,970 --> 00:03:07,735
The, twice.

64
00:03:07,735 --> 00:03:10,410
AUDIENCE: Fox, once.

65
00:03:10,410 --> 00:03:11,710
VICTOR COSTAN: One.

66
00:03:11,710 --> 00:03:13,340
AUDIENCE: Is, once.

67
00:03:13,340 --> 00:03:15,090
VICTOR COSTAN: Is, one.

68
00:03:15,090 --> 00:03:16,970
AUDIENCE: [INAUDIBLE] in once.

69
00:03:16,970 --> 00:03:18,200
VICTOR COSTAN: In, one.

70
00:03:18,200 --> 00:03:18,991
AUDIENCE: Hat once.

71
00:03:21,242 --> 00:03:22,200
VICTOR COSTAN: Awesome.

72
00:03:22,200 --> 00:03:24,760
Thank you very much.

73
00:03:24,760 --> 00:03:25,910
Second one.

74
00:03:25,910 --> 00:03:27,500
Another volunteer.

75
00:03:27,500 --> 00:03:28,270
Yes, go for it.

76
00:03:28,270 --> 00:03:30,231
AUDIENCE: The, once.

77
00:03:30,231 --> 00:03:31,230
VICTOR COSTAN: The, one.

78
00:03:31,230 --> 00:03:32,740
AUDIENCE: Fox, once.

79
00:03:32,740 --> 00:03:33,857
VICTOR COSTAN: Fox, one

80
00:03:33,857 --> 00:03:35,020
AUDIENCE: Is, one.

81
00:03:35,020 --> 00:03:36,708
VICTOR COSTAN: Is, one.

82
00:03:36,708 --> 00:03:38,604
AUDIENCE: Outside, one.

83
00:03:38,604 --> 00:03:41,890
VICTOR COSTAN: Outside, one.

84
00:03:41,890 --> 00:03:44,850
OK, so this is a
document vector.

85
00:03:44,850 --> 00:03:47,190
Notice two small details.

86
00:03:47,190 --> 00:03:49,880
Here, they is capitalized,
here it's not,

87
00:03:49,880 --> 00:03:52,570
and yet I bundle them together.

88
00:03:52,570 --> 00:03:54,160
I know my grammar,
so I put periods

89
00:03:54,160 --> 00:03:56,024
at the end of the
sentences, and yet they

90
00:03:56,024 --> 00:03:57,190
don't show up anywhere here.

91
00:03:57,190 --> 00:03:58,800
So we got rid of
the punctuation,

92
00:03:58,800 --> 00:04:00,185
and we made all words lowercase.

93
00:04:02,476 --> 00:04:04,100
These are details,
but they are details

94
00:04:04,100 --> 00:04:05,808
that you'll see in
the code, so if you're

95
00:04:05,808 --> 00:04:07,850
wondering why, this is why.

96
00:04:07,850 --> 00:04:11,180
So step one, read the document,
make it a list of words.

97
00:04:11,180 --> 00:04:13,250
Step two, compute
the document vector.

98
00:04:13,250 --> 00:04:15,940
Step three, take the
two document vectors,

99
00:04:15,940 --> 00:04:17,610
and compute the angle.

100
00:04:17,610 --> 00:04:20,560
What is the angle of
two document vectors?

101
00:04:20,560 --> 00:04:21,980
Big ugly math formula.

102
00:04:21,980 --> 00:04:25,570
The only thing that's relevant
is that it takes these vectors

103
00:04:25,570 --> 00:04:27,860
and computes an inner product.

104
00:04:27,860 --> 00:04:33,580
So if we look at the code for
angle vector, or vector angle,

105
00:04:33,580 --> 00:04:37,400
you'll see that because
numerator denominator lines two

106
00:04:37,400 --> 00:04:40,450
and three, it calls
inner product three times

107
00:04:40,450 --> 00:04:42,562
and then it does
some math with it.

108
00:04:42,562 --> 00:04:43,770
We don't care about the math.

109
00:04:43,770 --> 00:04:45,170
We assume the math is order one.

110
00:04:45,170 --> 00:04:48,514
We only care about
inner product.

111
00:04:48,514 --> 00:04:49,680
How does inner product work?

112
00:04:49,680 --> 00:04:52,210
Can anyone help me compute the
inner product for these guys?

113
00:04:56,960 --> 00:04:57,920
Yes?

114
00:04:57,920 --> 00:04:58,920
AUDIENCE: It's like
the dot product?

115
00:04:58,920 --> 00:04:59,672
VICTOR COSTAN: OK.

116
00:04:59,672 --> 00:05:03,445
AUDIENCE: So, if we take the
vectors and you multiply them,

117
00:05:03,445 --> 00:05:05,870
like, you're adding to
the components, right?

118
00:05:05,870 --> 00:05:08,300
Because they're so thick--

119
00:05:08,300 --> 00:05:10,482
VICTOR COSTAN: OK, this
is too complicated, then.

120
00:05:10,482 --> 00:05:11,940
I'm seriously
depressed, so give me

121
00:05:11,940 --> 00:05:15,600
some clear instructions
step by step.

122
00:05:15,600 --> 00:05:17,120
AUDIENCE: Like,
I know you divide

123
00:05:17,120 --> 00:05:18,860
by the length of
each of the vectors--

124
00:05:18,860 --> 00:05:20,280
VICTOR COSTAN: Let's
not worry about that.

125
00:05:20,280 --> 00:05:22,420
I have these vectors, and
I want an inner product.

126
00:05:22,420 --> 00:05:25,454
I don't care about the angle,
just the inner product.

127
00:05:25,454 --> 00:05:27,750
AUDIENCE: OK, well do 2
times 1 for the right.

128
00:05:27,750 --> 00:05:28,500
VICTOR COSTAN: OK.

129
00:05:28,500 --> 00:05:30,620
So I take the here,
shows up twice.

130
00:05:30,620 --> 00:05:32,690
I take the here, shows up once.

131
00:05:32,690 --> 00:05:33,970
2 times 1, right?

132
00:05:33,970 --> 00:05:34,640
AUDIENCE: Mhm.

133
00:05:34,640 --> 00:05:35,850
VICTOR COSTAN: OK.

134
00:05:35,850 --> 00:05:36,758
And then?

135
00:05:36,758 --> 00:05:38,927
AUDIENCE: I would
do the same for fox.

136
00:05:38,927 --> 00:05:41,510
VICTOR COSTAN: OK, fox shows up
here once, shows up here once,

137
00:05:41,510 --> 00:05:43,739
so what I do?

138
00:05:43,739 --> 00:05:45,070
AUDIENCE: 1 times 1.

139
00:05:45,070 --> 00:05:47,380
VICTOR COSTAN: OK.

140
00:05:47,380 --> 00:05:49,240
AUDIENCE: And do
the same for is.

141
00:05:49,240 --> 00:05:50,816
VICTOR COSTAN: OK.

142
00:05:50,816 --> 00:05:53,400
AUDIENCE: And in should be 0.

143
00:05:53,400 --> 00:05:54,150
VICTOR COSTAN: OK.

144
00:05:54,150 --> 00:05:55,730
AUDIENCE: [INAUDIBLE] in.

145
00:05:55,730 --> 00:05:56,480
VICTOR COSTAN: OK.

146
00:05:56,480 --> 00:05:58,820
AUDIENCE: And then
outside would also be 0,

147
00:05:58,820 --> 00:06:00,750
and hat would also be 0.

148
00:06:00,750 --> 00:06:01,580
VICTOR COSTAN: OK.

149
00:06:01,580 --> 00:06:04,020
So it turns out you don't
have to go through both lists.

150
00:06:04,020 --> 00:06:06,330
It's sufficient to go
through one of the vectors

151
00:06:06,330 --> 00:06:08,336
and look up the words
in the other vector.

152
00:06:08,336 --> 00:06:10,710
Because if the words don't
show up in any of the vectors,

153
00:06:10,710 --> 00:06:12,670
their contribution
is going to be 0.

154
00:06:12,670 --> 00:06:16,670
So my algorithm is go
through each of the elements

155
00:06:16,670 --> 00:06:20,010
here, look up each of the words
there, look up at the word

156
00:06:20,010 --> 00:06:20,780
here.

157
00:06:20,780 --> 00:06:23,500
And if there's a
word here and here,

158
00:06:23,500 --> 00:06:26,130
take out the number of times
it shows up in each document,

159
00:06:26,130 --> 00:06:31,470
multiply them, and
then add everything up.

160
00:06:31,470 --> 00:06:33,440
So this is inner product.

161
00:06:33,440 --> 00:06:35,890
Everything else is good if
you're writing a search engine

162
00:06:35,890 --> 00:06:38,020
or if you're using the
scenario application,

163
00:06:38,020 --> 00:06:40,880
but we're not really
concerned with it.

164
00:06:40,880 --> 00:06:42,730
OK, so now we have
the three steps,

165
00:06:42,730 --> 00:06:45,630
read the document,
break it up into words,

166
00:06:45,630 --> 00:06:47,990
compute document vectors,
compute our inner product.

167
00:06:47,990 --> 00:06:49,870
So this is what we want to do.

168
00:06:49,870 --> 00:06:53,650
And document distance 1 does
it in a painfully slow way,

169
00:06:53,650 --> 00:06:57,010
and we're probably not going to
cover everything in recitation.

170
00:06:57,010 --> 00:06:59,460
But if you go all the way
up to document distance 1,

171
00:06:59,460 --> 00:07:00,610
that's really, really fast.

172
00:07:00,610 --> 00:07:03,960
It's 1,000 times faster.

173
00:07:03,960 --> 00:07:06,892
So this is our job for the day.

174
00:07:06,892 --> 00:07:07,850
Let's look at the code.

175
00:07:07,850 --> 00:07:11,440
Did anyone look at
the code beforehand?

176
00:07:11,440 --> 00:07:12,400
Nope.

177
00:07:12,400 --> 00:07:15,130
OK, so when I look at
a big piece of code,

178
00:07:15,130 --> 00:07:19,820
I like to look at
it from top down.

179
00:07:19,820 --> 00:07:22,080
So that means I start
to the main function,

180
00:07:22,080 --> 00:07:24,370
I see who is it calling,
I see what everything

181
00:07:24,370 --> 00:07:27,720
is trying to do, and then
I go into the sub functions

182
00:07:27,720 --> 00:07:30,300
and recurves and basically
do the same thing.

183
00:07:30,300 --> 00:07:32,580
So I build a tree of
who's calling what,

184
00:07:32,580 --> 00:07:34,720
and that helps me figure
out what's going on.

185
00:07:37,224 --> 00:07:38,265
So let's start with main.

186
00:07:44,230 --> 00:07:46,140
And let's look at main.

187
00:07:46,140 --> 00:07:49,150
Lines 1 through 6
look at the arguments.

188
00:07:49,150 --> 00:07:51,040
We don't really care.

189
00:07:51,040 --> 00:07:54,365
Line 7 and 8 call word
frequencies for file.

190
00:08:01,520 --> 00:08:04,770
I am abbreviating liberally.

191
00:08:04,770 --> 00:08:11,950
And then line 9
calls vector angle.

192
00:08:11,950 --> 00:08:17,210
So line 7 and 8 read
the two documents,

193
00:08:17,210 --> 00:08:20,490
do steps one and two, and
then 9 does step three.

194
00:08:31,010 --> 00:08:31,700
OK.

195
00:08:31,700 --> 00:08:33,220
Word frequencies for files.

196
00:08:33,220 --> 00:08:35,679
So the point of this
is to read a file

197
00:08:35,679 --> 00:08:41,100
and to produce a word
document vector out of it.

198
00:08:41,100 --> 00:08:45,370
And it does it in three steps.

199
00:08:45,370 --> 00:08:48,840
Reads the file, line two.

200
00:08:48,840 --> 00:08:52,300
Breaks up the file into
words, so operation one,

201
00:08:52,300 --> 00:08:56,170
this is line 3, and then line
4, it takes up the list of words

202
00:08:56,170 --> 00:08:58,790
and computes a document
vector out of it.

203
00:08:58,790 --> 00:09:01,110
I don't care about
reading files because I'll

204
00:09:01,110 --> 00:09:03,420
assume this is
somehow done for me.

205
00:09:03,420 --> 00:09:05,290
We care about the algorithms.

206
00:09:05,290 --> 00:09:07,050
So as far as I'm
concerned, this function

207
00:09:07,050 --> 00:09:08,700
is calling get words
from line list.

208
00:09:11,810 --> 00:09:16,695
Get words from line list,
and count frequency.

209
00:09:26,100 --> 00:09:29,130
And if we skip all the
way to vector angle--

210
00:09:29,130 --> 00:09:32,440
we already talked a little
bit about how all it does

211
00:09:32,440 --> 00:09:34,580
is it calls inner
product three times

212
00:09:34,580 --> 00:09:37,010
and then in does some
fancy math of it.

213
00:09:44,510 --> 00:09:46,590
So this is how the code
looks like big picture.

214
00:09:56,370 --> 00:09:58,530
OK, so to figure out the
running time for main,

215
00:09:58,530 --> 00:10:00,780
we need to figure out the
running time for these two

216
00:10:00,780 --> 00:10:03,314
functions and add
them up, right?

217
00:10:03,314 --> 00:10:04,980
To figure out the
running time for this,

218
00:10:04,980 --> 00:10:06,354
we need to figure
out the running

219
00:10:06,354 --> 00:10:09,710
time for these functions and
add them up, so on and so forth.

220
00:10:09,710 --> 00:10:12,960
So as you go through each of
the document distance versions,

221
00:10:12,960 --> 00:10:17,214
you want keep a scorecard of the
implementation that shows you

222
00:10:17,214 --> 00:10:18,630
what the running
time is, and this

223
00:10:18,630 --> 00:10:20,772
helps you follow
what was improved

224
00:10:20,772 --> 00:10:21,730
in each implementation.

225
00:10:25,830 --> 00:10:29,430
So let's look at to get
words from line lists.

226
00:10:29,430 --> 00:10:31,940
What does it seem
like its doing?

227
00:10:31,940 --> 00:10:34,920
Without reading the
get words from string,

228
00:10:34,920 --> 00:10:38,540
can anyone tell me what
it seems like it's doing?

229
00:10:38,540 --> 00:10:40,515
If you just read
lines 1 through 6.

230
00:10:43,896 --> 00:10:46,950
AUDIENCE: [INAUDIBLE]
through the list.

231
00:10:46,950 --> 00:10:47,700
VICTOR COSTAN: OK.

232
00:10:47,700 --> 00:10:49,940
So it's getting an input list.

233
00:10:49,940 --> 00:10:52,400
And if you look at word
frequencies for files

234
00:10:52,400 --> 00:10:56,430
at line 2, it names
a variable line list.

235
00:10:56,430 --> 00:11:00,360
So it seems like
what's happening is,

236
00:11:00,360 --> 00:11:02,590
reads a file into
a list of lines.

237
00:11:02,590 --> 00:11:07,140
And then that list of lines goes
to get words from line lists.

238
00:11:07,140 --> 00:11:12,590
So this is L in get
words from line lists.

239
00:11:12,590 --> 00:11:15,820
So it takes a list of lines
which is the entire document,

240
00:11:15,820 --> 00:11:16,320
and then?

241
00:11:20,900 --> 00:11:23,514
AUDIENCE: Basically it
removes the new lines.

242
00:11:23,514 --> 00:11:30,178
It sticks it into one giant list
rather than a list of lines,

243
00:11:30,178 --> 00:11:31,591
is that right?

244
00:11:31,591 --> 00:11:34,090
VICTOR COSTAN: Almost, so you
seem to get words from string.

245
00:11:34,090 --> 00:11:35,798
Maybe we need to go
through the function,

246
00:11:35,798 --> 00:11:40,180
but do see the get words
from string function name?

247
00:11:40,180 --> 00:11:42,330
So I will assume that
it does something

248
00:11:42,330 --> 00:11:44,510
with each of the words.

249
00:11:44,510 --> 00:11:49,090
And if the overall goal
is to get a list of words,

250
00:11:49,090 --> 00:11:52,200
then I would assume that what
that does is it takes a line

251
00:11:52,200 --> 00:11:54,200
and it breaks it up into words.

252
00:11:54,200 --> 00:11:55,950
Because this way, if
you take up each line

253
00:11:55,950 --> 00:11:57,980
and break it up into
words, then when

254
00:11:57,980 --> 00:11:59,970
we put all the words
together we get the words

255
00:11:59,970 --> 00:12:01,053
that make up the document.

256
00:12:03,430 --> 00:12:04,210
Do people follow?

257
00:12:04,210 --> 00:12:04,870
Any questions?

258
00:12:04,870 --> 00:12:07,369
I like that people are nodding,
by the way, keep doing that.

259
00:12:07,369 --> 00:12:09,454
That helps me go
at the right speed.

260
00:12:09,454 --> 00:12:11,870
If you're not nodding, I'll
keep explaining the same thing

261
00:12:11,870 --> 00:12:12,710
over and over again.

262
00:12:18,870 --> 00:12:20,280
OK, so get words from string.

263
00:12:20,280 --> 00:12:24,590
Get words from string takes up
a single line, that's a string,

264
00:12:24,590 --> 00:12:27,060
and produces a list of words.

265
00:12:27,060 --> 00:12:29,920
And we saw in the
example there that it

266
00:12:29,920 --> 00:12:33,230
has to take care of a few
details such as making

267
00:12:33,230 --> 00:12:35,520
all the letters
lowercase and ignoring

268
00:12:35,520 --> 00:12:38,870
punctuation and skipping spaces.

269
00:12:38,870 --> 00:12:42,210
So let's look at this code and
figure out its running time.

270
00:12:42,210 --> 00:12:44,030
And the way we're going
to do that is we're

271
00:12:44,030 --> 00:12:46,840
going to look at
each line, and we're

272
00:12:46,840 --> 00:12:49,960
going to see what's
the cost for that line

273
00:12:49,960 --> 00:12:51,732
and how many times does it run.

274
00:12:51,732 --> 00:12:53,190
And once we have
those two numbers,

275
00:12:53,190 --> 00:12:56,700
we multiply them together
and we see how much time

276
00:12:56,700 --> 00:13:01,270
does the program spend
on that line in total.

277
00:13:01,270 --> 00:13:04,030
So I'm going to write
down line numbers here.

278
00:13:04,030 --> 00:13:09,110
9, 10, 11, 12, 13, 14, 15.

279
00:13:09,110 --> 00:13:10,118
All the way to 23.

280
00:13:18,180 --> 00:13:18,680
Too low.

281
00:13:26,930 --> 00:13:29,960
20, 21, 22, 23.

282
00:13:33,910 --> 00:13:35,670
OK, so let's start
with something easy,

283
00:13:35,670 --> 00:13:40,650
lines 9 and 10 How many
times are they run?

284
00:13:40,650 --> 00:13:41,730
AUDIENCE: Once.

285
00:13:41,730 --> 00:13:42,966
VICTOR COSTAN: OK.

286
00:13:42,966 --> 00:13:47,739
AUDIENCE: [INAUDIBLE]
Once in this method?

287
00:13:47,739 --> 00:13:48,530
VICTOR COSTAN: Yep.

288
00:13:48,530 --> 00:13:51,330
So I'm only looking
at this method.

289
00:13:51,330 --> 00:13:57,500
So assuming that the method
gets one line, and the line has,

290
00:13:57,500 --> 00:14:07,380
I don't know, say, one
line in characters,

291
00:14:07,380 --> 00:14:08,830
and we need another
variable which

292
00:14:08,830 --> 00:14:11,140
we're going to figure out later.

293
00:14:11,140 --> 00:14:14,040
But for now, one
line in characters.

294
00:14:14,040 --> 00:14:17,181
So how many times
does line 9 run?

295
00:14:17,181 --> 00:14:19,950
AUDIENCE: [INAUDIBLE]

296
00:14:19,950 --> 00:14:21,450
VICTOR COSTAN: OK.

297
00:14:21,450 --> 00:14:23,660
Runs once.

298
00:14:23,660 --> 00:14:25,528
How about line 10?

299
00:14:25,528 --> 00:14:26,355
AUDIENCE: Once.

300
00:14:26,355 --> 00:14:26,980
AUDIENCE: Once.

301
00:14:26,980 --> 00:14:27,950
VICTOR COSTAN: OK.

302
00:14:27,950 --> 00:14:29,240
What do they do?

303
00:14:29,240 --> 00:14:32,360
Create new lists and
assign them to variables.

304
00:14:32,360 --> 00:14:35,121
What's the cause for that?

305
00:14:35,121 --> 00:14:36,462
AUDIENCE: Constant [INAUDIBLE]

306
00:14:36,462 --> 00:14:38,420
VICTOR COSTAN:
Constant, excellent.

307
00:14:38,420 --> 00:14:40,140
So I'll be skipping
the order of so

308
00:14:40,140 --> 00:14:43,030
that I don't have to
write it 23 times.

309
00:14:43,030 --> 00:14:45,290
So 1, 1.

310
00:14:45,290 --> 00:14:48,730
OK, line 11.

311
00:14:48,730 --> 00:14:51,600
It's iterates over all
the characters in a line.

312
00:14:51,600 --> 00:14:54,750
So how many times
is it going to run?

313
00:14:54,750 --> 00:14:56,250
AUDIENCE: Like, the line?

314
00:14:56,250 --> 00:14:57,526
VICTOR COSTAN: OK, which is?

315
00:14:57,526 --> 00:14:59,270
AUDIENCE: Line end characters.

316
00:14:59,270 --> 00:15:01,510
VICTOR COSTAN: Awesome.

317
00:15:01,510 --> 00:15:07,290
And just the fact of
iterating takes constant time.

318
00:15:07,290 --> 00:15:09,220
I'm not sure we covered that.

319
00:15:09,220 --> 00:15:14,520
So for each character, test if
it's an alphanumeric character.

320
00:15:14,520 --> 00:15:17,430
Does anyone know what
alphanumeric means?

321
00:15:17,430 --> 00:15:19,050
AUDIENCE: It's a
letter and a number.

322
00:15:19,050 --> 00:15:21,300
VICTOR COSTAN: OK, so fancy
word for letter or number,

323
00:15:21,300 --> 00:15:23,970
A through Z, 0 through 9.

324
00:15:23,970 --> 00:15:25,920
So how much time
does it take to test

325
00:15:25,920 --> 00:15:29,200
if a character is alphanumeric?

326
00:15:29,200 --> 00:15:30,602
Guesses?

327
00:15:30,602 --> 00:15:31,940
AUDIENCE: Constant.

328
00:15:31,940 --> 00:15:34,300
VICTOR COSTAN: OK,
so constant time.

329
00:15:34,300 --> 00:15:37,160
You compare it to the
range A, Z and 0, 9.

330
00:15:37,160 --> 00:15:39,109
How many times am I doing it?

331
00:15:39,109 --> 00:15:39,984
AUDIENCE: [INAUDIBLE]

332
00:15:39,984 --> 00:15:42,022
AUDIENCE: [INAUDIBLE]

333
00:15:42,022 --> 00:15:43,480
VICTOR COSTAN:
Thank you guys, this

334
00:15:43,480 --> 00:15:45,650
is going much faster
than the last recitation.

335
00:15:45,650 --> 00:15:48,800
You guys are active, I like it.

336
00:15:48,800 --> 00:15:50,030
So, now for line 13.

337
00:15:50,030 --> 00:15:52,550
That only gets executed
when the character

338
00:15:52,550 --> 00:15:54,362
is an alphanumeric character.

339
00:15:54,362 --> 00:15:56,320
So we're going to have
to make some assumptions

340
00:15:56,320 --> 00:15:57,530
about the document.

341
00:15:57,530 --> 00:16:00,400
And to make my
life easier, we're

342
00:16:00,400 --> 00:16:02,380
going to make the
following assumption.

343
00:16:02,380 --> 00:16:05,576
If this is a natural
language like, say, English,

344
00:16:05,576 --> 00:16:07,450
words are going to be
a constant size, right?

345
00:16:07,450 --> 00:16:10,680
How many 500-character
words do you see in English?

346
00:16:10,680 --> 00:16:14,910
So let's say 5 to 10
characters per word.

347
00:16:14,910 --> 00:16:17,170
And since the
difference is so small,

348
00:16:17,170 --> 00:16:19,960
I'm going to say all the
words have the same size

349
00:16:19,960 --> 00:16:22,040
W. And if you want
to be more formal,

350
00:16:22,040 --> 00:16:24,800
you can replace word
length with average length,

351
00:16:24,800 --> 00:16:26,810
and the math works out.

352
00:16:26,810 --> 00:16:31,330
So each line has
a number of words,

353
00:16:31,330 --> 00:16:34,290
and the words are separated
by exactly one space,

354
00:16:34,290 --> 00:16:37,130
and the word has W characters.

355
00:16:37,130 --> 00:16:40,942
So how many words do
I have, by the way?

356
00:16:40,942 --> 00:16:44,330
AUDIENCE: N divided by W.

357
00:16:44,330 --> 00:16:45,640
VICTOR COSTAN: OK, good.

358
00:16:45,640 --> 00:16:47,740
Someone's paying
close attention.

359
00:16:47,740 --> 00:16:50,960
N divided by W plus 1.

360
00:16:50,960 --> 00:16:53,820
And the reason
that is, is a line

361
00:16:53,820 --> 00:16:56,340
would look like this, word,
space, word, space, word,

362
00:16:56,340 --> 00:16:56,840
space.

363
00:16:56,840 --> 00:16:59,850
So W, characters, one space,
W, characters, one space, W,

364
00:16:59,850 --> 00:17:01,110
characters, one space.

365
00:17:01,110 --> 00:17:04,294
That's why you have
W plus 1 there.

366
00:17:04,294 --> 00:17:05,960
When we look at
asymptotics it turns out

367
00:17:05,960 --> 00:17:08,910
that it doesn't really matter
because W's a constant,

368
00:17:08,910 --> 00:17:13,800
W plus 1 is a constant,
so order and words.

369
00:17:13,800 --> 00:17:17,930
But for now, let's keep track of
W's to seem a bit more formal.

370
00:17:17,930 --> 00:17:19,680
So line 13.

371
00:17:19,680 --> 00:17:21,180
How many times is
it going to run?

372
00:17:29,374 --> 00:17:31,629
AUDIENCE: W times
10 over W plus one.

373
00:17:31,629 --> 00:17:32,670
VICTOR COSTAN: Excellent.

374
00:17:38,345 --> 00:17:39,220
Let me pull them out.

375
00:17:43,880 --> 00:17:46,160
How much time does it
take to run [INAUDIBLE].

376
00:17:49,872 --> 00:17:50,800
AUDIENCE: Constant?

377
00:17:50,800 --> 00:17:52,341
VICTOR COSTAN:
Constant time, append,

378
00:17:52,341 --> 00:17:54,240
covered in lecture,
constant time.

379
00:17:54,240 --> 00:17:57,680
So this is a bit tricky
because if you have an array

380
00:17:57,680 --> 00:18:00,680
implementation that's naive,
it's not constant time.

381
00:18:00,680 --> 00:18:03,320
But Python does some magic
called table doubling, which

382
00:18:03,320 --> 00:18:05,190
we'll cover later in the course.

383
00:18:05,190 --> 00:18:11,470
And this is why you can say
that append takes constant time.

384
00:18:11,470 --> 00:18:12,230
OK.

385
00:18:12,230 --> 00:18:16,560
Else, so if the character
is not alphanumeric,

386
00:18:16,560 --> 00:18:20,050
than what's going on here?

387
00:18:20,050 --> 00:18:23,428
Can anyone see what's
happening there?

388
00:18:23,428 --> 00:18:26,410
AUDIENCE: If its
like, [INAUDIBLE].

389
00:18:26,410 --> 00:18:29,278
VICTOR COSTAN: OK, so
let's say if it's a space.

390
00:18:29,278 --> 00:18:30,153
AUDIENCE: [INAUDIBLE]

391
00:18:33,450 --> 00:18:35,620
VICTOR COSTAN: Yeah,
this the harder part.

392
00:18:35,620 --> 00:18:37,490
I think you need to
run this on an example

393
00:18:37,490 --> 00:18:39,950
to figure out what's going on.

394
00:18:39,950 --> 00:18:42,310
I have to run it on
an example in my head.

395
00:18:42,310 --> 00:18:46,610
So let's take this small example
here, the fox is outside.

396
00:18:46,610 --> 00:18:48,610
And this is a
single line, right?

397
00:18:48,610 --> 00:18:49,390
Nice and handy.

398
00:18:49,390 --> 00:18:52,180
So this can be the input
for get words from string.

399
00:18:52,180 --> 00:18:54,090
And let's see what happens.

400
00:18:54,090 --> 00:19:01,600
So first I start with word list
which is empty list, character,

401
00:19:01,600 --> 00:19:08,710
lists, empty list.

402
00:19:08,710 --> 00:19:11,330
Take the first character,
it's alphanumeric,

403
00:19:11,330 --> 00:19:14,770
gets appended here, the second
character, alphanumeric,

404
00:19:14,770 --> 00:19:17,140
appended here, third
character, alphanumeric,

405
00:19:17,140 --> 00:19:19,010
gets appended here.

406
00:19:19,010 --> 00:19:20,980
Fourth character,
not alphanumeric,

407
00:19:20,980 --> 00:19:25,910
so I get to run
lines 15 through 18.

408
00:19:25,910 --> 00:19:26,920
OK, I did the easy part.

409
00:19:26,920 --> 00:19:28,700
Someone walk me
through the hard part.

410
00:19:28,700 --> 00:19:33,600
What happens in
lines 15 through 18?

411
00:19:33,600 --> 00:19:34,858
Yes.

412
00:19:34,858 --> 00:19:38,316
AUDIENCE: First, it takes
that list and joins it

413
00:19:38,316 --> 00:19:40,790
into a string. [INAUDIBLE]

414
00:19:40,790 --> 00:19:43,550
VICTOR COSTAN: OK, so this
is a list of characters.

415
00:19:43,550 --> 00:19:47,840
And join takes the list and
makes a string out of it.

416
00:19:47,840 --> 00:19:50,840
So I'll have the string the.

417
00:19:50,840 --> 00:19:53,390
OK, excellent.

418
00:19:53,390 --> 00:19:55,570
AUDIENCE: And it converts
it all to lower case.

419
00:19:55,570 --> 00:19:56,320
VICTOR COSTAN: OK.

420
00:20:00,208 --> 00:20:03,620
AUDIENCE: End up [INAUDIBLE]
that to the word list.

421
00:20:03,620 --> 00:20:05,620
VICTOR COSTAN: The world
list is up here, right?

422
00:20:05,620 --> 00:20:10,413
So this is going to have the.

423
00:20:10,413 --> 00:20:13,836
AUDIENCE: And then it clears
the character list, [INAUDIBLE].

424
00:20:18,855 --> 00:20:19,605
VICTOR COSTAN: OK.

425
00:20:23,670 --> 00:20:31,680
So now as I go through the
next word, I have F-O-X.

426
00:20:31,680 --> 00:20:34,000
Then this becomes the word,
and it gets added here.

427
00:20:40,680 --> 00:20:42,470
So on and so forth
for everything.

428
00:20:42,470 --> 00:20:46,840
Do people see how
this method works now?

429
00:20:46,840 --> 00:20:50,360
I'm not getting that
many nods, so questions.

430
00:20:50,360 --> 00:20:52,440
If I don't get nods,
I'll stop and you guys

431
00:20:52,440 --> 00:20:54,530
have to ask what
you're confused about.

432
00:20:54,530 --> 00:20:57,230
AUDIENCE: I think it's a
little tricky because instead

433
00:20:57,230 --> 00:21:00,052
of saying if it's not an
alphanumeric character,

434
00:21:00,052 --> 00:21:02,738
it's just like well, if
the length of the list

435
00:21:02,738 --> 00:21:04,792
is greater than 0, which
threw me off initially,

436
00:21:04,792 --> 00:21:07,610
but then I realized it
was just, like, omission.

437
00:21:07,610 --> 00:21:09,360
VICTOR COSTAN: OK, so
why does it do this?

438
00:21:09,360 --> 00:21:12,316
What is the point of the
length of the character list?

439
00:21:12,316 --> 00:21:15,600
AUDIENCE: So that
there are two spaces.

440
00:21:15,600 --> 00:21:18,080
VICTOR COSTAN: Excellent.

441
00:21:18,080 --> 00:21:23,270
So here I was nice and I had
one space, one space, one space.

442
00:21:23,270 --> 00:21:26,410
But if I'm sloppy when I'm
typing and I have two spaces

443
00:21:26,410 --> 00:21:31,950
here, then suppose this is
space, space-- kind a small,

444
00:21:31,950 --> 00:21:33,060
but pretend.

445
00:21:33,060 --> 00:21:34,550
Go with me here.

446
00:21:34,550 --> 00:21:37,020
So we got here.

447
00:21:37,020 --> 00:21:38,470
We got the fox is.

448
00:21:41,720 --> 00:21:45,620
And then this list is
empty because line 18 just

449
00:21:45,620 --> 00:21:48,180
made it empty.

450
00:21:48,180 --> 00:21:50,520
If I run the code the
lines 15 through 18,

451
00:21:50,520 --> 00:21:53,930
it's going to add an
empty word up here.

452
00:21:53,930 --> 00:21:57,280
And empty words
aren't very useful.

453
00:21:57,280 --> 00:21:59,360
You'll see how many
times the documents have

454
00:21:59,360 --> 00:22:01,661
too many spaces in them, so
that doesn't really help.

455
00:22:01,661 --> 00:22:03,411
AUDIENCE: I mean, isn't
that not an issue,

456
00:22:03,411 --> 00:22:07,470
because you call if C is
L1 before you actually

457
00:22:07,470 --> 00:22:09,350
get to that.

458
00:22:09,350 --> 00:22:12,400
So you'd run through it
again, but you would still

459
00:22:12,400 --> 00:22:14,950
just skip over that.

460
00:22:14,950 --> 00:22:18,030
That would fail, I
mean it would not

461
00:22:18,030 --> 00:22:19,280
do anything for that equation.

462
00:22:19,280 --> 00:22:21,350
VICTOR COSTAN: So first space.

463
00:22:21,350 --> 00:22:22,750
C as L now fails.

464
00:22:22,750 --> 00:22:24,717
I run lines 15 through 18.

465
00:22:24,717 --> 00:22:25,300
AUDIENCE: Yep.

466
00:22:25,300 --> 00:22:26,175
VICTOR COSTAN: Right?

467
00:22:26,175 --> 00:22:27,170
I have is here.

468
00:22:27,170 --> 00:22:29,095
This becomes empty.

469
00:22:29,095 --> 00:22:29,970
AUDIENCE: Yep.

470
00:22:29,970 --> 00:22:33,557
AUDIENCE: Second space,
C as L now fails again.

471
00:22:33,557 --> 00:22:34,140
AUDIENCE: Yep.

472
00:22:34,140 --> 00:22:36,470
VICTOR COSTAN: And if I
wouldn't have the length check,

473
00:22:36,470 --> 00:22:40,080
it would run lines
15 through 18 again.

474
00:22:40,080 --> 00:22:40,961
AUDIENCE: Oh, OK.

475
00:22:40,961 --> 00:22:43,910
[INAUDIBLE]

476
00:22:43,910 --> 00:22:46,950
VICTOR COSTAN: OK, so this is
what it's trying to prevent.

477
00:22:46,950 --> 00:22:49,284
So you can see that this code
looks complicated, right?

478
00:22:49,284 --> 00:22:51,450
It's trying to do a lot of
things, it's complicated,

479
00:22:51,450 --> 00:22:53,810
it's hard to analyze.

480
00:22:53,810 --> 00:22:55,100
Oh, well, let's go with it.

481
00:22:55,100 --> 00:22:59,590
Let's try to finish
it up quickly.

482
00:22:59,590 --> 00:23:02,570
So now that we know
what it does, let's

483
00:23:02,570 --> 00:23:04,960
try to figure out how
many times each line runs

484
00:23:04,960 --> 00:23:06,650
and what's the cost?

485
00:23:06,650 --> 00:23:07,980
Yes.

486
00:23:07,980 --> 00:23:12,830
AUDIENCE: So I think the
total cost is N times 1

487
00:23:12,830 --> 00:23:17,035
minus W over W plus 1.

488
00:23:17,035 --> 00:23:18,243
VICTOR COSTAN: Wait, so here?

489
00:23:18,243 --> 00:23:19,170
AUDIENCE: Yeah.

490
00:23:19,170 --> 00:23:25,031
VICTOR COSTAN: OK, so you're
saying N times 1 minus.

491
00:23:25,031 --> 00:23:25,531
OK.

492
00:23:28,810 --> 00:23:31,790
Why do you say that?

493
00:23:31,790 --> 00:23:33,183
I like it, but why?

494
00:23:33,183 --> 00:23:36,017
OK, it's because it's everything
that is in the character,

495
00:23:36,017 --> 00:23:38,471
and the line above
it was characters--

496
00:23:38,471 --> 00:23:39,220
VICTOR COSTAN: OK.

497
00:23:39,220 --> 00:23:40,220
AUDIENCE: --all
alphanumeric, [INAUDIBLE]

498
00:23:40,220 --> 00:23:41,970
VICTOR COSTAN: So
basically spaces, right?

499
00:23:41,970 --> 00:23:44,160
If we have word, space,
word, space, word, space,

500
00:23:44,160 --> 00:23:46,400
this happens for all the spaces.

501
00:23:46,400 --> 00:23:47,500
Cool.

502
00:23:47,500 --> 00:23:48,967
So this is good.

503
00:23:48,967 --> 00:23:50,425
I'm going to make
it a bit simpler.

504
00:23:54,120 --> 00:23:56,975
Same thing, it's just that it's
slightly less intimidating.

505
00:23:56,975 --> 00:23:59,040
AUDIENCE: Oh, yeah.

506
00:23:59,040 --> 00:24:00,650
VICTOR COSTAN: Cool, thank you.

507
00:24:00,650 --> 00:24:03,000
Very brave, come up first.

508
00:24:03,000 --> 00:24:05,050
What's the running
time for line 14?

509
00:24:05,050 --> 00:24:06,995
So, cost for running it once.

510
00:24:10,320 --> 00:24:12,110
AUDIENCE: Constant.

511
00:24:12,110 --> 00:24:12,610
Excellent.

512
00:24:12,610 --> 00:24:14,850
VICTOR COSTAN: I like you guys.

513
00:24:14,850 --> 00:24:15,780
Nice.

514
00:24:15,780 --> 00:24:19,020
Line 15, how much
time does it to take

515
00:24:19,020 --> 00:24:21,880
to take characters and
put them into a list?

516
00:24:21,880 --> 00:24:24,110
AUDIENCE: N?

517
00:24:24,110 --> 00:24:24,860
VICTOR COSTAN: N--

518
00:24:24,860 --> 00:24:25,120
AUDIENCE: [INAUDIBLE]

519
00:24:25,120 --> 00:24:27,340
VICTOR COSTAN: --where N is
the size of the list, right?

520
00:24:27,340 --> 00:24:27,800
AUDIENCE: Yeah.

521
00:24:27,800 --> 00:24:28,550
VICTOR COSTAN: OK.

522
00:24:28,550 --> 00:24:30,814
So what's the size
of the list now?

523
00:24:30,814 --> 00:24:33,154
AUDIENCE: [INAUDIBLE]

524
00:24:33,154 --> 00:24:34,090
AUDIENCE: [INAUDIBLE]

525
00:24:34,090 --> 00:24:36,430
VICTOR COSTAN: Yep.

526
00:24:36,430 --> 00:24:39,342
OK, so when you're using
more than one letter,

527
00:24:39,342 --> 00:24:41,550
the problem is you have to
pay attention to which one

528
00:24:41,550 --> 00:24:42,091
you're using.

529
00:24:42,091 --> 00:24:44,080
Because when we
teach algorithms,

530
00:24:44,080 --> 00:24:46,910
we say oh, this is N, this is
N squared, so on and so forth.

531
00:24:46,910 --> 00:24:49,010
You have to replace it
to the right letter.

532
00:24:49,010 --> 00:24:51,415
And I get confused about
this all the time, so--

533
00:24:51,415 --> 00:24:51,810
AUDIENCE: [INAUDIBLE]

534
00:24:51,810 --> 00:24:52,630
VICTOR COSTAN: --a
serious problem.

535
00:24:52,630 --> 00:24:53,614
AUDIENCE: --columns?

536
00:24:53,614 --> 00:24:56,080
What are the two columns?

537
00:24:56,080 --> 00:24:59,270
VICTOR COSTAN: So this is the
cost of running a line once,

538
00:24:59,270 --> 00:25:01,296
and this is how
many times it's run.

539
00:25:01,296 --> 00:25:02,130
AUDIENCE: Oh, OK.

540
00:25:02,130 --> 00:25:02,930
VICTOR COSTAN: Thanks
for the question.

541
00:25:02,930 --> 00:25:04,260
I should have said
that in the beginning.

542
00:25:04,260 --> 00:25:04,760
Thank you.

543
00:25:07,480 --> 00:25:09,230
OK, let's make this
a little bit faster

544
00:25:09,230 --> 00:25:12,490
and notice that
lines 15 through 18

545
00:25:12,490 --> 00:25:14,789
all run the same
number of times, right?

546
00:25:14,789 --> 00:25:16,580
They're in the if, and
there's nothing else

547
00:25:16,580 --> 00:25:19,670
that's changes the
control flow there.

548
00:25:19,670 --> 00:25:28,320
So lines 15 through 18 are
O and divided by W plus 1.

549
00:25:28,320 --> 00:25:29,740
All right, line 16.

550
00:25:29,740 --> 00:25:30,930
Take a word.

551
00:25:30,930 --> 00:25:33,600
So take a string and
make another string

552
00:25:33,600 --> 00:25:37,070
where each character is
the lowercase version.

553
00:25:37,070 --> 00:25:38,360
AUDIENCE: [INAUDIBLE]

554
00:25:38,360 --> 00:25:39,610
VICTOR COSTAN: OK, cool.

555
00:25:39,610 --> 00:25:41,643
Why W, intuitively?

556
00:25:41,643 --> 00:25:44,884
AUDIENCE: Because [INAUDIBLE]
has to check to make sure

557
00:25:44,884 --> 00:25:46,929
[INAUDIBLE]

558
00:25:46,929 --> 00:25:47,720
VICTOR COSTAN: Yep.

559
00:25:47,720 --> 00:25:48,750
AUDIENCE: [INAUDIBLE]

560
00:25:48,750 --> 00:25:50,999
VICTOR COSTAN: Yeah, so if
you have a 10,000 character

561
00:25:50,999 --> 00:25:53,190
string you, have to go
through 10,000 characters.

562
00:25:53,190 --> 00:25:55,210
Very good.

563
00:25:55,210 --> 00:25:58,032
Append 917.

564
00:25:58,032 --> 00:26:00,512
AUDIENCE: [INAUDIBLE]

565
00:26:00,512 --> 00:26:02,000
VICTOR COSTAN: Sweet.

566
00:26:02,000 --> 00:26:06,565
And line 18, we said the
character list of length list.

567
00:26:06,565 --> 00:26:07,531
AUDIENCE: [INAUDIBLE]

568
00:26:07,531 --> 00:26:13,170
VICTOR COSTAN: [INAUDIBLE]
OK, how many times

569
00:26:13,170 --> 00:26:18,505
do lines 19 through 23 run?

570
00:26:18,505 --> 00:26:19,400
AUDIENCE: Once.

571
00:26:19,400 --> 00:26:20,608
VICTOR COSTAN: At most, once.

572
00:26:24,540 --> 00:26:26,727
AUDIENCE: [INAUDIBLE]

573
00:26:26,727 --> 00:26:29,310
VICTOR COSTAN: Can anyone figure
out what's the point of them?

574
00:26:33,099 --> 00:26:36,649
AUDIENCE: Catch any
trailing [INAUDIBLE]

575
00:26:36,649 --> 00:26:37,482
VICTOR COSTAN: Good.

576
00:26:37,482 --> 00:26:40,404
If you ended on the
last letter of a word,

577
00:26:40,404 --> 00:26:42,360
you want to make sure
you catch that word.

578
00:26:42,360 --> 00:26:42,870
VICTOR COSTAN: All right.

579
00:26:42,870 --> 00:26:43,820
AUDIENCE: [INAUDIBLE]

580
00:26:43,820 --> 00:26:44,861
VICTOR COSTAN: Very good.

581
00:26:44,861 --> 00:26:46,030
So I find it here.

582
00:26:46,030 --> 00:26:48,980
Then after I'm done
with the loop at line 19

583
00:26:48,980 --> 00:26:52,895
what the word list
would have, the fox is.

584
00:26:52,895 --> 00:26:54,270
And then the
character list would

585
00:26:54,270 --> 00:26:56,150
have the characters for outside.

586
00:26:56,150 --> 00:26:58,970
If I return the word list,
woops, I just missed a word.

587
00:26:58,970 --> 00:27:07,430
So lines 20 through 22 are a
copy of lines 15 through 17,

588
00:27:07,430 --> 00:27:11,110
and they take care
of the last word.

589
00:27:11,110 --> 00:27:14,720
So line 19 is an if, and it
takes the length of a list

590
00:27:14,720 --> 00:27:16,250
and compares it to the number.

591
00:27:16,250 --> 00:27:19,344
What's the cost of that?

592
00:27:19,344 --> 00:27:20,200
AUDIENCE: Constant.

593
00:27:20,200 --> 00:27:21,730
VICTOR COSTAN: OK, very good.

594
00:27:21,730 --> 00:27:24,980
Checking list length in
Python is constant time.

595
00:27:24,980 --> 00:27:27,340
We did that in lecture.

596
00:27:27,340 --> 00:27:29,490
How about lines 20 through 22?

597
00:27:32,262 --> 00:27:33,720
I just gave it
away, guys, come on.

598
00:27:33,720 --> 00:27:34,530
Someone--

599
00:27:34,530 --> 00:27:36,380
AUDIENCE: The same
as 15 through 17.

600
00:27:36,380 --> 00:27:38,890
VICTOR COSTAN: OK,
same as 15 through 17.

601
00:27:38,890 --> 00:27:41,460
W, W, 1.

602
00:27:41,460 --> 00:27:46,480
Line 23, return constant time.

603
00:27:46,480 --> 00:27:50,480
OK, so now we know how much
it takes to run a line once,

604
00:27:50,480 --> 00:27:52,640
how many times each line runs.

605
00:27:52,640 --> 00:27:55,360
So we're going to do a
dot product of these guys.

606
00:27:55,360 --> 00:27:57,920
See, dot products are useful.

607
00:27:57,920 --> 00:28:00,520
And if we do a dot
product of these guys,

608
00:28:00,520 --> 00:28:03,180
we're going to get the total
running time for the function.

609
00:28:03,180 --> 00:28:05,105
So let's compute
the partial terms.

610
00:28:05,105 --> 00:28:06,350
AUDIENCE: [INAUDIBLE]

611
00:28:06,350 --> 00:28:07,310
VICTOR COSTAN: I'm not
going to write them down.

612
00:28:07,310 --> 00:28:09,730
Let's just go through them
and figure out what they are.

613
00:28:09,730 --> 00:28:13,494
So you guys say them.

614
00:28:13,494 --> 00:28:17,964
AUDIENCE: 1, 1, N,
N, weird equation--

615
00:28:17,964 --> 00:28:19,380
VICTOR COSTAN: OK,
weird equation,

616
00:28:19,380 --> 00:28:21,177
what was the important part?

617
00:28:21,177 --> 00:28:22,010
[INTERPOSING VOICES]

618
00:28:22,010 --> 00:28:23,550
VICTOR COSTAN: Yeah,
the important part.

619
00:28:23,550 --> 00:28:24,841
The important part is N, right?

620
00:28:24,841 --> 00:28:27,784
This is some constant
times N, so N.

621
00:28:27,784 --> 00:28:35,670
AUDIENCE: N, N,
N, N, N, N, 1, 1.

622
00:28:35,670 --> 00:28:37,406
VICTOR COSTAN: Pay attention.

623
00:28:37,406 --> 00:28:39,280
AUDIENCE: 1, N.

624
00:28:39,280 --> 00:28:41,100
VICTOR COSTAN: Pay attention.

625
00:28:41,100 --> 00:28:42,495
It's not N, it's not 1.

626
00:28:42,495 --> 00:28:43,370
AUDIENCE: [INAUDIBLE]

627
00:28:43,370 --> 00:28:45,130
VICTOR COSTAN: OK,
actually is 1 I guess,

628
00:28:45,130 --> 00:28:46,740
if you think that
W is a constant.

629
00:28:46,740 --> 00:28:47,364
Sorry.

630
00:28:47,364 --> 00:28:48,530
AUDIENCE: You're testing us.

631
00:28:48,530 --> 00:28:49,800
VICTOR COSTAN: OK.

632
00:28:49,800 --> 00:28:52,331
1, 1.

633
00:28:52,331 --> 00:28:54,580
VICTOR COSTAN: So I heard
two numbers, N and 1, right?

634
00:28:54,580 --> 00:28:59,770
So this is 0 of N plus
1, which is order N,

635
00:28:59,770 --> 00:29:04,370
because as N goes to infinity,
1 becomes really tiny.

636
00:29:04,370 --> 00:29:07,660
OK, so this is how you
analyze a function.

637
00:29:07,660 --> 00:29:10,700
Big functions are horribly
painful to analyze because you

638
00:29:10,700 --> 00:29:14,760
have to look at each line and
do this kind of reasoning.

639
00:29:14,760 --> 00:29:16,640
And it's not even a top
level function here,

640
00:29:16,640 --> 00:29:19,340
so I don't even get to
write anything here yet.

641
00:29:19,340 --> 00:29:22,490
So get words from string
takes order and time

642
00:29:22,490 --> 00:29:24,980
where N is the length of a line.

643
00:29:24,980 --> 00:29:28,100
Let's look at get
words from line list.

644
00:29:28,100 --> 00:29:29,289
AUDIENCE: I have a question.

645
00:29:29,289 --> 00:29:30,080
VICTOR COSTAN: Yes.

646
00:29:30,080 --> 00:29:33,545
AUDIENCE: So [INAUDIBLE]
is W characters long?

647
00:29:33,545 --> 00:29:37,699
Like, does it matter
if the [INAUDIBLE]

648
00:29:37,699 --> 00:29:38,990
VICTOR COSTAN: Does it matter--

649
00:29:38,990 --> 00:29:41,470
AUDIENCE: [INAUDIBLE] make
that assumption of that?

650
00:29:41,470 --> 00:29:45,760
VICTOR COSTAN: So that I can
reason for lines 15 and 16.

651
00:29:45,760 --> 00:29:49,640
I can reason through them easily
if I have a content length.

652
00:29:49,640 --> 00:29:52,410
It turns out that if you
have an average length,

653
00:29:52,410 --> 00:29:54,580
the results are
going to be the same.

654
00:29:54,580 --> 00:30:03,110
Like overall, if you look at the
running time as a sum of what's

655
00:30:03,110 --> 00:30:05,730
the running time for converting
all the words to lowercase

656
00:30:05,730 --> 00:30:07,490
and then appending
them to the list.

657
00:30:07,490 --> 00:30:10,140
The sum of those is
still going to be n N,

658
00:30:10,140 --> 00:30:12,230
but that takes a bit more
time to reason through

659
00:30:12,230 --> 00:30:13,200
so I took a shortcut.

660
00:30:17,202 --> 00:30:19,790
Are you a math
major, by the way?

661
00:30:19,790 --> 00:30:21,790
You're very rigorous.

662
00:30:21,790 --> 00:30:22,450
OK.

663
00:30:22,450 --> 00:30:24,550
So this is good, it's
always good to try

664
00:30:24,550 --> 00:30:26,150
to keep this in the
back of your head

665
00:30:26,150 --> 00:30:31,260
to make sure you
don't fall for a trap.

666
00:30:31,260 --> 00:30:33,790
So get words from
string order N,

667
00:30:33,790 --> 00:30:36,150
and we're trying to figure
out get words from line list.

668
00:30:36,150 --> 00:30:39,090
Any more questions
before I do that?

669
00:30:39,090 --> 00:30:42,530
Or does anyone want
to tell me I'm wrong?

670
00:30:42,530 --> 00:30:44,610
OK, good.

671
00:30:44,610 --> 00:30:47,320
So get words from line list.

672
00:30:47,320 --> 00:30:50,890
Lines 2 through 6.

673
00:30:50,890 --> 00:30:53,100
2 3, 4, 5, 6.

674
00:30:55,690 --> 00:30:58,034
Line 2.

675
00:30:58,034 --> 00:30:59,860
AUDIENCE: 1.

676
00:30:59,860 --> 00:31:02,851
VICTOR COSTAN: OK, cost 1,
how many times does it run?

677
00:31:02,851 --> 00:31:03,476
AUDIENCE: Once.

678
00:31:03,476 --> 00:31:05,290
VICTOR COSTAN: Cool.

679
00:31:05,290 --> 00:31:07,990
Line 3.

680
00:31:07,990 --> 00:31:09,170
We need a new number, right?

681
00:31:09,170 --> 00:31:12,000
We need the number of
lines in a document.

682
00:31:12,000 --> 00:31:13,825
Let's say we have Z lines.

683
00:31:19,010 --> 00:31:25,710
So line 3 runs Z times,
and 4 and 5 are in a loop

684
00:31:25,710 --> 00:31:30,692
so they also run Z times
What's the cost for line 4?

685
00:31:33,524 --> 00:31:34,204
AUDIENCE: 1.

686
00:31:34,204 --> 00:31:35,245
VICTOR COSTAN: Excellent.

687
00:31:38,870 --> 00:31:41,934
What's the cost for line 3?

688
00:31:41,934 --> 00:31:42,790
AUDIENCE: 1.

689
00:31:42,790 --> 00:31:44,950
VICTOR COSTAN: 1.

690
00:31:44,950 --> 00:31:46,590
And what is the cost for line 5?

691
00:31:54,398 --> 00:31:55,880
AUDIENCE: Looks constant.

692
00:31:55,880 --> 00:31:58,125
VICTOR COSTAN:
Looks constant, OK.

693
00:31:58,125 --> 00:31:59,000
AUDIENCE: [INAUDIBLE]

694
00:31:59,000 --> 00:32:03,030
VICTOR COSTAN: Does anyone
else think it looks constant?

695
00:32:03,030 --> 00:32:04,618
Yeah.

696
00:32:04,618 --> 00:32:06,100
AUDIENCE: It's a trap.

697
00:32:06,100 --> 00:32:07,450
VICTOR COSTAN: It's a trap.

698
00:32:07,450 --> 00:32:08,948
It's a trap.

699
00:32:08,948 --> 00:32:10,310
[INTERPOSING VOICES]

700
00:32:10,310 --> 00:32:11,810
AUDIENCE: --length
of the two lists.

701
00:32:11,810 --> 00:32:12,650
VICTOR COSTAN: OK.

702
00:32:12,650 --> 00:32:14,880
Good.

703
00:32:14,880 --> 00:32:17,080
You paid attention
in lecture, right?

704
00:32:17,080 --> 00:32:17,990
AUDIENCE: I try.

705
00:32:17,990 --> 00:32:19,810
VICTOR COSTAN: Nice.

706
00:32:19,810 --> 00:32:25,830
OK, so we have plus
as an operator,

707
00:32:25,830 --> 00:32:29,280
and suppose we work
with two lists.

708
00:32:29,280 --> 00:32:34,410
The first list is 1, 2, 3,
all the way through 1,000.

709
00:32:34,410 --> 00:32:39,380
And the second list is 1, 2, 3.

710
00:32:39,380 --> 00:32:42,010
So when you code
plus to combine them,

711
00:32:42,010 --> 00:32:46,170
if you say something
like C equals A plus B,

712
00:32:46,170 --> 00:32:49,160
you would expect that--
if this is A, by the way

713
00:32:49,160 --> 00:32:53,380
and this is B-- you would expect
that after you call this A is

714
00:32:53,380 --> 00:32:56,120
still this, B is
still this, and C

715
00:32:56,120 --> 00:32:58,740
is a list that
contains everything.

716
00:32:58,740 --> 00:33:04,070
So because of that, what plus
has to do is make a new list,

717
00:33:04,070 --> 00:33:07,350
append all the elements here,
append all the elements here.

718
00:33:07,350 --> 00:33:10,630
So the cost of this if this
list is 1,000 and this list is 3

719
00:33:10,630 --> 00:33:11,940
is 1,003.

720
00:33:11,940 --> 00:33:17,340
Or if you have two lists
of length, L1 and L2

721
00:33:17,340 --> 00:33:22,580
the cost is order of L1 plus L2.

722
00:33:22,580 --> 00:33:24,920
Now there's another Python
method called extend,

723
00:33:24,920 --> 00:33:28,432
which does what I think
you would expect plus

724
00:33:28,432 --> 00:33:29,640
to do in terms of efficiency.

725
00:33:33,020 --> 00:33:36,670
So what extend does is you
call it a 1 or A on one list,

726
00:33:36,670 --> 00:33:38,610
give it the other
list, and it's going

727
00:33:38,610 --> 00:33:40,260
to take each element
in the second list

728
00:33:40,260 --> 00:33:43,020
and append it to the first list.

729
00:33:43,020 --> 00:33:47,050
So for each element here, it
calls append on this list.

730
00:33:47,050 --> 00:33:48,746
So what's the running
time for extend?

731
00:33:48,746 --> 00:33:49,621
AUDIENCE: [INAUDIBLE]

732
00:33:52,920 --> 00:33:55,226
VICTOR COSTAN: OK, there are
too many directions and--

733
00:33:55,226 --> 00:33:56,200
AUDIENCE: Length
of the second list.

734
00:33:56,200 --> 00:33:58,366
VICTOR COSTAN: Length of
the second list, excellent.

735
00:33:58,366 --> 00:34:03,210
So two lists, L1,
L2, order of L2.

736
00:34:03,210 --> 00:34:05,360
So it doesn't matter
this is 1,000 elements

737
00:34:05,360 --> 00:34:08,130
are a million elements,
appending three elements is

738
00:34:08,130 --> 00:34:11,739
going to take time
proportional to three.

739
00:34:11,739 --> 00:34:14,860
OK now, let's see
what's going on here.

740
00:34:14,860 --> 00:34:19,100
So we have Z lines and
characters in a line.

741
00:34:22,520 --> 00:34:24,730
I think I want a nicer constant.

742
00:34:28,069 --> 00:34:29,360
No, let's go with this for now.

743
00:34:32,240 --> 00:34:34,650
AUDIENCE: [INAUDIBLE] lines.

744
00:34:34,650 --> 00:34:38,020
VICTOR COSTAN: So this
is the length of a word.

745
00:34:38,020 --> 00:34:40,020
Let's see, how many words
will I have in a line?

746
00:34:40,020 --> 00:34:47,530
Let's say I have K words in a
line, which is N divided by W.

747
00:34:47,530 --> 00:34:49,989
So I know that to
get words from string

748
00:34:49,989 --> 00:34:55,219
returns a list of size K.
So if that is the case, then

749
00:34:55,219 --> 00:34:59,820
the first time line 5
runs, word list is empty.

750
00:34:59,820 --> 00:35:01,580
And it's going to
get K elements.

751
00:35:01,580 --> 00:35:05,310
The second time it runs,
word list has K elements

752
00:35:05,310 --> 00:35:06,530
and gets K more.

753
00:35:06,530 --> 00:35:09,590
Third time, it has 2K
elements, it gets K more.

754
00:35:09,590 --> 00:35:12,420
So the running time for
this looks like this.

755
00:35:12,420 --> 00:35:19,150
K plus 2K plus 3K
plus 4K all the way

756
00:35:19,150 --> 00:35:23,010
until when I'm at the last
line, if I have Z lines.

757
00:35:23,010 --> 00:35:27,720
I had Z minus 1 times
K elements in the list,

758
00:35:27,720 --> 00:35:30,000
because I have Z minus 1
lines and I put all the words

759
00:35:30,000 --> 00:35:35,080
in the list, and I'm
adding K more words.

760
00:35:35,080 --> 00:35:43,760
So total, Z times
K running time.

761
00:35:43,760 --> 00:35:46,010
So this is the total
running time for this guy.

762
00:35:46,010 --> 00:35:50,510
And this is not constant,
so it's complicated.

763
00:35:50,510 --> 00:35:52,910
What is the sum come
down to, asymptotically?

764
00:36:00,210 --> 00:36:04,990
AUDIENCE: Z plus
1K times Z over 2.

765
00:36:04,990 --> 00:36:05,740
VICTOR COSTAN: Ok.

766
00:36:05,740 --> 00:36:17,000
Z plus 1K, ZK over 2.

767
00:36:17,000 --> 00:36:19,770
Slow because I care
about asymptotics,

768
00:36:19,770 --> 00:36:31,180
this is order of Z
squared times K, right?

769
00:36:31,180 --> 00:36:33,820
So now any one more
natural number to work with

770
00:36:33,820 --> 00:36:36,500
would be the number of
words in a document.

771
00:36:36,500 --> 00:36:38,940
And the number of
words in a document

772
00:36:38,940 --> 00:36:50,150
is W, which is Z times K.
So Z is W divided by K.

773
00:36:50,150 --> 00:36:53,930
And if I substitute
this, I get that this

774
00:36:53,930 --> 00:37:05,170
is equal to 0 of W squared over
K. Now in a reasonable document

775
00:37:05,170 --> 00:37:08,840
that I see, there tends to
be a limited number of words

776
00:37:08,840 --> 00:37:12,860
per line because the document
has to fit on a page.

777
00:37:12,860 --> 00:37:15,580
So K's pretty much a constant.

778
00:37:15,580 --> 00:37:18,820
So this comes down to
order of W squared.

779
00:37:21,790 --> 00:37:27,830
So if I go down here and look
at get word from line list,

780
00:37:27,830 --> 00:37:31,679
this is W squared, where
W is how many words I

781
00:37:31,679 --> 00:37:32,470
have in a document.

782
00:37:35,130 --> 00:37:38,830
How many of you guys
are still with me?

783
00:37:38,830 --> 00:37:39,770
Half.

784
00:37:39,770 --> 00:37:41,400
OK.

785
00:37:41,400 --> 00:37:43,460
Does anyone else want
to ask questions,

786
00:37:43,460 --> 00:37:46,360
so that you can
get back on track?

787
00:37:46,360 --> 00:37:48,424
Yes, no?

788
00:37:48,424 --> 00:37:49,873
AUDIENCE: It makes sense so far.

789
00:37:49,873 --> 00:37:50,914
VICTOR COSTAN: Thank you.

790
00:37:50,914 --> 00:37:52,455
AUDIENCE: I think
I didn't understand

791
00:37:52,455 --> 00:37:55,201
the part of [INAUDIBLE]

792
00:37:55,201 --> 00:37:55,950
VICTOR COSTAN: OK.

793
00:37:55,950 --> 00:37:58,060
Thank you.

794
00:37:58,060 --> 00:38:02,280
So let's see what's going
on lines 2 through 5.

795
00:38:02,280 --> 00:38:09,360
So I have a word list, which
at the beginning is empty.

796
00:38:09,360 --> 00:38:12,640
Then in line 4, words
in line gets K words.

797
00:38:15,300 --> 00:38:21,840
And those K words in line
five are added to word list.

798
00:38:21,840 --> 00:38:25,420
So after that, word
list has K words.

799
00:38:25,420 --> 00:38:26,880
Then I run through
the loop again.

800
00:38:26,880 --> 00:38:29,880
Get the words from string
gives me K new words.

801
00:38:29,880 --> 00:38:33,770
They get added to the list,
which now has 2K words.

802
00:38:33,770 --> 00:38:35,470
Next time I get K
more words, they

803
00:38:35,470 --> 00:38:39,890
get that added to the
list, which has 3K.

804
00:38:39,890 --> 00:38:41,530
So on and so forth
until the end.

805
00:38:41,530 --> 00:38:44,380
I have ugly numbers.

806
00:38:44,380 --> 00:38:50,820
Z minus 1 times K words
and I add the last K words.

807
00:38:53,570 --> 00:38:56,480
I'm getting confused here.

808
00:38:56,480 --> 00:38:59,690
And I get Z times K words.

809
00:38:59,690 --> 00:39:02,840
So the word list is eventually
going to have Z times K words,

810
00:39:02,840 --> 00:39:04,710
and it gets them K at a time.

811
00:39:04,710 --> 00:39:08,450
The thing that does this
addition is the plus operator.

812
00:39:08,450 --> 00:39:10,370
And the running time
for the plus operator

813
00:39:10,370 --> 00:39:14,100
is the size of the two lists,
so it's this plus this.

814
00:39:14,100 --> 00:39:17,440
So that's why the running time
is first K, then 2K, then 3K,

815
00:39:17,440 --> 00:39:23,387
then-- make sense now?

816
00:39:23,387 --> 00:39:23,970
AUDIENCE: Yes.

817
00:39:23,970 --> 00:39:26,290
VICTOR COSTAN: OK.

818
00:39:26,290 --> 00:39:30,620
So this is a subtle bug because
if you change plus to extend,

819
00:39:30,620 --> 00:39:33,050
you get [? bug ?] disk two,
which runs a lot faster.

820
00:39:37,265 --> 00:39:37,765
OK.

821
00:39:42,270 --> 00:39:45,790
So for everything
else, we want to be

822
00:39:45,790 --> 00:39:47,330
able to do this
sort of analysis,

823
00:39:47,330 --> 00:39:49,027
but we want to do it faster.

824
00:39:49,027 --> 00:39:51,110
So you guys should look
through [? bug list ?] one

825
00:39:51,110 --> 00:39:55,110
through eight and do the same
analysis for all the functions.

826
00:39:55,110 --> 00:39:58,760
And we're going to post
recitation notes where

827
00:39:58,760 --> 00:40:01,130
we tell you this is the
function that changed,

828
00:40:01,130 --> 00:40:02,642
and this is the
total running time.

829
00:40:02,642 --> 00:40:04,100
And you should go
through the lines

830
00:40:04,100 --> 00:40:07,610
and convince yourself that
this is the right running time.

831
00:40:07,610 --> 00:40:10,290
And you should do that until
it becomes second nature,

832
00:40:10,290 --> 00:40:12,002
because when you're
writing Python code,

833
00:40:12,002 --> 00:40:13,460
you want to have
this in your head.

834
00:40:13,460 --> 00:40:14,880
You don't want to
have to write it down,

835
00:40:14,880 --> 00:40:17,450
because if you have to write it
down, you're going to be lazy

836
00:40:17,450 --> 00:40:19,158
and you're not going
to do it, and you're

837
00:40:19,158 --> 00:40:20,850
going to use plus
instead of extend,

838
00:40:20,850 --> 00:40:23,280
and your code is going
to be horribly slow.

839
00:40:23,280 --> 00:40:25,114
So practice until this
gets in your head,

840
00:40:25,114 --> 00:40:27,530
and then you'll be able to see
the running time for things

841
00:40:27,530 --> 00:40:28,155
really quickly.

842
00:40:31,070 --> 00:40:35,820
OK, do we have time for
once more let me see.

843
00:40:35,820 --> 00:40:37,120
OK.

844
00:40:37,120 --> 00:40:39,310
Let's look at the running
time for inner products,

845
00:40:39,310 --> 00:40:40,780
because this is nice and easy.

846
00:40:44,700 --> 00:40:53,030
2, 3, 4, 5, 6, 7.

847
00:40:53,030 --> 00:40:57,900
2 is 1, 1, very nice and easy.

848
00:40:57,900 --> 00:41:05,200
3 looks at the first document
list and iterates through it.

849
00:41:05,200 --> 00:41:09,430
Iteration is constant time, but
if the first document vector

850
00:41:09,430 --> 00:41:15,100
has L1 elements, it's
going to run L1 times.

851
00:41:15,100 --> 00:41:18,270
How about line 4,
words 2 count 2 in L2.

852
00:41:18,270 --> 00:41:26,330
This is iteration again, so it's
constant time to run it once,

853
00:41:26,330 --> 00:41:28,146
but how many times will it run?

854
00:41:28,146 --> 00:41:30,130
AUDIENCE: L2 times L1 times.

855
00:41:30,130 --> 00:41:33,420
VICTOR COSTAN: L2
times the 1, excellent.

856
00:41:33,420 --> 00:41:35,440
So these two loops are
nested inside each other

857
00:41:35,440 --> 00:41:39,170
so that means that
lines 4 through 6

858
00:41:39,170 --> 00:41:44,060
are going to run once
every time line 3 iterates.

859
00:41:44,060 --> 00:41:45,590
So sorry, actually
line 4 is going

860
00:41:45,590 --> 00:41:49,110
to run once every
time line 3 iterates.

861
00:41:49,110 --> 00:41:53,130
And then everything
inside the second 4

862
00:41:53,130 --> 00:41:56,980
is going to run
L1 times L2 times.

863
00:41:56,980 --> 00:42:02,705
So lines 5 and 6 are also
going to run L1, L2 times.

864
00:42:02,705 --> 00:42:06,430
L1, L2, L1, L2.

865
00:42:06,430 --> 00:42:11,716
How much time does it take
to do that if check there?

866
00:42:11,716 --> 00:42:13,040
AUDIENCE: [INAUDIBLE]

867
00:42:13,040 --> 00:42:15,040
VICTOR COSTAN: Why does
it take a constant time?

868
00:42:19,304 --> 00:42:21,345
AUDIENCE: I was going to
say, it wasn't constant,

869
00:42:21,345 --> 00:42:25,680
so you don't have to pair
each character with no word.

870
00:42:25,680 --> 00:42:26,680
VICTOR COSTAN: OK, good.

871
00:42:26,680 --> 00:42:28,360
So we have two words,
and equal, equal

872
00:42:28,360 --> 00:42:31,430
tells me are the words
equal or not, right?

873
00:42:31,430 --> 00:42:35,450
So the way you do that, is you
have words like the and fox.

874
00:42:35,450 --> 00:42:37,270
You go through each
character, and you

875
00:42:37,270 --> 00:42:40,640
stop whenever you see
different characters.

876
00:42:40,640 --> 00:42:46,440
But if you have something
like, if you have a fake word

877
00:42:46,440 --> 00:42:50,704
F-O-I and fox, then go
through the first character,

878
00:42:50,704 --> 00:42:53,370
they're equal, second character,
they're equal, third character,

879
00:42:53,370 --> 00:42:54,620
they're different.

880
00:42:54,620 --> 00:42:57,390
So if you have
length W words that

881
00:42:57,390 --> 00:42:59,220
are different only in
the last character,

882
00:42:59,220 --> 00:43:02,660
this is going to
be order W, right?

883
00:43:02,660 --> 00:43:04,210
So the real--

884
00:43:04,210 --> 00:43:05,620
AUDIENCE: [INAUDIBLE]

885
00:43:05,620 --> 00:43:08,750
VICTOR COSTAN: --yep, equals,
equals 4 strings not constant.

886
00:43:08,750 --> 00:43:13,470
It takes W time where W
is the length of a word.

887
00:43:13,470 --> 00:43:15,662
Now here we said that
the length of a word

888
00:43:15,662 --> 00:43:17,620
is constant because we're
dealing with English.

889
00:43:17,620 --> 00:43:19,890
So you could tell me it is
constant because of that.

890
00:43:19,890 --> 00:43:22,181
But I would like to hear the
argument before I take it.

891
00:43:24,630 --> 00:43:26,050
How about line 6?

892
00:43:31,330 --> 00:43:32,770
AUDIENCE: Well,
if the plus equals

893
00:43:32,770 --> 00:43:36,140
is going to be the same
thing before when we were,

894
00:43:36,140 --> 00:43:39,270
every new time your
plus equals, so it's

895
00:43:39,270 --> 00:43:41,940
going to be like how the word
list before we were adding it,

896
00:43:41,940 --> 00:43:43,548
where we have to
create that object,

897
00:43:43,548 --> 00:43:45,524
and then add it to the length.

898
00:43:45,524 --> 00:43:46,018
I mean, its going
to be length of sum.

899
00:43:46,018 --> 00:43:46,518
Sorry.

900
00:43:46,518 --> 00:43:48,488
And then you add in the new one.

901
00:43:48,488 --> 00:43:50,572
So every time its going
to be increasing, correct?

902
00:43:50,572 --> 00:43:51,488
VICTOR COSTAN: Almost.

903
00:43:51,488 --> 00:43:52,557
It's a trap again.

904
00:43:52,557 --> 00:43:53,390
[INTERPOSING VOICES]

905
00:43:53,390 --> 00:43:55,020
VICTOR COSTAN: Yep.

906
00:43:55,020 --> 00:43:56,770
Yeah, so this time
they're not lists.

907
00:43:56,770 --> 00:44:00,460
So if you look at what's
going on inside there,

908
00:44:00,460 --> 00:44:03,840
you have count one
and count two are

909
00:44:03,840 --> 00:44:08,780
these numbers in the document
vector, so they're numbers.

910
00:44:08,780 --> 00:44:11,124
And then some starts
out at 0, and then it

911
00:44:11,124 --> 00:44:12,040
keeps getting numbers.

912
00:44:12,040 --> 00:44:14,050
So sum is going to be a number.

913
00:44:14,050 --> 00:44:16,240
And multiplying numbers
is constant time,

914
00:44:16,240 --> 00:44:19,150
adding numbers is constant
time, so plus for numbers

915
00:44:19,150 --> 00:44:20,587
is order 1 indeed.

916
00:44:20,587 --> 00:44:22,420
AUDIENCE: You're
reassigning sum every time?

917
00:44:22,420 --> 00:44:24,003
VICTOR COSTAN: Which
is also constant.

918
00:44:24,003 --> 00:44:24,545
AUDIENCE: OK.

919
00:44:24,545 --> 00:44:26,711
VICTOR COSTAN: Because
you're copying a number over.

920
00:44:26,711 --> 00:44:28,660
So as long as you're
copying one element over,

921
00:44:28,660 --> 00:44:29,535
that's constant time.

922
00:44:29,535 --> 00:44:32,370
If you're adding two elements
together-- two elements,

923
00:44:32,370 --> 00:44:36,070
not two lists--
that's constant time.

924
00:44:36,070 --> 00:44:39,090
So this is constant.

925
00:44:39,090 --> 00:44:42,010
And the last line is returned.

926
00:44:42,010 --> 00:44:43,750
So what's the running
time for this?

927
00:44:46,630 --> 00:44:49,040
AUDIENCE: L2 times L1.

928
00:44:49,040 --> 00:44:50,250
VICTOR COSTAN: Excellent.

929
00:44:50,250 --> 00:44:52,880
So I assume this is a constant.

930
00:44:52,880 --> 00:44:55,860
So this lets me say
this is 1, and then

931
00:44:55,860 --> 00:45:00,260
if we do the partial products
we get 1L, 1L, 1, and L2.

932
00:45:00,260 --> 00:45:01,510
L1, L2, L1, L2.

933
00:45:01,510 --> 00:45:03,780
And if you add them
up, you get L1 and L2.

934
00:45:06,380 --> 00:45:11,290
So this is going to be L1, L2.

935
00:45:11,290 --> 00:45:15,410
Vector angle calls inner
product three times, right?

936
00:45:15,410 --> 00:45:18,895
So what's it's running time?

937
00:45:18,895 --> 00:45:19,877
AUDIENCE: L1, L2.

938
00:45:23,699 --> 00:45:24,740
VICTOR COSTAN: Excellent.

939
00:45:27,390 --> 00:45:29,090
Count frequency.

940
00:45:29,090 --> 00:45:31,130
You're going to have
to take my word for it

941
00:45:31,130 --> 00:45:36,870
that this is order of W squared.

942
00:45:36,870 --> 00:45:39,270
And if that's the case,
what's the running

943
00:45:39,270 --> 00:45:41,490
time for a word
frequency for file?

944
00:45:44,983 --> 00:45:45,981
AUDIENCE: W squared?

945
00:45:49,973 --> 00:45:50,980
VICTOR COSTAN: Cool.

946
00:45:50,980 --> 00:45:51,030
So.

947
00:45:51,030 --> 00:45:52,770
What's the running
time for main now?

948
00:45:55,835 --> 00:45:56,335
Last trick.

949
00:45:56,335 --> 00:45:57,000
AUDIENCE: [INAUDIBLE]

950
00:45:57,000 --> 00:45:58,833
VICTOR COSTAN: Yep, If
you just add them up,

951
00:45:58,833 --> 00:46:00,932
except there is one
last trick there.

952
00:46:00,932 --> 00:46:04,299
AUDIENCE: If W is
constant, [INAUDIBLE]

953
00:46:04,299 --> 00:46:05,261
VICTOR COSTAN: No.

954
00:46:05,261 --> 00:46:08,160
AUDIENCE: [INAUDIBLE]
W's constant, right?

955
00:46:08,160 --> 00:46:09,390
VICTOR COSTAN: No.

956
00:46:09,390 --> 00:46:11,793
So W is the number of
words in a document.

957
00:46:11,793 --> 00:46:12,660
AUDIENCE: Oh.

958
00:46:12,660 --> 00:46:14,440
VICTOR COSTAN: So it's huge.

959
00:46:14,440 --> 00:46:16,190
If that's constant,
then the whole problem

960
00:46:16,190 --> 00:46:18,065
should run in order one
time, and we're done.

961
00:46:18,065 --> 00:46:19,790
We're going home.

962
00:46:19,790 --> 00:46:23,940
AUDIENCE: W squared because
it beats out L1 and L2.

963
00:46:23,940 --> 00:46:25,460
VICTOR COSTAN: OK, so--

964
00:46:25,460 --> 00:46:26,110
AUDIENCE: L1--

965
00:46:26,110 --> 00:46:28,130
VICTOR COSTAN: --you're
going faster than me.

966
00:46:28,130 --> 00:46:31,190
You're going too fast,
but you're right.

967
00:46:31,190 --> 00:46:35,490
So word frequency for
file is called twice.

968
00:46:35,490 --> 00:46:38,220
The first document is
going to have W1 words.

969
00:46:38,220 --> 00:46:41,460
The second document is
going to have W2 words.

970
00:46:41,460 --> 00:46:44,470
So you can just copy W
because this is called twice

971
00:46:44,470 --> 00:46:46,940
for different files.

972
00:46:46,940 --> 00:46:51,410
So this is order of
W1 squared plus W2

973
00:46:51,410 --> 00:46:54,290
squared, different documents.

974
00:46:59,870 --> 00:47:03,760
And then I have plus L1, L2.

975
00:47:07,550 --> 00:47:12,960
And you said that W1 and W2
dominate L1 and L2, right?

976
00:47:12,960 --> 00:47:16,120
Because W's the total number
of words in a document,

977
00:47:16,120 --> 00:47:19,640
whereas L the is the
number of unique words,

978
00:47:19,640 --> 00:47:22,810
because it the
length of the vector.

979
00:47:22,810 --> 00:47:24,500
So that is true,
but I'm not sure

980
00:47:24,500 --> 00:47:28,000
how to reduce this here
to make use of that.

981
00:47:28,000 --> 00:47:31,962
However, I made use of what you
said already when I wrote this.

982
00:47:35,740 --> 00:47:37,300
You see why?

983
00:47:37,300 --> 00:47:39,240
Can anyone else see why?

984
00:47:42,600 --> 00:47:52,460
So let's look at the vector
angle again, lines 2 and 3.

985
00:47:52,460 --> 00:47:58,330
So line 2, it calls inner
product with L1 and L2.

986
00:47:58,330 --> 00:48:00,670
But if you look at line
3, it calls inner product

987
00:48:00,670 --> 00:48:05,670
with L1, L1 and then L2, L2
So the total running time

988
00:48:05,670 --> 00:48:10,880
for vector angle is
actually L1, L2 plus L1

989
00:48:10,880 --> 00:48:12,540
squared plus L2 squared.

990
00:48:17,880 --> 00:48:20,550
So if the first
document has 1,000 words

991
00:48:20,550 --> 00:48:22,810
and the second
document as one word,

992
00:48:22,810 --> 00:48:25,680
computing the inner
product between L1 and L1

993
00:48:25,680 --> 00:48:27,830
is going to take a lot
more time than computing

994
00:48:27,830 --> 00:48:30,050
the inner product
between L1 and L2.

995
00:48:30,050 --> 00:48:32,910
So I can't leave
out these terms.

996
00:48:32,910 --> 00:48:34,440
They have to be here.

997
00:48:34,440 --> 00:48:37,130
However, when I
add them up here--

998
00:48:37,130 --> 00:48:41,270
if I would write W1 squared
plus W2 squared plus L1 squared

999
00:48:41,270 --> 00:48:44,650
plus L2 squared plus
this-- in that case,

1000
00:48:44,650 --> 00:48:47,340
I can use the fact that
W1 is bigger than L1,

1001
00:48:47,340 --> 00:48:50,735
and it cancels it out.

1002
00:48:50,735 --> 00:48:51,610
Does this make sense?

1003
00:48:51,610 --> 00:48:52,420
Did I lose people?

1004
00:48:55,490 --> 00:48:58,188
Ask questions, please.

1005
00:49:02,751 --> 00:49:04,584
AUDIENCE: But you can't
get rid of L1 and L2

1006
00:49:04,584 --> 00:49:07,577
and not an [INAUDIBLE].

1007
00:49:07,577 --> 00:49:08,660
VICTOR COSTAN: You can't--

1008
00:49:08,660 --> 00:49:09,710
AUDIENCE: [INAUDIBLE]

1009
00:49:09,710 --> 00:49:11,500
VICTOR COSTAN: Oh, so I
can't get rid of this term--

1010
00:49:11,500 --> 00:49:12,541
AUDIENCE: --those, right?

1011
00:49:12,541 --> 00:49:17,335
So this should be the sum
of this and this, right?

1012
00:49:17,335 --> 00:49:18,200
AUDIENCE: Right.

1013
00:49:18,200 --> 00:49:22,080
VICTOR COSTAN: So it should
be W1 squared plus W2 squared

1014
00:49:22,080 --> 00:49:26,382
plus L1 squared plus
L2 squared plus L1, L2.

1015
00:49:26,382 --> 00:49:28,110
AUDIENCE: Right.

1016
00:49:28,110 --> 00:49:30,680
L1 is strictly smaller than W1.

1017
00:49:30,680 --> 00:49:31,620
AUDIENCE: Yeah.

1018
00:49:31,620 --> 00:49:35,402
Goes away, L2 smaller than
W2 goes away, and I get this.

1019
00:49:35,402 --> 00:49:36,326
Correct.

1020
00:49:36,326 --> 00:49:40,796
So L1L2 isn't smaller than
W [INAUDIBLE] squared?

1021
00:49:40,796 --> 00:49:41,670
VICTOR COSTAN: Is it?

1022
00:49:41,670 --> 00:49:43,086
If you know more
math than me, you

1023
00:49:43,086 --> 00:49:44,530
might be able to
prove that it is,

1024
00:49:44,530 --> 00:49:47,422
but I don't, so I'm just
leaving it in there.

1025
00:49:47,422 --> 00:49:47,963
AUDIENCE: Ok.

1026
00:49:47,963 --> 00:49:49,367
VICTOR COSTAN: Yeah.

1027
00:49:49,367 --> 00:49:51,200
I think there is some
relation, but I really

1028
00:49:51,200 --> 00:49:53,940
don't remember what
it this, so let's

1029
00:49:53,940 --> 00:49:55,070
leave it like that for now.

1030
00:50:00,854 --> 00:50:02,770
Yeah, I think it should
be the case that these

1031
00:50:02,770 --> 00:50:06,250
are bigger than this,
but I'm not sure.

1032
00:50:06,250 --> 00:50:07,463
OK, yes.

1033
00:50:07,463 --> 00:50:12,200
AUDIENCE: How do you get
the line for vector angle?

1034
00:50:12,200 --> 00:50:15,020
VICTOR COSTAN: How do I get
the running time for it?

1035
00:50:15,020 --> 00:50:19,390
So vector angle gets
two vectors, right?

1036
00:50:19,390 --> 00:50:22,250
The vector for document one and
the vector for document two.

1037
00:50:22,250 --> 00:50:24,190
The length of the
first vector is L1.

1038
00:50:24,190 --> 00:50:26,590
The length of the
second vector is L2.

1039
00:50:26,590 --> 00:50:29,260
Now, line, where is it?

1040
00:50:32,550 --> 00:50:38,050
Line 2, for numerator calls
inner product with L1 and L2.

1041
00:50:38,050 --> 00:50:43,350
So we know that the running
time is L1, L2 up here.

1042
00:50:43,350 --> 00:50:46,080
Now the next line,
line 3 in vector angle,

1043
00:50:46,080 --> 00:50:49,990
calls inner product
with L1 and L1.

1044
00:50:49,990 --> 00:50:53,700
So the running time is L1
times L1 which is L1 squared.

1045
00:50:53,700 --> 00:50:54,892
OK.

1046
00:50:54,892 --> 00:50:56,600
AUDIENCE: Can we say
that because there's

1047
00:50:56,600 --> 00:51:02,814
a bounded number of words in the
English language, L1's bounded?

1048
00:51:02,814 --> 00:51:04,287
And as the length
of the document

1049
00:51:04,287 --> 00:51:08,215
gets really, really big,
that [INAUDIBLE] constant?

1050
00:51:11,180 --> 00:51:15,300
VICTOR COSTAN: Yeah, you
might be able to do that.

1051
00:51:15,300 --> 00:51:19,150
Yes, I think for the cases
that we give you, that is true.

1052
00:51:19,150 --> 00:51:21,036
Yeah, I never thought
of that, that's cool.

1053
00:51:21,036 --> 00:51:24,012
AUDIENCE: It doesn't work if
it's not a language, right?

1054
00:51:24,012 --> 00:51:25,580
If you just have gibberish?

1055
00:51:25,580 --> 00:51:32,760
VICTOR COSTAN: Yes, also, to
say that its constant is useful

1056
00:51:32,760 --> 00:51:35,050
when the number of
words in English

1057
00:51:35,050 --> 00:51:37,660
is much smaller than
your input size.

1058
00:51:37,660 --> 00:51:40,180
So if, say, English
has 50,000 words

1059
00:51:40,180 --> 00:51:43,850
and your input is 3,000 words,
then the input is much smaller.

1060
00:51:43,850 --> 00:51:45,910
But if you're input is
a million words, which

1061
00:51:45,910 --> 00:51:48,330
I think is what
we use, then yeah,

1062
00:51:48,330 --> 00:51:49,709
it comes down to constant.

1063
00:51:49,709 --> 00:51:51,000
So yeah, that's a good insight.

1064
00:51:51,000 --> 00:51:51,791
That's really nice.

1065
00:51:54,572 --> 00:51:55,536
Anything else?

1066
00:52:02,780 --> 00:52:06,410
OK, so you get to go through
document distance 3 to 8.

1067
00:52:06,410 --> 00:52:08,690
We'll tell you what's
changed, and we'll

1068
00:52:08,690 --> 00:52:11,020
give you a chance to
help you analyze it.

1069
00:52:11,020 --> 00:52:13,850
But you have to analyze it,
then update the scorecard

1070
00:52:13,850 --> 00:52:19,000
for each algorithm to
see how things improve.

1071
00:52:19,000 --> 00:52:20,067
Thanks.