1
00:00:00,060 --> 00:00:01,770
The following
content is provided

2
00:00:01,770 --> 00:00:04,010
under a Creative
Commons license.

3
00:00:04,010 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue

4
00:00:06,860 --> 00:00:10,720
to offer high-quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,226
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,226 --> 00:00:17,851
at ocw.mit.edu.

8
00:00:22,030 --> 00:00:25,270
PROFESSOR: So did
everyone turn in PSET1?

9
00:00:25,270 --> 00:00:25,770
Yes?

10
00:00:25,770 --> 00:00:27,220
Good.

11
00:00:27,220 --> 00:00:30,760
OK, so there is a PSET1
critique due in a few days.

12
00:00:30,760 --> 00:00:32,189
My advice?

13
00:00:32,189 --> 00:00:33,215
You already did PSET1.

14
00:00:33,215 --> 00:00:35,090
You remember what you
wrote out on the proof.

15
00:00:35,090 --> 00:00:36,506
Look at the solution.

16
00:00:36,506 --> 00:00:38,630
Write one paragraph today,
and you're done with it.

17
00:00:38,630 --> 00:00:40,800
Then you go focus on PSET2.

18
00:00:40,800 --> 00:00:42,334
If you leave it
off until Tuesday,

19
00:00:42,334 --> 00:00:44,250
you're going to have to
read your proof again,

20
00:00:44,250 --> 00:00:46,370
remember what you're thinking.

21
00:00:46,370 --> 00:00:47,577
It's a lot more work.

22
00:00:47,577 --> 00:00:49,160
Just do it now, get
it out of the way,

23
00:00:49,160 --> 00:00:51,466
and put PSET1 behind you.

24
00:00:51,466 --> 00:00:53,340
AUDIENCE: Is the critique
only for the proof?

25
00:00:53,340 --> 00:00:54,784
Or is this for all of them?

26
00:00:54,784 --> 00:00:55,450
PROFESSOR: Nope.

27
00:00:55,450 --> 00:00:56,399
Just the proof.

28
00:00:56,399 --> 00:00:58,440
So you have to compare
your proof with our proof?

29
00:01:01,128 --> 00:01:02,880
AUDIENCE: Is there an
assignment for that?

30
00:01:02,880 --> 00:01:04,876
Or do we just know to do it?

31
00:01:04,876 --> 00:01:05,610
PROFESSOR: Uh.

32
00:01:05,610 --> 00:01:07,475
PSET1 be Stellar.

33
00:01:09,990 --> 00:01:10,580
Oh, no, sorry.

34
00:01:10,580 --> 00:01:11,413
It's not in Stellar.

35
00:01:11,413 --> 00:01:13,580
It's on our new grading
site that just went out.

36
00:01:13,580 --> 00:01:15,330
So you have to go to
our new grading site,

37
00:01:15,330 --> 00:01:17,500
and you have to type
in your critique there.

38
00:01:17,500 --> 00:01:18,790
And it's one paragraph.

39
00:01:18,790 --> 00:01:20,220
You should aim
for one paragraph.

40
00:01:20,220 --> 00:01:21,760
If you're doing more
than that, then you're

41
00:01:21,760 --> 00:01:22,676
doing something wrong.

42
00:01:22,676 --> 00:01:25,300
And it's LATEX Plus math mode.

43
00:01:25,300 --> 00:01:29,460
So you can use math mode,
and that's about it.

44
00:01:29,460 --> 00:01:31,852
OK, any more questions
about the critique?

45
00:01:31,852 --> 00:01:32,560
It's a new thing.

46
00:01:32,560 --> 00:01:35,160
We care about it because it will
make our grading life easier.

47
00:01:35,160 --> 00:01:37,360
And because it'll force you to
look at the solutions and see

48
00:01:37,360 --> 00:01:39,130
what you understood
and what you didn't.

49
00:01:39,130 --> 00:01:40,060
So we care about it.

50
00:01:40,060 --> 00:01:41,960
Don't ignore it.

51
00:01:41,960 --> 00:01:43,638
Yes?

52
00:01:43,638 --> 00:01:45,304
AUDIENCE: Like, how
much is it weighted?

53
00:01:45,304 --> 00:01:48,294
How much does it count
toward the grade?

54
00:01:48,294 --> 00:01:49,960
PROFESSOR: If you
don't have a critique,

55
00:01:49,960 --> 00:01:53,394
we will most likely
give you a 0 for proof.

56
00:01:53,394 --> 00:01:54,310
AUDIENCE: [INAUDIBLE].

57
00:01:57,350 --> 00:02:00,070
PROFESSOR: If your proof is bad
and your critique of the proof

58
00:02:00,070 --> 00:02:02,070
is good, then you
might get something.

59
00:02:02,070 --> 00:02:04,370
If your proof is bad and
you have no critique--

60
00:02:04,370 --> 00:02:06,370
actually if your proof
is whatever it is and you

61
00:02:06,370 --> 00:02:07,619
have no critique, you get a 0.

62
00:02:07,619 --> 00:02:10,032
AUDIENCE: Yeah. [CHUCKLE]

63
00:02:10,032 --> 00:02:11,740
PROFESSOR: Any more
questions about that?

64
00:02:14,870 --> 00:02:17,080
OK.

65
00:02:17,080 --> 00:02:20,740
Who needs help remembering
what the document distance

66
00:02:20,740 --> 00:02:21,960
problem is?

67
00:02:24,910 --> 00:02:25,410
OK.

68
00:02:25,410 --> 00:02:28,350
Everyone who went to lecture
or to [INAUDIBLE] remembers.

69
00:02:28,350 --> 00:02:29,840
That's good.

70
00:02:29,840 --> 00:02:31,360
Who went to lecture last time?

71
00:02:34,720 --> 00:02:35,290
Cool.

72
00:02:35,290 --> 00:02:36,010
That's good.

73
00:02:36,010 --> 00:02:38,560
So we did insertion
sort and merge sort

74
00:02:38,560 --> 00:02:39,970
from a theoretical standpoint.

75
00:02:39,970 --> 00:02:42,851
Today we're going to look at
the code for the insertion sort

76
00:02:42,851 --> 00:02:45,100
and, if we have time, look
at the code for merge-sort,

77
00:02:45,100 --> 00:02:48,660
and use the same strategy as we
did last time to analyze them,

78
00:02:48,660 --> 00:02:51,290
look at the running time, make
sure the running time matches

79
00:02:51,290 --> 00:02:55,410
the theory, and see how
pseudocode turns into Python.

80
00:02:59,010 --> 00:03:01,690
So you all have your listings.

81
00:03:01,690 --> 00:03:05,790
Last time in document
distance, we covered Main,

82
00:03:05,790 --> 00:03:07,570
and we covered most
of the functions

83
00:03:07,570 --> 00:03:10,450
except for count frequency.

84
00:03:10,450 --> 00:03:15,200
Can anyone remind me what
the call graph looked like?

85
00:03:15,200 --> 00:03:16,700
So the call graph
is the tree that I

86
00:03:16,700 --> 00:03:18,990
had up on the left,
and it started at Main.

87
00:03:28,650 --> 00:03:29,150
Thank you.

88
00:03:31,810 --> 00:03:40,610
So Main calls word frequencies
for file, which in turn calls?

89
00:03:40,610 --> 00:03:42,470
AUDIENCE: Well, it's
probably line list.

90
00:03:42,470 --> 00:03:43,090
PROFESSOR: OK.

91
00:03:43,090 --> 00:03:45,042
AUDIENCE: And count frequency.

92
00:03:51,390 --> 00:03:54,150
PROFESSOR: So we pretend we
don't see the read file called.

93
00:03:54,150 --> 00:03:57,360
We assume that the data
is already in memory

94
00:03:57,360 --> 00:03:59,970
or that the call
takes time that's

95
00:03:59,970 --> 00:04:03,730
proportional to the running--
to the length of the file.

96
00:04:03,730 --> 00:04:07,036
And we only look at get-word
from line list and count

97
00:04:07,036 --> 00:04:07,535
frequency.

98
00:04:13,340 --> 00:04:14,390
OK.

99
00:04:14,390 --> 00:04:17,072
Who else does Main call?

100
00:04:17,072 --> 00:04:18,030
AUDIENCE: Vector angle?

101
00:04:25,760 --> 00:04:28,624
PROFESSOR: And the vector angle?

102
00:04:28,624 --> 00:04:29,624
AUDIENCE: Inner product.

103
00:04:38,017 --> 00:04:38,600
PROFESSOR: OK.

104
00:04:38,600 --> 00:04:43,370
Let's put up the constants
for-- for the document distance

105
00:04:43,370 --> 00:04:46,720
problem that we used last time.

106
00:04:46,720 --> 00:04:51,780
So we said that the
document has W words.

107
00:04:51,780 --> 00:04:54,170
And then when you take
that list of words

108
00:04:54,170 --> 00:04:57,330
and you turn it into
a distance vector,

109
00:04:57,330 --> 00:04:59,850
you will get assigned
to a document vector.

110
00:04:59,850 --> 00:05:04,180
You will get L elements, which
basically means L unique words.

111
00:05:04,180 --> 00:05:09,730
So L is the document
vector length.

112
00:05:09,730 --> 00:05:12,530
And we assume we're using a
natural language like English,

113
00:05:12,530 --> 00:05:15,090
so all the words
are bounded in size.

114
00:05:15,090 --> 00:05:17,470
Like 5 to 10
characters, for example.

115
00:05:17,470 --> 00:05:20,210
And to make our
life easier, we say

116
00:05:20,210 --> 00:05:30,430
all the words have the same
size W. So w is the word length.

117
00:05:30,430 --> 00:05:32,810
Using these numbers,
can anyone remind me

118
00:05:32,810 --> 00:05:36,190
what we said the costs
for these methods are?

119
00:05:36,190 --> 00:05:38,690
And we didn't analyze
count frequency,

120
00:05:38,690 --> 00:05:40,740
so it's OK to not
take my word for it

121
00:05:40,740 --> 00:05:42,800
and not tell me what
I said last time.

122
00:05:42,800 --> 00:05:46,010
But I would like numbers for
word frequencies from file,

123
00:05:46,010 --> 00:05:50,084
get word from line list, vector
angle, and inner product.

124
00:05:50,084 --> 00:05:51,915
AUDIENCE: Was that mu squared?

125
00:05:51,915 --> 00:05:53,160
That was last time.

126
00:05:53,160 --> 00:05:55,380
PROFESSOR: OK.

127
00:05:55,380 --> 00:05:57,150
Which-- which one?

128
00:05:57,150 --> 00:06:01,410
Does anyone-- does
anyone else want to try?

129
00:06:01,410 --> 00:06:03,780
Let's not do guessing.

130
00:06:03,780 --> 00:06:08,290
I'll pull them up
if nobody remembers.

131
00:06:08,290 --> 00:06:10,910
I spent an entire hour on
that, and you guys did too.

132
00:06:10,910 --> 00:06:11,574
It was painful.

133
00:06:11,574 --> 00:06:12,282
AUDIENCE: I know.

134
00:06:12,282 --> 00:06:14,924
But we had to, like, you
know, add them up, and then--

135
00:06:14,924 --> 00:06:15,590
PROFESSOR: Yeah.

136
00:06:15,590 --> 00:06:17,340
So we did a lot of
work for those numbers.

137
00:06:20,280 --> 00:06:24,350
So get word from line list
was order of W squared.

138
00:06:24,350 --> 00:06:26,490
Does anyone remember why?

139
00:06:26,490 --> 00:06:29,220
What made it take so much time?

140
00:06:34,020 --> 00:06:36,900
AUDIENCE: 'Cause you
append it to the word list.

141
00:06:36,900 --> 00:06:39,467
For, like, you add it to
the end to go through.

142
00:06:39,467 --> 00:06:40,050
PROFESSOR: OK.

143
00:06:43,820 --> 00:06:44,530
So you add it.

144
00:06:44,530 --> 00:06:45,920
So what?

145
00:06:45,920 --> 00:06:47,922
AUDIENCE: So, like,
every time you need it,

146
00:06:47,922 --> 00:06:52,260
you, like, do word list equals
word list plus words in line.

147
00:06:52,260 --> 00:06:53,070
PROFESSOR: OK.

148
00:06:53,070 --> 00:06:55,300
Excellent.

149
00:06:55,300 --> 00:06:58,200
So get words from
line list, line five.

150
00:06:58,200 --> 00:06:59,580
There's a plus there.

151
00:06:59,580 --> 00:07:03,630
And that plus sign messes
up the performance.

152
00:07:03,630 --> 00:07:06,890
So that's why it's w squared.

153
00:07:06,890 --> 00:07:08,800
Count frequency,
we didn't cover it,

154
00:07:08,800 --> 00:07:10,160
you had to take my word for it.

155
00:07:10,160 --> 00:07:12,400
So we will cover it now.

156
00:07:12,400 --> 00:07:15,145
And, inner product.

157
00:07:26,060 --> 00:07:29,610
So suppose you have two vectors
of line-- length L1 and L2.

158
00:07:29,610 --> 00:07:32,700
How much time does it take
to compute the inner product?

159
00:07:32,700 --> 00:07:34,492
AUDIENCE: L1, L2 time.

160
00:07:34,492 --> 00:07:35,075
PROFESSOR: OK.

161
00:07:38,880 --> 00:07:40,970
So how much time does
vector angle take?

162
00:07:48,335 --> 00:07:49,317
AUDIENCE: L1 L2 time.

163
00:07:53,750 --> 00:07:56,509
PROFESSOR: L1 L2.

164
00:07:56,509 --> 00:07:57,800
So it makes three calls, right?

165
00:07:57,800 --> 00:07:59,270
You get the two things.

166
00:07:59,270 --> 00:08:02,450
First, it computes the inner
product of the two lists.

167
00:08:02,450 --> 00:08:05,890
And then it has to compute the
inner product of each vector--

168
00:08:05,890 --> 00:08:08,359
of each document
vector with itself.

169
00:08:08,359 --> 00:08:10,525
Because that's what's on
the bottom of the fraction.

170
00:08:13,330 --> 00:08:15,210
So what's the running
time for that?

171
00:08:15,210 --> 00:08:18,710
AUDIENCE: Plus L1
squared, plus L2 squared.

172
00:08:18,710 --> 00:08:22,880
PROFESSOR: Plus L1
squared, plus L2 squared.

173
00:08:22,880 --> 00:08:25,380
So someone was really helpful
last time and asked me,

174
00:08:25,380 --> 00:08:27,600
can you make this
simpler with some math?

175
00:08:27,600 --> 00:08:30,080
And I said, I don't know
so I don't think I will.

176
00:08:30,080 --> 00:08:32,820
But I looked at-- my, I
looked at my high school math

177
00:08:32,820 --> 00:08:37,549
afterwards, and it turns out
that these-- so L1 squared

178
00:08:37,549 --> 00:08:40,830
plus L2 squared are guaranteed
to be greater than L1 L2

179
00:08:40,830 --> 00:08:43,009
as long as these
numbers are positive.

180
00:08:43,009 --> 00:08:44,550
And we're working
with document list,

181
00:08:44,550 --> 00:08:46,350
so they're always positive.

182
00:08:46,350 --> 00:08:49,155
So this will go away.

183
00:08:54,320 --> 00:09:00,080
So let's assume count frequency
is w squared or smaller.

184
00:09:00,080 --> 00:09:03,310
So the total running time for
word frequencies from file

185
00:09:03,310 --> 00:09:06,580
is w squared.

186
00:09:06,580 --> 00:09:08,300
What's the running
time for everything?

187
00:09:13,114 --> 00:09:14,780
AUDIENCE: Would just
be, like, w squared

188
00:09:14,780 --> 00:09:17,410
plus L1 squared plus L2 squared?

189
00:09:17,410 --> 00:09:18,160
PROFESSOR: Almost.

190
00:09:18,160 --> 00:09:20,330
AUDIENCE: Actually, it depends
on which one's greater, right?

191
00:09:20,330 --> 00:09:21,330
PROFESSOR: Well, almost.

192
00:09:21,330 --> 00:09:27,580
So here W works, assuming you
get one document with W words.

193
00:09:27,580 --> 00:09:30,230
But word frequencies from
files is called twice, once

194
00:09:30,230 --> 00:09:31,870
for each document.

195
00:09:31,870 --> 00:09:36,060
First document has W1 words,
second document has W2 words.

196
00:09:36,060 --> 00:09:38,974
So what's the running time?

197
00:09:38,974 --> 00:09:42,750
AUDIENCE: W1 squared
plus W2 squared.

198
00:09:42,750 --> 00:09:48,000
PROFESSOR: W1 squared
plus W2 squared.

199
00:09:48,000 --> 00:09:50,200
So I take this, and I add this.

200
00:09:50,200 --> 00:09:51,590
Right?

201
00:09:51,590 --> 00:09:55,380
Except when I add this, if
I want to add L1 squared,

202
00:09:55,380 --> 00:09:58,290
I know L1 is the number
of unique documents--

203
00:09:58,290 --> 00:10:00,600
of unique words in the document.

204
00:10:00,600 --> 00:10:04,450
And W1 is the total number
of words in the document.

205
00:10:04,450 --> 00:10:07,177
W1 guaranteed to be
greater or equal than L1,

206
00:10:07,177 --> 00:10:08,260
so it's going to dominate.

207
00:10:08,260 --> 00:10:09,440
I don't need to add it.

208
00:10:09,440 --> 00:10:11,360
Same for L2.

209
00:10:11,360 --> 00:10:12,178
This is it.

210
00:10:15,200 --> 00:10:15,700
OK.

211
00:10:15,700 --> 00:10:18,054
You guys don't seem to
remember the numbers for these.

212
00:10:18,054 --> 00:10:20,220
So that means I didn't
torture you enough last time.

213
00:10:20,220 --> 00:10:21,830
So let's do more.

214
00:10:21,830 --> 00:10:23,270
Let's look at count frequency.

215
00:10:23,270 --> 00:10:27,750
And let's compute
the cost for that.

216
00:10:36,830 --> 00:10:40,730
So what we did last time was, we
went through each line of code.

217
00:10:40,730 --> 00:10:42,820
We thought, how
much time does it

218
00:10:42,820 --> 00:10:44,550
take to execute the line once?

219
00:10:44,550 --> 00:10:46,710
And how many times
does the line run?

220
00:10:46,710 --> 00:10:49,819
And then we compute the product
of that, add everything up,

221
00:10:49,819 --> 00:10:51,360
and that's the cost
for the function.

222
00:10:55,940 --> 00:10:58,260
First off, before
I put numbers here,

223
00:10:58,260 --> 00:10:59,540
what does the method to do?

224
00:11:09,480 --> 00:11:12,040
AUDIENCE: It takes
a list of words--

225
00:11:12,040 --> 00:11:12,730
PROFESSOR: OK.

226
00:11:12,730 --> 00:11:16,150
AUDIENCE: For each
item in that list,

227
00:11:16,150 --> 00:11:17,870
checks to see if
it's-- you know,

228
00:11:17,870 --> 00:11:25,407
list of words that
it's-- counted, right?

229
00:11:25,407 --> 00:11:25,990
PROFESSOR: OK.

230
00:11:25,990 --> 00:11:28,290
So you're telling me
what the code does.

231
00:11:28,290 --> 00:11:29,010
AUDIENCE: Yeah.

232
00:11:29,010 --> 00:11:31,070
PROFESSOR: Try to
look at Main or try

233
00:11:31,070 --> 00:11:33,420
to look at word
frequencies for files.

234
00:11:33,420 --> 00:11:36,540
So look at it top-down, and tell
me what the purpose of it is.

235
00:11:36,540 --> 00:11:38,310
What's the goal?

236
00:11:38,310 --> 00:11:45,108
AUDIENCE: Making a list
of-- and each object

237
00:11:45,108 --> 00:11:47,400
is a list with a
word and a number.

238
00:11:47,400 --> 00:11:48,220
PROFESSOR: OK.

239
00:11:48,220 --> 00:11:49,100
Excellent.

240
00:11:49,100 --> 00:11:51,790
So big picture, I have
the first document.

241
00:11:51,790 --> 00:11:52,640
I read it in.

242
00:11:52,640 --> 00:11:54,106
I break it up into words.

243
00:11:54,106 --> 00:11:55,230
And I have a list of words.

244
00:11:55,230 --> 00:11:58,380
That's what word
frequencies for file

245
00:11:58,380 --> 00:12:00,920
gives me-- sorry, that's what
you get words from line list

246
00:12:00,920 --> 00:12:02,190
gives me.

247
00:12:02,190 --> 00:12:02,890
List of words.

248
00:12:08,720 --> 00:12:16,000
The fox is in the hat.

249
00:12:18,680 --> 00:12:21,600
And this gets passed
to count frequency,

250
00:12:21,600 --> 00:12:24,700
and count frequency
gives me, you said,

251
00:12:24,700 --> 00:12:27,430
an object, which is a list.

252
00:12:27,430 --> 00:12:30,180
Of lists, where each
of them has the word

253
00:12:30,180 --> 00:12:31,590
and how many times it shows up.

254
00:12:31,590 --> 00:12:39,510
So I would have "the" shows
up twice, "fox" shows up

255
00:12:39,510 --> 00:12:51,110
once, "is" shows up
once, "in" shows up once,

256
00:12:51,110 --> 00:12:56,110
and-- I need a shorter
example-- "hat" shows up once.

257
00:12:56,110 --> 00:12:58,155
So it takes this and
turns it into that.

258
00:13:01,840 --> 00:13:05,090
So on line 2, I have a
list L that's initialized.

259
00:13:05,090 --> 00:13:06,770
And then, at the
end, it's returned.

260
00:13:06,770 --> 00:13:09,840
So I'm going to guess that L
is going to look like this.

261
00:13:13,350 --> 00:13:17,050
Line 3, for new word in word
list, iterates over the input.

262
00:13:17,050 --> 00:13:19,430
So iterates over this.

263
00:13:19,430 --> 00:13:23,340
And then, line 4 checks
to see, for each new word,

264
00:13:23,340 --> 00:13:26,150
it looks at the list that
I have under construction.

265
00:13:26,150 --> 00:13:30,950
So exam-- for example, if
I ran through all the words

266
00:13:30,950 --> 00:13:33,360
and then I'm trying to
put in hat right now,

267
00:13:33,360 --> 00:13:35,250
I wouldn't have it here.

268
00:13:35,250 --> 00:13:39,890
What line 4 does is, it
looks at all the entries.

269
00:13:39,890 --> 00:13:42,710
And it says, if I
can find the words--

270
00:13:42,710 --> 00:13:45,180
so if I can find the
word hat somewhere here--

271
00:13:45,180 --> 00:13:47,090
then increment the number.

272
00:13:47,090 --> 00:13:49,642
If I can't, then
make a new entry

273
00:13:49,642 --> 00:13:52,100
and say that the word shows up
once, because it's the first

274
00:13:52,100 --> 00:13:52,600
I see it.

275
00:13:56,340 --> 00:13:57,590
So this is what the code does.

276
00:13:57,590 --> 00:14:01,640
Now let's see how
fast it does that.

277
00:14:01,640 --> 00:14:07,450
So line 2 initialize the
output to an empty list.

278
00:14:07,450 --> 00:14:10,071
What's the cost for that?

279
00:14:10,071 --> 00:14:10,988
AUDIENCE: [INAUDIBLE].

280
00:14:10,988 --> 00:14:11,862
PROFESSOR: Very good.

281
00:14:11,862 --> 00:14:12,580
How many times?

282
00:14:18,370 --> 00:14:19,810
For new word in word lists.

283
00:14:19,810 --> 00:14:20,310
Cost?

284
00:14:23,726 --> 00:14:26,680
AUDIENCE: [INAUDIBLE].

285
00:14:26,680 --> 00:14:29,070
I know the cost is 1.

286
00:14:29,070 --> 00:14:32,330
PROFESSOR: OK, so we are-- here,
it's a bit confusing, right?

287
00:14:32,330 --> 00:14:35,610
We're saying that, oh, there's
does iteration over a list.

288
00:14:35,610 --> 00:14:37,920
And each step of the
iteration is constant time,

289
00:14:37,920 --> 00:14:42,210
but the iteration
happens L times.

290
00:14:42,210 --> 00:14:43,510
I'm sorry, not L times.

291
00:14:43,510 --> 00:14:45,400
The length of the list times.

292
00:14:45,400 --> 00:14:48,120
How many-- how many
elements are in word lists?

293
00:14:53,090 --> 00:14:57,730
I heard a very low W, so
I will pretend I heard it.

294
00:14:57,730 --> 00:15:01,350
Or I hope I heard W. So
word list, the words I

295
00:15:01,350 --> 00:15:06,690
got from the document,
W. How about the if.

296
00:15:06,690 --> 00:15:13,570
So, it looks at the
word that I have-- oh.

297
00:15:13,570 --> 00:15:16,680
This code is confusing because
I forgot a line, right?

298
00:15:16,680 --> 00:15:20,580
Pretend that between line--
oh, no, sorry, I didn't.

299
00:15:20,580 --> 00:15:23,512
New word is-- new word
is assigned in line 3.

300
00:15:23,512 --> 00:15:29,030
So new word in line 3 is
compared to the first element

301
00:15:29,030 --> 00:15:31,280
of the entry that's
assigned in line 4.

302
00:15:31,280 --> 00:15:36,960
So hat is compared with
the, fox, so on, so forth.

303
00:15:36,960 --> 00:15:41,530
And if the comparison is
true, it runs line 6 and 7.

304
00:15:41,530 --> 00:15:45,230
And if not, it keeps looping.

305
00:15:45,230 --> 00:15:50,300
So line 5, the if how,
many times does it run?

306
00:15:50,300 --> 00:15:52,498
Just the comparison.

307
00:15:52,498 --> 00:15:53,810
AUDIENCE: W.

308
00:15:53,810 --> 00:15:55,107
PROFESSOR: W.

309
00:15:55,107 --> 00:15:56,148
AUDIENCE: Oh, no, no, no.

310
00:15:56,148 --> 00:15:57,072
No.

311
00:15:57,072 --> 00:15:58,920
I'm thinking of line 4.

312
00:15:58,920 --> 00:15:59,941
PROFESSOR: Oh.

313
00:15:59,941 --> 00:16:00,440
Yeah.

314
00:16:00,440 --> 00:16:02,260
I'm getting confused too.

315
00:16:02,260 --> 00:16:03,540
So let's start with line 4.

316
00:16:03,540 --> 00:16:04,520
Sorry.

317
00:16:04,520 --> 00:16:05,882
Shouldn't do line 5.

318
00:16:05,882 --> 00:16:08,059
AUDIENCE: --new word,
then you're not-- like,

319
00:16:08,059 --> 00:16:09,850
you're not going to
run through that again.

320
00:16:09,850 --> 00:16:10,516
PROFESSOR: Yeah.

321
00:16:10,516 --> 00:16:12,530
Let's worry about
that right afterwards.

322
00:16:12,530 --> 00:16:13,760
Let's do line 4 first.

323
00:16:13,760 --> 00:16:15,790
Sorry, I jumped over line 4.

324
00:16:15,790 --> 00:16:19,520
So, line 4 definitely
runs W times

325
00:16:19,520 --> 00:16:23,970
because it's inside the for
loop from line 3 to line 9.

326
00:16:23,970 --> 00:16:29,510
So everything here will
definitely run W times.

327
00:16:29,510 --> 00:16:31,910
But how many times
does it run overall?

328
00:16:31,910 --> 00:16:36,377
So, line 4 iterates over
all the entries here.

329
00:16:36,377 --> 00:16:37,710
How many times does that happen?

330
00:16:41,161 --> 00:16:47,570
AUDIENCE: 1 plus W
over-- times 10/2,

331
00:16:47,570 --> 00:16:54,588
because it's just worst case,
L-- the length of L increases

332
00:16:54,588 --> 00:16:56,830
by 1 every time.

333
00:16:56,830 --> 00:16:58,950
[INAUDIBLE]

334
00:16:58,950 --> 00:17:02,350
PROFESSOR: OK, so I like that
you started with worst case.

335
00:17:02,350 --> 00:17:04,130
Normally I would
say exactly that.

336
00:17:04,130 --> 00:17:07,619
Worst case or W. But we
had a different constant

337
00:17:07,619 --> 00:17:11,480
for the number of words
that you have in the end.

338
00:17:11,480 --> 00:17:16,230
So let's say something a little
bit better than W. Let's say,

339
00:17:16,230 --> 00:17:17,880
let's put the lower bound on it.

340
00:17:17,880 --> 00:17:20,036
So yeah, worst case, all
the words are different.

341
00:17:20,036 --> 00:17:21,619
But what if they're
not all different?

342
00:17:21,619 --> 00:17:25,632
And what if in the end
I know I have L words?

343
00:17:25,632 --> 00:17:27,653
CLASS: [INAUDIBLE].

344
00:17:27,653 --> 00:17:28,569
PROFESSOR: Worst case.

345
00:17:33,001 --> 00:17:33,500
Almost.

346
00:17:36,340 --> 00:17:39,940
So, I know I have a W
from the outer loop.

347
00:17:39,940 --> 00:17:42,680
For each word in the
outer loop, how many times

348
00:17:42,680 --> 00:17:46,800
does the inner loop execute?

349
00:17:46,800 --> 00:17:49,210
How many times do I have
to go through something

350
00:17:49,210 --> 00:17:50,250
in the inner list?

351
00:17:50,250 --> 00:17:53,090
So I know here I have
W words, suppose here I

352
00:17:53,090 --> 00:17:55,480
have L elements in the vector.

353
00:17:55,480 --> 00:17:58,195
For each one of
these, how many times

354
00:17:58,195 --> 00:18:02,440
do I have to go through--
So how many elements do I

355
00:18:02,440 --> 00:18:03,946
have to go through here?

356
00:18:03,946 --> 00:18:05,764
AUDIENCE: Depends on
where you are, though.

357
00:18:05,764 --> 00:18:07,930
For the first word, you
only have to go through one.

358
00:18:07,930 --> 00:18:08,230
PROFESSOR: Yep.

359
00:18:08,230 --> 00:18:08,530
But--

360
00:18:08,530 --> 00:18:09,590
AUDIENCE: For the second
word, you have to through--

361
00:18:09,590 --> 00:18:09,870
PROFESSOR: Yep.

362
00:18:09,870 --> 00:18:11,280
But I heard the worst case.

363
00:18:11,280 --> 00:18:12,780
And I like that,
because it's easier

364
00:18:12,780 --> 00:18:14,071
to reason about the worst case.

365
00:18:14,071 --> 00:18:17,480
And most of the time it's
sort of like the average case.

366
00:18:17,480 --> 00:18:21,200
AUDIENCE: So then,
length of list-- L.

367
00:18:21,200 --> 00:18:25,530
PROFESSOR: The length of the
list, and that's L. Worst case,

368
00:18:25,530 --> 00:18:27,976
the first words that I see
will be L different words.

369
00:18:27,976 --> 00:18:29,350
And then all the
words that I see

370
00:18:29,350 --> 00:18:32,840
will be the same as the
words that I saw before.

371
00:18:32,840 --> 00:18:36,080
So worst case, the list
will grow to L very fast,

372
00:18:36,080 --> 00:18:39,085
and then I'll keep
seeing L L L. And I'll

373
00:18:39,085 --> 00:18:40,710
ignore what was there
in the beginning,

374
00:18:40,710 --> 00:18:44,460
and I'll say L times.

375
00:18:44,460 --> 00:18:47,320
So I know the second list
is bounded by L in length,

376
00:18:47,320 --> 00:18:49,470
the first list is
bounded by W in length.

377
00:18:49,470 --> 00:18:53,960
So worst case this
runs L times W times.

378
00:18:53,960 --> 00:18:56,394
And what's the
cost of iterating?

379
00:18:56,394 --> 00:18:58,854
AUDIENCE: What is the
difference between L and W?

380
00:18:58,854 --> 00:19:01,965
L is the document vector length,
and W is the number of words.

381
00:19:01,965 --> 00:19:04,803
But isn't the number of
elements in the document vector

382
00:19:04,803 --> 00:19:07,799
the number of words?

383
00:19:07,799 --> 00:19:09,090
PROFESSOR: How about this case?

384
00:19:09,090 --> 00:19:09,810
What's W?

385
00:19:15,134 --> 00:19:16,110
AUDIENCE: 6, yeah.

386
00:19:20,140 --> 00:19:23,790
PROFESSOR: So L is the number
of unique words in a document.

387
00:19:23,790 --> 00:19:25,580
And I heard a
really cool argument

388
00:19:25,580 --> 00:19:28,160
that I liked last time.

389
00:19:28,160 --> 00:19:29,120
Does anyone remember?

390
00:19:29,120 --> 00:19:29,840
About L?

391
00:19:33,450 --> 00:19:37,160
If we're really dealing with a
natural language like English,

392
00:19:37,160 --> 00:19:39,636
how many words do
I have in English?

393
00:19:39,636 --> 00:19:41,135
AUDIENCE: Well, I
think, at the max,

394
00:19:41,135 --> 00:19:43,997
there's actually around
250,000, but a lot of them

395
00:19:43,997 --> 00:19:45,157
are not used anymore.

396
00:19:45,157 --> 00:19:45,740
PROFESSOR: OK.

397
00:19:45,740 --> 00:19:47,720
So 250,000, right?

398
00:19:47,720 --> 00:19:48,820
Max.

399
00:19:48,820 --> 00:19:50,480
So that's a constant.

400
00:19:50,480 --> 00:19:52,930
If I have a document
that contains

401
00:19:52,930 --> 00:19:56,060
all the writings of all the
authors that were ever done,

402
00:19:56,060 --> 00:20:02,771
and say that's a billion words,
L is still going to be 250,000.

403
00:20:02,771 --> 00:20:03,270
Right?

404
00:20:03,270 --> 00:20:04,882
So L can be very
different from W.

405
00:20:04,882 --> 00:20:06,965
That's why we're keeping
track of them separately.

406
00:20:11,510 --> 00:20:14,910
One W L times W. What's
the cost of iterating?

407
00:20:19,190 --> 00:20:21,530
So we know how many
times line 4 runs,

408
00:20:21,530 --> 00:20:26,930
but what's the cost of one
step of iterating in the list?

409
00:20:26,930 --> 00:20:27,430
1.

410
00:20:27,430 --> 00:20:27,930
Very good.

411
00:20:31,530 --> 00:20:32,036
Line 5.

412
00:20:32,036 --> 00:20:33,160
How many times does it run?

413
00:20:37,420 --> 00:20:39,230
AUDIENCE: W times L?

414
00:20:39,230 --> 00:20:39,990
PROFESSOR: Yep.

415
00:20:39,990 --> 00:20:41,580
Same as line 4, right?

416
00:20:41,580 --> 00:20:43,800
The if is run all the time.

417
00:20:43,800 --> 00:20:47,860
And lines 5 and 6 only run
sometimes, but-- sorry,

418
00:20:47,860 --> 00:20:53,750
line 6 and 7 only run sometimes,
but line 5 runs all the time.

419
00:20:53,750 --> 00:20:54,740
What is the cost?

420
00:21:01,642 --> 00:21:03,280
AUDIENCE: We can
say it's constant.

421
00:21:03,280 --> 00:21:04,780
PROFESSOR: We can
say it's constant.

422
00:21:04,780 --> 00:21:07,560
I like we can say
it's constant, but why

423
00:21:07,560 --> 00:21:10,020
is it that we can
say it's constant?

424
00:21:10,020 --> 00:21:12,480
Why don't I just say 1,
if-- this is an empty list.

425
00:21:12,480 --> 00:21:13,280
This is a number.

426
00:21:13,280 --> 00:21:14,340
AUDIENCE: Depends
on the word length.

427
00:21:14,340 --> 00:21:14,923
PROFESSOR: OK.

428
00:21:14,923 --> 00:21:17,065
It depends on the word
lis-- length, very good.

429
00:21:17,065 --> 00:21:18,690
AUDIENCE: And we're
assuming that words

430
00:21:18,690 --> 00:21:20,170
are all the same length.

431
00:21:20,170 --> 00:21:21,637
PROFESSOR: OK.

432
00:21:21,637 --> 00:21:22,220
AUDIENCE: And?

433
00:21:22,220 --> 00:21:26,190
So, 1 times L W. Little w.

434
00:21:26,190 --> 00:21:28,390
PROFESSOR: OK, very good.

435
00:21:28,390 --> 00:21:30,980
So we do assume that all the
words are the same length.

436
00:21:30,980 --> 00:21:34,920
But unless I tell you that
the length is really small,

437
00:21:34,920 --> 00:21:37,630
which I did, you can't say 1.

438
00:21:37,630 --> 00:21:41,000
So, when you said we can say
it's constant, it's right.

439
00:21:41,000 --> 00:21:42,770
We can say, but we
also have to say why,

440
00:21:42,770 --> 00:21:44,890
or at least think
why, that's the case.

441
00:21:44,890 --> 00:21:48,760
So it's W-- we're going to use
W here and when we copy it here,

442
00:21:48,760 --> 00:21:51,144
we're going to forget about it.

443
00:21:51,144 --> 00:21:52,852
AUDIENCE: Can you put
a top bar on the W,

444
00:21:52,852 --> 00:21:56,830
just so I can tell that
it's not the other W.

445
00:21:56,830 --> 00:22:07,950
PROFESSOR: OK But you have to
be responsible for reminding

446
00:22:07,950 --> 00:22:08,770
me to put that.

447
00:22:08,770 --> 00:22:10,160
AUDIENCE: OK.

448
00:22:10,160 --> 00:22:12,240
PROFESSOR: OK.

449
00:22:12,240 --> 00:22:15,470
So string comparison,
not constant.

450
00:22:15,470 --> 00:22:17,770
If I have two very
long strings that

451
00:22:17,770 --> 00:22:19,550
only differ in the
last character,

452
00:22:19,550 --> 00:22:21,550
I have to go through them
character by character

453
00:22:21,550 --> 00:22:24,810
by character until I find
the last character that's

454
00:22:24,810 --> 00:22:25,580
different.

455
00:22:25,580 --> 00:22:28,770
Because until I look at that,
the strings might be equal.

456
00:22:28,770 --> 00:22:30,926
So comparing two
long strings takes

457
00:22:30,926 --> 00:22:33,175
time that's proportional to
the length of the strings.

458
00:22:36,680 --> 00:22:40,130
OK, so line 5 costs
w, tricky part,

459
00:22:40,130 --> 00:22:44,420
runs L times W part--
L times W times.

460
00:22:44,420 --> 00:22:46,350
How about line 6 and 7?

461
00:22:52,860 --> 00:22:56,280
I didn't ask line 6, I
asked line 6 and 7 together,

462
00:22:56,280 --> 00:22:57,690
because there is a trick there.

463
00:23:05,690 --> 00:23:09,190
AUDIENCE: I think it's
constant for line 6, right?

464
00:23:09,190 --> 00:23:10,518
PROFESSOR: Why?

465
00:23:10,518 --> 00:23:12,858
AUDIENCE: 'Cause it's a number.

466
00:23:12,858 --> 00:23:14,715
And one place in the entry.

467
00:23:14,715 --> 00:23:16,280
We've already grabbed the entry.

468
00:23:16,280 --> 00:23:20,420
PROFESSOR: So the cost
of running it once is 1.

469
00:23:20,420 --> 00:23:21,700
Good.

470
00:23:21,700 --> 00:23:25,190
I can tell you that that's
the same case for 7.

471
00:23:25,190 --> 00:23:26,560
How many times do they run?

472
00:23:30,120 --> 00:23:32,490
This is the hard part.

473
00:23:32,490 --> 00:23:34,980
AUDIENCE: Wait, does break
line break out of one loop?

474
00:23:34,980 --> 00:23:35,605
PROFESSOR: Yep.

475
00:23:35,605 --> 00:23:39,940
Break breaks out of the
loop between lines 4 and 7.

476
00:23:39,940 --> 00:23:41,300
I have a question.

477
00:23:41,300 --> 00:23:46,350
AUDIENCE: Is L supposed
to be not in line with if?

478
00:23:46,350 --> 00:23:46,850
Yes.

479
00:23:46,850 --> 00:23:49,580
PROFESSOR: It is supposed
to be where it is.

480
00:23:49,580 --> 00:23:50,640
I will talk-- yes.

481
00:23:50,640 --> 00:23:53,880
So what happens is, that
else is in line with a for.

482
00:23:53,880 --> 00:23:57,430
And if the for loop
runs to completion,

483
00:23:57,430 --> 00:24:00,860
then it does get executed.

484
00:24:00,860 --> 00:24:04,097
If there's a break
somewhere inside the for,

485
00:24:04,097 --> 00:24:05,305
then it doesn't get executed.

486
00:24:08,140 --> 00:24:10,650
So the idea behind
that is, usually

487
00:24:10,650 --> 00:24:12,510
use this for finding stuff.

488
00:24:12,510 --> 00:24:15,760
So you iterate over a list,
and when you find something,

489
00:24:15,760 --> 00:24:17,516
break out of the loop.

490
00:24:17,516 --> 00:24:19,390
You did something, you
break out of the loop.

491
00:24:19,390 --> 00:24:21,435
If you didn't find it,
you can put an else

492
00:24:21,435 --> 00:24:23,167
and then say what code happens.

493
00:24:23,167 --> 00:24:25,000
And you don't have to
write code on your own

494
00:24:25,000 --> 00:24:29,420
to check if you broke
out of the loop.

495
00:24:29,420 --> 00:24:32,230
So if break executes,
then it's going

496
00:24:32,230 --> 00:24:34,570
to take us out of the loop.

497
00:24:34,570 --> 00:24:35,930
It's going to ignore the else.

498
00:24:35,930 --> 00:24:38,282
And it's going to run
the loop on line 3 again.

499
00:24:38,282 --> 00:24:39,865
So it's going to do
another iteration.

500
00:24:42,470 --> 00:24:44,280
So line 6 and 7, how many times?

501
00:24:44,280 --> 00:24:47,120
AUDIENCE: W minus L?

502
00:24:47,120 --> 00:24:48,450
PROFESSOR: W minus L.

503
00:24:48,450 --> 00:24:50,526
AUDIENCE: Like, if
it's the difference

504
00:24:50,526 --> 00:24:54,962
in number of words and
number of-- unique words.

505
00:24:54,962 --> 00:24:55,670
PROFESSOR: Smart.

506
00:24:55,670 --> 00:25:00,520
You gave the precise answer
right the first time.

507
00:25:00,520 --> 00:25:02,230
W minus L. Very good.

508
00:25:06,830 --> 00:25:13,960
And what happens once--
what happens once this runs?

509
00:25:13,960 --> 00:25:21,730
Why do I know-- why do I know
that the if won't be-- Oh.

510
00:25:21,730 --> 00:25:23,660
Sorry, I'm getting
myself confused.

511
00:25:23,660 --> 00:25:27,950
So it's going to run W
minus L times total, right?

512
00:25:27,950 --> 00:25:31,570
Total times overall
without this W thing here,

513
00:25:31,570 --> 00:25:33,547
so I should put
an arrow and say--

514
00:25:38,320 --> 00:25:41,070
Now suppose I
didn't notice this.

515
00:25:41,070 --> 00:25:43,510
Is there another way I
can get the decent bounds?

516
00:25:43,510 --> 00:25:45,165
So this is the right,
perfect answer.

517
00:25:45,165 --> 00:25:46,290
You have this, you're done.

518
00:25:46,290 --> 00:25:47,623
You don't need to think further.

519
00:25:47,623 --> 00:25:49,420
Suppose you don't have this.

520
00:25:49,420 --> 00:25:50,350
What else can you do?

521
00:25:53,479 --> 00:25:55,070
AUDIENCE: But we
don't have what?

522
00:25:55,070 --> 00:25:57,770
PROFESSOR: If I didn't
notice that, hey, there

523
00:25:57,770 --> 00:26:03,330
are L words-- There
are L new words,

524
00:26:03,330 --> 00:26:07,690
W words total, so W minus
L words repeat themselves.

525
00:26:07,690 --> 00:26:11,230
So this is how many times
the if is going to be true.

526
00:26:11,230 --> 00:26:13,330
If I didn't have
that, then I could

527
00:26:13,330 --> 00:26:16,750
see that line 7 breaks
out of the loop.

528
00:26:16,750 --> 00:26:19,314
So if that if runs
once, then we're done.

529
00:26:19,314 --> 00:26:20,230
We're out of the loop.

530
00:26:20,230 --> 00:26:22,430
It's not going to run again.

531
00:26:22,430 --> 00:26:26,210
So a bound that's not as
precise as the one you gave me

532
00:26:26,210 --> 00:26:31,265
is 1 times W, because it
runs, at most, once per loop.

533
00:26:40,030 --> 00:26:42,720
Does people see this?

534
00:26:42,720 --> 00:26:45,810
So this is an easy way
to cop out of thinking.

535
00:26:45,810 --> 00:26:48,230
And I don't like to think more
than necessary, because you

536
00:26:48,230 --> 00:26:50,100
have finite time on
a test or in life,

537
00:26:50,100 --> 00:26:52,720
and you don't want to spend
too much time on one thing.

538
00:26:55,640 --> 00:26:56,670
We covered the loop.

539
00:26:56,670 --> 00:26:59,010
Let's look at else and append.

540
00:26:59,010 --> 00:27:00,510
I already got a
helpful question,

541
00:27:00,510 --> 00:27:03,940
so I explained when
the else would run.

542
00:27:03,940 --> 00:27:05,240
The running time for else.

543
00:27:05,240 --> 00:27:07,570
Else is a flow--
control flow statement.

544
00:27:07,570 --> 00:27:10,244
It's like break,
so Python will keep

545
00:27:10,244 --> 00:27:11,910
track of whether a
loop completed or not

546
00:27:11,910 --> 00:27:12,899
in constant time.

547
00:27:12,899 --> 00:27:13,690
I'll give you that.

548
00:27:13,690 --> 00:27:16,750
That's in the cost model.

549
00:27:16,750 --> 00:27:21,170
How many times does
this else run, at most?

550
00:27:21,170 --> 00:27:22,840
OK, L. Good.

551
00:27:25,546 --> 00:27:26,865
Perfect.

552
00:27:26,865 --> 00:27:27,573
How about line 9?

553
00:27:36,012 --> 00:27:36,976
Loop stops here.

554
00:27:39,890 --> 00:27:42,200
How about line 9?

555
00:27:42,200 --> 00:27:43,500
How many times does it run?

556
00:27:43,500 --> 00:27:44,886
That's easy.

557
00:27:44,886 --> 00:27:46,290
AUDIENCE: [INAUDIBLE].

558
00:27:46,290 --> 00:27:47,800
PROFESSOR: Yep.

559
00:27:47,800 --> 00:27:48,820
And what does it do?

560
00:27:48,820 --> 00:27:49,590
It's an append.

561
00:27:49,590 --> 00:27:53,111
What's the cost for append?

562
00:27:53,111 --> 00:27:53,902
AUDIENCE: Constant.

563
00:27:57,380 --> 00:27:59,680
PROFESSOR: OK Line 10
runs how many times?

564
00:27:59,680 --> 00:28:02,554
What's the cost?

565
00:28:02,554 --> 00:28:03,054
AUDIENCE: 1?

566
00:28:07,307 --> 00:28:09,390
PROFESSOR: You guys didn't
listen to me last time.

567
00:28:09,390 --> 00:28:11,020
So I was saying you have
to look at the notes

568
00:28:11,020 --> 00:28:12,270
and you have to practice this.

569
00:28:12,270 --> 00:28:14,480
Because you have to have
this model in your mind.

570
00:28:14,480 --> 00:28:16,160
So that when you're
writing code,

571
00:28:16,160 --> 00:28:17,710
this has to happen
automatically.

572
00:28:17,710 --> 00:28:20,306
You shouldn't have to
think explicitly about it.

573
00:28:20,306 --> 00:28:22,180
Because if you do, you're
not going to do it.

574
00:28:22,180 --> 00:28:23,679
AUDIENCE: For the
else, shouldn't it

575
00:28:23,679 --> 00:28:28,612
be W tim-- I mean, it would
be called W times not L times?

576
00:28:28,612 --> 00:28:30,744
Because you want to look
at the outer loop and not

577
00:28:30,744 --> 00:28:32,060
the inner loop?

578
00:28:32,060 --> 00:28:35,942
So it can-- you call all--
once at the end of the total--

579
00:28:35,942 --> 00:28:37,080
the inner for.

580
00:28:37,080 --> 00:28:37,580
Right?

581
00:28:37,580 --> 00:28:43,590
So, so it could be-- happen W
times, maximum, not L times.

582
00:28:43,590 --> 00:28:47,960
'Cause the L is the for loop
that the else coincides with.

583
00:28:47,960 --> 00:28:50,502
And the else would
only happen once

584
00:28:50,502 --> 00:28:53,410
for every total
iteration of that.

585
00:28:53,410 --> 00:28:55,400
PROFESSOR: OK so
you're proposing W

586
00:28:55,400 --> 00:28:58,070
as the bound for
the else, right?

587
00:28:58,070 --> 00:28:59,643
Here.

588
00:28:59,643 --> 00:29:00,939
AUDIENCE: Yeah.

589
00:29:00,939 --> 00:29:02,980
PROFESSOR: So I could say,
hey, it runs, at most,

590
00:29:02,980 --> 00:29:04,140
once for outer loop.

591
00:29:04,140 --> 00:29:06,430
So it's, at most, W times.

592
00:29:06,430 --> 00:29:08,020
This is a nice, easy argument.

593
00:29:08,020 --> 00:29:09,680
We have a bound.

594
00:29:09,680 --> 00:29:10,670
L is a tighter bound.

595
00:29:10,670 --> 00:29:14,730
And when I got L,
what happened here

596
00:29:14,730 --> 00:29:17,630
was the same kind of
thinking that you did earlier

597
00:29:17,630 --> 00:29:18,960
to get this.

598
00:29:18,960 --> 00:29:21,580
So this bound is good,
this bound is tighter.

599
00:29:21,580 --> 00:29:24,260
This bound is good,
this bound is tighter.

600
00:29:24,260 --> 00:29:27,850
And the argument behind
this one is that, hey, this

601
00:29:27,850 --> 00:29:30,570
else only happens for new words.

602
00:29:30,570 --> 00:29:33,300
If there's no new-- if the
word that I looked at is old,

603
00:29:33,300 --> 00:29:36,900
then break is going to execute.

604
00:29:36,900 --> 00:29:39,040
And else is not
going to execute.

605
00:29:41,620 --> 00:29:45,330
So that's why I can say L.

606
00:29:45,330 --> 00:29:47,260
The beauty of
asymptotics is that I

607
00:29:47,260 --> 00:29:49,290
can use either of the
bounds and I'll still

608
00:29:49,290 --> 00:29:50,950
get the correct running time.

609
00:29:50,950 --> 00:29:53,510
So I'm not going to
fuss over it too much.

610
00:29:53,510 --> 00:29:56,290
I like the tighter ones, because
it means you guys are thinking.

611
00:29:56,290 --> 00:29:57,790
And you're looking
at the algorithm,

612
00:29:57,790 --> 00:29:59,330
and you're understanding it.

613
00:29:59,330 --> 00:30:00,830
But if you don't
have them, you'll

614
00:30:00,830 --> 00:30:02,329
still get the correct
running times.

615
00:30:02,329 --> 00:30:04,640
So I think that's nice.

616
00:30:04,640 --> 00:30:07,540
PROFESSOR: OK, let's get the
running time for everything.

617
00:30:07,540 --> 00:30:08,800
Can someone do it in one step?

618
00:30:13,890 --> 00:30:15,830
Then let's do it step by step.

619
00:30:15,830 --> 00:30:17,710
So let's compute
partial products.

620
00:30:17,710 --> 00:30:20,660
What are they?

621
00:30:20,660 --> 00:30:22,118
AUDIENCE: 1.

622
00:30:22,118 --> 00:30:25,520
W. LW.

623
00:30:25,520 --> 00:30:29,894
L, W, little w-- with the bar.

624
00:30:29,894 --> 00:30:31,840
AUDIENCE: (LAUGHING)
With the bar.

625
00:30:31,840 --> 00:30:32,631
PROFESSOR: Awesome.

626
00:30:32,631 --> 00:30:39,258
AUDIENCE: W minus L. W. L, L, 1.

627
00:30:42,130 --> 00:30:45,020
PROFESSOR: OK, so if I add them
up, this is all asymptotic.

628
00:30:45,020 --> 00:30:47,530
So the biggest
one will dominate.

629
00:30:47,530 --> 00:30:49,190
In general, I can
just take a max

630
00:30:49,190 --> 00:30:52,100
instead of doing
actual addition.

631
00:30:52,100 --> 00:30:54,296
So who dominates here?

632
00:30:54,296 --> 00:30:57,782
AUDIENCE: The fourth one down?

633
00:30:57,782 --> 00:30:59,780
PROFESSOR: Fourth-- OK.

634
00:30:59,780 --> 00:31:01,680
Yep.

635
00:31:01,680 --> 00:31:08,820
So line 5 is the biggest time
consumer in this algorithm.

636
00:31:08,820 --> 00:31:11,430
And I know it's W times w-bar.

637
00:31:11,430 --> 00:31:12,940
So now I'm going
to copy it here.

638
00:31:12,940 --> 00:31:16,575
What did I say I'll do
when I'm copying it here?

639
00:31:16,575 --> 00:31:18,030
AUDIENCE: [INAUDIBLE].

640
00:31:18,030 --> 00:31:18,720
PROFESSOR: Yep.

641
00:31:18,720 --> 00:31:22,640
So I'm assuming English
five- to ten-character words.

642
00:31:22,640 --> 00:31:26,320
W L. And W L is
smaller than w squared,

643
00:31:26,320 --> 00:31:28,780
so the assumption that
I had before is correct.

644
00:31:28,780 --> 00:31:30,790
I don't have to
change anything here.

645
00:31:30,790 --> 00:31:31,597
That is good.

646
00:31:38,560 --> 00:31:42,190
So we noticed last time,
and already forgot by now,

647
00:31:42,190 --> 00:31:46,230
that the biggest problem in
this whole implementation

648
00:31:46,230 --> 00:31:50,800
was the plus in get
words from line list.

649
00:31:50,800 --> 00:31:54,300
Suppose we forgot about it and
we have this big pile of code,

650
00:31:54,300 --> 00:31:56,100
how do I go about
making it faster?

651
00:31:59,080 --> 00:32:01,720
Method one, go
through every method.

652
00:32:01,720 --> 00:32:03,680
Do this.

653
00:32:03,680 --> 00:32:06,740
Compute the running times, and
see which one's the slowest.

654
00:32:06,740 --> 00:32:10,190
Does this scale to
1,000 lines of code?

655
00:32:10,190 --> 00:32:11,730
Not so much.

656
00:32:11,730 --> 00:32:14,700
We're going to be giving you
roughly 1,000 lines of code

657
00:32:14,700 --> 00:32:17,960
for PSET2, and we're going
to ask it to make it faster.

658
00:32:17,960 --> 00:32:19,560
Do want to understand
everything?

659
00:32:19,560 --> 00:32:22,350
No.

660
00:32:22,350 --> 00:32:24,590
Instead what you want
to do is run the code

661
00:32:24,590 --> 00:32:25,824
through a profiler.

662
00:32:25,824 --> 00:32:26,740
So we have a computer.

663
00:32:26,740 --> 00:32:28,770
The computer can
tell you which line

664
00:32:28,770 --> 00:32:30,877
takes the most time to run.

665
00:32:30,877 --> 00:32:32,710
So you don't have to
do it on pen and paper.

666
00:32:32,710 --> 00:32:34,940
Whenever we can automate, do so.

667
00:32:34,940 --> 00:32:36,830
So we'll teach you
how to run a profiler.

668
00:32:36,830 --> 00:32:38,110
It's in the notes.

669
00:32:38,110 --> 00:32:40,870
And if you look in
the code outputs right

670
00:32:40,870 --> 00:32:45,080
after [INAUDIBLE], you'll
see a profiler output.

671
00:32:45,080 --> 00:32:47,580
So what that tells you
is for each function,

672
00:32:47,580 --> 00:32:51,980
how much time does it take--
that's the total time?

673
00:32:51,980 --> 00:32:53,670
And there's the
cumulative time, which

674
00:32:53,670 --> 00:32:56,300
is how much does it take
together with its children?

675
00:32:59,680 --> 00:33:04,380
In this case for word
frequencies from file, order

676
00:33:04,380 --> 00:33:06,360
of W-- this is how
much time it takes

677
00:33:06,360 --> 00:33:07,860
including the
functions it's called.

678
00:33:07,860 --> 00:33:10,540
So this is the cumulative time.

679
00:33:10,540 --> 00:33:13,940
Cumulative time is useful if
I'm during runtime analysis.

680
00:33:13,940 --> 00:33:17,120
Is not so useful if
I'm looking at where's

681
00:33:17,120 --> 00:33:18,745
the slowness in my program?

682
00:33:18,745 --> 00:33:20,370
Because if you look
at cumulative time,

683
00:33:20,370 --> 00:33:24,510
you might see the slowness
in one of the functions that

684
00:33:24,510 --> 00:33:27,930
get-- that word frequencies
from file called.

685
00:33:27,930 --> 00:33:30,560
So the cumulative time
for this is really big,

686
00:33:30,560 --> 00:33:35,140
but the total time-- the
time that's spent inside it--

687
00:33:35,140 --> 00:33:36,824
is not that bad.

688
00:33:36,824 --> 00:33:38,240
Instead if you
look at total time,

689
00:33:38,240 --> 00:33:41,520
you'll see that the worst
function is-- surprise,

690
00:33:41,520 --> 00:33:44,670
surprise-- get words
from line list.

691
00:33:44,670 --> 00:33:48,150
So 5 lines to look at--
hey, there's a plus there.

692
00:33:48,150 --> 00:33:50,900
I remember from lecture
that plus copies over lists,

693
00:33:50,900 --> 00:33:54,700
and it's kind of slow, so maybe
I should use something else.

694
00:33:54,700 --> 00:33:57,340
Does d remember what
else we should use?

695
00:33:57,340 --> 00:34:00,290
We talked about that
last recitation.

696
00:34:00,290 --> 00:34:02,200
Extend.

697
00:34:02,200 --> 00:34:03,580
Document distance 2 .

698
00:34:03,580 --> 00:34:06,580
The only difference between
it an document distance 1

699
00:34:06,580 --> 00:34:09,639
is get words from
line list, line 5.

700
00:34:09,639 --> 00:34:12,580
That plus turned into an extend.

701
00:34:12,580 --> 00:34:16,330
One character turned to six,
so about eight keystrokes,

702
00:34:16,330 --> 00:34:18,440
30% runtime improvement.

703
00:34:18,440 --> 00:34:20,500
Very good return on investment.

704
00:34:20,500 --> 00:34:24,530
Everything else is
going to be harder.

705
00:34:24,530 --> 00:34:26,840
So let's look at that line,
because that line dominated

706
00:34:26,840 --> 00:34:29,770
the running time of get
words from line list.

707
00:34:29,770 --> 00:34:32,440
And think what's, then,
your running time for it

708
00:34:32,440 --> 00:34:33,449
now that I have extend?

709
00:35:05,440 --> 00:35:07,800
Does anyone remember what get
words from line list does?

710
00:35:16,130 --> 00:35:19,765
AUDIENCE: It gets the
words out of the document.

711
00:35:19,765 --> 00:35:20,390
PROFESSOR: OK .

712
00:35:20,390 --> 00:35:22,110
It gets the word
out of the document.

713
00:35:22,110 --> 00:35:25,720
So it reads a document that
looks like a regular text file,

714
00:35:25,720 --> 00:35:30,240
and it gets this out of it.

715
00:35:30,240 --> 00:35:33,190
The way it does that is
it goes through each line,

716
00:35:33,190 --> 00:35:36,640
reads the line as a string,
breaks up the string

717
00:35:36,640 --> 00:35:39,310
into a list of words,
and then combines

718
00:35:39,310 --> 00:35:42,240
all those lists
of words together.

719
00:35:42,240 --> 00:35:44,420
Get words from string,
line 4, returns

720
00:35:44,420 --> 00:35:47,114
the number of words in a line.

721
00:35:47,114 --> 00:35:49,030
Sorry, the list of words
in the line, and then

722
00:35:49,030 --> 00:35:51,510
extend combines
the lists together.

723
00:35:51,510 --> 00:35:56,140
Let's add some constants
that we had last time.

724
00:35:56,140 --> 00:36:08,220
I think we had that K is the
number of words per line.

725
00:36:08,220 --> 00:36:16,850
And Z is the number
of total lines.

726
00:36:16,850 --> 00:36:29,492
So this K is actually-- W
over Z. No, I don't like that.

727
00:36:29,492 --> 00:36:29,992
That's work.

728
00:36:32,890 --> 00:36:43,730
So Z is K W over
K. And we argued

729
00:36:43,730 --> 00:36:45,370
that we're not going
to talk about K

730
00:36:45,370 --> 00:36:48,490
too much, because a document
needs to fit on a screen

731
00:36:48,490 --> 00:36:50,820
or needs to fit on
a piece of paper.

732
00:36:50,820 --> 00:36:52,740
So the line length has
to be finite, right?

733
00:36:52,740 --> 00:36:56,069
Otherwise, if I have a document
that has 10,000 character

734
00:36:56,069 --> 00:36:57,860
lines, I can't even
write it on this board.

735
00:36:57,860 --> 00:36:59,940
Even though it's really long.

736
00:36:59,940 --> 00:37:03,300
So K, the number of words in
a line, is going to be finite.

737
00:37:03,300 --> 00:37:06,270
But we'll need it
for this analysis.

738
00:37:06,270 --> 00:37:10,650
So line 4 returns a list
with the words on a line.

739
00:37:10,650 --> 00:37:12,225
How many elements in that list?

740
00:37:19,010 --> 00:37:22,660
This returns a list
with how many elements?

741
00:37:28,408 --> 00:37:31,459
AUDIENCE: Could K
[INAUDIBLE] words on line.

742
00:37:31,459 --> 00:37:33,250
PROFESSOR: So even if
I ask easy questions,

743
00:37:33,250 --> 00:37:34,249
you guys have to answer.

744
00:37:34,249 --> 00:37:36,900
Because otherwise I'll
stall until you do.

745
00:37:36,900 --> 00:37:39,240
So 4 gives me a list
with K elements,

746
00:37:39,240 --> 00:37:43,060
and 5 appends that small
list to the big list of words

747
00:37:43,060 --> 00:37:45,040
in the entire document.

748
00:37:45,040 --> 00:37:47,310
Before I used plus and
that did something bad.

749
00:37:47,310 --> 00:37:48,460
Now I'm using extend.

750
00:37:48,460 --> 00:37:53,248
What's the cost of
one extend called?

751
00:37:53,248 --> 00:37:54,706
AUDIENCE: Constant?

752
00:37:54,706 --> 00:37:55,912
Order of K?

753
00:37:55,912 --> 00:37:58,120
PROFESSOR: I want to know
your Python implementation.

754
00:38:02,288 --> 00:38:04,708
AUDIENCE: If Python did
it like a linked list ,

755
00:38:04,708 --> 00:38:08,477
like doubly linked lists,
could be order constant?

756
00:38:08,477 --> 00:38:09,810
PROFESSOR: I like your question.

757
00:38:09,810 --> 00:38:13,720
So if Python lists were
actually linked lists--

758
00:38:13,720 --> 00:38:16,060
so if the name
wasn't confusing--

759
00:38:16,060 --> 00:38:18,320
then yes, merging two
lists would be constant.

760
00:38:18,320 --> 00:38:21,520
But then accessing one element
in the middle of a list

761
00:38:21,520 --> 00:38:24,160
would not be constant anymore.

762
00:38:24,160 --> 00:38:25,910
Say I want to access
element number

763
00:38:25,910 --> 00:38:29,160
200 in a list of
10,000 elements.

764
00:38:29,160 --> 00:38:31,165
I have to go through
200 elements.

765
00:38:31,165 --> 00:38:32,540
We didn't do linked
lists when we

766
00:38:32,540 --> 00:38:33,930
ran it, so don't worry about it.

767
00:38:33,930 --> 00:38:36,230
So they decided that it's
less confusing to have

768
00:38:36,230 --> 00:38:39,710
lists actually be arrays.

769
00:38:39,710 --> 00:38:42,500
So Python lists, array in CLRS.

770
00:38:45,175 --> 00:38:47,300
AUDIENCE: That can use
their storage contiguously?

771
00:38:47,300 --> 00:38:48,200
PROFESSOR: Yep.

772
00:38:48,200 --> 00:38:50,930
AUDIENCE: So when you copy,
why can't you copy a block?

773
00:38:50,930 --> 00:38:52,731
Why do you have to
access each other?

774
00:38:52,731 --> 00:38:54,720
Does that make sense?

775
00:38:54,720 --> 00:38:56,930
PROFESSOR: So, you
can copy a block,

776
00:38:56,930 --> 00:38:58,600
but in order to copy
a block, you still

777
00:38:58,600 --> 00:39:00,270
have to move everything.

778
00:39:00,270 --> 00:39:03,770
So if your block
is 10,000 elements,

779
00:39:03,770 --> 00:39:08,370
you still have to move 10,000
bytes times element size.

780
00:39:08,370 --> 00:39:11,490
And the CPU works on 4 bytes
at a time or 8 bytes at a time.

781
00:39:13,990 --> 00:39:14,490
OK.

782
00:39:14,490 --> 00:39:16,150
But this is the
right kind of thing

783
00:39:16,150 --> 00:39:17,910
to be thinking about when
you're doing the cost model.

784
00:39:17,910 --> 00:39:19,300
And this is what you
want to have in your head

785
00:39:19,300 --> 00:39:20,424
when you're writing Python.

786
00:39:20,424 --> 00:39:21,610
So, good.

787
00:39:21,610 --> 00:39:23,492
I like your question.

788
00:39:23,492 --> 00:39:24,950
I wanted to say
that at some point,

789
00:39:24,950 --> 00:39:26,366
but I didn't get
the occasion yet.

790
00:39:26,366 --> 00:39:28,050
Lists in Python are
not lists in CLRS.

791
00:39:30,561 --> 00:39:31,060
OK.

792
00:39:31,060 --> 00:39:35,030
So with that long
explanation, an extend

793
00:39:35,030 --> 00:39:37,590
is a list-- is a sequence
of appends, right?

794
00:39:37,590 --> 00:39:39,020
If you have two
lists and you want

795
00:39:39,020 --> 00:39:41,980
to extend the first
list to the second list,

796
00:39:41,980 --> 00:39:44,640
extend basically goes through
each element in the second list

797
00:39:44,640 --> 00:39:47,800
and calls append
on the first list.

798
00:39:47,800 --> 00:39:50,380
The list is length K, the
second list is length K,

799
00:39:50,380 --> 00:39:52,510
so K appends are
going to happen.

800
00:39:52,510 --> 00:39:54,200
And append is constant time.

801
00:39:54,200 --> 00:39:55,910
Total cost, K.

802
00:39:55,910 --> 00:39:58,000
Now many times does line 5 run?

803
00:40:01,227 --> 00:40:03,265
AUDIENCE: [INAUDIBLE].

804
00:40:03,265 --> 00:40:04,140
PROFESSOR: Very good.

805
00:40:07,140 --> 00:40:09,200
So what is the
total running cost

806
00:40:09,200 --> 00:40:12,470
of the algorithm if this
is the line that dominates?

807
00:40:12,470 --> 00:40:13,970
I don't want to do
every other line,

808
00:40:13,970 --> 00:40:16,250
so I'll promise that this
is the line that dominates.

809
00:40:16,250 --> 00:40:19,300
What is the total running time?

810
00:40:19,300 --> 00:40:20,377
AUDIENCE: [INAUDIBLE].

811
00:40:20,377 --> 00:40:20,960
PROFESSOR: OK.

812
00:40:20,960 --> 00:40:22,230
Very good.

813
00:40:22,230 --> 00:40:23,390
K times L. And that is?

814
00:40:27,530 --> 00:40:33,160
Oh, K. So I shouldn't have
said K times L, sorry.

815
00:40:33,160 --> 00:40:35,430
L is not the number of lines.

816
00:40:35,430 --> 00:40:40,250
L is the number here, so
Z is the number of lines.

817
00:40:40,250 --> 00:40:45,720
So it's K times Z. Which
means I'm using bad letters,

818
00:40:45,720 --> 00:40:47,120
so please bear with me.

819
00:40:47,120 --> 00:40:48,830
We'll forget about
them in a minute.

820
00:40:48,830 --> 00:40:50,830
So K times Z equals?

821
00:40:50,830 --> 00:40:53,470
AUDIENCE: W.

822
00:40:53,470 --> 00:40:57,660
PROFESSOR: W. So
what do I write here?

823
00:41:01,994 --> 00:41:02,910
AUDIENCE: [INAUDIBLE].

824
00:41:02,910 --> 00:41:03,530
PROFESSOR: OK.

825
00:41:03,530 --> 00:41:04,760
Good.

826
00:41:04,760 --> 00:41:05,830
Very good.

827
00:41:05,830 --> 00:41:08,800
What do I write here?

828
00:41:08,800 --> 00:41:11,862
Word frequencies from file.

829
00:41:11,862 --> 00:41:13,137
AUDIENCE: W L.

830
00:41:13,137 --> 00:41:13,720
PROFESSOR: OK.

831
00:41:16,460 --> 00:41:17,335
What do I write here?

832
00:41:21,407 --> 00:41:21,990
AUDIENCE: W 1.

833
00:41:35,150 --> 00:41:38,237
AUDIENCE: We have to-- put L
1 squared and L 2 squared back

834
00:41:38,237 --> 00:41:38,950
in.

835
00:41:38,950 --> 00:41:40,160
PROFESSOR: OK, do we?

836
00:41:40,160 --> 00:41:43,060
Well, the-- No.

837
00:41:43,060 --> 00:41:45,535
W, you want to always be
bigger than L. I hope.

838
00:41:45,535 --> 00:41:46,160
PROFESSOR: Yep.

839
00:41:46,160 --> 00:41:48,680
So if I put it
in, L 1 squared is

840
00:41:48,680 --> 00:41:53,570
L 1 L 1, which is
smaller than W 1 L 1.

841
00:41:53,570 --> 00:41:55,560
But I have to think about
it before doing that.

842
00:41:55,560 --> 00:41:58,260
I can't ignore this completely.

843
00:42:02,240 --> 00:42:04,210
So this is document distance 2.

844
00:42:04,210 --> 00:42:07,207
The asymptotic complexity
improved, the running time

845
00:42:07,207 --> 00:42:07,707
improved.

846
00:42:14,530 --> 00:42:16,410
Next thing that happens
to make this faster

847
00:42:16,410 --> 00:42:19,180
is-- I'm not giving you
the profiler output,

848
00:42:19,180 --> 00:42:23,040
but you have to take my word for
it that the longest methods are

849
00:42:23,040 --> 00:42:27,170
count frequency
and inner product.

850
00:42:27,170 --> 00:42:29,770
So what I'm going
to do first is I'm

851
00:42:29,770 --> 00:42:32,320
going to make inner
product faster.

852
00:42:32,320 --> 00:42:34,400
But in order to do
that, I have to make

853
00:42:34,400 --> 00:42:36,550
word frequencies
for file slower.

854
00:42:36,550 --> 00:42:38,050
And that is because
I happen to know

855
00:42:38,050 --> 00:42:44,030
an algorithm for inner product
that is a lot faster, if only

856
00:42:44,030 --> 00:42:47,575
you can promise me that in this
list, the words are ordered.

857
00:42:47,575 --> 00:42:48,450
The words are sorted.

858
00:42:51,210 --> 00:42:57,000
So what happens in that list
1 is, the moment I see a word,

859
00:42:57,000 --> 00:42:59,510
if it's not in the list
I add it at the end.

860
00:42:59,510 --> 00:43:01,820
So the words here show up
pretty much in the order

861
00:43:01,820 --> 00:43:03,710
that they show up in the file.

862
00:43:03,710 --> 00:43:05,680
Well, if instead I
could have something

863
00:43:05,680 --> 00:43:11,090
that looks like-- what is it?

864
00:43:11,090 --> 00:43:12,470
Fox.

865
00:43:12,470 --> 00:43:13,731
In.

866
00:43:13,731 --> 00:43:14,230
His.

867
00:43:16,890 --> 00:43:19,850
Hat is somewhere here.

868
00:43:19,850 --> 00:43:22,650
So if these words
would be in this order,

869
00:43:22,650 --> 00:43:25,210
together with the
word that I'm missing,

870
00:43:25,210 --> 00:43:27,470
then I can combine two lists.

871
00:43:27,470 --> 00:43:31,620
I can do an inner
product a lot faster.

872
00:43:31,620 --> 00:43:34,130
Let's see how that would work.

873
00:43:34,130 --> 00:43:37,360
And I'm already getting confused
my words, so let's do a trick.

874
00:43:37,360 --> 00:43:39,810
Let's say that instead of
words, we'll use numbers.

875
00:43:39,810 --> 00:43:41,570
So instead of saying
"the," I'm going

876
00:43:41,570 --> 00:43:45,060
to say the is the 50th
word in the dictionary.

877
00:43:45,060 --> 00:43:47,957
So I'm going to use number 50.

878
00:43:47,957 --> 00:43:49,290
Because I want to write numbers.

879
00:43:49,290 --> 00:43:50,940
The numbers are
easier to deal with.

880
00:43:50,940 --> 00:43:53,460
So say the first
document that I have--

881
00:43:53,460 --> 00:43:59,840
the fox is in the hat-- as
words number 3, 4, 6, 8, and 9.

882
00:43:59,840 --> 00:44:03,140
And they show up twice,
once, once, once.

883
00:44:06,360 --> 00:44:08,720
9, once.

884
00:44:08,720 --> 00:44:17,160
And say I'm trying to compute
the inner product of this

885
00:44:17,160 --> 00:44:19,370
with a document that
has word number 2

886
00:44:19,370 --> 00:44:23,290
showing up once, word number
3 showing up once, word number

887
00:44:23,290 --> 00:44:28,350
6 showing up once, word
number 7 showing up once,

888
00:44:28,350 --> 00:44:30,174
word number 8 showing up once.

889
00:44:33,737 --> 00:44:35,820
OK, the algorithm for inner
product that we talked

890
00:44:35,820 --> 00:44:39,050
about last time was, go
through each element in one

891
00:44:39,050 --> 00:44:42,710
of the vectors, find an
element with the same word

892
00:44:42,710 --> 00:44:45,410
in the second vector,
and if you can find it,

893
00:44:45,410 --> 00:44:47,370
then take the number of
times the words show up

894
00:44:47,370 --> 00:44:48,800
and multiply them.

895
00:44:48,800 --> 00:44:50,560
So here I have a 3, 2.

896
00:44:50,560 --> 00:44:54,205
I would find this element
here that has 3, 1.

897
00:44:54,205 --> 00:44:55,205
I have these everywhere.

898
00:44:58,030 --> 00:45:00,370
And I take the 2 and the
1 and I multiply them.

899
00:45:00,370 --> 00:45:03,920
So the 3s have to be the same,
then I think the 2 and the 1,

900
00:45:03,920 --> 00:45:05,500
and I multiply them.

901
00:45:05,500 --> 00:45:08,270
And for all the elements
where that's case,

902
00:45:08,270 --> 00:45:10,430
I add up the results
of the multiplication.

903
00:45:13,420 --> 00:45:16,930
So step one, go through
each element in a vector.

904
00:45:16,930 --> 00:45:20,095
That's not going to get
faster if this other vector is

905
00:45:20,095 --> 00:45:21,260
sorted, right?

906
00:45:21,260 --> 00:45:27,740
But the step of looking up the
second element can be sped up.

907
00:45:27,740 --> 00:45:31,430
The first and easiest way I
can speed this up is, hey,

908
00:45:31,430 --> 00:45:32,250
this is sorted.

909
00:45:32,250 --> 00:45:37,340
If I'm looking up three,
why not do a binary search?

910
00:45:37,340 --> 00:45:39,640
What would be the
cost if I do that?

911
00:45:39,640 --> 00:45:43,670
So here I have L1 element,
here I have L2 elements.

912
00:45:43,670 --> 00:45:47,400
What's the cost of doing
one binary search here?

913
00:45:47,400 --> 00:45:48,940
AUDIENCE: Log of L2?

914
00:45:48,940 --> 00:45:51,660
PROFESSOR: OK.

915
00:45:51,660 --> 00:45:55,201
And how many times do
I do a binary search?

916
00:45:55,201 --> 00:45:57,682
AUDIENCE: [INAUDIBLE].

917
00:45:57,682 --> 00:45:59,640
PROFESSOR: So if I go
through each element here

918
00:45:59,640 --> 00:46:02,139
and I do a binary search, which
is a nice and easy algorithm

919
00:46:02,139 --> 00:46:03,800
that I can explain
in 10 seconds,

920
00:46:03,800 --> 00:46:06,520
I'm already faster than L1 L2.

921
00:46:06,520 --> 00:46:08,680
So it's worth sorting.

922
00:46:08,680 --> 00:46:11,200
Now, the algorithm
that we use in class

923
00:46:11,200 --> 00:46:15,790
takes time proportional
to L1 plus L2.

924
00:46:15,790 --> 00:46:17,590
So that's even
trickier, and it's

925
00:46:17,590 --> 00:46:19,300
going to take a bit
more time to explain.

926
00:46:21,970 --> 00:46:24,250
Does-- did anyone understand
the algorithm for class

927
00:46:24,250 --> 00:46:25,690
and wants to help me explain it?

928
00:46:30,580 --> 00:46:32,790
Didn't think so.

929
00:46:32,790 --> 00:46:34,140
OK.

930
00:46:34,140 --> 00:46:39,230
So idea is that both of
these vectors are sorted.

931
00:46:39,230 --> 00:46:46,390
So if I have a 3 here and I
found my 3 here, next time when

932
00:46:46,390 --> 00:46:49,850
I get to 4, I know
for sure that 4 is not

933
00:46:49,850 --> 00:46:54,370
going to be anywhere here,
because this vector is sorted.

934
00:46:54,370 --> 00:46:57,550
Say I couldn't find
4, then I go to 6.

935
00:46:57,550 --> 00:47:00,850
If 6 is here, when I
have to look for 8,

936
00:47:00,850 --> 00:47:04,260
I know for sure that 8 is not
going to be anywhere up here.

937
00:47:07,210 --> 00:47:11,290
So what I do is, I have a
pointer here that remembers,

938
00:47:11,290 --> 00:47:15,684
where's the last element
that I have seen?

939
00:47:15,684 --> 00:47:16,975
Does this make sense to people?

940
00:47:19,890 --> 00:47:22,500
So when I start here
and I look at 3,

941
00:47:22,500 --> 00:47:26,300
I have a pointer here that says,
I didn't see anything here.

942
00:47:26,300 --> 00:47:28,170
I look at 2, it's not 3.

943
00:47:28,170 --> 00:47:29,050
It's smaller.

944
00:47:29,050 --> 00:47:29,940
I look at 3.

945
00:47:29,940 --> 00:47:30,950
I found it, good.

946
00:47:30,950 --> 00:47:32,720
I do my product.

947
00:47:32,720 --> 00:47:35,610
Then I go look for the
next element here, 4.

948
00:47:35,610 --> 00:47:36,845
I look at 6.

949
00:47:36,845 --> 00:47:38,840
6 is bigger than 4.

950
00:47:38,840 --> 00:47:42,980
So I know for sure that nothing
below it is going to be 4.

951
00:47:42,980 --> 00:47:46,140
So I can stop right here
and keep my pointer here.

952
00:47:46,140 --> 00:47:47,840
Then I go to 6 here.

953
00:47:47,840 --> 00:47:48,580
And I look here.

954
00:47:48,580 --> 00:47:49,300
Where did I stop?

955
00:47:49,300 --> 00:47:50,520
I stop here.

956
00:47:50,520 --> 00:47:52,030
This element matches this.

957
00:47:52,030 --> 00:47:53,750
I do my product.

958
00:47:53,750 --> 00:47:56,150
Keep it in.

959
00:47:56,150 --> 00:47:58,030
Now I go to 8 here.

960
00:47:58,030 --> 00:47:59,310
I was at 6.

961
00:47:59,310 --> 00:48:01,320
6 is smaller than 8.

962
00:48:01,320 --> 00:48:03,236
7 is smaller than 8.

963
00:48:03,236 --> 00:48:04,420
8 is equal to 8.

964
00:48:04,420 --> 00:48:05,630
I found something.

965
00:48:05,630 --> 00:48:06,850
Sweet.

966
00:48:06,850 --> 00:48:08,500
I do a product.

967
00:48:08,500 --> 00:48:09,730
And then I stop.

968
00:48:09,730 --> 00:48:12,320
And I look at the next element.

969
00:48:12,320 --> 00:48:15,830
I know for sure that nothing
here is going to be 9,

970
00:48:15,830 --> 00:48:17,610
so I can keep looking down.

971
00:48:17,610 --> 00:48:20,160
I hit the end of my list.

972
00:48:20,160 --> 00:48:21,310
OK.

973
00:48:21,310 --> 00:48:25,069
Whatever I have down here--
it's going 9, 10, 11, 12--

974
00:48:25,069 --> 00:48:27,360
is not going to be in this
list, because it was sorted.

975
00:48:27,360 --> 00:48:30,475
So I can stop.

976
00:48:30,475 --> 00:48:34,340
AUDIENCE: How do
you keep going--

977
00:48:34,340 --> 00:48:37,340
What algorithm so you use
to keep looking down on L2?

978
00:48:37,340 --> 00:48:38,204
PROFESSOR: Plus 1.

979
00:48:38,204 --> 00:48:39,120
AUDIENCE: So what if--

980
00:48:39,120 --> 00:48:40,328
PROFESSOR: I keep going down.

981
00:48:40,328 --> 00:48:46,360
AUDIENCE: What if the left
side was 3, 4, 6, 9243?

982
00:48:46,360 --> 00:48:47,370
PROFESSOR: 9,000, what?

983
00:48:47,370 --> 00:48:48,470
AUDIENCE: Just a big number.

984
00:48:48,470 --> 00:48:49,136
AUDIENCE: Right.

985
00:48:49,136 --> 00:48:52,714
Then you have to
increment by 1 each time--

986
00:48:52,714 --> 00:48:54,130
PROFESSOR: So I'm
not incrementing

987
00:48:54,130 --> 00:48:56,080
the number I'm looking at.

988
00:48:56,080 --> 00:48:57,710
Here, I looked at 6.

989
00:48:57,710 --> 00:48:59,610
Then I'm looking for 8.

990
00:48:59,610 --> 00:49:03,180
And after I found 8,
I'm looking for 19,000.

991
00:49:03,180 --> 00:49:06,800
So I'm going to go down,
either until I find 90,000

992
00:49:06,800 --> 00:49:07,710
or until I stop.

993
00:49:07,710 --> 00:49:10,382
AUDIENCE: So are you going
to go down one at a time,

994
00:49:10,382 --> 00:49:11,007
PROFESSOR: Yep.

995
00:49:11,007 --> 00:49:13,840
AUDIENCE: --why not
do a binary search?

996
00:49:13,840 --> 00:49:16,560
PROFESSOR: Because if
I do a binary search,

997
00:49:16,560 --> 00:49:20,550
the analysis that says it's
fast is not going to work.

998
00:49:20,550 --> 00:49:23,550
It turns out that this gives
me the optimal running time.

999
00:49:23,550 --> 00:49:28,830
If I do a binary search, suppose
I have a list that's like this.

1000
00:49:28,830 --> 00:49:33,509
1, 2, 3, 4, 5, all the
way down to 10,000.

1001
00:49:33,509 --> 00:49:36,750
Ugh, I can't write.

1002
00:49:36,750 --> 00:49:39,920
And I have another list
that's like this- 1, 2, 3, 4,

1003
00:49:39,920 --> 00:49:42,800
all the way down to 10,000.

1004
00:49:42,800 --> 00:49:44,720
1, 1.

1005
00:49:44,720 --> 00:49:49,830
I do a binary search,
takes log N. I look at 2.

1006
00:49:49,830 --> 00:49:53,230
I do a binary search, takes
almost log N. I look at 3,

1007
00:49:53,230 --> 00:49:56,352
do a binary search, takes
almost log N. So on, so forth.

1008
00:49:59,550 --> 00:50:00,705
So this is--

1009
00:50:00,705 --> 00:50:03,214
AUDIENCE: But if your
left list had been

1010
00:50:03,214 --> 00:50:04,380
PROFESSOR: Log N plus log N.

1011
00:50:04,380 --> 00:50:06,034
AUDIENCE: --10,000.

1012
00:50:06,034 --> 00:50:06,700
PROFESSOR: Yeah.

1013
00:50:06,700 --> 00:50:09,770
AUDIENCE: You'd
have taken N time.

1014
00:50:09,770 --> 00:50:10,630
PROFESSOR: Yes.

1015
00:50:10,630 --> 00:50:13,030
Well, this algorithm
takes N time,

1016
00:50:13,030 --> 00:50:15,800
even if I have to
list like this.

1017
00:50:15,800 --> 00:50:19,490
It takes 10,000
plus 10,000 time.

1018
00:50:19,490 --> 00:50:21,700
Whereas this algorithm will
take time that's actually

1019
00:50:21,700 --> 00:50:29,910
proportional to
10,000 log 10,000.

1020
00:50:29,910 --> 00:50:30,740
You believe me.

1021
00:50:30,740 --> 00:50:34,480
So, a way to look at this
is to do bounds and say,

1022
00:50:34,480 --> 00:50:37,140
for the elements
1 through 5,000,

1023
00:50:37,140 --> 00:50:41,470
it's going to do a binary search
for more than 5,000 elements.

1024
00:50:41,470 --> 00:50:44,170
So, the time-- the running
time is definitely bigger

1025
00:50:44,170 --> 00:50:48,366
than N over 2 log N over 2.

1026
00:50:51,840 --> 00:50:52,630
Constant.

1027
00:50:52,630 --> 00:51:00,710
This becomes log N
minus 1 and log N.

1028
00:51:00,710 --> 00:51:01,900
That is a good question.

1029
00:51:01,900 --> 00:51:04,340
I wondered about that the first
time I saw merge-sort too.

1030
00:51:04,340 --> 00:51:05,810
And I was thinking,
hey, I'm going

1031
00:51:05,810 --> 00:51:07,768
to do a binary search
here because it's faster,

1032
00:51:07,768 --> 00:51:10,510
and I'm going to make a faster
algorithm than anyone has ever

1033
00:51:10,510 --> 00:51:11,390
seen.

1034
00:51:11,390 --> 00:51:13,320
Well, if you do the
analysis, not so much.

1035
00:51:13,320 --> 00:51:14,660
But you need to think
about it, and you

1036
00:51:14,660 --> 00:51:16,660
need to know why that's
true or that's not true.

1037
00:51:16,660 --> 00:51:17,670
So I like your question.

1038
00:51:17,670 --> 00:51:20,140
Thank you.

1039
00:51:20,140 --> 00:51:22,520
So now let's get down so
the plain old merge-sort

1040
00:51:22,520 --> 00:51:26,860
that everyone-- sorry,
merge that everyone knows.

1041
00:51:26,860 --> 00:51:30,610
So if we go through
these one by one,

1042
00:51:30,610 --> 00:51:33,880
how many times am I going to
be advancing this pointer?

1043
00:51:33,880 --> 00:51:35,498
In total?

1044
00:51:35,498 --> 00:51:36,920
AUDIENCE: L2?

1045
00:51:36,920 --> 00:51:38,160
PROFESSOR: L2 times.

1046
00:51:38,160 --> 00:51:42,540
So this pointer can
only go down, right?

1047
00:51:42,540 --> 00:51:45,660
So worst case is going
to go down L2 times.

1048
00:51:45,660 --> 00:51:48,550
And then I'm done
with the list, return.

1049
00:51:48,550 --> 00:51:50,580
How many elements am I
going to look through?

1050
00:51:50,580 --> 00:51:53,301
So how many times does this
pointer going to advance?

1051
00:51:53,301 --> 00:51:53,842
AUDIENCE: L1?

1052
00:51:56,700 --> 00:51:59,080
PROFESSOR: This one.

1053
00:51:59,080 --> 00:52:00,550
AUDIENCE: But I thought, like--

1054
00:52:00,550 --> 00:52:03,490
AUDIENCE: Then you get
extra ones in between.

1055
00:52:03,490 --> 00:52:05,560
PROFESSOR: What if
L2 is bigger than L1?

1056
00:52:05,560 --> 00:52:06,780
What if--

1057
00:52:06,780 --> 00:52:07,660
AUDIENCE: Oh, right.

1058
00:52:07,660 --> 00:52:08,540
OK, never mind.

1059
00:52:08,540 --> 00:52:10,499
The reason I said
L2 was because--

1060
00:52:10,499 --> 00:52:13,040
PROFESSOR: You're thinking after
I'm going through this list,

1061
00:52:13,040 --> 00:52:13,700
I'm out, right?

1062
00:52:13,700 --> 00:52:15,325
AUDIENCE: Right,
that's what I thought.

1063
00:52:15,325 --> 00:52:17,640
PROFESSOR: So, your answer
works if this list, say, has

1064
00:52:17,640 --> 00:52:19,620
10 elements and this has 10,000.

1065
00:52:19,620 --> 00:52:22,230
And I go through this
one really quickly.

1066
00:52:22,230 --> 00:52:25,697
But if this list has 10 elements
and this list has 10,000,

1067
00:52:25,697 --> 00:52:27,280
and they both start
with 1 through 10,

1068
00:52:27,280 --> 00:52:31,660
I have to say a 1 because
that's a better bound.

1069
00:52:31,660 --> 00:52:33,596
So I have to say this.

1070
00:52:33,596 --> 00:52:35,345
AUDIENCE: Could you
say the opposite value

1071
00:52:35,345 --> 00:52:37,450
of the difference
between the two of them?

1072
00:52:37,450 --> 00:52:39,486
Because if you're going
to stop at 1 has 10

1073
00:52:39,486 --> 00:52:43,120
and the other one has
10,000, and let's say

1074
00:52:43,120 --> 00:52:46,579
that only the first
ten are actually equal,

1075
00:52:46,579 --> 00:52:48,870
then you're going to go
through that list, find all 10,

1076
00:52:48,870 --> 00:52:50,110
and you stop.

1077
00:52:50,110 --> 00:52:51,059
That would be 9,000--

1078
00:52:51,059 --> 00:52:53,350
PROFESSOR: I could say about
that if I'm looking at one

1079
00:52:53,350 --> 00:52:56,160
case, but the magic trick
is-- let's-- we're looking

1080
00:52:56,160 --> 00:52:57,680
at the worst case.

1081
00:52:57,680 --> 00:52:59,844
So worst case, if
I have 10 elements,

1082
00:52:59,844 --> 00:53:01,760
they'll be all the way
down in the other list.

1083
00:53:01,760 --> 00:53:03,010
Or they won't be there at all.

1084
00:53:03,010 --> 00:53:04,970
And I have to go down
through all the list.

1085
00:53:04,970 --> 00:53:07,670
AUDIENCE: OK.

1086
00:53:07,670 --> 00:53:09,640
PROFESSOR: So worst
case, L1 plus L2.

1087
00:53:15,000 --> 00:53:18,760
Let me see if I
have any time left.

1088
00:53:18,760 --> 00:53:19,720
Nope.

1089
00:53:19,720 --> 00:53:25,050
So, what I would like you to do
is go through insertion sort.

1090
00:53:25,050 --> 00:53:27,700
Insertion sort
matches the textbook.

1091
00:53:27,700 --> 00:53:30,540
Look at the definition in the
textbook, look at the code,

1092
00:53:30,540 --> 00:53:32,009
convince yourself it's the same.

1093
00:53:32,009 --> 00:53:33,800
Look at the running
time, convince yourself

1094
00:53:33,800 --> 00:53:35,280
it's N squared.

1095
00:53:35,280 --> 00:53:38,880
Then look at inner product
and convince yourself

1096
00:53:38,880 --> 00:53:41,360
that this is what it does.

1097
00:53:41,360 --> 00:53:44,820
Go through this line by
line, see where they match,

1098
00:53:44,820 --> 00:53:46,100
put the cost on.

1099
00:53:46,100 --> 00:53:49,040
Make sure that the
cost is L1 plus L2.

1100
00:53:49,040 --> 00:53:52,810
And last, go through
merge-sort and notice

1101
00:53:52,810 --> 00:53:56,660
that merge in [INAUDIBLE]
6 is exactly the same

1102
00:53:56,660 --> 00:53:57,880
as inner product.

1103
00:53:57,880 --> 00:54:00,020
So this pointer
magic that I did here

1104
00:54:00,020 --> 00:54:02,440
is exactly what's
happening inside merge.

1105
00:54:02,440 --> 00:54:03,570
And understand merge-sort.

1106
00:54:03,570 --> 00:54:05,630
Look at the textbook,
look at the notes,

1107
00:54:05,630 --> 00:54:08,270
and see how they match.

1108
00:54:08,270 --> 00:54:08,870
OK.

1109
00:54:08,870 --> 00:54:10,720
Thanks, guys.