1 00:00:00,060 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high-quality educational resources for free. 5 00:00:10,720 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,226 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,226 --> 00:00:17,851 at ocw.mit.edu. 8 00:00:22,030 --> 00:00:25,270 PROFESSOR: So did everyone turn in PSET1? 9 00:00:25,270 --> 00:00:25,770 Yes? 10 00:00:25,770 --> 00:00:27,220 Good. 11 00:00:27,220 --> 00:00:30,760 OK, so there is a PSET1 critique due in a few days. 12 00:00:30,760 --> 00:00:32,189 My advice? 13 00:00:32,189 --> 00:00:33,215 You already did PSET1. 14 00:00:33,215 --> 00:00:35,090 You remember what you wrote out on the proof. 15 00:00:35,090 --> 00:00:36,506 Look at the solution. 16 00:00:36,506 --> 00:00:38,630 Write one paragraph today, and you're done with it. 17 00:00:38,630 --> 00:00:40,800 Then you go focus on PSET2. 18 00:00:40,800 --> 00:00:42,334 If you leave it off until Tuesday, 19 00:00:42,334 --> 00:00:44,250 you're going to have to read your proof again, 20 00:00:44,250 --> 00:00:46,370 remember what you're thinking. 21 00:00:46,370 --> 00:00:47,577 It's a lot more work. 22 00:00:47,577 --> 00:00:49,160 Just do it now, get it out of the way, 23 00:00:49,160 --> 00:00:51,466 and put PSET1 behind you. 24 00:00:51,466 --> 00:00:53,340 AUDIENCE: Is the critique only for the proof? 25 00:00:53,340 --> 00:00:54,784 Or is this for all of them? 26 00:00:54,784 --> 00:00:55,450 PROFESSOR: Nope. 27 00:00:55,450 --> 00:00:56,399 Just the proof. 28 00:00:56,399 --> 00:00:58,440 So you have to compare your proof with our proof? 29 00:01:01,128 --> 00:01:02,880 AUDIENCE: Is there an assignment for that? 30 00:01:02,880 --> 00:01:04,876 Or do we just know to do it? 31 00:01:04,876 --> 00:01:05,610 PROFESSOR: Uh. 32 00:01:05,610 --> 00:01:07,475 PSET1 be Stellar. 33 00:01:09,990 --> 00:01:10,580 Oh, no, sorry. 34 00:01:10,580 --> 00:01:11,413 It's not in Stellar. 35 00:01:11,413 --> 00:01:13,580 It's on our new grading site that just went out. 36 00:01:13,580 --> 00:01:15,330 So you have to go to our new grading site, 37 00:01:15,330 --> 00:01:17,500 and you have to type in your critique there. 38 00:01:17,500 --> 00:01:18,790 And it's one paragraph. 39 00:01:18,790 --> 00:01:20,220 You should aim for one paragraph. 40 00:01:20,220 --> 00:01:21,760 If you're doing more than that, then you're 41 00:01:21,760 --> 00:01:22,676 doing something wrong. 42 00:01:22,676 --> 00:01:25,300 And it's LATEX Plus math mode. 43 00:01:25,300 --> 00:01:29,460 So you can use math mode, and that's about it. 44 00:01:29,460 --> 00:01:31,852 OK, any more questions about the critique? 45 00:01:31,852 --> 00:01:32,560 It's a new thing. 46 00:01:32,560 --> 00:01:35,160 We care about it because it will make our grading life easier. 47 00:01:35,160 --> 00:01:37,360 And because it'll force you to look at the solutions and see 48 00:01:37,360 --> 00:01:39,130 what you understood and what you didn't. 49 00:01:39,130 --> 00:01:40,060 So we care about it. 50 00:01:40,060 --> 00:01:41,960 Don't ignore it. 51 00:01:41,960 --> 00:01:43,638 Yes? 52 00:01:43,638 --> 00:01:45,304 AUDIENCE: Like, how much is it weighted? 53 00:01:45,304 --> 00:01:48,294 How much does it count toward the grade? 54 00:01:48,294 --> 00:01:49,960 PROFESSOR: If you don't have a critique, 55 00:01:49,960 --> 00:01:53,394 we will most likely give you a 0 for proof. 56 00:01:53,394 --> 00:01:54,310 AUDIENCE: [INAUDIBLE]. 57 00:01:57,350 --> 00:02:00,070 PROFESSOR: If your proof is bad and your critique of the proof 58 00:02:00,070 --> 00:02:02,070 is good, then you might get something. 59 00:02:02,070 --> 00:02:04,370 If your proof is bad and you have no critique-- 60 00:02:04,370 --> 00:02:06,370 actually if your proof is whatever it is and you 61 00:02:06,370 --> 00:02:07,619 have no critique, you get a 0. 62 00:02:07,619 --> 00:02:10,032 AUDIENCE: Yeah. [CHUCKLE] 63 00:02:10,032 --> 00:02:11,740 PROFESSOR: Any more questions about that? 64 00:02:14,870 --> 00:02:17,080 OK. 65 00:02:17,080 --> 00:02:20,740 Who needs help remembering what the document distance 66 00:02:20,740 --> 00:02:21,960 problem is? 67 00:02:24,910 --> 00:02:25,410 OK. 68 00:02:25,410 --> 00:02:28,350 Everyone who went to lecture or to [INAUDIBLE] remembers. 69 00:02:28,350 --> 00:02:29,840 That's good. 70 00:02:29,840 --> 00:02:31,360 Who went to lecture last time? 71 00:02:34,720 --> 00:02:35,290 Cool. 72 00:02:35,290 --> 00:02:36,010 That's good. 73 00:02:36,010 --> 00:02:38,560 So we did insertion sort and merge sort 74 00:02:38,560 --> 00:02:39,970 from a theoretical standpoint. 75 00:02:39,970 --> 00:02:42,851 Today we're going to look at the code for the insertion sort 76 00:02:42,851 --> 00:02:45,100 and, if we have time, look at the code for merge-sort, 77 00:02:45,100 --> 00:02:48,660 and use the same strategy as we did last time to analyze them, 78 00:02:48,660 --> 00:02:51,290 look at the running time, make sure the running time matches 79 00:02:51,290 --> 00:02:55,410 the theory, and see how pseudocode turns into Python. 80 00:02:59,010 --> 00:03:01,690 So you all have your listings. 81 00:03:01,690 --> 00:03:05,790 Last time in document distance, we covered Main, 82 00:03:05,790 --> 00:03:07,570 and we covered most of the functions 83 00:03:07,570 --> 00:03:10,450 except for count frequency. 84 00:03:10,450 --> 00:03:15,200 Can anyone remind me what the call graph looked like? 85 00:03:15,200 --> 00:03:16,700 So the call graph is the tree that I 86 00:03:16,700 --> 00:03:18,990 had up on the left, and it started at Main. 87 00:03:28,650 --> 00:03:29,150 Thank you. 88 00:03:31,810 --> 00:03:40,610 So Main calls word frequencies for file, which in turn calls? 89 00:03:40,610 --> 00:03:42,470 AUDIENCE: Well, it's probably line list. 90 00:03:42,470 --> 00:03:43,090 PROFESSOR: OK. 91 00:03:43,090 --> 00:03:45,042 AUDIENCE: And count frequency. 92 00:03:51,390 --> 00:03:54,150 PROFESSOR: So we pretend we don't see the read file called. 93 00:03:54,150 --> 00:03:57,360 We assume that the data is already in memory 94 00:03:57,360 --> 00:03:59,970 or that the call takes time that's 95 00:03:59,970 --> 00:04:03,730 proportional to the running-- to the length of the file. 96 00:04:03,730 --> 00:04:07,036 And we only look at get-word from line list and count 97 00:04:07,036 --> 00:04:07,535 frequency. 98 00:04:13,340 --> 00:04:14,390 OK. 99 00:04:14,390 --> 00:04:17,072 Who else does Main call? 100 00:04:17,072 --> 00:04:18,030 AUDIENCE: Vector angle? 101 00:04:25,760 --> 00:04:28,624 PROFESSOR: And the vector angle? 102 00:04:28,624 --> 00:04:29,624 AUDIENCE: Inner product. 103 00:04:38,017 --> 00:04:38,600 PROFESSOR: OK. 104 00:04:38,600 --> 00:04:43,370 Let's put up the constants for-- for the document distance 105 00:04:43,370 --> 00:04:46,720 problem that we used last time. 106 00:04:46,720 --> 00:04:51,780 So we said that the document has W words. 107 00:04:51,780 --> 00:04:54,170 And then when you take that list of words 108 00:04:54,170 --> 00:04:57,330 and you turn it into a distance vector, 109 00:04:57,330 --> 00:04:59,850 you will get assigned to a document vector. 110 00:04:59,850 --> 00:05:04,180 You will get L elements, which basically means L unique words. 111 00:05:04,180 --> 00:05:09,730 So L is the document vector length. 112 00:05:09,730 --> 00:05:12,530 And we assume we're using a natural language like English, 113 00:05:12,530 --> 00:05:15,090 so all the words are bounded in size. 114 00:05:15,090 --> 00:05:17,470 Like 5 to 10 characters, for example. 115 00:05:17,470 --> 00:05:20,210 And to make our life easier, we say 116 00:05:20,210 --> 00:05:30,430 all the words have the same size W. So w is the word length. 117 00:05:30,430 --> 00:05:32,810 Using these numbers, can anyone remind me 118 00:05:32,810 --> 00:05:36,190 what we said the costs for these methods are? 119 00:05:36,190 --> 00:05:38,690 And we didn't analyze count frequency, 120 00:05:38,690 --> 00:05:40,740 so it's OK to not take my word for it 121 00:05:40,740 --> 00:05:42,800 and not tell me what I said last time. 122 00:05:42,800 --> 00:05:46,010 But I would like numbers for word frequencies from file, 123 00:05:46,010 --> 00:05:50,084 get word from line list, vector angle, and inner product. 124 00:05:50,084 --> 00:05:51,915 AUDIENCE: Was that mu squared? 125 00:05:51,915 --> 00:05:53,160 That was last time. 126 00:05:53,160 --> 00:05:55,380 PROFESSOR: OK. 127 00:05:55,380 --> 00:05:57,150 Which-- which one? 128 00:05:57,150 --> 00:06:01,410 Does anyone-- does anyone else want to try? 129 00:06:01,410 --> 00:06:03,780 Let's not do guessing. 130 00:06:03,780 --> 00:06:08,290 I'll pull them up if nobody remembers. 131 00:06:08,290 --> 00:06:10,910 I spent an entire hour on that, and you guys did too. 132 00:06:10,910 --> 00:06:11,574 It was painful. 133 00:06:11,574 --> 00:06:12,282 AUDIENCE: I know. 134 00:06:12,282 --> 00:06:14,924 But we had to, like, you know, add them up, and then-- 135 00:06:14,924 --> 00:06:15,590 PROFESSOR: Yeah. 136 00:06:15,590 --> 00:06:17,340 So we did a lot of work for those numbers. 137 00:06:20,280 --> 00:06:24,350 So get word from line list was order of W squared. 138 00:06:24,350 --> 00:06:26,490 Does anyone remember why? 139 00:06:26,490 --> 00:06:29,220 What made it take so much time? 140 00:06:34,020 --> 00:06:36,900 AUDIENCE: 'Cause you append it to the word list. 141 00:06:36,900 --> 00:06:39,467 For, like, you add it to the end to go through. 142 00:06:39,467 --> 00:06:40,050 PROFESSOR: OK. 143 00:06:43,820 --> 00:06:44,530 So you add it. 144 00:06:44,530 --> 00:06:45,920 So what? 145 00:06:45,920 --> 00:06:47,922 AUDIENCE: So, like, every time you need it, 146 00:06:47,922 --> 00:06:52,260 you, like, do word list equals word list plus words in line. 147 00:06:52,260 --> 00:06:53,070 PROFESSOR: OK. 148 00:06:53,070 --> 00:06:55,300 Excellent. 149 00:06:55,300 --> 00:06:58,200 So get words from line list, line five. 150 00:06:58,200 --> 00:06:59,580 There's a plus there. 151 00:06:59,580 --> 00:07:03,630 And that plus sign messes up the performance. 152 00:07:03,630 --> 00:07:06,890 So that's why it's w squared. 153 00:07:06,890 --> 00:07:08,800 Count frequency, we didn't cover it, 154 00:07:08,800 --> 00:07:10,160 you had to take my word for it. 155 00:07:10,160 --> 00:07:12,400 So we will cover it now. 156 00:07:12,400 --> 00:07:15,145 And, inner product. 157 00:07:26,060 --> 00:07:29,610 So suppose you have two vectors of line-- length L1 and L2. 158 00:07:29,610 --> 00:07:32,700 How much time does it take to compute the inner product? 159 00:07:32,700 --> 00:07:34,492 AUDIENCE: L1, L2 time. 160 00:07:34,492 --> 00:07:35,075 PROFESSOR: OK. 161 00:07:38,880 --> 00:07:40,970 So how much time does vector angle take? 162 00:07:48,335 --> 00:07:49,317 AUDIENCE: L1 L2 time. 163 00:07:53,750 --> 00:07:56,509 PROFESSOR: L1 L2. 164 00:07:56,509 --> 00:07:57,800 So it makes three calls, right? 165 00:07:57,800 --> 00:07:59,270 You get the two things. 166 00:07:59,270 --> 00:08:02,450 First, it computes the inner product of the two lists. 167 00:08:02,450 --> 00:08:05,890 And then it has to compute the inner product of each vector-- 168 00:08:05,890 --> 00:08:08,359 of each document vector with itself. 169 00:08:08,359 --> 00:08:10,525 Because that's what's on the bottom of the fraction. 170 00:08:13,330 --> 00:08:15,210 So what's the running time for that? 171 00:08:15,210 --> 00:08:18,710 AUDIENCE: Plus L1 squared, plus L2 squared. 172 00:08:18,710 --> 00:08:22,880 PROFESSOR: Plus L1 squared, plus L2 squared. 173 00:08:22,880 --> 00:08:25,380 So someone was really helpful last time and asked me, 174 00:08:25,380 --> 00:08:27,600 can you make this simpler with some math? 175 00:08:27,600 --> 00:08:30,080 And I said, I don't know so I don't think I will. 176 00:08:30,080 --> 00:08:32,820 But I looked at-- my, I looked at my high school math 177 00:08:32,820 --> 00:08:37,549 afterwards, and it turns out that these-- so L1 squared 178 00:08:37,549 --> 00:08:40,830 plus L2 squared are guaranteed to be greater than L1 L2 179 00:08:40,830 --> 00:08:43,009 as long as these numbers are positive. 180 00:08:43,009 --> 00:08:44,550 And we're working with document list, 181 00:08:44,550 --> 00:08:46,350 so they're always positive. 182 00:08:46,350 --> 00:08:49,155 So this will go away. 183 00:08:54,320 --> 00:09:00,080 So let's assume count frequency is w squared or smaller. 184 00:09:00,080 --> 00:09:03,310 So the total running time for word frequencies from file 185 00:09:03,310 --> 00:09:06,580 is w squared. 186 00:09:06,580 --> 00:09:08,300 What's the running time for everything? 187 00:09:13,114 --> 00:09:14,780 AUDIENCE: Would just be, like, w squared 188 00:09:14,780 --> 00:09:17,410 plus L1 squared plus L2 squared? 189 00:09:17,410 --> 00:09:18,160 PROFESSOR: Almost. 190 00:09:18,160 --> 00:09:20,330 AUDIENCE: Actually, it depends on which one's greater, right? 191 00:09:20,330 --> 00:09:21,330 PROFESSOR: Well, almost. 192 00:09:21,330 --> 00:09:27,580 So here W works, assuming you get one document with W words. 193 00:09:27,580 --> 00:09:30,230 But word frequencies from files is called twice, once 194 00:09:30,230 --> 00:09:31,870 for each document. 195 00:09:31,870 --> 00:09:36,060 First document has W1 words, second document has W2 words. 196 00:09:36,060 --> 00:09:38,974 So what's the running time? 197 00:09:38,974 --> 00:09:42,750 AUDIENCE: W1 squared plus W2 squared. 198 00:09:42,750 --> 00:09:48,000 PROFESSOR: W1 squared plus W2 squared. 199 00:09:48,000 --> 00:09:50,200 So I take this, and I add this. 200 00:09:50,200 --> 00:09:51,590 Right? 201 00:09:51,590 --> 00:09:55,380 Except when I add this, if I want to add L1 squared, 202 00:09:55,380 --> 00:09:58,290 I know L1 is the number of unique documents-- 203 00:09:58,290 --> 00:10:00,600 of unique words in the document. 204 00:10:00,600 --> 00:10:04,450 And W1 is the total number of words in the document. 205 00:10:04,450 --> 00:10:07,177 W1 guaranteed to be greater or equal than L1, 206 00:10:07,177 --> 00:10:08,260 so it's going to dominate. 207 00:10:08,260 --> 00:10:09,440 I don't need to add it. 208 00:10:09,440 --> 00:10:11,360 Same for L2. 209 00:10:11,360 --> 00:10:12,178 This is it. 210 00:10:15,200 --> 00:10:15,700 OK. 211 00:10:15,700 --> 00:10:18,054 You guys don't seem to remember the numbers for these. 212 00:10:18,054 --> 00:10:20,220 So that means I didn't torture you enough last time. 213 00:10:20,220 --> 00:10:21,830 So let's do more. 214 00:10:21,830 --> 00:10:23,270 Let's look at count frequency. 215 00:10:23,270 --> 00:10:27,750 And let's compute the cost for that. 216 00:10:36,830 --> 00:10:40,730 So what we did last time was, we went through each line of code. 217 00:10:40,730 --> 00:10:42,820 We thought, how much time does it 218 00:10:42,820 --> 00:10:44,550 take to execute the line once? 219 00:10:44,550 --> 00:10:46,710 And how many times does the line run? 220 00:10:46,710 --> 00:10:49,819 And then we compute the product of that, add everything up, 221 00:10:49,819 --> 00:10:51,360 and that's the cost for the function. 222 00:10:55,940 --> 00:10:58,260 First off, before I put numbers here, 223 00:10:58,260 --> 00:10:59,540 what does the method to do? 224 00:11:09,480 --> 00:11:12,040 AUDIENCE: It takes a list of words-- 225 00:11:12,040 --> 00:11:12,730 PROFESSOR: OK. 226 00:11:12,730 --> 00:11:16,150 AUDIENCE: For each item in that list, 227 00:11:16,150 --> 00:11:17,870 checks to see if it's-- you know, 228 00:11:17,870 --> 00:11:25,407 list of words that it's-- counted, right? 229 00:11:25,407 --> 00:11:25,990 PROFESSOR: OK. 230 00:11:25,990 --> 00:11:28,290 So you're telling me what the code does. 231 00:11:28,290 --> 00:11:29,010 AUDIENCE: Yeah. 232 00:11:29,010 --> 00:11:31,070 PROFESSOR: Try to look at Main or try 233 00:11:31,070 --> 00:11:33,420 to look at word frequencies for files. 234 00:11:33,420 --> 00:11:36,540 So look at it top-down, and tell me what the purpose of it is. 235 00:11:36,540 --> 00:11:38,310 What's the goal? 236 00:11:38,310 --> 00:11:45,108 AUDIENCE: Making a list of-- and each object 237 00:11:45,108 --> 00:11:47,400 is a list with a word and a number. 238 00:11:47,400 --> 00:11:48,220 PROFESSOR: OK. 239 00:11:48,220 --> 00:11:49,100 Excellent. 240 00:11:49,100 --> 00:11:51,790 So big picture, I have the first document. 241 00:11:51,790 --> 00:11:52,640 I read it in. 242 00:11:52,640 --> 00:11:54,106 I break it up into words. 243 00:11:54,106 --> 00:11:55,230 And I have a list of words. 244 00:11:55,230 --> 00:11:58,380 That's what word frequencies for file 245 00:11:58,380 --> 00:12:00,920 gives me-- sorry, that's what you get words from line list 246 00:12:00,920 --> 00:12:02,190 gives me. 247 00:12:02,190 --> 00:12:02,890 List of words. 248 00:12:08,720 --> 00:12:16,000 The fox is in the hat. 249 00:12:18,680 --> 00:12:21,600 And this gets passed to count frequency, 250 00:12:21,600 --> 00:12:24,700 and count frequency gives me, you said, 251 00:12:24,700 --> 00:12:27,430 an object, which is a list. 252 00:12:27,430 --> 00:12:30,180 Of lists, where each of them has the word 253 00:12:30,180 --> 00:12:31,590 and how many times it shows up. 254 00:12:31,590 --> 00:12:39,510 So I would have "the" shows up twice, "fox" shows up 255 00:12:39,510 --> 00:12:51,110 once, "is" shows up once, "in" shows up once, 256 00:12:51,110 --> 00:12:56,110 and-- I need a shorter example-- "hat" shows up once. 257 00:12:56,110 --> 00:12:58,155 So it takes this and turns it into that. 258 00:13:01,840 --> 00:13:05,090 So on line 2, I have a list L that's initialized. 259 00:13:05,090 --> 00:13:06,770 And then, at the end, it's returned. 260 00:13:06,770 --> 00:13:09,840 So I'm going to guess that L is going to look like this. 261 00:13:13,350 --> 00:13:17,050 Line 3, for new word in word list, iterates over the input. 262 00:13:17,050 --> 00:13:19,430 So iterates over this. 263 00:13:19,430 --> 00:13:23,340 And then, line 4 checks to see, for each new word, 264 00:13:23,340 --> 00:13:26,150 it looks at the list that I have under construction. 265 00:13:26,150 --> 00:13:30,950 So exam-- for example, if I ran through all the words 266 00:13:30,950 --> 00:13:33,360 and then I'm trying to put in hat right now, 267 00:13:33,360 --> 00:13:35,250 I wouldn't have it here. 268 00:13:35,250 --> 00:13:39,890 What line 4 does is, it looks at all the entries. 269 00:13:39,890 --> 00:13:42,710 And it says, if I can find the words-- 270 00:13:42,710 --> 00:13:45,180 so if I can find the word hat somewhere here-- 271 00:13:45,180 --> 00:13:47,090 then increment the number. 272 00:13:47,090 --> 00:13:49,642 If I can't, then make a new entry 273 00:13:49,642 --> 00:13:52,100 and say that the word shows up once, because it's the first 274 00:13:52,100 --> 00:13:52,600 I see it. 275 00:13:56,340 --> 00:13:57,590 So this is what the code does. 276 00:13:57,590 --> 00:14:01,640 Now let's see how fast it does that. 277 00:14:01,640 --> 00:14:07,450 So line 2 initialize the output to an empty list. 278 00:14:07,450 --> 00:14:10,071 What's the cost for that? 279 00:14:10,071 --> 00:14:10,988 AUDIENCE: [INAUDIBLE]. 280 00:14:10,988 --> 00:14:11,862 PROFESSOR: Very good. 281 00:14:11,862 --> 00:14:12,580 How many times? 282 00:14:18,370 --> 00:14:19,810 For new word in word lists. 283 00:14:19,810 --> 00:14:20,310 Cost? 284 00:14:23,726 --> 00:14:26,680 AUDIENCE: [INAUDIBLE]. 285 00:14:26,680 --> 00:14:29,070 I know the cost is 1. 286 00:14:29,070 --> 00:14:32,330 PROFESSOR: OK, so we are-- here, it's a bit confusing, right? 287 00:14:32,330 --> 00:14:35,610 We're saying that, oh, there's does iteration over a list. 288 00:14:35,610 --> 00:14:37,920 And each step of the iteration is constant time, 289 00:14:37,920 --> 00:14:42,210 but the iteration happens L times. 290 00:14:42,210 --> 00:14:43,510 I'm sorry, not L times. 291 00:14:43,510 --> 00:14:45,400 The length of the list times. 292 00:14:45,400 --> 00:14:48,120 How many-- how many elements are in word lists? 293 00:14:53,090 --> 00:14:57,730 I heard a very low W, so I will pretend I heard it. 294 00:14:57,730 --> 00:15:01,350 Or I hope I heard W. So word list, the words I 295 00:15:01,350 --> 00:15:06,690 got from the document, W. How about the if. 296 00:15:06,690 --> 00:15:13,570 So, it looks at the word that I have-- oh. 297 00:15:13,570 --> 00:15:16,680 This code is confusing because I forgot a line, right? 298 00:15:16,680 --> 00:15:20,580 Pretend that between line-- oh, no, sorry, I didn't. 299 00:15:20,580 --> 00:15:23,512 New word is-- new word is assigned in line 3. 300 00:15:23,512 --> 00:15:29,030 So new word in line 3 is compared to the first element 301 00:15:29,030 --> 00:15:31,280 of the entry that's assigned in line 4. 302 00:15:31,280 --> 00:15:36,960 So hat is compared with the, fox, so on, so forth. 303 00:15:36,960 --> 00:15:41,530 And if the comparison is true, it runs line 6 and 7. 304 00:15:41,530 --> 00:15:45,230 And if not, it keeps looping. 305 00:15:45,230 --> 00:15:50,300 So line 5, the if how, many times does it run? 306 00:15:50,300 --> 00:15:52,498 Just the comparison. 307 00:15:52,498 --> 00:15:53,810 AUDIENCE: W. 308 00:15:53,810 --> 00:15:55,107 PROFESSOR: W. 309 00:15:55,107 --> 00:15:56,148 AUDIENCE: Oh, no, no, no. 310 00:15:56,148 --> 00:15:57,072 No. 311 00:15:57,072 --> 00:15:58,920 I'm thinking of line 4. 312 00:15:58,920 --> 00:15:59,941 PROFESSOR: Oh. 313 00:15:59,941 --> 00:16:00,440 Yeah. 314 00:16:00,440 --> 00:16:02,260 I'm getting confused too. 315 00:16:02,260 --> 00:16:03,540 So let's start with line 4. 316 00:16:03,540 --> 00:16:04,520 Sorry. 317 00:16:04,520 --> 00:16:05,882 Shouldn't do line 5. 318 00:16:05,882 --> 00:16:08,059 AUDIENCE: --new word, then you're not-- like, 319 00:16:08,059 --> 00:16:09,850 you're not going to run through that again. 320 00:16:09,850 --> 00:16:10,516 PROFESSOR: Yeah. 321 00:16:10,516 --> 00:16:12,530 Let's worry about that right afterwards. 322 00:16:12,530 --> 00:16:13,760 Let's do line 4 first. 323 00:16:13,760 --> 00:16:15,790 Sorry, I jumped over line 4. 324 00:16:15,790 --> 00:16:19,520 So, line 4 definitely runs W times 325 00:16:19,520 --> 00:16:23,970 because it's inside the for loop from line 3 to line 9. 326 00:16:23,970 --> 00:16:29,510 So everything here will definitely run W times. 327 00:16:29,510 --> 00:16:31,910 But how many times does it run overall? 328 00:16:31,910 --> 00:16:36,377 So, line 4 iterates over all the entries here. 329 00:16:36,377 --> 00:16:37,710 How many times does that happen? 330 00:16:41,161 --> 00:16:47,570 AUDIENCE: 1 plus W over-- times 10/2, 331 00:16:47,570 --> 00:16:54,588 because it's just worst case, L-- the length of L increases 332 00:16:54,588 --> 00:16:56,830 by 1 every time. 333 00:16:56,830 --> 00:16:58,950 [INAUDIBLE] 334 00:16:58,950 --> 00:17:02,350 PROFESSOR: OK, so I like that you started with worst case. 335 00:17:02,350 --> 00:17:04,130 Normally I would say exactly that. 336 00:17:04,130 --> 00:17:07,619 Worst case or W. But we had a different constant 337 00:17:07,619 --> 00:17:11,480 for the number of words that you have in the end. 338 00:17:11,480 --> 00:17:16,230 So let's say something a little bit better than W. Let's say, 339 00:17:16,230 --> 00:17:17,880 let's put the lower bound on it. 340 00:17:17,880 --> 00:17:20,036 So yeah, worst case, all the words are different. 341 00:17:20,036 --> 00:17:21,619 But what if they're not all different? 342 00:17:21,619 --> 00:17:25,632 And what if in the end I know I have L words? 343 00:17:25,632 --> 00:17:27,653 CLASS: [INAUDIBLE]. 344 00:17:27,653 --> 00:17:28,569 PROFESSOR: Worst case. 345 00:17:33,001 --> 00:17:33,500 Almost. 346 00:17:36,340 --> 00:17:39,940 So, I know I have a W from the outer loop. 347 00:17:39,940 --> 00:17:42,680 For each word in the outer loop, how many times 348 00:17:42,680 --> 00:17:46,800 does the inner loop execute? 349 00:17:46,800 --> 00:17:49,210 How many times do I have to go through something 350 00:17:49,210 --> 00:17:50,250 in the inner list? 351 00:17:50,250 --> 00:17:53,090 So I know here I have W words, suppose here I 352 00:17:53,090 --> 00:17:55,480 have L elements in the vector. 353 00:17:55,480 --> 00:17:58,195 For each one of these, how many times 354 00:17:58,195 --> 00:18:02,440 do I have to go through-- So how many elements do I 355 00:18:02,440 --> 00:18:03,946 have to go through here? 356 00:18:03,946 --> 00:18:05,764 AUDIENCE: Depends on where you are, though. 357 00:18:05,764 --> 00:18:07,930 For the first word, you only have to go through one. 358 00:18:07,930 --> 00:18:08,230 PROFESSOR: Yep. 359 00:18:08,230 --> 00:18:08,530 But-- 360 00:18:08,530 --> 00:18:09,590 AUDIENCE: For the second word, you have to through-- 361 00:18:09,590 --> 00:18:09,870 PROFESSOR: Yep. 362 00:18:09,870 --> 00:18:11,280 But I heard the worst case. 363 00:18:11,280 --> 00:18:12,780 And I like that, because it's easier 364 00:18:12,780 --> 00:18:14,071 to reason about the worst case. 365 00:18:14,071 --> 00:18:17,480 And most of the time it's sort of like the average case. 366 00:18:17,480 --> 00:18:21,200 AUDIENCE: So then, length of list-- L. 367 00:18:21,200 --> 00:18:25,530 PROFESSOR: The length of the list, and that's L. Worst case, 368 00:18:25,530 --> 00:18:27,976 the first words that I see will be L different words. 369 00:18:27,976 --> 00:18:29,350 And then all the words that I see 370 00:18:29,350 --> 00:18:32,840 will be the same as the words that I saw before. 371 00:18:32,840 --> 00:18:36,080 So worst case, the list will grow to L very fast, 372 00:18:36,080 --> 00:18:39,085 and then I'll keep seeing L L L. And I'll 373 00:18:39,085 --> 00:18:40,710 ignore what was there in the beginning, 374 00:18:40,710 --> 00:18:44,460 and I'll say L times. 375 00:18:44,460 --> 00:18:47,320 So I know the second list is bounded by L in length, 376 00:18:47,320 --> 00:18:49,470 the first list is bounded by W in length. 377 00:18:49,470 --> 00:18:53,960 So worst case this runs L times W times. 378 00:18:53,960 --> 00:18:56,394 And what's the cost of iterating? 379 00:18:56,394 --> 00:18:58,854 AUDIENCE: What is the difference between L and W? 380 00:18:58,854 --> 00:19:01,965 L is the document vector length, and W is the number of words. 381 00:19:01,965 --> 00:19:04,803 But isn't the number of elements in the document vector 382 00:19:04,803 --> 00:19:07,799 the number of words? 383 00:19:07,799 --> 00:19:09,090 PROFESSOR: How about this case? 384 00:19:09,090 --> 00:19:09,810 What's W? 385 00:19:15,134 --> 00:19:16,110 AUDIENCE: 6, yeah. 386 00:19:20,140 --> 00:19:23,790 PROFESSOR: So L is the number of unique words in a document. 387 00:19:23,790 --> 00:19:25,580 And I heard a really cool argument 388 00:19:25,580 --> 00:19:28,160 that I liked last time. 389 00:19:28,160 --> 00:19:29,120 Does anyone remember? 390 00:19:29,120 --> 00:19:29,840 About L? 391 00:19:33,450 --> 00:19:37,160 If we're really dealing with a natural language like English, 392 00:19:37,160 --> 00:19:39,636 how many words do I have in English? 393 00:19:39,636 --> 00:19:41,135 AUDIENCE: Well, I think, at the max, 394 00:19:41,135 --> 00:19:43,997 there's actually around 250,000, but a lot of them 395 00:19:43,997 --> 00:19:45,157 are not used anymore. 396 00:19:45,157 --> 00:19:45,740 PROFESSOR: OK. 397 00:19:45,740 --> 00:19:47,720 So 250,000, right? 398 00:19:47,720 --> 00:19:48,820 Max. 399 00:19:48,820 --> 00:19:50,480 So that's a constant. 400 00:19:50,480 --> 00:19:52,930 If I have a document that contains 401 00:19:52,930 --> 00:19:56,060 all the writings of all the authors that were ever done, 402 00:19:56,060 --> 00:20:02,771 and say that's a billion words, L is still going to be 250,000. 403 00:20:02,771 --> 00:20:03,270 Right? 404 00:20:03,270 --> 00:20:04,882 So L can be very different from W. 405 00:20:04,882 --> 00:20:06,965 That's why we're keeping track of them separately. 406 00:20:11,510 --> 00:20:14,910 One W L times W. What's the cost of iterating? 407 00:20:19,190 --> 00:20:21,530 So we know how many times line 4 runs, 408 00:20:21,530 --> 00:20:26,930 but what's the cost of one step of iterating in the list? 409 00:20:26,930 --> 00:20:27,430 1. 410 00:20:27,430 --> 00:20:27,930 Very good. 411 00:20:31,530 --> 00:20:32,036 Line 5. 412 00:20:32,036 --> 00:20:33,160 How many times does it run? 413 00:20:37,420 --> 00:20:39,230 AUDIENCE: W times L? 414 00:20:39,230 --> 00:20:39,990 PROFESSOR: Yep. 415 00:20:39,990 --> 00:20:41,580 Same as line 4, right? 416 00:20:41,580 --> 00:20:43,800 The if is run all the time. 417 00:20:43,800 --> 00:20:47,860 And lines 5 and 6 only run sometimes, but-- sorry, 418 00:20:47,860 --> 00:20:53,750 line 6 and 7 only run sometimes, but line 5 runs all the time. 419 00:20:53,750 --> 00:20:54,740 What is the cost? 420 00:21:01,642 --> 00:21:03,280 AUDIENCE: We can say it's constant. 421 00:21:03,280 --> 00:21:04,780 PROFESSOR: We can say it's constant. 422 00:21:04,780 --> 00:21:07,560 I like we can say it's constant, but why 423 00:21:07,560 --> 00:21:10,020 is it that we can say it's constant? 424 00:21:10,020 --> 00:21:12,480 Why don't I just say 1, if-- this is an empty list. 425 00:21:12,480 --> 00:21:13,280 This is a number. 426 00:21:13,280 --> 00:21:14,340 AUDIENCE: Depends on the word length. 427 00:21:14,340 --> 00:21:14,923 PROFESSOR: OK. 428 00:21:14,923 --> 00:21:17,065 It depends on the word lis-- length, very good. 429 00:21:17,065 --> 00:21:18,690 AUDIENCE: And we're assuming that words 430 00:21:18,690 --> 00:21:20,170 are all the same length. 431 00:21:20,170 --> 00:21:21,637 PROFESSOR: OK. 432 00:21:21,637 --> 00:21:22,220 AUDIENCE: And? 433 00:21:22,220 --> 00:21:26,190 So, 1 times L W. Little w. 434 00:21:26,190 --> 00:21:28,390 PROFESSOR: OK, very good. 435 00:21:28,390 --> 00:21:30,980 So we do assume that all the words are the same length. 436 00:21:30,980 --> 00:21:34,920 But unless I tell you that the length is really small, 437 00:21:34,920 --> 00:21:37,630 which I did, you can't say 1. 438 00:21:37,630 --> 00:21:41,000 So, when you said we can say it's constant, it's right. 439 00:21:41,000 --> 00:21:42,770 We can say, but we also have to say why, 440 00:21:42,770 --> 00:21:44,890 or at least think why, that's the case. 441 00:21:44,890 --> 00:21:48,760 So it's W-- we're going to use W here and when we copy it here, 442 00:21:48,760 --> 00:21:51,144 we're going to forget about it. 443 00:21:51,144 --> 00:21:52,852 AUDIENCE: Can you put a top bar on the W, 444 00:21:52,852 --> 00:21:56,830 just so I can tell that it's not the other W. 445 00:21:56,830 --> 00:22:07,950 PROFESSOR: OK But you have to be responsible for reminding 446 00:22:07,950 --> 00:22:08,770 me to put that. 447 00:22:08,770 --> 00:22:10,160 AUDIENCE: OK. 448 00:22:10,160 --> 00:22:12,240 PROFESSOR: OK. 449 00:22:12,240 --> 00:22:15,470 So string comparison, not constant. 450 00:22:15,470 --> 00:22:17,770 If I have two very long strings that 451 00:22:17,770 --> 00:22:19,550 only differ in the last character, 452 00:22:19,550 --> 00:22:21,550 I have to go through them character by character 453 00:22:21,550 --> 00:22:24,810 by character until I find the last character that's 454 00:22:24,810 --> 00:22:25,580 different. 455 00:22:25,580 --> 00:22:28,770 Because until I look at that, the strings might be equal. 456 00:22:28,770 --> 00:22:30,926 So comparing two long strings takes 457 00:22:30,926 --> 00:22:33,175 time that's proportional to the length of the strings. 458 00:22:36,680 --> 00:22:40,130 OK, so line 5 costs w, tricky part, 459 00:22:40,130 --> 00:22:44,420 runs L times W part-- L times W times. 460 00:22:44,420 --> 00:22:46,350 How about line 6 and 7? 461 00:22:52,860 --> 00:22:56,280 I didn't ask line 6, I asked line 6 and 7 together, 462 00:22:56,280 --> 00:22:57,690 because there is a trick there. 463 00:23:05,690 --> 00:23:09,190 AUDIENCE: I think it's constant for line 6, right? 464 00:23:09,190 --> 00:23:10,518 PROFESSOR: Why? 465 00:23:10,518 --> 00:23:12,858 AUDIENCE: 'Cause it's a number. 466 00:23:12,858 --> 00:23:14,715 And one place in the entry. 467 00:23:14,715 --> 00:23:16,280 We've already grabbed the entry. 468 00:23:16,280 --> 00:23:20,420 PROFESSOR: So the cost of running it once is 1. 469 00:23:20,420 --> 00:23:21,700 Good. 470 00:23:21,700 --> 00:23:25,190 I can tell you that that's the same case for 7. 471 00:23:25,190 --> 00:23:26,560 How many times do they run? 472 00:23:30,120 --> 00:23:32,490 This is the hard part. 473 00:23:32,490 --> 00:23:34,980 AUDIENCE: Wait, does break line break out of one loop? 474 00:23:34,980 --> 00:23:35,605 PROFESSOR: Yep. 475 00:23:35,605 --> 00:23:39,940 Break breaks out of the loop between lines 4 and 7. 476 00:23:39,940 --> 00:23:41,300 I have a question. 477 00:23:41,300 --> 00:23:46,350 AUDIENCE: Is L supposed to be not in line with if? 478 00:23:46,350 --> 00:23:46,850 Yes. 479 00:23:46,850 --> 00:23:49,580 PROFESSOR: It is supposed to be where it is. 480 00:23:49,580 --> 00:23:50,640 I will talk-- yes. 481 00:23:50,640 --> 00:23:53,880 So what happens is, that else is in line with a for. 482 00:23:53,880 --> 00:23:57,430 And if the for loop runs to completion, 483 00:23:57,430 --> 00:24:00,860 then it does get executed. 484 00:24:00,860 --> 00:24:04,097 If there's a break somewhere inside the for, 485 00:24:04,097 --> 00:24:05,305 then it doesn't get executed. 486 00:24:08,140 --> 00:24:10,650 So the idea behind that is, usually 487 00:24:10,650 --> 00:24:12,510 use this for finding stuff. 488 00:24:12,510 --> 00:24:15,760 So you iterate over a list, and when you find something, 489 00:24:15,760 --> 00:24:17,516 break out of the loop. 490 00:24:17,516 --> 00:24:19,390 You did something, you break out of the loop. 491 00:24:19,390 --> 00:24:21,435 If you didn't find it, you can put an else 492 00:24:21,435 --> 00:24:23,167 and then say what code happens. 493 00:24:23,167 --> 00:24:25,000 And you don't have to write code on your own 494 00:24:25,000 --> 00:24:29,420 to check if you broke out of the loop. 495 00:24:29,420 --> 00:24:32,230 So if break executes, then it's going 496 00:24:32,230 --> 00:24:34,570 to take us out of the loop. 497 00:24:34,570 --> 00:24:35,930 It's going to ignore the else. 498 00:24:35,930 --> 00:24:38,282 And it's going to run the loop on line 3 again. 499 00:24:38,282 --> 00:24:39,865 So it's going to do another iteration. 500 00:24:42,470 --> 00:24:44,280 So line 6 and 7, how many times? 501 00:24:44,280 --> 00:24:47,120 AUDIENCE: W minus L? 502 00:24:47,120 --> 00:24:48,450 PROFESSOR: W minus L. 503 00:24:48,450 --> 00:24:50,526 AUDIENCE: Like, if it's the difference 504 00:24:50,526 --> 00:24:54,962 in number of words and number of-- unique words. 505 00:24:54,962 --> 00:24:55,670 PROFESSOR: Smart. 506 00:24:55,670 --> 00:25:00,520 You gave the precise answer right the first time. 507 00:25:00,520 --> 00:25:02,230 W minus L. Very good. 508 00:25:06,830 --> 00:25:13,960 And what happens once-- what happens once this runs? 509 00:25:13,960 --> 00:25:21,730 Why do I know-- why do I know that the if won't be-- Oh. 510 00:25:21,730 --> 00:25:23,660 Sorry, I'm getting myself confused. 511 00:25:23,660 --> 00:25:27,950 So it's going to run W minus L times total, right? 512 00:25:27,950 --> 00:25:31,570 Total times overall without this W thing here, 513 00:25:31,570 --> 00:25:33,547 so I should put an arrow and say-- 514 00:25:38,320 --> 00:25:41,070 Now suppose I didn't notice this. 515 00:25:41,070 --> 00:25:43,510 Is there another way I can get the decent bounds? 516 00:25:43,510 --> 00:25:45,165 So this is the right, perfect answer. 517 00:25:45,165 --> 00:25:46,290 You have this, you're done. 518 00:25:46,290 --> 00:25:47,623 You don't need to think further. 519 00:25:47,623 --> 00:25:49,420 Suppose you don't have this. 520 00:25:49,420 --> 00:25:50,350 What else can you do? 521 00:25:53,479 --> 00:25:55,070 AUDIENCE: But we don't have what? 522 00:25:55,070 --> 00:25:57,770 PROFESSOR: If I didn't notice that, hey, there 523 00:25:57,770 --> 00:26:03,330 are L words-- There are L new words, 524 00:26:03,330 --> 00:26:07,690 W words total, so W minus L words repeat themselves. 525 00:26:07,690 --> 00:26:11,230 So this is how many times the if is going to be true. 526 00:26:11,230 --> 00:26:13,330 If I didn't have that, then I could 527 00:26:13,330 --> 00:26:16,750 see that line 7 breaks out of the loop. 528 00:26:16,750 --> 00:26:19,314 So if that if runs once, then we're done. 529 00:26:19,314 --> 00:26:20,230 We're out of the loop. 530 00:26:20,230 --> 00:26:22,430 It's not going to run again. 531 00:26:22,430 --> 00:26:26,210 So a bound that's not as precise as the one you gave me 532 00:26:26,210 --> 00:26:31,265 is 1 times W, because it runs, at most, once per loop. 533 00:26:40,030 --> 00:26:42,720 Does people see this? 534 00:26:42,720 --> 00:26:45,810 So this is an easy way to cop out of thinking. 535 00:26:45,810 --> 00:26:48,230 And I don't like to think more than necessary, because you 536 00:26:48,230 --> 00:26:50,100 have finite time on a test or in life, 537 00:26:50,100 --> 00:26:52,720 and you don't want to spend too much time on one thing. 538 00:26:55,640 --> 00:26:56,670 We covered the loop. 539 00:26:56,670 --> 00:26:59,010 Let's look at else and append. 540 00:26:59,010 --> 00:27:00,510 I already got a helpful question, 541 00:27:00,510 --> 00:27:03,940 so I explained when the else would run. 542 00:27:03,940 --> 00:27:05,240 The running time for else. 543 00:27:05,240 --> 00:27:07,570 Else is a flow-- control flow statement. 544 00:27:07,570 --> 00:27:10,244 It's like break, so Python will keep 545 00:27:10,244 --> 00:27:11,910 track of whether a loop completed or not 546 00:27:11,910 --> 00:27:12,899 in constant time. 547 00:27:12,899 --> 00:27:13,690 I'll give you that. 548 00:27:13,690 --> 00:27:16,750 That's in the cost model. 549 00:27:16,750 --> 00:27:21,170 How many times does this else run, at most? 550 00:27:21,170 --> 00:27:22,840 OK, L. Good. 551 00:27:25,546 --> 00:27:26,865 Perfect. 552 00:27:26,865 --> 00:27:27,573 How about line 9? 553 00:27:36,012 --> 00:27:36,976 Loop stops here. 554 00:27:39,890 --> 00:27:42,200 How about line 9? 555 00:27:42,200 --> 00:27:43,500 How many times does it run? 556 00:27:43,500 --> 00:27:44,886 That's easy. 557 00:27:44,886 --> 00:27:46,290 AUDIENCE: [INAUDIBLE]. 558 00:27:46,290 --> 00:27:47,800 PROFESSOR: Yep. 559 00:27:47,800 --> 00:27:48,820 And what does it do? 560 00:27:48,820 --> 00:27:49,590 It's an append. 561 00:27:49,590 --> 00:27:53,111 What's the cost for append? 562 00:27:53,111 --> 00:27:53,902 AUDIENCE: Constant. 563 00:27:57,380 --> 00:27:59,680 PROFESSOR: OK Line 10 runs how many times? 564 00:27:59,680 --> 00:28:02,554 What's the cost? 565 00:28:02,554 --> 00:28:03,054 AUDIENCE: 1? 566 00:28:07,307 --> 00:28:09,390 PROFESSOR: You guys didn't listen to me last time. 567 00:28:09,390 --> 00:28:11,020 So I was saying you have to look at the notes 568 00:28:11,020 --> 00:28:12,270 and you have to practice this. 569 00:28:12,270 --> 00:28:14,480 Because you have to have this model in your mind. 570 00:28:14,480 --> 00:28:16,160 So that when you're writing code, 571 00:28:16,160 --> 00:28:17,710 this has to happen automatically. 572 00:28:17,710 --> 00:28:20,306 You shouldn't have to think explicitly about it. 573 00:28:20,306 --> 00:28:22,180 Because if you do, you're not going to do it. 574 00:28:22,180 --> 00:28:23,679 AUDIENCE: For the else, shouldn't it 575 00:28:23,679 --> 00:28:28,612 be W tim-- I mean, it would be called W times not L times? 576 00:28:28,612 --> 00:28:30,744 Because you want to look at the outer loop and not 577 00:28:30,744 --> 00:28:32,060 the inner loop? 578 00:28:32,060 --> 00:28:35,942 So it can-- you call all-- once at the end of the total-- 579 00:28:35,942 --> 00:28:37,080 the inner for. 580 00:28:37,080 --> 00:28:37,580 Right? 581 00:28:37,580 --> 00:28:43,590 So, so it could be-- happen W times, maximum, not L times. 582 00:28:43,590 --> 00:28:47,960 'Cause the L is the for loop that the else coincides with. 583 00:28:47,960 --> 00:28:50,502 And the else would only happen once 584 00:28:50,502 --> 00:28:53,410 for every total iteration of that. 585 00:28:53,410 --> 00:28:55,400 PROFESSOR: OK so you're proposing W 586 00:28:55,400 --> 00:28:58,070 as the bound for the else, right? 587 00:28:58,070 --> 00:28:59,643 Here. 588 00:28:59,643 --> 00:29:00,939 AUDIENCE: Yeah. 589 00:29:00,939 --> 00:29:02,980 PROFESSOR: So I could say, hey, it runs, at most, 590 00:29:02,980 --> 00:29:04,140 once for outer loop. 591 00:29:04,140 --> 00:29:06,430 So it's, at most, W times. 592 00:29:06,430 --> 00:29:08,020 This is a nice, easy argument. 593 00:29:08,020 --> 00:29:09,680 We have a bound. 594 00:29:09,680 --> 00:29:10,670 L is a tighter bound. 595 00:29:10,670 --> 00:29:14,730 And when I got L, what happened here 596 00:29:14,730 --> 00:29:17,630 was the same kind of thinking that you did earlier 597 00:29:17,630 --> 00:29:18,960 to get this. 598 00:29:18,960 --> 00:29:21,580 So this bound is good, this bound is tighter. 599 00:29:21,580 --> 00:29:24,260 This bound is good, this bound is tighter. 600 00:29:24,260 --> 00:29:27,850 And the argument behind this one is that, hey, this 601 00:29:27,850 --> 00:29:30,570 else only happens for new words. 602 00:29:30,570 --> 00:29:33,300 If there's no new-- if the word that I looked at is old, 603 00:29:33,300 --> 00:29:36,900 then break is going to execute. 604 00:29:36,900 --> 00:29:39,040 And else is not going to execute. 605 00:29:41,620 --> 00:29:45,330 So that's why I can say L. 606 00:29:45,330 --> 00:29:47,260 The beauty of asymptotics is that I 607 00:29:47,260 --> 00:29:49,290 can use either of the bounds and I'll still 608 00:29:49,290 --> 00:29:50,950 get the correct running time. 609 00:29:50,950 --> 00:29:53,510 So I'm not going to fuss over it too much. 610 00:29:53,510 --> 00:29:56,290 I like the tighter ones, because it means you guys are thinking. 611 00:29:56,290 --> 00:29:57,790 And you're looking at the algorithm, 612 00:29:57,790 --> 00:29:59,330 and you're understanding it. 613 00:29:59,330 --> 00:30:00,830 But if you don't have them, you'll 614 00:30:00,830 --> 00:30:02,329 still get the correct running times. 615 00:30:02,329 --> 00:30:04,640 So I think that's nice. 616 00:30:04,640 --> 00:30:07,540 PROFESSOR: OK, let's get the running time for everything. 617 00:30:07,540 --> 00:30:08,800 Can someone do it in one step? 618 00:30:13,890 --> 00:30:15,830 Then let's do it step by step. 619 00:30:15,830 --> 00:30:17,710 So let's compute partial products. 620 00:30:17,710 --> 00:30:20,660 What are they? 621 00:30:20,660 --> 00:30:22,118 AUDIENCE: 1. 622 00:30:22,118 --> 00:30:25,520 W. LW. 623 00:30:25,520 --> 00:30:29,894 L, W, little w-- with the bar. 624 00:30:29,894 --> 00:30:31,840 AUDIENCE: (LAUGHING) With the bar. 625 00:30:31,840 --> 00:30:32,631 PROFESSOR: Awesome. 626 00:30:32,631 --> 00:30:39,258 AUDIENCE: W minus L. W. L, L, 1. 627 00:30:42,130 --> 00:30:45,020 PROFESSOR: OK, so if I add them up, this is all asymptotic. 628 00:30:45,020 --> 00:30:47,530 So the biggest one will dominate. 629 00:30:47,530 --> 00:30:49,190 In general, I can just take a max 630 00:30:49,190 --> 00:30:52,100 instead of doing actual addition. 631 00:30:52,100 --> 00:30:54,296 So who dominates here? 632 00:30:54,296 --> 00:30:57,782 AUDIENCE: The fourth one down? 633 00:30:57,782 --> 00:30:59,780 PROFESSOR: Fourth-- OK. 634 00:30:59,780 --> 00:31:01,680 Yep. 635 00:31:01,680 --> 00:31:08,820 So line 5 is the biggest time consumer in this algorithm. 636 00:31:08,820 --> 00:31:11,430 And I know it's W times w-bar. 637 00:31:11,430 --> 00:31:12,940 So now I'm going to copy it here. 638 00:31:12,940 --> 00:31:16,575 What did I say I'll do when I'm copying it here? 639 00:31:16,575 --> 00:31:18,030 AUDIENCE: [INAUDIBLE]. 640 00:31:18,030 --> 00:31:18,720 PROFESSOR: Yep. 641 00:31:18,720 --> 00:31:22,640 So I'm assuming English five- to ten-character words. 642 00:31:22,640 --> 00:31:26,320 W L. And W L is smaller than w squared, 643 00:31:26,320 --> 00:31:28,780 so the assumption that I had before is correct. 644 00:31:28,780 --> 00:31:30,790 I don't have to change anything here. 645 00:31:30,790 --> 00:31:31,597 That is good. 646 00:31:38,560 --> 00:31:42,190 So we noticed last time, and already forgot by now, 647 00:31:42,190 --> 00:31:46,230 that the biggest problem in this whole implementation 648 00:31:46,230 --> 00:31:50,800 was the plus in get words from line list. 649 00:31:50,800 --> 00:31:54,300 Suppose we forgot about it and we have this big pile of code, 650 00:31:54,300 --> 00:31:56,100 how do I go about making it faster? 651 00:31:59,080 --> 00:32:01,720 Method one, go through every method. 652 00:32:01,720 --> 00:32:03,680 Do this. 653 00:32:03,680 --> 00:32:06,740 Compute the running times, and see which one's the slowest. 654 00:32:06,740 --> 00:32:10,190 Does this scale to 1,000 lines of code? 655 00:32:10,190 --> 00:32:11,730 Not so much. 656 00:32:11,730 --> 00:32:14,700 We're going to be giving you roughly 1,000 lines of code 657 00:32:14,700 --> 00:32:17,960 for PSET2, and we're going to ask it to make it faster. 658 00:32:17,960 --> 00:32:19,560 Do want to understand everything? 659 00:32:19,560 --> 00:32:22,350 No. 660 00:32:22,350 --> 00:32:24,590 Instead what you want to do is run the code 661 00:32:24,590 --> 00:32:25,824 through a profiler. 662 00:32:25,824 --> 00:32:26,740 So we have a computer. 663 00:32:26,740 --> 00:32:28,770 The computer can tell you which line 664 00:32:28,770 --> 00:32:30,877 takes the most time to run. 665 00:32:30,877 --> 00:32:32,710 So you don't have to do it on pen and paper. 666 00:32:32,710 --> 00:32:34,940 Whenever we can automate, do so. 667 00:32:34,940 --> 00:32:36,830 So we'll teach you how to run a profiler. 668 00:32:36,830 --> 00:32:38,110 It's in the notes. 669 00:32:38,110 --> 00:32:40,870 And if you look in the code outputs right 670 00:32:40,870 --> 00:32:45,080 after [INAUDIBLE], you'll see a profiler output. 671 00:32:45,080 --> 00:32:47,580 So what that tells you is for each function, 672 00:32:47,580 --> 00:32:51,980 how much time does it take-- that's the total time? 673 00:32:51,980 --> 00:32:53,670 And there's the cumulative time, which 674 00:32:53,670 --> 00:32:56,300 is how much does it take together with its children? 675 00:32:59,680 --> 00:33:04,380 In this case for word frequencies from file, order 676 00:33:04,380 --> 00:33:06,360 of W-- this is how much time it takes 677 00:33:06,360 --> 00:33:07,860 including the functions it's called. 678 00:33:07,860 --> 00:33:10,540 So this is the cumulative time. 679 00:33:10,540 --> 00:33:13,940 Cumulative time is useful if I'm during runtime analysis. 680 00:33:13,940 --> 00:33:17,120 Is not so useful if I'm looking at where's 681 00:33:17,120 --> 00:33:18,745 the slowness in my program? 682 00:33:18,745 --> 00:33:20,370 Because if you look at cumulative time, 683 00:33:20,370 --> 00:33:24,510 you might see the slowness in one of the functions that 684 00:33:24,510 --> 00:33:27,930 get-- that word frequencies from file called. 685 00:33:27,930 --> 00:33:30,560 So the cumulative time for this is really big, 686 00:33:30,560 --> 00:33:35,140 but the total time-- the time that's spent inside it-- 687 00:33:35,140 --> 00:33:36,824 is not that bad. 688 00:33:36,824 --> 00:33:38,240 Instead if you look at total time, 689 00:33:38,240 --> 00:33:41,520 you'll see that the worst function is-- surprise, 690 00:33:41,520 --> 00:33:44,670 surprise-- get words from line list. 691 00:33:44,670 --> 00:33:48,150 So 5 lines to look at-- hey, there's a plus there. 692 00:33:48,150 --> 00:33:50,900 I remember from lecture that plus copies over lists, 693 00:33:50,900 --> 00:33:54,700 and it's kind of slow, so maybe I should use something else. 694 00:33:54,700 --> 00:33:57,340 Does d remember what else we should use? 695 00:33:57,340 --> 00:34:00,290 We talked about that last recitation. 696 00:34:00,290 --> 00:34:02,200 Extend. 697 00:34:02,200 --> 00:34:03,580 Document distance 2 . 698 00:34:03,580 --> 00:34:06,580 The only difference between it an document distance 1 699 00:34:06,580 --> 00:34:09,639 is get words from line list, line 5. 700 00:34:09,639 --> 00:34:12,580 That plus turned into an extend. 701 00:34:12,580 --> 00:34:16,330 One character turned to six, so about eight keystrokes, 702 00:34:16,330 --> 00:34:18,440 30% runtime improvement. 703 00:34:18,440 --> 00:34:20,500 Very good return on investment. 704 00:34:20,500 --> 00:34:24,530 Everything else is going to be harder. 705 00:34:24,530 --> 00:34:26,840 So let's look at that line, because that line dominated 706 00:34:26,840 --> 00:34:29,770 the running time of get words from line list. 707 00:34:29,770 --> 00:34:32,440 And think what's, then, your running time for it 708 00:34:32,440 --> 00:34:33,449 now that I have extend? 709 00:35:05,440 --> 00:35:07,800 Does anyone remember what get words from line list does? 710 00:35:16,130 --> 00:35:19,765 AUDIENCE: It gets the words out of the document. 711 00:35:19,765 --> 00:35:20,390 PROFESSOR: OK . 712 00:35:20,390 --> 00:35:22,110 It gets the word out of the document. 713 00:35:22,110 --> 00:35:25,720 So it reads a document that looks like a regular text file, 714 00:35:25,720 --> 00:35:30,240 and it gets this out of it. 715 00:35:30,240 --> 00:35:33,190 The way it does that is it goes through each line, 716 00:35:33,190 --> 00:35:36,640 reads the line as a string, breaks up the string 717 00:35:36,640 --> 00:35:39,310 into a list of words, and then combines 718 00:35:39,310 --> 00:35:42,240 all those lists of words together. 719 00:35:42,240 --> 00:35:44,420 Get words from string, line 4, returns 720 00:35:44,420 --> 00:35:47,114 the number of words in a line. 721 00:35:47,114 --> 00:35:49,030 Sorry, the list of words in the line, and then 722 00:35:49,030 --> 00:35:51,510 extend combines the lists together. 723 00:35:51,510 --> 00:35:56,140 Let's add some constants that we had last time. 724 00:35:56,140 --> 00:36:08,220 I think we had that K is the number of words per line. 725 00:36:08,220 --> 00:36:16,850 And Z is the number of total lines. 726 00:36:16,850 --> 00:36:29,492 So this K is actually-- W over Z. No, I don't like that. 727 00:36:29,492 --> 00:36:29,992 That's work. 728 00:36:32,890 --> 00:36:43,730 So Z is K W over K. And we argued 729 00:36:43,730 --> 00:36:45,370 that we're not going to talk about K 730 00:36:45,370 --> 00:36:48,490 too much, because a document needs to fit on a screen 731 00:36:48,490 --> 00:36:50,820 or needs to fit on a piece of paper. 732 00:36:50,820 --> 00:36:52,740 So the line length has to be finite, right? 733 00:36:52,740 --> 00:36:56,069 Otherwise, if I have a document that has 10,000 character 734 00:36:56,069 --> 00:36:57,860 lines, I can't even write it on this board. 735 00:36:57,860 --> 00:36:59,940 Even though it's really long. 736 00:36:59,940 --> 00:37:03,300 So K, the number of words in a line, is going to be finite. 737 00:37:03,300 --> 00:37:06,270 But we'll need it for this analysis. 738 00:37:06,270 --> 00:37:10,650 So line 4 returns a list with the words on a line. 739 00:37:10,650 --> 00:37:12,225 How many elements in that list? 740 00:37:19,010 --> 00:37:22,660 This returns a list with how many elements? 741 00:37:28,408 --> 00:37:31,459 AUDIENCE: Could K [INAUDIBLE] words on line. 742 00:37:31,459 --> 00:37:33,250 PROFESSOR: So even if I ask easy questions, 743 00:37:33,250 --> 00:37:34,249 you guys have to answer. 744 00:37:34,249 --> 00:37:36,900 Because otherwise I'll stall until you do. 745 00:37:36,900 --> 00:37:39,240 So 4 gives me a list with K elements, 746 00:37:39,240 --> 00:37:43,060 and 5 appends that small list to the big list of words 747 00:37:43,060 --> 00:37:45,040 in the entire document. 748 00:37:45,040 --> 00:37:47,310 Before I used plus and that did something bad. 749 00:37:47,310 --> 00:37:48,460 Now I'm using extend. 750 00:37:48,460 --> 00:37:53,248 What's the cost of one extend called? 751 00:37:53,248 --> 00:37:54,706 AUDIENCE: Constant? 752 00:37:54,706 --> 00:37:55,912 Order of K? 753 00:37:55,912 --> 00:37:58,120 PROFESSOR: I want to know your Python implementation. 754 00:38:02,288 --> 00:38:04,708 AUDIENCE: If Python did it like a linked list , 755 00:38:04,708 --> 00:38:08,477 like doubly linked lists, could be order constant? 756 00:38:08,477 --> 00:38:09,810 PROFESSOR: I like your question. 757 00:38:09,810 --> 00:38:13,720 So if Python lists were actually linked lists-- 758 00:38:13,720 --> 00:38:16,060 so if the name wasn't confusing-- 759 00:38:16,060 --> 00:38:18,320 then yes, merging two lists would be constant. 760 00:38:18,320 --> 00:38:21,520 But then accessing one element in the middle of a list 761 00:38:21,520 --> 00:38:24,160 would not be constant anymore. 762 00:38:24,160 --> 00:38:25,910 Say I want to access element number 763 00:38:25,910 --> 00:38:29,160 200 in a list of 10,000 elements. 764 00:38:29,160 --> 00:38:31,165 I have to go through 200 elements. 765 00:38:31,165 --> 00:38:32,540 We didn't do linked lists when we 766 00:38:32,540 --> 00:38:33,930 ran it, so don't worry about it. 767 00:38:33,930 --> 00:38:36,230 So they decided that it's less confusing to have 768 00:38:36,230 --> 00:38:39,710 lists actually be arrays. 769 00:38:39,710 --> 00:38:42,500 So Python lists, array in CLRS. 770 00:38:45,175 --> 00:38:47,300 AUDIENCE: That can use their storage contiguously? 771 00:38:47,300 --> 00:38:48,200 PROFESSOR: Yep. 772 00:38:48,200 --> 00:38:50,930 AUDIENCE: So when you copy, why can't you copy a block? 773 00:38:50,930 --> 00:38:52,731 Why do you have to access each other? 774 00:38:52,731 --> 00:38:54,720 Does that make sense? 775 00:38:54,720 --> 00:38:56,930 PROFESSOR: So, you can copy a block, 776 00:38:56,930 --> 00:38:58,600 but in order to copy a block, you still 777 00:38:58,600 --> 00:39:00,270 have to move everything. 778 00:39:00,270 --> 00:39:03,770 So if your block is 10,000 elements, 779 00:39:03,770 --> 00:39:08,370 you still have to move 10,000 bytes times element size. 780 00:39:08,370 --> 00:39:11,490 And the CPU works on 4 bytes at a time or 8 bytes at a time. 781 00:39:13,990 --> 00:39:14,490 OK. 782 00:39:14,490 --> 00:39:16,150 But this is the right kind of thing 783 00:39:16,150 --> 00:39:17,910 to be thinking about when you're doing the cost model. 784 00:39:17,910 --> 00:39:19,300 And this is what you want to have in your head 785 00:39:19,300 --> 00:39:20,424 when you're writing Python. 786 00:39:20,424 --> 00:39:21,610 So, good. 787 00:39:21,610 --> 00:39:23,492 I like your question. 788 00:39:23,492 --> 00:39:24,950 I wanted to say that at some point, 789 00:39:24,950 --> 00:39:26,366 but I didn't get the occasion yet. 790 00:39:26,366 --> 00:39:28,050 Lists in Python are not lists in CLRS. 791 00:39:30,561 --> 00:39:31,060 OK. 792 00:39:31,060 --> 00:39:35,030 So with that long explanation, an extend 793 00:39:35,030 --> 00:39:37,590 is a list-- is a sequence of appends, right? 794 00:39:37,590 --> 00:39:39,020 If you have two lists and you want 795 00:39:39,020 --> 00:39:41,980 to extend the first list to the second list, 796 00:39:41,980 --> 00:39:44,640 extend basically goes through each element in the second list 797 00:39:44,640 --> 00:39:47,800 and calls append on the first list. 798 00:39:47,800 --> 00:39:50,380 The list is length K, the second list is length K, 799 00:39:50,380 --> 00:39:52,510 so K appends are going to happen. 800 00:39:52,510 --> 00:39:54,200 And append is constant time. 801 00:39:54,200 --> 00:39:55,910 Total cost, K. 802 00:39:55,910 --> 00:39:58,000 Now many times does line 5 run? 803 00:40:01,227 --> 00:40:03,265 AUDIENCE: [INAUDIBLE]. 804 00:40:03,265 --> 00:40:04,140 PROFESSOR: Very good. 805 00:40:07,140 --> 00:40:09,200 So what is the total running cost 806 00:40:09,200 --> 00:40:12,470 of the algorithm if this is the line that dominates? 807 00:40:12,470 --> 00:40:13,970 I don't want to do every other line, 808 00:40:13,970 --> 00:40:16,250 so I'll promise that this is the line that dominates. 809 00:40:16,250 --> 00:40:19,300 What is the total running time? 810 00:40:19,300 --> 00:40:20,377 AUDIENCE: [INAUDIBLE]. 811 00:40:20,377 --> 00:40:20,960 PROFESSOR: OK. 812 00:40:20,960 --> 00:40:22,230 Very good. 813 00:40:22,230 --> 00:40:23,390 K times L. And that is? 814 00:40:27,530 --> 00:40:33,160 Oh, K. So I shouldn't have said K times L, sorry. 815 00:40:33,160 --> 00:40:35,430 L is not the number of lines. 816 00:40:35,430 --> 00:40:40,250 L is the number here, so Z is the number of lines. 817 00:40:40,250 --> 00:40:45,720 So it's K times Z. Which means I'm using bad letters, 818 00:40:45,720 --> 00:40:47,120 so please bear with me. 819 00:40:47,120 --> 00:40:48,830 We'll forget about them in a minute. 820 00:40:48,830 --> 00:40:50,830 So K times Z equals? 821 00:40:50,830 --> 00:40:53,470 AUDIENCE: W. 822 00:40:53,470 --> 00:40:57,660 PROFESSOR: W. So what do I write here? 823 00:41:01,994 --> 00:41:02,910 AUDIENCE: [INAUDIBLE]. 824 00:41:02,910 --> 00:41:03,530 PROFESSOR: OK. 825 00:41:03,530 --> 00:41:04,760 Good. 826 00:41:04,760 --> 00:41:05,830 Very good. 827 00:41:05,830 --> 00:41:08,800 What do I write here? 828 00:41:08,800 --> 00:41:11,862 Word frequencies from file. 829 00:41:11,862 --> 00:41:13,137 AUDIENCE: W L. 830 00:41:13,137 --> 00:41:13,720 PROFESSOR: OK. 831 00:41:16,460 --> 00:41:17,335 What do I write here? 832 00:41:21,407 --> 00:41:21,990 AUDIENCE: W 1. 833 00:41:35,150 --> 00:41:38,237 AUDIENCE: We have to-- put L 1 squared and L 2 squared back 834 00:41:38,237 --> 00:41:38,950 in. 835 00:41:38,950 --> 00:41:40,160 PROFESSOR: OK, do we? 836 00:41:40,160 --> 00:41:43,060 Well, the-- No. 837 00:41:43,060 --> 00:41:45,535 W, you want to always be bigger than L. I hope. 838 00:41:45,535 --> 00:41:46,160 PROFESSOR: Yep. 839 00:41:46,160 --> 00:41:48,680 So if I put it in, L 1 squared is 840 00:41:48,680 --> 00:41:53,570 L 1 L 1, which is smaller than W 1 L 1. 841 00:41:53,570 --> 00:41:55,560 But I have to think about it before doing that. 842 00:41:55,560 --> 00:41:58,260 I can't ignore this completely. 843 00:42:02,240 --> 00:42:04,210 So this is document distance 2. 844 00:42:04,210 --> 00:42:07,207 The asymptotic complexity improved, the running time 845 00:42:07,207 --> 00:42:07,707 improved. 846 00:42:14,530 --> 00:42:16,410 Next thing that happens to make this faster 847 00:42:16,410 --> 00:42:19,180 is-- I'm not giving you the profiler output, 848 00:42:19,180 --> 00:42:23,040 but you have to take my word for it that the longest methods are 849 00:42:23,040 --> 00:42:27,170 count frequency and inner product. 850 00:42:27,170 --> 00:42:29,770 So what I'm going to do first is I'm 851 00:42:29,770 --> 00:42:32,320 going to make inner product faster. 852 00:42:32,320 --> 00:42:34,400 But in order to do that, I have to make 853 00:42:34,400 --> 00:42:36,550 word frequencies for file slower. 854 00:42:36,550 --> 00:42:38,050 And that is because I happen to know 855 00:42:38,050 --> 00:42:44,030 an algorithm for inner product that is a lot faster, if only 856 00:42:44,030 --> 00:42:47,575 you can promise me that in this list, the words are ordered. 857 00:42:47,575 --> 00:42:48,450 The words are sorted. 858 00:42:51,210 --> 00:42:57,000 So what happens in that list 1 is, the moment I see a word, 859 00:42:57,000 --> 00:42:59,510 if it's not in the list I add it at the end. 860 00:42:59,510 --> 00:43:01,820 So the words here show up pretty much in the order 861 00:43:01,820 --> 00:43:03,710 that they show up in the file. 862 00:43:03,710 --> 00:43:05,680 Well, if instead I could have something 863 00:43:05,680 --> 00:43:11,090 that looks like-- what is it? 864 00:43:11,090 --> 00:43:12,470 Fox. 865 00:43:12,470 --> 00:43:13,731 In. 866 00:43:13,731 --> 00:43:14,230 His. 867 00:43:16,890 --> 00:43:19,850 Hat is somewhere here. 868 00:43:19,850 --> 00:43:22,650 So if these words would be in this order, 869 00:43:22,650 --> 00:43:25,210 together with the word that I'm missing, 870 00:43:25,210 --> 00:43:27,470 then I can combine two lists. 871 00:43:27,470 --> 00:43:31,620 I can do an inner product a lot faster. 872 00:43:31,620 --> 00:43:34,130 Let's see how that would work. 873 00:43:34,130 --> 00:43:37,360 And I'm already getting confused my words, so let's do a trick. 874 00:43:37,360 --> 00:43:39,810 Let's say that instead of words, we'll use numbers. 875 00:43:39,810 --> 00:43:41,570 So instead of saying "the," I'm going 876 00:43:41,570 --> 00:43:45,060 to say the is the 50th word in the dictionary. 877 00:43:45,060 --> 00:43:47,957 So I'm going to use number 50. 878 00:43:47,957 --> 00:43:49,290 Because I want to write numbers. 879 00:43:49,290 --> 00:43:50,940 The numbers are easier to deal with. 880 00:43:50,940 --> 00:43:53,460 So say the first document that I have-- 881 00:43:53,460 --> 00:43:59,840 the fox is in the hat-- as words number 3, 4, 6, 8, and 9. 882 00:43:59,840 --> 00:44:03,140 And they show up twice, once, once, once. 883 00:44:06,360 --> 00:44:08,720 9, once. 884 00:44:08,720 --> 00:44:17,160 And say I'm trying to compute the inner product of this 885 00:44:17,160 --> 00:44:19,370 with a document that has word number 2 886 00:44:19,370 --> 00:44:23,290 showing up once, word number 3 showing up once, word number 887 00:44:23,290 --> 00:44:28,350 6 showing up once, word number 7 showing up once, 888 00:44:28,350 --> 00:44:30,174 word number 8 showing up once. 889 00:44:33,737 --> 00:44:35,820 OK, the algorithm for inner product that we talked 890 00:44:35,820 --> 00:44:39,050 about last time was, go through each element in one 891 00:44:39,050 --> 00:44:42,710 of the vectors, find an element with the same word 892 00:44:42,710 --> 00:44:45,410 in the second vector, and if you can find it, 893 00:44:45,410 --> 00:44:47,370 then take the number of times the words show up 894 00:44:47,370 --> 00:44:48,800 and multiply them. 895 00:44:48,800 --> 00:44:50,560 So here I have a 3, 2. 896 00:44:50,560 --> 00:44:54,205 I would find this element here that has 3, 1. 897 00:44:54,205 --> 00:44:55,205 I have these everywhere. 898 00:44:58,030 --> 00:45:00,370 And I take the 2 and the 1 and I multiply them. 899 00:45:00,370 --> 00:45:03,920 So the 3s have to be the same, then I think the 2 and the 1, 900 00:45:03,920 --> 00:45:05,500 and I multiply them. 901 00:45:05,500 --> 00:45:08,270 And for all the elements where that's case, 902 00:45:08,270 --> 00:45:10,430 I add up the results of the multiplication. 903 00:45:13,420 --> 00:45:16,930 So step one, go through each element in a vector. 904 00:45:16,930 --> 00:45:20,095 That's not going to get faster if this other vector is 905 00:45:20,095 --> 00:45:21,260 sorted, right? 906 00:45:21,260 --> 00:45:27,740 But the step of looking up the second element can be sped up. 907 00:45:27,740 --> 00:45:31,430 The first and easiest way I can speed this up is, hey, 908 00:45:31,430 --> 00:45:32,250 this is sorted. 909 00:45:32,250 --> 00:45:37,340 If I'm looking up three, why not do a binary search? 910 00:45:37,340 --> 00:45:39,640 What would be the cost if I do that? 911 00:45:39,640 --> 00:45:43,670 So here I have L1 element, here I have L2 elements. 912 00:45:43,670 --> 00:45:47,400 What's the cost of doing one binary search here? 913 00:45:47,400 --> 00:45:48,940 AUDIENCE: Log of L2? 914 00:45:48,940 --> 00:45:51,660 PROFESSOR: OK. 915 00:45:51,660 --> 00:45:55,201 And how many times do I do a binary search? 916 00:45:55,201 --> 00:45:57,682 AUDIENCE: [INAUDIBLE]. 917 00:45:57,682 --> 00:45:59,640 PROFESSOR: So if I go through each element here 918 00:45:59,640 --> 00:46:02,139 and I do a binary search, which is a nice and easy algorithm 919 00:46:02,139 --> 00:46:03,800 that I can explain in 10 seconds, 920 00:46:03,800 --> 00:46:06,520 I'm already faster than L1 L2. 921 00:46:06,520 --> 00:46:08,680 So it's worth sorting. 922 00:46:08,680 --> 00:46:11,200 Now, the algorithm that we use in class 923 00:46:11,200 --> 00:46:15,790 takes time proportional to L1 plus L2. 924 00:46:15,790 --> 00:46:17,590 So that's even trickier, and it's 925 00:46:17,590 --> 00:46:19,300 going to take a bit more time to explain. 926 00:46:21,970 --> 00:46:24,250 Does-- did anyone understand the algorithm for class 927 00:46:24,250 --> 00:46:25,690 and wants to help me explain it? 928 00:46:30,580 --> 00:46:32,790 Didn't think so. 929 00:46:32,790 --> 00:46:34,140 OK. 930 00:46:34,140 --> 00:46:39,230 So idea is that both of these vectors are sorted. 931 00:46:39,230 --> 00:46:46,390 So if I have a 3 here and I found my 3 here, next time when 932 00:46:46,390 --> 00:46:49,850 I get to 4, I know for sure that 4 is not 933 00:46:49,850 --> 00:46:54,370 going to be anywhere here, because this vector is sorted. 934 00:46:54,370 --> 00:46:57,550 Say I couldn't find 4, then I go to 6. 935 00:46:57,550 --> 00:47:00,850 If 6 is here, when I have to look for 8, 936 00:47:00,850 --> 00:47:04,260 I know for sure that 8 is not going to be anywhere up here. 937 00:47:07,210 --> 00:47:11,290 So what I do is, I have a pointer here that remembers, 938 00:47:11,290 --> 00:47:15,684 where's the last element that I have seen? 939 00:47:15,684 --> 00:47:16,975 Does this make sense to people? 940 00:47:19,890 --> 00:47:22,500 So when I start here and I look at 3, 941 00:47:22,500 --> 00:47:26,300 I have a pointer here that says, I didn't see anything here. 942 00:47:26,300 --> 00:47:28,170 I look at 2, it's not 3. 943 00:47:28,170 --> 00:47:29,050 It's smaller. 944 00:47:29,050 --> 00:47:29,940 I look at 3. 945 00:47:29,940 --> 00:47:30,950 I found it, good. 946 00:47:30,950 --> 00:47:32,720 I do my product. 947 00:47:32,720 --> 00:47:35,610 Then I go look for the next element here, 4. 948 00:47:35,610 --> 00:47:36,845 I look at 6. 949 00:47:36,845 --> 00:47:38,840 6 is bigger than 4. 950 00:47:38,840 --> 00:47:42,980 So I know for sure that nothing below it is going to be 4. 951 00:47:42,980 --> 00:47:46,140 So I can stop right here and keep my pointer here. 952 00:47:46,140 --> 00:47:47,840 Then I go to 6 here. 953 00:47:47,840 --> 00:47:48,580 And I look here. 954 00:47:48,580 --> 00:47:49,300 Where did I stop? 955 00:47:49,300 --> 00:47:50,520 I stop here. 956 00:47:50,520 --> 00:47:52,030 This element matches this. 957 00:47:52,030 --> 00:47:53,750 I do my product. 958 00:47:53,750 --> 00:47:56,150 Keep it in. 959 00:47:56,150 --> 00:47:58,030 Now I go to 8 here. 960 00:47:58,030 --> 00:47:59,310 I was at 6. 961 00:47:59,310 --> 00:48:01,320 6 is smaller than 8. 962 00:48:01,320 --> 00:48:03,236 7 is smaller than 8. 963 00:48:03,236 --> 00:48:04,420 8 is equal to 8. 964 00:48:04,420 --> 00:48:05,630 I found something. 965 00:48:05,630 --> 00:48:06,850 Sweet. 966 00:48:06,850 --> 00:48:08,500 I do a product. 967 00:48:08,500 --> 00:48:09,730 And then I stop. 968 00:48:09,730 --> 00:48:12,320 And I look at the next element. 969 00:48:12,320 --> 00:48:15,830 I know for sure that nothing here is going to be 9, 970 00:48:15,830 --> 00:48:17,610 so I can keep looking down. 971 00:48:17,610 --> 00:48:20,160 I hit the end of my list. 972 00:48:20,160 --> 00:48:21,310 OK. 973 00:48:21,310 --> 00:48:25,069 Whatever I have down here-- it's going 9, 10, 11, 12-- 974 00:48:25,069 --> 00:48:27,360 is not going to be in this list, because it was sorted. 975 00:48:27,360 --> 00:48:30,475 So I can stop. 976 00:48:30,475 --> 00:48:34,340 AUDIENCE: How do you keep going-- 977 00:48:34,340 --> 00:48:37,340 What algorithm so you use to keep looking down on L2? 978 00:48:37,340 --> 00:48:38,204 PROFESSOR: Plus 1. 979 00:48:38,204 --> 00:48:39,120 AUDIENCE: So what if-- 980 00:48:39,120 --> 00:48:40,328 PROFESSOR: I keep going down. 981 00:48:40,328 --> 00:48:46,360 AUDIENCE: What if the left side was 3, 4, 6, 9243? 982 00:48:46,360 --> 00:48:47,370 PROFESSOR: 9,000, what? 983 00:48:47,370 --> 00:48:48,470 AUDIENCE: Just a big number. 984 00:48:48,470 --> 00:48:49,136 AUDIENCE: Right. 985 00:48:49,136 --> 00:48:52,714 Then you have to increment by 1 each time-- 986 00:48:52,714 --> 00:48:54,130 PROFESSOR: So I'm not incrementing 987 00:48:54,130 --> 00:48:56,080 the number I'm looking at. 988 00:48:56,080 --> 00:48:57,710 Here, I looked at 6. 989 00:48:57,710 --> 00:48:59,610 Then I'm looking for 8. 990 00:48:59,610 --> 00:49:03,180 And after I found 8, I'm looking for 19,000. 991 00:49:03,180 --> 00:49:06,800 So I'm going to go down, either until I find 90,000 992 00:49:06,800 --> 00:49:07,710 or until I stop. 993 00:49:07,710 --> 00:49:10,382 AUDIENCE: So are you going to go down one at a time, 994 00:49:10,382 --> 00:49:11,007 PROFESSOR: Yep. 995 00:49:11,007 --> 00:49:13,840 AUDIENCE: --why not do a binary search? 996 00:49:13,840 --> 00:49:16,560 PROFESSOR: Because if I do a binary search, 997 00:49:16,560 --> 00:49:20,550 the analysis that says it's fast is not going to work. 998 00:49:20,550 --> 00:49:23,550 It turns out that this gives me the optimal running time. 999 00:49:23,550 --> 00:49:28,830 If I do a binary search, suppose I have a list that's like this. 1000 00:49:28,830 --> 00:49:33,509 1, 2, 3, 4, 5, all the way down to 10,000. 1001 00:49:33,509 --> 00:49:36,750 Ugh, I can't write. 1002 00:49:36,750 --> 00:49:39,920 And I have another list that's like this- 1, 2, 3, 4, 1003 00:49:39,920 --> 00:49:42,800 all the way down to 10,000. 1004 00:49:42,800 --> 00:49:44,720 1, 1. 1005 00:49:44,720 --> 00:49:49,830 I do a binary search, takes log N. I look at 2. 1006 00:49:49,830 --> 00:49:53,230 I do a binary search, takes almost log N. I look at 3, 1007 00:49:53,230 --> 00:49:56,352 do a binary search, takes almost log N. So on, so forth. 1008 00:49:59,550 --> 00:50:00,705 So this is-- 1009 00:50:00,705 --> 00:50:03,214 AUDIENCE: But if your left list had been 1010 00:50:03,214 --> 00:50:04,380 PROFESSOR: Log N plus log N. 1011 00:50:04,380 --> 00:50:06,034 AUDIENCE: --10,000. 1012 00:50:06,034 --> 00:50:06,700 PROFESSOR: Yeah. 1013 00:50:06,700 --> 00:50:09,770 AUDIENCE: You'd have taken N time. 1014 00:50:09,770 --> 00:50:10,630 PROFESSOR: Yes. 1015 00:50:10,630 --> 00:50:13,030 Well, this algorithm takes N time, 1016 00:50:13,030 --> 00:50:15,800 even if I have to list like this. 1017 00:50:15,800 --> 00:50:19,490 It takes 10,000 plus 10,000 time. 1018 00:50:19,490 --> 00:50:21,700 Whereas this algorithm will take time that's actually 1019 00:50:21,700 --> 00:50:29,910 proportional to 10,000 log 10,000. 1020 00:50:29,910 --> 00:50:30,740 You believe me. 1021 00:50:30,740 --> 00:50:34,480 So, a way to look at this is to do bounds and say, 1022 00:50:34,480 --> 00:50:37,140 for the elements 1 through 5,000, 1023 00:50:37,140 --> 00:50:41,470 it's going to do a binary search for more than 5,000 elements. 1024 00:50:41,470 --> 00:50:44,170 So, the time-- the running time is definitely bigger 1025 00:50:44,170 --> 00:50:48,366 than N over 2 log N over 2. 1026 00:50:51,840 --> 00:50:52,630 Constant. 1027 00:50:52,630 --> 00:51:00,710 This becomes log N minus 1 and log N. 1028 00:51:00,710 --> 00:51:01,900 That is a good question. 1029 00:51:01,900 --> 00:51:04,340 I wondered about that the first time I saw merge-sort too. 1030 00:51:04,340 --> 00:51:05,810 And I was thinking, hey, I'm going 1031 00:51:05,810 --> 00:51:07,768 to do a binary search here because it's faster, 1032 00:51:07,768 --> 00:51:10,510 and I'm going to make a faster algorithm than anyone has ever 1033 00:51:10,510 --> 00:51:11,390 seen. 1034 00:51:11,390 --> 00:51:13,320 Well, if you do the analysis, not so much. 1035 00:51:13,320 --> 00:51:14,660 But you need to think about it, and you 1036 00:51:14,660 --> 00:51:16,660 need to know why that's true or that's not true. 1037 00:51:16,660 --> 00:51:17,670 So I like your question. 1038 00:51:17,670 --> 00:51:20,140 Thank you. 1039 00:51:20,140 --> 00:51:22,520 So now let's get down so the plain old merge-sort 1040 00:51:22,520 --> 00:51:26,860 that everyone-- sorry, merge that everyone knows. 1041 00:51:26,860 --> 00:51:30,610 So if we go through these one by one, 1042 00:51:30,610 --> 00:51:33,880 how many times am I going to be advancing this pointer? 1043 00:51:33,880 --> 00:51:35,498 In total? 1044 00:51:35,498 --> 00:51:36,920 AUDIENCE: L2? 1045 00:51:36,920 --> 00:51:38,160 PROFESSOR: L2 times. 1046 00:51:38,160 --> 00:51:42,540 So this pointer can only go down, right? 1047 00:51:42,540 --> 00:51:45,660 So worst case is going to go down L2 times. 1048 00:51:45,660 --> 00:51:48,550 And then I'm done with the list, return. 1049 00:51:48,550 --> 00:51:50,580 How many elements am I going to look through? 1050 00:51:50,580 --> 00:51:53,301 So how many times does this pointer going to advance? 1051 00:51:53,301 --> 00:51:53,842 AUDIENCE: L1? 1052 00:51:56,700 --> 00:51:59,080 PROFESSOR: This one. 1053 00:51:59,080 --> 00:52:00,550 AUDIENCE: But I thought, like-- 1054 00:52:00,550 --> 00:52:03,490 AUDIENCE: Then you get extra ones in between. 1055 00:52:03,490 --> 00:52:05,560 PROFESSOR: What if L2 is bigger than L1? 1056 00:52:05,560 --> 00:52:06,780 What if-- 1057 00:52:06,780 --> 00:52:07,660 AUDIENCE: Oh, right. 1058 00:52:07,660 --> 00:52:08,540 OK, never mind. 1059 00:52:08,540 --> 00:52:10,499 The reason I said L2 was because-- 1060 00:52:10,499 --> 00:52:13,040 PROFESSOR: You're thinking after I'm going through this list, 1061 00:52:13,040 --> 00:52:13,700 I'm out, right? 1062 00:52:13,700 --> 00:52:15,325 AUDIENCE: Right, that's what I thought. 1063 00:52:15,325 --> 00:52:17,640 PROFESSOR: So, your answer works if this list, say, has 1064 00:52:17,640 --> 00:52:19,620 10 elements and this has 10,000. 1065 00:52:19,620 --> 00:52:22,230 And I go through this one really quickly. 1066 00:52:22,230 --> 00:52:25,697 But if this list has 10 elements and this list has 10,000, 1067 00:52:25,697 --> 00:52:27,280 and they both start with 1 through 10, 1068 00:52:27,280 --> 00:52:31,660 I have to say a 1 because that's a better bound. 1069 00:52:31,660 --> 00:52:33,596 So I have to say this. 1070 00:52:33,596 --> 00:52:35,345 AUDIENCE: Could you say the opposite value 1071 00:52:35,345 --> 00:52:37,450 of the difference between the two of them? 1072 00:52:37,450 --> 00:52:39,486 Because if you're going to stop at 1 has 10 1073 00:52:39,486 --> 00:52:43,120 and the other one has 10,000, and let's say 1074 00:52:43,120 --> 00:52:46,579 that only the first ten are actually equal, 1075 00:52:46,579 --> 00:52:48,870 then you're going to go through that list, find all 10, 1076 00:52:48,870 --> 00:52:50,110 and you stop. 1077 00:52:50,110 --> 00:52:51,059 That would be 9,000-- 1078 00:52:51,059 --> 00:52:53,350 PROFESSOR: I could say about that if I'm looking at one 1079 00:52:53,350 --> 00:52:56,160 case, but the magic trick is-- let's-- we're looking 1080 00:52:56,160 --> 00:52:57,680 at the worst case. 1081 00:52:57,680 --> 00:52:59,844 So worst case, if I have 10 elements, 1082 00:52:59,844 --> 00:53:01,760 they'll be all the way down in the other list. 1083 00:53:01,760 --> 00:53:03,010 Or they won't be there at all. 1084 00:53:03,010 --> 00:53:04,970 And I have to go down through all the list. 1085 00:53:04,970 --> 00:53:07,670 AUDIENCE: OK. 1086 00:53:07,670 --> 00:53:09,640 PROFESSOR: So worst case, L1 plus L2. 1087 00:53:15,000 --> 00:53:18,760 Let me see if I have any time left. 1088 00:53:18,760 --> 00:53:19,720 Nope. 1089 00:53:19,720 --> 00:53:25,050 So, what I would like you to do is go through insertion sort. 1090 00:53:25,050 --> 00:53:27,700 Insertion sort matches the textbook. 1091 00:53:27,700 --> 00:53:30,540 Look at the definition in the textbook, look at the code, 1092 00:53:30,540 --> 00:53:32,009 convince yourself it's the same. 1093 00:53:32,009 --> 00:53:33,800 Look at the running time, convince yourself 1094 00:53:33,800 --> 00:53:35,280 it's N squared. 1095 00:53:35,280 --> 00:53:38,880 Then look at inner product and convince yourself 1096 00:53:38,880 --> 00:53:41,360 that this is what it does. 1097 00:53:41,360 --> 00:53:44,820 Go through this line by line, see where they match, 1098 00:53:44,820 --> 00:53:46,100 put the cost on. 1099 00:53:46,100 --> 00:53:49,040 Make sure that the cost is L1 plus L2. 1100 00:53:49,040 --> 00:53:52,810 And last, go through merge-sort and notice 1101 00:53:52,810 --> 00:53:56,660 that merge in [INAUDIBLE] 6 is exactly the same 1102 00:53:56,660 --> 00:53:57,880 as inner product. 1103 00:53:57,880 --> 00:54:00,020 So this pointer magic that I did here 1104 00:54:00,020 --> 00:54:02,440 is exactly what's happening inside merge. 1105 00:54:02,440 --> 00:54:03,570 And understand merge-sort. 1106 00:54:03,570 --> 00:54:05,630 Look at the textbook, look at the notes, 1107 00:54:05,630 --> 00:54:08,270 and see how they match. 1108 00:54:08,270 --> 00:54:08,870 OK. 1109 00:54:08,870 --> 00:54:10,720 Thanks, guys.