1 00:00:00,060 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation, or view additional materials 6 00:00:13,320 --> 00:00:17,207 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,207 --> 00:00:17,832 at ocw.mit.edu. 8 00:00:20,469 --> 00:00:22,510 PROFESSOR: The things we can talk about today, we 9 00:00:22,510 --> 00:00:24,000 can talk about this code. 10 00:00:24,000 --> 00:00:28,370 We can talk a little bit more about the hash functions. 11 00:00:28,370 --> 00:00:31,170 And we can talk a little bit more about amortization. 12 00:00:31,170 --> 00:00:32,519 What to do guys want to hear? 13 00:00:35,110 --> 00:00:36,980 AUDIENCE: Amoritizaiton. 14 00:00:36,980 --> 00:00:38,920 PROFESSOR: OK, so one vote for amortization. 15 00:00:38,920 --> 00:00:42,690 So who wants to look at the PSET code? 16 00:00:42,690 --> 00:00:46,810 Who wants to talk about hashes? 17 00:00:46,810 --> 00:00:50,726 Who wants to talk about amortization? 18 00:00:50,726 --> 00:00:51,850 Two, three, four, five, OK. 19 00:00:51,850 --> 00:00:53,810 So then let's try this. 20 00:00:53,810 --> 00:00:56,970 Let's look at the PSET code then talk about amortization a bit 21 00:00:56,970 --> 00:00:57,920 at the end. 22 00:00:57,920 --> 00:01:00,570 I do have to talk a little bit about hashes 23 00:01:00,570 --> 00:01:04,489 though, because I owe someone a question from last time. 24 00:01:04,489 --> 00:01:08,300 And the question was, we have rolling hashes, 25 00:01:08,300 --> 00:01:11,210 so the hashes look like this. 26 00:01:11,210 --> 00:01:16,170 K where K is a big number, modulo p. 27 00:01:16,170 --> 00:01:21,340 And we argue that it's really nice if p is a prime. 28 00:01:21,340 --> 00:01:27,130 And then the question was, what if instead p is 2 to the w, 29 00:01:27,130 --> 00:01:31,900 and is not prime, as long as the base that we're using 30 00:01:31,900 --> 00:01:36,070 is co-prime with p? 31 00:01:36,070 --> 00:01:37,930 Does this work? 32 00:01:37,930 --> 00:01:39,760 And the answer is-- I didn't want 33 00:01:39,760 --> 00:01:42,630 to say yes without making sure that I don't say something 34 00:01:42,630 --> 00:01:44,130 stupid-- but the answer is yes, this 35 00:01:44,130 --> 00:01:48,000 works just fine, because the way a compute multiplicative 36 00:01:48,000 --> 00:01:50,310 inverse is is you use so something 37 00:01:50,310 --> 00:01:58,745 called the extended Euclid's method. 38 00:02:03,140 --> 00:02:08,350 And if we have b and p, then if we compute their GCD, 39 00:02:08,350 --> 00:02:17,470 the that's the greatest common divisor-- so GCD is greatest-- 40 00:02:24,710 --> 00:02:32,580 If you use extended Euclid you get something like xb plus yp 41 00:02:32,580 --> 00:02:40,340 equals GCD of b and p. 42 00:02:40,340 --> 00:02:48,500 So if this is 1, then you have xb plus yp equals 1. 43 00:02:48,500 --> 00:02:53,250 And if you're working modulo p, whatever that is, 44 00:02:53,250 --> 00:02:59,410 then you have that xb is 1 mod p. 45 00:02:59,410 --> 00:03:03,100 So there's your multiplicative inverse. 46 00:03:03,100 --> 00:03:04,840 Well so now that's nice math, right? 47 00:03:04,840 --> 00:03:08,860 But that doesn't tell me why are we not using this. 48 00:03:08,860 --> 00:03:11,465 So with the multiplicative inverse would work, 49 00:03:11,465 --> 00:03:12,840 but there's something else that's 50 00:03:12,840 --> 00:03:15,941 wrong with using 2 to the w. 51 00:03:18,710 --> 00:03:22,170 Will this give me a good hash function? 52 00:03:22,170 --> 00:03:24,300 OK, the fact that it's p might be confusing. 53 00:03:24,300 --> 00:03:34,032 So let's say h equals K mod 2 to the w. 54 00:03:34,032 --> 00:03:38,540 And remember that the K is some digits in base b, right? 55 00:03:38,540 --> 00:03:41,322 It's a big number made out of digits in base b. 56 00:03:41,322 --> 00:03:52,390 So K is d1, d2, d3, all the way up until d length in base b. 57 00:03:56,310 --> 00:04:00,550 And I'll make things easier and say that b is 2 to the 8, 58 00:04:00,550 --> 00:04:03,810 because we're working with ASCII characters, or colors, 59 00:04:03,810 --> 00:04:07,330 or something that fits nicely in a bit. 60 00:04:07,330 --> 00:04:11,214 So what could go wrong with using this? 61 00:04:11,214 --> 00:04:14,430 AUDIENCE: Well if your series of-- if your K is bigger than 62 00:04:14,430 --> 00:04:17,111 2-- if it's K is bigger than 2 to the w-- 63 00:04:17,111 --> 00:04:18,444 PROFESSOR: It will be, for sure. 64 00:04:18,444 --> 00:04:20,694 AUDIENCE: Yes, that's the problem, because then you'll 65 00:04:20,694 --> 00:04:21,490 loop. 66 00:04:21,490 --> 00:04:24,060 You'll get the same hashes for-- 67 00:04:24,060 --> 00:04:25,260 PROFESSOR: Yeah, yeah. 68 00:04:25,260 --> 00:04:32,430 So you will get-- So hashing takes a lot of possible inputs 69 00:04:32,430 --> 00:04:37,790 and maps them to a relatively small set of outputs. 70 00:04:37,790 --> 00:04:41,230 Inputs hash output. 71 00:04:41,230 --> 00:04:42,650 And we argued last time that we're 72 00:04:42,650 --> 00:04:44,316 going to have collisions no matter what, 73 00:04:44,316 --> 00:04:47,540 because we have a ton of inputs and not that many outputs. 74 00:04:47,540 --> 00:04:49,250 For example, if we're hashing strings 75 00:04:49,250 --> 00:04:51,650 that are a million characters then this 76 00:04:51,650 --> 00:04:58,600 is going to be 2 to the 8 to the 1 million possible strings. 77 00:04:58,600 --> 00:05:00,740 And then the number of possible values 78 00:05:00,740 --> 00:05:04,890 is, if we're using the word size, 2 to the 32. 79 00:05:04,890 --> 00:05:07,045 There is no way we can design a function that 80 00:05:07,045 --> 00:05:09,500 will take this many inputs, map them to this many outputs, 81 00:05:09,500 --> 00:05:11,270 and not do collisions. 82 00:05:11,270 --> 00:05:12,560 But instead, what do we want? 83 00:05:12,560 --> 00:05:13,893 What makes a good hash function? 84 00:05:21,290 --> 00:05:25,770 Say my hash function is 0 for all the K's. 85 00:05:25,770 --> 00:05:27,393 Is that a good hash function? 86 00:05:27,393 --> 00:05:29,382 AUDIENCE: It's an excellent hash function. 87 00:05:29,382 --> 00:05:30,715 PROFESSOR: What's wrong with it? 88 00:05:30,715 --> 00:05:32,655 AUDIENCE: You would put everything in one 89 00:05:32,655 --> 00:05:36,050 so that it's searching, or it would take a long time? 90 00:05:36,050 --> 00:05:38,922 PROFESSOR: Yeah, searching takes a long time. 91 00:05:38,922 --> 00:05:40,630 And we've don't do sorting with this yet. 92 00:05:40,630 --> 00:05:42,580 Searching takes a long time, string sub-matching 93 00:05:42,580 --> 00:05:44,121 will take a long time, it's horrible. 94 00:05:46,748 --> 00:05:49,690 AUDIENCE: So what would that distribute-- 95 00:05:49,690 --> 00:05:51,470 like [INAUDIBLE] over all-- 96 00:05:51,470 --> 00:05:53,220 PROFESSOR: All right, so we want something 97 00:05:53,220 --> 00:05:56,560 that looks sort of random. 98 00:05:56,560 --> 00:05:58,980 The ideals hash function takes an input then 99 00:05:58,980 --> 00:06:02,127 gives it a random output, and then stays consistent. 100 00:06:02,127 --> 00:06:04,210 So when it sees an input, returns the same output. 101 00:06:10,030 --> 00:06:13,060 So I think distribute is the keyword here. 102 00:06:13,060 --> 00:06:15,960 What's wrong with this hash function? 103 00:06:20,110 --> 00:06:21,650 If it takes random data, it's going 104 00:06:21,650 --> 00:06:22,820 to distribute it randomly. 105 00:06:22,820 --> 00:06:24,153 That's true, so that's all good. 106 00:06:24,153 --> 00:06:27,260 But what data that we might see in real life 107 00:06:27,260 --> 00:06:30,786 will make it behave badly? 108 00:06:30,786 --> 00:06:34,690 AUDIENCE: The K is a series of characters, right? 109 00:06:34,690 --> 00:06:35,620 PROFESSOR: Maybe. 110 00:06:35,620 --> 00:06:37,120 AUDIENCE: It just could be anything. 111 00:06:37,120 --> 00:06:41,760 But we know for sure that L will be larger than w. 112 00:06:41,760 --> 00:06:43,270 PROFESSOR: Say L is a million. 113 00:06:43,270 --> 00:06:46,495 AUDIENCE: OK, well that sucks. 114 00:06:46,495 --> 00:06:47,245 PROFESSOR: Oh, no. 115 00:06:47,245 --> 00:06:48,920 That in itself, that doesn't suck. 116 00:06:48,920 --> 00:06:51,010 That's what let's us do sub-string matching really 117 00:06:51,010 --> 00:06:52,746 fast, even if we have large strings. 118 00:06:52,746 --> 00:06:54,370 AUDIENCE: --say for 2 to the w, though, 119 00:06:54,370 --> 00:06:56,909 because then it will be much larger, like the number of-- 120 00:06:56,909 --> 00:06:58,200 PROFESSOR: Yeah, but that's OK. 121 00:06:58,200 --> 00:07:03,410 So I'm OK with doing this as long as all the values here 122 00:07:03,410 --> 00:07:06,375 are distributed sort of uniformly here. 123 00:07:06,375 --> 00:07:07,000 So that's fine. 124 00:07:07,000 --> 00:07:07,900 AUDIENCE: OK. 125 00:07:07,900 --> 00:07:11,370 PROFESSOR: But there's-- I'm arguing that there are some 126 00:07:11,370 --> 00:07:14,660 values which will make this hash function behave badly. 127 00:07:14,660 --> 00:07:17,770 And that those values are so simple that we might see them 128 00:07:17,770 --> 00:07:18,390 in real life. 129 00:07:26,380 --> 00:07:29,410 OK, what if all these numbers are-- what 130 00:07:29,410 --> 00:07:32,670 if all the digits are even? 131 00:07:32,670 --> 00:07:35,820 So d is 0 mod 2. 132 00:07:40,150 --> 00:07:41,480 What happens to K? 133 00:07:45,506 --> 00:07:47,880 AUDIENCE: Well, you're saying that instead of 2 to the w, 134 00:07:47,880 --> 00:07:50,120 we're just using 2. 135 00:07:50,120 --> 00:07:52,392 PROFESSOR: So no, the modulo is 2 to the w. 136 00:07:52,392 --> 00:07:55,070 Say it's 2 to the 32. 137 00:07:55,070 --> 00:08:01,630 So d are the digits that make up my K. So what 138 00:08:01,630 --> 00:08:03,160 if the base is 2 to the 8? 139 00:08:03,160 --> 00:08:06,850 So I have digits from 0 to 255, 256 of them. 140 00:08:06,850 --> 00:08:08,990 And all the digits are 0 modulo 2. 141 00:08:14,380 --> 00:08:16,640 For my sub-string matching example, 142 00:08:16,640 --> 00:08:18,930 what if all the characters in the sub-string are even? 143 00:08:22,698 --> 00:08:24,650 AUDIENCE: [INAUDIBLE] 144 00:08:24,650 --> 00:08:25,900 PROFESSOR: Not the same thing. 145 00:08:25,900 --> 00:08:26,850 But there's a problem. 146 00:08:26,850 --> 00:08:30,240 They will hash to-- so if all the digits are 147 00:08:30,240 --> 00:08:36,640 0 modulo 2 then what about the K? 148 00:08:36,640 --> 00:08:39,015 AUDIENCE: [INAUDIBLE] 0 modulo 2-- 149 00:08:39,015 --> 00:08:39,640 PROFESSOR: Yep. 150 00:08:44,640 --> 00:08:47,805 So it's just like when you have numbers in base 10. 151 00:08:47,805 --> 00:08:50,520 10 happens to be divisible by 2. 152 00:08:50,520 --> 00:08:54,480 So if your last digit is even, then the entire number is even. 153 00:08:57,451 --> 00:08:58,450 That makes sense, right? 154 00:08:58,450 --> 00:08:59,820 That's math. 155 00:08:59,820 --> 00:09:02,230 Please nod, tell me that I'm making sense. 156 00:09:02,230 --> 00:09:06,010 OK, so here, the base is 256. 157 00:09:06,010 --> 00:09:07,310 And it's also divisible by 2. 158 00:09:07,310 --> 00:09:10,010 So if your last digit is divisible by 2, 159 00:09:10,010 --> 00:09:13,100 then the whole number is divisible by 2. 160 00:09:13,100 --> 00:09:19,090 So then if I take this K modulo 2 to the 32 161 00:09:19,090 --> 00:09:21,180 then the hash is also going to be divisible by 2. 162 00:09:29,084 --> 00:09:32,048 AUDIENCE: Why does it matter if the hash is divisible by 2? 163 00:09:35,020 --> 00:09:39,550 PROFESSOR: So it matters because this 164 00:09:39,550 --> 00:09:41,430 is supposed to be my universe, right? 165 00:09:41,430 --> 00:09:43,490 These are supposed to be all the outputs. 166 00:09:43,490 --> 00:09:46,344 And I'm saying that if my inputs look like this, 167 00:09:46,344 --> 00:09:48,760 then the hash function will not distribute them uniformly. 168 00:09:48,760 --> 00:09:52,470 Instead, if this is my possible set of outputs, 169 00:09:52,470 --> 00:09:56,123 the hash function will always put outputs in this half. 170 00:09:56,123 --> 00:09:58,200 So the outputs will always be here. 171 00:09:58,200 --> 00:10:01,220 And these are the numbers that are divisible by 2. 172 00:10:01,220 --> 00:10:04,297 So these are even, and these are odd. 173 00:10:10,770 --> 00:10:12,140 And this area gets no love. 174 00:10:12,140 --> 00:10:16,070 Absolutely no number will hash here. 175 00:10:16,070 --> 00:10:16,820 So-- 176 00:10:16,820 --> 00:10:20,900 AUDIENCE: Wait, what about something with all odds? 177 00:10:20,900 --> 00:10:22,902 AUDIENCE: Something with all odd digits? 178 00:10:22,902 --> 00:10:23,860 AUDIENCE: Because you're asking-- 179 00:10:23,860 --> 00:10:25,443 AUDIENCE: You have all A's rather than 180 00:10:25,443 --> 00:10:28,597 all B's in your sub-string or in your string. 181 00:10:28,597 --> 00:10:31,002 AUDIENCE: Or because your last digit was odd. 182 00:10:31,002 --> 00:10:32,838 PROFESSOR: If all of our digits are odd 183 00:10:32,838 --> 00:10:34,590 then the last digit is odd. 184 00:10:34,590 --> 00:10:38,341 And then you'd also get something odd, right? 185 00:10:38,341 --> 00:10:38,966 AUDIENCE: Yeah. 186 00:10:38,966 --> 00:10:40,445 AUDIENCE: So there's a pattern. 187 00:10:40,445 --> 00:10:44,390 But there's an even distribution. 188 00:10:44,390 --> 00:10:49,050 PROFESSOR: Well if your hash function is always odd, 189 00:10:49,050 --> 00:10:50,940 then it's not an even distribution. 190 00:10:50,940 --> 00:10:51,728 It's-- 191 00:10:51,728 --> 00:10:53,162 AUDIENCE: Wait, our hash function? 192 00:10:53,162 --> 00:10:55,074 I thought we were talking about-- 193 00:10:55,074 --> 00:10:58,420 AUDIENCE: Isn't it even if your K is even? 194 00:10:58,420 --> 00:11:00,332 And if it's odd [INAUDIBLE]? 195 00:11:00,332 --> 00:11:02,510 PROFESSOR: Yeah, so that's bad. 196 00:11:02,510 --> 00:11:05,520 Because if all your K's happens to be even-- 197 00:11:05,520 --> 00:11:09,130 say if you're doing the nucleotides, 198 00:11:09,130 --> 00:11:15,020 and the nucleotides are A, C, G, T. 199 00:11:15,020 --> 00:11:20,391 If they happen to be encoded as, say, 0, 2, 4, 6, 200 00:11:20,391 --> 00:11:21,390 then these are all even. 201 00:11:23,920 --> 00:11:25,620 So the hash function will always be even 202 00:11:25,620 --> 00:11:28,889 and I'm wasting the last bit. 203 00:11:28,889 --> 00:11:30,930 So if I'm building a hash table, half the entries 204 00:11:30,930 --> 00:11:31,580 will be wasted. 205 00:11:31,580 --> 00:11:33,080 They'll never get anything in there. 206 00:11:33,080 --> 00:11:35,390 I'm just wasting memory. 207 00:11:35,390 --> 00:11:38,090 AUDIENCE: So if you could guarantee 208 00:11:38,090 --> 00:11:43,887 that your inputs would be evenly distributed-- 209 00:11:43,887 --> 00:11:45,470 PROFESSOR: So if our inputs are random 210 00:11:45,470 --> 00:11:47,530 then the hash function-- most hash functions will 211 00:11:47,530 --> 00:11:49,900 do a good job of producing a random output. 212 00:11:49,900 --> 00:11:53,600 The problem is real life inputs are not random. 213 00:11:53,600 --> 00:11:55,970 For example, if you get-- asides from this-- 214 00:11:55,970 --> 00:11:59,960 if you get data from a camera, so if you get your color pixels 215 00:11:59,960 --> 00:12:02,880 from a camera, then because of noise those 216 00:12:02,880 --> 00:12:06,340 might have the last few bits, always be the same thing. 217 00:12:09,190 --> 00:12:12,290 Also it seems like in real life-- [INAUDIBLE], 218 00:12:12,290 --> 00:12:13,585 in his book, argues about this. 219 00:12:13,585 --> 00:12:15,085 It seems like in real life there are 220 00:12:15,085 --> 00:12:17,380 a lot of sequences that look like that, that would make 221 00:12:17,380 --> 00:12:19,006 your hash function behave poorly. 222 00:12:23,470 --> 00:12:25,940 So again, the keyword is distribute. 223 00:12:25,940 --> 00:12:28,340 If some non-random property in the input 224 00:12:28,340 --> 00:12:30,946 is reflected in the output, then that's a bad hash function. 225 00:12:33,547 --> 00:12:35,130 AUDIENCE: Would you gain a lot of time 226 00:12:35,130 --> 00:12:38,460 from your mod operation? 227 00:12:38,460 --> 00:12:45,580 Because in mod 2 to the n you just truncate any bits 228 00:12:45,580 --> 00:12:47,920 to the left of the n. 229 00:12:47,920 --> 00:12:50,360 PROFESSOR: Yeah, so that's why we would do this, right? 230 00:12:50,360 --> 00:12:52,440 That's why we're even considering this case. 231 00:12:52,440 --> 00:12:54,930 AUDIENCE: Because that'd be really nice to be able to not-- 232 00:12:54,930 --> 00:12:57,705 PROFESSOR: So modulo is faster, but in return my hash function 233 00:12:57,705 --> 00:12:59,780 is crap here. 234 00:12:59,780 --> 00:13:03,260 So usually we prefer-- it turns out that in practice nicer hash 235 00:13:03,260 --> 00:13:08,650 functions give better speed improvements overall. 236 00:13:08,650 --> 00:13:11,992 So if you think of how a hash is laid out in memory, 237 00:13:11,992 --> 00:13:13,450 you'll see that because of caching. 238 00:13:13,450 --> 00:13:16,225 And everything gets better to take more time on the mod 239 00:13:16,225 --> 00:13:21,320 function and use up all your memory for the hash table. 240 00:13:21,320 --> 00:13:24,740 So this is why we don't use the and we use this. 241 00:13:24,740 --> 00:13:27,150 Not because of this argument. 242 00:13:27,150 --> 00:13:29,770 So a good question required a lot of talking and remembering 243 00:13:29,770 --> 00:13:31,620 what's a good hash function, what's a bad hash function. 244 00:13:31,620 --> 00:13:32,120 Thank you. 245 00:13:41,370 --> 00:13:43,850 OK, let's look at the code a little bit. 246 00:13:43,850 --> 00:13:45,520 Everyone looked at it, right? 247 00:13:45,520 --> 00:13:48,490 So this time we have modules. 248 00:13:48,490 --> 00:13:50,490 We don't have everything in one big file. 249 00:13:50,490 --> 00:13:53,240 Can someone tell me what are the modules we care about, and why? 250 00:13:55,580 --> 00:13:57,080 AUDIENCE: The problem with the one's 251 00:13:57,080 --> 00:13:59,070 we have to code ourselves. 252 00:13:59,070 --> 00:14:01,340 PROFESSOR: OK, let's start with that. 253 00:14:01,340 --> 00:14:05,830 AUDIENCE: Sub-sequence hashes-- interval sub-sequence hashes. 254 00:14:05,830 --> 00:14:08,530 PROFESSOR: OK, so these are all in DNA seq, right? 255 00:14:08,530 --> 00:14:16,560 So the module is-- so yeah, the PSET hopefully 256 00:14:16,560 --> 00:14:20,030 says that you need to upload this file because it's 257 00:14:20,030 --> 00:14:22,250 the only file you'll need to modify. 258 00:14:22,250 --> 00:14:25,650 So everything that we need to write is here. 259 00:14:25,650 --> 00:14:29,130 Now pretty much everything that's in that file 260 00:14:29,130 --> 00:14:30,150 needs to be modified. 261 00:14:30,150 --> 00:14:32,870 So I'm not going to list them out. 262 00:14:32,870 --> 00:14:35,526 What else do we want to read in that PSET? 263 00:14:35,526 --> 00:14:37,390 AUDIENCE: Rolling [INAUDIBLE] 264 00:14:37,390 --> 00:14:38,985 PROFESSOR: OK, where is rolling hash? 265 00:14:38,985 --> 00:14:40,151 AUDIENCE: In the [INAUDIBLE] 266 00:14:54,570 --> 00:14:58,195 PROFESSOR: So what's different between the API in rolling hash 267 00:14:58,195 --> 00:15:00,230 and the API that we talked about last time? 268 00:15:03,690 --> 00:15:04,340 Yes? 269 00:15:04,340 --> 00:15:08,732 AUDIENCE: Them having the [INAUDIBLE] pop, 270 00:15:08,732 --> 00:15:09,708 or it would skip. 271 00:15:09,708 --> 00:15:12,148 And that's something else [INAUDIBLE] just has a slide, 272 00:15:12,148 --> 00:15:14,588 it puts everything in one operation. 273 00:15:14,588 --> 00:15:19,160 PROFESSOR: All right, so we have append and skip. 274 00:15:22,079 --> 00:15:23,870 And we built some beautiful code with that. 275 00:15:23,870 --> 00:15:27,060 And we looked at some fancy math because of it. 276 00:15:27,060 --> 00:15:28,610 But it turns out that for this PSET 277 00:15:28,610 --> 00:15:31,740 we can get away with slide. 278 00:15:31,740 --> 00:15:35,270 And we started from slide and built these two methods 279 00:15:35,270 --> 00:15:35,800 last time. 280 00:15:35,800 --> 00:15:38,070 So I'm not going to explain slide again. 281 00:15:38,070 --> 00:15:39,680 It's exactly what we had in the code 282 00:15:39,680 --> 00:15:43,550 before we started breaking them up. 283 00:15:43,550 --> 00:15:45,760 OK so this is the rolling hash. 284 00:15:45,760 --> 00:15:47,470 It is good. 285 00:15:47,470 --> 00:15:48,810 Do we care about anything else? 286 00:16:00,099 --> 00:16:02,390 AUDIENCE: I guess you can look at the rest of the code, 287 00:16:02,390 --> 00:16:03,812 if you feel like it. 288 00:16:03,812 --> 00:16:05,770 PROFESSOR: You can look at the rest of the code 289 00:16:05,770 --> 00:16:07,350 if you feel like it, yep. 290 00:16:07,350 --> 00:16:10,325 So I highlighted one file that might be useful, 291 00:16:10,325 --> 00:16:11,200 and that's Kfasta.py. 292 00:16:21,440 --> 00:16:23,540 That file has a FASTA sequence class, 293 00:16:23,540 --> 00:16:26,890 and that's reads from a file and returns something. 294 00:16:26,890 --> 00:16:29,470 And the important thing is it doesn't return a list. 295 00:16:29,470 --> 00:16:32,200 If you remember the doc dists, doc dist 1 thorugh doc 296 00:16:32,200 --> 00:16:35,460 dist 8 dot PI, fun times. 297 00:16:35,460 --> 00:16:38,070 What we had there was we took the input file, 298 00:16:38,070 --> 00:16:40,199 and we read it all a list. 299 00:16:40,199 --> 00:16:41,490 This time we're not doing that. 300 00:16:41,490 --> 00:16:45,250 We're writing, what, 20 lines of code instead of what 301 00:16:45,250 --> 00:16:47,900 could be five lines of code to read the input. 302 00:16:47,900 --> 00:16:49,332 Why is that? 303 00:16:49,332 --> 00:16:51,020 AUDIENCE: Less memory? 304 00:16:51,020 --> 00:16:52,600 PROFESSOR: Less memory, OK. 305 00:16:52,600 --> 00:16:55,000 So if we're doing it this way, chances 306 00:16:55,000 --> 00:16:59,140 are that if we tried to shove the whole input into memory, 307 00:16:59,140 --> 00:17:00,657 it wouldn't fit. 308 00:17:00,657 --> 00:17:02,240 And it would crash and you would get 0 309 00:17:02,240 --> 00:17:03,406 on the test because of that. 310 00:17:03,406 --> 00:17:05,109 So that's not good. 311 00:17:05,109 --> 00:17:06,890 So what do we use instead? 312 00:17:06,890 --> 00:17:09,609 Does anyone know what this thing is called? 313 00:17:09,609 --> 00:17:11,509 What this class is called? 314 00:17:11,509 --> 00:17:13,109 AUDIENCE: [INAUDIBLE] 315 00:17:13,109 --> 00:17:14,400 PROFESSOR: Iterator, very good. 316 00:17:27,873 --> 00:17:30,867 AUDIENCE: Why do they call it FASTA? 317 00:17:30,867 --> 00:17:33,745 Because it goes faster? 318 00:17:33,745 --> 00:17:36,210 PROFESSOR: I think the letters are a bio acronym. 319 00:17:36,210 --> 00:17:37,740 AUDIENCE: Oh, OK. 320 00:17:37,740 --> 00:17:42,224 PROFESSOR: Does anyone, does anyone do bio here? 321 00:17:42,224 --> 00:17:43,140 I've seen that before. 322 00:17:43,140 --> 00:17:44,319 So it's a bio thing. 323 00:17:44,319 --> 00:17:45,360 Let's not worry about it. 324 00:17:45,360 --> 00:17:45,901 AUDIENCE: OK. 325 00:17:49,464 --> 00:17:51,760 Or, your can use that for any type of file. 326 00:17:51,760 --> 00:17:54,290 Like, you don't have to use it just for bio files. 327 00:17:54,290 --> 00:17:56,090 PROFESSOR: Well, presumably it's reads, 328 00:17:56,090 --> 00:17:58,680 it takes advantage of the format that they're stored in, 329 00:17:58,680 --> 00:18:02,320 and gives you a list instead of something else. 330 00:18:02,320 --> 00:18:04,270 So how does an iterator work? 331 00:18:04,270 --> 00:18:06,020 Suppose you're building your own iterator. 332 00:18:06,020 --> 00:18:08,076 What do you have to implement? 333 00:18:08,076 --> 00:18:09,510 AUDIENCE: Iterator [INAUDIBLE] 334 00:18:12,380 --> 00:18:15,760 PROFESSOR: OK, let's start with next, that's the fun one. 335 00:18:15,760 --> 00:18:16,924 What does next do? 336 00:18:19,726 --> 00:18:22,530 AUDIENCE: It's like pop. 337 00:18:22,530 --> 00:18:25,390 PROFESSOR: OK, so it's like pop in what way? 338 00:18:25,390 --> 00:18:27,180 AUDIENCE: It gives you the next character. 339 00:18:27,180 --> 00:18:27,763 PROFESSOR: OK. 340 00:18:34,380 --> 00:18:38,465 And what happens when you're at the end of the list? 341 00:18:38,465 --> 00:18:41,456 AUDIENCE: It stops. 342 00:18:41,456 --> 00:18:42,931 PROFESSOR: How do you stop? 343 00:18:42,931 --> 00:18:44,374 AUDIENCE: It raises an exception? 344 00:18:54,980 --> 00:18:57,430 PROFESSOR: So next will either return an element, 345 00:18:57,430 --> 00:18:59,740 that's the next element in the sequence 346 00:18:59,740 --> 00:19:00,870 that you're iterating over. 347 00:19:00,870 --> 00:19:04,410 Or it will raise a stop iteration exception 348 00:19:04,410 --> 00:19:07,720 error to stop iteration, cool. 349 00:19:07,720 --> 00:19:11,100 So what's the other method? 350 00:19:11,100 --> 00:19:12,815 Someone said it before, say it again. 351 00:19:12,815 --> 00:19:13,440 AUDIENCE: Iter. 352 00:19:13,440 --> 00:19:14,106 PROFESSOR: Iter. 353 00:19:18,410 --> 00:19:19,810 What does this do in an iterator? 354 00:19:24,790 --> 00:19:26,227 AUDIENCE: It returns itself. 355 00:19:26,227 --> 00:19:27,560 PROFESSOR: All right, very good. 356 00:19:27,560 --> 00:19:31,308 In an iterator this is how you will implement it all the time. 357 00:19:35,300 --> 00:19:39,325 Does anyone know what's the point of iter? 358 00:19:39,325 --> 00:19:41,177 AUDIENCE: So you can return an iterator? 359 00:19:41,177 --> 00:19:44,110 Because that's what it told us to do in the PSET. 360 00:19:44,110 --> 00:19:47,050 PROFESSOR: OK, so iter returns and iterator. 361 00:19:47,050 --> 00:19:51,830 But it doesn't-- you don't have to start from an iterator. 362 00:19:51,830 --> 00:19:53,680 You can start from any object. 363 00:19:53,680 --> 00:19:56,120 And if it has a method iter, then it 364 00:19:56,120 --> 00:19:59,560 should give you an iterator that iterates over that object. 365 00:19:59,560 --> 00:20:05,570 So if you have something like a list-- 1, 2, 3, 4-- then 366 00:20:05,570 --> 00:20:08,670 if you call iter on this, you'll get an iterator for it, 367 00:20:08,670 --> 00:20:10,260 hopefully, right? 368 00:20:10,260 --> 00:20:15,930 And this is what Python uses when you say for i in. 369 00:20:19,540 --> 00:20:22,430 So behind the scenes, whatever object you give it here, 370 00:20:22,430 --> 00:20:23,890 gets an iter call. 371 00:20:23,890 --> 00:20:25,840 And then that produces an iterator. 372 00:20:25,840 --> 00:20:31,060 And then Python calls next until stop iteration happens. 373 00:20:31,060 --> 00:20:33,100 So you can write an iterator that 374 00:20:33,100 --> 00:20:34,550 almost behaves like a list. 375 00:20:37,150 --> 00:20:40,310 You can use it in these [INAUDIBLE] instructions, 376 00:20:40,310 --> 00:20:42,380 and it works as if it was a list, 377 00:20:42,380 --> 00:20:46,160 except it uses a lot less memory, because it computes 378 00:20:46,160 --> 00:20:47,140 the elements. 379 00:20:47,140 --> 00:20:49,267 Hopefully every time next is called, 380 00:20:49,267 --> 00:20:51,600 you're computing the next element that you're returning. 381 00:20:51,600 --> 00:20:53,808 If you're storing everything in a list then returning 382 00:20:53,808 --> 00:20:56,365 the elements that way, that's not the very smart iterator. 383 00:20:59,350 --> 00:21:02,150 OK let's look at the last page. 384 00:21:05,620 --> 00:21:08,100 So the last page has an iterator on top. 385 00:21:08,100 --> 00:21:10,450 And the iterator computes-- given a list, 386 00:21:10,450 --> 00:21:13,550 it computes the reverse of that list. 387 00:21:13,550 --> 00:21:16,101 And you can see that it doesn't reverse the list 388 00:21:16,101 --> 00:21:17,850 and then keep the reversed list in memory. 389 00:21:17,850 --> 00:21:20,430 Instead, every time you call next, 390 00:21:20,430 --> 00:21:22,790 it does some magic with the indexes-- 391 00:21:22,790 --> 00:21:24,410 I think the magic is called math-- 392 00:21:24,410 --> 00:21:28,020 and then it return something for as long as it can. 393 00:21:28,020 --> 00:21:29,640 So this is how you implement reverse 394 00:21:29,640 --> 00:21:32,140 without producing a new list. 395 00:21:32,140 --> 00:21:37,360 If the original list was order, say had n elements, then 396 00:21:37,360 --> 00:21:40,260 if you'd produce a new list, you'd consume order and memory. 397 00:21:40,260 --> 00:21:42,439 This think consumes order 1 memory, 398 00:21:42,439 --> 00:21:44,480 and the running time is the same, asymptotically. 399 00:21:47,070 --> 00:21:48,340 OK, any question on iterators? 400 00:21:52,050 --> 00:21:57,690 AUDIENCE: So it's going from the very end, 401 00:21:57,690 --> 00:22:02,590 oh, to the very beginning, and then it's stepping back. 402 00:22:02,590 --> 00:22:04,360 PROFESSOR: So reverse, if I give it 403 00:22:04,360 --> 00:22:09,055 the list 1, 2, 3, 4, I want reverse to give it back 4, 3, 404 00:22:09,055 --> 00:22:10,826 2, 1. 405 00:22:10,826 --> 00:22:12,450 Except it's not going to return a list, 406 00:22:12,450 --> 00:22:15,210 it's going to return something that I can use here. 407 00:22:15,210 --> 00:22:18,290 AUDIENCE: Mm hm, ah, OK. 408 00:22:18,290 --> 00:22:19,348 PROFESSOR: OK, yes. 409 00:22:19,348 --> 00:22:21,838 AUDIENCE: Is it ever possible to, sort of, 410 00:22:21,838 --> 00:22:25,324 rewind the iterator to like, sort of, reset it? 411 00:22:25,324 --> 00:22:28,544 PROFESSOR: OK, is it? 412 00:22:28,544 --> 00:22:29,476 AUDIENCE: No. 413 00:22:29,476 --> 00:22:30,410 PROFESSOR: Nope. 414 00:22:30,410 --> 00:22:32,050 So Python iterators are simple. 415 00:22:32,050 --> 00:22:34,155 All you can do is go forward. 416 00:22:34,155 --> 00:22:34,999 AUDIENCE: OK. 417 00:22:34,999 --> 00:22:36,540 PROFESSOR: The reason that is good is 418 00:22:36,540 --> 00:22:38,330 because you can use them for streams. 419 00:22:38,330 --> 00:22:41,380 So if you get data from a file, or if you can get data 420 00:22:41,380 --> 00:22:44,350 from the network, you can wrap it in an iterator. 421 00:22:44,350 --> 00:22:47,140 If you wanted to support resume on data that you 422 00:22:47,140 --> 00:22:50,544 get from the network, you'd have to buffer all the data. 423 00:22:50,544 --> 00:22:52,002 AUDIENCE: So you would have to call 424 00:22:52,002 --> 00:22:53,670 the iter about that again and-- 425 00:22:53,670 --> 00:22:54,710 PROFESSOR: Yeah. 426 00:22:54,710 --> 00:22:57,700 Yeah, if you want to rewind, get another iterator. 427 00:22:57,700 --> 00:23:00,530 OK, that's a good question, thank you. 428 00:23:00,530 --> 00:23:03,280 So these are iterators. 429 00:23:03,280 --> 00:23:06,450 Now we're going to go over some Python magic, which 430 00:23:06,450 --> 00:23:07,730 is called generators. 431 00:23:07,730 --> 00:23:10,050 So look at the iterator code, and then look 432 00:23:10,050 --> 00:23:11,980 at the equivalent code right below it. 433 00:23:14,890 --> 00:23:20,270 So 12 lines of Python turned into three lines of Python 434 00:23:20,270 --> 00:23:23,310 that do exactly the same thing. 435 00:23:23,310 --> 00:23:27,340 So the reverse method will return an object 436 00:23:27,340 --> 00:23:30,650 that is an iterator, and that you can use just 437 00:23:30,650 --> 00:23:33,620 like the iterator in the reverse class. 438 00:23:40,160 --> 00:23:43,480 Do people understand what that code does? 439 00:23:43,480 --> 00:23:48,180 If you do I'm so out of here, we're done. 440 00:23:48,180 --> 00:23:49,630 AUDIENCE: What does yield do? 441 00:23:49,630 --> 00:23:51,270 PROFESSOR: What does yield do? 442 00:23:51,270 --> 00:23:53,855 All right, that's the hard question, what does yield do? 443 00:23:53,855 --> 00:23:57,180 I will probably spend the rest of the session on the answer 444 00:23:57,180 --> 00:23:58,070 to that question. 445 00:23:58,070 --> 00:24:01,370 You're asking all the had questions today, man. 446 00:24:01,370 --> 00:24:06,820 So yield, does anyone know conceptually what yield does? 447 00:24:06,820 --> 00:24:08,940 Not in detail, just what's it supposed 448 00:24:08,940 --> 00:24:10,840 to do so that the rest of the code works? 449 00:24:10,840 --> 00:24:11,639 Yes. 450 00:24:11,639 --> 00:24:13,180 AUDIENCE: If you're driving someplace 451 00:24:13,180 --> 00:24:16,150 and there's a yield sign, you pause. 452 00:24:16,150 --> 00:24:18,290 PROFESSOR: OK, Python yield. 453 00:24:18,290 --> 00:24:20,270 So I like the word pause in there. 454 00:24:20,270 --> 00:24:22,420 The word pause is useful. 455 00:24:22,420 --> 00:24:26,140 So say, instead of implementing this, 456 00:24:26,140 --> 00:24:28,330 say we're implementing sub-sequence hashes. 457 00:24:32,494 --> 00:24:35,392 AUDIENCE: It kind of spit something out, but keeps going. 458 00:24:35,392 --> 00:24:37,807 PROFESSOR: Yep. 459 00:24:37,807 --> 00:24:39,586 AUDIENCE: Returns [INAUDIBLE] 460 00:24:39,586 --> 00:24:41,460 PROFESSOR: OK, so suppose you're implementing 461 00:24:41,460 --> 00:24:42,420 sub-sequence hashes. 462 00:24:42,420 --> 00:24:47,678 What's the worst, worst possible way you could implement this? 463 00:24:47,678 --> 00:24:48,885 AUDIENCE: Return a list. 464 00:24:48,885 --> 00:24:50,990 PROFESSOR: OK, so the worst, worst way is to go all the way, 465 00:24:50,990 --> 00:24:52,900 brute force, don't use the rolling hashes, 466 00:24:52,900 --> 00:24:54,170 don't use anything. 467 00:24:54,170 --> 00:24:56,210 The next best way is to make a list, right? 468 00:24:56,210 --> 00:25:01,030 So you're going to start with an empty list. 469 00:25:01,030 --> 00:25:09,670 Then you're going to use the rolling hash in some way. 470 00:25:09,670 --> 00:25:16,540 And in some loop you're going to say list.append e. 471 00:25:16,540 --> 00:25:21,120 And then you're going to return the list. 472 00:25:21,120 --> 00:25:23,440 Does this makes sense? 473 00:25:23,440 --> 00:25:26,265 OK, what's the problem with this code? 474 00:25:26,265 --> 00:25:28,570 AUDIENCE: You're going to have a huge list. 475 00:25:28,570 --> 00:25:30,510 PROFESSOR: Going to have a huge list. 476 00:25:30,510 --> 00:25:36,900 So the way we fix it with iterators is we remove this, 477 00:25:36,900 --> 00:25:46,240 we replace this with yield e, and we remove this. 478 00:25:46,240 --> 00:25:48,010 And now it's a generator. 479 00:25:48,010 --> 00:25:51,520 And now this consumes a constant amount of memory, 480 00:25:51,520 --> 00:25:54,400 instead of building a list. 481 00:25:54,400 --> 00:25:58,870 And as long as you only want an iterator out of this method, 482 00:25:58,870 --> 00:26:00,557 you'll get the right thing. 483 00:26:00,557 --> 00:26:02,640 Your code will still work in exactly the same way. 484 00:26:06,730 --> 00:26:09,460 OK, so the big question is what does this guy do, right? 485 00:26:09,460 --> 00:26:11,870 This is where the magic is. 486 00:26:11,870 --> 00:26:15,190 So I already said, as a first hint, 487 00:26:15,190 --> 00:26:18,920 that this guy will return an iterator. 488 00:26:21,730 --> 00:26:27,690 So can someone try to imagine their Python, and see this? 489 00:26:27,690 --> 00:26:29,540 So suppose it's your Python, you see this. 490 00:26:29,540 --> 00:26:32,530 What do you do? 491 00:26:32,530 --> 00:26:35,454 AUDIENCE: You wait for some sort of command of some sort, right? 492 00:26:35,454 --> 00:26:37,120 PROFESSOR: No, let's try something else. 493 00:26:37,120 --> 00:26:38,480 AUDIENCE: OK. 494 00:26:38,480 --> 00:26:42,880 PROFESSOR: So the execution of this pauses. 495 00:26:42,880 --> 00:26:43,620 What happens? 496 00:26:43,620 --> 00:26:47,260 So we're looping somewhere, we got a yield. 497 00:26:47,260 --> 00:26:49,261 We stop, what's the first thing we do? 498 00:26:49,261 --> 00:26:52,620 AUDIENCE: Spit out e. 499 00:26:52,620 --> 00:26:55,974 PROFESSOR: So you're saying you return e from this guy? 500 00:26:55,974 --> 00:26:58,780 AUDIENCE: [INAUDIBLE] out e [INAUDIBLE] 501 00:26:58,780 --> 00:27:01,473 PROFESSOR: So I want to return something-- 502 00:27:01,473 --> 00:27:03,950 I want to return something else from this. 503 00:27:03,950 --> 00:27:06,460 So I want to use this as if it was a list, yes? 504 00:27:06,460 --> 00:27:08,496 AUDIENCE: We store e somewhere. 505 00:27:08,496 --> 00:27:09,870 PROFESSOR: OK, store e somewhere. 506 00:27:09,870 --> 00:27:11,682 AUDIENCE: Do you return the pointer of e? 507 00:27:14,330 --> 00:27:17,240 PROFESSOR: Almost, so there's a word for the object 508 00:27:17,240 --> 00:27:19,260 that I'm returning. 509 00:27:19,260 --> 00:27:21,350 So I want to use it as if it was a list. 510 00:27:21,350 --> 00:27:25,180 So I want to pretend that I had returned list in this method, 511 00:27:25,180 --> 00:27:27,100 right? 512 00:27:27,100 --> 00:27:30,424 So what's the closest thing to a list that I can return. 513 00:27:30,424 --> 00:27:31,398 AUDIENCE: An iterator. 514 00:27:31,398 --> 00:27:33,920 PROFESSOR: An iterator, thank you, all right. 515 00:27:33,920 --> 00:27:36,665 So we will grab some information from here. 516 00:27:39,756 --> 00:27:42,190 We'll put it in a nice box. 517 00:27:42,190 --> 00:27:48,590 And that box will behave like an iterator. 518 00:27:48,590 --> 00:27:51,890 OK, so the first thing, someone said put e away, 519 00:27:51,890 --> 00:27:58,900 so that's when we call next we're going to spit that out. 520 00:27:58,900 --> 00:28:00,570 What else do I need to put away? 521 00:28:07,304 --> 00:28:08,760 AUDIENCE: [INAUDIBLE] 522 00:28:08,760 --> 00:28:10,750 PROFESSOR: Yep, so this is a lot of magic. 523 00:28:10,750 --> 00:28:13,830 This tiny box actually has a lot of magic in it. 524 00:28:13,830 --> 00:28:17,320 Because when I call next, I want to get e. 525 00:28:17,320 --> 00:28:22,310 But I want to come back here and keep going, right? 526 00:28:22,310 --> 00:28:26,030 So I have my code that's using the iterator. 527 00:28:26,030 --> 00:28:28,860 And there's this code here, that's sort of 528 00:28:28,860 --> 00:28:30,330 in a frozen state. 529 00:28:30,330 --> 00:28:32,960 Did you guys see any movies where people are frozen up 530 00:28:32,960 --> 00:28:34,710 and then, in the future, they're unfrozen 531 00:28:34,710 --> 00:28:36,582 and they start moving again? 532 00:28:36,582 --> 00:28:38,350 AUDIENCE: [INAUDIBLE] movies. 533 00:28:38,350 --> 00:28:39,530 PROFESSOR: All right, cool. 534 00:28:39,530 --> 00:28:42,450 So this is like that, this takes up the whole function, 535 00:28:42,450 --> 00:28:45,540 freezes it up and puts it in a box here. 536 00:28:45,540 --> 00:28:50,350 And it returns an iterator that can use the box in the future. 537 00:28:50,350 --> 00:28:52,380 So when you call next, it gives e, 538 00:28:52,380 --> 00:28:54,740 which is the guy that you put in here. 539 00:28:54,740 --> 00:28:56,760 And then it take the function out of the box, 540 00:28:56,760 --> 00:28:59,460 unfreezes it, and lets it run again 541 00:28:59,460 --> 00:29:02,000 until it hits yield again. 542 00:29:02,000 --> 00:29:04,060 Then what happens the next time it hits yield? 543 00:29:08,680 --> 00:29:10,980 So, you're looping, and you're yielding again. 544 00:29:10,980 --> 00:29:15,135 And say this time you're yielding. 545 00:29:15,135 --> 00:29:16,510 AUDIENCE: Just do the same thing? 546 00:29:16,510 --> 00:29:18,218 AUDIENCE: Do you put it in that iterator? 547 00:29:18,218 --> 00:29:19,600 Or do you make another iterator? 548 00:29:19,600 --> 00:29:21,040 PROFESSOR: Same iterator. 549 00:29:21,040 --> 00:29:24,260 So while this is looping, the code outside 550 00:29:24,260 --> 00:29:26,780 should get the values that it's yielding. 551 00:29:26,780 --> 00:29:29,210 So this has to behave as one iterator. 552 00:29:29,210 --> 00:29:31,690 So the code is unfrozen, it's allowed 553 00:29:31,690 --> 00:29:33,840 to execute until it says yield again. 554 00:29:33,840 --> 00:29:36,180 And then it says yield with a new element. 555 00:29:36,180 --> 00:29:38,440 I put this guy in the box. 556 00:29:38,440 --> 00:29:43,390 Then I return the old guy as the return value for next. 557 00:29:43,390 --> 00:29:45,147 AUDIENCE: Oh. 558 00:29:45,147 --> 00:29:46,730 PROFESSOR: And then it's frozen again. 559 00:29:46,730 --> 00:29:50,049 So this guy's still in a frozen state. 560 00:29:50,049 --> 00:29:52,090 In the movies, I think you're only unfrozen once. 561 00:29:52,090 --> 00:29:53,170 And then you keep going, right? 562 00:29:53,170 --> 00:29:54,295 And there's a happy ending. 563 00:29:54,295 --> 00:29:56,000 Where here, every time you call yield 564 00:29:56,000 --> 00:29:59,560 you're frozen again, until someone calls next. 565 00:30:02,480 --> 00:30:03,480 Does this make sense? 566 00:30:03,480 --> 00:30:06,760 AUDIENCE: It's kind of like Groundhog Day. 567 00:30:06,760 --> 00:30:09,980 PROFESSOR: Yes, except you're allowed to go forward. 568 00:30:09,980 --> 00:30:12,141 So this keeps going forward. 569 00:30:12,141 --> 00:30:13,140 AUDIENCE: --up, thought. 570 00:30:13,140 --> 00:30:14,645 So it's looping. 571 00:30:14,645 --> 00:30:16,421 It's the same day, really. 572 00:30:16,421 --> 00:30:17,920 It's doing different things, though. 573 00:30:17,920 --> 00:30:18,586 PROFESSOR: Yeah. 574 00:30:18,586 --> 00:30:20,920 But all your state is saved. 575 00:30:20,920 --> 00:30:23,330 So there, some of the state is rolled back. 576 00:30:23,330 --> 00:30:25,072 Here all the state is saved. 577 00:30:25,072 --> 00:30:26,056 AUDIENCE: OK. 578 00:30:26,056 --> 00:30:29,410 PROFESSOR: OK, but if that analogy helps, keep it. 579 00:30:29,410 --> 00:30:32,700 AUDIENCE: When you call next, are you computing e 580 00:30:32,700 --> 00:30:33,895 or e prime to be returned? 581 00:30:33,895 --> 00:30:35,520 PROFESSOR: So when you're calling next, 582 00:30:35,520 --> 00:30:38,952 you're computing e prime and returning e. 583 00:30:38,952 --> 00:30:41,657 AUDIENCE: So the value you get from next is pre-computed? 584 00:30:41,657 --> 00:30:43,490 PROFESSOR: So the value you get form next is 585 00:30:43,490 --> 00:30:46,386 what you yielded before. 586 00:30:46,386 --> 00:30:50,210 AUDIENCE: Wait, so you would just take some sequence hashes 587 00:30:50,210 --> 00:30:53,300 instance of that, and then just by putting in yield, 588 00:30:53,300 --> 00:30:55,410 now it's magically become an iterator 589 00:30:55,410 --> 00:30:57,090 and you can call that next on it? 590 00:30:57,090 --> 00:30:57,920 PROFESSOR: Yep. 591 00:30:57,920 --> 00:31:00,787 And inside, you don't have to know that it's an iterator. 592 00:31:00,787 --> 00:31:02,620 So you don't have a method next here, right? 593 00:31:02,620 --> 00:31:04,940 I don't implement next or iter here. 594 00:31:04,940 --> 00:31:08,230 I write this as if it's printing stuff to the output. 595 00:31:08,230 --> 00:31:10,640 You can think of yield is a print. 596 00:31:10,640 --> 00:31:12,500 If you wanted an iterator, then pretend 597 00:31:12,500 --> 00:31:14,740 you're printing what you want to iterate over. 598 00:31:14,740 --> 00:31:16,990 And instead of saying print you say yield. 599 00:31:16,990 --> 00:31:18,090 And then you use that. 600 00:31:20,710 --> 00:31:22,502 OK, now what happens when we're done? 601 00:31:22,502 --> 00:31:23,960 What happens when this loop is done 602 00:31:23,960 --> 00:31:25,410 and you return from this method? 603 00:31:25,410 --> 00:31:27,650 We said there's no return value. 604 00:31:27,650 --> 00:31:30,750 AUDIENCE: It raises a stop? 605 00:31:30,750 --> 00:31:32,430 PROFESSOR: So when we return, it's 606 00:31:32,430 --> 00:31:35,330 going to keep in-- have to remember that it's done, right? 607 00:31:39,330 --> 00:31:42,180 And the first time, it has some element here 608 00:31:42,180 --> 00:31:44,760 that it has to return. 609 00:31:44,760 --> 00:31:48,230 So every time you call yield we put a new element in the box, 610 00:31:48,230 --> 00:31:51,580 and return the old one. 611 00:31:51,580 --> 00:31:55,620 So now we would return the old one. 612 00:31:55,620 --> 00:32:01,730 We've returned e prime, take it out, and put done in the box. 613 00:32:01,730 --> 00:32:05,310 So in the future, if next is called again, 614 00:32:05,310 --> 00:32:11,510 raise stop iteration. 615 00:32:11,510 --> 00:32:13,680 No more freezing, unfreezing, because we're done. 616 00:32:13,680 --> 00:32:14,304 We're returned. 617 00:32:18,202 --> 00:32:21,470 AUDIENCE: So if you called next it would just give you nothing? 618 00:32:21,470 --> 00:32:23,762 PROFESSOR: It has to raise this exception. 619 00:32:23,762 --> 00:32:26,220 AUDIENCE: So you mean, like-- oh, so it-- oh, I see. 620 00:32:26,220 --> 00:32:28,257 It would give you red text then? 621 00:32:28,257 --> 00:32:30,007 PROFESSOR: If you called it directly, yes, 622 00:32:30,007 --> 00:32:32,140 it would give you red text. 623 00:32:32,140 --> 00:32:32,930 Yes? 624 00:32:32,930 --> 00:32:37,603 AUDIENCE: So this takes a sequence or a list, 625 00:32:37,603 --> 00:32:40,561 not another iterator, ever? 626 00:32:40,561 --> 00:32:42,045 PROFESSOR: This? 627 00:32:42,045 --> 00:32:43,530 What's this? 628 00:32:43,530 --> 00:32:44,520 This other code here? 629 00:32:44,520 --> 00:32:45,870 AUDIENCE: Yeah. 630 00:32:45,870 --> 00:32:46,995 PROFESSOR: Not necessarily. 631 00:32:46,995 --> 00:32:49,410 AUDIENCE: Or you could give it a procedure. 632 00:32:49,410 --> 00:32:50,910 PROFESSOR: I can give it an iterator 633 00:32:50,910 --> 00:32:53,050 if I'm iterating over it using for-in. 634 00:32:53,050 --> 00:32:55,540 AUDIENCE: Like, for something in one iterator, yield 635 00:32:55,540 --> 00:32:57,532 that something, and then [INAUDIBLE] 636 00:32:57,532 --> 00:33:00,329 AUDIENCE: Oh, OK. 637 00:33:00,329 --> 00:33:01,870 PROFESSOR: Yeah, that's a good point. 638 00:33:01,870 --> 00:33:04,114 I'll get to that later, when we talk 639 00:33:04,114 --> 00:33:05,780 about how we're going to solve the PSET. 640 00:33:05,780 --> 00:33:07,570 No, we're not solving the PSET for you. 641 00:33:07,570 --> 00:33:09,661 But we'll talk about it a little bit. 642 00:33:09,661 --> 00:33:10,910 But yeah, that's a good point. 643 00:33:10,910 --> 00:33:13,210 So there's no reason why you can't 644 00:33:13,210 --> 00:33:16,250 have an argument here that, either a list or an iterator, 645 00:33:16,250 --> 00:33:18,070 and then you're iterating over it. 646 00:33:18,070 --> 00:33:20,345 And then you have nested generators. 647 00:33:20,345 --> 00:33:23,160 So you have generators returned in other generators, 648 00:33:23,160 --> 00:33:25,080 and you have a whole chain of things 649 00:33:25,080 --> 00:33:26,492 happening when you say next. 650 00:33:26,492 --> 00:33:28,325 AUDIENCE: Wait, so this is a generator then, 651 00:33:28,325 --> 00:33:32,330 because it produces-- well it is an iterator though? 652 00:33:32,330 --> 00:33:34,770 PROFESSOR: So a generator returns an iterator 653 00:33:34,770 --> 00:33:36,680 from this method. 654 00:33:36,680 --> 00:33:38,960 So a generator acts like an iterator, 655 00:33:38,960 --> 00:33:42,050 except when you call next, it unfreezes this code here, 656 00:33:42,050 --> 00:33:43,724 and it let's it run. 657 00:33:43,724 --> 00:33:46,045 AUDIENCE: But I mean, it's basically an iterator then? 658 00:33:46,045 --> 00:33:46,510 PROFESSOR: Yeah. 659 00:33:46,510 --> 00:33:48,695 AUDIENCE: But we're just calling it a generator because-- 660 00:33:48,695 --> 00:33:50,240 PROFESSOR: Because there's a lot more magic. 661 00:33:50,240 --> 00:33:50,820 AUDIENCE: OK. 662 00:33:50,820 --> 00:33:53,980 PROFESSOR: So an iterator just says next and iter. 663 00:33:53,980 --> 00:33:57,250 This is all that an iterator is, nothing more. 664 00:33:57,250 --> 00:33:59,520 Any object that has these two methods is an iterator. 665 00:33:59,520 --> 00:34:01,940 AUDIENCE: Oh, OK. 666 00:34:01,940 --> 00:34:07,080 PROFESSOR: Now a generator is a piece of Python magic 667 00:34:07,080 --> 00:34:10,310 that let's you write shorter iterators. 668 00:34:10,310 --> 00:34:12,580 So three lines, as opposed to 13 lines. 669 00:34:12,580 --> 00:34:15,980 And we came up with a way to turn 670 00:34:15,980 --> 00:34:18,110 in a code that would build a list, 671 00:34:18,110 --> 00:34:21,980 and easily turn it into a code that uses a generator, 672 00:34:21,980 --> 00:34:24,717 and that uses constant memory instead of building that list. 673 00:34:24,717 --> 00:34:27,113 AUDIENCE: OK, now I know how an iterator functions. 674 00:34:27,113 --> 00:34:27,904 PROFESSOR: Exactly. 675 00:34:30,690 --> 00:34:33,820 OK, do generators make sense now? 676 00:34:33,820 --> 00:34:34,659 Yes. 677 00:34:34,659 --> 00:34:36,450 AUDIENCE: If you wanted to loop through all 678 00:34:36,450 --> 00:34:38,752 of the values in a generator, do you just 679 00:34:38,752 --> 00:34:40,480 wait until the exception's raised? 680 00:34:40,480 --> 00:34:42,880 Or should you, like, keep track of how many things 681 00:34:42,880 --> 00:34:44,902 are going to be in that generator? 682 00:34:44,902 --> 00:34:46,610 PROFESSOR: So, when you have a generator, 683 00:34:46,610 --> 00:34:48,777 you'd have no idea how many things there are. 684 00:34:48,777 --> 00:34:49,610 That's a good point. 685 00:34:49,610 --> 00:34:53,716 So you're wondering if I have an iterator, say any iterator, 686 00:34:53,716 --> 00:34:56,340 not necessarily a generator, how do I know how many things it's 687 00:34:56,340 --> 00:34:57,390 going to return, right? 688 00:34:57,390 --> 00:34:59,200 Do I have ln? 689 00:34:59,200 --> 00:35:01,060 I do not have ln. 690 00:35:01,060 --> 00:35:05,410 So an iterator does not have ln. 691 00:35:05,410 --> 00:35:08,850 So you have to iterate through it. 692 00:35:08,850 --> 00:35:14,760 And most importantly, some iterators can never return. 693 00:35:14,760 --> 00:35:17,070 So you can have an iterator that streams data for you 694 00:35:17,070 --> 00:35:18,500 across the network. 695 00:35:18,500 --> 00:35:21,550 Or you can have an iterator that iterates over the Fibonacci 696 00:35:21,550 --> 00:35:22,572 numbers. 697 00:35:22,572 --> 00:35:24,030 That's an infinite sequence, right? 698 00:35:24,030 --> 00:35:25,030 It's never going to end. 699 00:35:25,030 --> 00:35:28,700 So ln would not even be defined then. 700 00:35:28,700 --> 00:35:30,335 Good question, I like it. 701 00:35:30,335 --> 00:35:33,897 AUDIENCE: Is there an is-next method for either iterators 702 00:35:33,897 --> 00:35:35,090 or generators? 703 00:35:35,090 --> 00:35:36,456 PROFESSOR: Nope. 704 00:35:36,456 --> 00:35:39,497 This is what you get, if there is no in. 705 00:35:39,497 --> 00:35:40,930 AUDIENCE: If that is mature then-- 706 00:35:40,930 --> 00:35:42,020 PROFESSOR: Yeah. 707 00:35:42,020 --> 00:35:44,990 So in Java you have this belief that you 708 00:35:44,990 --> 00:35:46,477 shouldn't get exceptions. 709 00:35:46,477 --> 00:35:48,310 You should be able to check for them, right? 710 00:35:48,310 --> 00:35:49,860 So maybe that's why you're asking. 711 00:35:49,860 --> 00:35:52,565 So if people coming from Java know that any time a method 712 00:35:52,565 --> 00:35:53,940 raises an exception, there should 713 00:35:53,940 --> 00:35:57,110 be another method that tells you whether this first method is 714 00:35:57,110 --> 00:35:58,920 going to raise an exception or not. 715 00:35:58,920 --> 00:36:00,820 In Python the exception is just raised. 716 00:36:00,820 --> 00:36:03,470 So exceptions are not a lot more expensive 717 00:36:03,470 --> 00:36:05,950 than regular instructions, because we're 718 00:36:05,950 --> 00:36:08,060 using an interpreted language, and it's already 719 00:36:08,060 --> 00:36:09,150 reasonably slow. 720 00:36:09,150 --> 00:36:12,280 So it can do exceptions for free, yay. 721 00:36:12,280 --> 00:36:13,340 So this is how it works. 722 00:36:13,340 --> 00:36:14,706 This is how for-in works. 723 00:36:14,706 --> 00:36:16,830 Every time you do a for-in, an exception is raised. 724 00:36:19,517 --> 00:36:21,350 AUDIENCE: We don't have to catch that, then? 725 00:36:21,350 --> 00:36:24,430 PROFESSOR: Nope, the for-in catches it for you. 726 00:36:24,430 --> 00:36:25,710 AUDIENCE: That's tricky stuff. 727 00:36:28,390 --> 00:36:30,820 PROFESSOR: But it's nice because then you 728 00:36:30,820 --> 00:36:33,360 can build any iterator that acts like a list. 729 00:36:33,360 --> 00:36:35,600 And then you can do even more fancy stuff, 730 00:36:35,600 --> 00:36:37,650 and build a generator. 731 00:36:37,650 --> 00:36:40,710 And you're using constant memory instead of order and memory 732 00:36:40,710 --> 00:36:45,140 for producing an order and size list. 733 00:36:45,140 --> 00:36:45,900 Yes? 734 00:36:45,900 --> 00:36:52,280 AUDIENCE: So if we get passed in an iterator 735 00:36:52,280 --> 00:36:57,696 and then just yielded what we passed in, yielded 736 00:36:57,696 --> 00:36:59,986 the iterator, would that just, essentially, 737 00:36:59,986 --> 00:37:04,054 delay everything by one? 738 00:37:04,054 --> 00:37:06,470 PROFESSOR: So you're yielding the iterator as next, right? 739 00:37:06,470 --> 00:37:07,240 AUDIENCE: What? 740 00:37:07,240 --> 00:37:07,520 Yeah. 741 00:37:07,520 --> 00:37:08,890 PROFESSOR: You want to yield the iterator as next. 742 00:37:08,890 --> 00:37:10,760 Because if you yield the iterator object, 743 00:37:10,760 --> 00:37:12,676 you're going to return that object every time. 744 00:37:14,910 --> 00:37:16,802 So you're thinking of something that-- 745 00:37:16,802 --> 00:37:18,260 AUDIENCE: So you need to increase-- 746 00:37:18,260 --> 00:37:19,550 PROFESSOR: You'll yield up next, right? 747 00:37:19,550 --> 00:37:20,270 AUDIENCE: Right. 748 00:37:20,270 --> 00:37:22,290 PROFESSOR: You can have a method that says this is the method. 749 00:37:22,290 --> 00:37:23,664 And then you take in an iterator. 750 00:37:23,664 --> 00:37:25,780 And then you yield it up next. 751 00:37:25,780 --> 00:37:28,102 But then you'll, basically, get the same thing. 752 00:37:28,102 --> 00:37:29,143 AUDIENCE: The same thing. 753 00:37:29,143 --> 00:37:31,744 But is it delayed by one or no? 754 00:37:31,744 --> 00:37:32,454 PROFESSOR: Nope. 755 00:37:32,454 --> 00:37:35,070 No, so you have to work through this 756 00:37:35,070 --> 00:37:37,616 to convince yourself that it's not delayed. 757 00:37:37,616 --> 00:37:38,990 So if it would be delayed by one, 758 00:37:38,990 --> 00:37:40,330 what's the first thing that you're yielding. 759 00:37:40,330 --> 00:37:41,330 AUDIENCE: I don't know. 760 00:37:41,330 --> 00:37:42,865 PROFESSOR: Yeah, so no delay. 761 00:37:42,865 --> 00:37:43,770 AUDIENCE: OK. 762 00:37:43,770 --> 00:37:46,320 PROFESSOR: OK, cool. 763 00:37:46,320 --> 00:37:48,990 So let's see, what do we have to implement 764 00:37:48,990 --> 00:37:52,280 in DNA seq, sub-sequence hashes. 765 00:37:52,280 --> 00:37:54,830 Do people have an idea of how to implement that now? 766 00:37:58,530 --> 00:37:59,030 Yes? 767 00:37:59,030 --> 00:38:01,380 Does it make sense for everyone? 768 00:38:01,380 --> 00:38:04,230 So you build it as if you were building a list, 769 00:38:04,230 --> 00:38:07,570 and then you use yield to make it fast. 770 00:38:07,570 --> 00:38:09,400 And by fast I mean less memory. 771 00:38:09,400 --> 00:38:12,305 OK, how about interval sub-sequence hashes? 772 00:38:15,580 --> 00:38:17,551 The one below. 773 00:38:17,551 --> 00:38:19,380 AUDIENCE: Is that just like rolling hash, 774 00:38:19,380 --> 00:38:22,826 except you, like, have a step in your range? 775 00:38:22,826 --> 00:38:25,920 PROFESSOR: OK, so it's like having a step in your range. 776 00:38:25,920 --> 00:38:27,240 So how can you do that? 777 00:38:27,240 --> 00:38:28,600 What's one way of doing it? 778 00:38:34,012 --> 00:38:35,488 AUDIENCE: [INAUDIBLE] hashes? 779 00:38:38,440 --> 00:38:41,160 PROFESSOR: Did anyone solve the PSET yet? 780 00:38:41,160 --> 00:38:44,840 Yes, OK how did you guys do it? 781 00:38:44,840 --> 00:38:46,180 Wait, no. 782 00:38:46,180 --> 00:38:49,100 That's a bad question because you guys can answer too much. 783 00:38:49,100 --> 00:38:54,200 So interval sub-sequence hashes versus sub-sequence hashes. 784 00:38:54,200 --> 00:38:55,780 Did you copy paste the code? 785 00:38:55,780 --> 00:38:57,520 AUDIENCE: Absolutely. 786 00:38:57,520 --> 00:39:00,205 PROFESSOR: OK, so one way of doing it is copy 787 00:39:00,205 --> 00:39:01,422 and pasting the code. 788 00:39:01,422 --> 00:39:03,630 The problem if you copy paste the code is then you're 789 00:39:03,630 --> 00:39:04,290 not DRY. 790 00:39:04,290 --> 00:39:09,360 There's this engineering thing-- DRY means do not 791 00:39:09,360 --> 00:39:10,630 repeat yourself. 792 00:39:10,630 --> 00:39:12,750 So if you're not DRY, if you copy paste, 793 00:39:12,750 --> 00:39:14,280 then suppose you find the bug later. 794 00:39:14,280 --> 00:39:16,530 Suppose you run the big test and it crashes somewhere. 795 00:39:16,530 --> 00:39:19,600 And you fix a bug in sub-sequence hashes. 796 00:39:19,600 --> 00:39:21,190 AUDIENCE: Oh, we're supposed to, like, 797 00:39:21,190 --> 00:39:24,175 call sub-sequence hashes from interval sub-sequence hashes, 798 00:39:24,175 --> 00:39:24,675 right? 799 00:39:24,675 --> 00:39:27,656 PROFESSOR: That's another way of doing it that is DRY. 800 00:39:27,656 --> 00:39:29,530 So this way you're not copy pasting the code. 801 00:39:29,530 --> 00:39:32,272 AUDIENCE: We're inlining the code. 802 00:39:32,272 --> 00:39:34,560 PROFESSOR: You're inlining it manually, right? 803 00:39:34,560 --> 00:39:35,430 All right. 804 00:39:35,430 --> 00:39:38,672 So the problem, if you do this on a large scale, 805 00:39:38,672 --> 00:39:40,130 like when you go work somewhere, is 806 00:39:40,130 --> 00:39:43,010 that you end up with 20 copies of the same code. 807 00:39:43,010 --> 00:39:46,970 And then five of them have bug fixes and the other 15 808 00:39:46,970 --> 00:39:50,120 don't, because people forgot where they are. 809 00:39:50,120 --> 00:39:52,760 So ideally, try to keep your code DRY. 810 00:40:00,086 --> 00:40:03,466 AUDIENCE: So, basically, a list of tuples, right? 811 00:40:03,466 --> 00:40:04,924 PROFESSOR: OK, so a list of tuples. 812 00:40:08,270 --> 00:40:10,488 What does a tuple have? 813 00:40:10,488 --> 00:40:16,940 AUDIENCE: The index at which the sub-sequence operates? 814 00:40:16,940 --> 00:40:19,040 PROFESSOR: So two indexes, right? 815 00:40:19,040 --> 00:40:22,780 The index in the first sub-sequence, say-- 816 00:40:22,780 --> 00:40:24,590 AUDIENCE: [INAUDIBLE] 817 00:40:24,590 --> 00:40:31,110 PROFESSOR: OK, say i1 and then the index in a second sequence, 818 00:40:31,110 --> 00:40:32,926 for the same sub-sequence, r right? 819 00:40:32,926 --> 00:40:40,686 And then i1, i2 prime, i1, i2 second, so on and so forth. 820 00:40:40,686 --> 00:40:42,060 So you have the same sub-sequence 821 00:40:42,060 --> 00:40:45,650 in the first sequence matches more things in the second one. 822 00:40:45,650 --> 00:40:48,594 This is how you're supposed to return them. 823 00:40:48,594 --> 00:40:50,550 AUDIENCE: Does the order matter? 824 00:40:50,550 --> 00:40:51,496 PROFESSOR: I hope not. 825 00:40:56,630 --> 00:40:59,700 OK, any questions on this? 826 00:40:59,700 --> 00:41:01,080 We went through generators fast. 827 00:41:01,080 --> 00:41:02,580 You guys are smart. 828 00:41:02,580 --> 00:41:03,080 Yes? 829 00:41:03,080 --> 00:41:05,480 AUDIENCE: Can you explain how the imaging works? 830 00:41:05,480 --> 00:41:08,840 Like, how they create the [INAUDIBLE] on tuples. 831 00:41:08,840 --> 00:41:09,800 PROFESSOR: No. 832 00:41:09,800 --> 00:41:11,720 [LAUGHTER] 833 00:41:11,720 --> 00:41:15,280 PROFESSOR: Sorry, I do not know. 834 00:41:15,280 --> 00:41:18,360 AUDIENCE: Wait, which part? 835 00:41:18,360 --> 00:41:20,415 AUDIENCE: So we yield the tuples. 836 00:41:20,415 --> 00:41:24,027 But I don't really get how they come up with the image from it. 837 00:41:24,027 --> 00:41:25,110 AUDIENCE: From the tuples? 838 00:41:25,110 --> 00:41:29,011 Oh, I mean, I guess they're probably values. 839 00:41:29,011 --> 00:41:30,386 AUDIENCE: Yeah, because I thought 840 00:41:30,386 --> 00:41:36,720 if you compared two strings of DNA that had the exact same, 841 00:41:36,720 --> 00:41:39,360 I thought it would be like a diagonal line down, 842 00:41:39,360 --> 00:41:42,980 not just a small black box. 843 00:41:42,980 --> 00:41:44,140 PROFESSOR: OK. 844 00:41:44,140 --> 00:41:46,516 AUDIENCE: So I don't think I'm understanding 845 00:41:46,516 --> 00:41:49,980 how they, like, image it. 846 00:41:49,980 --> 00:41:56,590 PROFESSOR: So you're supposed to get-- 847 00:41:56,590 --> 00:42:01,120 your image has some things here, and a match 848 00:42:01,120 --> 00:42:03,480 is going to give you a big diagonal line that's 849 00:42:03,480 --> 00:42:06,020 stronger than everything else, right? 850 00:42:06,020 --> 00:42:08,010 AUDIENCE: It's really fanned out. 851 00:42:08,010 --> 00:42:09,676 PROFESSOR: Well I don't have thin chalk. 852 00:42:09,676 --> 00:42:12,830 AUDIENCE: No, no, there's like one really dark black box, 853 00:42:12,830 --> 00:42:15,310 that's like really black. 854 00:42:15,310 --> 00:42:18,510 So I thought that meant that all the tuples are there, 855 00:42:18,510 --> 00:42:20,218 and everything else is just kind of gray. 856 00:42:24,430 --> 00:42:26,090 PROFESSOR: Good question. 857 00:42:26,090 --> 00:42:27,580 I will have to think about that-- 858 00:42:27,580 --> 00:42:28,955 AUDIENCE: --supposed to be there. 859 00:42:28,955 --> 00:42:30,664 Is it like a notation thing, or-- 860 00:42:30,664 --> 00:42:33,080 PROFESSOR: I think that black box is supposed to be there. 861 00:42:33,080 --> 00:42:35,180 Did anyone try comparing two things that 862 00:42:35,180 --> 00:42:37,680 shouldn't match, like the dog and the monkey? 863 00:42:37,680 --> 00:42:39,488 AUDIENCE: Yeah. 864 00:42:39,488 --> 00:42:41,594 And the entire thing was like dark. 865 00:42:41,594 --> 00:42:42,260 PROFESSOR: Yeah. 866 00:42:42,260 --> 00:42:46,190 AUDIENCE: --against, like, two same DNAs everything 867 00:42:46,190 --> 00:42:47,019 was very light. 868 00:42:47,019 --> 00:42:50,674 And there was like a very, very light gray line. 869 00:42:50,674 --> 00:42:53,890 But I thought that would be like black. 870 00:42:53,890 --> 00:42:56,630 PROFESSOR: So I think how black it is means 871 00:42:56,630 --> 00:43:00,134 relative to all the sub-sequences, how long it is-- 872 00:43:00,134 --> 00:43:02,050 how long the sub-sequence you're recording is. 873 00:43:05,290 --> 00:43:06,290 Either that or how many. 874 00:43:06,290 --> 00:43:07,873 There is a function somewhere in there 875 00:43:07,873 --> 00:43:11,180 that computes the intensity of a pixel, that's 876 00:43:11,180 --> 00:43:13,085 square root of order 4 of something. 877 00:43:17,420 --> 00:43:21,261 OK, and I can look at that now and tell you. 878 00:43:21,261 --> 00:43:22,704 AUDIENCE: It's OK. 879 00:43:22,704 --> 00:43:25,110 It's not super important. 880 00:43:25,110 --> 00:43:28,390 PROFESSOR: Or we can talk about amortized analysis for a bit. 881 00:43:28,390 --> 00:43:29,040 Yay! 882 00:43:29,040 --> 00:43:32,110 Let's talk about amortized analysis. 883 00:43:32,110 --> 00:43:35,660 So this is what you're supposed to get, that's what matters. 884 00:43:35,660 --> 00:43:36,535 AUDIENCE: [INAUDIBLE] 885 00:43:58,619 --> 00:44:00,160 PROFESSOR: OK, so amortized analysis, 886 00:44:00,160 --> 00:44:02,385 what's the example that we talked about in class? 887 00:44:05,380 --> 00:44:08,160 AUDIENCE: It's like list expansion? 888 00:44:08,160 --> 00:44:13,500 PROFESSOR: OK, so you have-- you have a list. 889 00:44:13,500 --> 00:44:16,073 And we know that the list is stored as an array, right? 890 00:44:19,530 --> 00:44:24,740 So this means that you can do indexing in constant time. 891 00:44:24,740 --> 00:44:26,900 So if you want to get the first element, order 1. 892 00:44:26,900 --> 00:44:30,680 If you want to get the millionth element, order 1. 893 00:44:30,680 --> 00:44:32,790 This is not true if you had a link list instead. 894 00:44:32,790 --> 00:44:37,380 The millionth element would be order a million. 895 00:44:37,380 --> 00:44:39,706 So this is an array. 896 00:44:39,706 --> 00:44:41,300 What do we implement? 897 00:44:41,300 --> 00:44:43,530 What's the operation that we implement on this list? 898 00:44:46,278 --> 00:44:47,710 AUDIENCE: Insert-- 899 00:44:47,710 --> 00:44:49,635 PROFESSOR: Insert, append, push. 900 00:44:49,635 --> 00:44:54,480 Let's go for append, because that's what Python calls it. 901 00:44:54,480 --> 00:44:59,060 OK, so append puts an element at the end of the list, right? 902 00:44:59,060 --> 00:45:00,170 So how does append work? 903 00:45:03,780 --> 00:45:05,356 AUDIENCE: The array is not full. 904 00:45:05,356 --> 00:45:07,636 PROFESSOR: OK. 905 00:45:07,636 --> 00:45:11,040 So say I have some count variable here. 906 00:45:11,040 --> 00:45:23,060 So if the length of the array is bigger than count 907 00:45:23,060 --> 00:45:24,943 then what do I do? 908 00:45:24,943 --> 00:45:26,942 AUDIENCE: Then we can directly insert. 909 00:45:30,760 --> 00:45:32,426 And because we're looking up in an array 910 00:45:32,426 --> 00:45:34,730 and we're doing constant time. 911 00:45:34,730 --> 00:45:37,472 PROFESSOR: OK. 912 00:45:37,472 --> 00:45:39,430 AUDIENCE: And so an order amount of information 913 00:45:39,430 --> 00:45:42,230 in x [INAUDIBLE]? 914 00:45:42,230 --> 00:45:42,980 PROFESSOR: Sorry? 915 00:45:42,980 --> 00:45:44,870 AUDIENCE: Order amount of information of x [INAUDIBLE]? 916 00:45:44,870 --> 00:45:45,598 Or do we just-- 917 00:45:45,598 --> 00:45:47,389 PROFESSOR: Let's say this is our reference, 918 00:45:47,389 --> 00:45:48,347 so it's constant time. 919 00:45:51,700 --> 00:45:54,420 AUDIENCE: Otherwise we don't have enough room in our array. 920 00:45:54,420 --> 00:45:56,112 So we need to make it bigger. 921 00:45:56,112 --> 00:45:56,695 PROFESSOR: OK. 922 00:45:59,260 --> 00:46:06,030 So we have array 2 becomes new array 923 00:46:06,030 --> 00:46:14,810 of size 2 times count, right? 924 00:46:14,810 --> 00:46:16,587 Copy everything from-- 925 00:46:16,587 --> 00:46:17,920 AUDIENCE: --length of the array. 926 00:46:17,920 --> 00:46:18,962 I guess they're the same. 927 00:46:18,962 --> 00:46:20,420 PROFESSOR: I hope they're the same. 928 00:46:20,420 --> 00:46:21,270 AUDIENCE: It is. 929 00:46:21,270 --> 00:46:23,116 PROFESSOR: Yeah, I'd say that. 930 00:46:23,116 --> 00:46:35,420 So copy from array to-- let's do this-- to array 2. 931 00:46:35,420 --> 00:46:40,130 And then array 2 becomes array. 932 00:46:44,510 --> 00:46:46,710 And then this code here goes here, right? 933 00:46:46,710 --> 00:46:49,540 So there's a better way to write this if statement 934 00:46:49,540 --> 00:46:52,210 so the code isn't duplicated. 935 00:46:52,210 --> 00:46:57,180 OK, so if the length is bigger than how many elements I have, 936 00:46:57,180 --> 00:46:59,470 if I still have room in the array, what's the cost? 937 00:46:59,470 --> 00:47:02,360 What's the running time? 938 00:47:02,360 --> 00:47:04,201 Constant. 939 00:47:04,201 --> 00:47:05,589 Oh, let's put it on the left. 940 00:47:09,990 --> 00:47:13,168 OK, if I have to resize the array, what's the cost? 941 00:47:13,168 --> 00:47:14,043 AUDIENCE: [INAUDIBLE] 942 00:47:18,760 --> 00:47:21,630 PROFESSOR: So, if I did an operations, what then, right? 943 00:47:21,630 --> 00:47:23,980 N is the size of the array. 944 00:47:23,980 --> 00:47:25,970 If the only operation I have is append, 945 00:47:25,970 --> 00:47:28,030 then I can say n operations will cause 946 00:47:28,030 --> 00:47:31,120 the array of grow to size n. 947 00:47:31,120 --> 00:47:36,860 So n where n is the number of operations. 948 00:47:36,860 --> 00:47:39,636 AUDIENCE: You mean, like, re-adding to the-- 949 00:47:42,670 --> 00:47:45,340 PROFESSOR: So an operation is a data structure operation, 950 00:47:45,340 --> 00:47:49,930 like a query or an update. 951 00:47:49,930 --> 00:47:52,260 This is my update and this is my query. 952 00:47:57,210 --> 00:48:02,220 AUDIENCE: Wait, but like, it's order n though, because-- 953 00:48:02,220 --> 00:48:03,010 PROFESSOR: Yeah. 954 00:48:03,010 --> 00:48:04,301 AUDIENCE: I know, it's order n. 955 00:48:04,301 --> 00:48:06,510 But because we have like an array, 956 00:48:06,510 --> 00:48:08,010 and then you have to make a new one, 957 00:48:08,010 --> 00:48:10,699 and you have to move all those old items over, right? 958 00:48:10,699 --> 00:48:11,324 PROFESSOR: Yep. 959 00:48:11,324 --> 00:48:12,680 AUDIENCE: OK. 960 00:48:12,680 --> 00:48:17,745 But, I mean, sometimes like, if your actual array, 961 00:48:17,745 --> 00:48:21,192 if you expand it before-- like, let's say you notice 962 00:48:21,192 --> 00:48:23,566 you're getting full and you decide to like make it bigger 963 00:48:23,566 --> 00:48:27,182 at that point, is it still order n, 964 00:48:27,182 --> 00:48:28,994 as in the number of elements that are-- 965 00:48:28,994 --> 00:48:30,660 PROFESSOR: It depends on how you decide. 966 00:48:30,660 --> 00:48:33,760 There's a problem on the PSET that asks you about that. 967 00:48:33,760 --> 00:48:36,210 So, depends on when you make the decision 968 00:48:36,210 --> 00:48:39,064 and how you make the decision, the answer is either yes, 969 00:48:39,064 --> 00:48:40,480 you're still constant time, or no. 970 00:48:40,480 --> 00:48:43,360 So if you understand the amortized analysis then 971 00:48:43,360 --> 00:48:45,690 you can argue of whether it still holds or not. 972 00:48:45,690 --> 00:48:47,180 If this breaks down at any point, 973 00:48:47,180 --> 00:48:48,790 not going to be constant time. 974 00:48:48,790 --> 00:48:49,290 Yes? 975 00:48:49,290 --> 00:48:51,870 AUDIENCE: So the only cost is really copying everything 976 00:48:51,870 --> 00:48:52,970 from the old array to the new array? 977 00:48:52,970 --> 00:48:53,410 PROFESSOR: Yes. 978 00:48:53,410 --> 00:48:55,330 AUDIENCE: Actually allocating that space is-- 979 00:48:55,330 --> 00:48:56,875 PROFESSOR: We assume that allocating the space 980 00:48:56,875 --> 00:48:57,630 is constant time. 981 00:48:57,630 --> 00:49:00,740 Good question, because you can't take that for granted, right? 982 00:49:00,740 --> 00:49:04,440 So we assume that this is order 1, copying is order n. 983 00:49:04,440 --> 00:49:08,970 And then the insertion is order 1, just like before. 984 00:49:08,970 --> 00:49:11,310 So allocating may not be constant. 985 00:49:11,310 --> 00:49:14,050 In real life, allocating is actually 986 00:49:14,050 --> 00:49:17,340 logarithmic either of the size that you're asking for 987 00:49:17,340 --> 00:49:20,670 or logarithmic of how many buffers you've allocated. 988 00:49:20,670 --> 00:49:22,900 And you can make a constant time allocator. 989 00:49:22,900 --> 00:49:26,990 But that's lower than a logarithmic allocator, 990 00:49:26,990 --> 00:49:29,800 because the constant factor behind it is so big. 991 00:49:29,800 --> 00:49:31,820 But even if this allocation would 992 00:49:31,820 --> 00:49:35,040 be order n, which would be terrible, 993 00:49:35,040 --> 00:49:37,440 it would still get absorbed here. 994 00:49:37,440 --> 00:49:40,550 So the overall model works no matter what the allocation is. 995 00:49:40,550 --> 00:49:43,370 It's reasonable, from a theoretical standpoint, 996 00:49:43,370 --> 00:49:47,210 to say that allocation is order 1, 997 00:49:47,210 --> 00:49:50,640 from a theoretical standpoint. 998 00:49:50,640 --> 00:49:53,950 So this is the real cost copying the elements. 999 00:49:53,950 --> 00:49:57,910 And this makes an append order n worst case. 1000 00:49:57,910 --> 00:50:01,180 So if you look at this data structure then 1001 00:50:01,180 --> 00:50:04,260 suppose we want to compute the cost of an append. 1002 00:50:04,260 --> 00:50:12,040 So say we have code like this, 4, 1, 2, n. 1003 00:50:12,040 --> 00:50:17,630 First we have L be an empty list. 1004 00:50:17,630 --> 00:50:24,840 Then we want to compute the cost of this. 1005 00:50:27,800 --> 00:50:32,570 So if we do it without amortized analysis, 1006 00:50:32,570 --> 00:50:35,250 line by line analysis, just like we 1007 00:50:35,250 --> 00:50:38,380 learned in the first lecture, what's the cost of this, 1008 00:50:38,380 --> 00:50:40,870 making a new list constant? 1009 00:50:43,960 --> 00:50:46,060 What's the cost of one append? 1010 00:50:48,916 --> 00:50:49,868 AUDIENCE: Constant. 1011 00:50:49,868 --> 00:50:51,390 PROFESSOR: One append. 1012 00:50:51,390 --> 00:50:55,730 So an append can either branch here or branch here. 1013 00:50:58,260 --> 00:51:01,975 So what's the cost of one append? 1014 00:51:01,975 --> 00:51:04,016 AUDIENCE: It would be showing with an empty list? 1015 00:51:04,016 --> 00:51:04,932 AUDIENCE: Depends. 1016 00:51:04,932 --> 00:51:06,840 PROFESSOR: It depends. 1017 00:51:06,840 --> 00:51:07,547 So worst case. 1018 00:51:07,547 --> 00:51:08,880 We have to look at a worst case. 1019 00:51:08,880 --> 00:51:10,260 So this is line by line analysis. 1020 00:51:10,260 --> 00:51:13,836 We're going to get one number for this. 1021 00:51:13,836 --> 00:51:14,336 AUDIENCE: N. 1022 00:51:14,336 --> 00:51:15,200 AUDIENCE: An n. 1023 00:51:15,200 --> 00:51:15,825 PROFESSOR: Yep. 1024 00:51:19,520 --> 00:51:21,964 So in the worst case, the list will be full. 1025 00:51:21,964 --> 00:51:23,380 And you'll have to make a new one. 1026 00:51:23,380 --> 00:51:25,440 And then you're going on this branch of the if, 1027 00:51:25,440 --> 00:51:27,900 so the cost is order n. 1028 00:51:27,900 --> 00:51:34,120 So order n, worst case. 1029 00:51:34,120 --> 00:51:37,610 So the cost of one call is order n, worst case. 1030 00:51:37,610 --> 00:51:39,000 How many calls do we make? 1031 00:51:43,970 --> 00:51:47,010 So what is the total cost of this thing? 1032 00:51:47,010 --> 00:51:48,594 AUDIENCE: It's not actually n squared. 1033 00:51:48,594 --> 00:51:50,426 PROFESSOR: Yes, it's not actually n squared. 1034 00:51:50,426 --> 00:51:52,020 But if we do line by line analysis, 1035 00:51:52,020 --> 00:51:54,350 before we learn amortized analysis, 1036 00:51:54,350 --> 00:51:56,830 all we can say it's order of n squared. 1037 00:51:56,830 --> 00:51:59,340 And this is correct, it's not bigger than n squared, right? 1038 00:51:59,340 --> 00:52:00,990 So O is correct. 1039 00:52:00,990 --> 00:52:03,020 But it's not the tight bound. 1040 00:52:03,020 --> 00:52:05,882 So if we had a multiple choice, and you selected this, 1041 00:52:05,882 --> 00:52:08,090 you wouldn't get the score because we usually ask you 1042 00:52:08,090 --> 00:52:10,930 what the tightest bound that you can get. 1043 00:52:10,930 --> 00:52:13,050 OK, so line by line analysis. 1044 00:52:13,050 --> 00:52:14,880 We worked through that a lot in doc dist. 1045 00:52:14,880 --> 00:52:16,305 Doesn't work all the time. 1046 00:52:16,305 --> 00:52:17,680 When it doesn't work, we tell you 1047 00:52:17,680 --> 00:52:19,780 to use amortized analysis instead. 1048 00:52:19,780 --> 00:52:21,850 So what's the goal of amortized analysis? 1049 00:52:21,850 --> 00:52:23,020 What do we want? 1050 00:52:23,020 --> 00:52:26,510 You guys are yelling at me that this is not n squared, why? 1051 00:52:26,510 --> 00:52:27,385 I mean not why, what? 1052 00:52:27,385 --> 00:52:28,310 What is it instead? 1053 00:52:28,310 --> 00:52:30,522 What do we want from amortized analysis? 1054 00:52:30,522 --> 00:52:32,932 AUDIENCE: [INAUDIBLE] 1055 00:52:32,932 --> 00:52:36,306 AUDIENCE: It's a [INAUDIBLE] that's an n. 1056 00:52:36,306 --> 00:52:38,810 PROFESSOR: So we want amortized analysis 1057 00:52:38,810 --> 00:52:43,040 to say that this is order 1 amortized, and this is-- 1058 00:52:43,040 --> 00:52:44,816 [ALARM SOUNDING] 1059 00:52:44,816 --> 00:52:46,070 PROFESSOR: Am I out of time? 1060 00:52:46,070 --> 00:52:46,570 Yeah. 1061 00:52:49,200 --> 00:52:53,937 OK, so there's a difference between the worst 1062 00:52:53,937 --> 00:52:55,020 case and amortized, right? 1063 00:52:55,020 --> 00:53:03,210 We can argue that this is order 1 amortized. 1064 00:53:03,210 --> 00:53:05,100 And if this is order 1 amortized, 1065 00:53:05,100 --> 00:53:09,020 then this is order n amortized. 1066 00:53:09,020 --> 00:53:11,790 So does the difference between worst case and amortized 1067 00:53:11,790 --> 00:53:12,520 make sense now? 1068 00:53:15,650 --> 00:53:18,162 So this is what I want, the rest is fancy math. 1069 00:53:18,162 --> 00:53:19,870 If you forget the fancy math after you're 1070 00:53:19,870 --> 00:53:21,640 done with this class, that's OK. 1071 00:53:21,640 --> 00:53:24,030 If you remember that this is order 1 amortized, 1072 00:53:24,030 --> 00:53:26,362 and that's order n amortized, that's good. 1073 00:53:26,362 --> 00:53:28,070 That's all you need to know to write code 1074 00:53:28,070 --> 00:53:30,470 if you don't design algorithms. 1075 00:53:30,470 --> 00:53:34,307 So this is an important piece of knowledge on its own. 1076 00:53:34,307 --> 00:53:36,640 OK, so questions about the difference between worst case 1077 00:53:36,640 --> 00:53:37,300 and amortized? 1078 00:53:40,280 --> 00:53:41,660 OK, what does amortized mean? 1079 00:53:44,784 --> 00:53:46,310 AUDIENCE: Average. 1080 00:53:46,310 --> 00:53:48,750 PROFESSOR: Yep, averaged out over multiple operations. 1081 00:53:48,750 --> 00:53:51,100 So instead of doing line by line analysis, 1082 00:53:51,100 --> 00:53:54,470 we have to look at what happens over multiple operations, 1083 00:53:54,470 --> 00:53:56,760 right? 1084 00:53:56,760 --> 00:54:00,580 So there are two methods that I think are useful in CLRS. 1085 00:54:00,580 --> 00:54:02,330 There are three in total, but the last one 1086 00:54:02,330 --> 00:54:04,002 is horribly complicated. 1087 00:54:04,002 --> 00:54:05,960 So there's something called aggregate analysis. 1088 00:54:11,130 --> 00:54:14,450 And there's something called the cost based accounting. 1089 00:54:19,160 --> 00:54:22,200 So last time when we looked at the costs for append, 1090 00:54:22,200 --> 00:54:30,150 we argued that, hey, it's order 1 for a lot of times. 1091 00:54:30,150 --> 00:54:34,900 And then it's only order n for an operation 1092 00:54:34,900 --> 00:54:37,090 that's a power of 2. 1093 00:54:37,090 --> 00:54:43,690 So if we're looking at the K-ith append, 1094 00:54:43,690 --> 00:54:50,290 then this is order K for K equals 2 to the i. 1095 00:54:50,290 --> 00:54:52,640 And it's order 1 otherwise. 1096 00:54:58,880 --> 00:55:01,460 Right? 1097 00:55:01,460 --> 00:55:04,110 So if we sum up all these costs, we 1098 00:55:04,110 --> 00:55:20,400 get-- plus sum over log n of O of 2 to the i. 1099 00:55:20,400 --> 00:55:22,450 And this is clearly order n. 1100 00:55:25,430 --> 00:55:29,350 And if you do the math here, this is also order n. 1101 00:55:33,000 --> 00:55:34,480 So this is aggregate analysis. 1102 00:55:34,480 --> 00:55:36,600 This is what we taught you in lecture. 1103 00:55:40,800 --> 00:55:42,820 Does this make sense? 1104 00:55:42,820 --> 00:55:48,680 So the key here is that whenever we are increasing the array, 1105 00:55:48,680 --> 00:55:52,960 we're increasing it to 2 times. 1106 00:55:52,960 --> 00:55:55,450 And we start with a size of 1, count is 1. 1107 00:55:55,450 --> 00:55:58,040 We start with an array with 1 element. 1108 00:55:58,040 --> 00:55:59,790 So the size of the array will first 1109 00:55:59,790 --> 00:56:06,716 be 1, then 2, then 4, then 8, then 16, 32, 64, 128, 1110 00:56:06,716 --> 00:56:07,340 so on so forth. 1111 00:56:07,340 --> 00:56:09,550 It increases exponentially. 1112 00:56:09,550 --> 00:56:11,880 So on the first append I'll have to do a resize. 1113 00:56:11,880 --> 00:56:13,230 On the second one, resize. 1114 00:56:13,230 --> 00:56:14,380 Fourth one, resize. 1115 00:56:14,380 --> 00:56:17,530 Eighth, resize, so on and so forth. 1116 00:56:17,530 --> 00:56:20,480 So if I'm adding up the cost for n operations, 1117 00:56:20,480 --> 00:56:23,830 each operation is order 1 because I'm 1118 00:56:23,830 --> 00:56:25,060 inserting everywhere. 1119 00:56:25,060 --> 00:56:30,020 And then all these operations are all order n. 1120 00:56:30,020 --> 00:56:31,160 But there's few of them. 1121 00:56:31,160 --> 00:56:32,560 They're few and far out. 1122 00:56:32,560 --> 00:56:35,110 So if you write the sum this way, and you do the math, 1123 00:56:35,110 --> 00:56:37,120 you get that it's order n. 1124 00:56:37,120 --> 00:56:40,110 So aggregate analysis says, look at n operations 1125 00:56:40,110 --> 00:56:42,220 and add the costs up together. 1126 00:56:42,220 --> 00:56:44,840 And last time we had that good example of walking over a tree, 1127 00:56:44,840 --> 00:56:48,310 and in order traversal where we drew arrows across edges. 1128 00:56:48,310 --> 00:56:50,450 So that's aggregate analysis. 1129 00:56:50,450 --> 00:56:52,710 And then you should look at the cost method in CLRS 1130 00:56:52,710 --> 00:56:56,030 because that's also useful sometimes. 1131 00:56:56,030 --> 00:56:57,500 Does this help? 1132 00:56:57,500 --> 00:56:59,660 Any questions? 1133 00:56:59,660 --> 00:57:00,910 No, everyone wants to go home. 1134 00:57:00,910 --> 00:57:01,680 AUDIENCE: Wait-- 1135 00:57:01,680 --> 00:57:02,430 PROFESSOR: Almost. 1136 00:57:02,430 --> 00:57:06,370 AUDIENCE: For log n, so you're starting from log n going to-- 1137 00:57:06,370 --> 00:57:09,174 PROFESSOR: So I'm starting from 1 going to log n. 1138 00:57:09,174 --> 00:57:13,610 AUDIENCE: Oh, oh, so [INAUDIBLE] after you're buffering. 1139 00:57:13,610 --> 00:57:15,750 PROFESSOR: So this is fancy math for saying only 1140 00:57:15,750 --> 00:57:18,504 add up powers of two. 1141 00:57:18,504 --> 00:57:20,670 So that's what I'm trying to say, add these guys up. 1142 00:57:20,670 --> 00:57:22,000 AUDIENCE: Well that's your step [INAUDIBLE]. 1143 00:57:22,000 --> 00:57:22,666 PROFESSOR: Yeah. 1144 00:57:22,666 --> 00:57:23,800 AUDIENCE: Oh, OK. 1145 00:57:23,800 --> 00:57:25,000 Oh, I like that. 1146 00:57:25,000 --> 00:57:25,600 OK. 1147 00:57:25,600 --> 00:57:26,710 PROFESSOR: OK.