1 00:00:00,050 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,019 under a Creative Commons license. 3 00:00:04,019 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,207 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,207 --> 00:00:17,832 at ocw.mit.edu. 8 00:00:22,452 --> 00:00:23,535 PROFESSOR: Hey, everybody. 9 00:00:23,535 --> 00:00:26,960 You ready to learn some algorithms? 10 00:00:26,960 --> 00:00:28,350 Yeah! 11 00:00:28,350 --> 00:00:29,580 Let's do it. 12 00:00:29,580 --> 00:00:30,540 I'm Eric Domain. 13 00:00:30,540 --> 00:00:31,950 You can call me Eric. 14 00:00:31,950 --> 00:00:35,020 And the last class, we sort of jumped into things. 15 00:00:35,020 --> 00:00:36,610 We studied peak finding and looked 16 00:00:36,610 --> 00:00:37,940 at a bunch of algorithms for peak finding 17 00:00:37,940 --> 00:00:38,880 on your problem set. 18 00:00:38,880 --> 00:00:42,014 You've already seen a bunch more. 19 00:00:42,014 --> 00:00:44,430 And in this class, we're going to do some more algorithms. 20 00:00:44,430 --> 00:00:44,730 Don't worry. 21 00:00:44,730 --> 00:00:45,670 That will be at the end. 22 00:00:45,670 --> 00:00:47,550 We're going to talk about another problem, document 23 00:00:47,550 --> 00:00:49,950 distance, which will be a running example for a bunch 24 00:00:49,950 --> 00:00:52,360 of topics that we cover in this class. 25 00:00:52,360 --> 00:00:55,640 But before we go there, I wanted to take a step back and talk 26 00:00:55,640 --> 00:00:58,400 about, what actually is an algorithm? 27 00:00:58,400 --> 00:01:00,760 What is an algorithm allowed to do? 28 00:01:00,760 --> 00:01:04,702 And also deep philosophical questions like, what is time? 29 00:01:04,702 --> 00:01:06,410 What is the running time of an algorithm? 30 00:01:06,410 --> 00:01:08,060 How do we measure it? 31 00:01:08,060 --> 00:01:10,030 And what are the rules the game? 32 00:01:10,030 --> 00:01:13,120 For fun, I thought I would first mention 33 00:01:13,120 --> 00:01:17,520 where the word comes from, the word algorithm. 34 00:01:17,520 --> 00:01:22,390 It comes from this guy, a little hard to spell. 35 00:01:27,180 --> 00:01:31,210 Al-Khwarizmi, who is sort of the father of algebra. 36 00:01:31,210 --> 00:01:35,240 He wrote this book called "The Compendious Book on Calculation 37 00:01:35,240 --> 00:01:38,740 by Completion and Balancing" back in the day. 38 00:01:38,740 --> 00:01:40,420 And it was in particular about how 39 00:01:40,420 --> 00:01:44,194 to solve linear and quadratic equations. 40 00:01:44,194 --> 00:01:45,360 So the beginning of algebra. 41 00:01:45,360 --> 00:01:47,151 I don't think he invented those techniques. 42 00:01:47,151 --> 00:01:48,780 But he was sort of the textbook writer 43 00:01:48,780 --> 00:01:50,826 who wrote sort of how people solved them. 44 00:01:50,826 --> 00:01:52,200 And you can think of how to solve 45 00:01:52,200 --> 00:01:54,550 those equations as early algorithms. 46 00:01:54,550 --> 00:01:56,080 First, you take this number. 47 00:01:56,080 --> 00:01:56,970 You multiply by this. 48 00:01:56,970 --> 00:02:01,660 You add it or you reduce to squares, whatever. 49 00:02:01,660 --> 00:02:05,100 So that's where the word algebra comes from and also 50 00:02:05,100 --> 00:02:06,690 where the word algorithm comes from. 51 00:02:06,690 --> 00:02:09,715 There aren't very many words with these roots. 52 00:02:09,715 --> 00:02:10,580 So there you go. 53 00:02:10,580 --> 00:02:11,335 Some fun history. 54 00:02:15,320 --> 00:02:17,550 What's an algorithm? 55 00:02:17,550 --> 00:02:19,870 I'll start with sort of some informal definitions 56 00:02:19,870 --> 00:02:22,070 and then the point of this lecture. 57 00:02:22,070 --> 00:02:26,220 And the idea of a model of computation 58 00:02:26,220 --> 00:02:28,684 is to formally specify what an algorithm is. 59 00:02:28,684 --> 00:02:30,850 I don't want to get super technical and formal here, 60 00:02:30,850 --> 00:02:32,900 but I want to give you some grounding 61 00:02:32,900 --> 00:02:35,330 so when we write Python code, when we write pseudocode, 62 00:02:35,330 --> 00:02:38,054 we have some idea what things actually cost. 63 00:02:38,054 --> 00:02:38,970 This is a new lecture. 64 00:02:38,970 --> 00:02:40,470 We've never done this before in 006. 65 00:02:40,470 --> 00:02:42,800 But I think it's important. 66 00:02:42,800 --> 00:02:45,870 So at a high level, you can think 67 00:02:45,870 --> 00:02:49,600 of an algorithm is just a-- I'm sure you've 68 00:02:49,600 --> 00:02:50,870 seen the definition before. 69 00:02:54,550 --> 00:02:57,789 It's a way to define computation or computational procedure 70 00:02:57,789 --> 00:02:58,830 for solving some problem. 71 00:03:05,030 --> 00:03:06,651 So whereas computer code, I mean, 72 00:03:06,651 --> 00:03:08,400 it could just be running in the background 73 00:03:08,400 --> 00:03:09,566 all the time doing whatever. 74 00:03:09,566 --> 00:03:12,000 An algorithm we think of as having some input 75 00:03:12,000 --> 00:03:13,800 and generating some output. 76 00:03:13,800 --> 00:03:15,770 Usually, it's to solve some problem. 77 00:03:20,040 --> 00:03:22,040 You want to know is this number prime, whatever. 78 00:03:22,040 --> 00:03:22,540 Question? 79 00:03:22,540 --> 00:03:26,120 AUDIENCE: Can you turn up the volume for your mic? 80 00:03:26,120 --> 00:03:28,950 PROFESSOR: This microphone does not feed into the AV system. 81 00:03:28,950 --> 00:03:31,066 So I shall just talk louder, OK? 82 00:03:33,790 --> 00:03:36,800 And quiet the set, please. 83 00:03:36,800 --> 00:03:38,150 OK, so that's an algorithm. 84 00:03:38,150 --> 00:03:39,890 You take some input. 85 00:03:39,890 --> 00:03:41,320 You run it through. 86 00:03:41,320 --> 00:03:43,010 You compute some output. 87 00:03:43,010 --> 00:03:45,210 Of course, computer code can do this too. 88 00:03:45,210 --> 00:03:48,140 An algorithm is basically the mathematical analog 89 00:03:48,140 --> 00:03:49,380 of a computer program. 90 00:03:49,380 --> 00:03:52,620 So if you want to reason about what computer programs do, 91 00:03:52,620 --> 00:03:55,249 you translate it into the world algorithms. 92 00:03:55,249 --> 00:03:57,540 And vice versa, you want to solve some problem-- first, 93 00:03:57,540 --> 00:04:00,710 you usually develop an algorithm using mathematics, 94 00:04:00,710 --> 00:04:01,819 using this class. 95 00:04:01,819 --> 00:04:03,610 And then you convert it into computer code. 96 00:04:03,610 --> 00:04:05,760 And this class is about that transition from one 97 00:04:05,760 --> 00:04:07,110 to the other. 98 00:04:07,110 --> 00:04:10,265 You can draw a picture of sort of analogs. 99 00:04:13,660 --> 00:04:16,769 So an algorithm is a mathematical analog 100 00:04:16,769 --> 00:04:20,289 of a computer program. 101 00:04:20,289 --> 00:04:24,254 A computer program is built on top of a programming language. 102 00:04:24,254 --> 00:04:26,045 And it's written in a programming language. 103 00:04:33,740 --> 00:04:36,760 The mathematical analog of a programming language, 104 00:04:36,760 --> 00:04:39,630 what we write algorithms in, usually we 105 00:04:39,630 --> 00:04:44,890 write them in pseudocode, which is basically 106 00:04:44,890 --> 00:04:49,650 another fancy word for structured English, 107 00:04:49,650 --> 00:04:51,950 good English, whatever you want to say. 108 00:04:51,950 --> 00:04:54,200 Of course, you could use another natural language. 109 00:04:54,200 --> 00:04:58,140 But the idea is, you need to express that algorithm in a way 110 00:04:58,140 --> 00:05:00,610 that people can understand and reason about formally. 111 00:05:00,610 --> 00:05:02,350 So that's the structured part. 112 00:05:02,350 --> 00:05:05,780 Pseudocode means lots of different things. 113 00:05:05,780 --> 00:05:08,040 It's just sort of an abstract how you would write down 114 00:05:08,040 --> 00:05:10,630 formal specification without necessarily being 115 00:05:10,630 --> 00:05:13,739 able to actually run it on a computer. 116 00:05:13,739 --> 00:05:16,030 Though there's a particular pseudocode in your textbook 117 00:05:16,030 --> 00:05:18,050 which you probably could run on a computer. 118 00:05:18,050 --> 00:05:19,426 A lot of it, anyway. 119 00:05:19,426 --> 00:05:21,050 But you don't have to use that version. 120 00:05:21,050 --> 00:05:25,950 It just makes sense to humans who do the mathematics. 121 00:05:25,950 --> 00:05:30,420 OK, and then ultimately, this program runs on a computer. 122 00:05:30,420 --> 00:05:34,340 You all have computers, probably in your pockets. 123 00:05:34,340 --> 00:05:39,130 There's an analog of a computer in the mathematical world. 124 00:05:39,130 --> 00:05:42,230 And that is the model of computation. 125 00:05:42,230 --> 00:05:46,560 And that's sort of the focus of the first part of this lecture. 126 00:05:46,560 --> 00:05:50,850 Model of computation says what your computer is allowed to do, 127 00:05:50,850 --> 00:05:53,770 what it can do in constant time, basically? 128 00:05:53,770 --> 00:05:57,460 And that's what I want to talk about here. 129 00:05:57,460 --> 00:06:13,080 So the model of computation specifies basically 130 00:06:13,080 --> 00:06:28,290 what operations you can do in an algorithm 131 00:06:28,290 --> 00:06:30,180 and how much they cost. 132 00:06:30,180 --> 00:06:31,430 This is the what is time. 133 00:06:38,274 --> 00:06:39,690 So for each operation, we're going 134 00:06:39,690 --> 00:06:41,550 to specify how much time it costs. 135 00:06:41,550 --> 00:06:43,680 Then the algorithm does a bunch of operations. 136 00:06:43,680 --> 00:06:45,513 They're combined together with control flow, 137 00:06:45,513 --> 00:06:48,060 for loops, if statements, stuff like that which we're not 138 00:06:48,060 --> 00:06:50,280 going to worry about too much. 139 00:06:50,280 --> 00:06:52,440 But obviously, we'll use them a lot. 140 00:06:52,440 --> 00:06:55,020 And what we count is how much do each of the operations cost. 141 00:06:55,020 --> 00:06:55,720 You add them up. 142 00:06:55,720 --> 00:06:58,094 That is the total cost of your algorithm. 143 00:06:58,094 --> 00:07:00,010 So in particular, we care mostly in this class 144 00:07:00,010 --> 00:07:01,960 about running time. 145 00:07:01,960 --> 00:07:03,900 Each operation has a time cost. 146 00:07:03,900 --> 00:07:04,960 You add those up. 147 00:07:04,960 --> 00:07:07,400 That's running time of the algorithm. 148 00:07:07,400 --> 00:07:14,974 OK, so let's-- I'm going to cover two models of computation 149 00:07:14,974 --> 00:07:17,390 which you can just think of as different ways of thinking. 150 00:07:17,390 --> 00:07:19,380 You've probably seen them in some sense 151 00:07:19,380 --> 00:07:22,650 as-- what you call them? 152 00:07:22,650 --> 00:07:25,330 Styles of programming. 153 00:07:25,330 --> 00:07:28,450 Object oriented style of programming, more assembly 154 00:07:28,450 --> 00:07:29,530 style of programming. 155 00:07:29,530 --> 00:07:32,076 There's lots of different styles of programming languages 156 00:07:32,076 --> 00:07:33,700 which I'm not going to talk about here. 157 00:07:33,700 --> 00:07:35,910 But you've see analogs if you've seen those before. 158 00:07:38,550 --> 00:07:41,580 And these models really give you a way 159 00:07:41,580 --> 00:07:44,940 of structuring your thinking about how 160 00:07:44,940 --> 00:07:46,770 you write an algorithm. 161 00:07:46,770 --> 00:07:49,020 So they are the random access machine and the pointer 162 00:07:49,020 --> 00:07:51,010 machine. 163 00:07:51,010 --> 00:08:04,550 So we'll start with random access machine, also known 164 00:08:04,550 --> 00:08:07,530 as the RAM. 165 00:08:07,530 --> 00:08:10,459 Can someone tell me what else RAM stands for? 166 00:08:10,459 --> 00:08:11,750 AUDIENCE: Random Access Memory? 167 00:08:11,750 --> 00:08:13,220 PROFESSOR: Random Access Memory. 168 00:08:13,220 --> 00:08:15,860 So this is both confusing but also convenience. 169 00:08:15,860 --> 00:08:18,870 Because RAM simultaneously stands for two things 170 00:08:18,870 --> 00:08:21,135 and they mean almost the same thing, but not quite. 171 00:08:21,135 --> 00:08:23,010 So I guess that's more confusing than useful. 172 00:08:23,010 --> 00:08:24,560 But there you go. 173 00:08:24,560 --> 00:08:31,244 So we have random access memory. 174 00:08:31,244 --> 00:08:32,380 Oh, look at that. 175 00:08:32,380 --> 00:08:35,549 Fits perfectly. 176 00:08:35,549 --> 00:08:37,340 And so we're thinking, this is a real-- 177 00:08:37,340 --> 00:08:41,130 this is-- random access memory is over here in real computer 178 00:08:41,130 --> 00:08:41,630 land. 179 00:08:41,630 --> 00:08:43,809 That's like, D-RAM SD-RAM, whatever-- 180 00:08:43,809 --> 00:08:48,310 the things you buy and stick into your motherboard, your GP, 181 00:08:48,310 --> 00:08:49,190 or whatever. 182 00:08:49,190 --> 00:08:53,076 And over here, the mathematical analog of-- so here's, 183 00:08:53,076 --> 00:08:53,575 it's a RAM. 184 00:08:53,575 --> 00:08:55,070 Here, it's also a RAM. 185 00:08:55,070 --> 00:08:57,550 Here, it's a random access machine. 186 00:08:57,550 --> 00:09:00,560 Here, it's a random access memory. 187 00:09:00,560 --> 00:09:02,450 It's technical detail. 188 00:09:02,450 --> 00:09:09,680 But the idea is, if you look at RAM that's in your computer, 189 00:09:09,680 --> 00:09:12,310 it's basically a giant array, right? 190 00:09:12,310 --> 00:09:16,090 You can go from zero to, I don't know. 191 00:09:16,090 --> 00:09:20,080 A typical chip these days is like four gigs in one thing. 192 00:09:20,080 --> 00:09:21,890 So you can go from zero to four gigs. 193 00:09:21,890 --> 00:09:25,660 You can access anything in the middle there in constant time. 194 00:09:25,660 --> 00:09:28,490 To access something, you need to know where it is. 195 00:09:28,490 --> 00:09:30,270 That's random access memory. 196 00:09:30,270 --> 00:09:31,320 So that's an array. 197 00:09:34,490 --> 00:09:36,090 So I'll just draw a big picture. 198 00:09:36,090 --> 00:09:37,490 Here's an array. 199 00:09:37,490 --> 00:09:41,840 Now, RAM is usually organized by words. 200 00:09:41,840 --> 00:09:45,890 So these are a machine word, which 201 00:09:45,890 --> 00:09:47,510 we're going to put in this model. 202 00:09:47,510 --> 00:09:51,510 And then there's address zero, address one, address two. 203 00:09:51,510 --> 00:09:53,150 This is the fifth word. 204 00:09:53,150 --> 00:09:54,070 And just keeps going. 205 00:09:54,070 --> 00:09:55,486 You can think of this as infinite. 206 00:09:55,486 --> 00:09:57,570 Or the amount that you use, that's 207 00:09:57,570 --> 00:10:01,660 the space of your algorithm, if you care about storage space. 208 00:10:01,660 --> 00:10:03,610 So that's basically it. 209 00:10:03,610 --> 00:10:06,000 OK, now how do we-- this is the memory side of things. 210 00:10:06,000 --> 00:10:08,150 How do we actually compute with it? 211 00:10:08,150 --> 00:10:09,240 It's very simple. 212 00:10:09,240 --> 00:10:18,490 We just say, in constant time, an algorithm can basically 213 00:10:18,490 --> 00:10:25,160 read in or load a constant number of words from memory, 214 00:10:25,160 --> 00:10:33,020 do a constant number of computations on them, 215 00:10:33,020 --> 00:10:34,950 and then write them out. 216 00:10:34,950 --> 00:10:36,150 It's usually called store. 217 00:10:43,539 --> 00:10:45,330 OK, it needs to know where these words are. 218 00:10:45,330 --> 00:10:52,290 It accesses them by address. 219 00:10:52,290 --> 00:10:54,390 And so I guess I should write here 220 00:10:54,390 --> 00:11:01,910 you have a constant number of registers just hanging around. 221 00:11:01,910 --> 00:11:04,040 So you load some words into registers. 222 00:11:04,040 --> 00:11:06,105 You can do some computations on those registers. 223 00:11:06,105 --> 00:11:07,480 And then you can write them back, 224 00:11:07,480 --> 00:11:09,120 storing them in locations that are 225 00:11:09,120 --> 00:11:10,850 specified by your registers. 226 00:11:10,850 --> 00:11:12,690 So you've ever done assembly programming, 227 00:11:12,690 --> 00:11:15,510 this is what assembly programming is like. 228 00:11:15,510 --> 00:11:20,500 And it can be rather annoying to write algorithms in this model. 229 00:11:20,500 --> 00:11:22,140 But in some sense, it is reality. 230 00:11:22,140 --> 00:11:24,320 This is how we think about computers. 231 00:11:24,320 --> 00:11:25,780 If you ignore things like caches, 232 00:11:25,780 --> 00:11:28,270 this is an accurate model of computation 233 00:11:28,270 --> 00:11:30,332 that loading, computing, and storing 234 00:11:30,332 --> 00:11:32,040 all take roughly the same amount of time. 235 00:11:32,040 --> 00:11:33,720 They all take constant time. 236 00:11:33,720 --> 00:11:35,980 You can manipulate a whole word at a time. 237 00:11:35,980 --> 00:11:38,680 Now, what exactly is a word? 238 00:11:38,680 --> 00:11:42,830 You know, computers these days, it's like 32 bits or 64 bits. 239 00:11:42,830 --> 00:11:45,710 But we like to be a little bit more abstract. 240 00:11:45,710 --> 00:11:51,220 A word is w bits. 241 00:11:51,220 --> 00:11:52,530 It's slightly annoying. 242 00:11:52,530 --> 00:11:55,340 And most of this class, we won't really worry about what w is. 243 00:11:55,340 --> 00:11:57,280 We'll assume that we're given as input 244 00:11:57,280 --> 00:11:58,970 a bunch of things which are words. 245 00:11:58,970 --> 00:12:00,720 So for example, peak finding. 246 00:12:00,720 --> 00:12:03,050 We're given a matrix of numbers. 247 00:12:03,050 --> 00:12:05,380 We didn't really say whether they're integers or floats 248 00:12:05,380 --> 00:12:06,407 or what. 249 00:12:06,407 --> 00:12:07,490 We don't worry about that. 250 00:12:07,490 --> 00:12:08,810 We just think of them as words and we 251 00:12:08,810 --> 00:12:10,560 assume that we can manipulate those words. 252 00:12:10,560 --> 00:12:12,840 In particular, given two numbers, we can compare them. 253 00:12:12,840 --> 00:12:13,820 Which is bigger? 254 00:12:13,820 --> 00:12:17,670 And so we can determine, is this cell in the matrix 255 00:12:17,670 --> 00:12:20,950 a peak by comparing it with its neighbors in constant time. 256 00:12:20,950 --> 00:12:23,479 We didn't say why it was constant time to do that. 257 00:12:23,479 --> 00:12:24,520 But now you kind of know. 258 00:12:24,520 --> 00:12:26,160 If those things are all words and you 259 00:12:26,160 --> 00:12:28,618 can manipulate a constant number of words in constant time, 260 00:12:28,618 --> 00:12:31,680 you can tell whether a number is a peak in constant time. 261 00:12:31,680 --> 00:12:37,930 Some things like w should be at least log the size of memory. 262 00:12:41,160 --> 00:12:43,570 Because my word should be able to specify 263 00:12:43,570 --> 00:12:46,920 an index into this array. 264 00:12:46,920 --> 00:12:48,231 And we might use that someday. 265 00:12:48,231 --> 00:12:49,730 But basically, don't worry about it. 266 00:12:49,730 --> 00:12:50,710 Words are words. 267 00:12:50,710 --> 00:12:52,080 Words come in as inputs. 268 00:12:52,080 --> 00:12:53,720 You can manipulate them and you don't 269 00:12:53,720 --> 00:12:58,240 have to worry about it for the most part. 270 00:12:58,240 --> 00:13:00,010 In unit four of this class, we're 271 00:13:00,010 --> 00:13:02,920 going to talk about, what if we have really giant integers that 272 00:13:02,920 --> 00:13:03,987 don't fit in a word? 273 00:13:03,987 --> 00:13:05,070 How do we manipulate them? 274 00:13:05,070 --> 00:13:06,830 How do we add them, multiply them? 275 00:13:06,830 --> 00:13:08,490 So that's another topic. 276 00:13:08,490 --> 00:13:10,620 But most of this class, we'll just 277 00:13:10,620 --> 00:13:13,030 assume everything we're given is one word. 278 00:13:13,030 --> 00:13:16,890 And it's easy to compute on. 279 00:13:16,890 --> 00:13:19,420 So this is a realistic model, more or less. 280 00:13:19,420 --> 00:13:21,260 And it's a powerful one. 281 00:13:21,260 --> 00:13:25,230 But a lot of the time, a lot of code 282 00:13:25,230 --> 00:13:27,190 just doesn't use arrays-- doesn't need it. 283 00:13:27,190 --> 00:13:30,540 Sometimes we need arrays, sometimes we don't. 284 00:13:30,540 --> 00:13:33,920 Sometimes you feel like a nut, sometimes you don't. 285 00:13:33,920 --> 00:13:38,105 So it's useful to think about somewhat more abstract models 286 00:13:38,105 --> 00:13:42,120 that are not quite as powerful but offer a simpler 287 00:13:42,120 --> 00:13:44,360 way of thinking about things. 288 00:13:44,360 --> 00:13:45,830 For example, in this model there's 289 00:13:45,830 --> 00:13:47,884 no dynamic memory allocation. 290 00:13:47,884 --> 00:13:50,050 You probably know you could implement dynamic memory 291 00:13:50,050 --> 00:13:52,590 allocation because real computers do it. 292 00:13:52,590 --> 00:13:54,520 But it's nice to think about a model 293 00:13:54,520 --> 00:13:56,620 where that's taken care of for you. 294 00:13:56,620 --> 00:14:00,010 It's kind of like a higher level programming abstraction. 295 00:14:00,010 --> 00:14:03,580 So the one is useful in this class is the pointer machine. 296 00:14:03,580 --> 00:14:06,190 This basically corresponds to object oriented programming 297 00:14:06,190 --> 00:14:10,020 in a simple, very simple version. 298 00:14:10,020 --> 00:14:12,660 So we have dynamically allocated objects. 299 00:14:21,410 --> 00:14:30,370 And an object has a constant number of fields. 300 00:14:34,650 --> 00:14:44,040 And a field is going to be either a word-- so you 301 00:14:44,040 --> 00:14:46,390 can think of this as, for example, 302 00:14:46,390 --> 00:14:49,470 storing an integer, one of the input objects 303 00:14:49,470 --> 00:14:52,600 or something you computed on it or a counter, all these sorts 304 00:14:52,600 --> 00:14:57,570 of things-- or a pointer. 305 00:14:57,570 --> 00:15:02,370 And that's where pointer machine gets its name. 306 00:15:02,370 --> 00:15:09,310 A pointer is something that points to another object 307 00:15:09,310 --> 00:15:14,470 or has a special value null, also known as nil, 308 00:15:14,470 --> 00:15:15,870 also known as none in Python. 309 00:15:20,430 --> 00:15:23,860 OK, how many people have heard about pointers before? 310 00:15:23,860 --> 00:15:25,497 Who hasn't? 311 00:15:25,497 --> 00:15:26,330 Willing to admit it? 312 00:15:26,330 --> 00:15:27,230 OK, only a few. 313 00:15:27,230 --> 00:15:27,770 That's good. 314 00:15:27,770 --> 00:15:29,020 You should have seen pointers. 315 00:15:29,020 --> 00:15:31,080 You may have heard them called references. 316 00:15:31,080 --> 00:15:33,700 Modern languages these days don't call them pointers 317 00:15:33,700 --> 00:15:35,690 because pointers are scary. 318 00:15:35,690 --> 00:15:38,390 But there's a very subtle difference between them. 319 00:15:38,390 --> 00:15:40,994 And this model actually really is references. 320 00:15:40,994 --> 00:15:43,285 But for whatever reason, it's called a pointer machine. 321 00:15:43,285 --> 00:15:45,350 It doesn't matter. 322 00:15:45,350 --> 00:15:49,480 The point is, you've seem linked lists I hope. 323 00:15:49,480 --> 00:15:54,240 And linked lists have a bunch of fields in each node. 324 00:15:54,240 --> 00:15:56,490 Maybe you've got a pointer to the previous element, 325 00:15:56,490 --> 00:16:00,820 a pointer to the next element, and some value. 326 00:16:00,820 --> 00:16:04,550 So here's a very simple linked list. 327 00:16:04,550 --> 00:16:07,050 This is what you'd call a doubly linked list because it 328 00:16:07,050 --> 00:16:10,050 has previous and next pointers. 329 00:16:10,050 --> 00:16:12,390 So the next pointer points to this node. 330 00:16:12,390 --> 00:16:15,110 The previous pointer points to this node. 331 00:16:15,110 --> 00:16:17,075 Next pointer points to null. 332 00:16:17,075 --> 00:16:19,940 The previous pointer points to null, let's say. 333 00:16:19,940 --> 00:16:21,931 So that's a two node doubly linked list. 334 00:16:21,931 --> 00:16:24,180 You presume we have a pointer to the head of the list, 335 00:16:24,180 --> 00:16:26,900 maybe a pointer to the tail of list, whatever. 336 00:16:26,900 --> 00:16:29,395 So this is a structure in the pointer machine. 337 00:16:29,395 --> 00:16:31,380 It's a data structure. 338 00:16:31,380 --> 00:16:33,790 In Python, you might call this a named tuple, 339 00:16:33,790 --> 00:16:38,340 or it's just an object with three attributes, 340 00:16:38,340 --> 00:16:40,090 I guess, they're called in Python. 341 00:16:40,090 --> 00:16:43,340 So here we have the value. 342 00:16:43,340 --> 00:16:45,450 That's a word like an integer. 343 00:16:45,450 --> 00:16:47,160 And then some things can be pointers 344 00:16:47,160 --> 00:16:48,420 that point to other nodes. 345 00:16:48,420 --> 00:16:49,730 And you can create a new node. 346 00:16:49,730 --> 00:16:50,720 You can destroy a node. 347 00:16:50,720 --> 00:16:54,310 That's the dynamic memory allocation. 348 00:16:54,310 --> 00:16:56,920 In this model, yeah, pointers are pointers. 349 00:16:56,920 --> 00:16:58,220 You can't touch them. 350 00:16:58,220 --> 00:17:01,160 Now, you can implement this model in a random access 351 00:17:01,160 --> 00:17:01,900 machine. 352 00:17:01,900 --> 00:17:06,069 A pointer becomes an index into this giant table. 353 00:17:06,069 --> 00:17:08,069 And that's more like the pointers in C 354 00:17:08,069 --> 00:17:09,762 if you've ever written C programs. 355 00:17:09,762 --> 00:17:11,220 Because then you can take a pointer 356 00:17:11,220 --> 00:17:13,869 and you can add one to it and go to the next thing after that. 357 00:17:13,869 --> 00:17:15,744 In this model, you can just follow a pointer. 358 00:17:15,744 --> 00:17:17,057 That's all you can do. 359 00:17:17,057 --> 00:17:18,890 OK, following a pointer costs constant time. 360 00:17:18,890 --> 00:17:21,329 Changing one of these fields costs constant time. 361 00:17:21,329 --> 00:17:24,930 All the usual things you might imagine doing to these objects 362 00:17:24,930 --> 00:17:26,609 take constant time. 363 00:17:26,609 --> 00:17:29,755 So it's actually a weaker model than this one. 364 00:17:29,755 --> 00:17:31,630 Because you could implement a pointer machine 365 00:17:31,630 --> 00:17:33,741 with a random access machine. 366 00:17:33,741 --> 00:17:35,490 But it offers a different way of thinking. 367 00:17:35,490 --> 00:17:37,323 A lot of data structures are built this way. 368 00:17:39,705 --> 00:17:40,205 Cool. 369 00:17:42,770 --> 00:17:45,130 So that's the theory side. 370 00:17:45,130 --> 00:17:49,510 What I'd like to talk about next is actually in Python, 371 00:17:49,510 --> 00:17:53,520 what's a reasonable model of what's going on? 372 00:17:53,520 --> 00:17:55,160 So these are old models. 373 00:17:55,160 --> 00:17:56,970 This goes back to the '80s. 374 00:17:56,970 --> 00:17:59,312 This one probably '80s or '70s. 375 00:17:59,312 --> 00:18:00,770 So they've been around a long time. 376 00:18:00,770 --> 00:18:02,120 People have used them forever. 377 00:18:02,120 --> 00:18:04,872 Python is obviously much more recent, at least 378 00:18:04,872 --> 00:18:05,955 modern versions of Python. 379 00:18:11,450 --> 00:18:14,790 And it's the model of computation in some sense 380 00:18:14,790 --> 00:18:16,200 that we use in this class. 381 00:18:16,200 --> 00:18:18,200 Because we're implementing everything in Python. 382 00:18:18,200 --> 00:18:21,610 And Python offers both a random access machine perspective 383 00:18:21,610 --> 00:18:24,430 because it has arrays, and it offers a pointer machine 384 00:18:24,430 --> 00:18:26,780 perspective because it has references, 385 00:18:26,780 --> 00:18:28,600 because it has pointers. 386 00:18:28,600 --> 00:18:31,150 So you can do either one. 387 00:18:31,150 --> 00:18:32,890 But it also has a lot of operations. 388 00:18:32,890 --> 00:18:38,180 It doesn't just have load and store and follow pointer. 389 00:18:38,180 --> 00:18:43,030 It's got things like sort and append 390 00:18:43,030 --> 00:18:46,590 and concatenation of two lists and lots of things. 391 00:18:46,590 --> 00:18:48,940 And each of those has a cost associated with them. 392 00:18:48,940 --> 00:18:50,990 So whereas the random access machine and pointer machine, 393 00:18:50,990 --> 00:18:52,115 they're theoretical models. 394 00:18:52,115 --> 00:18:53,710 They're designed to be super simple. 395 00:18:53,710 --> 00:18:58,230 So it's clear that everything you do takes constant time. 396 00:18:58,230 --> 00:19:01,000 In Python, some of the operations you can do 397 00:19:01,000 --> 00:19:03,150 take a lot of time. 398 00:19:03,150 --> 00:19:05,734 Some of the operations in Python take exponential time to do. 399 00:19:05,734 --> 00:19:08,150 And you've got to know when you're writing your algorithms 400 00:19:08,150 --> 00:19:11,260 down either thinking in a Python model or your implementing 401 00:19:11,260 --> 00:19:13,950 your algorithms in actual Python, 402 00:19:13,950 --> 00:19:16,260 which operations are fast and which are slow. 403 00:19:16,260 --> 00:19:19,704 And that's what I'd like to spend the next few minutes on. 404 00:19:19,704 --> 00:19:20,870 There's a lot of operations. 405 00:19:20,870 --> 00:19:23,620 I'm not going to cover all of them. 406 00:19:23,620 --> 00:19:27,280 But we'll cover more in recitation. 407 00:19:27,280 --> 00:19:29,130 And there's a whole bunch in my notes. 408 00:19:29,130 --> 00:19:31,060 I won't get to all of them. 409 00:19:31,060 --> 00:19:36,140 So in Python, you can do random access style things. 410 00:19:36,140 --> 00:19:38,540 In Python, arrays are called lists, 411 00:19:38,540 --> 00:19:40,230 which is super confusing. 412 00:19:40,230 --> 00:19:42,240 But there you go. 413 00:19:42,240 --> 00:19:48,110 A list in Python is an array in real world. 414 00:19:48,110 --> 00:19:51,285 It's a super cool array, of course? 415 00:19:51,285 --> 00:19:52,750 And you can think of it as a list. 416 00:19:52,750 --> 00:19:55,460 But in terms implementation, it's implemented as an array. 417 00:19:55,460 --> 00:19:55,960 Question? 418 00:19:55,960 --> 00:19:58,660 AUDIENCE: I thought that [INAUDIBLE]. 419 00:19:58,660 --> 00:20:01,160 PROFESSOR: You thought Python links lists were linked lists. 420 00:20:01,160 --> 00:20:02,368 That's why it's so confusing. 421 00:20:02,368 --> 00:20:04,280 In fact, they are not. 422 00:20:04,280 --> 00:20:07,630 In, say, scheme, back in the days when we taught scheme, 423 00:20:07,630 --> 00:20:10,040 lists are linked lists. 424 00:20:10,040 --> 00:20:11,580 And it's very different. 425 00:20:11,580 --> 00:20:14,530 So when you do-- I'll give an operation here. 426 00:20:14,530 --> 00:20:17,330 You have a list L, and you do something like this. 427 00:20:21,180 --> 00:20:23,390 L is a list object. 428 00:20:23,390 --> 00:20:25,647 This takes constant time. 429 00:20:25,647 --> 00:20:27,480 In a linked list, it would take linear time. 430 00:20:27,480 --> 00:20:30,770 Because we've got a scan to position I, scan to position J, 431 00:20:30,770 --> 00:20:33,000 add 5, and store. 432 00:20:33,000 --> 00:20:38,030 But conveniently in Python, this takes constant time. 433 00:20:38,030 --> 00:20:39,862 And that's important to know. 434 00:20:39,862 --> 00:20:41,820 I know that the terminology is super confusing. 435 00:20:41,820 --> 00:20:49,140 But blame the benevolent dictator for life. 436 00:20:49,140 --> 00:20:55,580 On the other hand, you can do style two, pointer machine, 437 00:20:55,580 --> 00:20:57,830 using object oriented programming, obviously. 438 00:21:01,120 --> 00:21:05,210 I'll just mention that I'm not really 439 00:21:05,210 --> 00:21:07,530 worrying about methods here. 440 00:21:07,530 --> 00:21:11,640 Because methods are just sort of a way of thinking about things, 441 00:21:11,640 --> 00:21:14,870 not super important from a cost standpoint. 442 00:21:14,870 --> 00:21:17,150 If your object has a constant number of attributes-- 443 00:21:17,150 --> 00:21:19,300 it can't have like a million attributes 444 00:21:19,300 --> 00:21:21,250 or can't have n executes-- then it 445 00:21:21,250 --> 00:21:22,890 fits into this pointer machine model. 446 00:21:22,890 --> 00:21:24,450 So if you have an object that only 447 00:21:24,450 --> 00:21:27,079 has like three things or 10 things or whatever, 448 00:21:27,079 --> 00:21:28,120 that's a pointer machine. 449 00:21:28,120 --> 00:21:29,828 You can think of manipulating that object 450 00:21:29,828 --> 00:21:31,350 as taking constant time. 451 00:21:31,350 --> 00:21:33,955 If you are screwing around the object's dictionary 452 00:21:33,955 --> 00:21:35,730 and doing lots of crazy things, then you 453 00:21:35,730 --> 00:21:37,990 have to be careful about whether this remains true. 454 00:21:37,990 --> 00:21:40,460 But as long as you only have a reasonable number 455 00:21:40,460 --> 00:21:43,230 of attributes, this is all fair game. 456 00:21:43,230 --> 00:21:46,550 And so if you do something like, if you're implementing a linked 457 00:21:46,550 --> 00:21:48,185 list, Python I checked still does not 458 00:21:48,185 --> 00:21:49,310 have built-in linked lists. 459 00:21:49,310 --> 00:21:51,430 They're pretty easy to build, though. 460 00:21:51,430 --> 00:21:52,420 You have a pointer. 461 00:21:52,420 --> 00:21:54,590 And you just say x equals x.next. 462 00:21:54,590 --> 00:21:58,700 That takes constant time because accessing this field 463 00:21:58,700 --> 00:22:02,319 in an object of constant size takes constant time. 464 00:22:02,319 --> 00:22:04,110 And we don't care what these constants are. 465 00:22:04,110 --> 00:22:05,443 That's the beauty of algorithms. 466 00:22:05,443 --> 00:22:07,450 Because we only care about scalability with n. 467 00:22:07,450 --> 00:22:09,150 There's no n here. 468 00:22:09,150 --> 00:22:10,200 This takes constant time. 469 00:22:10,200 --> 00:22:12,910 This takes constant time. 470 00:22:12,910 --> 00:22:14,530 No matter how big your linked list 471 00:22:14,530 --> 00:22:17,140 is or no matter how many objects you have, 472 00:22:17,140 --> 00:22:19,300 these are constant time. 473 00:22:19,300 --> 00:22:21,760 OK, let's do some harder ones, though. 474 00:22:21,760 --> 00:22:24,480 In general, the idea is, if you take 475 00:22:24,480 --> 00:22:29,160 an operation like L.append-- so you have a list. 476 00:22:29,160 --> 00:22:31,460 And you want to append some item to the list. 477 00:22:31,460 --> 00:22:33,580 It's an array, though. 478 00:22:33,580 --> 00:22:35,790 So think about it. 479 00:22:35,790 --> 00:22:38,840 The way to figure out how much does this cost 480 00:22:38,840 --> 00:22:40,600 is to think about how it's implemented 481 00:22:40,600 --> 00:22:42,750 in terms of these basic operations. 482 00:22:42,750 --> 00:22:46,780 So these are your sort of the core concept time things. 483 00:22:46,780 --> 00:22:50,320 Most everything can be reduced to thinking about this. 484 00:22:50,320 --> 00:22:53,320 But sometimes, it's less obvious. 485 00:22:53,320 --> 00:22:55,517 L.apend is a little tricky to think about. 486 00:22:55,517 --> 00:22:57,600 Because basically, you have an array of some size. 487 00:22:57,600 --> 00:23:00,250 And now you want to make an array one larger. 488 00:23:00,250 --> 00:23:02,625 And the obvious way to do that is to allocate a new array 489 00:23:02,625 --> 00:23:03,708 and copy all the elements. 490 00:23:03,708 --> 00:23:05,260 That would take linear time. 491 00:23:05,260 --> 00:23:07,180 Python doesn't do that. 492 00:23:07,180 --> 00:23:09,910 What does it do? 493 00:23:09,910 --> 00:23:11,680 Stay tuned for lecture eight. 494 00:23:14,300 --> 00:23:17,140 It does something called table doubling. 495 00:23:17,140 --> 00:23:18,140 It's a very simple idea. 496 00:23:18,140 --> 00:23:20,560 You can almost get guess it from the title. 497 00:23:20,560 --> 00:23:23,946 And if you go to lecture-- is it eight or nine? 498 00:23:23,946 --> 00:23:26,180 Nine, sorry. 499 00:23:26,180 --> 00:23:28,050 You'll see how this can basically 500 00:23:28,050 --> 00:23:30,820 be done in constant time. 501 00:23:30,820 --> 00:23:33,860 There's a slight catch, but basically, think of it 502 00:23:33,860 --> 00:23:36,280 as a constant time operation. 503 00:23:36,280 --> 00:23:38,254 Once we have that, and so this is 504 00:23:38,254 --> 00:23:39,920 why you should take this class so you'll 505 00:23:39,920 --> 00:23:41,920 understand how Python works. 506 00:23:41,920 --> 00:23:44,610 This is using an algorithmic concept that was invented, 507 00:23:44,610 --> 00:23:47,180 I don't know, decades ago, but is a simple thing 508 00:23:47,180 --> 00:23:49,769 that we need to do to solve lots of other problems. 509 00:23:49,769 --> 00:23:50,310 So it's cool. 510 00:23:50,310 --> 00:23:53,800 There's a lot of features in Python that use algorithms. 511 00:23:53,800 --> 00:23:56,500 And that's kind of why I'm telling you. 512 00:23:56,500 --> 00:23:58,120 All right, so let's do another one. 513 00:23:58,120 --> 00:23:59,210 A little easier. 514 00:23:59,210 --> 00:24:01,030 What if I want to concatenate two lists? 515 00:24:01,030 --> 00:24:04,080 You should know in Python this is a non-destructive operation. 516 00:24:04,080 --> 00:24:07,440 You basically take a copy of L1 and L2 and concatenate them. 517 00:24:07,440 --> 00:24:09,300 Of course, they're arrays. 518 00:24:09,300 --> 00:24:11,440 The way to think about this is to re-implement it 519 00:24:11,440 --> 00:24:12,280 as Python code. 520 00:24:12,280 --> 00:24:14,630 This is the same thing as saying, well, 521 00:24:14,630 --> 00:24:16,440 L is initially empty. 522 00:24:16,440 --> 00:24:21,250 And then for every item x and L1, L.append(x). 523 00:24:24,880 --> 00:24:27,300 And a lot of the times in documentation for Python, 524 00:24:27,300 --> 00:24:30,820 you see this sort of here's what it means, especially 525 00:24:30,820 --> 00:24:33,520 in the fancier features. 526 00:24:33,520 --> 00:24:38,879 They give sort of an equivalent simple Python, if you will. 527 00:24:38,879 --> 00:24:40,420 This doesn't use any fancy operations 528 00:24:40,420 --> 00:24:41,920 that we haven't seen already. 529 00:24:41,920 --> 00:24:45,050 So now we know this takes constant time. 530 00:24:45,050 --> 00:24:47,840 The append, this append, takes constant time. 531 00:24:47,840 --> 00:24:50,020 And so the amount of time here is basically 532 00:24:50,020 --> 00:24:53,230 order the length of L1. 533 00:24:53,230 --> 00:24:56,670 And the time here is order the length of L2. 534 00:24:56,670 --> 00:24:59,140 And so in total, it's order-- I'm 535 00:24:59,140 --> 00:25:02,470 going to be careful and say 1 plus length of L1 536 00:25:02,470 --> 00:25:06,160 plus length of L2. 537 00:25:06,160 --> 00:25:08,060 The 1 plus is just in case these are both 0. 538 00:25:08,060 --> 00:25:11,960 It still takes constant time to build an initial list. 539 00:25:11,960 --> 00:25:14,020 OK, so there are a bunch of operations 540 00:25:14,020 --> 00:25:15,650 that are written in these notes. 541 00:25:15,650 --> 00:25:17,830 I'm not going to go through all of them 542 00:25:17,830 --> 00:25:19,130 because they're tedious. 543 00:25:19,130 --> 00:25:22,580 But a lot of you, could just expand out code like this. 544 00:25:22,580 --> 00:25:23,830 And it's very easy to analyze. 545 00:25:23,830 --> 00:25:25,246 Whereas you just look at plus, you 546 00:25:25,246 --> 00:25:26,670 think, oh, plus is constant time. 547 00:25:26,670 --> 00:25:28,659 And plus is constant time if this is a word 548 00:25:28,659 --> 00:25:29,450 and this is a word. 549 00:25:29,450 --> 00:25:31,760 But these are entire data structures. 550 00:25:31,760 --> 00:25:34,480 And so it's not constant time. 551 00:25:34,480 --> 00:25:35,994 All right. 552 00:25:35,994 --> 00:25:37,910 There are more subtle fun ones to think about. 553 00:25:37,910 --> 00:25:43,540 Like, if I want to know is x in the list, how does that happen? 554 00:25:43,540 --> 00:25:45,870 Any guesses? 555 00:25:45,870 --> 00:25:47,660 There's an operator in Python called 556 00:25:47,660 --> 00:25:51,670 in-- x in L. How long do you think this takes? 557 00:25:55,070 --> 00:25:56,770 Altogether? 558 00:25:56,770 --> 00:25:59,010 Linear, yeah. 559 00:25:59,010 --> 00:25:59,760 Linear time. 560 00:25:59,760 --> 00:26:00,786 In the worst case, you're going to have 561 00:26:00,786 --> 00:26:02,090 to scan through the whole list. 562 00:26:02,090 --> 00:26:03,500 Lists aren't necessarily sorted. 563 00:26:03,500 --> 00:26:05,030 We don't know anything about them. 564 00:26:05,030 --> 00:26:06,540 So you've got to just scan through and test 565 00:26:06,540 --> 00:26:07,290 for every item. 566 00:26:07,290 --> 00:26:09,240 Is x equal to that item? 567 00:26:09,240 --> 00:26:12,012 And it's even worse if equal equals costs a lot. 568 00:26:12,012 --> 00:26:13,720 So if x is some really complicated thing, 569 00:26:13,720 --> 00:26:16,870 you have to take that into account. 570 00:26:16,870 --> 00:26:18,400 OK, blah, blah, blah. 571 00:26:18,400 --> 00:26:19,250 OK, another fun one. 572 00:26:19,250 --> 00:26:21,680 This is like a pop quiz. 573 00:26:21,680 --> 00:26:25,660 How long's it take to compute the length of a list? 574 00:26:25,660 --> 00:26:26,800 Constant. 575 00:26:26,800 --> 00:26:29,197 Yeah, luckily, if you didn't know anything, 576 00:26:29,197 --> 00:26:31,530 you'd have to scan through the list and count the items. 577 00:26:31,530 --> 00:26:33,930 But in Python, lists are implemented 578 00:26:33,930 --> 00:26:35,100 with a counter built in. 579 00:26:35,100 --> 00:26:37,270 It always stores the list at the beginning-- 580 00:26:37,270 --> 00:26:38,620 stores the length of the list at the beginning. 581 00:26:38,620 --> 00:26:39,790 So you just look it up. 582 00:26:39,790 --> 00:26:42,300 This is instantaneous. 583 00:26:42,300 --> 00:26:43,690 It's important, though. 584 00:26:43,690 --> 00:26:46,270 That can matter. 585 00:26:46,270 --> 00:26:47,920 All right. 586 00:26:47,920 --> 00:26:48,890 Let's do some more. 587 00:26:58,900 --> 00:27:00,370 What if I want to sort a list? 588 00:27:00,370 --> 00:27:01,370 How long does that take? 589 00:27:06,580 --> 00:27:13,800 N log n where n is the length of the list. 590 00:27:13,800 --> 00:27:20,450 Technically times the time to compare two items, which 591 00:27:20,450 --> 00:27:22,130 usually we're just sorting words. 592 00:27:22,130 --> 00:27:25,850 And so this is constant time. 593 00:27:25,850 --> 00:27:27,530 If you look at Python sorting algorithm, 594 00:27:27,530 --> 00:27:29,240 it uses a comparison sort. 595 00:27:29,240 --> 00:27:35,260 This is the topic of lectures three and four and seven. 596 00:27:35,260 --> 00:27:37,500 But in particular, the very next lecture, 597 00:27:37,500 --> 00:27:41,670 we will see how this is done in n log n time. 598 00:27:41,670 --> 00:27:45,690 And that is using algorithms. 599 00:27:45,690 --> 00:27:52,210 All right, let's go to dictionaries. 600 00:27:52,210 --> 00:27:55,240 Python called dicts. 601 00:27:55,240 --> 00:27:57,600 And these let you do things. 602 00:27:57,600 --> 00:27:59,600 They're a generalization of lists in some sense. 603 00:27:59,600 --> 00:28:03,130 Instead of putting just an index here, an integer between 0 604 00:28:03,130 --> 00:28:06,020 and the length minus 1, you can put an arbitrary key 605 00:28:06,020 --> 00:28:08,250 and store a value, for example. 606 00:28:08,250 --> 00:28:09,940 How long does this take? 607 00:28:09,940 --> 00:28:12,795 I'm not going to ask you because, it's not obvious. 608 00:28:12,795 --> 00:28:16,860 In fact, this is one of the most important data structures 609 00:28:16,860 --> 00:28:18,580 in all of computer science. 610 00:28:18,580 --> 00:28:20,410 It's called a hash table. 611 00:28:20,410 --> 00:28:25,270 And it is the topic of lectures eight through 10. 612 00:28:25,270 --> 00:28:28,350 So stay tuned for how to do this in constant time, 613 00:28:28,350 --> 00:28:30,250 how to be able to store an arbitrary key, 614 00:28:30,250 --> 00:28:32,120 get it back out in constant time. 615 00:28:32,120 --> 00:28:34,850 This is assuming the key is a single word. 616 00:28:34,850 --> 00:28:35,620 Yeah. 617 00:28:35,620 --> 00:28:38,245 AUDIENCE: Does it first check to see whether the key is already 618 00:28:38,245 --> 00:28:39,700 in the dictionary? 619 00:28:39,700 --> 00:28:42,590 PROFESSOR: Yeah, it will clobber any existing key. 620 00:28:42,590 --> 00:28:44,732 There's also, you know, you can test 621 00:28:44,732 --> 00:28:46,190 whether a key is in the dictionary. 622 00:28:46,190 --> 00:28:47,926 That also takes constant time. 623 00:28:47,926 --> 00:28:49,800 You can delete something from the dictionary. 624 00:28:49,800 --> 00:28:53,710 All the usual-- dealing with a single key in dictionaries, 625 00:28:53,710 --> 00:28:56,510 obviously dictionary.update, that involves a lot of keys. 626 00:28:56,510 --> 00:28:58,070 That doesn't take some time. 627 00:28:58,070 --> 00:28:59,059 How long does it take? 628 00:28:59,059 --> 00:29:00,975 Well, you write out a for loop and count them. 629 00:29:00,975 --> 00:29:02,795 AUDIENCE: But how can you see whether [INAUDIBLE] 630 00:29:02,795 --> 00:29:04,160 dictionary in constant time? 631 00:29:04,160 --> 00:29:06,118 PROFESSOR: How do you do this in constant time? 632 00:29:06,118 --> 00:29:08,220 Come to lecture eight through 10. 633 00:29:08,220 --> 00:29:10,420 I should say a slight catch, which 634 00:29:10,420 --> 00:29:13,480 is this is constant time with high probability. 635 00:29:13,480 --> 00:29:15,834 It's a randomized algorithm. 636 00:29:15,834 --> 00:29:17,375 It doesn't always take constant time. 637 00:29:17,375 --> 00:29:18,570 It's always correct. 638 00:29:18,570 --> 00:29:21,380 But sometimes, very rarely, it takes a little more 639 00:29:21,380 --> 00:29:22,760 than constant time. 640 00:29:22,760 --> 00:29:26,170 And I'm going to abbreviate this WHP. 641 00:29:26,170 --> 00:29:29,410 And we'll see more what that means mostly, actually, 642 00:29:29,410 --> 00:29:30,390 in 6046. 643 00:29:30,390 --> 00:29:33,434 But we'll see a fair amount in 6006 on how this works 644 00:29:33,434 --> 00:29:34,350 and how it's possible. 645 00:29:34,350 --> 00:29:35,680 It's a big area of research. 646 00:29:35,680 --> 00:29:37,430 A lot of people work on hashing. 647 00:29:37,430 --> 00:29:39,120 It's very cool and it's super useful. 648 00:29:39,120 --> 00:29:41,720 If you write any code these days, you use a dictionary. 649 00:29:41,720 --> 00:29:45,470 It's the way to solve problems. 650 00:29:45,470 --> 00:29:47,990 I'm basically using Python is a platform 651 00:29:47,990 --> 00:29:50,840 to advertise the rest of the class you may have noticed. 652 00:29:50,840 --> 00:29:53,600 Not every topic we cover in this class is already in Python, 653 00:29:53,600 --> 00:29:55,210 but a lot of them are. 654 00:29:55,210 --> 00:29:58,210 So we've got table doubling. 655 00:29:58,210 --> 00:30:03,070 We've got dictionaries. 656 00:30:03,070 --> 00:30:04,460 We've got sorting. 657 00:30:04,460 --> 00:30:09,300 Another one is longs, which are long integers in Python 658 00:30:09,300 --> 00:30:11,990 through version two. 659 00:30:11,990 --> 00:30:17,850 And this is the topic of lecture 11. 660 00:30:17,850 --> 00:30:21,360 And so for fun, if I have two integers x and y, 661 00:30:21,360 --> 00:30:24,610 and let's say one of them is this many words long 662 00:30:24,610 --> 00:30:26,360 and the other one is this many words long, 663 00:30:26,360 --> 00:30:28,151 how long do you think it takes to add them? 664 00:30:33,290 --> 00:30:33,790 Guesses? 665 00:30:37,590 --> 00:30:40,020 AUDIENCE: [INAUDIBLE]. 666 00:30:40,020 --> 00:30:40,900 PROFESSOR: Plus? 667 00:30:40,900 --> 00:30:41,580 Times? 668 00:30:41,580 --> 00:30:43,827 Plus is the answer. 669 00:30:43,827 --> 00:30:45,160 You can do it in that much time. 670 00:30:47,936 --> 00:30:49,810 If you think about the grade school algorithm 671 00:30:49,810 --> 00:30:53,030 for adding really big multi-digit numbers, 672 00:30:53,030 --> 00:30:54,500 it'll only take that much time. 673 00:30:54,500 --> 00:30:56,770 Multiplication is a little bit harder, though. 674 00:30:56,770 --> 00:30:58,520 If you look at the grade school algorithm, 675 00:30:58,520 --> 00:31:01,310 it's going to be x times y-- it's quadratic time not so 676 00:31:01,310 --> 00:31:02,660 good. 677 00:31:02,660 --> 00:31:05,600 The algorithm that's implemented in Python 678 00:31:05,600 --> 00:31:11,870 is x plus y to the log base 2 of 3. 679 00:31:11,870 --> 00:31:16,880 By the way, I always write LG to mean log base 2. 680 00:31:16,880 --> 00:31:20,430 Because it only has two letters, so OK, this is 2. 681 00:31:20,430 --> 00:31:24,710 Log base 2 of 3 is about 1.6. 682 00:31:24,710 --> 00:31:27,240 So while the straightforward algorithm is basically 683 00:31:27,240 --> 00:31:31,410 x plus y squared, this one is x plus y to the 1.6 power, 684 00:31:31,410 --> 00:31:32,910 a little better than quadratic. 685 00:31:32,910 --> 00:31:37,680 And the Python developers found that was faster 686 00:31:37,680 --> 00:31:39,350 than grade school multiplication. 687 00:31:39,350 --> 00:31:40,490 And so that's what they implemented. 688 00:31:40,490 --> 00:31:42,573 And that is something we will cover in lecture 11, 689 00:31:42,573 --> 00:31:43,240 how to do that. 690 00:31:43,240 --> 00:31:44,510 It's pretty cool. 691 00:31:44,510 --> 00:31:46,290 There are faster algorithms, but this 692 00:31:46,290 --> 00:31:50,430 is one that works quite practically. 693 00:31:50,430 --> 00:31:52,080 One more. 694 00:31:52,080 --> 00:31:55,452 Heap queue, this is in the Python standard library 695 00:31:55,452 --> 00:31:57,410 and implements something called the heap, which 696 00:31:57,410 --> 00:31:59,540 will be in lecture four. 697 00:31:59,540 --> 00:32:02,780 So, coming soon to a classroom near you. 698 00:32:02,780 --> 00:32:04,760 All right, enough advertisement. 699 00:32:04,760 --> 00:32:07,200 That gives you some idea of the model of computation. 700 00:32:07,200 --> 00:32:11,330 There's a whole bunch more in these notes which are online. 701 00:32:11,330 --> 00:32:12,930 Go check them out. 702 00:32:12,930 --> 00:32:17,354 And some of them, we'll cover in recitation tomorrow. 703 00:32:17,354 --> 00:32:19,770 I'd like to-- now that we are sort of comfortable for what 704 00:32:19,770 --> 00:32:23,350 costs what in Python, I want to do a real example. 705 00:32:23,350 --> 00:32:25,152 So last time, we did peak finding. 706 00:32:25,152 --> 00:32:26,860 We're going to have another example which 707 00:32:26,860 --> 00:32:31,400 is called document distance. 708 00:32:31,400 --> 00:32:32,180 So let's do that. 709 00:32:35,390 --> 00:32:36,770 Any questions before we go on? 710 00:32:40,235 --> 00:32:41,225 All right. 711 00:33:10,960 --> 00:33:15,840 So document distance problem is, I give you two documents. 712 00:33:15,840 --> 00:33:19,050 I'll call them D1 D2. 713 00:33:19,050 --> 00:33:22,180 And I want to compute the distance between them. 714 00:33:22,180 --> 00:33:25,150 And the first question is, what does that mean? 715 00:33:25,150 --> 00:33:27,170 What is this distance function? 716 00:33:27,170 --> 00:33:29,550 Let me first tell you some motivations 717 00:33:29,550 --> 00:33:32,010 for computing document distance. 718 00:33:32,010 --> 00:33:35,580 Let's say you're Google and you're 719 00:33:35,580 --> 00:33:37,130 cataloging the entire web. 720 00:33:37,130 --> 00:33:41,060 You'd like to know when two web pages are basically identical. 721 00:33:41,060 --> 00:33:43,991 Because then you store less and because you present it 722 00:33:43,991 --> 00:33:44,990 differently to the user. 723 00:33:44,990 --> 00:33:46,770 You say, well, there's this page. 724 00:33:46,770 --> 00:33:48,320 And there's lots of extra copies. 725 00:33:48,320 --> 00:33:51,470 But you just need-- here's the canonical one. 726 00:33:51,470 --> 00:33:52,814 Or you're Wikipedia. 727 00:33:52,814 --> 00:33:54,980 And I don't know if you've ever looked at Wikipedia. 728 00:33:54,980 --> 00:33:56,771 There's a list of all mirrors of Wikipedia. 729 00:33:56,771 --> 00:33:58,600 There's like millions of them. 730 00:33:58,600 --> 00:34:01,620 And they find them by hand. 731 00:34:01,620 --> 00:34:04,160 But you could do that using document distance. 732 00:34:04,160 --> 00:34:05,680 Say, are these basically identical 733 00:34:05,680 --> 00:34:07,510 other than like some stuff at the-- junk 734 00:34:07,510 --> 00:34:10,116 at the beginning or the end? 735 00:34:10,116 --> 00:34:12,449 Or if you're teaching this class and you want to detect, 736 00:34:12,449 --> 00:34:14,499 are two problem sets cheating? 737 00:34:14,499 --> 00:34:15,290 Are they identical? 738 00:34:15,290 --> 00:34:16,230 We do this a lot. 739 00:34:16,230 --> 00:34:18,631 I'm not going to tell you what distance function we use. 740 00:34:18,631 --> 00:34:20,130 Because that would defeat the point. 741 00:34:20,130 --> 00:34:21,920 It's not the one we cover in class. 742 00:34:21,920 --> 00:34:25,770 But we use automated tests for whether you're cheating. 743 00:34:25,770 --> 00:34:29,250 I've got some more. 744 00:34:29,250 --> 00:34:30,570 Web search. 745 00:34:30,570 --> 00:34:31,880 Let's say you're Google again. 746 00:34:31,880 --> 00:34:35,139 And you want to implement searching. 747 00:34:35,139 --> 00:34:37,389 Like, I give you three words. 748 00:34:37,389 --> 00:34:40,489 I'm searching for introduction to algorithms. 749 00:34:40,489 --> 00:34:42,280 You can think of introduction to algorithms 750 00:34:42,280 --> 00:34:43,980 as a very short document. 751 00:34:43,980 --> 00:34:45,980 And you want to test whether that document is 752 00:34:45,980 --> 00:34:48,416 similar to all the other documents on the web. 753 00:34:48,416 --> 00:34:50,290 And the one that's most similar, the one that 754 00:34:50,290 --> 00:34:51,820 has the small distance, that's maybe 755 00:34:51,820 --> 00:34:52,969 what you want to put at the top. 756 00:34:52,969 --> 00:34:54,552 That's obviously not what Google does. 757 00:34:54,552 --> 00:34:56,969 But it's part of what it does. 758 00:34:56,969 --> 00:34:58,750 So that's why you might care. 759 00:34:58,750 --> 00:35:00,280 It's partly also just a toy problem. 760 00:35:00,280 --> 00:35:03,720 It lets us illustrate a lot of the techniques 761 00:35:03,720 --> 00:35:07,790 that we develop in this class. 762 00:35:07,790 --> 00:35:12,280 All right, I'm going to think of a document 763 00:35:12,280 --> 00:35:13,735 as a sequence of words. 764 00:35:16,790 --> 00:35:19,410 Just to be a little bit more formal, 765 00:35:19,410 --> 00:35:21,610 what do I mean by document? 766 00:35:21,610 --> 00:35:26,150 And a word is just going to be a string 767 00:35:26,150 --> 00:35:30,024 of alphanumeric characters-- A through Z 768 00:35:30,024 --> 00:35:30,940 and zero through nine. 769 00:35:35,180 --> 00:35:36,980 OK, so if I have a document which you also 770 00:35:36,980 --> 00:35:39,200 think of as a string and you basically 771 00:35:39,200 --> 00:35:42,040 delete all the white space and punctuation all the other junk 772 00:35:42,040 --> 00:35:43,100 that's in there. 773 00:35:43,100 --> 00:35:46,820 This Everything in between those, those are the words. 774 00:35:46,820 --> 00:35:49,630 That's a simple definition of decomposing documents 775 00:35:49,630 --> 00:35:51,490 into words. 776 00:35:51,490 --> 00:35:53,240 And now we can think of about what-- 777 00:35:53,240 --> 00:35:55,340 I want to know whether D1 and D2 are similar. 778 00:35:55,340 --> 00:35:57,000 And I've thought about my document 779 00:35:57,000 --> 00:35:58,300 as a collection of words. 780 00:35:58,300 --> 00:36:02,540 Maybe they're similar if they share a lot of words in common. 781 00:36:02,540 --> 00:36:05,660 So that's the idea. 782 00:36:05,660 --> 00:36:11,550 Look at shared words and use that to define 783 00:36:11,550 --> 00:36:12,300 document distance. 784 00:36:12,300 --> 00:36:14,614 This is obviously only one way to define distance. 785 00:36:14,614 --> 00:36:16,280 It'll be the way we do it in this class. 786 00:36:16,280 --> 00:36:20,730 But there are lots of other possibilities. 787 00:36:20,730 --> 00:36:27,810 So I'm going to think of a document. 788 00:36:27,810 --> 00:36:28,860 It's a sequence of words. 789 00:36:28,860 --> 00:36:32,155 But I could also think of it as a vector. 790 00:36:35,090 --> 00:36:41,330 So if I have a document D and I have a word W, this D of W 791 00:36:41,330 --> 00:36:43,930 is going to be the number of times 792 00:36:43,930 --> 00:36:47,160 that word occurs in the document. 793 00:36:47,160 --> 00:36:52,260 So, number of recurrences W in the document D. 794 00:36:52,260 --> 00:36:53,140 So it's a number. 795 00:36:53,140 --> 00:36:54,480 It's an integer. 796 00:36:54,480 --> 00:36:56,510 Non-negative integer. 797 00:36:56,510 --> 00:36:57,599 Could be 0. 798 00:36:57,599 --> 00:36:58,140 Could be one. 799 00:36:58,140 --> 00:37:00,440 Could be a million. 800 00:37:00,440 --> 00:37:03,210 I think of this as a giant vector. 801 00:37:03,210 --> 00:37:07,800 A vector is indexed by all words. 802 00:37:07,800 --> 00:37:10,110 And for each of them, there's some frequency. 803 00:37:10,110 --> 00:37:11,530 Of lot of them are zero. 804 00:37:11,530 --> 00:37:16,106 And then some of them have some positive number occurrences. 805 00:37:16,106 --> 00:37:17,480 You could think of every document 806 00:37:17,480 --> 00:37:22,354 is as being one of these plots in this common axis. 807 00:37:22,354 --> 00:37:24,020 There's infinitely many words down here. 808 00:37:24,020 --> 00:37:25,220 So it's kind of a big axis. 809 00:37:25,220 --> 00:37:27,600 But it's one way to draw the picture. 810 00:37:27,600 --> 00:37:36,420 OK, so for example, take two very important documents. 811 00:37:36,420 --> 00:37:38,270 Everybody likes cats and dogs. 812 00:37:38,270 --> 00:37:41,510 So these are two word documents. 813 00:37:41,510 --> 00:37:42,890 And so we can draw them. 814 00:37:42,890 --> 00:37:44,889 Because there's only three different words here, 815 00:37:44,889 --> 00:37:47,444 we can draw them in three dimensional space. 816 00:37:47,444 --> 00:37:49,110 Beyond that, it's a little hard to draw. 817 00:37:49,110 --> 00:37:53,010 So we have, let's say, which one's the-- let's say 818 00:37:53,010 --> 00:37:56,550 this one's the-- makes it easier to draw. 819 00:37:56,550 --> 00:38:00,350 So there's going to be just zero here and one. 820 00:38:00,350 --> 00:38:06,310 For each of the axes, let's say this is dog and this is cat. 821 00:38:06,310 --> 00:38:10,510 OK, so the cat has won the-- it has one cat and no dog. 822 00:38:10,510 --> 00:38:12,170 So it's here. 823 00:38:12,170 --> 00:38:15,480 It's a vector pointing out there. 824 00:38:15,480 --> 00:38:20,992 The dog you've got basically pointing there. 825 00:38:20,992 --> 00:38:22,200 OK, so these are two vectors. 826 00:38:25,150 --> 00:38:27,640 So how do I measure how different two vectors are? 827 00:38:27,640 --> 00:38:30,235 Any suggestions from vector calculus? 828 00:38:33,864 --> 00:38:35,559 AUDIENCE: Inner product? 829 00:38:35,559 --> 00:38:36,600 PROFESSOR: Inner product? 830 00:38:36,600 --> 00:38:38,780 Yeah, that's good suggestion. 831 00:38:38,780 --> 00:38:41,550 Any others. 832 00:38:41,550 --> 00:38:43,010 OK, we'll go with inner product. 833 00:38:43,010 --> 00:38:48,589 I like inner product, also known as dot product. 834 00:38:48,589 --> 00:38:49,630 Just define that quickly. 835 00:38:56,780 --> 00:38:58,790 So we could-- I'm going to call this D prime 836 00:38:58,790 --> 00:39:02,360 because it's not what we're going to end up with. 837 00:39:02,360 --> 00:39:06,160 We could think of this as the dot product of D1 and D2, 838 00:39:06,160 --> 00:39:17,380 also known as the sum over all words of D1 of W times D2 of W. 839 00:39:17,380 --> 00:39:20,010 So for example, you take the dot product of these two guys. 840 00:39:20,010 --> 00:39:21,540 Those match. 841 00:39:21,540 --> 00:39:27,940 So you get one point there, cat and dog multiplied by zero. 842 00:39:27,940 --> 00:39:30,600 So you don't get much there. 843 00:39:30,600 --> 00:39:33,410 So this is some measure of distance. 844 00:39:33,410 --> 00:39:38,220 But it's a measure of, actually, of commonality. 845 00:39:38,220 --> 00:39:41,160 So it would be sort of inverse distance, sorry. 846 00:39:41,160 --> 00:39:43,209 If you have a high dot product, you 847 00:39:43,209 --> 00:39:44,500 have a lot of things in common. 848 00:39:44,500 --> 00:39:46,210 Because a lot of these things didn't be-- 849 00:39:46,210 --> 00:39:47,840 wasn't zero times something. 850 00:39:47,840 --> 00:39:50,460 It's actually a positive number times some positive number. 851 00:39:50,460 --> 00:39:53,197 If you have a lot of shared words, than that looks good. 852 00:39:53,197 --> 00:39:55,280 The trouble of this is if I have a long document-- 853 00:39:55,280 --> 00:39:59,210 say, a million words-- and it's 99% in common 854 00:39:59,210 --> 00:40:02,760 with another document that's a million words long, 855 00:40:02,760 --> 00:40:06,310 it's still-- it looks super similar. 856 00:40:06,310 --> 00:40:08,970 Actually, I need to do it the other way around. 857 00:40:08,970 --> 00:40:12,267 Let's say it's a million words long and half of the words 858 00:40:12,267 --> 00:40:12,850 are in common. 859 00:40:12,850 --> 00:40:15,190 So not that many, but a fair number. 860 00:40:15,190 --> 00:40:18,479 Then I have a score of like 500,000. 861 00:40:18,479 --> 00:40:21,020 And then I have two documents which are, say, 100 words long. 862 00:40:21,020 --> 00:40:22,540 And they're identical. 863 00:40:22,540 --> 00:40:25,670 Their score is maybe only 100. 864 00:40:25,670 --> 00:40:27,310 So even though they're identical, 865 00:40:27,310 --> 00:40:29,170 it's not quite scale invariant. 866 00:40:29,170 --> 00:40:31,630 So it's not quite a perfect measure. 867 00:40:31,630 --> 00:40:33,241 Any suggestions for how to fix this? 868 00:40:33,241 --> 00:40:34,740 This, I think, is a little trickier. 869 00:40:34,740 --> 00:40:35,722 Yeah? 870 00:40:35,722 --> 00:40:37,639 AUDIENCE: Divide by the length of the vectors? 871 00:40:37,639 --> 00:40:39,596 PROFESSOR: Divide by the length of the vectors. 872 00:40:39,596 --> 00:40:40,970 I think that's worth a pillow. 873 00:40:40,970 --> 00:40:43,280 Haven't done any pillows yet. 874 00:40:43,280 --> 00:40:44,770 Sorry about that. 875 00:40:44,770 --> 00:40:47,550 So, divide by the length of vector. 876 00:40:47,550 --> 00:40:49,447 That's good. 877 00:40:49,447 --> 00:40:51,030 I'm going to call this D double prime. 878 00:40:51,030 --> 00:40:54,190 Still not quite the right answer. 879 00:40:54,190 --> 00:40:56,240 Or not-- no, it's pretty good. 880 00:40:56,240 --> 00:40:58,060 It's pretty good. 881 00:40:58,060 --> 00:40:59,960 So here, the length of the vectors 882 00:40:59,960 --> 00:41:02,030 is the number of words that occur 883 00:41:02,030 --> 00:41:06,610 in them This is pretty cool. 884 00:41:06,610 --> 00:41:10,610 But does anyone recognize this formula? 885 00:41:10,610 --> 00:41:12,200 Angle, yeah. 886 00:41:12,200 --> 00:41:14,650 It's a lot like the angle between the two vectors. 887 00:41:14,650 --> 00:41:18,670 It's just off by an arc cos. 888 00:41:18,670 --> 00:41:21,229 This is the cosine of the angle between the two vectors. 889 00:41:21,229 --> 00:41:22,020 And I'm a geometer. 890 00:41:22,020 --> 00:41:23,020 I like geometry. 891 00:41:23,020 --> 00:41:25,110 So if you take arc cos of that thing, 892 00:41:25,110 --> 00:41:27,520 that's a well established distance metric. 893 00:41:27,520 --> 00:41:32,210 This goes back to '75, if you can believe it, 894 00:41:32,210 --> 00:41:34,850 back when people-- early days of document, information 895 00:41:34,850 --> 00:41:37,585 retrieval, way before the web, people 896 00:41:37,585 --> 00:41:40,670 were still working on this stuff. 897 00:41:40,670 --> 00:41:43,990 So it's a natural measure of the angle between the two vectors. 898 00:41:43,990 --> 00:41:46,690 If it's 0, they're basically identical. 899 00:41:46,690 --> 00:41:49,670 If it's 90 degrees, they're really, really different. 900 00:41:49,670 --> 00:41:53,030 And so that gives you a nice way to compute document distance. 901 00:41:53,030 --> 00:41:55,670 The question is, how do we actually compute that measure? 902 00:41:55,670 --> 00:41:58,020 Now that we've come up with something that's reasonable, 903 00:41:58,020 --> 00:42:00,460 how do I actually find this value? 904 00:42:00,460 --> 00:42:03,930 I need to compute these vectors-- the number 905 00:42:03,930 --> 00:42:06,500 of recurrences of each word in the document. 906 00:42:06,500 --> 00:42:08,817 And I need you compute the dot product. 907 00:42:08,817 --> 00:42:09,900 And then I need to divide. 908 00:42:09,900 --> 00:42:10,691 That's really easy. 909 00:42:10,691 --> 00:42:13,260 So, dot product-- and I also need 910 00:42:13,260 --> 00:42:15,610 to decompose a document to a list of words. 911 00:42:15,610 --> 00:42:17,492 So there are three things I need to do. 912 00:42:17,492 --> 00:42:18,450 Let me write them down. 913 00:42:30,417 --> 00:42:31,375 So a sort of algorithm. 914 00:42:36,580 --> 00:42:42,745 There's one, split a document into words. 915 00:42:46,130 --> 00:42:51,040 Second is compute word frequencies, 916 00:42:51,040 --> 00:42:54,080 how many times each word appears. 917 00:42:54,080 --> 00:42:55,860 This is the document vectors . 918 00:42:58,380 --> 00:43:02,020 And then the third step is to compute the dot product. 919 00:43:07,124 --> 00:43:09,290 Let me tell you a little bit about how each of those 920 00:43:09,290 --> 00:43:10,690 is done. 921 00:43:10,690 --> 00:43:14,860 Some of these will be covered more in future lectures. 922 00:43:14,860 --> 00:43:16,980 I want to give you an overview. 923 00:43:16,980 --> 00:43:19,800 There's a lot of ways to do each of these steps. 924 00:43:19,800 --> 00:43:21,900 If you look at the-- next to the lecture 925 00:43:21,900 --> 00:43:25,310 notes for this lecture two, there's a bunch of code 926 00:43:25,310 --> 00:43:28,640 and a bunch of data examples of documents-- 927 00:43:28,640 --> 00:43:30,530 big corpuses of text. 928 00:43:30,530 --> 00:43:32,070 And you can run, I think, there are 929 00:43:32,070 --> 00:43:34,760 eight different algorithms for it. 930 00:43:34,760 --> 00:43:37,010 And let me give you-- actually, why don't I 931 00:43:37,010 --> 00:43:39,060 cut to the chase a little bit and tell you 932 00:43:39,060 --> 00:43:42,430 about the run times of these different implementations 933 00:43:42,430 --> 00:43:43,431 of this same algorithms. 934 00:43:43,431 --> 00:43:45,638 There are lots of sort of versions of this algorithm. 935 00:43:45,638 --> 00:43:46,910 We implement it a whole bunch. 936 00:43:46,910 --> 00:43:49,850 Every semester I teach this, I change them a little bit more, 937 00:43:49,850 --> 00:43:51,970 add a few more variants. 938 00:43:51,970 --> 00:43:55,260 So version one, on a particular pair 939 00:43:55,260 --> 00:44:00,700 of documents which is like a megabyte-- not very much text-- 940 00:44:00,700 --> 00:44:05,460 it takes 228.1 seconds-- super slow. 941 00:44:05,460 --> 00:44:06,790 Pathetic. 942 00:44:06,790 --> 00:44:09,410 Then we do a little bit of algorithmic tweaking. 943 00:44:09,410 --> 00:44:11,780 We get down to 164 seconds. 944 00:44:11,780 --> 00:44:14,740 Then we get to 123 seconds. 945 00:44:14,740 --> 00:44:17,340 Then we get down to 71 seconds. 946 00:44:17,340 --> 00:44:21,460 Then we get down to 18.3 seconds. 947 00:44:21,460 --> 00:44:25,130 And then we get to 11.5 seconds. 948 00:44:25,130 --> 00:44:28,270 Then we get to 1.8 seconds. 949 00:44:28,270 --> 00:44:31,760 Then we get to 0.2 seconds. 950 00:44:31,760 --> 00:44:33,630 So factor of 1,000. 951 00:44:33,630 --> 00:44:36,530 This is just in Python. 952 00:44:36,530 --> 00:44:38,730 2/10 of a second to process a megabytes. 953 00:44:38,730 --> 00:44:39,410 It's all right. 954 00:44:39,410 --> 00:44:40,750 It's getting reasonable. 955 00:44:40,750 --> 00:44:41,969 This is not so reasonable. 956 00:44:41,969 --> 00:44:43,760 Some of these improvements are algorithmic. 957 00:44:43,760 --> 00:44:46,300 Some of them are just better coding. 958 00:44:46,300 --> 00:44:49,280 So there's improving the constant factors. 959 00:44:49,280 --> 00:44:52,710 But if you look at larger and larger texts, 960 00:44:52,710 --> 00:44:54,210 this will become even more dramatic. 961 00:44:54,210 --> 00:44:56,220 Because a lot of these were improvements 962 00:44:56,220 --> 00:44:59,790 from quadratic time algorithms to linear and log n algorithms. 963 00:44:59,790 --> 00:45:02,432 And so for a megabyte, yeah, it's a reasonable improvement. 964 00:45:02,432 --> 00:45:04,890 But if you look at a gigabyte, it'll be a huge improvement. 965 00:45:04,890 --> 00:45:06,099 There will be no comparison. 966 00:45:06,099 --> 00:45:07,640 In fact, there will be no comparison. 967 00:45:07,640 --> 00:45:09,098 Because this one will never finish. 968 00:45:09,098 --> 00:45:11,330 So the reason I ran such a small example 969 00:45:11,330 --> 00:45:13,540 so I could have patience to wait for this one. 970 00:45:13,540 --> 00:45:17,050 But this one you could run on the bigger examples. 971 00:45:17,050 --> 00:45:22,490 All right, so where do I want to go from here? 972 00:45:22,490 --> 00:45:24,380 Five minutes. 973 00:45:24,380 --> 00:45:26,560 I want to tell you about some of those improvements 974 00:45:26,560 --> 00:45:29,380 and some of the algorithms here. 975 00:45:29,380 --> 00:45:31,770 Let's start with this very simple one. 976 00:45:31,770 --> 00:45:36,325 How would you split a document into words in Python? 977 00:45:36,325 --> 00:45:36,825 Yeah? 978 00:45:36,825 --> 00:45:38,280 AUDIENCE: [INAUDIBLE]. 979 00:45:38,280 --> 00:45:40,935 Iterate through the document, [INAUDIBLE] the dictionary? 980 00:45:40,935 --> 00:45:42,560 PROFESSOR: Iterate through the-- that's 981 00:45:42,560 --> 00:45:44,490 actually how we do number two. 982 00:45:44,490 --> 00:45:46,900 OK, we can talk about that one. 983 00:45:46,900 --> 00:45:52,220 Iterate through the words in the document 984 00:45:52,220 --> 00:45:53,690 and put it in a dictionary. 985 00:45:53,690 --> 00:45:59,980 Let's say, count of word plus equals 1. 986 00:45:59,980 --> 00:46:02,320 This would work if count is something called a count 987 00:46:02,320 --> 00:46:05,440 dictionary if you're super Pythonista. 988 00:46:05,440 --> 00:46:07,940 Otherwise, you have to check, is the word in the dictionary? 989 00:46:07,940 --> 00:46:09,610 If not, set it to one. 990 00:46:09,610 --> 00:46:12,459 If it is there, add one to it. 991 00:46:12,459 --> 00:46:14,000 But I think you know what this means. 992 00:46:14,000 --> 00:46:15,859 This will count the number of words-- 993 00:46:15,859 --> 00:46:18,400 this will count the frequency of each word in the dictionary. 994 00:46:18,400 --> 00:46:21,020 And becomes dictionaries run in constant time 995 00:46:21,020 --> 00:46:26,020 with high probability-- with high probability-- 996 00:46:26,020 --> 00:46:31,170 this will take order-- well, cheating a little bit. 997 00:46:31,170 --> 00:46:32,630 Because words can be really long. 998 00:46:32,630 --> 00:46:35,480 And so to reduce a word down to a machine word 999 00:46:35,480 --> 00:46:38,810 could take order the length of the word time. 1000 00:46:38,810 --> 00:46:40,210 To a little more precise, this is 1001 00:46:40,210 --> 00:46:41,626 going to be the sum of the lengths 1002 00:46:41,626 --> 00:46:45,800 of the words in the document, which is also 1003 00:46:45,800 --> 00:46:48,464 known as a length of the document, basically. 1004 00:46:48,464 --> 00:46:49,130 So this is good. 1005 00:46:49,130 --> 00:46:51,565 This is linear time with high probability. 1006 00:46:54,500 --> 00:46:55,770 OK, that's a good algorithm. 1007 00:46:55,770 --> 00:47:02,390 That is introduced in algorithm four. 1008 00:47:02,390 --> 00:47:04,627 So we got a significant boost. 1009 00:47:04,627 --> 00:47:05,960 There are other ways to do this. 1010 00:47:05,960 --> 00:47:09,190 For example, you could sort the words 1011 00:47:09,190 --> 00:47:10,930 and then run through the sorted list 1012 00:47:10,930 --> 00:47:13,430 and count, how many do you get in a row for each one? 1013 00:47:13,430 --> 00:47:15,925 If it's sorted, you can count-- I mean, 1014 00:47:15,925 --> 00:47:18,300 all the identical words are put right next to each other. 1015 00:47:18,300 --> 00:47:19,800 So it's easy to count them. 1016 00:47:19,800 --> 00:47:21,310 And that'll run almost as fast. 1017 00:47:21,310 --> 00:47:22,690 That was one of these algorithms. 1018 00:47:26,470 --> 00:47:29,350 OK, so that's a couple different ways to do that. 1019 00:47:29,350 --> 00:47:30,740 Let's go back to this first step. 1020 00:47:30,740 --> 00:47:33,785 How would you split a document into words in the first place? 1021 00:47:33,785 --> 00:47:34,284 Yeah? 1022 00:47:34,284 --> 00:47:36,617 AUDIENCE: Search circulated spaces and then [INAUDIBLE]. 1023 00:47:36,617 --> 00:47:39,150 PROFESSOR: Run through though the string. 1024 00:47:39,150 --> 00:47:41,870 And every time you see anything that's not alphanumeric, 1025 00:47:41,870 --> 00:47:43,250 start a new word. 1026 00:47:43,250 --> 00:47:45,380 OK, that would run in linear time. 1027 00:47:45,380 --> 00:47:47,380 That's a good answer. 1028 00:47:47,380 --> 00:47:48,590 So it's not hard. 1029 00:47:48,590 --> 00:47:54,210 If you're a fancy Pythonista, you might do it like this. 1030 00:48:01,430 --> 00:48:02,485 Remember my Reg Exes. 1031 00:48:05,040 --> 00:48:07,090 This will find all the words in a document. 1032 00:48:07,090 --> 00:48:10,660 Trouble is, in general, re takes exponential time. 1033 00:48:10,660 --> 00:48:14,260 So if you think about algorithms, be very careful. 1034 00:48:14,260 --> 00:48:16,010 Unless you know how re is implemented, 1035 00:48:16,010 --> 00:48:19,800 this probably will run in linear time. 1036 00:48:19,800 --> 00:48:22,000 But it's not obvious at all. 1037 00:48:22,000 --> 00:48:24,210 Do anything fancy with regular expressions. 1038 00:48:24,210 --> 00:48:26,543 If you don't know what this means, don't worry about it. 1039 00:48:26,543 --> 00:48:27,100 Don't use it. 1040 00:48:27,100 --> 00:48:28,990 If you know about it, be very careful in this class 1041 00:48:28,990 --> 00:48:29,656 when you use re. 1042 00:48:29,656 --> 00:48:31,664 Because it's not always linear time. 1043 00:48:31,664 --> 00:48:33,330 But there is an easy algorithm for this, 1044 00:48:33,330 --> 00:48:38,075 which is just scan through and look for alpha numerics. 1045 00:48:38,075 --> 00:48:38,950 String them together. 1046 00:48:38,950 --> 00:48:39,449 It's good. 1047 00:48:39,449 --> 00:48:41,530 There's a few other algorithms here in the notes. 1048 00:48:41,530 --> 00:48:42,670 You should check them out. 1049 00:48:42,670 --> 00:48:46,930 And for fun, look at this code and see how small differences 1050 00:48:46,930 --> 00:48:49,190 make dramatic difference in performance. 1051 00:48:49,190 --> 00:48:51,620 Next class will be about sorting.