1 00:00:00,040 --> 00:00:02,490 The following content is provided under a Creative 2 00:00:02,490 --> 00:00:03,900 Commons license. 3 00:00:03,900 --> 00:00:06,940 Your support will help MIT OpenCourseWare continue to 4 00:00:06,940 --> 00:00:10,580 offer high quality educational resources for free. 5 00:00:10,580 --> 00:00:13,490 To make a donation or view additional materials from 6 00:00:13,490 --> 00:00:19,320 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,320 --> 00:00:21,200 ocw.mit.edu. 8 00:00:21,200 --> 00:00:27,010 PROFESSOR: OK so we're going to test it. 9 00:00:27,010 --> 00:00:28,930 Here's testOne. 10 00:00:28,930 --> 00:00:32,360 It just says for name and range 10, I'm going to build 11 00:00:32,360 --> 00:00:35,560 some nodes, put in a few edges, and then I'm going to 12 00:00:35,560 --> 00:00:37,160 print the graph. 13 00:00:37,160 --> 00:00:41,940 And all I really want to show you here, is that if we run it 14 00:00:41,940 --> 00:00:44,665 for digraph or we run it for graph, we'll 15 00:00:44,665 --> 00:00:45,915 get something different. 16 00:00:48,560 --> 00:00:52,550 Yes, I'm happy to save the source. 17 00:00:52,550 --> 00:00:59,733 Oh, and now the syntax. 18 00:01:03,200 --> 00:01:05,450 It was valid last time I looked, what have we done 19 00:01:05,450 --> 00:01:08,450 wrong here? 20 00:01:08,450 --> 00:01:10,900 I edit something badly? 21 00:01:10,900 --> 00:01:14,350 Sort of looks valid to me, doesn't it? 22 00:01:14,350 --> 00:01:15,600 We'll retype it. 23 00:01:18,900 --> 00:01:21,520 Yes, we'll save. 24 00:01:21,520 --> 00:01:22,960 Nope. 25 00:01:22,960 --> 00:01:26,450 Well, it's one of those days, isn't it? 26 00:01:26,450 --> 00:01:30,950 All right this is a good debugging exercise for us. 27 00:01:30,950 --> 00:01:32,875 Let's think about how we go and find this. 28 00:01:32,875 --> 00:01:35,710 I'm sure you've all had this sort of problem. 29 00:01:35,710 --> 00:01:38,690 Well the first thing to do is I think I'll just comment out 30 00:01:38,690 --> 00:01:39,940 all of this. 31 00:01:46,270 --> 00:01:47,520 Let's see if I still get an error. 32 00:01:52,130 --> 00:01:52,640 I do. 33 00:01:52,640 --> 00:01:55,420 All right, well that suggests that, that wasn't the problem, 34 00:01:55,420 --> 00:01:56,700 so I can put it back. 35 00:01:59,590 --> 00:02:01,855 Now it shows a problem all the way down there. 36 00:02:05,280 --> 00:02:07,870 So let's see what's going on here. 37 00:02:07,870 --> 00:02:11,300 Or maybe, I'll just skip this, but doubtless I'll get in 38 00:02:11,300 --> 00:02:12,550 trouble if I do. 39 00:02:18,448 --> 00:02:22,440 So let's see. 40 00:02:22,440 --> 00:02:25,560 I must have just made a sloppy editing error somewhere this 41 00:02:25,560 --> 00:02:28,280 morning in commenting things out for the lecture. 42 00:02:36,710 --> 00:02:42,180 Well, I think what we're going to do, for the moment, is move 43 00:02:42,180 --> 00:02:45,260 on and hope it goes away. 44 00:02:45,260 --> 00:02:46,940 Now that seems silly. 45 00:02:46,940 --> 00:02:50,450 Sorry about this everybody. 46 00:02:50,450 --> 00:02:53,470 People should look with me, and someone may see it more 47 00:02:53,470 --> 00:02:55,432 quickly than I can. 48 00:02:55,432 --> 00:02:57,630 In fact, I'm hoping someone sees it more 49 00:02:57,630 --> 00:02:58,880 quickly than I can. 50 00:03:27,580 --> 00:03:30,761 We've now got a microphone, and the embarrassment of a 51 00:03:30,761 --> 00:03:32,870 code with syntax error, [UNINTELLIGIBLE]. 52 00:03:32,870 --> 00:03:34,130 MALE SPEAKER: You've got a sign. 53 00:03:34,130 --> 00:03:36,870 For some reason it wasn't on the schedule, so-- 54 00:03:36,870 --> 00:03:38,140 PROFESSOR: Well just because we've been 55 00:03:38,140 --> 00:03:39,790 teaching at 10 o'clock. 56 00:03:39,790 --> 00:03:40,203 MALE SPEAKER: Yeah, I know. 57 00:03:40,203 --> 00:03:41,400 Yeah. 58 00:03:41,400 --> 00:03:42,160 [UNINTELLIGIBLE]. 59 00:03:42,160 --> 00:03:42,620 PROFESSOR: February. 60 00:03:42,620 --> 00:03:44,780 There's no reason to suspect that we would teach at 10 61 00:03:44,780 --> 00:03:45,430 o'clock today. 62 00:03:45,430 --> 00:03:45,906 MALE SPEAKER: I actually looked. 63 00:03:45,906 --> 00:03:47,156 I double checked. 64 00:03:53,050 --> 00:03:54,830 PROFESSOR: Well this is embarrassing, folks. 65 00:03:54,830 --> 00:03:56,980 And I wish one of you would bail me out by telling me 66 00:03:56,980 --> 00:03:58,345 where my syntax error is. 67 00:04:06,900 --> 00:04:09,250 Well random looks OK, right? 68 00:04:09,250 --> 00:04:10,500 Node looks OK. 69 00:04:19,230 --> 00:04:22,095 And it gets more complicated. 70 00:04:22,095 --> 00:04:23,275 MALE SPEAKER: Hey professor, can you turn on the 71 00:04:23,275 --> 00:04:24,310 transmitter? 72 00:04:24,310 --> 00:04:25,173 PROFESSOR: No. 73 00:04:25,173 --> 00:04:25,880 [LAUGHTER] 74 00:04:25,880 --> 00:04:27,130 PROFESSOR: OK, I'll turn on the transmitter. 75 00:04:29,760 --> 00:04:33,150 But I'm really focused on a different problem right now. 76 00:04:40,940 --> 00:04:42,190 Guys help! 77 00:04:44,120 --> 00:04:45,370 Where are my TAs? 78 00:04:53,510 --> 00:04:55,620 Why does it keep doing that to me? 79 00:04:55,620 --> 00:05:00,020 Maybe there's something just funny going on here. 80 00:05:00,020 --> 00:05:00,420 Pardon? 81 00:05:00,420 --> 00:05:02,560 AUDIENCE: Restart IDLE. 82 00:05:02,560 --> 00:05:03,300 PROFESSOR: Restarting IDLE? 83 00:05:03,300 --> 00:05:06,320 You think maybe that's the issue? 84 00:05:06,320 --> 00:05:07,570 We could try that. 85 00:05:32,920 --> 00:05:33,950 Ha! 86 00:05:33,950 --> 00:05:38,930 So it looks like IDLE was just in some ugly state. 87 00:05:38,930 --> 00:05:40,180 Let's hope. 88 00:05:56,880 --> 00:05:59,720 Yes, all right, so I didn't have a bug. 89 00:05:59,720 --> 00:06:01,900 It was just IDLE had a bug. 90 00:06:01,900 --> 00:06:03,340 All right, phew. 91 00:06:03,340 --> 00:06:05,170 But we did squander ten minutes. 92 00:06:05,170 --> 00:06:06,200 Oh well. 93 00:06:06,200 --> 00:06:09,050 So we have the graph, and you can see when we look at a 94 00:06:09,050 --> 00:06:12,400 graph, we have a node from 1 to 2, an edge from 1 to 2, 95 00:06:12,400 --> 00:06:15,030 from 1 to 1, et cetera. 96 00:06:15,030 --> 00:06:16,890 And that's the digraph. 97 00:06:16,890 --> 00:06:20,600 When we look at the graph, we'll see that in fact we have 98 00:06:20,600 --> 00:06:23,110 a lot more nodes, because everything goes in both 99 00:06:23,110 --> 00:06:25,110 directions. 100 00:06:25,110 --> 00:06:27,900 But that's what we expected-- 101 00:06:27,900 --> 00:06:29,320 nothing very interesting. 102 00:06:29,320 --> 00:06:32,760 All I want you to do is notice the difference here between 103 00:06:32,760 --> 00:06:35,730 graphs and digraphs. 104 00:06:35,730 --> 00:06:42,450 Now getting to the whole point, once we have this 105 00:06:42,450 --> 00:06:46,230 mechanism set up to think about graphs, we can now think 106 00:06:46,230 --> 00:06:49,010 about interesting problems and formulate 107 00:06:49,010 --> 00:06:50,420 them as graph problems. 108 00:06:50,420 --> 00:06:53,320 And I want to list a few of the interesting problems, and 109 00:06:53,320 --> 00:06:56,480 then we'll look at how to solve some of them. 110 00:06:56,480 --> 00:06:59,820 So probably the most common graph problem that people 111 00:06:59,820 --> 00:07:01,920 solve, is called the shortest path problem. 112 00:07:05,640 --> 00:07:08,590 We talked about this briefly last time. 113 00:07:08,590 --> 00:07:14,210 The notion here is for some pair of nodes, n1 and n2, we 114 00:07:14,210 --> 00:07:18,040 want to find the shortest sequence of edges that 115 00:07:18,040 --> 00:07:19,390 connects those two nodes. 116 00:07:22,160 --> 00:07:22,670 All right? 117 00:07:22,670 --> 00:07:25,110 So that's very straightforward. 118 00:07:25,110 --> 00:07:38,390 Then there is the shortest weighted path, where instead 119 00:07:38,390 --> 00:07:42,530 of trying to find the shortest sequence of edges, we want to 120 00:07:42,530 --> 00:07:46,610 find the smallest total weight. 121 00:07:46,610 --> 00:07:50,690 So it may be that we traverse a few extra edges, but since 122 00:07:50,690 --> 00:07:52,700 they have a shorter weights, we end up 123 00:07:52,700 --> 00:07:53,950 getting a shorter path. 124 00:07:56,550 --> 00:08:01,410 So we might indirect to do the shortest path. 125 00:08:01,410 --> 00:08:06,100 This is probably the more common problem. 126 00:08:06,100 --> 00:08:09,360 So for example, that's the problem that Google Maps 127 00:08:09,360 --> 00:08:16,580 solves, when you ask it to give you driving directions. 128 00:08:16,580 --> 00:08:19,960 And you'll notice when you use something like Google Maps or 129 00:08:19,960 --> 00:08:26,990 MapQuest, you can tell it to minimize the time, in which 130 00:08:26,990 --> 00:08:31,320 case maybe it will route you on a freeway, where you can 131 00:08:31,320 --> 00:08:33,909 drive at 80 miles an hour, even though you drive a few 132 00:08:33,909 --> 00:08:35,390 extra miles. 133 00:08:35,390 --> 00:08:39,789 Or you can tell it to minimize the distance in which case it 134 00:08:39,789 --> 00:08:42,350 may take you on these crummy little surface roads where you 135 00:08:42,350 --> 00:08:46,180 have to drive slowly, but you'll cover fewer miles and 136 00:08:46,180 --> 00:08:48,410 use less gas. 137 00:08:48,410 --> 00:08:52,220 So you get to tell it which set of weights you care about, 138 00:08:52,220 --> 00:08:54,510 and then it finds you the shortest 139 00:08:54,510 --> 00:08:57,990 path, given those weights. 140 00:08:57,990 --> 00:09:00,430 We'll come back to this since we're going to look at some 141 00:09:00,430 --> 00:09:02,720 code to implement it. 142 00:09:02,720 --> 00:09:06,080 Another slightly more complicated problem to 143 00:09:06,080 --> 00:09:09,620 understand is finding cliques. 144 00:09:12,280 --> 00:09:17,460 So to find a clique, we're looking to find a set of 145 00:09:17,460 --> 00:09:34,800 nodes, such that there exists a path connecting 146 00:09:34,800 --> 00:09:38,210 each node in the set. 147 00:09:44,730 --> 00:09:50,180 So you can think of this as similar to, 148 00:09:50,180 --> 00:09:52,750 say a social clique-- 149 00:09:52,750 --> 00:09:55,100 who your friends are. 150 00:09:55,100 --> 00:09:59,780 It's a group of nodes or group of people that somehow can get 151 00:09:59,780 --> 00:10:01,970 to each other. 152 00:10:01,970 --> 00:10:04,590 It's not saying you can't get outside the clique. 153 00:10:04,590 --> 00:10:07,910 But it is guaranteeing that from any member of the clique, 154 00:10:07,910 --> 00:10:10,235 you can reach any other member of the clique. 155 00:10:18,330 --> 00:10:23,610 And so well, we'll look at some examples of where finding 156 00:10:23,610 --> 00:10:26,260 a clique is useful. 157 00:10:26,260 --> 00:10:36,420 And the final kind of problem I want to mention is the 158 00:10:36,420 --> 00:10:38,250 minimum cut problem-- 159 00:10:38,250 --> 00:10:41,870 often abbreviated mincut. 160 00:10:41,870 --> 00:10:50,300 So the problem here, is given a graph, and given two sets of 161 00:10:50,300 --> 00:11:11,650 nodes, you want to find the minimum number of edges such 162 00:11:11,650 --> 00:11:19,780 that if those edges are removed, the two sets are 163 00:11:19,780 --> 00:11:21,030 disconnected. 164 00:11:25,020 --> 00:11:25,490 i.e. 165 00:11:25,490 --> 00:11:28,360 you can't get from a member of one set to a member 166 00:11:28,360 --> 00:11:29,610 of the other set. 167 00:11:36,090 --> 00:11:39,100 This is often a question that gets asked. 168 00:11:39,100 --> 00:11:43,760 For example, imagine that you were the government of Syria 169 00:11:43,760 --> 00:11:45,490 and you want to ensure that nobody can 170 00:11:45,490 --> 00:11:49,200 post a video on YouTube. 171 00:11:49,200 --> 00:11:52,520 You would take the set of nodes in Syria, and you would 172 00:11:52,520 --> 00:11:57,580 take the set of nodes probably outside Syria, and ask what's 173 00:11:57,580 --> 00:12:00,910 the minimum number of communication links you'd have 174 00:12:00,910 --> 00:12:05,230 to cut to ensure that you can't get from a node in Syria 175 00:12:05,230 --> 00:12:06,480 to a node outside Syria. 176 00:12:08,960 --> 00:12:14,330 People who do things like plan power lines worry about that. 177 00:12:14,330 --> 00:12:17,590 They want to say what's the minimum number of links such 178 00:12:17,590 --> 00:12:21,220 that if they're cut, you can't get any electricity from this 179 00:12:21,220 --> 00:12:24,330 power plant to say, this city. 180 00:12:24,330 --> 00:12:28,050 And they'll typically design their network with redundancy 181 00:12:28,050 --> 00:12:30,685 in it, so that the mincut is not too small. 182 00:12:33,550 --> 00:12:37,110 And so people frequently are worried about mincut problems, 183 00:12:37,110 --> 00:12:41,390 and trying to see what that is. 184 00:12:41,390 --> 00:12:46,470 All right, now let's look at a couple of examples, in 185 00:12:46,470 --> 00:12:48,520 slightly more detail. 186 00:12:48,520 --> 00:12:56,880 So what we see here is a pictorial representation of a 187 00:12:56,880 --> 00:12:59,950 weighted graph generated by the Centers for Disease 188 00:12:59,950 --> 00:13:05,740 Control, CDC, in Atlanta in 2003 when they were studying 189 00:13:05,740 --> 00:13:09,790 an outbreak of tuberculosis in the United States-- 190 00:13:09,790 --> 00:13:13,720 a virulent and bad infectious disease. 191 00:13:13,720 --> 00:13:18,380 Each node, and you can see these little dots are the 192 00:13:18,380 --> 00:13:24,480 nodes, represents a person. 193 00:13:24,480 --> 00:13:28,940 And each node is labeled by a color, indicating whether the 194 00:13:28,940 --> 00:13:33,220 person has active tuberculosis, has tested 195 00:13:33,220 --> 00:13:38,340 positive for exposure, but doesn't have the disease, or 196 00:13:38,340 --> 00:13:43,530 tested negative for exposure, or not been tested. 197 00:13:43,530 --> 00:13:47,700 So you'll remember when we looked last time at class 198 00:13:47,700 --> 00:13:51,110 node, and asked why did I bother creating a class for 199 00:13:51,110 --> 00:13:54,260 something so simple, it was because I said well maybe we 200 00:13:54,260 --> 00:13:57,090 would add extra properties to a node. 201 00:13:57,090 --> 00:14:01,670 So now in some sense these colors would be easy to add. 202 00:14:01,670 --> 00:14:04,690 So I could add to class node-- 203 00:14:04,690 --> 00:14:06,930 well I could attribute color, and call it 204 00:14:06,930 --> 00:14:08,560 red or blue or green-- 205 00:14:08,560 --> 00:14:12,640 or more likely an attribute saying TB state, which would 206 00:14:12,640 --> 00:14:16,300 indicate active not active, et cetera. 207 00:14:16,300 --> 00:14:22,430 The edges, which you can see here, represent connections 208 00:14:22,430 --> 00:14:24,905 among pairs of people. 209 00:14:29,400 --> 00:14:31,830 What I didn't bother, you can't see on these pictures, 210 00:14:31,830 --> 00:14:34,980 is the edges are actually weighted. 211 00:14:34,980 --> 00:14:37,750 And the weights there are about how 212 00:14:37,750 --> 00:14:40,320 closely people are connected. 213 00:14:40,320 --> 00:14:43,240 And there are really only two weights I think they used: 214 00:14:43,240 --> 00:14:47,790 close, someone who say lives in your house or works in the 215 00:14:47,790 --> 00:14:51,540 same office, or casual, a neighbor you might have 216 00:14:51,540 --> 00:14:54,210 encountered, but you wouldn't expect to necessarily 217 00:14:54,210 --> 00:14:57,110 see them every day. 218 00:14:57,110 --> 00:15:00,940 So I've taken a fairly complicated set of information 219 00:15:00,940 --> 00:15:05,290 and represented it as a graph. 220 00:15:05,290 --> 00:15:07,890 Now what are some of the interesting graph theoretic 221 00:15:07,890 --> 00:15:12,280 questions we might proceed to ask about this? 222 00:15:12,280 --> 00:15:17,050 So an important question they typically ask when diseases 223 00:15:17,050 --> 00:15:23,060 break out unexpectedly is, is there an index patient? 224 00:15:23,060 --> 00:15:26,570 The index patient is the patient who brought the 225 00:15:26,570 --> 00:15:29,250 disease into the community-- 226 00:15:29,250 --> 00:15:31,970 so somebody who visited some country, picked up 227 00:15:31,970 --> 00:15:36,130 tuberculosis, flew back to their neighborhood in the US 228 00:15:36,130 --> 00:15:38,470 and started spreading it. 229 00:15:41,160 --> 00:15:44,750 How would we formulate that as a graph question? 230 00:15:44,750 --> 00:15:48,020 Again, quite simply. 231 00:15:48,020 --> 00:16:05,630 We would say does there exist a node such that node has TB, 232 00:16:05,630 --> 00:16:11,230 or maybe not even that, no, let's simplify it. 233 00:16:11,230 --> 00:16:13,660 You might say, "or tested positive" because maybe you 234 00:16:13,660 --> 00:16:16,940 can communicate it without having it-- 235 00:16:16,940 --> 00:16:30,840 has TB and is connected to every node with TB. 236 00:16:35,730 --> 00:16:37,350 Now this doesn't guarantee that the 237 00:16:37,350 --> 00:16:39,600 patient is an index patient. 238 00:16:39,600 --> 00:16:43,410 But if there is no such patient, no such node, then 239 00:16:43,410 --> 00:16:46,380 you know that there's not a single source of this disease 240 00:16:46,380 --> 00:16:47,630 in the community. 241 00:16:51,460 --> 00:16:56,180 How would we change the graph to model it in a more detailed 242 00:16:56,180 --> 00:16:59,640 way, and remember this is all about modeling, so that we 243 00:16:59,640 --> 00:17:03,995 could ask a question or more precisely? 244 00:17:06,589 --> 00:17:11,240 Well we'd have to change to a more complex coloring scheme, 245 00:17:11,240 --> 00:17:15,400 if you will, in which we'd include the date of when 246 00:17:15,400 --> 00:17:20,440 somebody acquired the disease, or tested positive. 247 00:17:20,440 --> 00:17:22,990 And then we could ask those kinds of questions in a little 248 00:17:22,990 --> 00:17:24,700 bit more detail. 249 00:17:24,700 --> 00:17:29,050 But again once we've built the model, we can then go and ask 250 00:17:29,050 --> 00:17:32,080 a lot of interesting questions. 251 00:17:32,080 --> 00:17:34,370 By the way the answer to this question, for 252 00:17:34,370 --> 00:17:36,830 this graph, is almost. 253 00:17:42,390 --> 00:17:47,920 There is an index patient that's connected to every node 254 00:17:47,920 --> 00:17:51,285 in the graph, except for the nodes in this black circle. 255 00:17:53,980 --> 00:18:00,780 They are not connected to any index patient. 256 00:18:00,780 --> 00:18:04,690 So the CDC actually did that analysis, and they reached 257 00:18:04,690 --> 00:18:07,620 that conclusion that there didn't seem to be. 258 00:18:07,620 --> 00:18:11,550 And then later, it came to light, in fact, that this 259 00:18:11,550 --> 00:18:13,640 particular graph is missing an edge. 260 00:18:16,210 --> 00:18:19,250 Somebody had moved from neighborhood A to neighborhood 261 00:18:19,250 --> 00:18:22,670 B, and they had not kept track of that. 262 00:18:22,670 --> 00:18:24,720 And if they had, they would have discovered there was a 263 00:18:24,720 --> 00:18:27,140 link that's missing-- an edge that's 264 00:18:27,140 --> 00:18:28,830 missing from this graph-- 265 00:18:28,830 --> 00:18:31,000 which in fact would've connected everybody to the 266 00:18:31,000 --> 00:18:34,230 index patient. 267 00:18:34,230 --> 00:18:35,380 It was an interesting question. 268 00:18:35,380 --> 00:18:38,600 They only found that, because they were puzzled about this 269 00:18:38,600 --> 00:18:41,270 tiny little black circle out here, and started 270 00:18:41,270 --> 00:18:44,290 investigating all the people in the black circle, and 271 00:18:44,290 --> 00:18:47,260 discovered that one of them had moved from 272 00:18:47,260 --> 00:18:50,350 one place to another. 273 00:18:50,350 --> 00:18:52,570 What's another question you might ask once you've built 274 00:18:52,570 --> 00:18:54,220 this model? 275 00:18:54,220 --> 00:18:58,230 Well suppose this is the current state of the world, 276 00:18:58,230 --> 00:19:03,340 and I want to reduce the spread of the disease, by 277 00:19:03,340 --> 00:19:06,950 vaccinating uninfected people so that they don't contract 278 00:19:06,950 --> 00:19:09,390 tuberculosis. 279 00:19:09,390 --> 00:19:13,780 But I have a minimum, it's expensive to do this, I only 280 00:19:13,780 --> 00:19:15,280 have so much vaccine. 281 00:19:15,280 --> 00:19:17,280 Who should get it? 282 00:19:17,280 --> 00:19:21,460 What's the graph theory problem that I would solve to 283 00:19:21,460 --> 00:19:24,540 address the question of what's the best way to allocate my 284 00:19:24,540 --> 00:19:25,790 limited supply a vaccine? 285 00:19:29,670 --> 00:19:30,820 Exactly. 286 00:19:30,820 --> 00:19:34,150 I, by the way, have much better candy now. 287 00:19:34,150 --> 00:19:39,850 So I think that's where the minimum cut came from. 288 00:19:39,850 --> 00:19:40,820 Well, all right. 289 00:19:40,820 --> 00:19:42,650 It's better for eating. 290 00:19:42,650 --> 00:19:43,900 It's just worse for throwing. 291 00:19:47,030 --> 00:19:48,740 That's easier to throw. 292 00:19:48,740 --> 00:19:50,920 All right. 293 00:19:50,920 --> 00:19:53,610 It's the minimum cut problem. 294 00:19:53,610 --> 00:19:59,610 I take the people who are already infected, view them as 295 00:19:59,610 --> 00:20:01,450 one set of nodes. 296 00:20:01,450 --> 00:20:04,040 I take the people who are not infected, and view them as 297 00:20:04,040 --> 00:20:09,400 another set of nodes, find the edges that I need to cut to 298 00:20:09,400 --> 00:20:12,400 separate them, and then vaccinated somebody on one 299 00:20:12,400 --> 00:20:16,550 side of the edge, so that they don't contract the disease. 300 00:20:16,550 --> 00:20:21,200 So again a nice, easily formalized, problem. 301 00:20:21,200 --> 00:20:24,590 All right, so that's an example. 302 00:20:24,590 --> 00:20:27,040 Let's look at another example. 303 00:20:27,040 --> 00:20:30,730 Let's think about the shortest path problem here. 304 00:20:30,730 --> 00:20:34,120 And we'll do that by thinking about social networks. 305 00:20:34,120 --> 00:20:38,580 So I suspect that at least a few of you have used Facebook, 306 00:20:38,580 --> 00:20:41,730 and you have friends-- 307 00:20:41,730 --> 00:20:43,070 some of you more, than others. 308 00:20:46,760 --> 00:20:47,810 I see people laughing. 309 00:20:47,810 --> 00:20:50,700 This is someone who probably has two friends, and is said. 310 00:20:50,700 --> 00:20:55,030 I don't know, or 1,000 friends and is happy. 311 00:20:55,030 --> 00:20:55,630 Who knows-- 312 00:20:55,630 --> 00:20:56,880 I don't want to know please. 313 00:20:59,120 --> 00:21:00,240 And I'm not going to tell you how many 314 00:21:00,240 --> 00:21:03,250 friends I have either. 315 00:21:03,250 --> 00:21:09,790 But you might ask the question, suppose you wanted 316 00:21:09,790 --> 00:21:11,750 to reach Donald Trump -- 317 00:21:11,750 --> 00:21:14,630 erstwhile Republican, vice presidential candidate, or 318 00:21:14,630 --> 00:21:16,510 presidential candidate. 319 00:21:16,510 --> 00:21:20,170 Say is there a connection from you to Donald Trump? 320 00:21:20,170 --> 00:21:22,140 Do you have a friend, who has a friend, who has a friend, 321 00:21:22,140 --> 00:21:24,150 who is a friend with Donald Trump? 322 00:21:24,150 --> 00:21:27,480 Or for Barack Obama, or anyone else you'd ask. 323 00:21:27,480 --> 00:21:30,600 Well what's the shortest path? 324 00:21:30,600 --> 00:21:34,790 How many friends do you have to go through? 325 00:21:34,790 --> 00:21:39,250 This is what's called the six degrees of separation problem. 326 00:21:39,250 --> 00:21:43,840 In the 1990s, the playwright John Guare published a play 327 00:21:43,840 --> 00:21:47,440 called Six Degrees of Separation, under the slightly 328 00:21:47,440 --> 00:21:50,780 dubious premise, that everybody in the world was 329 00:21:50,780 --> 00:21:52,580 connected to everybody else in the world 330 00:21:52,580 --> 00:21:55,120 with at most six hops. 331 00:21:55,120 --> 00:21:57,070 If you took all the people you knew, all the people they 332 00:21:57,070 --> 00:22:00,500 knew, et cetera, you could reach any person in the world 333 00:22:00,500 --> 00:22:02,100 in six phone calls-- 334 00:22:02,100 --> 00:22:04,040 say any person who has a phone. 335 00:22:04,040 --> 00:22:08,630 I don't know whether that's true, but this is the whole 336 00:22:08,630 --> 00:22:11,910 notion of a social network. 337 00:22:11,910 --> 00:22:14,880 So if we wanted to look at that in Facebook, we could 338 00:22:14,880 --> 00:22:18,700 either assume that the friend relation is symmetric-- 339 00:22:18,700 --> 00:22:22,260 if I'm your friend, you're my friend, which it is. 340 00:22:22,260 --> 00:22:24,690 Or you could imagine a different model, in which it's 341 00:22:24,690 --> 00:22:26,090 asymmetric. 342 00:22:26,090 --> 00:22:28,670 If it's symmetric you have a graph. 343 00:22:28,670 --> 00:22:32,590 If it's asymmetric you have a directed graph. 344 00:22:32,590 --> 00:22:34,950 And then you just ask the question. 345 00:22:34,950 --> 00:22:37,830 What's the shortest path from you to 346 00:22:37,830 --> 00:22:40,150 whomever you care about? 347 00:22:40,150 --> 00:22:43,280 And you get that. 348 00:22:43,280 --> 00:22:46,030 You could imagine that Facebook already knows the 349 00:22:46,030 --> 00:22:49,540 answer to that question, but just won't tell you. 350 00:22:49,540 --> 00:22:52,880 But they'll sell it to somebody who has enough money. 351 00:22:52,880 --> 00:22:58,370 All right, So how does Facebook solve this problem? 352 00:22:58,370 --> 00:23:01,480 They have a very simple piece of code, which we'll now look 353 00:23:01,480 --> 00:23:03,300 at which solves the shortest path. 354 00:23:10,310 --> 00:23:11,560 So let's go back. 355 00:23:19,000 --> 00:23:21,870 So here's a recursive 356 00:23:21,870 --> 00:23:26,640 implementation of shortest path. 357 00:23:26,640 --> 00:23:28,365 Comment this out while I'm in the neighborhood. 358 00:23:33,890 --> 00:23:39,800 It takes the graph, a start node and end node to print, 359 00:23:39,800 --> 00:23:42,530 and this extra argument call visited. 360 00:23:42,530 --> 00:23:45,200 We'll see why that's gets used. 361 00:23:45,200 --> 00:23:48,970 And we'll think about the algorithm. 362 00:23:48,970 --> 00:23:52,990 This particular algorithm is what's called a depth first 363 00:23:52,990 --> 00:23:55,350 search algorithm. 364 00:23:55,350 --> 00:23:57,710 It's a recursive depth first search. 365 00:23:57,710 --> 00:24:13,185 We've seen these before, often abbreviated DFS. 366 00:24:15,730 --> 00:24:20,110 So if you think about having a graph of a bunch of nodes 367 00:24:20,110 --> 00:24:30,140 connected to one another, just for fun we'll say it does 368 00:24:30,140 --> 00:24:33,230 something like this. 369 00:24:33,230 --> 00:24:37,040 What depth first search does is it starts at the source 370 00:24:37,040 --> 00:24:43,030 node for the shortest path, let's called it this one, it 371 00:24:43,030 --> 00:24:48,080 first visits one child, then visits all the children of 372 00:24:48,080 --> 00:24:51,250 those children. 373 00:24:51,250 --> 00:24:54,160 This one has no children. 374 00:24:54,160 --> 00:24:59,200 Visits this child, picks one of its children, visits all of 375 00:24:59,200 --> 00:25:01,970 its children-- let's say it had another one here-- 376 00:25:01,970 --> 00:25:06,330 and goes on until it's done. 377 00:25:06,330 --> 00:25:12,530 And then it back tracks, comes back and takes the next child. 378 00:25:12,530 --> 00:25:14,865 Then we have to be a little bit careful about the circle. 379 00:25:17,440 --> 00:25:28,060 So to summarize it, first thing we have to say is the 380 00:25:28,060 --> 00:25:41,060 recursion ends, when start equals end. 381 00:25:41,060 --> 00:25:47,290 That is to say I've called it and I've asked is there a path 382 00:25:47,290 --> 00:25:49,770 from A to A, and the answer is yes, there is. 383 00:25:49,770 --> 00:25:52,150 I'm already there. 384 00:25:52,150 --> 00:25:54,940 Now you could argue, and in some formulations the answer 385 00:25:54,940 --> 00:25:58,590 is not necessarily, you'd say there's only a path if there's 386 00:25:58,590 --> 00:26:03,370 an edge from A to A. But I've chosen to make the simpler 387 00:26:03,370 --> 00:26:05,440 assertion that if you want to get to A, and you're already 388 00:26:05,440 --> 00:26:07,070 there, you're done. 389 00:26:07,070 --> 00:26:10,360 Kind of seems reasonable. 390 00:26:10,360 --> 00:26:29,060 So then the recursive part, starts by choosing one child 391 00:26:29,060 --> 00:26:33,080 of the node you're currently at. 392 00:26:33,080 --> 00:26:38,180 And it keeps doing that until either it reaches a node with 393 00:26:38,180 --> 00:26:47,280 no children, or it reaches the node you're trying to get to, 394 00:26:47,280 --> 00:26:51,910 or, and here's an important part, it reaches a node it's 395 00:26:51,910 --> 00:26:53,980 already seen. 396 00:26:53,980 --> 00:26:56,210 And that's what visited is about. 397 00:26:56,210 --> 00:27:00,100 Because I want to make sure that when I explore this 398 00:27:00,100 --> 00:27:02,950 graph, I don't go from here to here to here to here to here 399 00:27:02,950 --> 00:27:06,350 to here ad nauseum, because I'm stuck in 400 00:27:06,350 --> 00:27:07,600 what's called a cycle. 401 00:27:14,170 --> 00:27:15,490 You have to avoid the cycles. 402 00:27:22,250 --> 00:27:28,110 Once it's got to a node that has no children, if that's not 403 00:27:28,110 --> 00:27:32,990 the node it's trying to get to, it back tracks and takes 404 00:27:32,990 --> 00:27:38,885 the next child of the node it was at. 405 00:27:46,840 --> 00:27:52,200 And in that way, it systematically explores all 406 00:27:52,200 --> 00:27:58,800 possible paths, and along the way, it chooses the best one. 407 00:27:58,800 --> 00:28:00,390 So we can look at the code here. 408 00:28:03,030 --> 00:28:05,970 I've just commented out something we'll look at later 409 00:28:05,970 --> 00:28:08,000 just as we try and instrument it to see 410 00:28:08,000 --> 00:28:11,380 how fast it's working. 411 00:28:11,380 --> 00:28:14,700 I've got a debugging statement just to say whether I'm going 412 00:28:14,700 --> 00:28:16,865 to print what I've been asked to do, in 413 00:28:16,865 --> 00:28:18,830 case it's not working. 414 00:28:18,830 --> 00:28:21,000 And then the real work starts. 415 00:28:21,000 --> 00:28:25,060 I get the original path is just the node we're starting 416 00:28:25,060 --> 00:28:27,620 at, if start is end, I stop. 417 00:28:30,450 --> 00:28:33,680 If I get to here, or say shortest equals none, I 418 00:28:33,680 --> 00:28:35,240 haven't found any paths yet. 419 00:28:35,240 --> 00:28:38,000 So there is no shortest path. 420 00:28:38,000 --> 00:28:44,350 And then for node in the children of start, if I 421 00:28:44,350 --> 00:28:45,820 haven't already visited the node-- 422 00:28:45,820 --> 00:28:48,170 this is to avoid cycles-- 423 00:28:48,170 --> 00:28:52,910 I create a visited list that contains whatever used to 424 00:28:52,910 --> 00:28:56,090 contain plus the node. 425 00:28:56,090 --> 00:29:00,080 Notice that I'm creating a new list here, rather than 426 00:29:00,080 --> 00:29:02,340 mutating the old list. 427 00:29:02,340 --> 00:29:05,930 And that's because when I unravel my recursion, and back 428 00:29:05,930 --> 00:29:10,170 track to where I was, I don't want to have think I visited 429 00:29:10,170 --> 00:29:13,040 something I haven't visited, right? 430 00:29:13,040 --> 00:29:16,900 If I had only one list, and I mutated each time, then as I 431 00:29:16,900 --> 00:29:19,950 go down the recursion and back up the recursion, I'm always 432 00:29:19,950 --> 00:29:22,880 dealing with the same list. 433 00:29:22,880 --> 00:29:26,550 By getting a new list, I'm ensuring that I don't have 434 00:29:26,550 --> 00:29:29,480 that problem. 435 00:29:29,480 --> 00:29:33,270 Then I say the new path is whatever the shortest path is, 436 00:29:33,270 --> 00:29:37,340 from the node I'm currently at to the desired end node. 437 00:29:37,340 --> 00:29:40,580 And I use the current set of visited nodes to indicate 438 00:29:40,580 --> 00:29:43,760 where I've already been at this part of the recursion. 439 00:29:47,550 --> 00:29:51,600 If the new path is none, well didn't find one, I continue. 440 00:29:51,600 --> 00:29:55,060 Otherwise, I found a path, and now I just want to check is it 441 00:29:55,060 --> 00:30:00,910 better, or worse, or the same, as the previous shortest path. 442 00:30:00,910 --> 00:30:03,120 And then I'm done. 443 00:30:03,120 --> 00:30:05,390 Very straightforward. 444 00:30:05,390 --> 00:30:08,690 The only really tricky part was making sure that I kept 445 00:30:08,690 --> 00:30:14,250 track of visited properly, and didn't get stuck in cycles. 446 00:30:14,250 --> 00:30:17,720 OK let's run it. 447 00:30:17,720 --> 00:30:19,880 So here's testTwo -- 448 00:30:19,880 --> 00:30:23,310 builds the same kind of graph we've built before. 449 00:30:23,310 --> 00:30:26,135 And then it tries to find the shortest path. 450 00:30:28,700 --> 00:30:33,790 And I'm going to do it for the same input, essentially, the 451 00:30:33,790 --> 00:30:37,400 same at edge operations, but once when it's a graph and 452 00:30:37,400 --> 00:30:38,650 once when it's a digraph. 453 00:30:45,410 --> 00:30:47,785 So you'll notice that it found two different answers. 454 00:30:52,050 --> 00:30:56,860 When it was a graph, it could get from 0 to 4 in 455 00:30:56,860 --> 00:30:58,125 essentially one hop. 456 00:31:02,250 --> 00:31:08,690 But when it was a digraph, it took longer. 457 00:31:08,690 --> 00:31:12,050 It had to go from 0 to two to 3 to 4. 458 00:31:12,050 --> 00:31:13,790 And that's not surprising, because the 459 00:31:13,790 --> 00:31:15,135 graph has more edges. 460 00:31:17,890 --> 00:31:21,950 And in fact, what we saw is that in the graph there was an 461 00:31:21,950 --> 00:31:25,020 edge from 4 to 0, but there was no such edge in the 462 00:31:25,020 --> 00:31:26,990 directed graph. 463 00:31:26,990 --> 00:31:28,510 So again you'll get, 464 00:31:28,510 --> 00:31:33,560 unsurprisingly, different answers-- 465 00:31:33,560 --> 00:31:36,440 but very straightforwardly. 466 00:31:36,440 --> 00:31:40,000 Now let's try it on a bigger problem. 467 00:31:51,010 --> 00:31:53,930 So I've called this big test. 468 00:31:53,930 --> 00:31:56,270 And what this does, is rather than my sitting there and 469 00:31:56,270 --> 00:32:01,080 typing a bunch of at edge commands, it just generates 470 00:32:01,080 --> 00:32:03,830 edges at random. 471 00:32:03,830 --> 00:32:07,580 So I tell it whether I want it to be a graph or digraph, and 472 00:32:07,580 --> 00:32:10,870 then I give it the number of nodes I want, and 473 00:32:10,870 --> 00:32:11,760 the number of edges. 474 00:32:11,760 --> 00:32:14,570 And it just generates, at random, a graph in this case 475 00:32:14,570 --> 00:32:19,430 with 25 nodes and 200 edges. 476 00:32:19,430 --> 00:32:21,550 So let's see what happens here. 477 00:32:28,670 --> 00:32:35,630 So it's printed out the graph, and now we're 478 00:32:35,630 --> 00:32:38,650 waiting a little bit. 479 00:32:38,650 --> 00:32:41,350 It will eventually finish, there. 480 00:32:41,350 --> 00:32:43,550 I can get from 0 to 4. 481 00:32:43,550 --> 00:32:46,520 It turns out there's a short path for this random graph 482 00:32:46,520 --> 00:32:48,800 from 0 to 14 to 4. 483 00:32:48,800 --> 00:32:50,980 It's not so surprising that there's a short path. 484 00:32:50,980 --> 00:32:55,180 Why is it not surprising that there's a pretty short path? 485 00:32:55,180 --> 00:32:56,400 It had a lot of edges, right? 486 00:32:56,400 --> 00:32:59,140 I had 200 edges in my graph. 487 00:32:59,140 --> 00:33:01,125 So things are pretty densely connected. 488 00:33:03,640 --> 00:33:05,125 Why did it take so long? 489 00:33:08,540 --> 00:33:14,310 Well remember what it's doing is exploring all the possible 490 00:33:14,310 --> 00:33:17,560 paths from 0 to 4, in this case. 491 00:33:21,220 --> 00:33:23,530 This is very much like what we saw when we looked at the 492 00:33:23,530 --> 00:33:25,530 knapsack problem, right? 493 00:33:25,530 --> 00:33:29,100 Where, there when we looked at the recursive implementation, 494 00:33:29,100 --> 00:33:32,670 we saw that well all right, generating all possibilities, 495 00:33:32,670 --> 00:33:35,310 there were an exponential number of possibilities there 496 00:33:35,310 --> 00:33:37,720 in the number of items. 497 00:33:37,720 --> 00:33:42,440 Here, depending upon the number of nodes and the number 498 00:33:42,440 --> 00:33:46,330 of edges, it's also large, and in fact, exponential. 499 00:33:50,390 --> 00:33:56,380 We could explore a lot of different paths, but let's see 500 00:33:56,380 --> 00:34:00,610 what's going on when we explore those. 501 00:34:00,610 --> 00:34:04,500 So what I'm going to do now, is go back 502 00:34:04,500 --> 00:34:05,760 to our small example. 503 00:34:08,380 --> 00:34:13,239 We'll run testTwo That was the small one we looked at. 504 00:34:13,239 --> 00:34:18,090 But I'm going to set to print onto true, and if you remember 505 00:34:18,090 --> 00:34:22,920 what that code did is they told us what each recursive 506 00:34:22,920 --> 00:34:26,274 call was, what the start node was and what the end node was. 507 00:34:31,280 --> 00:34:34,429 So it found the same shortest path. 508 00:34:34,429 --> 00:34:37,090 That's a good thing, 0 to 4. 509 00:34:37,090 --> 00:34:39,989 But how did it do that? 510 00:34:39,989 --> 00:34:45,730 Well it first got called with the question of starting at 0 511 00:34:45,730 --> 00:34:48,000 find me a path to 4. 512 00:34:48,000 --> 00:34:51,300 It visited the first child of 0, which was 1. 513 00:34:51,300 --> 00:34:54,239 It said, all right see if you can find a path from 1 to 4. 514 00:34:56,940 --> 00:35:01,120 It then backtracked and sort of asked the same question, 515 00:35:01,120 --> 00:35:03,750 can I get from 2 to 4? 516 00:35:03,750 --> 00:35:04,720 From 0 to 4? 517 00:35:04,720 --> 00:35:07,130 And then it said well I can get from 0 to 2, let me try 2 518 00:35:07,130 --> 00:35:13,370 to 4, 3 to 4, 4 to 4, that's good. 519 00:35:13,370 --> 00:35:17,040 Get to 5 to 4, and then it tried to find 4 to 4 again. 520 00:35:17,040 --> 00:35:19,730 Here it tried to find 2 to 4 again. 521 00:35:19,730 --> 00:35:25,420 So what you can see, is as I do that backtracking, I'm 522 00:35:25,420 --> 00:35:30,750 solving the same problem multiple times. 523 00:35:30,750 --> 00:35:32,760 Why am I doing that? 524 00:35:32,760 --> 00:35:38,920 Because there may be multiple ways to get to the same node. 525 00:35:38,920 --> 00:35:47,950 So if, for example, I looked at this graph, what we would 526 00:35:47,950 --> 00:35:52,882 see is I would try and let's say I want to get to here, 527 00:35:52,882 --> 00:35:57,050 just for the sake of argument, I'd first say can I get to 528 00:35:57,050 --> 00:35:58,420 here from here. 529 00:35:58,420 --> 00:36:02,520 I'd try this, then I'd solve here to here. 530 00:36:02,520 --> 00:36:05,830 And I'd do that. 531 00:36:05,830 --> 00:36:10,290 I'd also go from here to here to here, and then for the 532 00:36:10,290 --> 00:36:14,670 second time, I'd try and solve the problem here to here. 533 00:36:14,670 --> 00:36:16,860 Now here since it's only one connection, 534 00:36:16,860 --> 00:36:18,620 it's a short thing. 535 00:36:18,620 --> 00:36:22,350 But you can see if I have multiple ways to get to the 536 00:36:22,350 --> 00:36:26,520 same intermediate node, each time I get there I'm going to 537 00:36:26,520 --> 00:36:30,070 solve a problem I have already solved-- 538 00:36:30,070 --> 00:36:32,570 how to get from that intermediate node to the final 539 00:36:32,570 --> 00:36:34,490 destination. 540 00:36:34,490 --> 00:36:37,310 So I'm doing work I've already done before. 541 00:36:41,390 --> 00:36:45,740 This is obviously troublesome. 542 00:36:45,740 --> 00:36:47,720 Nobody likes to solve a problem they've 543 00:36:47,720 --> 00:36:49,290 already solved before. 544 00:36:49,290 --> 00:36:50,990 So what do you think the solution is? 545 00:36:53,720 --> 00:36:55,500 How would you solve this sort of thing yourself? 546 00:36:58,460 --> 00:36:59,710 What would you do? 547 00:37:03,940 --> 00:37:06,320 Well what you'd-- yeah, thank you. 548 00:37:06,320 --> 00:37:07,240 This guy is hungry. 549 00:37:07,240 --> 00:37:08,017 Go ahead. 550 00:37:08,017 --> 00:37:11,052 AUDIENCE: Some way of storing information that you've 551 00:37:11,052 --> 00:37:12,220 already looked at. 552 00:37:12,220 --> 00:37:14,100 PROFESSOR: Exactly. 553 00:37:14,100 --> 00:37:23,330 What you try and do, is remember what you did before, 554 00:37:23,330 --> 00:37:26,320 and just look it up. 555 00:37:26,320 --> 00:37:30,730 This is a very common technique. 556 00:37:30,730 --> 00:37:32,790 It's called memoization. 557 00:37:41,350 --> 00:37:45,280 We use this to solve a lot of problems where you remember 558 00:37:45,280 --> 00:37:48,500 what the answer was, and rather than recalculating it, 559 00:37:48,500 --> 00:37:51,590 you just look it up. 560 00:37:51,590 --> 00:37:55,780 And that can, of course, be much faster. 561 00:37:55,780 --> 00:37:58,140 So it's a fancy way to say we're going 562 00:37:58,140 --> 00:37:59,390 to use a table look-up. 563 00:38:03,840 --> 00:38:09,750 This concept of memoization is at the heart of a very 564 00:38:09,750 --> 00:38:13,040 important programming technique called dynamic 565 00:38:13,040 --> 00:38:14,290 programming. 566 00:38:22,090 --> 00:38:24,870 In the algorithms class that's taught in this room 567 00:38:24,870 --> 00:38:28,080 immediately following this class, they have spent at 568 00:38:28,080 --> 00:38:32,120 least four lectures on the topic of dynamic programming. 569 00:38:32,120 --> 00:38:34,910 But since you guys are much smarter than those guys taking 570 00:38:34,910 --> 00:38:38,530 that class, we're going to do it in about 20 minutes, in 571 00:38:38,530 --> 00:38:43,140 today and a little bit in the next lecture. 572 00:38:43,140 --> 00:38:46,935 All right, so let's look at an example. 573 00:38:50,490 --> 00:38:51,660 We'll look at a solution. 574 00:38:51,660 --> 00:39:02,420 So I've taken the recursive implementation we had before, 575 00:39:02,420 --> 00:39:08,880 and rewritten it just a little bit, to call dp, dynamic 576 00:39:08,880 --> 00:39:13,140 programming shortest path. 577 00:39:13,140 --> 00:39:17,500 And the most important thing to notice is I've given yet 578 00:39:17,500 --> 00:39:22,590 another argument to the function, and that's the memo, 579 00:39:22,590 --> 00:39:25,360 which is initially an empty dictionary. 580 00:39:28,150 --> 00:39:33,440 The rest of the algorithm proceeds as before, except 581 00:39:33,440 --> 00:39:41,330 what happens here is when I want to get from a path, the 582 00:39:41,330 --> 00:39:46,170 first question I ask is I say new path is equal to the memo 583 00:39:46,170 --> 00:39:48,540 of node to end. 584 00:39:48,540 --> 00:39:52,320 So when I get to one of these interior nodes, and I want to 585 00:39:52,320 --> 00:39:55,760 say what's the shortest path from here to here, the first 586 00:39:55,760 --> 00:39:59,980 question I ask is do I already know the answer? 587 00:39:59,980 --> 00:40:02,160 Is it already in my memo? 588 00:40:02,160 --> 00:40:08,130 If so, I just look it up, and I'm done. 589 00:40:08,130 --> 00:40:11,220 I found it, and I proceed as before. 590 00:40:11,220 --> 00:40:17,370 If it's not in the memo, well this look up will fail, and 591 00:40:17,370 --> 00:40:22,970 I'll enter the except clause, and I'll make a call again. 592 00:40:22,970 --> 00:40:26,290 So this is a very conventional way of using try, except as a 593 00:40:26,290 --> 00:40:28,010 control structure. 594 00:40:28,010 --> 00:40:31,150 Failing to find in the memo is not an error, it just means I 595 00:40:31,150 --> 00:40:33,980 haven't yet stored it away. 596 00:40:33,980 --> 00:40:36,840 And as I go, I'll build up the memo, and then I'm done. 597 00:40:41,470 --> 00:40:45,660 So it's very simple. 598 00:40:45,660 --> 00:40:49,380 So I should ask the question. 599 00:40:49,380 --> 00:40:52,010 Does anyone need me to explain this again, or does it makes 600 00:40:52,010 --> 00:40:53,720 sense what we're doing here with a memo? 601 00:40:56,990 --> 00:40:59,450 OK, I'm assuming it makes sense. 602 00:40:59,450 --> 00:41:01,370 Let's test it. 603 00:41:01,370 --> 00:41:04,330 And we'll first do a very simple test. 604 00:41:04,330 --> 00:41:06,560 We're just going to use the same little 605 00:41:06,560 --> 00:41:09,620 graph we used before. 606 00:41:09,620 --> 00:41:13,330 And I'm going to run shortest path, and dp_shortest path, 607 00:41:13,330 --> 00:41:16,110 and at least confirm that for one search I 608 00:41:16,110 --> 00:41:18,142 get the same answer. 609 00:41:18,142 --> 00:41:21,380 It's just fire testing it to make sure that it's not a 610 00:41:21,380 --> 00:41:22,630 complete disaster. 611 00:41:25,380 --> 00:41:27,010 And we do. 612 00:41:27,010 --> 00:41:30,190 We get 0234, 0234. 613 00:41:30,190 --> 00:41:33,210 So at least for one thing, it's the same thing. 614 00:41:36,830 --> 00:41:39,500 Let's see about performance, because that's really what we 615 00:41:39,500 --> 00:41:40,750 got interested in. 616 00:41:43,090 --> 00:41:47,640 So we'll go back to our big test. 617 00:41:50,220 --> 00:41:58,720 And let's go back and for both of these, I'm going to 618 00:41:58,720 --> 00:42:04,980 uncomment, tracking this global variable, just keeping 619 00:42:04,980 --> 00:42:09,630 track of the number of calls, and we'll see whether we get a 620 00:42:09,630 --> 00:42:14,970 substantially different amount of recursion, in 621 00:42:14,970 --> 00:42:16,390 one versus the other. 622 00:42:38,260 --> 00:42:40,150 So it's built some random graph again. 623 00:42:44,620 --> 00:42:48,170 This is the non-dynamic programming one, which as we 624 00:42:48,170 --> 00:42:50,390 recall, takes a bit longer. 625 00:42:54,090 --> 00:42:56,960 I probably should have said-- all right, so it's pretty big 626 00:42:56,960 --> 00:42:58,310 difference. 627 00:42:58,310 --> 00:43:01,300 They found the same path, 0, 2,3, 4. 628 00:43:01,300 --> 00:43:05,630 But you'll notice the straightforward depth first 629 00:43:05,630 --> 00:43:12,380 search took over 800,000 recursive calls, whereas the 630 00:43:12,380 --> 00:43:17,380 dynamic programming one took only an order of 2000-- 631 00:43:17,380 --> 00:43:18,630 a huge difference. 632 00:43:23,180 --> 00:43:26,340 If I ran it again, I might see a slightly smaller difference. 633 00:43:26,340 --> 00:43:28,070 I might even see a considerably larger 634 00:43:28,070 --> 00:43:29,410 difference. 635 00:43:29,410 --> 00:43:33,230 I've run this on some examples where the recursive search 636 00:43:33,230 --> 00:43:37,450 depth first took a million, and got through the dynamic 637 00:43:37,450 --> 00:43:39,960 programming in 50, 60. 638 00:43:39,960 --> 00:43:44,480 But what you can see is there's a huge improvement in 639 00:43:44,480 --> 00:43:47,370 going from one to the other. 640 00:43:47,370 --> 00:43:50,960 Dynamic programming was invented in the 1950s by 641 00:43:50,960 --> 00:43:52,210 someone named Richard Bellman. 642 00:43:55,290 --> 00:43:58,860 Many a student has wasted a lot of time trying to 643 00:43:58,860 --> 00:44:02,080 understand why it's called dynamic programming. 644 00:44:02,080 --> 00:44:05,630 And you or I could invent lots of theories. 645 00:44:05,630 --> 00:44:08,850 Relatively recently, I found out why it was called dynamic 646 00:44:08,850 --> 00:44:13,620 programming, and this is a quote from Bellman. 647 00:44:13,620 --> 00:44:15,860 "It was an attempt to hide what I was doing from 648 00:44:15,860 --> 00:44:18,160 government sponsors. 649 00:44:18,160 --> 00:44:20,640 The fact that I was really doing mathematics was 650 00:44:20,640 --> 00:44:24,240 something not even a congressman could object to." 651 00:44:24,240 --> 00:44:27,960 So he was doing this thing that was pretty evil, which 652 00:44:27,960 --> 00:44:31,510 was mathematics, which is what he thought this was-- the math 653 00:44:31,510 --> 00:44:33,330 of dynamic programming. 654 00:44:33,330 --> 00:44:35,600 And he just didn't want to admit it. 655 00:44:35,600 --> 00:44:38,340 So he made up a name out of nothing, and it fooled the 656 00:44:38,340 --> 00:44:40,550 government, and he got to do it. 657 00:44:40,550 --> 00:44:43,430 Now why do I teach you dynamic programming? 658 00:44:43,430 --> 00:44:45,690 And we're going to talk a little bit more about it, the 659 00:44:45,690 --> 00:44:47,730 next lecture. 660 00:44:47,730 --> 00:44:52,380 It's because it is one of the most important algorithms that 661 00:44:52,380 --> 00:44:54,980 we know today. 662 00:44:54,980 --> 00:45:01,330 It's used over and over again to provide practical, 663 00:45:01,330 --> 00:45:07,850 efficient solutions to optimization problems that on 664 00:45:07,850 --> 00:45:11,470 their surface appear intractable. 665 00:45:11,470 --> 00:45:13,610 They appear exponential. 666 00:45:13,610 --> 00:45:16,120 It says there is no good way to solve it. 667 00:45:16,120 --> 00:45:21,830 In fact, if it has certain kinds of properties, it will 668 00:45:21,830 --> 00:45:24,620 always be amenable to solutions by dynamic 669 00:45:24,620 --> 00:45:28,360 programming, which will most of the time-- and I'll come 670 00:45:28,360 --> 00:45:30,390 back to the most of the time-- 671 00:45:30,390 --> 00:45:35,110 end up taking an exponential problem, and solving it really 672 00:45:35,110 --> 00:45:38,560 quickly, as we did here. 673 00:45:38,560 --> 00:45:41,240 I could have made this graph enormous, and dynamic 674 00:45:41,240 --> 00:45:45,590 programming would have given us a very fast solution to it. 675 00:45:45,590 --> 00:45:47,600 So when can we use dynamic programming? 676 00:45:50,270 --> 00:45:51,543 Not all the time. 677 00:45:54,420 --> 00:46:02,500 We can use it on problems that exhibit two properties. 678 00:46:02,500 --> 00:46:05,086 The problem must have optimal substructure. 679 00:46:15,070 --> 00:46:20,060 What this means is that you can find a globally optimal 680 00:46:20,060 --> 00:46:25,505 solution by combining locally optimal solutions. 681 00:46:54,400 --> 00:46:59,540 So we can again see that with our graph problem, that we can 682 00:46:59,540 --> 00:47:04,230 combine the solutions from nodes at a distance from the 683 00:47:04,230 --> 00:47:08,020 root node to get the solution of getting there 684 00:47:08,020 --> 00:47:09,840 from the root node. 685 00:47:09,840 --> 00:47:13,720 If I know I can get from A to B, and I can find the optimal 686 00:47:13,720 --> 00:47:17,350 solution from B to C, then I can use that to find the 687 00:47:17,350 --> 00:47:23,020 optimal solution from A to C. So it has optimal 688 00:47:23,020 --> 00:47:24,270 substructure. 689 00:47:25,990 --> 00:47:32,350 The other thing it has to have is overlapping sub-problems. 690 00:47:32,350 --> 00:47:34,380 And that's the thing I emphasized earlier-- 691 00:47:41,800 --> 00:47:46,120 that finding the optimal solution involves finding 692 00:47:46,120 --> 00:47:51,930 optimal solution to the same sub-problem multiple times. 693 00:47:51,930 --> 00:47:55,180 Otherwise, we could build this memo, but we'd never 694 00:47:55,180 --> 00:47:57,790 successfully look up anything in it. 695 00:47:57,790 --> 00:47:59,910 And so the algorithm would give us the right answer, but 696 00:47:59,910 --> 00:48:01,160 we'd get no speedup,. 697 00:48:03,850 --> 00:48:07,710 So it's this property that we need to know that we'll get 698 00:48:07,710 --> 00:48:10,260 the correct answer-- 699 00:48:10,260 --> 00:48:12,980 that when we combine the local solutions, we'll get the right 700 00:48:12,980 --> 00:48:14,750 global solution. 701 00:48:14,750 --> 00:48:18,570 It's this property that gives us an indication of how much 702 00:48:18,570 --> 00:48:22,100 of a speedup we can expect to achieve. 703 00:48:22,100 --> 00:48:25,300 How many problems will we not have to solve, because we can 704 00:48:25,300 --> 00:48:27,390 look up the solution? 705 00:48:27,390 --> 00:48:29,270 We'll come back to this. 706 00:48:29,270 --> 00:48:33,260 And we'll see how it applies to another problem that you've 707 00:48:33,260 --> 00:48:36,590 already looked at say the knapsack problem, to give us a 708 00:48:36,590 --> 00:48:40,700 fast solution to that, so that if you want a answer, go back 709 00:48:40,700 --> 00:48:44,360 to a previous problem set, and take the full database of 710 00:48:44,360 --> 00:48:47,540 classes, you'll be able to solve it quickly using dynamic 711 00:48:47,540 --> 00:48:48,690 programming. 712 00:48:48,690 --> 00:48:50,460 OK, see you next time.