1 00:00:00,090 --> 00:00:02,430 The following content is provided under a Creative 2 00:00:02,430 --> 00:00:03,810 Commons license. 3 00:00:03,810 --> 00:00:06,050 Your support will help MIT OpenCourseWare 4 00:00:06,050 --> 00:00:10,170 continue to offer high quality educational resources for free. 5 00:00:10,170 --> 00:00:12,690 To make a donation or to view additional materials 6 00:00:12,690 --> 00:00:16,606 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:16,606 --> 00:00:17,570 at ocw.mit.edu. 8 00:00:27,337 --> 00:00:28,670 ARMANDO SOLAR-LEZAMA: All right. 9 00:00:28,670 --> 00:00:29,970 So good morning, everyone. 10 00:00:29,970 --> 00:00:32,250 I'm Armando Solar-Lezama. 11 00:00:32,250 --> 00:00:37,190 I'm giving the lecture today on symbolic execution. 12 00:00:37,190 --> 00:00:41,740 How many of you here are familiar with what the term is 13 00:00:41,740 --> 00:00:45,220 or have heard about it before? 14 00:00:45,220 --> 00:00:47,230 We want to get a sense of audience. 15 00:00:47,230 --> 00:00:48,180 OK. 16 00:00:48,180 --> 00:00:51,080 So let's see. 17 00:00:58,295 --> 00:01:00,690 I dropped this machine a little too many times 18 00:01:00,690 --> 00:01:04,480 and it takes a while to boot up. 19 00:01:04,480 --> 00:01:10,040 So symbolic execution is really the workhorse 20 00:01:10,040 --> 00:01:14,420 of modern program analysis. 21 00:01:14,420 --> 00:01:17,580 It's one of those techniques that has really broken out 22 00:01:17,580 --> 00:01:21,630 of the research bubble and actually made it 23 00:01:21,630 --> 00:01:25,210 into a very large number of high impact applications. 24 00:01:25,210 --> 00:01:29,400 For example, today at Microsoft there's 25 00:01:29,400 --> 00:01:35,000 a system called SAGE that runs on a lot of important Microsoft 26 00:01:35,000 --> 00:01:37,560 code ranging from PowerPoint to Windows 27 00:01:37,560 --> 00:01:40,520 to actually find security problems and security 28 00:01:40,520 --> 00:01:42,610 vulnerabilities. 29 00:01:42,610 --> 00:01:44,700 There's a lot of that academic projects that 30 00:01:44,700 --> 00:01:48,160 have made a lot of real world impact 31 00:01:48,160 --> 00:01:51,720 by discovering important bugs in open source software, 32 00:01:51,720 --> 00:01:55,870 for example, by relying on symbolic execution. 33 00:01:55,870 --> 00:01:59,390 And the beauty of symbolic execution as a technique 34 00:01:59,390 --> 00:02:03,410 is that compared to testing, for example, 35 00:02:03,410 --> 00:02:04,980 it gives you the ability to reason 36 00:02:04,980 --> 00:02:07,420 about how your program is going to behave 37 00:02:07,420 --> 00:02:12,260 on a potentially infinite set of possible inputs. 38 00:02:12,260 --> 00:02:15,390 It allows you to explore spaces of inputs 39 00:02:15,390 --> 00:02:18,730 that would be completely unfeasible and impractical 40 00:02:18,730 --> 00:02:21,860 to explore by, say, random testing, 41 00:02:21,860 --> 00:02:25,080 or even by having a very large number of testers 42 00:02:25,080 --> 00:02:27,040 banging and the code. 43 00:02:27,040 --> 00:02:29,690 On the other hand, compared to more traditional 44 00:02:29,690 --> 00:02:32,580 static analysis techniques it has the advantage 45 00:02:32,580 --> 00:02:36,180 that when it discovers a problem it can actually 46 00:02:36,180 --> 00:02:39,860 produce for you an input and a trace 47 00:02:39,860 --> 00:02:42,280 that you can run on your real program 48 00:02:42,280 --> 00:02:44,990 and execute that program on that input. 49 00:02:44,990 --> 00:02:48,100 And you can actually tell that it is a real bug. 50 00:02:48,100 --> 00:02:49,980 And you can actually go and debug it 51 00:02:49,980 --> 00:02:55,700 using traditional debugging mechanisms. 52 00:02:55,700 --> 00:02:58,970 And this is particularly valuable 53 00:02:58,970 --> 00:03:02,100 when you're in an industrial development environment 54 00:03:02,100 --> 00:03:04,830 where you probably don't have time 55 00:03:04,830 --> 00:03:08,880 to go looking after every little problem in your code. 56 00:03:08,880 --> 00:03:10,920 You really want to be able to tell 57 00:03:10,920 --> 00:03:12,880 the difference between real problems 58 00:03:12,880 --> 00:03:16,010 versus false positives, for example. 59 00:03:16,010 --> 00:03:21,150 So how does it work? 60 00:03:21,150 --> 00:03:23,510 So in order to really understand how 61 00:03:23,510 --> 00:03:28,260 it works it's useful to start by looking at just 62 00:03:28,260 --> 00:03:30,450 normal execution, right? 63 00:03:30,450 --> 00:03:32,140 If we think of symbolic execution 64 00:03:32,140 --> 00:03:36,500 as a generalization of traditional, plain execution, 65 00:03:36,500 --> 00:03:40,310 it makes sense to know what this looks like. 66 00:03:40,310 --> 00:03:44,420 So I'm going to be using this very, very simple program 67 00:03:44,420 --> 00:03:48,090 as an illustration for a lot of what I'm going 68 00:03:48,090 --> 00:03:49,800 to be talking about today. 69 00:03:49,800 --> 00:03:51,510 So what do we have here? 70 00:03:51,510 --> 00:03:54,460 Again, it's a very simple piece of code, just 71 00:03:54,460 --> 00:03:57,510 a couple of branches and here we have an assertion, 72 00:03:57,510 --> 00:03:58,280 assert false. 73 00:03:58,280 --> 00:04:01,570 And we want to know could that assertion ever be triggered. 74 00:04:01,570 --> 00:04:02,270 Is it possible? 75 00:04:02,270 --> 00:04:07,260 Is there some input where that will make that assertion fail? 76 00:04:07,260 --> 00:04:09,510 And in this case because the assertion is just saying, 77 00:04:09,510 --> 00:04:11,780 assert false, what I'm really asking is, 78 00:04:11,780 --> 00:04:14,960 is there an input that can reach that point in the program? 79 00:04:14,960 --> 00:04:19,070 So one of the things I can do is I can try just testing. 80 00:04:19,070 --> 00:04:24,550 I can go in and run this code with a concrete input. 81 00:04:24,550 --> 00:04:25,050 Right? 82 00:04:25,050 --> 00:04:29,850 So let's say that I start with an input where x is 4 83 00:04:29,850 --> 00:04:31,820 and y is 4. 84 00:04:31,820 --> 00:04:35,110 And initially t is going to have the value 0 85 00:04:35,110 --> 00:04:36,310 right after I declare it. 86 00:04:36,310 --> 00:04:38,990 So before we go with normal execution, 87 00:04:38,990 --> 00:04:40,800 what are some of the important point here? 88 00:04:40,800 --> 00:04:44,884 The fact that we need some representation of the state 89 00:04:44,884 --> 00:04:45,800 of the program, right? 90 00:04:45,800 --> 00:04:48,680 Whether we're doing normal execution 91 00:04:48,680 --> 00:04:52,020 or whether we're doing symbolic execution, 92 00:04:52,020 --> 00:04:53,700 we need to have some way to characterize 93 00:04:53,700 --> 00:04:54,710 the state of the program. 94 00:04:54,710 --> 00:04:56,700 And in this case, this is such a simple program 95 00:04:56,700 --> 00:04:59,850 that it doesn't use the heap. 96 00:04:59,850 --> 00:05:01,050 It doesn't use the stack. 97 00:05:01,050 --> 00:05:03,210 There are no function calls. 98 00:05:03,210 --> 00:05:07,550 So the state can be fully characterized by these three 99 00:05:07,550 --> 00:05:10,130 variables together with knowledge of where 100 00:05:10,130 --> 00:05:12,200 in the program I'm at, right? 101 00:05:12,200 --> 00:05:15,920 So if I start executing with 4, 4, 102 00:05:15,920 --> 00:05:21,330 and 0, so when I get to this branch, is 4 greater than 4? 103 00:05:21,330 --> 00:05:22,460 Clearly not. 104 00:05:22,460 --> 00:05:26,560 So then I'm going to be executing t equals y. 105 00:05:26,560 --> 00:05:29,850 So now after I do that t is no longer 0. 106 00:05:29,850 --> 00:05:32,230 It now has the value 4. 107 00:05:32,230 --> 00:05:32,730 Right? 108 00:05:32,730 --> 00:05:35,080 So that is now the state of my program. 109 00:05:35,080 --> 00:05:38,980 And then I can evaluate this branch. 110 00:05:38,980 --> 00:05:41,260 Is it the case that t is less than x? 111 00:05:43,850 --> 00:05:44,350 No. 112 00:05:44,350 --> 00:05:44,849 Right? 113 00:05:44,849 --> 00:05:46,440 So we dodged the bullet. 114 00:05:46,440 --> 00:05:49,490 We did not get an assertion failure. 115 00:05:49,490 --> 00:05:52,331 There was no problem in this particular execution. 116 00:05:52,331 --> 00:05:52,830 Right? 117 00:05:52,830 --> 00:05:55,580 But that doesn't really tell us anything 118 00:05:55,580 --> 00:05:57,010 about any other execution. 119 00:05:57,010 --> 00:05:59,540 All we know is that under the input 120 00:05:59,540 --> 00:06:03,580 x equals 4 and y equals 4, the program is not going to fail. 121 00:06:03,580 --> 00:06:06,790 But it tells us nothing about what's going to happen 122 00:06:06,790 --> 00:06:10,390 on the input [? 2, 1, ?] for example. 123 00:06:10,390 --> 00:06:10,890 Right? 124 00:06:10,890 --> 00:06:13,700 And in this input you see that this input is actually 125 00:06:13,700 --> 00:06:17,350 going to follow a different path in the execution. 126 00:06:17,350 --> 00:06:22,020 This time we're actually going to see that t equals x. 127 00:06:22,020 --> 00:06:25,750 We're actually going to set t equals 2x. 128 00:06:25,750 --> 00:06:29,710 So after executing these t will be equal to 2, 129 00:06:29,710 --> 00:06:32,765 but is there any problem in this execution? 130 00:06:36,800 --> 00:06:39,500 Will there be an assertion failure on this input? 131 00:06:42,920 --> 00:06:44,050 Well, so let's see. 132 00:06:44,050 --> 00:06:45,850 So if t is 2. 133 00:06:45,850 --> 00:06:47,970 And x is 2. 134 00:06:47,970 --> 00:06:50,420 Is t less than x? 135 00:06:50,420 --> 00:06:51,160 No. 136 00:06:51,160 --> 00:06:54,330 So it looks like we dodged a bullet again. 137 00:06:54,330 --> 00:06:54,830 Right? 138 00:06:54,830 --> 00:06:57,930 So here we have two concrete inputs. 139 00:06:57,930 --> 00:07:00,440 And they told us that on these two concrete inputs 140 00:07:00,440 --> 00:07:01,770 the program didn't fail. 141 00:07:01,770 --> 00:07:06,900 But that really doesn't tell us anything about any other input. 142 00:07:06,900 --> 00:07:10,080 And so the idea with symbolic execution 143 00:07:10,080 --> 00:07:13,950 is we want to go beyond these single input executions. 144 00:07:13,950 --> 00:07:17,480 And we want to be able to actually reason 145 00:07:17,480 --> 00:07:20,440 about the behavior of the program on very 146 00:07:20,440 --> 00:07:21,550 large sets of inputs. 147 00:07:21,550 --> 00:07:25,680 In some cases, infinite sets of possible inputs. 148 00:07:25,680 --> 00:07:28,830 And the basic idea is as follows. 149 00:07:28,830 --> 00:07:31,940 So for a program like this, just like 150 00:07:31,940 --> 00:07:33,940 before the state of the program is 151 00:07:33,940 --> 00:07:36,630 characterized by the value of these three 152 00:07:36,630 --> 00:07:37,500 different variables. 153 00:07:37,500 --> 00:07:41,140 Right? x, y, and t together with knowing where in the program 154 00:07:41,140 --> 00:07:42,380 I'm at. 155 00:07:42,380 --> 00:07:48,230 But now instead of concrete values for x and y 156 00:07:48,230 --> 00:07:51,920 what I'm going to have is a symbolic value, just 157 00:07:51,920 --> 00:07:52,530 a variable. 158 00:07:52,530 --> 00:07:57,760 A variable that allows me to give a name to this value 159 00:07:57,760 --> 00:08:00,450 that the user is going to provide at the input. 160 00:08:00,450 --> 00:08:03,540 So what that means is that the state of my program 161 00:08:03,540 --> 00:08:07,170 is no longer a mapping from variable names 162 00:08:07,170 --> 00:08:08,630 to concrete values. 163 00:08:08,630 --> 00:08:13,542 It's now a mapping from variable names to these symbolic values. 164 00:08:13,542 --> 00:08:15,250 And a symbolic value, you can essentially 165 00:08:15,250 --> 00:08:18,480 think of it as a formula. 166 00:08:18,480 --> 00:08:23,610 So in this case the formula for x is just x. 167 00:08:23,610 --> 00:08:25,440 And the formula for y is just y. 168 00:08:25,440 --> 00:08:27,590 And for t, it's actually the value 0. 169 00:08:27,590 --> 00:08:31,190 We know that for every input, doesn't matter what you do. 170 00:08:31,190 --> 00:08:35,400 The value of t after the first statement is going to be 0. 171 00:08:35,400 --> 00:08:39,510 But now here's where it gets interesting. 172 00:08:39,510 --> 00:08:42,179 So we get to this branch right here 173 00:08:42,179 --> 00:08:44,912 that says, if x is greater than y, 174 00:08:44,912 --> 00:08:46,370 we're going to go in one direction. 175 00:08:48,977 --> 00:08:50,560 If it's less than or equal to y, we're 176 00:08:50,560 --> 00:08:52,018 going to go in the other direction. 177 00:08:52,018 --> 00:08:55,600 Now do we know anything about x and y? 178 00:08:59,476 --> 00:09:00,600 What do we know about them? 179 00:09:05,600 --> 00:09:07,110 We know their type, at least. 180 00:09:07,110 --> 00:09:08,020 So that's a start. 181 00:09:08,020 --> 00:09:11,870 So we know that they're going to be ranging from min int 182 00:09:11,870 --> 00:09:16,287 to max int, but that's about all we know about them. 183 00:09:16,287 --> 00:09:17,870 And it turns out that this information 184 00:09:17,870 --> 00:09:22,070 that we know about them is not sufficient to tell us which 185 00:09:22,070 --> 00:09:23,630 direction this branch might go. 186 00:09:23,630 --> 00:09:26,630 This branch could go either way. 187 00:09:26,630 --> 00:09:32,360 And so now there are many things and we can do, 188 00:09:32,360 --> 00:09:35,870 but what's one possible thing that we could do at this point? 189 00:09:44,680 --> 00:09:46,018 Make a wild guess. 190 00:09:46,018 --> 00:09:46,935 AUDIENCE: [INAUDIBLE]. 191 00:09:46,935 --> 00:09:48,059 ARMANDO SOLAR-LEZAMA: Yeah. 192 00:09:48,059 --> 00:09:49,680 We could follow both branches. 193 00:09:49,680 --> 00:09:54,420 We could flip a coin and pick one branch and take that. 194 00:09:54,420 --> 00:09:56,730 So if we want to follow both branches 195 00:09:56,730 --> 00:09:58,990 we have to follow one and then the other one, right? 196 00:09:58,990 --> 00:10:04,381 So let's say we start with this branch. 197 00:10:04,381 --> 00:10:04,880 Right? 198 00:10:04,880 --> 00:10:07,250 So now we are at this branch. 199 00:10:07,250 --> 00:10:11,240 So what we know is that if we make it to this branch, 200 00:10:11,240 --> 00:10:17,740 in this branch t is now going to have the same value as x. 201 00:10:17,740 --> 00:10:20,277 And we don't know what that value is going to be, 202 00:10:20,277 --> 00:10:21,360 but we have a name for it. 203 00:10:21,360 --> 00:10:26,080 It's this script letter x. 204 00:10:26,080 --> 00:10:26,580 Right? 205 00:10:26,580 --> 00:10:31,370 So that's the value of t on that branch. 206 00:10:31,370 --> 00:10:36,330 If we were to take the opposite branch then what would happen? 207 00:10:36,330 --> 00:10:38,730 The value of t would be something different, right? 208 00:10:38,730 --> 00:10:45,790 In that branch, the value of t would be the symbolic value y. 209 00:10:45,790 --> 00:10:50,050 So that means that when we get to this point in the program, 210 00:10:50,050 --> 00:10:51,020 what is the value of t? 211 00:10:51,020 --> 00:10:53,040 Well, maybe it's x. 212 00:10:53,040 --> 00:10:54,440 And maybe it's y. 213 00:10:54,440 --> 00:10:58,460 We don't know exactly which one it is, but why don't we 214 00:10:58,460 --> 00:10:59,150 give it a name? 215 00:10:59,150 --> 00:11:02,450 Let's call it t0. 216 00:11:02,450 --> 00:11:04,970 And what do we know about t0? 217 00:11:07,570 --> 00:11:10,855 What are the cases where t0 is going to be equal to x? 218 00:11:14,291 --> 00:11:15,650 AUDIENCE: [INAUDIBLE]. 219 00:11:15,650 --> 00:11:17,108 ARMANDO SOLAR-LEZAMA: That's right. 220 00:11:17,108 --> 00:11:21,380 So essentially what we know is that if x is greater than y, 221 00:11:21,380 --> 00:11:27,280 then this implies that it's x. 222 00:11:27,280 --> 00:11:37,460 And if x is less than or equal to y that implies that it's y, 223 00:11:37,460 --> 00:11:38,350 right? 224 00:11:38,350 --> 00:11:41,960 And so we have this value that we've defined. 225 00:11:41,960 --> 00:11:43,750 We'll call it t0. 226 00:11:43,750 --> 00:11:46,470 And it has these logical properties. 227 00:11:46,470 --> 00:11:53,150 So at this point in the program we actually 228 00:11:53,150 --> 00:11:56,800 have a name for the value of t. 229 00:11:56,800 --> 00:11:57,700 It's t0. 230 00:12:00,290 --> 00:12:00,960 Right? 231 00:12:00,960 --> 00:12:03,200 And so what did we do here? 232 00:12:03,200 --> 00:12:06,835 We took both branches of this if statement. 233 00:12:09,610 --> 00:12:12,090 And then we computed the symbolic value 234 00:12:12,090 --> 00:12:14,220 by looking at under what conditions 235 00:12:14,220 --> 00:12:17,170 am I going to take one branch, under what conditions am I 236 00:12:17,170 --> 00:12:19,360 going to take another branch? 237 00:12:19,360 --> 00:12:22,330 And then looking at what values am 238 00:12:22,330 --> 00:12:26,420 I going to be assigning to t on both of those branches? 239 00:12:26,420 --> 00:12:31,760 So now it comes to the point where we have to ask, 240 00:12:31,760 --> 00:12:33,130 can t be less than x? 241 00:12:33,130 --> 00:12:33,630 Right? 242 00:12:33,630 --> 00:12:35,510 So what is the value of t? 243 00:12:35,510 --> 00:12:37,580 The value of t is now t0. 244 00:12:37,580 --> 00:12:41,040 So what we want to know is, is it 245 00:12:41,040 --> 00:12:47,090 possible for t0 to be less than x? 246 00:12:47,090 --> 00:12:47,590 Right? 247 00:12:47,590 --> 00:12:51,760 Now remember the first branch we hit 248 00:12:51,760 --> 00:12:53,930 we were asking a question about x and y. 249 00:12:53,930 --> 00:12:56,990 And we knew nothing about x and y. 250 00:12:56,990 --> 00:12:59,520 The only thing we knew about x and y 251 00:12:59,520 --> 00:13:02,100 was that they were of type int. 252 00:13:02,100 --> 00:13:06,620 But now with t0 we actually know a lot about t0. 253 00:13:06,620 --> 00:13:11,930 We know that t0 is going to be equal to x in some cases. 254 00:13:11,930 --> 00:13:14,640 And it's going to be equal to y in some cases. 255 00:13:14,640 --> 00:13:18,300 And so this now gives us a set of equations 256 00:13:18,300 --> 00:13:20,090 that we can solve for. 257 00:13:20,090 --> 00:13:26,060 So what we can say is, is it possible to satisfy 258 00:13:26,060 --> 00:13:31,110 t0 less than x knowing that t0 satisfies 259 00:13:31,110 --> 00:13:33,761 all of these properties? 260 00:13:33,761 --> 00:13:34,260 Right? 261 00:13:34,260 --> 00:13:38,270 So, in fact, we can actually express this 262 00:13:38,270 --> 00:13:44,550 as a constraint where we say, so is it possible to have t0 263 00:13:44,550 --> 00:13:45,990 less than x? 264 00:13:45,990 --> 00:13:55,720 And to have x greater than y implies t0 equals x. 265 00:13:55,720 --> 00:14:07,146 And x less than or equal to y imply t0 equal y. 266 00:14:10,010 --> 00:14:10,510 Right? 267 00:14:10,510 --> 00:14:15,890 So what we have here is an equation that if that equation 268 00:14:15,890 --> 00:14:20,200 has a solution, if it's possible to find a value of t0, 269 00:14:20,200 --> 00:14:24,660 and a value of x, and a value of y that satisfies that equation, 270 00:14:24,660 --> 00:14:29,930 then we know that those values, when we plug them 271 00:14:29,930 --> 00:14:33,170 into our program, when the program executes, 272 00:14:33,170 --> 00:14:35,930 it will take this branch. 273 00:14:35,930 --> 00:14:40,090 And it will blow up when it hits a assert false. 274 00:14:42,721 --> 00:14:43,220 Right? 275 00:14:43,220 --> 00:14:45,080 So what did we do here? 276 00:14:45,080 --> 00:14:50,370 So we're executing this program, but instead 277 00:14:50,370 --> 00:14:57,560 of keeping our state as a mapping from variable names 278 00:14:57,560 --> 00:14:59,970 to values, what we're doing is we're 279 00:14:59,970 --> 00:15:03,970 keeping our program as a mapping from variable names 280 00:15:03,970 --> 00:15:07,310 to these symbolic values. 281 00:15:07,310 --> 00:15:09,230 Essentially, other variable names. 282 00:15:09,230 --> 00:15:11,830 And in this case our other variable names 283 00:15:11,830 --> 00:15:17,320 are the script x, script y, t0, and on top of that, 284 00:15:17,320 --> 00:15:20,110 we have a set of equations that tell us 285 00:15:20,110 --> 00:15:22,460 how those values are related. 286 00:15:22,460 --> 00:15:24,510 So we have an equation that tells us 287 00:15:24,510 --> 00:15:29,180 how t0 is related to x and y in this case. 288 00:15:29,180 --> 00:15:33,620 And solving for that equation allows 289 00:15:33,620 --> 00:15:37,380 us to answer the question of whether this branch can 290 00:15:37,380 --> 00:15:38,310 be taken or not. 291 00:15:38,310 --> 00:15:41,510 Now just looking at the equation, 292 00:15:41,510 --> 00:15:42,900 can this branch be taken or not? 293 00:15:45,570 --> 00:15:46,070 Right? 294 00:15:46,070 --> 00:15:49,450 So it looks like the branch cannot be taken. 295 00:15:49,450 --> 00:15:50,170 Why not? 296 00:15:50,170 --> 00:15:56,390 Because we're looking for cases where t0 is less than x, 297 00:15:56,390 --> 00:15:59,500 which means that if you're in this case, then clearly 298 00:15:59,500 --> 00:16:01,350 that's not going to be true. 299 00:16:01,350 --> 00:16:01,850 Right? 300 00:16:01,850 --> 00:16:04,480 So that means that when x is greater than y, 301 00:16:04,480 --> 00:16:08,280 then it cannot happen because t0 will be equal to x. 302 00:16:08,280 --> 00:16:11,720 And it cannot be equal to x and less than x at the same time. 303 00:16:11,720 --> 00:16:13,950 And what about in this case? 304 00:16:13,950 --> 00:16:15,180 Can it happen in this case? 305 00:16:15,180 --> 00:16:17,200 Can t0 be less than x in this case? 306 00:16:21,150 --> 00:16:22,590 No, it clearly cannot, right? 307 00:16:22,590 --> 00:16:29,180 Because in this case we know that x is less than y. 308 00:16:29,180 --> 00:16:31,790 And so if t0 is going to be less than x, 309 00:16:31,790 --> 00:16:34,070 then it would also be less than y. 310 00:16:34,070 --> 00:16:37,730 But we know that in that case t0 is exactly equal to y. 311 00:16:37,730 --> 00:16:42,730 And therefore, again, that case cannot be satisfied. 312 00:16:42,730 --> 00:16:47,080 So what we have here is an equation that has no solution. 313 00:16:47,080 --> 00:16:49,980 It doesn't matter what values you plug into this equation. 314 00:16:49,980 --> 00:16:54,990 You cannot solve it and that tells us that no matter what 315 00:16:54,990 --> 00:17:01,620 inputs we pass to this code, it will not go down this branch. 316 00:17:01,620 --> 00:17:07,460 Now notice that when making that argument here 317 00:17:07,460 --> 00:17:10,859 I was basically alluding to your intuition about integers, 318 00:17:10,859 --> 00:17:13,619 about mathematical integers. 319 00:17:13,619 --> 00:17:17,589 In practice we know that machine ints don't quite 320 00:17:17,589 --> 00:17:22,109 behave exactly the same way as mathematical ints. 321 00:17:22,109 --> 00:17:25,130 And there are some cases where laws 322 00:17:25,130 --> 00:17:27,430 that apply to mathematical ints don't actually 323 00:17:27,430 --> 00:17:29,822 apply to ints in programs. 324 00:17:29,822 --> 00:17:31,280 And so when reasoning about this we 325 00:17:31,280 --> 00:17:33,761 have to be very careful that when 326 00:17:33,761 --> 00:17:35,260 we're solving these equations, we're 327 00:17:35,260 --> 00:17:40,930 keeping in mind that these are not 328 00:17:40,930 --> 00:17:44,830 the integers as they were taught to us in elementary school. 329 00:17:44,830 --> 00:17:48,550 These are 32-bit integers that the machine uses. 330 00:17:48,550 --> 00:17:51,090 And there are many cases and many instances 331 00:17:51,090 --> 00:17:55,000 of bugs that arose because programmers were thinking 332 00:17:55,000 --> 00:17:58,770 about their code in terms of mathematical integers, 333 00:17:58,770 --> 00:18:02,450 and not realizing that there are things like overflows that 334 00:18:02,450 --> 00:18:04,330 can cause the program to behave differently 335 00:18:04,330 --> 00:18:06,470 for mathematical inputs. 336 00:18:06,470 --> 00:18:10,140 But the other thing is what I've described here 337 00:18:10,140 --> 00:18:16,230 is a purely intuitive argument. 338 00:18:16,230 --> 00:18:19,110 I walk you through the process of how to do this by hand, 339 00:18:19,110 --> 00:18:21,970 but that's by no means an algorithm. 340 00:18:21,970 --> 00:18:23,250 Right? 341 00:18:23,250 --> 00:18:26,070 The beauty of this idea of symbolic execution, 342 00:18:26,070 --> 00:18:28,920 however, is that it can be coded into an algorithm. 343 00:18:28,920 --> 00:18:31,960 And it can be solved in a mechanical way, which 344 00:18:31,960 --> 00:18:36,190 allows you to do this not just for ten line programs, 345 00:18:36,190 --> 00:18:38,930 but actually for million line programs. 346 00:18:38,930 --> 00:18:41,281 And it allows you to actually take 347 00:18:41,281 --> 00:18:43,280 this reasoning, and the same intuitive reasoning 348 00:18:43,280 --> 00:18:48,090 that we used in this case to talk 349 00:18:48,090 --> 00:18:49,820 about what happens when we execute 350 00:18:49,820 --> 00:18:51,860 this program on different inputs. 351 00:18:51,860 --> 00:18:59,429 And scale that reasoning to very large programs. 352 00:18:59,429 --> 00:19:00,720 Are there any questions so far? 353 00:19:05,621 --> 00:19:06,120 Yes? 354 00:19:06,120 --> 00:19:07,745 AUDIENCE: What if a [INAUDIBLE] are not 355 00:19:07,745 --> 00:19:09,620 supposed to take an input? 356 00:19:09,620 --> 00:19:10,120 [INAUDIBLE] 357 00:19:15,639 --> 00:19:16,680 ARMANDO SOLAR-LEZAMA: Oh. 358 00:19:16,680 --> 00:19:17,920 That's a very good question. 359 00:19:17,920 --> 00:19:26,190 Right, so, for example, let's say 360 00:19:26,190 --> 00:19:36,100 we have the program that we have here, but instead 361 00:19:36,100 --> 00:19:46,130 of these being t equals x, here we will say t equals x minus 1. 362 00:19:46,130 --> 00:19:47,006 Right? 363 00:19:47,006 --> 00:19:48,630 So now all of a sudden, intuitively you 364 00:19:48,630 --> 00:19:52,580 can see that now this program could blow up, right? 365 00:19:52,580 --> 00:20:00,150 Because when the program takes this path then 366 00:20:00,150 --> 00:20:02,680 t will indeed be less than x. 367 00:20:02,680 --> 00:20:06,220 And you will indeed fail here. 368 00:20:06,220 --> 00:20:06,720 Right? 369 00:20:06,720 --> 00:20:10,040 So what will happen to a program like this? 370 00:20:10,040 --> 00:20:15,370 How will our symbolic state look like? 371 00:20:15,370 --> 00:20:15,870 Right? 372 00:20:15,870 --> 00:20:22,710 So in this case, so t0, when x is greater than y, 373 00:20:22,710 --> 00:20:24,697 what is t0 now going to be equal to? 374 00:20:24,697 --> 00:20:26,030 It's not going to be equal to x. 375 00:20:26,030 --> 00:20:35,060 It's going to be equal to x minus 1, right? 376 00:20:35,060 --> 00:20:47,290 And so that means that, so, this condition now 377 00:20:47,290 --> 00:20:50,060 has a satisfying assignment. 378 00:20:50,060 --> 00:20:50,560 Right? 379 00:20:50,560 --> 00:20:56,600 Now this can fail, but what if you go to the developer 380 00:20:56,600 --> 00:21:03,320 and say, hey, this function can blow up 381 00:21:03,320 --> 00:21:06,710 whenever x is greater than y. 382 00:21:06,710 --> 00:21:11,150 And the developer looks at this and says, 383 00:21:11,150 --> 00:21:13,340 oh, I forgot to tell you. 384 00:21:13,340 --> 00:21:16,410 Actually, this function can never 385 00:21:16,410 --> 00:21:23,090 be called with parameters where x is greater than y. 386 00:21:23,090 --> 00:21:23,830 Right? 387 00:21:23,830 --> 00:21:27,110 That the client that calls this function is just 388 00:21:27,110 --> 00:21:29,140 a quick function that I wrote for something. 389 00:21:29,140 --> 00:21:32,060 And it has this branch for some historical purpose. 390 00:21:32,060 --> 00:21:34,140 But actually this function will never 391 00:21:34,140 --> 00:21:37,240 get called with x greater than y. 392 00:21:37,240 --> 00:21:39,150 You're like, well, now you tell me. 393 00:21:39,150 --> 00:21:39,870 Right? 394 00:21:39,870 --> 00:21:43,060 But the way we can think about this 395 00:21:43,060 --> 00:21:55,830 is that there is an assumption that x is going to be less than 396 00:21:55,830 --> 00:21:57,360 or equal to y, right? 397 00:21:57,360 --> 00:22:02,020 This is sometimes referred to as a precondition or a contract 398 00:22:02,020 --> 00:22:02,890 for this function. 399 00:22:02,890 --> 00:22:04,639 The function is promising to do something, 400 00:22:04,639 --> 00:22:06,622 but only if you satisfy this assumption. 401 00:22:06,622 --> 00:22:09,080 And if you don't satisfy the assumption, the function says, 402 00:22:09,080 --> 00:22:11,026 I don't care what happens. 403 00:22:11,026 --> 00:22:12,400 I only promise that I'm not going 404 00:22:12,400 --> 00:22:15,390 to fail when this assumption is satisfied. 405 00:22:15,390 --> 00:22:17,370 And it's the responsibility of the color 406 00:22:17,370 --> 00:22:20,790 to make sure that this condition is never violated, right? 407 00:22:20,790 --> 00:22:26,340 So how would we encode that constraint 408 00:22:26,340 --> 00:22:28,040 when we're solving for equations? 409 00:22:28,040 --> 00:22:30,280 Well, essentially what we have is 410 00:22:30,280 --> 00:22:31,780 we have this set of constraints that 411 00:22:31,780 --> 00:22:34,040 tell us whether this branch is feasible. 412 00:22:34,040 --> 00:22:37,100 And on top of the constraints that we already have 413 00:22:37,100 --> 00:22:45,530 we need to also make sure that the precondition, 414 00:22:45,530 --> 00:22:48,260 or the assumptions are satisfied. 415 00:22:48,260 --> 00:22:48,820 Right? 416 00:22:48,820 --> 00:22:53,210 And now we want to ask, OK, so can I 417 00:22:53,210 --> 00:22:56,780 find an x and a y that satisfy all of these constraints 418 00:22:56,780 --> 00:22:59,630 together with these constraint that I have on the input, 419 00:22:59,630 --> 00:23:01,540 with these properties that I know 420 00:23:01,540 --> 00:23:03,500 that the input must satisfy? 421 00:23:03,500 --> 00:23:06,810 And once again you can see that this constraint 422 00:23:06,810 --> 00:23:10,050 of x less than or equal to y is the difference 423 00:23:10,050 --> 00:23:13,940 between this constraint being satisfiable, 424 00:23:13,940 --> 00:23:18,780 and this constraint once again becoming unsatisfiable. 425 00:23:18,780 --> 00:23:22,450 That's a very important issue when dealing with analysis, 426 00:23:22,450 --> 00:23:25,910 especially when you want to do this marginally 427 00:23:25,910 --> 00:23:27,990 at the level of individual functions at a time. 428 00:23:27,990 --> 00:23:32,220 It makes sense to know what the assumptions are 429 00:23:32,220 --> 00:23:34,412 that the programmer had in mind when 430 00:23:34,412 --> 00:23:36,620 writing this function, because if you don't know what 431 00:23:36,620 --> 00:23:39,760 those assumptions were you could say, yeah, here 432 00:23:39,760 --> 00:23:42,780 are some inputs where it's going to fail only for the programmer 433 00:23:42,780 --> 00:23:45,530 to dismiss myth that by saying, oh, but those inputs are not 434 00:23:45,530 --> 00:23:49,489 possible, or those inputs can never happen. 435 00:23:49,489 --> 00:23:50,155 Other questions? 436 00:23:57,570 --> 00:23:58,070 All right. 437 00:23:58,070 --> 00:24:03,210 So how do we do this in a more mechanical way? 438 00:24:03,210 --> 00:24:07,965 So there are two aspects to this problem. 439 00:24:07,965 --> 00:24:11,390 Aspect number one is how do you actually 440 00:24:11,390 --> 00:24:13,890 come up with these formulas? 441 00:24:13,890 --> 00:24:15,770 So in this case it was kind of intuitive 442 00:24:15,770 --> 00:24:17,174 how we came up with the formulas. 443 00:24:17,174 --> 00:24:19,090 where we were just working through it by hand, 444 00:24:19,090 --> 00:24:21,490 but how do you come up with these formulas 445 00:24:21,490 --> 00:24:23,390 in a mechanical way? 446 00:24:23,390 --> 00:24:27,660 And aspect number two is once you have the formulas, 447 00:24:27,660 --> 00:24:30,520 how do you actually solve them? 448 00:24:30,520 --> 00:24:34,140 How can you actually solve these formulas 449 00:24:34,140 --> 00:24:38,700 that describe whether your program fails or not? 450 00:24:38,700 --> 00:24:43,970 And I'm actually going to start with that second question. 451 00:24:43,970 --> 00:24:48,350 Given that we're able to reduce our problem to these formulas 452 00:24:48,350 --> 00:24:54,280 that involve integer reasoning that involved 453 00:24:54,280 --> 00:24:55,910 in the case of programs generally 454 00:24:55,910 --> 00:24:57,721 you care about bit vector reasoning. 455 00:24:57,721 --> 00:25:00,220 [INAUDIBLE] programs, a lot of times, you care about arrays. 456 00:25:00,220 --> 00:25:01,920 You care about functions. 457 00:25:01,920 --> 00:25:04,180 And you end up with these giant formulas. 458 00:25:04,180 --> 00:25:08,540 How in the world do you actually solve them in a mechanical way? 459 00:25:08,540 --> 00:25:12,020 And a lot of the technology that we're talking about today, 460 00:25:12,020 --> 00:25:14,870 and the reason why we're actually talking about it 461 00:25:14,870 --> 00:25:20,280 as a practical tool, have to do with tremendous advances 462 00:25:20,280 --> 00:25:23,170 in solvers for logical questions. 463 00:25:23,170 --> 00:25:25,390 And in particular, there is a very important class 464 00:25:25,390 --> 00:25:31,300 of solvers called satisfiability modulo theory solvers, 465 00:25:31,300 --> 00:25:33,730 often abbreviated as SMT. 466 00:25:33,730 --> 00:25:35,230 But a lot of people in the community 467 00:25:35,230 --> 00:25:39,260 would argue that the name is not a particularly good name, 468 00:25:39,260 --> 00:25:41,820 but it's the one that everybody uses and it has stuck. 469 00:25:41,820 --> 00:25:45,220 What you need to know about these SMT solvers 470 00:25:45,220 --> 00:25:50,840 is that an SMT solver is an algorithm essentially 471 00:25:50,840 --> 00:25:54,670 that given a logical formula will give you 472 00:25:54,670 --> 00:25:56,080 one of two things. 473 00:25:56,080 --> 00:25:58,430 it will give you either a satisfying assignment 474 00:25:58,430 --> 00:26:01,830 to the formula, or it will tell you 475 00:26:01,830 --> 00:26:04,990 that the formula is unsatisfiable. 476 00:26:04,990 --> 00:26:09,490 And that there is no possible assignment 477 00:26:09,490 --> 00:26:11,590 to the variables in that formula that 478 00:26:11,590 --> 00:26:14,790 will satisfy these constraints that you defined. 479 00:26:14,790 --> 00:26:18,820 Now in practice, if this sounds a little bit scary 480 00:26:18,820 --> 00:26:21,730 and a little bit like magic, it is a little bit scary. 481 00:26:21,730 --> 00:26:25,350 A lot of the problems that these SMT solvers have to solve 482 00:26:25,350 --> 00:26:28,310 are NP-complete in the best case. 483 00:26:28,310 --> 00:26:30,570 All right? the nice ones are NP-complete. 484 00:26:30,570 --> 00:26:34,310 The hard ones can get much harrier than that. 485 00:26:34,310 --> 00:26:41,040 So how can we have a system that relies as its primary building 486 00:26:41,040 --> 00:26:46,950 block on solving NP complete PSPACE-complete problems? 487 00:26:46,950 --> 00:26:51,020 And still have something that works in practice? 488 00:26:51,020 --> 00:26:54,570 And part of the answer is that for a lot of these solvers 489 00:26:54,570 --> 00:26:59,590 there is a third thing that they can tell you, 490 00:26:59,590 --> 00:27:01,440 which is, I don't know. 491 00:27:09,630 --> 00:27:14,530 And so part of the beauty of these solvers 492 00:27:14,530 --> 00:27:16,890 is that for practical problems, even 493 00:27:16,890 --> 00:27:19,900 for very, very large and complicated practical problems, 494 00:27:19,900 --> 00:27:22,660 they are still able to do better than simply telling you, 495 00:27:22,660 --> 00:27:23,410 I don't know. 496 00:27:23,410 --> 00:27:26,430 They are still able to give you either 497 00:27:26,430 --> 00:27:30,420 a guarantee that this set of constraints 498 00:27:30,420 --> 00:27:34,090 is unsatisfiable or an actual satisfying assignment that 499 00:27:34,090 --> 00:27:37,300 tells you exactly what the answer is. 500 00:27:40,770 --> 00:27:41,750 Yes? 501 00:27:41,750 --> 00:27:48,451 AUDIENCE: [INAUDIBLE] For example, [INAUDIBLE] 502 00:27:48,451 --> 00:27:50,325 specification I don't think you said anything 503 00:27:50,325 --> 00:27:54,000 about how many bits are used to store an integer. [INAUDIBLE] 504 00:28:00,907 --> 00:28:02,990 ARMANDO SOLAR-LEZAMA: That's a very good question. 505 00:28:02,990 --> 00:28:05,430 And that really has to do with how you 506 00:28:05,430 --> 00:28:07,810 define your constraints, right? 507 00:28:07,810 --> 00:28:18,940 So If you look at our simple example from the beginning, 508 00:28:18,940 --> 00:28:25,140 in this case, we assume that these were the integers as 509 00:28:25,140 --> 00:28:26,640 learned in elementary school. 510 00:28:26,640 --> 00:28:34,420 And that we completely decided to ignore overflow errors. 511 00:28:34,420 --> 00:28:35,920 If you care about overflow errors, 512 00:28:35,920 --> 00:28:39,510 if overflow errors are actually essential to the kind of bugs 513 00:28:39,510 --> 00:28:42,864 you're trying to find, this would not be a good way 514 00:28:42,864 --> 00:28:43,780 to set up the problem. 515 00:28:43,780 --> 00:28:49,780 What you need is to represent these not so fast integers, 516 00:28:49,780 --> 00:28:50,910 but as bit-vectors. 517 00:28:50,910 --> 00:28:52,940 And the moment you represent them as bit vectors 518 00:28:52,940 --> 00:28:55,470 you have to have a bit width in mind. 519 00:28:55,470 --> 00:29:01,530 And this goes back to what this modular theory 520 00:29:01,530 --> 00:29:03,670 aspect in the solver means. 521 00:29:03,670 --> 00:29:05,430 What this modular theory aspect means 522 00:29:05,430 --> 00:29:08,700 is that the solver is actually extensible 523 00:29:08,700 --> 00:29:10,140 with different theories. 524 00:29:10,140 --> 00:29:15,960 The most popular theories are the theory of bit-vector which 525 00:29:15,960 --> 00:29:21,450 are fixed length bit-vectors. 526 00:29:21,450 --> 00:29:24,150 That means that if you're interpreting your formulas 527 00:29:24,150 --> 00:29:26,664 in this theory of fixed length bit-vectors 528 00:29:26,664 --> 00:29:28,580 you have to fix the length of the bit-vectors. 529 00:29:28,580 --> 00:29:31,380 And you have to explicitly specify 530 00:29:31,380 --> 00:29:36,760 that these are going to be 32-bit bit-vectors, or 8 bit 531 00:29:36,760 --> 00:29:39,326 bit-vectors, or 64-bit bit-vectors. 532 00:29:39,326 --> 00:29:42,284 AUDIENCE: So if you wanted to make the the bit symbolic 533 00:29:42,284 --> 00:29:46,730 [INAUDIBLE], like this is an x bit, is that-- 534 00:29:46,730 --> 00:29:49,140 ARMANDO SOLAR-LEZAMA: So there's another theory which 535 00:29:49,140 --> 00:29:53,690 is called the theory of arrays. 536 00:29:53,690 --> 00:29:55,740 And we'll talk a little bit more about it, 537 00:29:55,740 --> 00:29:59,150 where unlike the bit vector theory, 538 00:29:59,150 --> 00:30:02,410 which is designed to be for fixed length things 539 00:30:02,410 --> 00:30:07,360 the theory of arrays is meant to be for collections where 540 00:30:07,360 --> 00:30:10,110 you don't actually know the size a priori. 541 00:30:10,110 --> 00:30:13,040 Now in practice nobody uses the theory 542 00:30:13,040 --> 00:30:16,010 of arrays to model integers, for example, 543 00:30:16,010 --> 00:30:18,100 because it's too expensive. 544 00:30:18,100 --> 00:30:21,250 It becomes way more expensive to reason about 545 00:30:21,250 --> 00:30:23,070 when you don't know what the bound is. 546 00:30:23,070 --> 00:30:25,840 So generally people use fixed length theory 547 00:30:25,840 --> 00:30:30,910 of bit-vectors when reasoning about integers or characters 548 00:30:30,910 --> 00:30:33,050 even. 549 00:30:33,050 --> 00:30:41,760 Another very common theory is the theory of actual integer 550 00:30:41,760 --> 00:30:44,520 arithmetic, and in particularly linear integer arithmetic. 551 00:30:44,520 --> 00:30:47,200 This is a theory that people like a lot because it 552 00:30:47,200 --> 00:30:50,650 can be reasoned about very, very efficiently, 553 00:30:50,650 --> 00:30:52,930 but it's not particularly good when 554 00:30:52,930 --> 00:30:55,960 you're reasoning about programs, because in general you really 555 00:30:55,960 --> 00:30:59,040 do care about overflow issues. 556 00:30:59,040 --> 00:31:03,680 But it's actually very widely used for many, many things. 557 00:31:03,680 --> 00:31:07,240 The other theory that you're likely to see people using 558 00:31:07,240 --> 00:31:13,535 is the theory of uninterpreted functions. 559 00:31:19,240 --> 00:31:22,060 So what does it mean, the theory of an uninterpreted function? 560 00:31:22,060 --> 00:31:27,200 It means that you have a formula where somewhere in your formula 561 00:31:27,200 --> 00:31:29,350 you know that you're calling a function, 562 00:31:29,350 --> 00:31:31,270 but you know nothing about that function 563 00:31:31,270 --> 00:31:39,200 other than the fact that it is a function, that if you give it 564 00:31:39,200 --> 00:31:42,870 the same inputs you get the same outputs in return. 565 00:31:42,870 --> 00:31:45,190 And it turns out this is very, very useful sometimes 566 00:31:45,190 --> 00:31:47,310 when trying to reason about things 567 00:31:47,310 --> 00:31:53,190 like if you floating point code, modeling, sine, cosines, 568 00:31:53,190 --> 00:31:56,025 square roots can be very messy and expensive, 569 00:31:56,025 --> 00:31:57,650 but you can say, look, I don't actually 570 00:31:57,650 --> 00:32:01,030 care about what the sine function does. 571 00:32:01,030 --> 00:32:03,200 I don't care about what its output is. 572 00:32:03,200 --> 00:32:05,600 All I know is that if I call the sine function 573 00:32:05,600 --> 00:32:07,390 in many different places with the input 574 00:32:07,390 --> 00:32:08,830 I will get the same output. 575 00:32:08,830 --> 00:32:14,100 And that's enough for me to reason about my code. 576 00:32:14,100 --> 00:32:17,350 And so the most common ones you will 577 00:32:17,350 --> 00:32:21,140 see when analyzing real systems are 578 00:32:21,140 --> 00:32:24,510 bit-vectors to deal with integers, and logs, 579 00:32:24,510 --> 00:32:26,110 and pointers. 580 00:32:26,110 --> 00:32:30,990 Actually, pointers are often represented with integer 581 00:32:30,990 --> 00:32:35,760 because you're generally not going 582 00:32:35,760 --> 00:32:40,500 to be doing complicated bit whittling on pointers. 583 00:32:40,500 --> 00:32:44,650 Sometimes you will and then you can't use integers anymore. 584 00:32:44,650 --> 00:32:46,210 So OK. 585 00:32:46,210 --> 00:32:48,470 So that's all well and good. 586 00:32:48,470 --> 00:32:52,650 That's what an SMT solver can do for you. 587 00:32:52,650 --> 00:32:54,900 How does it actually work? 588 00:32:54,900 --> 00:32:56,870 What's inside it that makes it work? 589 00:32:56,870 --> 00:33:01,820 And SMT solvers actually rely on our ability 590 00:33:01,820 --> 00:33:04,690 to solve SAT problems, on our ability 591 00:33:04,690 --> 00:33:10,350 to take problems involving just purely Boolean constraints 592 00:33:10,350 --> 00:33:13,650 and Boolean variables, and telling us 593 00:33:13,650 --> 00:33:16,680 whether there is an assignment to these Boolean variables 594 00:33:16,680 --> 00:33:20,370 that is satisfiable or not. 595 00:33:20,370 --> 00:33:24,400 And this is the kind of thing that for many, many years 596 00:33:24,400 --> 00:33:27,416 people in undergrad have been taught that actually this 597 00:33:27,416 --> 00:33:28,690 is an NP-complete problem. 598 00:33:28,690 --> 00:33:30,680 The moment something reduces to SAT 599 00:33:30,680 --> 00:33:33,220 you know you shouldn't do it, but it turns out 600 00:33:33,220 --> 00:33:35,960 that we actually have some very, very good SAT 601 00:33:35,960 --> 00:33:36,720 solvers out there. 602 00:33:36,720 --> 00:33:42,060 Probably most of you even built one as part of 6005. 603 00:33:42,060 --> 00:33:43,940 Am I right? 604 00:33:43,940 --> 00:33:46,200 Or some of you did. 605 00:33:46,200 --> 00:33:50,780 So I'll tell you the basic idea behind how SAT solvers work. 606 00:33:50,780 --> 00:33:56,140 And the basic idea is that you take all your constraints 607 00:33:56,140 --> 00:34:00,440 on your Boolean variables and you put them into a database. 608 00:34:00,440 --> 00:34:03,450 And what is a constraint? 609 00:34:03,450 --> 00:34:06,950 Is this too small or can people in the back read this? 610 00:34:09,662 --> 00:34:10,570 AUDIENCE: Too small. 611 00:34:10,570 --> 00:34:11,179 ARMANDO SOLAR-LEZAMA: Too small? 612 00:34:11,179 --> 00:34:11,679 OK. 613 00:34:15,900 --> 00:34:19,469 Let's see if we can make this bigger. 614 00:34:42,040 --> 00:34:45,331 Is this a little bit better? 615 00:34:45,331 --> 00:34:46,305 AUDIENCE: [INAUDIBLE]. 616 00:34:46,305 --> 00:34:47,770 ARMANDO SOLAR-LEZAMA: OK. 617 00:34:47,770 --> 00:34:51,000 Well, here's what I'll do. 618 00:34:51,000 --> 00:34:54,030 I will annotate and I will narrate it as I go. 619 00:34:54,030 --> 00:34:55,810 And I'll post the slides later. 620 00:34:55,810 --> 00:34:57,660 So people can see what it says. 621 00:34:57,660 --> 00:35:01,650 So what we have here in SAT problem 622 00:35:01,650 --> 00:35:06,770 is that we have all these variables that represent 623 00:35:06,770 --> 00:35:08,460 Boolean unknowns, right? 624 00:35:08,460 --> 00:35:11,620 We want to know is it possible for x 625 00:35:11,620 --> 00:35:15,170 to be true, and y to be true, and z to be true at the same, 626 00:35:15,170 --> 00:35:15,820 for example. 627 00:35:15,820 --> 00:35:16,320 Right? 628 00:35:16,320 --> 00:35:18,330 And these are our unknowns. 629 00:35:18,330 --> 00:35:22,750 And all the constraints are in conjunctive normal form. 630 00:35:22,750 --> 00:35:24,590 What that means is all our constraints 631 00:35:24,590 --> 00:35:33,920 are of the form either x1 is true, or x2 is true, 632 00:35:33,920 --> 00:35:37,951 or x3 is true, for example. 633 00:35:37,951 --> 00:35:38,450 Right? 634 00:35:38,450 --> 00:35:42,200 So what we have is we have all our constraints in this form 635 00:35:42,200 --> 00:35:45,130 and some of them might say, well, either x1 is true, 636 00:35:45,130 --> 00:35:48,970 or x2 is false, or x3 is false. 637 00:35:48,970 --> 00:35:49,470 Right? 638 00:35:49,470 --> 00:35:50,880 So we have constraints. 639 00:35:50,880 --> 00:35:53,500 All our constraints are of this form. 640 00:35:53,500 --> 00:35:55,780 And you probably remember from discrete math 641 00:35:55,780 --> 00:35:59,700 that any Boolean formula can be represented 642 00:35:59,700 --> 00:36:01,264 in conjunctive normal form. 643 00:36:01,264 --> 00:36:03,680 And it has the added benefit that it's actually very, very 644 00:36:03,680 --> 00:36:08,370 easy to translate from arbitrary representations of a formula 645 00:36:08,370 --> 00:36:11,970 to these conjunctive normal form formula, which means whatever 646 00:36:11,970 --> 00:36:15,180 representation you're using to represent Boolean formulas, 647 00:36:15,180 --> 00:36:19,130 you can very easily convert it to this format. 648 00:36:19,130 --> 00:36:22,730 So what we have is we have a database 649 00:36:22,730 --> 00:36:25,230 with lots of constraints of this form. 650 00:36:25,230 --> 00:36:27,380 And what SAT solver is going to do 651 00:36:27,380 --> 00:36:29,540 is going to pick one of these variables at random. 652 00:36:29,540 --> 00:36:31,950 Let's say it's going to pick x1. 653 00:36:31,950 --> 00:36:36,180 And it's going to say, why don't we set x1 to true? 654 00:36:36,180 --> 00:36:38,120 I don't know anything about this problem. 655 00:36:38,120 --> 00:36:41,130 Might as well try selling it to true. 656 00:36:41,130 --> 00:36:44,050 And then what will happen is you'll have some constraints 657 00:36:44,050 --> 00:36:48,390 that mention x1 and let's say that you have a constraint that 658 00:36:48,390 --> 00:36:53,160 says either x1 is false or x7 is true. 659 00:36:53,160 --> 00:36:53,660 Right? 660 00:36:53,660 --> 00:36:56,700 So if you know that x1 is true and you 661 00:36:56,700 --> 00:37:00,430 know that either x1 is false or x7 is true, 662 00:37:00,430 --> 00:37:04,105 what do you know about x7? 663 00:37:04,105 --> 00:37:05,145 AUDIENCE: [INAUDIBLE]. 664 00:37:05,145 --> 00:37:06,270 ARMANDO SOLAR-LEZAMA: Yeah. 665 00:37:06,270 --> 00:37:06,990 It has to be true. 666 00:37:06,990 --> 00:37:07,489 Right? 667 00:37:07,489 --> 00:37:09,000 Because otherwise this constraint 668 00:37:09,000 --> 00:37:10,660 would not be satisfied. 669 00:37:10,660 --> 00:37:16,420 And so now you've propagated this assignment from x1 to x7. 670 00:37:16,420 --> 00:37:19,370 And let's say now you pick some other random variable. 671 00:37:19,370 --> 00:37:22,090 You say, well, what about x5? 672 00:37:22,090 --> 00:37:24,140 Why don't we try x5 being true? 673 00:37:24,140 --> 00:37:24,640 Right? 674 00:37:24,640 --> 00:37:27,600 And now let's say that you have a constraint that says, 675 00:37:27,600 --> 00:37:41,850 well, either x7 is false, or x6 is true, or x5 is false. 676 00:37:41,850 --> 00:37:42,350 Right? 677 00:37:42,350 --> 00:37:48,500 So I have x5 being true and I have x7 being true. 678 00:37:48,500 --> 00:37:52,640 So that means x6 now has to be true. 679 00:37:52,640 --> 00:37:53,140 Right? 680 00:37:53,140 --> 00:37:56,760 Because otherwise this constraint would be violated. 681 00:37:56,760 --> 00:37:59,520 And so from that the system infers, OK. 682 00:37:59,520 --> 00:38:01,500 So x6 has to be true. 683 00:38:01,500 --> 00:38:04,680 And it keeps at this process essentially 684 00:38:04,680 --> 00:38:06,820 trying out assignments. 685 00:38:06,820 --> 00:38:09,290 And then looking at all the available clauses, 686 00:38:09,290 --> 00:38:10,750 and looking at, hey, are there are 687 00:38:10,750 --> 00:38:14,080 other things that are implied by the assignments 688 00:38:14,080 --> 00:38:16,090 that I have so far? 689 00:38:16,090 --> 00:38:20,190 And following those implications until one of two things 690 00:38:20,190 --> 00:38:20,690 happens. 691 00:38:20,690 --> 00:38:23,480 Either you keep following implications and trying 692 00:38:23,480 --> 00:38:26,490 random things and eventually you have set a value 693 00:38:26,490 --> 00:38:28,460 to every single variable without ever 694 00:38:28,460 --> 00:38:30,550 running into a contradiction. 695 00:38:30,550 --> 00:38:32,580 And then you're done. 696 00:38:32,580 --> 00:38:33,080 Right? 697 00:38:33,080 --> 00:38:37,240 You found a satisfying assignment, or what can happen 698 00:38:37,240 --> 00:38:38,580 is you run into a contradiction. 699 00:38:38,580 --> 00:38:45,690 You run into a place where there was a clause that forced x4 700 00:38:45,690 --> 00:38:49,900 to be true, except there was another clause that forced x4 701 00:38:49,900 --> 00:38:50,950 to be false. 702 00:38:50,950 --> 00:38:55,080 And if there's one rule of Boolean algebra that everybody 703 00:38:55,080 --> 00:38:58,090 should know, is that you cannot have a variable be true and be 704 00:38:58,090 --> 00:38:59,860 false at the same time. 705 00:38:59,860 --> 00:39:00,360 Right? 706 00:39:00,360 --> 00:39:01,859 And so what that tells you is you've 707 00:39:01,859 --> 00:39:03,690 run into a contradiction. 708 00:39:03,690 --> 00:39:05,840 You clearly did something wrong in one 709 00:39:05,840 --> 00:39:08,200 of these random assignments that you were trying. 710 00:39:08,200 --> 00:39:10,680 So now let's analyze this contradiction. 711 00:39:10,680 --> 00:39:12,820 Let's figure out what were the assignments that 712 00:39:12,820 --> 00:39:16,790 led to this contradiction. 713 00:39:16,790 --> 00:39:20,690 And based on the assignments that led to that contradiction, 714 00:39:20,690 --> 00:39:25,010 let's come up with a new conflict clause that 715 00:39:25,010 --> 00:39:27,560 summarizes that contradiction. 716 00:39:27,560 --> 00:39:31,170 So in this case, what would happen 717 00:39:31,170 --> 00:39:38,180 is that you have x1 being false, and x5 being false. 718 00:39:38,180 --> 00:39:41,130 And x9 being false, right? 719 00:39:41,130 --> 00:39:44,530 So essentially what this is saying is that based on what I 720 00:39:44,530 --> 00:39:46,840 learned from these random assignments I discovered that 721 00:39:46,840 --> 00:39:49,560 one of these things has to be true, 722 00:39:49,560 --> 00:39:53,440 that it cannot be the case that x1 is true, and x5 is true, 723 00:39:53,440 --> 00:39:55,990 and x9 is false. 724 00:39:55,990 --> 00:39:57,000 That cannot happen. 725 00:39:57,000 --> 00:40:00,240 And I know that cannot happen because when I tried that 726 00:40:00,240 --> 00:40:00,965 things blew up. 727 00:40:00,965 --> 00:40:03,050 I ended up with a contradiction. 728 00:40:03,050 --> 00:40:06,330 And so what SAT solver is doing is trying random assignments, 729 00:40:06,330 --> 00:40:08,030 propagating them through. 730 00:40:08,030 --> 00:40:09,630 When it runs into contradictions it's 731 00:40:09,630 --> 00:40:12,600 analyzing the set of implications 732 00:40:12,600 --> 00:40:14,130 that led to that contradiction. 733 00:40:14,130 --> 00:40:17,690 And summarising that in a new constraint that 734 00:40:17,690 --> 00:40:19,650 will make sure that it never runs 735 00:40:19,650 --> 00:40:21,980 into this contradiction again, that it never 736 00:40:21,980 --> 00:40:25,574 runs into this particular problem again. 737 00:40:25,574 --> 00:40:26,240 Other questions? 738 00:40:34,960 --> 00:40:35,460 OK. 739 00:40:35,460 --> 00:40:36,730 So so far so good. 740 00:40:36,730 --> 00:40:40,040 So we can't really think of the SAT solver 741 00:40:40,040 --> 00:40:43,830 as just a black box that given a Boolean constraint 742 00:40:43,830 --> 00:40:47,380 it can either say, no, this Boolean constraint is 743 00:40:47,380 --> 00:40:51,130 unsatisfiable, or it can say, yeah, here's 744 00:40:51,130 --> 00:40:53,270 a satisfying assignment to that Boolean constraint. 745 00:40:53,270 --> 00:40:57,137 So SMT solvers are built on top of SAT solvers. 746 00:40:57,137 --> 00:40:58,720 And what they're able to do is they're 747 00:40:58,720 --> 00:41:01,670 able to combine the power of the SAT solver 748 00:41:01,670 --> 00:41:08,130 to solve these NP-complete SAT problems with domain 749 00:41:08,130 --> 00:41:12,190 specific reasoning to reason about the different theories 750 00:41:12,190 --> 00:41:13,000 that are supported. 751 00:41:13,000 --> 00:41:15,460 So to give you an idea of how it works, 752 00:41:15,460 --> 00:41:18,226 and this is going to be a fairly high level, 753 00:41:18,226 --> 00:41:19,600 but to give you an idea of how it 754 00:41:19,600 --> 00:41:22,000 works let's say that you have a formula like this, right? 755 00:41:22,000 --> 00:41:25,635 So you say x is greater than 5 and y is less than 5. 756 00:41:28,890 --> 00:41:33,791 And either y is greater than x or y is greater than 2. 757 00:41:33,791 --> 00:41:34,290 Right? 758 00:41:34,290 --> 00:41:37,310 So is that satisfiable? 759 00:41:37,310 --> 00:41:39,490 Can we find a satisfying assignment for that? 760 00:41:39,490 --> 00:41:46,940 So what an SMT solver can do is separate out the part 761 00:41:46,940 --> 00:41:50,950 of this formula that requires domain reasoning, that 762 00:41:50,950 --> 00:41:52,930 requires reasoning in the theory, in this case, 763 00:41:52,930 --> 00:41:54,150 of integers. 764 00:41:54,150 --> 00:41:55,730 With the part of this formula that 765 00:41:55,730 --> 00:41:57,770 is just the Boolean structure. 766 00:41:57,770 --> 00:42:01,616 So if you separate the Boolean structure here, 767 00:42:01,616 --> 00:42:02,990 essentially what you're saying is 768 00:42:02,990 --> 00:42:09,034 that there's some formula, F1 and some formula F2, 769 00:42:09,034 --> 00:42:11,800 and either F3 or F4. 770 00:42:11,800 --> 00:42:12,300 Right? 771 00:42:12,300 --> 00:42:15,740 And now this is a purely Boolean problem, right? 772 00:42:15,740 --> 00:42:18,060 It's just a problem of can I find a satisfying 773 00:42:18,060 --> 00:42:22,110 assignment for that? 774 00:42:22,110 --> 00:42:24,280 Is there a satisfying assignment for that? 775 00:42:24,280 --> 00:42:26,570 And, again, this is just a Boolean formula. 776 00:42:26,570 --> 00:42:30,385 Goes to a SAT solver and the SAT solver can say, yeah. 777 00:42:33,820 --> 00:42:36,010 I can find a satisfying assignment for this. 778 00:42:36,010 --> 00:42:39,220 And I can find a satisfying assignment 779 00:42:39,220 --> 00:42:43,740 by making this true, and this true, and this true. 780 00:42:43,740 --> 00:42:44,240 Right? 781 00:42:44,240 --> 00:42:48,010 It's a satisfying assignment for the Boolean formula. 782 00:42:48,010 --> 00:42:52,670 So now we have a question that we can go and ask 783 00:42:52,670 --> 00:42:54,160 the domain specific solver. 784 00:42:54,160 --> 00:42:59,700 In this case just a linear arithmetic solver. 785 00:42:59,700 --> 00:43:01,130 So we can go to the linear solver 786 00:43:01,130 --> 00:43:04,050 and say, hey, so the SAT solver claims 787 00:43:04,050 --> 00:43:06,990 that this is a reasonable assignment, that if I 788 00:43:06,990 --> 00:43:08,930 can make that assignment work, then 789 00:43:08,930 --> 00:43:10,890 my formula will be satisfied. 790 00:43:10,890 --> 00:43:17,160 So I can go and say, well F1 was actually this, and F2 was this, 791 00:43:17,160 --> 00:43:18,740 and F3 was this. 792 00:43:18,740 --> 00:43:22,290 So I can ask a theory solver, is it possible to get an x and a y 793 00:43:22,290 --> 00:43:26,030 such that x is greater than 5, y is less than 5, 794 00:43:26,030 --> 00:43:28,200 and y is greater than x? 795 00:43:28,200 --> 00:43:32,410 Right, so now this is a question purely about linear arithmetic. 796 00:43:32,410 --> 00:43:36,484 There's no Boolean logic involved. 797 00:43:36,484 --> 00:43:37,400 And what's the answer? 798 00:43:39,960 --> 00:43:40,460 No. 799 00:43:40,460 --> 00:43:40,960 Right? 800 00:43:40,960 --> 00:43:44,210 And there are traditional methods 801 00:43:44,210 --> 00:43:47,960 to solve these kinds of your problems. 802 00:43:47,960 --> 00:43:50,730 You could use the simplex method, for example, 803 00:43:50,730 --> 00:43:53,570 to solve systems of linear inequalities. 804 00:43:53,570 --> 00:43:55,070 There's lots of methods that you can 805 00:43:55,070 --> 00:43:57,530 use to solve systems of linear inequalities. 806 00:43:57,530 --> 00:44:00,770 The point is the theory solver knows about all of those. 807 00:44:00,770 --> 00:44:03,630 And the theory solver can say, no. 808 00:44:03,630 --> 00:44:04,670 This will not work. 809 00:44:04,670 --> 00:44:07,640 This is an assignment that will not work. 810 00:44:07,640 --> 00:44:13,510 And so the theory solver can now go back to the SAT solver 811 00:44:13,510 --> 00:44:15,740 and not just tell the SAT solver, hey, that thing 812 00:44:15,740 --> 00:44:18,300 that you did, that didn't work. 813 00:44:18,300 --> 00:44:20,920 But it can also give more of an explanation. 814 00:44:20,920 --> 00:44:24,370 So in this case, what you can conclude from the fact that 815 00:44:24,370 --> 00:44:26,880 this didn't work is that actually in addition 816 00:44:26,880 --> 00:44:31,360 to satisfying this formula you also want to satisfy the fact 817 00:44:31,360 --> 00:44:40,500 that I cannot have F1, and F2, and F3, right? 818 00:44:40,500 --> 00:44:42,810 My theory solver has told me that these three 819 00:44:42,810 --> 00:44:44,460 things are mutually exclusive. 820 00:44:44,460 --> 00:44:47,890 I cannot satisfy all three of them together. 821 00:44:47,890 --> 00:44:49,660 And so now that's a piece of information 822 00:44:49,660 --> 00:44:52,230 that I can go back to the SAT solver 823 00:44:52,230 --> 00:44:54,320 and ask the SAT solver, hey, can you 824 00:44:54,320 --> 00:44:57,000 give me a solution that satisfies 825 00:44:57,000 --> 00:44:59,440 not only the constraint that you had in the beginning, 826 00:44:59,440 --> 00:45:03,410 but also this new constraint that the theory 827 00:45:03,410 --> 00:45:05,091 solver discovered? 828 00:45:05,091 --> 00:45:05,590 Right? 829 00:45:05,590 --> 00:45:09,587 So now is there some other assignment that satisfies now 830 00:45:09,587 --> 00:45:10,670 both of these constraints? 831 00:45:18,950 --> 00:45:21,440 AUDIENCE: [INAUDIBLE]. 832 00:45:21,440 --> 00:45:23,070 ARMANDO SOLAR-LEZAMA: Yeah. 833 00:45:23,070 --> 00:45:25,870 So there's an assignment where this becomes false. 834 00:45:25,870 --> 00:45:27,415 And this becomes true. 835 00:45:27,415 --> 00:45:29,040 And that's an assignment that satisfies 836 00:45:29,040 --> 00:45:30,160 the constraint on the top. 837 00:45:30,160 --> 00:45:32,250 It satisfies the constraint on the bottom. 838 00:45:32,250 --> 00:45:34,480 And so once again that's an assignment 839 00:45:34,480 --> 00:45:37,856 that leads to a new constraint. 840 00:45:37,856 --> 00:45:39,230 So this constraint now goes away. 841 00:45:39,230 --> 00:45:40,900 We don't care about it any more. 842 00:45:40,900 --> 00:45:44,790 We have a new constraint that we can ask our theory solver, hey, 843 00:45:44,790 --> 00:45:46,520 it this possible? 844 00:45:46,520 --> 00:45:48,870 And in this case the theory solver says, yeah. 845 00:45:48,870 --> 00:45:50,310 That actually is possible. 846 00:45:50,310 --> 00:45:57,630 You can make y equal 3 and x equal 6. 847 00:45:57,630 --> 00:45:59,100 And it works. 848 00:45:59,100 --> 00:45:59,600 Right? 849 00:45:59,600 --> 00:46:02,820 And so now you have an assignment 850 00:46:02,820 --> 00:46:07,150 that satisfies the formula in the theory 851 00:46:07,150 --> 00:46:11,127 and that satisfies the Boolean structure 852 00:46:11,127 --> 00:46:12,085 behind this assignment. 853 00:46:12,085 --> 00:46:15,240 And with that the system can come back and tell you, yeah. 854 00:46:15,240 --> 00:46:19,660 Here's an assignment that satisfies all your constraints. 855 00:46:19,660 --> 00:46:21,870 And so it's this interaction back and forth 856 00:46:21,870 --> 00:46:25,660 between the theory solver and the SAT solver. 857 00:46:25,660 --> 00:46:27,610 And really the ability to be able to reason 858 00:46:27,610 --> 00:46:31,440 about very, very large and very complicated Boolean formulas. 859 00:46:31,440 --> 00:46:36,990 That's what makes symbolic execution possible. 860 00:46:36,990 --> 00:46:41,910 So now that we have that the next question is, 861 00:46:41,910 --> 00:46:52,620 so how do we go from a program to a constraint 862 00:46:52,620 --> 00:46:54,090 that we can give to an SMT solver? 863 00:46:54,090 --> 00:46:54,630 Yes? 864 00:46:54,630 --> 00:46:56,000 AUDIENCE: Sorry for going back. 865 00:46:56,000 --> 00:46:57,125 ARMANDO SOLAR-LEZAMA: Sure. 866 00:46:57,125 --> 00:46:58,622 AUDIENCE: [INAUDIBLE] previously. 867 00:46:58,622 --> 00:47:05,608 But could you run me again the whole issue of constructing 868 00:47:05,608 --> 00:47:07,105 the SMT statements? 869 00:47:07,105 --> 00:47:10,620 Is it an NP-complete or is it not? [INAUDIBLE]. 870 00:47:10,620 --> 00:47:12,670 ARMANDO SOLAR-LEZAMA: So the problems 871 00:47:12,670 --> 00:47:15,190 that the SMT solvers are solving, 872 00:47:15,190 --> 00:47:20,180 those are NP-complete problems in the best of cases. 873 00:47:20,180 --> 00:47:24,270 So SAT itself is the canonical NP-complete problem, 874 00:47:24,270 --> 00:47:28,630 but a lot of solvers these days even include support 875 00:47:28,630 --> 00:47:34,590 for some theories that are outright undecidable. 876 00:47:34,590 --> 00:47:35,270 So-- 877 00:47:35,270 --> 00:47:39,050 AUDIENCE: So how do you approach that in your system? 878 00:47:39,050 --> 00:47:42,840 ARMANDO SOLAR-LEZAMA: Well, at the end of the day what you get 879 00:47:42,840 --> 00:47:48,590 is you're going to create a constraint from this program. 880 00:47:48,590 --> 00:47:51,890 You're going to give it to the SMT solver. 881 00:47:51,890 --> 00:47:54,120 And the fact that these are NP-complete problems, 882 00:47:54,120 --> 00:47:56,630 or the fact that they're unsatisfiable, what it means 883 00:47:56,630 --> 00:48:03,570 is that if you're lucky, you will get an answer in seconds. 884 00:48:03,570 --> 00:48:06,770 And if you're not lucky, then it might 885 00:48:06,770 --> 00:48:09,670 take longer than the age of the universe for the thing 886 00:48:09,670 --> 00:48:11,009 to give you an answer. 887 00:48:11,009 --> 00:48:11,550 AUDIENCE: OK. 888 00:48:11,550 --> 00:48:14,841 How often do you run into cases where your system just 889 00:48:14,841 --> 00:48:18,746 flat-lines and says, sorry, I just can't figure this out yet? 890 00:48:18,746 --> 00:48:20,560 Has that ever happened or is that just-- 891 00:48:20,560 --> 00:48:21,070 ARMANDO SOLAR-LEZAMA: Yes. 892 00:48:21,070 --> 00:48:22,140 Yes, it does happen. 893 00:48:22,140 --> 00:48:24,666 And a big part of the engineering 894 00:48:24,666 --> 00:48:27,340 of these kind of tools is making sure 895 00:48:27,340 --> 00:48:30,420 that this happens as infrequently as possible. 896 00:48:30,420 --> 00:48:35,890 And part what makes this work at all 897 00:48:35,890 --> 00:48:40,530 is that we're not solving random SAT problems. 898 00:48:40,530 --> 00:48:44,450 We're not solving completely random bit-vector problems. 899 00:48:44,450 --> 00:48:47,390 We're solving problems that have a certain structure to them 900 00:48:47,390 --> 00:48:50,760 that a person was able to look at it 901 00:48:50,760 --> 00:48:53,750 and least have some confidence that this worked, right? 902 00:48:53,750 --> 00:48:57,070 Build some argument in their head for why this worked. 903 00:48:57,070 --> 00:49:00,260 And so what the solvers are trying to do 904 00:49:00,260 --> 00:49:02,640 is essentially exploiting that structure. 905 00:49:02,640 --> 00:49:05,260 And taking advantage, for example, the description 906 00:49:05,260 --> 00:49:08,194 that I gave you of what the SAT solver is doing internally, 907 00:49:08,194 --> 00:49:10,110 that's taking advantage of the fact that, yes. 908 00:49:10,110 --> 00:49:13,390 Your problem might have a million Boolean variables, 909 00:49:13,390 --> 00:49:15,280 but actually most of those variables 910 00:49:15,280 --> 00:49:18,430 are very tightly dependent on the values of each other. 911 00:49:18,430 --> 00:49:20,990 So the number of degrees of freedom in the problem 912 00:49:20,990 --> 00:49:23,730 is actually much smaller than what the million 913 00:49:23,730 --> 00:49:24,848 variables would suggest. 914 00:49:24,848 --> 00:49:27,056 AUDIENCE: So you're saying is that this isn't an exam 915 00:49:27,056 --> 00:49:27,540 question. 916 00:49:27,540 --> 00:49:28,024 This is real life. 917 00:49:28,024 --> 00:49:29,476 And someone built this system. 918 00:49:29,476 --> 00:49:30,444 It was supposed to work and make sense. 919 00:49:30,444 --> 00:49:32,138 So it's probably not going to be one 920 00:49:32,138 --> 00:49:34,292 of those wildly bizarre theoretical [INAUDIBLE]. 921 00:49:34,292 --> 00:49:35,750 ARMANDO SOLAR-LEZAMA: That's right. 922 00:49:38,780 --> 00:49:40,760 And in practice what happens and when 923 00:49:40,760 --> 00:49:43,020 you use this tool is the thing is you always do 924 00:49:43,020 --> 00:49:45,180 is set timeouts. 925 00:49:45,180 --> 00:49:49,864 So generally, what happens is because it's exponential, 926 00:49:49,864 --> 00:49:51,780 exponential doesn't mean that you can't do it. 927 00:49:51,780 --> 00:49:54,820 Exponential just means that there's a brick wall, 928 00:49:54,820 --> 00:49:57,700 that before that brick wall things will work, 929 00:49:57,700 --> 00:49:59,620 and in fact, they will work really fast. 930 00:49:59,620 --> 00:50:00,120 Right? 931 00:50:00,120 --> 00:50:01,660 The exponential works in both ways. 932 00:50:01,660 --> 00:50:04,480 Yes, when you're going out then things 933 00:50:04,480 --> 00:50:06,520 are growing very quickly, but when 934 00:50:06,520 --> 00:50:09,980 you're going toward smaller problems, or simpler problems 935 00:50:09,980 --> 00:50:12,490 things are also getting faster very, very quickly. 936 00:50:12,490 --> 00:50:17,120 So in general what that means is that lots of problems 937 00:50:17,120 --> 00:50:19,190 finish very, very quickly. 938 00:50:19,190 --> 00:50:21,350 And then some problems timeout. 939 00:50:21,350 --> 00:50:24,630 And the key is to engineer things in such a way 940 00:50:24,630 --> 00:50:28,990 that among the problems that finish quickly are actually 941 00:50:28,990 --> 00:50:30,960 problems of practical use. 942 00:50:30,960 --> 00:50:33,410 Or problems that will actually point you 943 00:50:33,410 --> 00:50:35,450 to security vulnerabilities in your system, 944 00:50:35,450 --> 00:50:39,560 will point you to bugs, will point you to a path 945 00:50:39,560 --> 00:50:41,390 that you maybe haven't explored before, 946 00:50:41,390 --> 00:50:43,560 or inputs that will take you down paths that you 947 00:50:43,560 --> 00:50:45,432 hadn't explored before. 948 00:50:45,432 --> 00:50:46,207 AUDIENCE: Thanks. 949 00:50:46,207 --> 00:50:47,790 ARMANDO SOLAR-LEZAMA: Other questions? 950 00:50:52,550 --> 00:50:53,460 All right. 951 00:50:53,460 --> 00:50:57,750 So we know how to go from a formula, 952 00:50:57,750 --> 00:51:01,690 from a set of constraints, to an answer that will either say, 953 00:51:01,690 --> 00:51:03,170 yes, this formula has a solution. 954 00:51:03,170 --> 00:51:08,060 And here's a solution, or no, this formula is unsatisfiable. 955 00:51:08,060 --> 00:51:10,950 There is no input that satisfies this. 956 00:51:10,950 --> 00:51:15,310 So now how do we get a formula from a program? 957 00:51:15,310 --> 00:51:18,970 So one of the things that you have 958 00:51:18,970 --> 00:51:20,730 when you're doing symbolic execution 959 00:51:20,730 --> 00:51:23,035 is that when you get to a branch and you 960 00:51:23,035 --> 00:51:26,600 don't know which direction the branch is going to go. 961 00:51:26,600 --> 00:51:30,660 Now there are two possibilities that you can do in that case. 962 00:51:30,660 --> 00:51:35,040 One is to do what we did in the early example, which is just 963 00:51:35,040 --> 00:51:37,960 to say, I'm going to take both branches at the same time. 964 00:51:37,960 --> 00:51:40,790 I'm going to collect what happens in mode's branches, 965 00:51:40,790 --> 00:51:42,270 merge at the end. 966 00:51:42,270 --> 00:51:46,100 That is a strategy that is often used 967 00:51:46,100 --> 00:51:50,710 when you're trying to get very strong guarantees in general. 968 00:51:50,710 --> 00:51:54,080 But it's a strategy that doesn't work too well 969 00:51:54,080 --> 00:51:56,060 with modern and SMT solvers. 970 00:51:56,060 --> 00:52:02,674 So often people prefer to do one path at a time exploration. 971 00:52:02,674 --> 00:52:04,090 And what that means is that you're 972 00:52:04,090 --> 00:52:06,730 going to pick a path down your program. 973 00:52:06,730 --> 00:52:10,420 And then you're going to create a formula for that path. 974 00:52:10,420 --> 00:52:13,800 So you're going to ask, fine me an input that goes down 975 00:52:13,800 --> 00:52:18,640 this path and that satisfies my constraint, 976 00:52:18,640 --> 00:52:21,880 or that violates my property, that 977 00:52:21,880 --> 00:52:26,370 goes out of bounds in my buffer, or that causes a null pointer 978 00:52:26,370 --> 00:52:27,840 error. 979 00:52:27,840 --> 00:52:29,860 And then if you can't find one then 980 00:52:29,860 --> 00:52:32,020 you try a different path and a different path. 981 00:52:32,020 --> 00:52:38,260 And you do these path explorations one at a time. 982 00:52:38,260 --> 00:52:42,000 So that's the strategy that we're going to talk about now. 983 00:52:42,000 --> 00:52:44,900 It's a little bit easier to describe how to do it. 984 00:52:44,900 --> 00:52:49,440 So let's say that we have a problem like this. 985 00:52:49,440 --> 00:52:51,690 So, by the way, I switched representations. 986 00:52:51,690 --> 00:52:54,170 So I'm not representing the program as a block of code 987 00:52:54,170 --> 00:52:58,220 and representing it as a control flow graph. 988 00:52:58,220 --> 00:53:00,610 Is everybody here familiar with a control flow graph? 989 00:53:00,610 --> 00:53:03,930 Or is anybody here not familiar with a control flow graph? 990 00:53:03,930 --> 00:53:05,790 It's just a representation of a program that 991 00:53:05,790 --> 00:53:08,940 makes branches more explicit. 992 00:53:08,940 --> 00:53:11,420 So let's pick a path. 993 00:53:13,940 --> 00:53:17,610 And so let's say that we care about this path, right, 994 00:53:17,610 --> 00:53:19,790 a path that starts at the beginning 995 00:53:19,790 --> 00:53:23,310 and takes us all the way down to the point where 996 00:53:23,310 --> 00:53:27,090 we are asserting false. 997 00:53:27,090 --> 00:53:29,780 And we want to know, is this path feasible? 998 00:53:29,780 --> 00:53:32,990 Could the program go down this path? 999 00:53:32,990 --> 00:53:35,800 So as we're going down this program 1000 00:53:35,800 --> 00:53:37,660 we're going to keep two things. 1001 00:53:42,070 --> 00:53:43,870 We're going to keep an environment that 1002 00:53:43,870 --> 00:53:46,830 keeps track of the symbolic values 1003 00:53:46,830 --> 00:53:48,580 of the different variables. 1004 00:53:48,580 --> 00:53:52,700 And in addition to that, we're going to keep around 1005 00:53:52,700 --> 00:53:54,710 an environment for constraints. 1006 00:54:04,109 --> 00:54:05,650 And these constraints are essentially 1007 00:54:05,650 --> 00:54:08,150 going to keep track of all the relationships 1008 00:54:08,150 --> 00:54:12,000 between these variables as well as any assumptions, 1009 00:54:12,000 --> 00:54:13,480 whether they were assumptions that 1010 00:54:13,480 --> 00:54:15,830 were made at the beginning, or assumptions 1011 00:54:15,830 --> 00:54:18,320 that come from the branches that you are taking. 1012 00:54:18,320 --> 00:54:21,350 So in this case, when we start down this path 1013 00:54:21,350 --> 00:54:29,490 we get to t equals 0, so our state is x, y, and 0. 1014 00:54:29,490 --> 00:54:31,090 And so far we have no constraints 1015 00:54:31,090 --> 00:54:35,290 because we didn't have any constraint in the beginning. 1016 00:54:35,290 --> 00:54:39,224 So now we're going to take this branch 1017 00:54:39,224 --> 00:54:41,390 and, again, because we've made a decision that we're 1018 00:54:41,390 --> 00:54:45,950 going to go down the path to your right, 1019 00:54:45,950 --> 00:54:51,358 then we know that this path will only happen when? 1020 00:54:56,506 --> 00:54:57,450 AUDIENCE: [INAUDIBLE]. 1021 00:54:57,450 --> 00:54:58,908 ARMANDO SOLAR-LEZAMA: That's right. 1022 00:54:58,908 --> 00:55:04,970 So we get our first constraint that says, x is greater than y. 1023 00:55:04,970 --> 00:55:05,470 Right? 1024 00:55:05,470 --> 00:55:13,410 So now down here we're looking at t equals y. 1025 00:55:13,410 --> 00:55:16,510 Now in this case because we're going only one path at a time 1026 00:55:16,510 --> 00:55:19,850 we don't actually need to introduce a new variable for t 1027 00:55:19,850 --> 00:55:20,520 necessarily. 1028 00:55:20,520 --> 00:55:22,340 We can just say, OK. 1029 00:55:22,340 --> 00:55:23,750 t is equal to y. 1030 00:55:23,750 --> 00:55:27,640 So that means that t is no longer 0. 1031 00:55:27,640 --> 00:55:31,130 It's now y. 1032 00:55:31,130 --> 00:55:31,740 Right? 1033 00:55:31,740 --> 00:55:32,840 And then keep going. 1034 00:55:32,840 --> 00:55:34,860 We get to this point. 1035 00:55:34,860 --> 00:55:37,990 Now we hit another branch. 1036 00:55:37,990 --> 00:55:39,490 What's a new assumption that we have 1037 00:55:39,490 --> 00:55:41,740 to make if we're assuming that we went down this path? 1038 00:55:49,340 --> 00:55:51,410 Just t less than y, right? 1039 00:55:51,410 --> 00:55:52,880 And what is t? 1040 00:55:56,340 --> 00:55:57,120 Right. 1041 00:55:57,120 --> 00:56:00,840 So in fact if we look up t, so t has the value y. 1042 00:56:00,840 --> 00:56:01,916 We look up y. 1043 00:56:01,916 --> 00:56:03,300 y also has the value of y. 1044 00:56:03,300 --> 00:56:09,290 So this constraint actually translates to y less than y. 1045 00:56:09,290 --> 00:56:11,320 So what does this tell us? 1046 00:56:11,320 --> 00:56:16,750 It tells us that in order to make it to this point, 1047 00:56:16,750 --> 00:56:20,280 in order to make it to a assert false, all of those things 1048 00:56:20,280 --> 00:56:21,340 have to hold. 1049 00:56:21,340 --> 00:56:22,730 Can they hold? 1050 00:56:22,730 --> 00:56:23,920 Clearly not. 1051 00:56:23,920 --> 00:56:24,670 Right? 1052 00:56:24,670 --> 00:56:28,350 y less than y alone is already sufficient for things 1053 00:56:28,350 --> 00:56:29,550 not to hold. 1054 00:56:29,550 --> 00:56:35,980 And so that tells us immediately that this is unsatisfiable. 1055 00:56:35,980 --> 00:56:39,940 And this is often known as a path condition. 1056 00:56:39,940 --> 00:56:42,030 This is a condition that has to be 1057 00:56:42,030 --> 00:56:47,020 true in order for the program to go down that path. 1058 00:56:47,020 --> 00:56:51,630 And so we know that this path condition cannot be satisfied. 1059 00:56:51,630 --> 00:56:54,650 And therefore, that it's impossible for the program 1060 00:56:54,650 --> 00:56:55,970 to take this path. 1061 00:56:55,970 --> 00:57:01,480 So this path is now completely eliminated. 1062 00:57:01,480 --> 00:57:05,680 We know that this path cannot be taken. 1063 00:57:05,680 --> 00:57:08,640 And, in fact, so this constraint we're 1064 00:57:08,640 --> 00:57:13,650 actually going to just keep them around as the condition itself. 1065 00:57:13,650 --> 00:57:14,150 All right? 1066 00:57:14,150 --> 00:57:17,860 So what about a different path? 1067 00:57:17,860 --> 00:57:21,840 So now we're trying this path. 1068 00:57:24,830 --> 00:57:29,140 So what would be the path condition for this? 1069 00:57:29,140 --> 00:57:35,920 So, again, our symbolic state starts with t equals 0, 1070 00:57:35,920 --> 00:57:39,270 and x and y equals to just the variables x and y. 1071 00:57:39,270 --> 00:57:43,060 And now how does the path constraint 1072 00:57:43,060 --> 00:57:44,610 look like in this case? 1073 00:57:44,610 --> 00:57:48,115 So by the time we get here how does the path condition look 1074 00:57:48,115 --> 00:57:48,615 like? 1075 00:57:50,984 --> 00:57:51,900 AUDIENCE: [INAUDIBLE]. 1076 00:57:53,818 --> 00:57:54,984 ARMANDO SOLAR LEZAMA: Right. 1077 00:57:54,984 --> 00:57:59,860 So in this case [INAUDIBLE] this is true and this is false. 1078 00:57:59,860 --> 00:58:02,590 So in this case it says, OK. x is greater than y. 1079 00:58:06,010 --> 00:58:10,900 And we are setting t to be equal to x. 1080 00:58:10,900 --> 00:58:21,290 So then when we get here we have x is less than y. 1081 00:58:21,290 --> 00:58:21,790 Right? 1082 00:58:21,790 --> 00:58:24,830 And once again it's very clear that this path condition 1083 00:58:24,830 --> 00:58:26,940 is unsatisfiable. 1084 00:58:26,940 --> 00:58:27,440 Right? 1085 00:58:27,440 --> 00:58:30,960 We cannot have x greater than y and x less than y at the same 1086 00:58:30,960 --> 00:58:31,460 time. 1087 00:58:31,460 --> 00:58:33,970 There's no assignment to x that will satisfy 1088 00:58:33,970 --> 00:58:35,360 both of those constraints. 1089 00:58:35,360 --> 00:58:38,740 So what that tells us is, again, that this other path is also 1090 00:58:38,740 --> 00:58:40,030 unsatisfiable. 1091 00:58:40,030 --> 00:58:42,030 And now at this point we've actually 1092 00:58:42,030 --> 00:58:46,280 explored every possible path in our program that could lead us 1093 00:58:46,280 --> 00:58:47,040 to this condition. 1094 00:58:47,040 --> 00:58:50,200 So we can actually establish and certify 1095 00:58:50,200 --> 00:58:56,890 that there is no possible path that will lead to an assertion 1096 00:58:56,890 --> 00:58:57,710 failure. 1097 00:58:57,710 --> 00:58:58,539 Yes? 1098 00:58:58,539 --> 00:59:00,205 AUDIENCE: The way you just presented it, 1099 00:59:00,205 --> 00:59:03,995 it makes it look as if you would explore every possible branch. 1100 00:59:03,995 --> 00:59:06,120 I mean, one of the advantages of symbolic execution 1101 00:59:06,120 --> 00:59:07,953 is that you're trying to prevent [INAUDIBLE] 1102 00:59:07,953 --> 00:59:11,730 a need of exploring all possible [INAUDIBLE] exponential. 1103 00:59:11,730 --> 00:59:13,356 So how are you avoiding that over here? 1104 00:59:13,356 --> 00:59:15,730 ARMANDO SOLAR-LEZAMA: That's a very good question, right? 1105 00:59:15,730 --> 00:59:18,080 So in this case essentially what you have is 1106 00:59:18,080 --> 00:59:21,160 you have a trade off between high symbolic and how concrete 1107 00:59:21,160 --> 00:59:22,101 you want to be. 1108 00:59:22,101 --> 00:59:22,600 Right? 1109 00:59:22,600 --> 00:59:26,990 So in this case we are not as symbolic as the first time 1110 00:59:26,990 --> 00:59:30,810 around when we were visiting both branches at the same time, 1111 00:59:30,810 --> 00:59:34,460 but in exchange for that our constraints became very, very 1112 00:59:34,460 --> 00:59:35,221 simple. 1113 00:59:35,221 --> 00:59:35,720 Right? 1114 00:59:35,720 --> 00:59:39,370 So the individual path by path constraints are very simple, 1115 00:59:39,370 --> 00:59:42,050 but you have to do this over, and over, and over again 1116 00:59:42,050 --> 00:59:44,310 to explore all the different branches. 1117 00:59:44,310 --> 00:59:46,930 And there are exponentially-- all the different paths. 1118 00:59:46,930 --> 00:59:50,580 And there are exponentially many paths in a program. 1119 00:59:50,580 --> 00:59:53,110 Now there are exponentially many paths, 1120 00:59:53,110 --> 00:59:55,540 but for every path in general, there's 1121 00:59:55,540 --> 00:59:58,580 also an exponentially large set of inputs 1122 00:59:58,580 --> 01:00:00,234 that could go down that path. 1123 01:00:00,234 --> 01:00:02,525 So this already gives you a big benefit because instead 1124 01:00:02,525 --> 01:00:05,220 of having to try every possible input you're only 1125 01:00:05,220 --> 01:00:08,220 trying every possible path. 1126 01:00:08,220 --> 01:00:10,430 But can you do better? 1127 01:00:10,430 --> 01:00:14,370 And this is one of the areas where there's 1128 01:00:14,370 --> 01:00:19,040 been a lot of experimentation in the area of symbolic execution. 1129 01:00:19,040 --> 01:00:22,700 When you do path by path reasoning? 1130 01:00:22,700 --> 01:00:26,180 When do you do all paths at the same time? 1131 01:00:26,180 --> 01:00:28,550 And one of the things that you saw, for example, 1132 01:00:28,550 --> 01:00:31,750 in the [? Clee ?] paper is a set of heuristics, 1133 01:00:31,750 --> 01:00:33,550 and a set of strategies they used 1134 01:00:33,550 --> 01:00:35,360 to make the search tractable. 1135 01:00:35,360 --> 01:00:37,530 For example, one of the things that they do 1136 01:00:37,530 --> 01:00:40,890 is that they are exploring path by path, 1137 01:00:40,890 --> 01:00:43,300 but they're not exploring completely blindly. 1138 01:00:43,300 --> 01:00:47,960 And they are also checking the path conditions 1139 01:00:47,960 --> 01:00:49,670 after every step. 1140 01:00:49,670 --> 01:00:53,480 So that, for example, if here instead of just 1141 01:00:53,480 --> 01:01:02,110 assert false, if this were a very complex program tree, 1142 01:01:02,110 --> 01:01:03,440 control flow graph. 1143 01:01:03,440 --> 01:01:07,860 You don't wait until you get to the very end 1144 01:01:07,860 --> 01:01:10,330 to check whether the path is feasible. 1145 01:01:10,330 --> 01:01:13,870 The moment you get here you know that this path is unsatisfiable 1146 01:01:13,870 --> 01:01:16,330 and you never go down this direction. 1147 01:01:16,330 --> 01:01:18,950 You always go in the other direction. 1148 01:01:18,950 --> 01:01:24,670 So pruning the paths early helps cut down a lot 1149 01:01:24,670 --> 01:01:26,180 on the experiential blow up. 1150 01:01:26,180 --> 01:01:28,590 And exploring the paths intelligently 1151 01:01:28,590 --> 01:01:32,510 helps a lot in preventing blow up. 1152 01:01:32,510 --> 01:01:35,270 A lot of the practical tools that are used today, 1153 01:01:35,270 --> 01:01:36,770 some of the things that they will do 1154 01:01:36,770 --> 01:01:39,710 is they will actually start with some random testing 1155 01:01:39,710 --> 01:01:42,520 to get an initial set of paths. 1156 01:01:42,520 --> 01:01:45,660 And then they will start looking for paths in the neighborhood 1157 01:01:45,660 --> 01:01:46,900 of those paths. 1158 01:01:46,900 --> 01:01:50,310 They will start asking questions like, hey, the random execution 1159 01:01:50,310 --> 01:01:51,430 went down this branch. 1160 01:01:51,430 --> 01:01:52,770 What if I flip this branch? 1161 01:01:52,770 --> 01:01:54,130 What if I flip this branch? 1162 01:01:54,130 --> 01:01:55,560 What if I flip this branch? 1163 01:01:55,560 --> 01:01:57,780 What happens in those paths? 1164 01:01:57,780 --> 01:01:59,750 Can be particularly useful, for example, 1165 01:01:59,750 --> 01:02:01,210 if we have a good test suite. 1166 01:02:01,210 --> 01:02:04,220 And you run your test suite and you find, OK, there 1167 01:02:04,220 --> 01:02:07,200 is this piece of code that nothing in my test suite 1168 01:02:07,200 --> 01:02:08,720 exercised. 1169 01:02:08,720 --> 01:02:12,600 So what you can do is you can take the path that got closest 1170 01:02:12,600 --> 01:02:15,510 to exercising that, and then ask, hey, 1171 01:02:15,510 --> 01:02:19,630 can I change this path so that it goes down this direction 1172 01:02:19,630 --> 01:02:20,930 instead? 1173 01:02:20,930 --> 01:02:25,970 And so in general, the moment you 1174 01:02:25,970 --> 01:02:28,690 try to do all paths simultaneously 1175 01:02:28,690 --> 01:02:31,420 the constraints start becoming intractable. 1176 01:02:31,420 --> 01:02:33,910 And it's the kind of thing that you 1177 01:02:33,910 --> 01:02:37,250 can do if you're doing one function at a time. 1178 01:02:37,250 --> 01:02:39,420 For example, if you're doing one function at a time 1179 01:02:39,420 --> 01:02:42,140 then it is generally feasible to explore all the paths 1180 01:02:42,140 --> 01:02:43,790 in a function together. 1181 01:02:43,790 --> 01:02:47,660 If you're trying to do larger units, then generally 1182 01:02:47,660 --> 01:02:50,105 you have to go with path by path exploration. 1183 01:02:53,392 --> 01:02:54,475 Are there other questions? 1184 01:02:56,880 --> 01:02:57,380 Yes? 1185 01:02:57,380 --> 01:03:00,302 AUDIENCE: You referenced how [INAUDIBLE]. 1186 01:03:00,302 --> 01:03:02,250 How does it do that again? 1187 01:03:02,250 --> 01:03:04,920 What's the [INAUDIBLE]? 1188 01:03:04,920 --> 01:03:08,140 ARMANDO SOLAR-LEZAMA: So the most important one really is 1189 01:03:08,140 --> 01:03:13,600 this idea that for every branch, you check your constraints 1190 01:03:13,600 --> 01:03:17,490 to check whether that branch can actually go both ways, 1191 01:03:17,490 --> 01:03:23,670 because if it cannot go both ways then you save a lot just 1192 01:03:23,670 --> 01:03:26,390 going in this direction of where it can't go. 1193 01:03:26,390 --> 01:03:28,780 Beyond that I don't remember the specific strategy 1194 01:03:28,780 --> 01:03:32,220 that they use for searching paths that are more 1195 01:03:32,220 --> 01:03:34,570 likely to give good results. 1196 01:03:37,760 --> 01:03:39,580 But pruning is really, really important. 1197 01:03:43,460 --> 01:03:44,930 OK. 1198 01:03:44,930 --> 01:03:48,560 So far though we've been talking mostly about toy code 1199 01:03:48,560 --> 01:03:53,360 in the sense that it's only integer variables, branches, 1200 01:03:53,360 --> 01:03:54,760 very simple stuff. 1201 01:03:54,760 --> 01:03:55,430 Right? 1202 01:03:55,430 --> 01:03:59,090 What happens when you have a program that 1203 01:03:59,090 --> 01:04:01,680 is more complicated? 1204 01:04:01,680 --> 01:04:05,790 And in particular, what happens when you have a program that 1205 01:04:05,790 --> 01:04:08,031 involves the heap? 1206 01:04:08,031 --> 01:04:08,530 Right? 1207 01:04:08,530 --> 01:04:11,580 So the heap has historically been 1208 01:04:11,580 --> 01:04:14,080 the bane of all program analysis, analysis 1209 01:04:14,080 --> 01:04:18,180 that were so clean and so elegant in the days of Fortran, 1210 01:04:18,180 --> 01:04:21,230 completely blow up when you try to run them on a C program 1211 01:04:21,230 --> 01:04:23,410 where you're allocating memory left and right. 1212 01:04:23,410 --> 01:04:25,280 And you have aliasing. 1213 01:04:25,280 --> 01:04:28,680 And you have all the messiness that 1214 01:04:28,680 --> 01:04:32,410 comes with dealing with program allocated memory. 1215 01:04:32,410 --> 01:04:34,660 And with pointers and pointer arithmetic. 1216 01:04:34,660 --> 01:04:37,840 And this is one of the areas where symbolic execution really 1217 01:04:37,840 --> 01:04:39,840 shines in the ability to actually reason 1218 01:04:39,840 --> 01:04:42,450 about these kinds of programs. 1219 01:04:42,450 --> 01:04:44,190 So how do we do it? 1220 01:04:44,190 --> 01:04:47,640 Right, so let's forget now for a moment about branches, 1221 01:04:47,640 --> 01:04:48,530 and control flow. 1222 01:04:48,530 --> 01:04:53,080 We have a trivially simple program here. 1223 01:04:53,080 --> 01:04:56,630 All it's doing is it's allocating some memory. 1224 01:04:56,630 --> 01:04:58,090 It's zeroing it out. 1225 01:04:58,090 --> 01:05:02,500 It's getting a new pointer y from the pointer x. 1226 01:05:02,500 --> 01:05:04,380 It's writing something into y. 1227 01:05:04,380 --> 01:05:08,140 And then it's checking, hey, is the value 1228 01:05:08,140 --> 01:05:12,070 stored at pointer y equal to the value stored at pointer x? 1229 01:05:12,070 --> 01:05:14,390 And just from your basic knowledge of C 1230 01:05:14,390 --> 01:05:16,920 you could see that, no. 1231 01:05:16,920 --> 01:05:22,081 Right, that this assertion is actually violated because x got 1232 01:05:22,081 --> 01:05:26,570 zeroed out and y has 25 in there, 1233 01:05:26,570 --> 01:05:30,210 but x is pointing to a different location. 1234 01:05:30,210 --> 01:05:33,030 Right? 1235 01:05:33,030 --> 01:05:35,000 So far so good. 1236 01:05:35,000 --> 01:05:37,570 The way we're going to model the heap and the way 1237 01:05:37,570 --> 01:05:41,140 the heap is modeled in a lot of these systems 1238 01:05:41,140 --> 01:05:45,070 is by not thinking of the heap as a heap, 1239 01:05:45,070 --> 01:05:48,150 but to thinking of the heat the way 1240 01:05:48,150 --> 01:05:51,840 C likes for you to think of the heap, which is just 1241 01:05:51,840 --> 01:05:57,500 a giant address base, a giant array where you can put things 1242 01:05:57,500 --> 01:05:58,640 into. 1243 01:05:58,640 --> 01:06:00,800 So what does that mean? 1244 01:06:00,800 --> 01:06:03,340 It means that we can think of our program 1245 01:06:03,340 --> 01:06:07,780 as having this very big global array. 1246 01:06:07,780 --> 01:06:10,980 And we're just going to call it MEM for now. 1247 01:06:10,980 --> 01:06:11,480 Right? 1248 01:06:11,480 --> 01:06:13,530 And it's an array that essentially is going 1249 01:06:13,530 --> 01:06:17,630 to map addresses to values. 1250 01:06:17,630 --> 01:06:18,130 Right? 1251 01:06:18,130 --> 01:06:19,330 And what's an address? 1252 01:06:19,330 --> 01:06:25,710 Well, an address is just a 64-bit value. 1253 01:06:25,710 --> 01:06:30,040 And what comes after you read something from an address? 1254 01:06:30,040 --> 01:06:31,750 It depends on how you're modeling memory. 1255 01:06:31,750 --> 01:06:36,620 If you're modeling it at the byte level, then what comes out 1256 01:06:36,620 --> 01:06:37,960 is a byte. 1257 01:06:37,960 --> 01:06:40,460 If you're modeling it at the word level then 1258 01:06:40,460 --> 01:06:42,880 what comes out of it is a word. 1259 01:06:42,880 --> 01:06:45,490 And depending on the kind of bugs that you're interested in, 1260 01:06:45,490 --> 01:06:47,920 and whether things like memory alignment 1261 01:06:47,920 --> 01:06:49,650 are an issue for you are not, you're 1262 01:06:49,650 --> 01:06:51,441 going to model it a little bit differently, 1263 01:06:51,441 --> 01:06:53,810 but generally memory is just an array 1264 01:06:53,810 --> 01:07:00,030 from an address to a value. 1265 01:07:00,030 --> 01:07:00,530 Right? 1266 01:07:00,530 --> 01:07:07,260 So an address is just an integer. 1267 01:07:07,260 --> 01:07:08,147 Right? 1268 01:07:08,147 --> 01:07:10,230 It's in some sense not that different from the way 1269 01:07:10,230 --> 01:07:11,550 C thinks I'm an address. 1270 01:07:11,550 --> 01:07:12,870 It's just an integer. 1271 01:07:12,870 --> 01:07:15,430 It's just a value. 1272 01:07:15,430 --> 01:07:18,740 It's just a 64-bit integer, or a 32-bit integer, 1273 01:07:18,740 --> 01:07:20,010 depending on your machine. 1274 01:07:20,010 --> 01:07:22,930 It just a value that indexes into that memory. 1275 01:07:22,930 --> 01:07:24,990 And that you can put things in memory, 1276 01:07:24,990 --> 01:07:27,490 read them from the memory. 1277 01:07:27,490 --> 01:07:30,860 So things like pointer arithmetic 1278 01:07:30,860 --> 01:07:33,304 just becomes integer arithmetic. 1279 01:07:33,304 --> 01:07:35,220 In practice there's a little bit of desugaring 1280 01:07:35,220 --> 01:07:43,020 that has to happen because in C the pointer arithmetic actually 1281 01:07:43,020 --> 01:07:45,290 knows about the types of the pointers. 1282 01:07:45,290 --> 01:07:50,030 And things will be incremented proportional to the size, 1283 01:07:50,030 --> 01:07:50,530 right? 1284 01:07:50,530 --> 01:08:00,100 So this would actually be x plus 10 times the size of int. 1285 01:08:00,100 --> 01:08:01,320 Right? 1286 01:08:01,320 --> 01:08:03,440 But what's really important is what 1287 01:08:03,440 --> 01:08:06,610 happens when you're reading and writing from memory. 1288 01:08:06,610 --> 01:08:11,590 So what used to be just a pointer reference from y 1289 01:08:11,590 --> 01:08:17,109 to write 25, is now just I'm taking my memory array, 1290 01:08:17,109 --> 01:08:19,910 and I'm indexing it with y. 1291 01:08:19,910 --> 01:08:24,590 And I'm writing 25 to that memory location. 1292 01:08:24,590 --> 01:08:25,090 Right? 1293 01:08:25,090 --> 01:08:29,020 And this assertion now becomes, well, I 1294 01:08:29,020 --> 01:08:32,430 am reading from location y in memory. 1295 01:08:32,430 --> 01:08:35,100 And I am reading from location x in memory. 1296 01:08:35,100 --> 01:08:36,550 And I am comparing them. 1297 01:08:36,550 --> 01:08:40,010 And I'm checking whether they are the same or not. 1298 01:08:40,010 --> 01:08:41,510 It's a very, very simple reduction 1299 01:08:41,510 --> 01:08:46,880 to go from program that uses the heap to a program the just uses 1300 01:08:46,880 --> 01:08:51,790 this giant global array that represents the memory. 1301 01:08:51,790 --> 01:08:53,649 And now what that means is that in order 1302 01:08:53,649 --> 01:08:55,764 to reason about programs that manipulate the heap 1303 01:08:55,764 --> 01:08:57,680 you don't really have to reason about programs 1304 01:08:57,680 --> 01:08:58,721 that manipulate the heap. 1305 01:08:58,721 --> 01:09:01,510 As long as you have the ability to reason about arrays, 1306 01:09:01,510 --> 01:09:02,399 you are good. 1307 01:09:02,399 --> 01:09:04,700 Now here's a simple question though. 1308 01:09:04,700 --> 01:09:07,430 What about the malloc? 1309 01:09:07,430 --> 01:09:11,479 So one thing you can do is you can say, well, malloc, 1310 01:09:11,479 --> 01:09:16,240 I can just take the C implementation of malloc 1311 01:09:16,240 --> 01:09:18,130 and actually implement malloc like that. 1312 01:09:18,130 --> 01:09:23,130 And keep track of all the pages that I have allocated 1313 01:09:23,130 --> 01:09:26,950 and keep track of everything that has been freed. 1314 01:09:26,950 --> 01:09:29,109 And keep a free list, and everything. 1315 01:09:29,109 --> 01:09:31,380 It turns out for a lot of purposes 1316 01:09:31,380 --> 01:09:33,310 and for a lot of classes of bugs, 1317 01:09:33,310 --> 01:09:35,185 you don't need malloc to be that complicated. 1318 01:09:35,185 --> 01:09:39,529 In fact, you can get away with a malloc that looks like this, 1319 01:09:39,529 --> 01:09:41,819 with a malloc that just says, I'm 1320 01:09:41,819 --> 01:09:49,330 going to keep a counter for the next free memory location. 1321 01:09:49,330 --> 01:09:55,560 And whenever somebody asks for an address, 1322 01:09:55,560 --> 01:09:57,730 that address I'm just going to give this position 1323 01:09:57,730 --> 01:09:59,720 and then increment the position. 1324 01:09:59,720 --> 01:10:00,220 Right? 1325 01:10:02,920 --> 01:10:04,769 And then return rv, in this case. 1326 01:10:11,626 --> 01:10:14,042 So one of the thing that is malloc is completely ignoring. 1327 01:10:17,754 --> 01:10:18,670 AUDIENCE: [INAUDIBLE]. 1328 01:10:18,670 --> 01:10:18,770 ARMANDO SOLAR-LEZAMA: Yeah. 1329 01:10:18,770 --> 01:10:19,670 Freeing, right? 1330 01:10:19,670 --> 01:10:21,939 This malloc says, yeah, forget about freeing. 1331 01:10:21,939 --> 01:10:22,730 There's no freeing. 1332 01:10:22,730 --> 01:10:26,650 We're just going to keep walking through our memory allocating 1333 01:10:26,650 --> 01:10:30,880 further, and further, and further and that will be it. 1334 01:10:30,880 --> 01:10:34,770 And we don't care about freeing anything. 1335 01:10:34,770 --> 01:10:36,710 It also doesn't really care about the fact 1336 01:10:36,710 --> 01:10:39,759 that well, actually, there are regions of memory where 1337 01:10:39,759 --> 01:10:40,800 you shouldn't be writing. 1338 01:10:40,800 --> 01:10:42,385 There are special addresses that have 1339 01:10:42,385 --> 01:10:44,960 special meaning that are reserved for the operating 1340 01:10:44,960 --> 01:10:45,540 system. 1341 01:10:45,540 --> 01:10:47,560 It doesn't model any of the things 1342 01:10:47,560 --> 01:10:50,580 that actually make writing a malloc function complicated, 1343 01:10:50,580 --> 01:10:54,380 but at a certain level of abstraction, 1344 01:10:54,380 --> 01:10:58,280 if you're trying to reason about some complicated code that 1345 01:10:58,280 --> 01:10:59,520 does pointer manipulation. 1346 01:10:59,520 --> 01:11:02,130 And you don't care about freeing memory, 1347 01:11:02,130 --> 01:11:04,600 but you really care about is, am I 1348 01:11:04,600 --> 01:11:08,030 going to write past the end of some buffer, for example. 1349 01:11:08,030 --> 01:11:10,642 Then this malloc might be good enough. 1350 01:11:10,642 --> 01:11:12,850 And this is actually that happens very, very commonly 1351 01:11:12,850 --> 01:11:15,380 when you're doing symbolic execution of real code. 1352 01:11:15,380 --> 01:11:19,080 A very important step is the modeling 1353 01:11:19,080 --> 01:11:20,750 of your library functions. 1354 01:11:20,750 --> 01:11:22,800 And how you model your library functions 1355 01:11:22,800 --> 01:11:25,760 is going to have a huge impact on the one hand 1356 01:11:25,760 --> 01:11:30,110 on the performance and the scalability of the analysis, 1357 01:11:30,110 --> 01:11:32,160 but on the other hand, on the precision. 1358 01:11:32,160 --> 01:11:35,670 So if you have a Mickey Mouse model of malloc like this, 1359 01:11:35,670 --> 01:11:37,930 it's going to be very, very fast, 1360 01:11:37,930 --> 01:11:41,265 but there are going to be certain classes of bugs 1361 01:11:41,265 --> 01:11:43,060 that you won't be able to catch. 1362 01:11:43,060 --> 01:11:43,560 Right? 1363 01:11:43,560 --> 01:11:45,630 So and this model, for example, I'm completely 1364 01:11:45,630 --> 01:11:46,840 ignoring the allocations. 1365 01:11:46,840 --> 01:11:48,840 So if I have a bug because somebody 1366 01:11:48,840 --> 01:11:51,940 is accessing unallocated space. 1367 01:11:51,940 --> 01:11:56,010 Well, I'm not going to find it with this Mickey Mouse 1368 01:11:56,010 --> 01:11:58,860 model of malloc. 1369 01:11:58,860 --> 01:11:59,660 Right? 1370 01:11:59,660 --> 01:12:04,400 So it's always a balance between the precision of the analysis 1371 01:12:04,400 --> 01:12:10,400 versus the efficiency. 1372 01:12:10,400 --> 01:12:14,030 And the more complicated your models of standard functions 1373 01:12:14,030 --> 01:12:17,010 like malloc get, the less scalable 1374 01:12:17,010 --> 01:12:20,230 the analysis is going to be, but for certain classes of bugs 1375 01:12:20,230 --> 01:12:22,150 you will need those models. 1376 01:12:22,150 --> 01:12:25,510 And one of the big things in the [? Clee ?] paper 1377 01:12:25,510 --> 01:12:27,830 was really having reasonable models 1378 01:12:27,830 --> 01:12:31,440 for all the different libraries in C, 1379 01:12:31,440 --> 01:12:32,940 all the different libraries that are 1380 01:12:32,940 --> 01:12:35,350 needed in order to understand what a program is actually 1381 01:12:35,350 --> 01:12:35,850 doing. 1382 01:12:39,090 --> 01:12:40,177 So, OK. 1383 01:12:40,177 --> 01:12:42,510 So we've reduced the problem of reasoning about the heap 1384 01:12:42,510 --> 01:12:47,220 to a problem of reasoning about a program with arrays, 1385 01:12:47,220 --> 01:12:50,910 but I haven't actually told you how to reason 1386 01:12:50,910 --> 01:12:52,270 about a program with arrays. 1387 01:12:52,270 --> 01:12:55,390 And it turns out that most SMT solvers 1388 01:12:55,390 --> 01:12:58,060 support a theory of arrays. 1389 01:12:58,060 --> 01:13:01,826 And the idea is if a is an array, 1390 01:13:01,826 --> 01:13:03,950 there's some notation to say, well, take that array 1391 01:13:03,950 --> 01:13:07,070 and create a new array where location i has 1392 01:13:07,070 --> 01:13:10,571 been updated to value e. 1393 01:13:10,571 --> 01:13:11,070 All right? 1394 01:13:11,070 --> 01:13:14,820 So if I have array a and I do this update operation, 1395 01:13:14,820 --> 01:13:17,340 and then I try to read the value k, 1396 01:13:17,340 --> 01:13:20,180 then the meaning is that the value k 1397 01:13:20,180 --> 01:13:22,370 is going to be equal to the value k 1398 01:13:22,370 --> 01:13:25,330 at a if k is different from i. 1399 01:13:25,330 --> 01:13:29,350 And it's going to be equal to e if k is equal to i, right? 1400 01:13:29,350 --> 01:13:31,290 That's what updating an array means. 1401 01:13:31,290 --> 01:13:33,890 That's what it means to take an old array 1402 01:13:33,890 --> 01:13:35,583 and update it to be a new array. 1403 01:13:40,320 --> 01:13:44,780 And the nice thing about this is that if you have a formula that 1404 01:13:44,780 --> 01:13:47,780 involves the theory of arrays, so, for example, 1405 01:13:47,780 --> 01:13:51,850 I started with the zero array that is just zeros everywhere. 1406 01:13:51,850 --> 01:13:59,210 And then I wrote 5 into location i, and 7 into location j. 1407 01:13:59,210 --> 01:14:00,850 And then I'm reading from k. 1408 01:14:00,850 --> 01:14:04,680 And then I'm checking whether that's equal to 5 or not. 1409 01:14:04,680 --> 01:14:10,110 Then that can be expanded by using this definition 1410 01:14:10,110 --> 01:14:14,450 to something that says, well, if k is equal to i 1411 01:14:14,450 --> 01:14:19,290 then if k is equal to y, and k is different from j, 1412 01:14:19,290 --> 01:14:21,650 then, yes, this is going to be equal to 5. 1413 01:14:24,570 --> 01:14:30,640 And otherwise this is not going to be equal to 5, right? 1414 01:14:30,640 --> 01:14:33,850 And in practice SMT solvers don't just expand these 1415 01:14:33,850 --> 01:14:36,290 into lots of Boolean formulas. 1416 01:14:36,290 --> 01:14:37,950 They, again, use this back and forth 1417 01:14:37,950 --> 01:14:41,200 strategy between a SAT solver and an engine 1418 01:14:41,200 --> 01:14:45,380 that is able to reason about this theory of arrays in order 1419 01:14:45,380 --> 01:14:46,020 to do it. 1420 01:14:46,020 --> 01:14:48,060 But what's important is that by relying 1421 01:14:48,060 --> 01:14:51,680 on this theory of arrays, using the same strategy we 1422 01:14:51,680 --> 01:15:00,050 saw to generate formulas for integers you can actually 1423 01:15:00,050 --> 01:15:03,990 generate formulas involving array logic, 1424 01:15:03,990 --> 01:15:08,720 and involving array updates, involving array axises, 1425 01:15:08,720 --> 01:15:16,730 involving iteration over arrays as long as you fix your path, 1426 01:15:16,730 --> 01:15:21,000 these formulas are very easy to generate. 1427 01:15:21,000 --> 01:15:22,440 If you don't fix your paths if you 1428 01:15:22,440 --> 01:15:24,450 want to generate a formula that corresponds 1429 01:15:24,450 --> 01:15:29,080 to going through all paths, then it's also relatively easy. 1430 01:15:29,080 --> 01:15:32,310 The Only thing is you have to deal with loops 1431 01:15:32,310 --> 01:15:34,910 in more of a special way. 1432 01:15:34,910 --> 01:15:35,479 Yes? 1433 01:15:35,479 --> 01:15:36,395 AUDIENCE: [INAUDIBLE]. 1434 01:15:43,340 --> 01:15:46,530 ARMANDO SOLAR-LEZAMA: I don't know. 1435 01:15:46,530 --> 01:15:48,870 So dictionaries and maps are actually 1436 01:15:48,870 --> 01:15:52,960 very easy to model using uninterpreted functions. 1437 01:15:52,960 --> 01:15:55,190 And, in fact, the theory of arrays 1438 01:15:55,190 --> 01:16:05,170 itself, it's just a special case of uninterpreted functions. 1439 01:16:05,170 --> 01:16:09,630 So more complicated things can be done 1440 01:16:09,630 --> 01:16:11,460 with uninterpreted functions. 1441 01:16:11,460 --> 01:16:16,820 In modern SMT solvers there is native support 1442 01:16:16,820 --> 01:16:20,657 for reasoning about sets and set operations, 1443 01:16:20,657 --> 01:16:22,740 which can be very, very useful if you're reasoning 1444 01:16:22,740 --> 01:16:28,390 about a program that involves lots of set computations, 1445 01:16:28,390 --> 01:16:30,410 for example. 1446 01:16:30,410 --> 01:16:33,750 When designing one of these tools 1447 01:16:33,750 --> 01:16:36,320 the modeling step is really important. 1448 01:16:36,320 --> 01:16:41,040 And it's not just how you model complicated program features 1449 01:16:41,040 --> 01:16:43,320 down to your theories. 1450 01:16:43,320 --> 01:16:47,850 So, for example, things like heaps down to arrays. 1451 01:16:47,850 --> 01:16:50,837 And also the choice of what theories and the solver you 1452 01:16:50,837 --> 01:16:51,630 use. 1453 01:16:51,630 --> 01:16:56,470 And there's a large number of theories and the solver 1454 01:16:56,470 --> 01:17:02,260 with different trade offs between how efficient they are 1455 01:17:02,260 --> 01:17:04,520 versus how expressive they are. 1456 01:17:04,520 --> 01:17:08,870 And, in general, most of the production tools 1457 01:17:08,870 --> 01:17:13,370 stick to the theory of bit-vectors 1458 01:17:13,370 --> 01:17:16,550 and they might use the theory of arrays 1459 01:17:16,550 --> 01:17:21,820 to model the heap if that is necessary. 1460 01:17:21,820 --> 01:17:24,220 Generally production tools try to shy away 1461 01:17:24,220 --> 01:17:27,380 from some of the more sophisticated theories, 1462 01:17:27,380 --> 01:17:31,560 like the theory of sets just because by virtue 1463 01:17:31,560 --> 01:17:36,450 being richer they also tend to be less scalable in some cases, 1464 01:17:36,450 --> 01:17:39,620 unless you're dealing with a program that really requires 1465 01:17:39,620 --> 01:17:44,920 exactly that kind of reasoning in order to work with. 1466 01:17:44,920 --> 01:17:47,841 Are there other questions? 1467 01:17:47,841 --> 01:17:48,340 Yes? 1468 01:17:48,340 --> 01:17:50,834 AUDIENCE: [INAUDIBLE] research in symbolic execution, 1469 01:17:50,834 --> 01:17:52,762 what are people focusing on and where 1470 01:17:52,762 --> 01:17:54,208 is there room for improvement? 1471 01:17:54,208 --> 01:17:56,620 [INAUDIBLE] applications. 1472 01:17:56,620 --> 01:18:00,040 ARMANDO SOLAR-LEZAMA: So one very active area of research 1473 01:18:00,040 --> 01:18:02,880 is around applications. 1474 01:18:02,880 --> 01:18:06,080 And looking at models that will allow 1475 01:18:06,080 --> 01:18:09,400 you to discover new classes of bugs. 1476 01:18:09,400 --> 01:18:15,200 So, for example, Nikolai, and Franz, and Xi Wang and I 1477 01:18:15,200 --> 01:18:19,330 had a paper, what was it, last year 1478 01:18:19,330 --> 01:18:23,810 when we were looking at using symbolic execution to identify 1479 01:18:23,810 --> 01:18:28,770 coding your program that a compiler might optimize away. 1480 01:18:28,770 --> 01:18:32,410 Security checks that might get optimized away by a compiler. 1481 01:18:32,410 --> 01:18:38,510 So it's very different from the question of will the program go 1482 01:18:38,510 --> 01:18:42,470 down this path or not, but there is a modeling step 1483 01:18:42,470 --> 01:18:45,300 to go from this high level conceptual question 1484 01:18:45,300 --> 01:18:47,750 of, is there a code in my program 1485 01:18:47,750 --> 01:18:54,780 that can be compiled away to an algorithm based 1486 01:18:54,780 --> 01:18:56,673 on symbolic execution that will rely 1487 01:18:56,673 --> 01:18:58,530 on the ability of symbolic execution 1488 01:18:58,530 --> 01:19:01,290 to easily tell you whether the program can go down 1489 01:19:01,290 --> 01:19:04,930 a particular path, or whether a particular path is feasible. 1490 01:19:04,930 --> 01:19:08,380 So applications is a big area, extending 1491 01:19:08,380 --> 01:19:12,080 to newer classes of bugs, growing 1492 01:19:12,080 --> 01:19:15,500 to new and different language features. 1493 01:19:15,500 --> 01:19:19,740 For example, one of the things that is still 1494 01:19:19,740 --> 01:19:22,840 fairly hard to model from using symbolic execution 1495 01:19:22,840 --> 01:19:28,850 are very high level languages, like JavaScript or Python where 1496 01:19:28,850 --> 01:19:31,750 you have a lot of very dynamic language features, 1497 01:19:31,750 --> 01:19:37,910 but at the same time they are-- if any technique can 1498 01:19:37,910 --> 01:19:40,370 work for the symbolic execution, it's definitely very good. 1499 01:19:40,370 --> 01:19:44,640 And, in fact, we had some work a couple of years 1500 01:19:44,640 --> 01:19:46,780 ago using symbolic execution to reason 1501 01:19:46,780 --> 01:19:50,070 about errors in Python programming assignments, 1502 01:19:50,070 --> 01:19:51,890 for example. 1503 01:19:51,890 --> 01:19:52,623 Yes? 1504 01:19:52,623 --> 01:19:54,102 AUDIENCE: So [INAUDIBLE]. 1505 01:20:03,962 --> 01:20:04,948 How does [INAUDIBLE]? 1506 01:20:08,204 --> 01:20:09,370 ARMANDO SOLAR-LEZAMA: It is. 1507 01:20:09,370 --> 01:20:13,990 So in the case of symbolic execution part of the problem 1508 01:20:13,990 --> 01:20:19,130 is that your symbolic state, it's very hard to simply say, 1509 01:20:19,130 --> 01:20:21,340 OK, I executed this instruction, and then 1510 01:20:21,340 --> 01:20:23,430 this instruction, and then this instruction. 1511 01:20:23,430 --> 01:20:24,720 The sequence is not there. 1512 01:20:24,720 --> 01:20:28,180 There was some work a few years ago looking, for example, 1513 01:20:28,180 --> 01:20:31,970 at very small pieces of code, but very critical, 1514 01:20:31,970 --> 01:20:35,150 like a concurring data structure in operating 1515 01:20:35,150 --> 01:20:37,240 system, or lock-free data structure 1516 01:20:37,240 --> 01:20:43,190 and modeling the interactions between threads 1517 01:20:43,190 --> 01:20:47,984 by essentially saying, every time there is a variable that 1518 01:20:47,984 --> 01:20:49,900 could have been overwritten by something else, 1519 01:20:49,900 --> 01:20:54,000 you replace that value with just a fresh symbolic value that 1520 01:20:54,000 --> 01:20:55,946 says, I have no idea what this is. 1521 01:20:55,946 --> 01:20:57,320 And you generate constraints that 1522 01:20:57,320 --> 01:21:00,060 relate to those symbolic values to symbolic values 1523 01:21:00,060 --> 01:21:01,520 in other threads. 1524 01:21:01,520 --> 01:21:03,320 And this has been used even to reason 1525 01:21:03,320 --> 01:21:08,840 about things like missing memory fences, for example. 1526 01:21:08,840 --> 01:21:13,565 And so it is possible, but the complexity grows quite a bit. 1527 01:21:13,565 --> 01:21:18,100 And it becomes the kind of thing that you cannot no longer do 1528 01:21:18,100 --> 01:21:22,240 at the scale of Microsoft Word, but you can do at the scale 1529 01:21:22,240 --> 01:21:26,087 of, say, a concurring data structure, for example. 1530 01:21:26,087 --> 01:21:28,670 There had been other work though in the context of concurrency 1531 01:21:28,670 --> 01:21:31,200 looking at, for example, can I use symbolic execution 1532 01:21:31,200 --> 01:21:34,830 to reconstruct interleavings based 1533 01:21:34,830 --> 01:21:38,290 on knowledge of how the program behaved as it was running, 1534 01:21:38,290 --> 01:21:40,810 for example. 1535 01:21:40,810 --> 01:21:46,020 And so this opens a lot of possibilities, 1536 01:21:46,020 --> 01:21:49,220 having this capability to ask very concrete questions 1537 01:21:49,220 --> 01:21:52,660 about can my program run down this path. 1538 01:21:52,660 --> 01:21:54,440 Being able to have symbolic values 1539 01:21:54,440 --> 01:21:57,600 and ask questions, what values should these things have 1540 01:21:57,600 --> 01:22:00,200 in order for the program to do something, or in order 1541 01:22:00,200 --> 01:22:03,215 something to happen is a very powerful capability 1542 01:22:03,215 --> 01:22:04,590 and there's a lot of applications 1543 01:22:04,590 --> 01:22:10,660 that have been tried, but this is a fairly new piece 1544 01:22:10,660 --> 01:22:13,280 of technology as far as technology 1545 01:22:13,280 --> 01:22:15,203 for analyzing a program goes.