1
00:00:00,090 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,810
Commons license.

3
00:00:03,810 --> 00:00:06,050
Your support will help
MIT OpenCourseWare

4
00:00:06,050 --> 00:00:10,170
continue to offer high quality
educational resources for free.

5
00:00:10,170 --> 00:00:12,690
To make a donation or to
view additional materials

6
00:00:12,690 --> 00:00:16,606
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:16,606 --> 00:00:17,570
at ocw.mit.edu.

8
00:00:27,337 --> 00:00:28,670
ARMANDO SOLAR-LEZAMA: All right.

9
00:00:28,670 --> 00:00:29,970
So good morning, everyone.

10
00:00:29,970 --> 00:00:32,250
I'm Armando Solar-Lezama.

11
00:00:32,250 --> 00:00:37,190
I'm giving the lecture
today on symbolic execution.

12
00:00:37,190 --> 00:00:41,740
How many of you here are
familiar with what the term is

13
00:00:41,740 --> 00:00:45,220
or have heard about it before?

14
00:00:45,220 --> 00:00:47,230
We want to get a
sense of audience.

15
00:00:47,230 --> 00:00:48,180
OK.

16
00:00:48,180 --> 00:00:51,080
So let's see.

17
00:00:58,295 --> 00:01:00,690
I dropped this machine
a little too many times

18
00:01:00,690 --> 00:01:04,480
and it takes a while to boot up.

19
00:01:04,480 --> 00:01:10,040
So symbolic execution
is really the workhorse

20
00:01:10,040 --> 00:01:14,420
of modern program analysis.

21
00:01:14,420 --> 00:01:17,580
It's one of those techniques
that has really broken out

22
00:01:17,580 --> 00:01:21,630
of the research bubble
and actually made it

23
00:01:21,630 --> 00:01:25,210
into a very large number of
high impact applications.

24
00:01:25,210 --> 00:01:29,400
For example, today
at Microsoft there's

25
00:01:29,400 --> 00:01:35,000
a system called SAGE that runs
on a lot of important Microsoft

26
00:01:35,000 --> 00:01:37,560
code ranging from
PowerPoint to Windows

27
00:01:37,560 --> 00:01:40,520
to actually find security
problems and security

28
00:01:40,520 --> 00:01:42,610
vulnerabilities.

29
00:01:42,610 --> 00:01:44,700
There's a lot of that
academic projects that

30
00:01:44,700 --> 00:01:48,160
have made a lot of
real world impact

31
00:01:48,160 --> 00:01:51,720
by discovering important
bugs in open source software,

32
00:01:51,720 --> 00:01:55,870
for example, by relying
on symbolic execution.

33
00:01:55,870 --> 00:01:59,390
And the beauty of symbolic
execution as a technique

34
00:01:59,390 --> 00:02:03,410
is that compared to
testing, for example,

35
00:02:03,410 --> 00:02:04,980
it gives you the
ability to reason

36
00:02:04,980 --> 00:02:07,420
about how your program
is going to behave

37
00:02:07,420 --> 00:02:12,260
on a potentially infinite
set of possible inputs.

38
00:02:12,260 --> 00:02:15,390
It allows you to
explore spaces of inputs

39
00:02:15,390 --> 00:02:18,730
that would be completely
unfeasible and impractical

40
00:02:18,730 --> 00:02:21,860
to explore by, say,
random testing,

41
00:02:21,860 --> 00:02:25,080
or even by having a very
large number of testers

42
00:02:25,080 --> 00:02:27,040
banging and the code.

43
00:02:27,040 --> 00:02:29,690
On the other hand, compared
to more traditional

44
00:02:29,690 --> 00:02:32,580
static analysis techniques
it has the advantage

45
00:02:32,580 --> 00:02:36,180
that when it discovers a
problem it can actually

46
00:02:36,180 --> 00:02:39,860
produce for you an
input and a trace

47
00:02:39,860 --> 00:02:42,280
that you can run on
your real program

48
00:02:42,280 --> 00:02:44,990
and execute that
program on that input.

49
00:02:44,990 --> 00:02:48,100
And you can actually tell
that it is a real bug.

50
00:02:48,100 --> 00:02:49,980
And you can actually
go and debug it

51
00:02:49,980 --> 00:02:55,700
using traditional
debugging mechanisms.

52
00:02:55,700 --> 00:02:58,970
And this is
particularly valuable

53
00:02:58,970 --> 00:03:02,100
when you're in an industrial
development environment

54
00:03:02,100 --> 00:03:04,830
where you probably
don't have time

55
00:03:04,830 --> 00:03:08,880
to go looking after every
little problem in your code.

56
00:03:08,880 --> 00:03:10,920
You really want
to be able to tell

57
00:03:10,920 --> 00:03:12,880
the difference
between real problems

58
00:03:12,880 --> 00:03:16,010
versus false
positives, for example.

59
00:03:16,010 --> 00:03:21,150
So how does it work?

60
00:03:21,150 --> 00:03:23,510
So in order to
really understand how

61
00:03:23,510 --> 00:03:28,260
it works it's useful to
start by looking at just

62
00:03:28,260 --> 00:03:30,450
normal execution, right?

63
00:03:30,450 --> 00:03:32,140
If we think of
symbolic execution

64
00:03:32,140 --> 00:03:36,500
as a generalization of
traditional, plain execution,

65
00:03:36,500 --> 00:03:40,310
it makes sense to know
what this looks like.

66
00:03:40,310 --> 00:03:44,420
So I'm going to be using this
very, very simple program

67
00:03:44,420 --> 00:03:48,090
as an illustration for
a lot of what I'm going

68
00:03:48,090 --> 00:03:49,800
to be talking about today.

69
00:03:49,800 --> 00:03:51,510
So what do we have here?

70
00:03:51,510 --> 00:03:54,460
Again, it's a very simple
piece of code, just

71
00:03:54,460 --> 00:03:57,510
a couple of branches and
here we have an assertion,

72
00:03:57,510 --> 00:03:58,280
assert false.

73
00:03:58,280 --> 00:04:01,570
And we want to know could that
assertion ever be triggered.

74
00:04:01,570 --> 00:04:02,270
Is it possible?

75
00:04:02,270 --> 00:04:07,260
Is there some input where that
will make that assertion fail?

76
00:04:07,260 --> 00:04:09,510
And in this case because the
assertion is just saying,

77
00:04:09,510 --> 00:04:11,780
assert false, what
I'm really asking is,

78
00:04:11,780 --> 00:04:14,960
is there an input that can
reach that point in the program?

79
00:04:14,960 --> 00:04:19,070
So one of the things I can
do is I can try just testing.

80
00:04:19,070 --> 00:04:24,550
I can go in and run this
code with a concrete input.

81
00:04:24,550 --> 00:04:25,050
Right?

82
00:04:25,050 --> 00:04:29,850
So let's say that I start
with an input where x is 4

83
00:04:29,850 --> 00:04:31,820
and y is 4.

84
00:04:31,820 --> 00:04:35,110
And initially t is going
to have the value 0

85
00:04:35,110 --> 00:04:36,310
right after I declare it.

86
00:04:36,310 --> 00:04:38,990
So before we go with
normal execution,

87
00:04:38,990 --> 00:04:40,800
what are some of the
important point here?

88
00:04:40,800 --> 00:04:44,884
The fact that we need some
representation of the state

89
00:04:44,884 --> 00:04:45,800
of the program, right?

90
00:04:45,800 --> 00:04:48,680
Whether we're doing
normal execution

91
00:04:48,680 --> 00:04:52,020
or whether we're doing
symbolic execution,

92
00:04:52,020 --> 00:04:53,700
we need to have some
way to characterize

93
00:04:53,700 --> 00:04:54,710
the state of the program.

94
00:04:54,710 --> 00:04:56,700
And in this case, this
is such a simple program

95
00:04:56,700 --> 00:04:59,850
that it doesn't use the heap.

96
00:04:59,850 --> 00:05:01,050
It doesn't use the stack.

97
00:05:01,050 --> 00:05:03,210
There are no function calls.

98
00:05:03,210 --> 00:05:07,550
So the state can be fully
characterized by these three

99
00:05:07,550 --> 00:05:10,130
variables together
with knowledge of where

100
00:05:10,130 --> 00:05:12,200
in the program I'm at, right?

101
00:05:12,200 --> 00:05:15,920
So if I start
executing with 4, 4,

102
00:05:15,920 --> 00:05:21,330
and 0, so when I get to this
branch, is 4 greater than 4?

103
00:05:21,330 --> 00:05:22,460
Clearly not.

104
00:05:22,460 --> 00:05:26,560
So then I'm going to be
executing t equals y.

105
00:05:26,560 --> 00:05:29,850
So now after I do
that t is no longer 0.

106
00:05:29,850 --> 00:05:32,230
It now has the value 4.

107
00:05:32,230 --> 00:05:32,730
Right?

108
00:05:32,730 --> 00:05:35,080
So that is now the
state of my program.

109
00:05:35,080 --> 00:05:38,980
And then I can
evaluate this branch.

110
00:05:38,980 --> 00:05:41,260
Is it the case that
t is less than x?

111
00:05:43,850 --> 00:05:44,350
No.

112
00:05:44,350 --> 00:05:44,849
Right?

113
00:05:44,849 --> 00:05:46,440
So we dodged the bullet.

114
00:05:46,440 --> 00:05:49,490
We did not get an
assertion failure.

115
00:05:49,490 --> 00:05:52,331
There was no problem in
this particular execution.

116
00:05:52,331 --> 00:05:52,830
Right?

117
00:05:52,830 --> 00:05:55,580
But that doesn't
really tell us anything

118
00:05:55,580 --> 00:05:57,010
about any other execution.

119
00:05:57,010 --> 00:05:59,540
All we know is that
under the input

120
00:05:59,540 --> 00:06:03,580
x equals 4 and y equals 4, the
program is not going to fail.

121
00:06:03,580 --> 00:06:06,790
But it tells us nothing
about what's going to happen

122
00:06:06,790 --> 00:06:10,390
on the input
[? 2, 1, ?] for example.

123
00:06:10,390 --> 00:06:10,890
Right?

124
00:06:10,890 --> 00:06:13,700
And in this input you see
that this input is actually

125
00:06:13,700 --> 00:06:17,350
going to follow a different
path in the execution.

126
00:06:17,350 --> 00:06:22,020
This time we're actually
going to see that t equals x.

127
00:06:22,020 --> 00:06:25,750
We're actually going
to set t equals 2x.

128
00:06:25,750 --> 00:06:29,710
So after executing these
t will be equal to 2,

129
00:06:29,710 --> 00:06:32,765
but is there any problem
in this execution?

130
00:06:36,800 --> 00:06:39,500
Will there be an assertion
failure on this input?

131
00:06:42,920 --> 00:06:44,050
Well, so let's see.

132
00:06:44,050 --> 00:06:45,850
So if t is 2.

133
00:06:45,850 --> 00:06:47,970
And x is 2.

134
00:06:47,970 --> 00:06:50,420
Is t less than x?

135
00:06:50,420 --> 00:06:51,160
No.

136
00:06:51,160 --> 00:06:54,330
So it looks like we
dodged a bullet again.

137
00:06:54,330 --> 00:06:54,830
Right?

138
00:06:54,830 --> 00:06:57,930
So here we have two
concrete inputs.

139
00:06:57,930 --> 00:07:00,440
And they told us that on
these two concrete inputs

140
00:07:00,440 --> 00:07:01,770
the program didn't fail.

141
00:07:01,770 --> 00:07:06,900
But that really doesn't tell us
anything about any other input.

142
00:07:06,900 --> 00:07:10,080
And so the idea with
symbolic execution

143
00:07:10,080 --> 00:07:13,950
is we want to go beyond these
single input executions.

144
00:07:13,950 --> 00:07:17,480
And we want to be able
to actually reason

145
00:07:17,480 --> 00:07:20,440
about the behavior of
the program on very

146
00:07:20,440 --> 00:07:21,550
large sets of inputs.

147
00:07:21,550 --> 00:07:25,680
In some cases, infinite
sets of possible inputs.

148
00:07:25,680 --> 00:07:28,830
And the basic idea
is as follows.

149
00:07:28,830 --> 00:07:31,940
So for a program
like this, just like

150
00:07:31,940 --> 00:07:33,940
before the state
of the program is

151
00:07:33,940 --> 00:07:36,630
characterized by the
value of these three

152
00:07:36,630 --> 00:07:37,500
different variables.

153
00:07:37,500 --> 00:07:41,140
Right? x, y, and t together with
knowing where in the program

154
00:07:41,140 --> 00:07:42,380
I'm at.

155
00:07:42,380 --> 00:07:48,230
But now instead of
concrete values for x and y

156
00:07:48,230 --> 00:07:51,920
what I'm going to have
is a symbolic value, just

157
00:07:51,920 --> 00:07:52,530
a variable.

158
00:07:52,530 --> 00:07:57,760
A variable that allows me
to give a name to this value

159
00:07:57,760 --> 00:08:00,450
that the user is going
to provide at the input.

160
00:08:00,450 --> 00:08:03,540
So what that means is that
the state of my program

161
00:08:03,540 --> 00:08:07,170
is no longer a mapping
from variable names

162
00:08:07,170 --> 00:08:08,630
to concrete values.

163
00:08:08,630 --> 00:08:13,542
It's now a mapping from variable
names to these symbolic values.

164
00:08:13,542 --> 00:08:15,250
And a symbolic value,
you can essentially

165
00:08:15,250 --> 00:08:18,480
think of it as a formula.

166
00:08:18,480 --> 00:08:23,610
So in this case the
formula for x is just x.

167
00:08:23,610 --> 00:08:25,440
And the formula for y is just y.

168
00:08:25,440 --> 00:08:27,590
And for t, it's
actually the value 0.

169
00:08:27,590 --> 00:08:31,190
We know that for every input,
doesn't matter what you do.

170
00:08:31,190 --> 00:08:35,400
The value of t after the first
statement is going to be 0.

171
00:08:35,400 --> 00:08:39,510
But now here's where
it gets interesting.

172
00:08:39,510 --> 00:08:42,179
So we get to this
branch right here

173
00:08:42,179 --> 00:08:44,912
that says, if x
is greater than y,

174
00:08:44,912 --> 00:08:46,370
we're going to go
in one direction.

175
00:08:48,977 --> 00:08:50,560
If it's less than
or equal to y, we're

176
00:08:50,560 --> 00:08:52,018
going to go in the
other direction.

177
00:08:52,018 --> 00:08:55,600
Now do we know
anything about x and y?

178
00:08:59,476 --> 00:09:00,600
What do we know about them?

179
00:09:05,600 --> 00:09:07,110
We know their type, at least.

180
00:09:07,110 --> 00:09:08,020
So that's a start.

181
00:09:08,020 --> 00:09:11,870
So we know that they're going
to be ranging from min int

182
00:09:11,870 --> 00:09:16,287
to max int, but that's about
all we know about them.

183
00:09:16,287 --> 00:09:17,870
And it turns out
that this information

184
00:09:17,870 --> 00:09:22,070
that we know about them is not
sufficient to tell us which

185
00:09:22,070 --> 00:09:23,630
direction this branch might go.

186
00:09:23,630 --> 00:09:26,630
This branch could go either way.

187
00:09:26,630 --> 00:09:32,360
And so now there are many
things and we can do,

188
00:09:32,360 --> 00:09:35,870
but what's one possible thing
that we could do at this point?

189
00:09:44,680 --> 00:09:46,018
Make a wild guess.

190
00:09:46,018 --> 00:09:46,935
AUDIENCE: [INAUDIBLE].

191
00:09:46,935 --> 00:09:48,059
ARMANDO SOLAR-LEZAMA: Yeah.

192
00:09:48,059 --> 00:09:49,680
We could follow both branches.

193
00:09:49,680 --> 00:09:54,420
We could flip a coin and pick
one branch and take that.

194
00:09:54,420 --> 00:09:56,730
So if we want to
follow both branches

195
00:09:56,730 --> 00:09:58,990
we have to follow one and
then the other one, right?

196
00:09:58,990 --> 00:10:04,381
So let's say we start
with this branch.

197
00:10:04,381 --> 00:10:04,880
Right?

198
00:10:04,880 --> 00:10:07,250
So now we are at this branch.

199
00:10:07,250 --> 00:10:11,240
So what we know is that if
we make it to this branch,

200
00:10:11,240 --> 00:10:17,740
in this branch t is now going
to have the same value as x.

201
00:10:17,740 --> 00:10:20,277
And we don't know what
that value is going to be,

202
00:10:20,277 --> 00:10:21,360
but we have a name for it.

203
00:10:21,360 --> 00:10:26,080
It's this script letter x.

204
00:10:26,080 --> 00:10:26,580
Right?

205
00:10:26,580 --> 00:10:31,370
So that's the value
of t on that branch.

206
00:10:31,370 --> 00:10:36,330
If we were to take the opposite
branch then what would happen?

207
00:10:36,330 --> 00:10:38,730
The value of t would be
something different, right?

208
00:10:38,730 --> 00:10:45,790
In that branch, the value of t
would be the symbolic value y.

209
00:10:45,790 --> 00:10:50,050
So that means that when we get
to this point in the program,

210
00:10:50,050 --> 00:10:51,020
what is the value of t?

211
00:10:51,020 --> 00:10:53,040
Well, maybe it's x.

212
00:10:53,040 --> 00:10:54,440
And maybe it's y.

213
00:10:54,440 --> 00:10:58,460
We don't know exactly which
one it is, but why don't we

214
00:10:58,460 --> 00:10:59,150
give it a name?

215
00:10:59,150 --> 00:11:02,450
Let's call it t0.

216
00:11:02,450 --> 00:11:04,970
And what do we know about t0?

217
00:11:07,570 --> 00:11:10,855
What are the cases where t0
is going to be equal to x?

218
00:11:14,291 --> 00:11:15,650
AUDIENCE: [INAUDIBLE].

219
00:11:15,650 --> 00:11:17,108
ARMANDO SOLAR-LEZAMA:
That's right.

220
00:11:17,108 --> 00:11:21,380
So essentially what we know is
that if x is greater than y,

221
00:11:21,380 --> 00:11:27,280
then this implies that it's x.

222
00:11:27,280 --> 00:11:37,460
And if x is less than or equal
to y that implies that it's y,

223
00:11:37,460 --> 00:11:38,350
right?

224
00:11:38,350 --> 00:11:41,960
And so we have this
value that we've defined.

225
00:11:41,960 --> 00:11:43,750
We'll call it t0.

226
00:11:43,750 --> 00:11:46,470
And it has these
logical properties.

227
00:11:46,470 --> 00:11:53,150
So at this point in
the program we actually

228
00:11:53,150 --> 00:11:56,800
have a name for the value of t.

229
00:11:56,800 --> 00:11:57,700
It's t0.

230
00:12:00,290 --> 00:12:00,960
Right?

231
00:12:00,960 --> 00:12:03,200
And so what did we do here?

232
00:12:03,200 --> 00:12:06,835
We took both branches
of this if statement.

233
00:12:09,610 --> 00:12:12,090
And then we computed
the symbolic value

234
00:12:12,090 --> 00:12:14,220
by looking at under
what conditions

235
00:12:14,220 --> 00:12:17,170
am I going to take one branch,
under what conditions am I

236
00:12:17,170 --> 00:12:19,360
going to take another branch?

237
00:12:19,360 --> 00:12:22,330
And then looking
at what values am

238
00:12:22,330 --> 00:12:26,420
I going to be assigning to
t on both of those branches?

239
00:12:26,420 --> 00:12:31,760
So now it comes to the
point where we have to ask,

240
00:12:31,760 --> 00:12:33,130
can t be less than x?

241
00:12:33,130 --> 00:12:33,630
Right?

242
00:12:33,630 --> 00:12:35,510
So what is the value of t?

243
00:12:35,510 --> 00:12:37,580
The value of t is now t0.

244
00:12:37,580 --> 00:12:41,040
So what we want
to know is, is it

245
00:12:41,040 --> 00:12:47,090
possible for t0
to be less than x?

246
00:12:47,090 --> 00:12:47,590
Right?

247
00:12:47,590 --> 00:12:51,760
Now remember the
first branch we hit

248
00:12:51,760 --> 00:12:53,930
we were asking a
question about x and y.

249
00:12:53,930 --> 00:12:56,990
And we knew nothing
about x and y.

250
00:12:56,990 --> 00:12:59,520
The only thing we
knew about x and y

251
00:12:59,520 --> 00:13:02,100
was that they were of type int.

252
00:13:02,100 --> 00:13:06,620
But now with t0 we actually
know a lot about t0.

253
00:13:06,620 --> 00:13:11,930
We know that t0 is going to
be equal to x in some cases.

254
00:13:11,930 --> 00:13:14,640
And it's going to be
equal to y in some cases.

255
00:13:14,640 --> 00:13:18,300
And so this now gives
us a set of equations

256
00:13:18,300 --> 00:13:20,090
that we can solve for.

257
00:13:20,090 --> 00:13:26,060
So what we can say is,
is it possible to satisfy

258
00:13:26,060 --> 00:13:31,110
t0 less than x knowing
that t0 satisfies

259
00:13:31,110 --> 00:13:33,761
all of these properties?

260
00:13:33,761 --> 00:13:34,260
Right?

261
00:13:34,260 --> 00:13:38,270
So, in fact, we can
actually express this

262
00:13:38,270 --> 00:13:44,550
as a constraint where we say,
so is it possible to have t0

263
00:13:44,550 --> 00:13:45,990
less than x?

264
00:13:45,990 --> 00:13:55,720
And to have x greater than
y implies t0 equals x.

265
00:13:55,720 --> 00:14:07,146
And x less than or equal
to y imply t0 equal y.

266
00:14:10,010 --> 00:14:10,510
Right?

267
00:14:10,510 --> 00:14:15,890
So what we have here is an
equation that if that equation

268
00:14:15,890 --> 00:14:20,200
has a solution, if it's
possible to find a value of t0,

269
00:14:20,200 --> 00:14:24,660
and a value of x, and a value of
y that satisfies that equation,

270
00:14:24,660 --> 00:14:29,930
then we know that those
values, when we plug them

271
00:14:29,930 --> 00:14:33,170
into our program, when
the program executes,

272
00:14:33,170 --> 00:14:35,930
it will take this branch.

273
00:14:35,930 --> 00:14:40,090
And it will blow up when
it hits a assert false.

274
00:14:42,721 --> 00:14:43,220
Right?

275
00:14:43,220 --> 00:14:45,080
So what did we do here?

276
00:14:45,080 --> 00:14:50,370
So we're executing this
program, but instead

277
00:14:50,370 --> 00:14:57,560
of keeping our state as a
mapping from variable names

278
00:14:57,560 --> 00:14:59,970
to values, what
we're doing is we're

279
00:14:59,970 --> 00:15:03,970
keeping our program as a
mapping from variable names

280
00:15:03,970 --> 00:15:07,310
to these symbolic values.

281
00:15:07,310 --> 00:15:09,230
Essentially, other
variable names.

282
00:15:09,230 --> 00:15:11,830
And in this case our
other variable names

283
00:15:11,830 --> 00:15:17,320
are the script x, script
y, t0, and on top of that,

284
00:15:17,320 --> 00:15:20,110
we have a set of
equations that tell us

285
00:15:20,110 --> 00:15:22,460
how those values are related.

286
00:15:22,460 --> 00:15:24,510
So we have an
equation that tells us

287
00:15:24,510 --> 00:15:29,180
how t0 is related to
x and y in this case.

288
00:15:29,180 --> 00:15:33,620
And solving for
that equation allows

289
00:15:33,620 --> 00:15:37,380
us to answer the question
of whether this branch can

290
00:15:37,380 --> 00:15:38,310
be taken or not.

291
00:15:38,310 --> 00:15:41,510
Now just looking
at the equation,

292
00:15:41,510 --> 00:15:42,900
can this branch be taken or not?

293
00:15:45,570 --> 00:15:46,070
Right?

294
00:15:46,070 --> 00:15:49,450
So it looks like the
branch cannot be taken.

295
00:15:49,450 --> 00:15:50,170
Why not?

296
00:15:50,170 --> 00:15:56,390
Because we're looking for
cases where t0 is less than x,

297
00:15:56,390 --> 00:15:59,500
which means that if you're
in this case, then clearly

298
00:15:59,500 --> 00:16:01,350
that's not going to be true.

299
00:16:01,350 --> 00:16:01,850
Right?

300
00:16:01,850 --> 00:16:04,480
So that means that when
x is greater than y,

301
00:16:04,480 --> 00:16:08,280
then it cannot happen because
t0 will be equal to x.

302
00:16:08,280 --> 00:16:11,720
And it cannot be equal to x and
less than x at the same time.

303
00:16:11,720 --> 00:16:13,950
And what about in this case?

304
00:16:13,950 --> 00:16:15,180
Can it happen in this case?

305
00:16:15,180 --> 00:16:17,200
Can t0 be less than
x in this case?

306
00:16:21,150 --> 00:16:22,590
No, it clearly cannot, right?

307
00:16:22,590 --> 00:16:29,180
Because in this case we
know that x is less than y.

308
00:16:29,180 --> 00:16:31,790
And so if t0 is going
to be less than x,

309
00:16:31,790 --> 00:16:34,070
then it would also
be less than y.

310
00:16:34,070 --> 00:16:37,730
But we know that in that case
t0 is exactly equal to y.

311
00:16:37,730 --> 00:16:42,730
And therefore, again, that
case cannot be satisfied.

312
00:16:42,730 --> 00:16:47,080
So what we have here is an
equation that has no solution.

313
00:16:47,080 --> 00:16:49,980
It doesn't matter what values
you plug into this equation.

314
00:16:49,980 --> 00:16:54,990
You cannot solve it and that
tells us that no matter what

315
00:16:54,990 --> 00:17:01,620
inputs we pass to this code, it
will not go down this branch.

316
00:17:01,620 --> 00:17:07,460
Now notice that when
making that argument here

317
00:17:07,460 --> 00:17:10,859
I was basically alluding to
your intuition about integers,

318
00:17:10,859 --> 00:17:13,619
about mathematical integers.

319
00:17:13,619 --> 00:17:17,589
In practice we know that
machine ints don't quite

320
00:17:17,589 --> 00:17:22,109
behave exactly the same
way as mathematical ints.

321
00:17:22,109 --> 00:17:25,130
And there are some
cases where laws

322
00:17:25,130 --> 00:17:27,430
that apply to mathematical
ints don't actually

323
00:17:27,430 --> 00:17:29,822
apply to ints in programs.

324
00:17:29,822 --> 00:17:31,280
And so when reasoning
about this we

325
00:17:31,280 --> 00:17:33,761
have to be very
careful that when

326
00:17:33,761 --> 00:17:35,260
we're solving these
equations, we're

327
00:17:35,260 --> 00:17:40,930
keeping in mind
that these are not

328
00:17:40,930 --> 00:17:44,830
the integers as they were taught
to us in elementary school.

329
00:17:44,830 --> 00:17:48,550
These are 32-bit integers
that the machine uses.

330
00:17:48,550 --> 00:17:51,090
And there are many
cases and many instances

331
00:17:51,090 --> 00:17:55,000
of bugs that arose because
programmers were thinking

332
00:17:55,000 --> 00:17:58,770
about their code in terms
of mathematical integers,

333
00:17:58,770 --> 00:18:02,450
and not realizing that there
are things like overflows that

334
00:18:02,450 --> 00:18:04,330
can cause the program
to behave differently

335
00:18:04,330 --> 00:18:06,470
for mathematical inputs.

336
00:18:06,470 --> 00:18:10,140
But the other thing is
what I've described here

337
00:18:10,140 --> 00:18:16,230
is a purely intuitive argument.

338
00:18:16,230 --> 00:18:19,110
I walk you through the process
of how to do this by hand,

339
00:18:19,110 --> 00:18:21,970
but that's by no
means an algorithm.

340
00:18:21,970 --> 00:18:23,250
Right?

341
00:18:23,250 --> 00:18:26,070
The beauty of this idea
of symbolic execution,

342
00:18:26,070 --> 00:18:28,920
however, is that it can be
coded into an algorithm.

343
00:18:28,920 --> 00:18:31,960
And it can be solved in
a mechanical way, which

344
00:18:31,960 --> 00:18:36,190
allows you to do this not
just for ten line programs,

345
00:18:36,190 --> 00:18:38,930
but actually for
million line programs.

346
00:18:38,930 --> 00:18:41,281
And it allows you
to actually take

347
00:18:41,281 --> 00:18:43,280
this reasoning, and the
same intuitive reasoning

348
00:18:43,280 --> 00:18:48,090
that we used in
this case to talk

349
00:18:48,090 --> 00:18:49,820
about what happens
when we execute

350
00:18:49,820 --> 00:18:51,860
this program on
different inputs.

351
00:18:51,860 --> 00:18:59,429
And scale that reasoning
to very large programs.

352
00:18:59,429 --> 00:19:00,720
Are there any questions so far?

353
00:19:05,621 --> 00:19:06,120
Yes?

354
00:19:06,120 --> 00:19:07,745
AUDIENCE: What if a
[INAUDIBLE] are not

355
00:19:07,745 --> 00:19:09,620
supposed to take an input?

356
00:19:09,620 --> 00:19:10,120
[INAUDIBLE]

357
00:19:15,639 --> 00:19:16,680
ARMANDO SOLAR-LEZAMA: Oh.

358
00:19:16,680 --> 00:19:17,920
That's a very good question.

359
00:19:17,920 --> 00:19:26,190
Right, so, for
example, let's say

360
00:19:26,190 --> 00:19:36,100
we have the program that
we have here, but instead

361
00:19:36,100 --> 00:19:46,130
of these being t equals x, here
we will say t equals x minus 1.

362
00:19:46,130 --> 00:19:47,006
Right?

363
00:19:47,006 --> 00:19:48,630
So now all of a
sudden, intuitively you

364
00:19:48,630 --> 00:19:52,580
can see that now this
program could blow up, right?

365
00:19:52,580 --> 00:20:00,150
Because when the program
takes this path then

366
00:20:00,150 --> 00:20:02,680
t will indeed be less than x.

367
00:20:02,680 --> 00:20:06,220
And you will indeed fail here.

368
00:20:06,220 --> 00:20:06,720
Right?

369
00:20:06,720 --> 00:20:10,040
So what will happen to
a program like this?

370
00:20:10,040 --> 00:20:15,370
How will our symbolic
state look like?

371
00:20:15,370 --> 00:20:15,870
Right?

372
00:20:15,870 --> 00:20:22,710
So in this case, so t0,
when x is greater than y,

373
00:20:22,710 --> 00:20:24,697
what is t0 now going
to be equal to?

374
00:20:24,697 --> 00:20:26,030
It's not going to be equal to x.

375
00:20:26,030 --> 00:20:35,060
It's going to be equal
to x minus 1, right?

376
00:20:35,060 --> 00:20:47,290
And so that means that,
so, this condition now

377
00:20:47,290 --> 00:20:50,060
has a satisfying assignment.

378
00:20:50,060 --> 00:20:50,560
Right?

379
00:20:50,560 --> 00:20:56,600
Now this can fail, but what
if you go to the developer

380
00:20:56,600 --> 00:21:03,320
and say, hey, this
function can blow up

381
00:21:03,320 --> 00:21:06,710
whenever x is greater than y.

382
00:21:06,710 --> 00:21:11,150
And the developer
looks at this and says,

383
00:21:11,150 --> 00:21:13,340
oh, I forgot to tell you.

384
00:21:13,340 --> 00:21:16,410
Actually, this
function can never

385
00:21:16,410 --> 00:21:23,090
be called with parameters
where x is greater than y.

386
00:21:23,090 --> 00:21:23,830
Right?

387
00:21:23,830 --> 00:21:27,110
That the client that calls
this function is just

388
00:21:27,110 --> 00:21:29,140
a quick function that
I wrote for something.

389
00:21:29,140 --> 00:21:32,060
And it has this branch for
some historical purpose.

390
00:21:32,060 --> 00:21:34,140
But actually this
function will never

391
00:21:34,140 --> 00:21:37,240
get called with
x greater than y.

392
00:21:37,240 --> 00:21:39,150
You're like, well,
now you tell me.

393
00:21:39,150 --> 00:21:39,870
Right?

394
00:21:39,870 --> 00:21:43,060
But the way we can
think about this

395
00:21:43,060 --> 00:21:55,830
is that there is an assumption
that x is going to be less than

396
00:21:55,830 --> 00:21:57,360
or equal to y, right?

397
00:21:57,360 --> 00:22:02,020
This is sometimes referred to
as a precondition or a contract

398
00:22:02,020 --> 00:22:02,890
for this function.

399
00:22:02,890 --> 00:22:04,639
The function is promising
to do something,

400
00:22:04,639 --> 00:22:06,622
but only if you satisfy
this assumption.

401
00:22:06,622 --> 00:22:09,080
And if you don't satisfy the
assumption, the function says,

402
00:22:09,080 --> 00:22:11,026
I don't care what happens.

403
00:22:11,026 --> 00:22:12,400
I only promise
that I'm not going

404
00:22:12,400 --> 00:22:15,390
to fail when this
assumption is satisfied.

405
00:22:15,390 --> 00:22:17,370
And it's the
responsibility of the color

406
00:22:17,370 --> 00:22:20,790
to make sure that this condition
is never violated, right?

407
00:22:20,790 --> 00:22:26,340
So how would we
encode that constraint

408
00:22:26,340 --> 00:22:28,040
when we're solving
for equations?

409
00:22:28,040 --> 00:22:30,280
Well, essentially
what we have is

410
00:22:30,280 --> 00:22:31,780
we have this set
of constraints that

411
00:22:31,780 --> 00:22:34,040
tell us whether this
branch is feasible.

412
00:22:34,040 --> 00:22:37,100
And on top of the constraints
that we already have

413
00:22:37,100 --> 00:22:45,530
we need to also make sure
that the precondition,

414
00:22:45,530 --> 00:22:48,260
or the assumptions
are satisfied.

415
00:22:48,260 --> 00:22:48,820
Right?

416
00:22:48,820 --> 00:22:53,210
And now we want to
ask, OK, so can I

417
00:22:53,210 --> 00:22:56,780
find an x and a y that satisfy
all of these constraints

418
00:22:56,780 --> 00:22:59,630
together with these constraint
that I have on the input,

419
00:22:59,630 --> 00:23:01,540
with these properties
that I know

420
00:23:01,540 --> 00:23:03,500
that the input must satisfy?

421
00:23:03,500 --> 00:23:06,810
And once again you can
see that this constraint

422
00:23:06,810 --> 00:23:10,050
of x less than or equal
to y is the difference

423
00:23:10,050 --> 00:23:13,940
between this constraint
being satisfiable,

424
00:23:13,940 --> 00:23:18,780
and this constraint once
again becoming unsatisfiable.

425
00:23:18,780 --> 00:23:22,450
That's a very important issue
when dealing with analysis,

426
00:23:22,450 --> 00:23:25,910
especially when you want
to do this marginally

427
00:23:25,910 --> 00:23:27,990
at the level of individual
functions at a time.

428
00:23:27,990 --> 00:23:32,220
It makes sense to know
what the assumptions are

429
00:23:32,220 --> 00:23:34,412
that the programmer
had in mind when

430
00:23:34,412 --> 00:23:36,620
writing this function,
because if you don't know what

431
00:23:36,620 --> 00:23:39,760
those assumptions were
you could say, yeah, here

432
00:23:39,760 --> 00:23:42,780
are some inputs where it's going
to fail only for the programmer

433
00:23:42,780 --> 00:23:45,530
to dismiss myth that by saying,
oh, but those inputs are not

434
00:23:45,530 --> 00:23:49,489
possible, or those
inputs can never happen.

435
00:23:49,489 --> 00:23:50,155
Other questions?

436
00:23:57,570 --> 00:23:58,070
All right.

437
00:23:58,070 --> 00:24:03,210
So how do we do this in
a more mechanical way?

438
00:24:03,210 --> 00:24:07,965
So there are two
aspects to this problem.

439
00:24:07,965 --> 00:24:11,390
Aspect number one is
how do you actually

440
00:24:11,390 --> 00:24:13,890
come up with these formulas?

441
00:24:13,890 --> 00:24:15,770
So in this case it
was kind of intuitive

442
00:24:15,770 --> 00:24:17,174
how we came up
with the formulas.

443
00:24:17,174 --> 00:24:19,090
where we were just working
through it by hand,

444
00:24:19,090 --> 00:24:21,490
but how do you come
up with these formulas

445
00:24:21,490 --> 00:24:23,390
in a mechanical way?

446
00:24:23,390 --> 00:24:27,660
And aspect number two is
once you have the formulas,

447
00:24:27,660 --> 00:24:30,520
how do you actually solve them?

448
00:24:30,520 --> 00:24:34,140
How can you actually
solve these formulas

449
00:24:34,140 --> 00:24:38,700
that describe whether
your program fails or not?

450
00:24:38,700 --> 00:24:43,970
And I'm actually going to start
with that second question.

451
00:24:43,970 --> 00:24:48,350
Given that we're able to reduce
our problem to these formulas

452
00:24:48,350 --> 00:24:54,280
that involve integer
reasoning that involved

453
00:24:54,280 --> 00:24:55,910
in the case of
programs generally

454
00:24:55,910 --> 00:24:57,721
you care about bit
vector reasoning.

455
00:24:57,721 --> 00:25:00,220
[INAUDIBLE] programs, a lot of
times, you care about arrays.

456
00:25:00,220 --> 00:25:01,920
You care about functions.

457
00:25:01,920 --> 00:25:04,180
And you end up with
these giant formulas.

458
00:25:04,180 --> 00:25:08,540
How in the world do you actually
solve them in a mechanical way?

459
00:25:08,540 --> 00:25:12,020
And a lot of the technology
that we're talking about today,

460
00:25:12,020 --> 00:25:14,870
and the reason why we're
actually talking about it

461
00:25:14,870 --> 00:25:20,280
as a practical tool, have to
do with tremendous advances

462
00:25:20,280 --> 00:25:23,170
in solvers for
logical questions.

463
00:25:23,170 --> 00:25:25,390
And in particular, there
is a very important class

464
00:25:25,390 --> 00:25:31,300
of solvers called satisfiability
modulo theory solvers,

465
00:25:31,300 --> 00:25:33,730
often abbreviated as SMT.

466
00:25:33,730 --> 00:25:35,230
But a lot of people
in the community

467
00:25:35,230 --> 00:25:39,260
would argue that the name is
not a particularly good name,

468
00:25:39,260 --> 00:25:41,820
but it's the one that everybody
uses and it has stuck.

469
00:25:41,820 --> 00:25:45,220
What you need to know
about these SMT solvers

470
00:25:45,220 --> 00:25:50,840
is that an SMT solver is
an algorithm essentially

471
00:25:50,840 --> 00:25:54,670
that given a logical
formula will give you

472
00:25:54,670 --> 00:25:56,080
one of two things.

473
00:25:56,080 --> 00:25:58,430
it will give you either
a satisfying assignment

474
00:25:58,430 --> 00:26:01,830
to the formula, or
it will tell you

475
00:26:01,830 --> 00:26:04,990
that the formula
is unsatisfiable.

476
00:26:04,990 --> 00:26:09,490
And that there is no
possible assignment

477
00:26:09,490 --> 00:26:11,590
to the variables in
that formula that

478
00:26:11,590 --> 00:26:14,790
will satisfy these
constraints that you defined.

479
00:26:14,790 --> 00:26:18,820
Now in practice, if this
sounds a little bit scary

480
00:26:18,820 --> 00:26:21,730
and a little bit like magic,
it is a little bit scary.

481
00:26:21,730 --> 00:26:25,350
A lot of the problems that
these SMT solvers have to solve

482
00:26:25,350 --> 00:26:28,310
are NP-complete
in the best case.

483
00:26:28,310 --> 00:26:30,570
All right? the nice
ones are NP-complete.

484
00:26:30,570 --> 00:26:34,310
The hard ones can get
much harrier than that.

485
00:26:34,310 --> 00:26:41,040
So how can we have a system that
relies as its primary building

486
00:26:41,040 --> 00:26:46,950
block on solving NP complete
PSPACE-complete problems?

487
00:26:46,950 --> 00:26:51,020
And still have something
that works in practice?

488
00:26:51,020 --> 00:26:54,570
And part of the answer is that
for a lot of these solvers

489
00:26:54,570 --> 00:26:59,590
there is a third thing
that they can tell you,

490
00:26:59,590 --> 00:27:01,440
which is, I don't know.

491
00:27:09,630 --> 00:27:14,530
And so part of the
beauty of these solvers

492
00:27:14,530 --> 00:27:16,890
is that for practical
problems, even

493
00:27:16,890 --> 00:27:19,900
for very, very large and
complicated practical problems,

494
00:27:19,900 --> 00:27:22,660
they are still able to do
better than simply telling you,

495
00:27:22,660 --> 00:27:23,410
I don't know.

496
00:27:23,410 --> 00:27:26,430
They are still able
to give you either

497
00:27:26,430 --> 00:27:30,420
a guarantee that this
set of constraints

498
00:27:30,420 --> 00:27:34,090
is unsatisfiable or an actual
satisfying assignment that

499
00:27:34,090 --> 00:27:37,300
tells you exactly
what the answer is.

500
00:27:40,770 --> 00:27:41,750
Yes?

501
00:27:41,750 --> 00:27:48,451
AUDIENCE: [INAUDIBLE]
For example, [INAUDIBLE]

502
00:27:48,451 --> 00:27:50,325
specification I don't
think you said anything

503
00:27:50,325 --> 00:27:54,000
about how many bits are used to
store an integer. [INAUDIBLE]

504
00:28:00,907 --> 00:28:02,990
ARMANDO SOLAR-LEZAMA:
That's a very good question.

505
00:28:02,990 --> 00:28:05,430
And that really has
to do with how you

506
00:28:05,430 --> 00:28:07,810
define your constraints, right?

507
00:28:07,810 --> 00:28:18,940
So If you look at our simple
example from the beginning,

508
00:28:18,940 --> 00:28:25,140
in this case, we assume that
these were the integers as

509
00:28:25,140 --> 00:28:26,640
learned in elementary school.

510
00:28:26,640 --> 00:28:34,420
And that we completely decided
to ignore overflow errors.

511
00:28:34,420 --> 00:28:35,920
If you care about
overflow errors,

512
00:28:35,920 --> 00:28:39,510
if overflow errors are actually
essential to the kind of bugs

513
00:28:39,510 --> 00:28:42,864
you're trying to find, this
would not be a good way

514
00:28:42,864 --> 00:28:43,780
to set up the problem.

515
00:28:43,780 --> 00:28:49,780
What you need is to represent
these not so fast integers,

516
00:28:49,780 --> 00:28:50,910
but as bit-vectors.

517
00:28:50,910 --> 00:28:52,940
And the moment you represent
them as bit vectors

518
00:28:52,940 --> 00:28:55,470
you have to have a
bit width in mind.

519
00:28:55,470 --> 00:29:01,530
And this goes back to
what this modular theory

520
00:29:01,530 --> 00:29:03,670
aspect in the solver means.

521
00:29:03,670 --> 00:29:05,430
What this modular
theory aspect means

522
00:29:05,430 --> 00:29:08,700
is that the solver is
actually extensible

523
00:29:08,700 --> 00:29:10,140
with different theories.

524
00:29:10,140 --> 00:29:15,960
The most popular theories are
the theory of bit-vector which

525
00:29:15,960 --> 00:29:21,450
are fixed length bit-vectors.

526
00:29:21,450 --> 00:29:24,150
That means that if you're
interpreting your formulas

527
00:29:24,150 --> 00:29:26,664
in this theory of fixed
length bit-vectors

528
00:29:26,664 --> 00:29:28,580
you have to fix the
length of the bit-vectors.

529
00:29:28,580 --> 00:29:31,380
And you have to
explicitly specify

530
00:29:31,380 --> 00:29:36,760
that these are going to be
32-bit bit-vectors, or 8 bit

531
00:29:36,760 --> 00:29:39,326
bit-vectors, or
64-bit bit-vectors.

532
00:29:39,326 --> 00:29:42,284
AUDIENCE: So if you wanted
to make the the bit symbolic

533
00:29:42,284 --> 00:29:46,730
[INAUDIBLE], like this
is an x bit, is that--

534
00:29:46,730 --> 00:29:49,140
ARMANDO SOLAR-LEZAMA: So
there's another theory which

535
00:29:49,140 --> 00:29:53,690
is called the theory of arrays.

536
00:29:53,690 --> 00:29:55,740
And we'll talk a little
bit more about it,

537
00:29:55,740 --> 00:29:59,150
where unlike the
bit vector theory,

538
00:29:59,150 --> 00:30:02,410
which is designed to be
for fixed length things

539
00:30:02,410 --> 00:30:07,360
the theory of arrays is meant
to be for collections where

540
00:30:07,360 --> 00:30:10,110
you don't actually
know the size a priori.

541
00:30:10,110 --> 00:30:13,040
Now in practice
nobody uses the theory

542
00:30:13,040 --> 00:30:16,010
of arrays to model
integers, for example,

543
00:30:16,010 --> 00:30:18,100
because it's too expensive.

544
00:30:18,100 --> 00:30:21,250
It becomes way more
expensive to reason about

545
00:30:21,250 --> 00:30:23,070
when you don't know
what the bound is.

546
00:30:23,070 --> 00:30:25,840
So generally people
use fixed length theory

547
00:30:25,840 --> 00:30:30,910
of bit-vectors when reasoning
about integers or characters

548
00:30:30,910 --> 00:30:33,050
even.

549
00:30:33,050 --> 00:30:41,760
Another very common theory is
the theory of actual integer

550
00:30:41,760 --> 00:30:44,520
arithmetic, and in particularly
linear integer arithmetic.

551
00:30:44,520 --> 00:30:47,200
This is a theory that
people like a lot because it

552
00:30:47,200 --> 00:30:50,650
can be reasoned about
very, very efficiently,

553
00:30:50,650 --> 00:30:52,930
but it's not
particularly good when

554
00:30:52,930 --> 00:30:55,960
you're reasoning about programs,
because in general you really

555
00:30:55,960 --> 00:30:59,040
do care about overflow issues.

556
00:30:59,040 --> 00:31:03,680
But it's actually very widely
used for many, many things.

557
00:31:03,680 --> 00:31:07,240
The other theory that you're
likely to see people using

558
00:31:07,240 --> 00:31:13,535
is the theory of
uninterpreted functions.

559
00:31:19,240 --> 00:31:22,060
So what does it mean, the theory
of an uninterpreted function?

560
00:31:22,060 --> 00:31:27,200
It means that you have a formula
where somewhere in your formula

561
00:31:27,200 --> 00:31:29,350
you know that you're
calling a function,

562
00:31:29,350 --> 00:31:31,270
but you know nothing
about that function

563
00:31:31,270 --> 00:31:39,200
other than the fact that it is
a function, that if you give it

564
00:31:39,200 --> 00:31:42,870
the same inputs you get
the same outputs in return.

565
00:31:42,870 --> 00:31:45,190
And it turns out this is
very, very useful sometimes

566
00:31:45,190 --> 00:31:47,310
when trying to
reason about things

567
00:31:47,310 --> 00:31:53,190
like if you floating point
code, modeling, sine, cosines,

568
00:31:53,190 --> 00:31:56,025
square roots can be very
messy and expensive,

569
00:31:56,025 --> 00:31:57,650
but you can say,
look, I don't actually

570
00:31:57,650 --> 00:32:01,030
care about what the
sine function does.

571
00:32:01,030 --> 00:32:03,200
I don't care about
what its output is.

572
00:32:03,200 --> 00:32:05,600
All I know is that if I
call the sine function

573
00:32:05,600 --> 00:32:07,390
in many different
places with the input

574
00:32:07,390 --> 00:32:08,830
I will get the same output.

575
00:32:08,830 --> 00:32:14,100
And that's enough for me
to reason about my code.

576
00:32:14,100 --> 00:32:17,350
And so the most
common ones you will

577
00:32:17,350 --> 00:32:21,140
see when analyzing
real systems are

578
00:32:21,140 --> 00:32:24,510
bit-vectors to deal
with integers, and logs,

579
00:32:24,510 --> 00:32:26,110
and pointers.

580
00:32:26,110 --> 00:32:30,990
Actually, pointers are often
represented with integer

581
00:32:30,990 --> 00:32:35,760
because you're
generally not going

582
00:32:35,760 --> 00:32:40,500
to be doing complicated
bit whittling on pointers.

583
00:32:40,500 --> 00:32:44,650
Sometimes you will and then
you can't use integers anymore.

584
00:32:44,650 --> 00:32:46,210
So OK.

585
00:32:46,210 --> 00:32:48,470
So that's all well and good.

586
00:32:48,470 --> 00:32:52,650
That's what an SMT
solver can do for you.

587
00:32:52,650 --> 00:32:54,900
How does it actually work?

588
00:32:54,900 --> 00:32:56,870
What's inside it
that makes it work?

589
00:32:56,870 --> 00:33:01,820
And SMT solvers actually
rely on our ability

590
00:33:01,820 --> 00:33:04,690
to solve SAT problems,
on our ability

591
00:33:04,690 --> 00:33:10,350
to take problems involving
just purely Boolean constraints

592
00:33:10,350 --> 00:33:13,650
and Boolean variables,
and telling us

593
00:33:13,650 --> 00:33:16,680
whether there is an assignment
to these Boolean variables

594
00:33:16,680 --> 00:33:20,370
that is satisfiable or not.

595
00:33:20,370 --> 00:33:24,400
And this is the kind of thing
that for many, many years

596
00:33:24,400 --> 00:33:27,416
people in undergrad have been
taught that actually this

597
00:33:27,416 --> 00:33:28,690
is an NP-complete problem.

598
00:33:28,690 --> 00:33:30,680
The moment something
reduces to SAT

599
00:33:30,680 --> 00:33:33,220
you know you shouldn't
do it, but it turns out

600
00:33:33,220 --> 00:33:35,960
that we actually have
some very, very good SAT

601
00:33:35,960 --> 00:33:36,720
solvers out there.

602
00:33:36,720 --> 00:33:42,060
Probably most of you even
built one as part of 6005.

603
00:33:42,060 --> 00:33:43,940
Am I right?

604
00:33:43,940 --> 00:33:46,200
Or some of you did.

605
00:33:46,200 --> 00:33:50,780
So I'll tell you the basic idea
behind how SAT solvers work.

606
00:33:50,780 --> 00:33:56,140
And the basic idea is that
you take all your constraints

607
00:33:56,140 --> 00:34:00,440
on your Boolean variables and
you put them into a database.

608
00:34:00,440 --> 00:34:03,450
And what is a constraint?

609
00:34:03,450 --> 00:34:06,950
Is this too small or can
people in the back read this?

610
00:34:09,662 --> 00:34:10,570
AUDIENCE: Too small.

611
00:34:10,570 --> 00:34:11,179
ARMANDO SOLAR-LEZAMA: Too small?

612
00:34:11,179 --> 00:34:11,679
OK.

613
00:34:15,900 --> 00:34:19,469
Let's see if we can
make this bigger.

614
00:34:42,040 --> 00:34:45,331
Is this a little bit better?

615
00:34:45,331 --> 00:34:46,305
AUDIENCE: [INAUDIBLE].

616
00:34:46,305 --> 00:34:47,770
ARMANDO SOLAR-LEZAMA: OK.

617
00:34:47,770 --> 00:34:51,000
Well, here's what I'll do.

618
00:34:51,000 --> 00:34:54,030
I will annotate and I
will narrate it as I go.

619
00:34:54,030 --> 00:34:55,810
And I'll post the slides later.

620
00:34:55,810 --> 00:34:57,660
So people can see what it says.

621
00:34:57,660 --> 00:35:01,650
So what we have
here in SAT problem

622
00:35:01,650 --> 00:35:06,770
is that we have all these
variables that represent

623
00:35:06,770 --> 00:35:08,460
Boolean unknowns, right?

624
00:35:08,460 --> 00:35:11,620
We want to know is
it possible for x

625
00:35:11,620 --> 00:35:15,170
to be true, and y to be true,
and z to be true at the same,

626
00:35:15,170 --> 00:35:15,820
for example.

627
00:35:15,820 --> 00:35:16,320
Right?

628
00:35:16,320 --> 00:35:18,330
And these are our unknowns.

629
00:35:18,330 --> 00:35:22,750
And all the constraints are
in conjunctive normal form.

630
00:35:22,750 --> 00:35:24,590
What that means is
all our constraints

631
00:35:24,590 --> 00:35:33,920
are of the form either x1
is true, or x2 is true,

632
00:35:33,920 --> 00:35:37,951
or x3 is true, for example.

633
00:35:37,951 --> 00:35:38,450
Right?

634
00:35:38,450 --> 00:35:42,200
So what we have is we have all
our constraints in this form

635
00:35:42,200 --> 00:35:45,130
and some of them might say,
well, either x1 is true,

636
00:35:45,130 --> 00:35:48,970
or x2 is false, or x3 is false.

637
00:35:48,970 --> 00:35:49,470
Right?

638
00:35:49,470 --> 00:35:50,880
So we have constraints.

639
00:35:50,880 --> 00:35:53,500
All our constraints
are of this form.

640
00:35:53,500 --> 00:35:55,780
And you probably remember
from discrete math

641
00:35:55,780 --> 00:35:59,700
that any Boolean formula
can be represented

642
00:35:59,700 --> 00:36:01,264
in conjunctive normal form.

643
00:36:01,264 --> 00:36:03,680
And it has the added benefit
that it's actually very, very

644
00:36:03,680 --> 00:36:08,370
easy to translate from arbitrary
representations of a formula

645
00:36:08,370 --> 00:36:11,970
to these conjunctive normal form
formula, which means whatever

646
00:36:11,970 --> 00:36:15,180
representation you're using
to represent Boolean formulas,

647
00:36:15,180 --> 00:36:19,130
you can very easily
convert it to this format.

648
00:36:19,130 --> 00:36:22,730
So what we have is
we have a database

649
00:36:22,730 --> 00:36:25,230
with lots of constraints
of this form.

650
00:36:25,230 --> 00:36:27,380
And what SAT solver
is going to do

651
00:36:27,380 --> 00:36:29,540
is going to pick one of
these variables at random.

652
00:36:29,540 --> 00:36:31,950
Let's say it's going to pick x1.

653
00:36:31,950 --> 00:36:36,180
And it's going to say, why
don't we set x1 to true?

654
00:36:36,180 --> 00:36:38,120
I don't know anything
about this problem.

655
00:36:38,120 --> 00:36:41,130
Might as well try
selling it to true.

656
00:36:41,130 --> 00:36:44,050
And then what will happen is
you'll have some constraints

657
00:36:44,050 --> 00:36:48,390
that mention x1 and let's say
that you have a constraint that

658
00:36:48,390 --> 00:36:53,160
says either x1 is
false or x7 is true.

659
00:36:53,160 --> 00:36:53,660
Right?

660
00:36:53,660 --> 00:36:56,700
So if you know that
x1 is true and you

661
00:36:56,700 --> 00:37:00,430
know that either x1 is
false or x7 is true,

662
00:37:00,430 --> 00:37:04,105
what do you know about x7?

663
00:37:04,105 --> 00:37:05,145
AUDIENCE: [INAUDIBLE].

664
00:37:05,145 --> 00:37:06,270
ARMANDO SOLAR-LEZAMA: Yeah.

665
00:37:06,270 --> 00:37:06,990
It has to be true.

666
00:37:06,990 --> 00:37:07,489
Right?

667
00:37:07,489 --> 00:37:09,000
Because otherwise
this constraint

668
00:37:09,000 --> 00:37:10,660
would not be satisfied.

669
00:37:10,660 --> 00:37:16,420
And so now you've propagated
this assignment from x1 to x7.

670
00:37:16,420 --> 00:37:19,370
And let's say now you pick
some other random variable.

671
00:37:19,370 --> 00:37:22,090
You say, well, what about x5?

672
00:37:22,090 --> 00:37:24,140
Why don't we try x5 being true?

673
00:37:24,140 --> 00:37:24,640
Right?

674
00:37:24,640 --> 00:37:27,600
And now let's say that you
have a constraint that says,

675
00:37:27,600 --> 00:37:41,850
well, either x7 is false, or
x6 is true, or x5 is false.

676
00:37:41,850 --> 00:37:42,350
Right?

677
00:37:42,350 --> 00:37:48,500
So I have x5 being true
and I have x7 being true.

678
00:37:48,500 --> 00:37:52,640
So that means x6
now has to be true.

679
00:37:52,640 --> 00:37:53,140
Right?

680
00:37:53,140 --> 00:37:56,760
Because otherwise this
constraint would be violated.

681
00:37:56,760 --> 00:37:59,520
And so from that the
system infers, OK.

682
00:37:59,520 --> 00:38:01,500
So x6 has to be true.

683
00:38:01,500 --> 00:38:04,680
And it keeps at this
process essentially

684
00:38:04,680 --> 00:38:06,820
trying out assignments.

685
00:38:06,820 --> 00:38:09,290
And then looking at all
the available clauses,

686
00:38:09,290 --> 00:38:10,750
and looking at,
hey, are there are

687
00:38:10,750 --> 00:38:14,080
other things that are
implied by the assignments

688
00:38:14,080 --> 00:38:16,090
that I have so far?

689
00:38:16,090 --> 00:38:20,190
And following those implications
until one of two things

690
00:38:20,190 --> 00:38:20,690
happens.

691
00:38:20,690 --> 00:38:23,480
Either you keep following
implications and trying

692
00:38:23,480 --> 00:38:26,490
random things and eventually
you have set a value

693
00:38:26,490 --> 00:38:28,460
to every single
variable without ever

694
00:38:28,460 --> 00:38:30,550
running into a contradiction.

695
00:38:30,550 --> 00:38:32,580
And then you're done.

696
00:38:32,580 --> 00:38:33,080
Right?

697
00:38:33,080 --> 00:38:37,240
You found a satisfying
assignment, or what can happen

698
00:38:37,240 --> 00:38:38,580
is you run into a contradiction.

699
00:38:38,580 --> 00:38:45,690
You run into a place where there
was a clause that forced x4

700
00:38:45,690 --> 00:38:49,900
to be true, except there was
another clause that forced x4

701
00:38:49,900 --> 00:38:50,950
to be false.

702
00:38:50,950 --> 00:38:55,080
And if there's one rule of
Boolean algebra that everybody

703
00:38:55,080 --> 00:38:58,090
should know, is that you cannot
have a variable be true and be

704
00:38:58,090 --> 00:38:59,860
false at the same time.

705
00:38:59,860 --> 00:39:00,360
Right?

706
00:39:00,360 --> 00:39:01,859
And so what that
tells you is you've

707
00:39:01,859 --> 00:39:03,690
run into a contradiction.

708
00:39:03,690 --> 00:39:05,840
You clearly did
something wrong in one

709
00:39:05,840 --> 00:39:08,200
of these random assignments
that you were trying.

710
00:39:08,200 --> 00:39:10,680
So now let's analyze
this contradiction.

711
00:39:10,680 --> 00:39:12,820
Let's figure out what
were the assignments that

712
00:39:12,820 --> 00:39:16,790
led to this contradiction.

713
00:39:16,790 --> 00:39:20,690
And based on the assignments
that led to that contradiction,

714
00:39:20,690 --> 00:39:25,010
let's come up with a
new conflict clause that

715
00:39:25,010 --> 00:39:27,560
summarizes that contradiction.

716
00:39:27,560 --> 00:39:31,170
So in this case,
what would happen

717
00:39:31,170 --> 00:39:38,180
is that you have x1 being
false, and x5 being false.

718
00:39:38,180 --> 00:39:41,130
And x9 being false, right?

719
00:39:41,130 --> 00:39:44,530
So essentially what this is
saying is that based on what I

720
00:39:44,530 --> 00:39:46,840
learned from these random
assignments I discovered that

721
00:39:46,840 --> 00:39:49,560
one of these things
has to be true,

722
00:39:49,560 --> 00:39:53,440
that it cannot be the case that
x1 is true, and x5 is true,

723
00:39:53,440 --> 00:39:55,990
and x9 is false.

724
00:39:55,990 --> 00:39:57,000
That cannot happen.

725
00:39:57,000 --> 00:40:00,240
And I know that cannot happen
because when I tried that

726
00:40:00,240 --> 00:40:00,965
things blew up.

727
00:40:00,965 --> 00:40:03,050
I ended up with a contradiction.

728
00:40:03,050 --> 00:40:06,330
And so what SAT solver is doing
is trying random assignments,

729
00:40:06,330 --> 00:40:08,030
propagating them through.

730
00:40:08,030 --> 00:40:09,630
When it runs into
contradictions it's

731
00:40:09,630 --> 00:40:12,600
analyzing the set
of implications

732
00:40:12,600 --> 00:40:14,130
that led to that contradiction.

733
00:40:14,130 --> 00:40:17,690
And summarising that in
a new constraint that

734
00:40:17,690 --> 00:40:19,650
will make sure
that it never runs

735
00:40:19,650 --> 00:40:21,980
into this contradiction
again, that it never

736
00:40:21,980 --> 00:40:25,574
runs into this
particular problem again.

737
00:40:25,574 --> 00:40:26,240
Other questions?

738
00:40:34,960 --> 00:40:35,460
OK.

739
00:40:35,460 --> 00:40:36,730
So so far so good.

740
00:40:36,730 --> 00:40:40,040
So we can't really
think of the SAT solver

741
00:40:40,040 --> 00:40:43,830
as just a black box that
given a Boolean constraint

742
00:40:43,830 --> 00:40:47,380
it can either say, no,
this Boolean constraint is

743
00:40:47,380 --> 00:40:51,130
unsatisfiable, or it
can say, yeah, here's

744
00:40:51,130 --> 00:40:53,270
a satisfying assignment to
that Boolean constraint.

745
00:40:53,270 --> 00:40:57,137
So SMT solvers are built
on top of SAT solvers.

746
00:40:57,137 --> 00:40:58,720
And what they're
able to do is they're

747
00:40:58,720 --> 00:41:01,670
able to combine the
power of the SAT solver

748
00:41:01,670 --> 00:41:08,130
to solve these NP-complete
SAT problems with domain

749
00:41:08,130 --> 00:41:12,190
specific reasoning to reason
about the different theories

750
00:41:12,190 --> 00:41:13,000
that are supported.

751
00:41:13,000 --> 00:41:15,460
So to give you an
idea of how it works,

752
00:41:15,460 --> 00:41:18,226
and this is going to
be a fairly high level,

753
00:41:18,226 --> 00:41:19,600
but to give you
an idea of how it

754
00:41:19,600 --> 00:41:22,000
works let's say that you have
a formula like this, right?

755
00:41:22,000 --> 00:41:25,635
So you say x is greater
than 5 and y is less than 5.

756
00:41:28,890 --> 00:41:33,791
And either y is greater than
x or y is greater than 2.

757
00:41:33,791 --> 00:41:34,290
Right?

758
00:41:34,290 --> 00:41:37,310
So is that satisfiable?

759
00:41:37,310 --> 00:41:39,490
Can we find a satisfying
assignment for that?

760
00:41:39,490 --> 00:41:46,940
So what an SMT solver can
do is separate out the part

761
00:41:46,940 --> 00:41:50,950
of this formula that requires
domain reasoning, that

762
00:41:50,950 --> 00:41:52,930
requires reasoning in
the theory, in this case,

763
00:41:52,930 --> 00:41:54,150
of integers.

764
00:41:54,150 --> 00:41:55,730
With the part of
this formula that

765
00:41:55,730 --> 00:41:57,770
is just the Boolean structure.

766
00:41:57,770 --> 00:42:01,616
So if you separate the
Boolean structure here,

767
00:42:01,616 --> 00:42:02,990
essentially what
you're saying is

768
00:42:02,990 --> 00:42:09,034
that there's some formula,
F1 and some formula F2,

769
00:42:09,034 --> 00:42:11,800
and either F3 or F4.

770
00:42:11,800 --> 00:42:12,300
Right?

771
00:42:12,300 --> 00:42:15,740
And now this is a purely
Boolean problem, right?

772
00:42:15,740 --> 00:42:18,060
It's just a problem of
can I find a satisfying

773
00:42:18,060 --> 00:42:22,110
assignment for that?

774
00:42:22,110 --> 00:42:24,280
Is there a satisfying
assignment for that?

775
00:42:24,280 --> 00:42:26,570
And, again, this is
just a Boolean formula.

776
00:42:26,570 --> 00:42:30,385
Goes to a SAT solver and the
SAT solver can say, yeah.

777
00:42:33,820 --> 00:42:36,010
I can find a satisfying
assignment for this.

778
00:42:36,010 --> 00:42:39,220
And I can find a
satisfying assignment

779
00:42:39,220 --> 00:42:43,740
by making this true, and
this true, and this true.

780
00:42:43,740 --> 00:42:44,240
Right?

781
00:42:44,240 --> 00:42:48,010
It's a satisfying assignment
for the Boolean formula.

782
00:42:48,010 --> 00:42:52,670
So now we have a question
that we can go and ask

783
00:42:52,670 --> 00:42:54,160
the domain specific solver.

784
00:42:54,160 --> 00:42:59,700
In this case just a
linear arithmetic solver.

785
00:42:59,700 --> 00:43:01,130
So we can go to
the linear solver

786
00:43:01,130 --> 00:43:04,050
and say, hey, so the
SAT solver claims

787
00:43:04,050 --> 00:43:06,990
that this is a reasonable
assignment, that if I

788
00:43:06,990 --> 00:43:08,930
can make that
assignment work, then

789
00:43:08,930 --> 00:43:10,890
my formula will be satisfied.

790
00:43:10,890 --> 00:43:17,160
So I can go and say, well F1 was
actually this, and F2 was this,

791
00:43:17,160 --> 00:43:18,740
and F3 was this.

792
00:43:18,740 --> 00:43:22,290
So I can ask a theory solver, is
it possible to get an x and a y

793
00:43:22,290 --> 00:43:26,030
such that x is greater
than 5, y is less than 5,

794
00:43:26,030 --> 00:43:28,200
and y is greater than x?

795
00:43:28,200 --> 00:43:32,410
Right, so now this is a question
purely about linear arithmetic.

796
00:43:32,410 --> 00:43:36,484
There's no Boolean
logic involved.

797
00:43:36,484 --> 00:43:37,400
And what's the answer?

798
00:43:39,960 --> 00:43:40,460
No.

799
00:43:40,460 --> 00:43:40,960
Right?

800
00:43:40,960 --> 00:43:44,210
And there are
traditional methods

801
00:43:44,210 --> 00:43:47,960
to solve these kinds
of your problems.

802
00:43:47,960 --> 00:43:50,730
You could use the simplex
method, for example,

803
00:43:50,730 --> 00:43:53,570
to solve systems of
linear inequalities.

804
00:43:53,570 --> 00:43:55,070
There's lots of
methods that you can

805
00:43:55,070 --> 00:43:57,530
use to solve systems
of linear inequalities.

806
00:43:57,530 --> 00:44:00,770
The point is the theory solver
knows about all of those.

807
00:44:00,770 --> 00:44:03,630
And the theory
solver can say, no.

808
00:44:03,630 --> 00:44:04,670
This will not work.

809
00:44:04,670 --> 00:44:07,640
This is an assignment
that will not work.

810
00:44:07,640 --> 00:44:13,510
And so the theory solver can
now go back to the SAT solver

811
00:44:13,510 --> 00:44:15,740
and not just tell the SAT
solver, hey, that thing

812
00:44:15,740 --> 00:44:18,300
that you did, that didn't work.

813
00:44:18,300 --> 00:44:20,920
But it can also give
more of an explanation.

814
00:44:20,920 --> 00:44:24,370
So in this case, what you can
conclude from the fact that

815
00:44:24,370 --> 00:44:26,880
this didn't work is that
actually in addition

816
00:44:26,880 --> 00:44:31,360
to satisfying this formula you
also want to satisfy the fact

817
00:44:31,360 --> 00:44:40,500
that I cannot have F1,
and F2, and F3, right?

818
00:44:40,500 --> 00:44:42,810
My theory solver has
told me that these three

819
00:44:42,810 --> 00:44:44,460
things are mutually exclusive.

820
00:44:44,460 --> 00:44:47,890
I cannot satisfy all
three of them together.

821
00:44:47,890 --> 00:44:49,660
And so now that's a
piece of information

822
00:44:49,660 --> 00:44:52,230
that I can go back
to the SAT solver

823
00:44:52,230 --> 00:44:54,320
and ask the SAT
solver, hey, can you

824
00:44:54,320 --> 00:44:57,000
give me a solution
that satisfies

825
00:44:57,000 --> 00:44:59,440
not only the constraint that
you had in the beginning,

826
00:44:59,440 --> 00:45:03,410
but also this new
constraint that the theory

827
00:45:03,410 --> 00:45:05,091
solver discovered?

828
00:45:05,091 --> 00:45:05,590
Right?

829
00:45:05,590 --> 00:45:09,587
So now is there some other
assignment that satisfies now

830
00:45:09,587 --> 00:45:10,670
both of these constraints?

831
00:45:18,950 --> 00:45:21,440
AUDIENCE: [INAUDIBLE].

832
00:45:21,440 --> 00:45:23,070
ARMANDO SOLAR-LEZAMA: Yeah.

833
00:45:23,070 --> 00:45:25,870
So there's an assignment
where this becomes false.

834
00:45:25,870 --> 00:45:27,415
And this becomes true.

835
00:45:27,415 --> 00:45:29,040
And that's an assignment
that satisfies

836
00:45:29,040 --> 00:45:30,160
the constraint on the top.

837
00:45:30,160 --> 00:45:32,250
It satisfies the
constraint on the bottom.

838
00:45:32,250 --> 00:45:34,480
And so once again
that's an assignment

839
00:45:34,480 --> 00:45:37,856
that leads to a new constraint.

840
00:45:37,856 --> 00:45:39,230
So this constraint
now goes away.

841
00:45:39,230 --> 00:45:40,900
We don't care about it any more.

842
00:45:40,900 --> 00:45:44,790
We have a new constraint that we
can ask our theory solver, hey,

843
00:45:44,790 --> 00:45:46,520
it this possible?

844
00:45:46,520 --> 00:45:48,870
And in this case the
theory solver says, yeah.

845
00:45:48,870 --> 00:45:50,310
That actually is possible.

846
00:45:50,310 --> 00:45:57,630
You can make y equal
3 and x equal 6.

847
00:45:57,630 --> 00:45:59,100
And it works.

848
00:45:59,100 --> 00:45:59,600
Right?

849
00:45:59,600 --> 00:46:02,820
And so now you
have an assignment

850
00:46:02,820 --> 00:46:07,150
that satisfies the
formula in the theory

851
00:46:07,150 --> 00:46:11,127
and that satisfies
the Boolean structure

852
00:46:11,127 --> 00:46:12,085
behind this assignment.

853
00:46:12,085 --> 00:46:15,240
And with that the system can
come back and tell you, yeah.

854
00:46:15,240 --> 00:46:19,660
Here's an assignment that
satisfies all your constraints.

855
00:46:19,660 --> 00:46:21,870
And so it's this
interaction back and forth

856
00:46:21,870 --> 00:46:25,660
between the theory solver
and the SAT solver.

857
00:46:25,660 --> 00:46:27,610
And really the ability
to be able to reason

858
00:46:27,610 --> 00:46:31,440
about very, very large and very
complicated Boolean formulas.

859
00:46:31,440 --> 00:46:36,990
That's what makes symbolic
execution possible.

860
00:46:36,990 --> 00:46:41,910
So now that we have that
the next question is,

861
00:46:41,910 --> 00:46:52,620
so how do we go from a
program to a constraint

862
00:46:52,620 --> 00:46:54,090
that we can give
to an SMT solver?

863
00:46:54,090 --> 00:46:54,630
Yes?

864
00:46:54,630 --> 00:46:56,000
AUDIENCE: Sorry for going back.

865
00:46:56,000 --> 00:46:57,125
ARMANDO SOLAR-LEZAMA: Sure.

866
00:46:57,125 --> 00:46:58,622
AUDIENCE: [INAUDIBLE]
previously.

867
00:46:58,622 --> 00:47:05,608
But could you run me again the
whole issue of constructing

868
00:47:05,608 --> 00:47:07,105
the SMT statements?

869
00:47:07,105 --> 00:47:10,620
Is it an NP-complete or
is it not? [INAUDIBLE].

870
00:47:10,620 --> 00:47:12,670
ARMANDO SOLAR-LEZAMA:
So the problems

871
00:47:12,670 --> 00:47:15,190
that the SMT
solvers are solving,

872
00:47:15,190 --> 00:47:20,180
those are NP-complete
problems in the best of cases.

873
00:47:20,180 --> 00:47:24,270
So SAT itself is the
canonical NP-complete problem,

874
00:47:24,270 --> 00:47:28,630
but a lot of solvers these
days even include support

875
00:47:28,630 --> 00:47:34,590
for some theories that
are outright undecidable.

876
00:47:34,590 --> 00:47:35,270
So--

877
00:47:35,270 --> 00:47:39,050
AUDIENCE: So how do you
approach that in your system?

878
00:47:39,050 --> 00:47:42,840
ARMANDO SOLAR-LEZAMA: Well, at
the end of the day what you get

879
00:47:42,840 --> 00:47:48,590
is you're going to create a
constraint from this program.

880
00:47:48,590 --> 00:47:51,890
You're going to give
it to the SMT solver.

881
00:47:51,890 --> 00:47:54,120
And the fact that these
are NP-complete problems,

882
00:47:54,120 --> 00:47:56,630
or the fact that they're
unsatisfiable, what it means

883
00:47:56,630 --> 00:48:03,570
is that if you're lucky, you
will get an answer in seconds.

884
00:48:03,570 --> 00:48:06,770
And if you're not
lucky, then it might

885
00:48:06,770 --> 00:48:09,670
take longer than the age of
the universe for the thing

886
00:48:09,670 --> 00:48:11,009
to give you an answer.

887
00:48:11,009 --> 00:48:11,550
AUDIENCE: OK.

888
00:48:11,550 --> 00:48:14,841
How often do you run into
cases where your system just

889
00:48:14,841 --> 00:48:18,746
flat-lines and says, sorry, I
just can't figure this out yet?

890
00:48:18,746 --> 00:48:20,560
Has that ever happened
or is that just--

891
00:48:20,560 --> 00:48:21,070
ARMANDO SOLAR-LEZAMA: Yes.

892
00:48:21,070 --> 00:48:22,140
Yes, it does happen.

893
00:48:22,140 --> 00:48:24,666
And a big part of
the engineering

894
00:48:24,666 --> 00:48:27,340
of these kind of
tools is making sure

895
00:48:27,340 --> 00:48:30,420
that this happens as
infrequently as possible.

896
00:48:30,420 --> 00:48:35,890
And part what makes
this work at all

897
00:48:35,890 --> 00:48:40,530
is that we're not solving
random SAT problems.

898
00:48:40,530 --> 00:48:44,450
We're not solving completely
random bit-vector problems.

899
00:48:44,450 --> 00:48:47,390
We're solving problems that
have a certain structure to them

900
00:48:47,390 --> 00:48:50,760
that a person was
able to look at it

901
00:48:50,760 --> 00:48:53,750
and least have some confidence
that this worked, right?

902
00:48:53,750 --> 00:48:57,070
Build some argument in their
head for why this worked.

903
00:48:57,070 --> 00:49:00,260
And so what the solvers
are trying to do

904
00:49:00,260 --> 00:49:02,640
is essentially exploiting
that structure.

905
00:49:02,640 --> 00:49:05,260
And taking advantage, for
example, the description

906
00:49:05,260 --> 00:49:08,194
that I gave you of what the
SAT solver is doing internally,

907
00:49:08,194 --> 00:49:10,110
that's taking advantage
of the fact that, yes.

908
00:49:10,110 --> 00:49:13,390
Your problem might have a
million Boolean variables,

909
00:49:13,390 --> 00:49:15,280
but actually most
of those variables

910
00:49:15,280 --> 00:49:18,430
are very tightly dependent
on the values of each other.

911
00:49:18,430 --> 00:49:20,990
So the number of degrees
of freedom in the problem

912
00:49:20,990 --> 00:49:23,730
is actually much smaller
than what the million

913
00:49:23,730 --> 00:49:24,848
variables would suggest.

914
00:49:24,848 --> 00:49:27,056
AUDIENCE: So you're saying
is that this isn't an exam

915
00:49:27,056 --> 00:49:27,540
question.

916
00:49:27,540 --> 00:49:28,024
This is real life.

917
00:49:28,024 --> 00:49:29,476
And someone built this system.

918
00:49:29,476 --> 00:49:30,444
It was supposed to
work and make sense.

919
00:49:30,444 --> 00:49:32,138
So it's probably
not going to be one

920
00:49:32,138 --> 00:49:34,292
of those wildly bizarre
theoretical [INAUDIBLE].

921
00:49:34,292 --> 00:49:35,750
ARMANDO SOLAR-LEZAMA:
That's right.

922
00:49:38,780 --> 00:49:40,760
And in practice what
happens and when

923
00:49:40,760 --> 00:49:43,020
you use this tool is the
thing is you always do

924
00:49:43,020 --> 00:49:45,180
is set timeouts.

925
00:49:45,180 --> 00:49:49,864
So generally, what happens
is because it's exponential,

926
00:49:49,864 --> 00:49:51,780
exponential doesn't mean
that you can't do it.

927
00:49:51,780 --> 00:49:54,820
Exponential just means
that there's a brick wall,

928
00:49:54,820 --> 00:49:57,700
that before that brick
wall things will work,

929
00:49:57,700 --> 00:49:59,620
and in fact, they
will work really fast.

930
00:49:59,620 --> 00:50:00,120
Right?

931
00:50:00,120 --> 00:50:01,660
The exponential
works in both ways.

932
00:50:01,660 --> 00:50:04,480
Yes, when you're
going out then things

933
00:50:04,480 --> 00:50:06,520
are growing very
quickly, but when

934
00:50:06,520 --> 00:50:09,980
you're going toward smaller
problems, or simpler problems

935
00:50:09,980 --> 00:50:12,490
things are also getting
faster very, very quickly.

936
00:50:12,490 --> 00:50:17,120
So in general what that means
is that lots of problems

937
00:50:17,120 --> 00:50:19,190
finish very, very quickly.

938
00:50:19,190 --> 00:50:21,350
And then some problems timeout.

939
00:50:21,350 --> 00:50:24,630
And the key is to engineer
things in such a way

940
00:50:24,630 --> 00:50:28,990
that among the problems that
finish quickly are actually

941
00:50:28,990 --> 00:50:30,960
problems of practical use.

942
00:50:30,960 --> 00:50:33,410
Or problems that will
actually point you

943
00:50:33,410 --> 00:50:35,450
to security vulnerabilities
in your system,

944
00:50:35,450 --> 00:50:39,560
will point you to bugs,
will point you to a path

945
00:50:39,560 --> 00:50:41,390
that you maybe haven't
explored before,

946
00:50:41,390 --> 00:50:43,560
or inputs that will take
you down paths that you

947
00:50:43,560 --> 00:50:45,432
hadn't explored before.

948
00:50:45,432 --> 00:50:46,207
AUDIENCE: Thanks.

949
00:50:46,207 --> 00:50:47,790
ARMANDO SOLAR-LEZAMA:
Other questions?

950
00:50:52,550 --> 00:50:53,460
All right.

951
00:50:53,460 --> 00:50:57,750
So we know how to
go from a formula,

952
00:50:57,750 --> 00:51:01,690
from a set of constraints, to
an answer that will either say,

953
00:51:01,690 --> 00:51:03,170
yes, this formula
has a solution.

954
00:51:03,170 --> 00:51:08,060
And here's a solution, or no,
this formula is unsatisfiable.

955
00:51:08,060 --> 00:51:10,950
There is no input
that satisfies this.

956
00:51:10,950 --> 00:51:15,310
So now how do we get a
formula from a program?

957
00:51:15,310 --> 00:51:18,970
So one of the
things that you have

958
00:51:18,970 --> 00:51:20,730
when you're doing
symbolic execution

959
00:51:20,730 --> 00:51:23,035
is that when you get
to a branch and you

960
00:51:23,035 --> 00:51:26,600
don't know which direction
the branch is going to go.

961
00:51:26,600 --> 00:51:30,660
Now there are two possibilities
that you can do in that case.

962
00:51:30,660 --> 00:51:35,040
One is to do what we did in the
early example, which is just

963
00:51:35,040 --> 00:51:37,960
to say, I'm going to take both
branches at the same time.

964
00:51:37,960 --> 00:51:40,790
I'm going to collect what
happens in mode's branches,

965
00:51:40,790 --> 00:51:42,270
merge at the end.

966
00:51:42,270 --> 00:51:46,100
That is a strategy
that is often used

967
00:51:46,100 --> 00:51:50,710
when you're trying to get very
strong guarantees in general.

968
00:51:50,710 --> 00:51:54,080
But it's a strategy that
doesn't work too well

969
00:51:54,080 --> 00:51:56,060
with modern and SMT solvers.

970
00:51:56,060 --> 00:52:02,674
So often people prefer to do
one path at a time exploration.

971
00:52:02,674 --> 00:52:04,090
And what that means
is that you're

972
00:52:04,090 --> 00:52:06,730
going to pick a path
down your program.

973
00:52:06,730 --> 00:52:10,420
And then you're going to
create a formula for that path.

974
00:52:10,420 --> 00:52:13,800
So you're going to ask, fine
me an input that goes down

975
00:52:13,800 --> 00:52:18,640
this path and that
satisfies my constraint,

976
00:52:18,640 --> 00:52:21,880
or that violates
my property, that

977
00:52:21,880 --> 00:52:26,370
goes out of bounds in my buffer,
or that causes a null pointer

978
00:52:26,370 --> 00:52:27,840
error.

979
00:52:27,840 --> 00:52:29,860
And then if you
can't find one then

980
00:52:29,860 --> 00:52:32,020
you try a different path
and a different path.

981
00:52:32,020 --> 00:52:38,260
And you do these path
explorations one at a time.

982
00:52:38,260 --> 00:52:42,000
So that's the strategy that
we're going to talk about now.

983
00:52:42,000 --> 00:52:44,900
It's a little bit easier
to describe how to do it.

984
00:52:44,900 --> 00:52:49,440
So let's say that we
have a problem like this.

985
00:52:49,440 --> 00:52:51,690
So, by the way, I
switched representations.

986
00:52:51,690 --> 00:52:54,170
So I'm not representing the
program as a block of code

987
00:52:54,170 --> 00:52:58,220
and representing it as
a control flow graph.

988
00:52:58,220 --> 00:53:00,610
Is everybody here familiar
with a control flow graph?

989
00:53:00,610 --> 00:53:03,930
Or is anybody here not familiar
with a control flow graph?

990
00:53:03,930 --> 00:53:05,790
It's just a representation
of a program that

991
00:53:05,790 --> 00:53:08,940
makes branches more explicit.

992
00:53:08,940 --> 00:53:11,420
So let's pick a path.

993
00:53:13,940 --> 00:53:17,610
And so let's say that we
care about this path, right,

994
00:53:17,610 --> 00:53:19,790
a path that starts
at the beginning

995
00:53:19,790 --> 00:53:23,310
and takes us all the way
down to the point where

996
00:53:23,310 --> 00:53:27,090
we are asserting false.

997
00:53:27,090 --> 00:53:29,780
And we want to know,
is this path feasible?

998
00:53:29,780 --> 00:53:32,990
Could the program
go down this path?

999
00:53:32,990 --> 00:53:35,800
So as we're going
down this program

1000
00:53:35,800 --> 00:53:37,660
we're going to keep two things.

1001
00:53:42,070 --> 00:53:43,870
We're going to keep
an environment that

1002
00:53:43,870 --> 00:53:46,830
keeps track of the
symbolic values

1003
00:53:46,830 --> 00:53:48,580
of the different variables.

1004
00:53:48,580 --> 00:53:52,700
And in addition to that,
we're going to keep around

1005
00:53:52,700 --> 00:53:54,710
an environment for constraints.

1006
00:54:04,109 --> 00:54:05,650
And these constraints
are essentially

1007
00:54:05,650 --> 00:54:08,150
going to keep track of
all the relationships

1008
00:54:08,150 --> 00:54:12,000
between these variables as
well as any assumptions,

1009
00:54:12,000 --> 00:54:13,480
whether they were
assumptions that

1010
00:54:13,480 --> 00:54:15,830
were made at the
beginning, or assumptions

1011
00:54:15,830 --> 00:54:18,320
that come from the branches
that you are taking.

1012
00:54:18,320 --> 00:54:21,350
So in this case, when
we start down this path

1013
00:54:21,350 --> 00:54:29,490
we get to t equals 0, so
our state is x, y, and 0.

1014
00:54:29,490 --> 00:54:31,090
And so far we have
no constraints

1015
00:54:31,090 --> 00:54:35,290
because we didn't have any
constraint in the beginning.

1016
00:54:35,290 --> 00:54:39,224
So now we're going
to take this branch

1017
00:54:39,224 --> 00:54:41,390
and, again, because we've
made a decision that we're

1018
00:54:41,390 --> 00:54:45,950
going to go down the
path to your right,

1019
00:54:45,950 --> 00:54:51,358
then we know that this
path will only happen when?

1020
00:54:56,506 --> 00:54:57,450
AUDIENCE: [INAUDIBLE].

1021
00:54:57,450 --> 00:54:58,908
ARMANDO SOLAR-LEZAMA:
That's right.

1022
00:54:58,908 --> 00:55:04,970
So we get our first constraint
that says, x is greater than y.

1023
00:55:04,970 --> 00:55:05,470
Right?

1024
00:55:05,470 --> 00:55:13,410
So now down here we're
looking at t equals y.

1025
00:55:13,410 --> 00:55:16,510
Now in this case because we're
going only one path at a time

1026
00:55:16,510 --> 00:55:19,850
we don't actually need to
introduce a new variable for t

1027
00:55:19,850 --> 00:55:20,520
necessarily.

1028
00:55:20,520 --> 00:55:22,340
We can just say, OK.

1029
00:55:22,340 --> 00:55:23,750
t is equal to y.

1030
00:55:23,750 --> 00:55:27,640
So that means that
t is no longer 0.

1031
00:55:27,640 --> 00:55:31,130
It's now y.

1032
00:55:31,130 --> 00:55:31,740
Right?

1033
00:55:31,740 --> 00:55:32,840
And then keep going.

1034
00:55:32,840 --> 00:55:34,860
We get to this point.

1035
00:55:34,860 --> 00:55:37,990
Now we hit another branch.

1036
00:55:37,990 --> 00:55:39,490
What's a new
assumption that we have

1037
00:55:39,490 --> 00:55:41,740
to make if we're assuming
that we went down this path?

1038
00:55:49,340 --> 00:55:51,410
Just t less than y, right?

1039
00:55:51,410 --> 00:55:52,880
And what is t?

1040
00:55:56,340 --> 00:55:57,120
Right.

1041
00:55:57,120 --> 00:56:00,840
So in fact if we look up
t, so t has the value y.

1042
00:56:00,840 --> 00:56:01,916
We look up y.

1043
00:56:01,916 --> 00:56:03,300
y also has the value of y.

1044
00:56:03,300 --> 00:56:09,290
So this constraint actually
translates to y less than y.

1045
00:56:09,290 --> 00:56:11,320
So what does this tell us?

1046
00:56:11,320 --> 00:56:16,750
It tells us that in order
to make it to this point,

1047
00:56:16,750 --> 00:56:20,280
in order to make it to a assert
false, all of those things

1048
00:56:20,280 --> 00:56:21,340
have to hold.

1049
00:56:21,340 --> 00:56:22,730
Can they hold?

1050
00:56:22,730 --> 00:56:23,920
Clearly not.

1051
00:56:23,920 --> 00:56:24,670
Right?

1052
00:56:24,670 --> 00:56:28,350
y less than y alone is
already sufficient for things

1053
00:56:28,350 --> 00:56:29,550
not to hold.

1054
00:56:29,550 --> 00:56:35,980
And so that tells us immediately
that this is unsatisfiable.

1055
00:56:35,980 --> 00:56:39,940
And this is often known
as a path condition.

1056
00:56:39,940 --> 00:56:42,030
This is a condition
that has to be

1057
00:56:42,030 --> 00:56:47,020
true in order for the
program to go down that path.

1058
00:56:47,020 --> 00:56:51,630
And so we know that this path
condition cannot be satisfied.

1059
00:56:51,630 --> 00:56:54,650
And therefore, that it's
impossible for the program

1060
00:56:54,650 --> 00:56:55,970
to take this path.

1061
00:56:55,970 --> 00:57:01,480
So this path is now
completely eliminated.

1062
00:57:01,480 --> 00:57:05,680
We know that this
path cannot be taken.

1063
00:57:05,680 --> 00:57:08,640
And, in fact, so
this constraint we're

1064
00:57:08,640 --> 00:57:13,650
actually going to just keep them
around as the condition itself.

1065
00:57:13,650 --> 00:57:14,150
All right?

1066
00:57:14,150 --> 00:57:17,860
So what about a different path?

1067
00:57:17,860 --> 00:57:21,840
So now we're trying this path.

1068
00:57:24,830 --> 00:57:29,140
So what would be the
path condition for this?

1069
00:57:29,140 --> 00:57:35,920
So, again, our symbolic
state starts with t equals 0,

1070
00:57:35,920 --> 00:57:39,270
and x and y equals to just
the variables x and y.

1071
00:57:39,270 --> 00:57:43,060
And now how does
the path constraint

1072
00:57:43,060 --> 00:57:44,610
look like in this case?

1073
00:57:44,610 --> 00:57:48,115
So by the time we get here how
does the path condition look

1074
00:57:48,115 --> 00:57:48,615
like?

1075
00:57:50,984 --> 00:57:51,900
AUDIENCE: [INAUDIBLE].

1076
00:57:53,818 --> 00:57:54,984
ARMANDO SOLAR LEZAMA: Right.

1077
00:57:54,984 --> 00:57:59,860
So in this case [INAUDIBLE]
this is true and this is false.

1078
00:57:59,860 --> 00:58:02,590
So in this case it says,
OK. x is greater than y.

1079
00:58:06,010 --> 00:58:10,900
And we are setting
t to be equal to x.

1080
00:58:10,900 --> 00:58:21,290
So then when we get here
we have x is less than y.

1081
00:58:21,290 --> 00:58:21,790
Right?

1082
00:58:21,790 --> 00:58:24,830
And once again it's very
clear that this path condition

1083
00:58:24,830 --> 00:58:26,940
is unsatisfiable.

1084
00:58:26,940 --> 00:58:27,440
Right?

1085
00:58:27,440 --> 00:58:30,960
We cannot have x greater than
y and x less than y at the same

1086
00:58:30,960 --> 00:58:31,460
time.

1087
00:58:31,460 --> 00:58:33,970
There's no assignment
to x that will satisfy

1088
00:58:33,970 --> 00:58:35,360
both of those constraints.

1089
00:58:35,360 --> 00:58:38,740
So what that tells us is, again,
that this other path is also

1090
00:58:38,740 --> 00:58:40,030
unsatisfiable.

1091
00:58:40,030 --> 00:58:42,030
And now at this
point we've actually

1092
00:58:42,030 --> 00:58:46,280
explored every possible path in
our program that could lead us

1093
00:58:46,280 --> 00:58:47,040
to this condition.

1094
00:58:47,040 --> 00:58:50,200
So we can actually
establish and certify

1095
00:58:50,200 --> 00:58:56,890
that there is no possible path
that will lead to an assertion

1096
00:58:56,890 --> 00:58:57,710
failure.

1097
00:58:57,710 --> 00:58:58,539
Yes?

1098
00:58:58,539 --> 00:59:00,205
AUDIENCE: The way you
just presented it,

1099
00:59:00,205 --> 00:59:03,995
it makes it look as if you would
explore every possible branch.

1100
00:59:03,995 --> 00:59:06,120
I mean, one of the advantages
of symbolic execution

1101
00:59:06,120 --> 00:59:07,953
is that you're trying
to prevent [INAUDIBLE]

1102
00:59:07,953 --> 00:59:11,730
a need of exploring all possible
[INAUDIBLE] exponential.

1103
00:59:11,730 --> 00:59:13,356
So how are you avoiding
that over here?

1104
00:59:13,356 --> 00:59:15,730
ARMANDO SOLAR-LEZAMA: That's
a very good question, right?

1105
00:59:15,730 --> 00:59:18,080
So in this case essentially
what you have is

1106
00:59:18,080 --> 00:59:21,160
you have a trade off between
high symbolic and how concrete

1107
00:59:21,160 --> 00:59:22,101
you want to be.

1108
00:59:22,101 --> 00:59:22,600
Right?

1109
00:59:22,600 --> 00:59:26,990
So in this case we are not
as symbolic as the first time

1110
00:59:26,990 --> 00:59:30,810
around when we were visiting
both branches at the same time,

1111
00:59:30,810 --> 00:59:34,460
but in exchange for that our
constraints became very, very

1112
00:59:34,460 --> 00:59:35,221
simple.

1113
00:59:35,221 --> 00:59:35,720
Right?

1114
00:59:35,720 --> 00:59:39,370
So the individual path by path
constraints are very simple,

1115
00:59:39,370 --> 00:59:42,050
but you have to do this over,
and over, and over again

1116
00:59:42,050 --> 00:59:44,310
to explore all the
different branches.

1117
00:59:44,310 --> 00:59:46,930
And there are exponentially--
all the different paths.

1118
00:59:46,930 --> 00:59:50,580
And there are exponentially
many paths in a program.

1119
00:59:50,580 --> 00:59:53,110
Now there are
exponentially many paths,

1120
00:59:53,110 --> 00:59:55,540
but for every path
in general, there's

1121
00:59:55,540 --> 00:59:58,580
also an exponentially
large set of inputs

1122
00:59:58,580 --> 01:00:00,234
that could go down that path.

1123
01:00:00,234 --> 01:00:02,525
So this already gives you a
big benefit because instead

1124
01:00:02,525 --> 01:00:05,220
of having to try every
possible input you're only

1125
01:00:05,220 --> 01:00:08,220
trying every possible path.

1126
01:00:08,220 --> 01:00:10,430
But can you do better?

1127
01:00:10,430 --> 01:00:14,370
And this is one of the
areas where there's

1128
01:00:14,370 --> 01:00:19,040
been a lot of experimentation in
the area of symbolic execution.

1129
01:00:19,040 --> 01:00:22,700
When you do path
by path reasoning?

1130
01:00:22,700 --> 01:00:26,180
When do you do all
paths at the same time?

1131
01:00:26,180 --> 01:00:28,550
And one of the things
that you saw, for example,

1132
01:00:28,550 --> 01:00:31,750
in the [? Clee ?] paper
is a set of heuristics,

1133
01:00:31,750 --> 01:00:33,550
and a set of
strategies they used

1134
01:00:33,550 --> 01:00:35,360
to make the search tractable.

1135
01:00:35,360 --> 01:00:37,530
For example, one of
the things that they do

1136
01:00:37,530 --> 01:00:40,890
is that they are
exploring path by path,

1137
01:00:40,890 --> 01:00:43,300
but they're not exploring
completely blindly.

1138
01:00:43,300 --> 01:00:47,960
And they are also checking
the path conditions

1139
01:00:47,960 --> 01:00:49,670
after every step.

1140
01:00:49,670 --> 01:00:53,480
So that, for example,
if here instead of just

1141
01:00:53,480 --> 01:01:02,110
assert false, if this were
a very complex program tree,

1142
01:01:02,110 --> 01:01:03,440
control flow graph.

1143
01:01:03,440 --> 01:01:07,860
You don't wait until
you get to the very end

1144
01:01:07,860 --> 01:01:10,330
to check whether the
path is feasible.

1145
01:01:10,330 --> 01:01:13,870
The moment you get here you know
that this path is unsatisfiable

1146
01:01:13,870 --> 01:01:16,330
and you never go
down this direction.

1147
01:01:16,330 --> 01:01:18,950
You always go in
the other direction.

1148
01:01:18,950 --> 01:01:24,670
So pruning the paths
early helps cut down a lot

1149
01:01:24,670 --> 01:01:26,180
on the experiential blow up.

1150
01:01:26,180 --> 01:01:28,590
And exploring the
paths intelligently

1151
01:01:28,590 --> 01:01:32,510
helps a lot in
preventing blow up.

1152
01:01:32,510 --> 01:01:35,270
A lot of the practical
tools that are used today,

1153
01:01:35,270 --> 01:01:36,770
some of the things
that they will do

1154
01:01:36,770 --> 01:01:39,710
is they will actually start
with some random testing

1155
01:01:39,710 --> 01:01:42,520
to get an initial set of paths.

1156
01:01:42,520 --> 01:01:45,660
And then they will start looking
for paths in the neighborhood

1157
01:01:45,660 --> 01:01:46,900
of those paths.

1158
01:01:46,900 --> 01:01:50,310
They will start asking questions
like, hey, the random execution

1159
01:01:50,310 --> 01:01:51,430
went down this branch.

1160
01:01:51,430 --> 01:01:52,770
What if I flip this branch?

1161
01:01:52,770 --> 01:01:54,130
What if I flip this branch?

1162
01:01:54,130 --> 01:01:55,560
What if I flip this branch?

1163
01:01:55,560 --> 01:01:57,780
What happens in those paths?

1164
01:01:57,780 --> 01:01:59,750
Can be particularly
useful, for example,

1165
01:01:59,750 --> 01:02:01,210
if we have a good test suite.

1166
01:02:01,210 --> 01:02:04,220
And you run your test suite
and you find, OK, there

1167
01:02:04,220 --> 01:02:07,200
is this piece of code that
nothing in my test suite

1168
01:02:07,200 --> 01:02:08,720
exercised.

1169
01:02:08,720 --> 01:02:12,600
So what you can do is you can
take the path that got closest

1170
01:02:12,600 --> 01:02:15,510
to exercising that,
and then ask, hey,

1171
01:02:15,510 --> 01:02:19,630
can I change this path so that
it goes down this direction

1172
01:02:19,630 --> 01:02:20,930
instead?

1173
01:02:20,930 --> 01:02:25,970
And so in general,
the moment you

1174
01:02:25,970 --> 01:02:28,690
try to do all paths
simultaneously

1175
01:02:28,690 --> 01:02:31,420
the constraints start
becoming intractable.

1176
01:02:31,420 --> 01:02:33,910
And it's the kind
of thing that you

1177
01:02:33,910 --> 01:02:37,250
can do if you're doing
one function at a time.

1178
01:02:37,250 --> 01:02:39,420
For example, if you're
doing one function at a time

1179
01:02:39,420 --> 01:02:42,140
then it is generally feasible
to explore all the paths

1180
01:02:42,140 --> 01:02:43,790
in a function together.

1181
01:02:43,790 --> 01:02:47,660
If you're trying to do
larger units, then generally

1182
01:02:47,660 --> 01:02:50,105
you have to go with path
by path exploration.

1183
01:02:53,392 --> 01:02:54,475
Are there other questions?

1184
01:02:56,880 --> 01:02:57,380
Yes?

1185
01:02:57,380 --> 01:03:00,302
AUDIENCE: You referenced
how [INAUDIBLE].

1186
01:03:00,302 --> 01:03:02,250
How does it do that again?

1187
01:03:02,250 --> 01:03:04,920
What's the [INAUDIBLE]?

1188
01:03:04,920 --> 01:03:08,140
ARMANDO SOLAR-LEZAMA: So the
most important one really is

1189
01:03:08,140 --> 01:03:13,600
this idea that for every branch,
you check your constraints

1190
01:03:13,600 --> 01:03:17,490
to check whether that branch
can actually go both ways,

1191
01:03:17,490 --> 01:03:23,670
because if it cannot go both
ways then you save a lot just

1192
01:03:23,670 --> 01:03:26,390
going in this direction
of where it can't go.

1193
01:03:26,390 --> 01:03:28,780
Beyond that I don't remember
the specific strategy

1194
01:03:28,780 --> 01:03:32,220
that they use for searching
paths that are more

1195
01:03:32,220 --> 01:03:34,570
likely to give good results.

1196
01:03:37,760 --> 01:03:39,580
But pruning is really,
really important.

1197
01:03:43,460 --> 01:03:44,930
OK.

1198
01:03:44,930 --> 01:03:48,560
So far though we've been
talking mostly about toy code

1199
01:03:48,560 --> 01:03:53,360
in the sense that it's only
integer variables, branches,

1200
01:03:53,360 --> 01:03:54,760
very simple stuff.

1201
01:03:54,760 --> 01:03:55,430
Right?

1202
01:03:55,430 --> 01:03:59,090
What happens when you
have a program that

1203
01:03:59,090 --> 01:04:01,680
is more complicated?

1204
01:04:01,680 --> 01:04:05,790
And in particular, what happens
when you have a program that

1205
01:04:05,790 --> 01:04:08,031
involves the heap?

1206
01:04:08,031 --> 01:04:08,530
Right?

1207
01:04:08,530 --> 01:04:11,580
So the heap has
historically been

1208
01:04:11,580 --> 01:04:14,080
the bane of all program
analysis, analysis

1209
01:04:14,080 --> 01:04:18,180
that were so clean and so
elegant in the days of Fortran,

1210
01:04:18,180 --> 01:04:21,230
completely blow up when you
try to run them on a C program

1211
01:04:21,230 --> 01:04:23,410
where you're allocating
memory left and right.

1212
01:04:23,410 --> 01:04:25,280
And you have aliasing.

1213
01:04:25,280 --> 01:04:28,680
And you have all
the messiness that

1214
01:04:28,680 --> 01:04:32,410
comes with dealing with
program allocated memory.

1215
01:04:32,410 --> 01:04:34,660
And with pointers and
pointer arithmetic.

1216
01:04:34,660 --> 01:04:37,840
And this is one of the areas
where symbolic execution really

1217
01:04:37,840 --> 01:04:39,840
shines in the ability
to actually reason

1218
01:04:39,840 --> 01:04:42,450
about these kinds of programs.

1219
01:04:42,450 --> 01:04:44,190
So how do we do it?

1220
01:04:44,190 --> 01:04:47,640
Right, so let's forget now
for a moment about branches,

1221
01:04:47,640 --> 01:04:48,530
and control flow.

1222
01:04:48,530 --> 01:04:53,080
We have a trivially
simple program here.

1223
01:04:53,080 --> 01:04:56,630
All it's doing is it's
allocating some memory.

1224
01:04:56,630 --> 01:04:58,090
It's zeroing it out.

1225
01:04:58,090 --> 01:05:02,500
It's getting a new pointer
y from the pointer x.

1226
01:05:02,500 --> 01:05:04,380
It's writing something into y.

1227
01:05:04,380 --> 01:05:08,140
And then it's checking,
hey, is the value

1228
01:05:08,140 --> 01:05:12,070
stored at pointer y equal to
the value stored at pointer x?

1229
01:05:12,070 --> 01:05:14,390
And just from your
basic knowledge of C

1230
01:05:14,390 --> 01:05:16,920
you could see that, no.

1231
01:05:16,920 --> 01:05:22,081
Right, that this assertion is
actually violated because x got

1232
01:05:22,081 --> 01:05:26,570
zeroed out and y
has 25 in there,

1233
01:05:26,570 --> 01:05:30,210
but x is pointing to
a different location.

1234
01:05:30,210 --> 01:05:33,030
Right?

1235
01:05:33,030 --> 01:05:35,000
So far so good.

1236
01:05:35,000 --> 01:05:37,570
The way we're going to
model the heap and the way

1237
01:05:37,570 --> 01:05:41,140
the heap is modeled in
a lot of these systems

1238
01:05:41,140 --> 01:05:45,070
is by not thinking of
the heap as a heap,

1239
01:05:45,070 --> 01:05:48,150
but to thinking of
the heat the way

1240
01:05:48,150 --> 01:05:51,840
C likes for you to think
of the heap, which is just

1241
01:05:51,840 --> 01:05:57,500
a giant address base, a giant
array where you can put things

1242
01:05:57,500 --> 01:05:58,640
into.

1243
01:05:58,640 --> 01:06:00,800
So what does that mean?

1244
01:06:00,800 --> 01:06:03,340
It means that we can
think of our program

1245
01:06:03,340 --> 01:06:07,780
as having this very
big global array.

1246
01:06:07,780 --> 01:06:10,980
And we're just going
to call it MEM for now.

1247
01:06:10,980 --> 01:06:11,480
Right?

1248
01:06:11,480 --> 01:06:13,530
And it's an array that
essentially is going

1249
01:06:13,530 --> 01:06:17,630
to map addresses to values.

1250
01:06:17,630 --> 01:06:18,130
Right?

1251
01:06:18,130 --> 01:06:19,330
And what's an address?

1252
01:06:19,330 --> 01:06:25,710
Well, an address is
just a 64-bit value.

1253
01:06:25,710 --> 01:06:30,040
And what comes after you read
something from an address?

1254
01:06:30,040 --> 01:06:31,750
It depends on how
you're modeling memory.

1255
01:06:31,750 --> 01:06:36,620
If you're modeling it at the
byte level, then what comes out

1256
01:06:36,620 --> 01:06:37,960
is a byte.

1257
01:06:37,960 --> 01:06:40,460
If you're modeling it
at the word level then

1258
01:06:40,460 --> 01:06:42,880
what comes out of it is a word.

1259
01:06:42,880 --> 01:06:45,490
And depending on the kind of
bugs that you're interested in,

1260
01:06:45,490 --> 01:06:47,920
and whether things
like memory alignment

1261
01:06:47,920 --> 01:06:49,650
are an issue for
you are not, you're

1262
01:06:49,650 --> 01:06:51,441
going to model it a
little bit differently,

1263
01:06:51,441 --> 01:06:53,810
but generally memory
is just an array

1264
01:06:53,810 --> 01:07:00,030
from an address to a value.

1265
01:07:00,030 --> 01:07:00,530
Right?

1266
01:07:00,530 --> 01:07:07,260
So an address is
just an integer.

1267
01:07:07,260 --> 01:07:08,147
Right?

1268
01:07:08,147 --> 01:07:10,230
It's in some sense not
that different from the way

1269
01:07:10,230 --> 01:07:11,550
C thinks I'm an address.

1270
01:07:11,550 --> 01:07:12,870
It's just an integer.

1271
01:07:12,870 --> 01:07:15,430
It's just a value.

1272
01:07:15,430 --> 01:07:18,740
It's just a 64-bit integer,
or a 32-bit integer,

1273
01:07:18,740 --> 01:07:20,010
depending on your machine.

1274
01:07:20,010 --> 01:07:22,930
It just a value that
indexes into that memory.

1275
01:07:22,930 --> 01:07:24,990
And that you can put
things in memory,

1276
01:07:24,990 --> 01:07:27,490
read them from the memory.

1277
01:07:27,490 --> 01:07:30,860
So things like
pointer arithmetic

1278
01:07:30,860 --> 01:07:33,304
just becomes integer arithmetic.

1279
01:07:33,304 --> 01:07:35,220
In practice there's a
little bit of desugaring

1280
01:07:35,220 --> 01:07:43,020
that has to happen because in C
the pointer arithmetic actually

1281
01:07:43,020 --> 01:07:45,290
knows about the types
of the pointers.

1282
01:07:45,290 --> 01:07:50,030
And things will be incremented
proportional to the size,

1283
01:07:50,030 --> 01:07:50,530
right?

1284
01:07:50,530 --> 01:08:00,100
So this would actually be x
plus 10 times the size of int.

1285
01:08:00,100 --> 01:08:01,320
Right?

1286
01:08:01,320 --> 01:08:03,440
But what's really
important is what

1287
01:08:03,440 --> 01:08:06,610
happens when you're reading
and writing from memory.

1288
01:08:06,610 --> 01:08:11,590
So what used to be just a
pointer reference from y

1289
01:08:11,590 --> 01:08:17,109
to write 25, is now just
I'm taking my memory array,

1290
01:08:17,109 --> 01:08:19,910
and I'm indexing it with y.

1291
01:08:19,910 --> 01:08:24,590
And I'm writing 25 to
that memory location.

1292
01:08:24,590 --> 01:08:25,090
Right?

1293
01:08:25,090 --> 01:08:29,020
And this assertion
now becomes, well, I

1294
01:08:29,020 --> 01:08:32,430
am reading from
location y in memory.

1295
01:08:32,430 --> 01:08:35,100
And I am reading from
location x in memory.

1296
01:08:35,100 --> 01:08:36,550
And I am comparing them.

1297
01:08:36,550 --> 01:08:40,010
And I'm checking whether
they are the same or not.

1298
01:08:40,010 --> 01:08:41,510
It's a very, very
simple reduction

1299
01:08:41,510 --> 01:08:46,880
to go from program that uses the
heap to a program the just uses

1300
01:08:46,880 --> 01:08:51,790
this giant global array
that represents the memory.

1301
01:08:51,790 --> 01:08:53,649
And now what that
means is that in order

1302
01:08:53,649 --> 01:08:55,764
to reason about programs
that manipulate the heap

1303
01:08:55,764 --> 01:08:57,680
you don't really have
to reason about programs

1304
01:08:57,680 --> 01:08:58,721
that manipulate the heap.

1305
01:08:58,721 --> 01:09:01,510
As long as you have the
ability to reason about arrays,

1306
01:09:01,510 --> 01:09:02,399
you are good.

1307
01:09:02,399 --> 01:09:04,700
Now here's a simple
question though.

1308
01:09:04,700 --> 01:09:07,430
What about the malloc?

1309
01:09:07,430 --> 01:09:11,479
So one thing you can do is
you can say, well, malloc,

1310
01:09:11,479 --> 01:09:16,240
I can just take the C
implementation of malloc

1311
01:09:16,240 --> 01:09:18,130
and actually implement
malloc like that.

1312
01:09:18,130 --> 01:09:23,130
And keep track of all the
pages that I have allocated

1313
01:09:23,130 --> 01:09:26,950
and keep track of everything
that has been freed.

1314
01:09:26,950 --> 01:09:29,109
And keep a free
list, and everything.

1315
01:09:29,109 --> 01:09:31,380
It turns out for
a lot of purposes

1316
01:09:31,380 --> 01:09:33,310
and for a lot of
classes of bugs,

1317
01:09:33,310 --> 01:09:35,185
you don't need malloc
to be that complicated.

1318
01:09:35,185 --> 01:09:39,529
In fact, you can get away with
a malloc that looks like this,

1319
01:09:39,529 --> 01:09:41,819
with a malloc that
just says, I'm

1320
01:09:41,819 --> 01:09:49,330
going to keep a counter for
the next free memory location.

1321
01:09:49,330 --> 01:09:55,560
And whenever somebody
asks for an address,

1322
01:09:55,560 --> 01:09:57,730
that address I'm just
going to give this position

1323
01:09:57,730 --> 01:09:59,720
and then increment the position.

1324
01:09:59,720 --> 01:10:00,220
Right?

1325
01:10:02,920 --> 01:10:04,769
And then return
rv, in this case.

1326
01:10:11,626 --> 01:10:14,042
So one of the thing that is
malloc is completely ignoring.

1327
01:10:17,754 --> 01:10:18,670
AUDIENCE: [INAUDIBLE].

1328
01:10:18,670 --> 01:10:18,770
ARMANDO SOLAR-LEZAMA: Yeah.

1329
01:10:18,770 --> 01:10:19,670
Freeing, right?

1330
01:10:19,670 --> 01:10:21,939
This malloc says, yeah,
forget about freeing.

1331
01:10:21,939 --> 01:10:22,730
There's no freeing.

1332
01:10:22,730 --> 01:10:26,650
We're just going to keep walking
through our memory allocating

1333
01:10:26,650 --> 01:10:30,880
further, and further, and
further and that will be it.

1334
01:10:30,880 --> 01:10:34,770
And we don't care
about freeing anything.

1335
01:10:34,770 --> 01:10:36,710
It also doesn't really
care about the fact

1336
01:10:36,710 --> 01:10:39,759
that well, actually, there
are regions of memory where

1337
01:10:39,759 --> 01:10:40,800
you shouldn't be writing.

1338
01:10:40,800 --> 01:10:42,385
There are special
addresses that have

1339
01:10:42,385 --> 01:10:44,960
special meaning that are
reserved for the operating

1340
01:10:44,960 --> 01:10:45,540
system.

1341
01:10:45,540 --> 01:10:47,560
It doesn't model
any of the things

1342
01:10:47,560 --> 01:10:50,580
that actually make writing a
malloc function complicated,

1343
01:10:50,580 --> 01:10:54,380
but at a certain
level of abstraction,

1344
01:10:54,380 --> 01:10:58,280
if you're trying to reason
about some complicated code that

1345
01:10:58,280 --> 01:10:59,520
does pointer manipulation.

1346
01:10:59,520 --> 01:11:02,130
And you don't care
about freeing memory,

1347
01:11:02,130 --> 01:11:04,600
but you really
care about is, am I

1348
01:11:04,600 --> 01:11:08,030
going to write past the end
of some buffer, for example.

1349
01:11:08,030 --> 01:11:10,642
Then this malloc
might be good enough.

1350
01:11:10,642 --> 01:11:12,850
And this is actually that
happens very, very commonly

1351
01:11:12,850 --> 01:11:15,380
when you're doing symbolic
execution of real code.

1352
01:11:15,380 --> 01:11:19,080
A very important
step is the modeling

1353
01:11:19,080 --> 01:11:20,750
of your library functions.

1354
01:11:20,750 --> 01:11:22,800
And how you model
your library functions

1355
01:11:22,800 --> 01:11:25,760
is going to have a huge
impact on the one hand

1356
01:11:25,760 --> 01:11:30,110
on the performance and the
scalability of the analysis,

1357
01:11:30,110 --> 01:11:32,160
but on the other hand,
on the precision.

1358
01:11:32,160 --> 01:11:35,670
So if you have a Mickey Mouse
model of malloc like this,

1359
01:11:35,670 --> 01:11:37,930
it's going to be
very, very fast,

1360
01:11:37,930 --> 01:11:41,265
but there are going to be
certain classes of bugs

1361
01:11:41,265 --> 01:11:43,060
that you won't be able to catch.

1362
01:11:43,060 --> 01:11:43,560
Right?

1363
01:11:43,560 --> 01:11:45,630
So and this model, for
example, I'm completely

1364
01:11:45,630 --> 01:11:46,840
ignoring the allocations.

1365
01:11:46,840 --> 01:11:48,840
So if I have a bug
because somebody

1366
01:11:48,840 --> 01:11:51,940
is accessing unallocated space.

1367
01:11:51,940 --> 01:11:56,010
Well, I'm not going to find
it with this Mickey Mouse

1368
01:11:56,010 --> 01:11:58,860
model of malloc.

1369
01:11:58,860 --> 01:11:59,660
Right?

1370
01:11:59,660 --> 01:12:04,400
So it's always a balance between
the precision of the analysis

1371
01:12:04,400 --> 01:12:10,400
versus the efficiency.

1372
01:12:10,400 --> 01:12:14,030
And the more complicated your
models of standard functions

1373
01:12:14,030 --> 01:12:17,010
like malloc get,
the less scalable

1374
01:12:17,010 --> 01:12:20,230
the analysis is going to be,
but for certain classes of bugs

1375
01:12:20,230 --> 01:12:22,150
you will need those models.

1376
01:12:22,150 --> 01:12:25,510
And one of the big things
in the [? Clee ?] paper

1377
01:12:25,510 --> 01:12:27,830
was really having
reasonable models

1378
01:12:27,830 --> 01:12:31,440
for all the different
libraries in C,

1379
01:12:31,440 --> 01:12:32,940
all the different
libraries that are

1380
01:12:32,940 --> 01:12:35,350
needed in order to understand
what a program is actually

1381
01:12:35,350 --> 01:12:35,850
doing.

1382
01:12:39,090 --> 01:12:40,177
So, OK.

1383
01:12:40,177 --> 01:12:42,510
So we've reduced the problem
of reasoning about the heap

1384
01:12:42,510 --> 01:12:47,220
to a problem of reasoning
about a program with arrays,

1385
01:12:47,220 --> 01:12:50,910
but I haven't actually
told you how to reason

1386
01:12:50,910 --> 01:12:52,270
about a program with arrays.

1387
01:12:52,270 --> 01:12:55,390
And it turns out
that most SMT solvers

1388
01:12:55,390 --> 01:12:58,060
support a theory of arrays.

1389
01:12:58,060 --> 01:13:01,826
And the idea is
if a is an array,

1390
01:13:01,826 --> 01:13:03,950
there's some notation to
say, well, take that array

1391
01:13:03,950 --> 01:13:07,070
and create a new array
where location i has

1392
01:13:07,070 --> 01:13:10,571
been updated to value e.

1393
01:13:10,571 --> 01:13:11,070
All right?

1394
01:13:11,070 --> 01:13:14,820
So if I have array a and I
do this update operation,

1395
01:13:14,820 --> 01:13:17,340
and then I try to
read the value k,

1396
01:13:17,340 --> 01:13:20,180
then the meaning
is that the value k

1397
01:13:20,180 --> 01:13:22,370
is going to be
equal to the value k

1398
01:13:22,370 --> 01:13:25,330
at a if k is different from i.

1399
01:13:25,330 --> 01:13:29,350
And it's going to be equal to
e if k is equal to i, right?

1400
01:13:29,350 --> 01:13:31,290
That's what updating
an array means.

1401
01:13:31,290 --> 01:13:33,890
That's what it means
to take an old array

1402
01:13:33,890 --> 01:13:35,583
and update it to be a new array.

1403
01:13:40,320 --> 01:13:44,780
And the nice thing about this is
that if you have a formula that

1404
01:13:44,780 --> 01:13:47,780
involves the theory of
arrays, so, for example,

1405
01:13:47,780 --> 01:13:51,850
I started with the zero array
that is just zeros everywhere.

1406
01:13:51,850 --> 01:13:59,210
And then I wrote 5 into location
i, and 7 into location j.

1407
01:13:59,210 --> 01:14:00,850
And then I'm reading from k.

1408
01:14:00,850 --> 01:14:04,680
And then I'm checking whether
that's equal to 5 or not.

1409
01:14:04,680 --> 01:14:10,110
Then that can be expanded
by using this definition

1410
01:14:10,110 --> 01:14:14,450
to something that says,
well, if k is equal to i

1411
01:14:14,450 --> 01:14:19,290
then if k is equal to y,
and k is different from j,

1412
01:14:19,290 --> 01:14:21,650
then, yes, this is
going to be equal to 5.

1413
01:14:24,570 --> 01:14:30,640
And otherwise this is not
going to be equal to 5, right?

1414
01:14:30,640 --> 01:14:33,850
And in practice SMT solvers
don't just expand these

1415
01:14:33,850 --> 01:14:36,290
into lots of Boolean formulas.

1416
01:14:36,290 --> 01:14:37,950
They, again, use
this back and forth

1417
01:14:37,950 --> 01:14:41,200
strategy between a SAT
solver and an engine

1418
01:14:41,200 --> 01:14:45,380
that is able to reason about
this theory of arrays in order

1419
01:14:45,380 --> 01:14:46,020
to do it.

1420
01:14:46,020 --> 01:14:48,060
But what's important
is that by relying

1421
01:14:48,060 --> 01:14:51,680
on this theory of arrays,
using the same strategy we

1422
01:14:51,680 --> 01:15:00,050
saw to generate formulas for
integers you can actually

1423
01:15:00,050 --> 01:15:03,990
generate formulas
involving array logic,

1424
01:15:03,990 --> 01:15:08,720
and involving array updates,
involving array axises,

1425
01:15:08,720 --> 01:15:16,730
involving iteration over arrays
as long as you fix your path,

1426
01:15:16,730 --> 01:15:21,000
these formulas are
very easy to generate.

1427
01:15:21,000 --> 01:15:22,440
If you don't fix
your paths if you

1428
01:15:22,440 --> 01:15:24,450
want to generate a
formula that corresponds

1429
01:15:24,450 --> 01:15:29,080
to going through all paths,
then it's also relatively easy.

1430
01:15:29,080 --> 01:15:32,310
The Only thing is you
have to deal with loops

1431
01:15:32,310 --> 01:15:34,910
in more of a special way.

1432
01:15:34,910 --> 01:15:35,479
Yes?

1433
01:15:35,479 --> 01:15:36,395
AUDIENCE: [INAUDIBLE].

1434
01:15:43,340 --> 01:15:46,530
ARMANDO SOLAR-LEZAMA:
I don't know.

1435
01:15:46,530 --> 01:15:48,870
So dictionaries and
maps are actually

1436
01:15:48,870 --> 01:15:52,960
very easy to model using
uninterpreted functions.

1437
01:15:52,960 --> 01:15:55,190
And, in fact, the
theory of arrays

1438
01:15:55,190 --> 01:16:05,170
itself, it's just a special
case of uninterpreted functions.

1439
01:16:05,170 --> 01:16:09,630
So more complicated
things can be done

1440
01:16:09,630 --> 01:16:11,460
with uninterpreted functions.

1441
01:16:11,460 --> 01:16:16,820
In modern SMT solvers
there is native support

1442
01:16:16,820 --> 01:16:20,657
for reasoning about
sets and set operations,

1443
01:16:20,657 --> 01:16:22,740
which can be very, very
useful if you're reasoning

1444
01:16:22,740 --> 01:16:28,390
about a program that involves
lots of set computations,

1445
01:16:28,390 --> 01:16:30,410
for example.

1446
01:16:30,410 --> 01:16:33,750
When designing
one of these tools

1447
01:16:33,750 --> 01:16:36,320
the modeling step
is really important.

1448
01:16:36,320 --> 01:16:41,040
And it's not just how you model
complicated program features

1449
01:16:41,040 --> 01:16:43,320
down to your theories.

1450
01:16:43,320 --> 01:16:47,850
So, for example, things
like heaps down to arrays.

1451
01:16:47,850 --> 01:16:50,837
And also the choice of what
theories and the solver you

1452
01:16:50,837 --> 01:16:51,630
use.

1453
01:16:51,630 --> 01:16:56,470
And there's a large number
of theories and the solver

1454
01:16:56,470 --> 01:17:02,260
with different trade offs
between how efficient they are

1455
01:17:02,260 --> 01:17:04,520
versus how expressive they are.

1456
01:17:04,520 --> 01:17:08,870
And, in general, most
of the production tools

1457
01:17:08,870 --> 01:17:13,370
stick to the theory
of bit-vectors

1458
01:17:13,370 --> 01:17:16,550
and they might use
the theory of arrays

1459
01:17:16,550 --> 01:17:21,820
to model the heap if
that is necessary.

1460
01:17:21,820 --> 01:17:24,220
Generally production
tools try to shy away

1461
01:17:24,220 --> 01:17:27,380
from some of the more
sophisticated theories,

1462
01:17:27,380 --> 01:17:31,560
like the theory of sets
just because by virtue

1463
01:17:31,560 --> 01:17:36,450
being richer they also tend to
be less scalable in some cases,

1464
01:17:36,450 --> 01:17:39,620
unless you're dealing with a
program that really requires

1465
01:17:39,620 --> 01:17:44,920
exactly that kind of reasoning
in order to work with.

1466
01:17:44,920 --> 01:17:47,841
Are there other questions?

1467
01:17:47,841 --> 01:17:48,340
Yes?

1468
01:17:48,340 --> 01:17:50,834
AUDIENCE: [INAUDIBLE] research
in symbolic execution,

1469
01:17:50,834 --> 01:17:52,762
what are people
focusing on and where

1470
01:17:52,762 --> 01:17:54,208
is there room for improvement?

1471
01:17:54,208 --> 01:17:56,620
[INAUDIBLE] applications.

1472
01:17:56,620 --> 01:18:00,040
ARMANDO SOLAR-LEZAMA: So one
very active area of research

1473
01:18:00,040 --> 01:18:02,880
is around applications.

1474
01:18:02,880 --> 01:18:06,080
And looking at models
that will allow

1475
01:18:06,080 --> 01:18:09,400
you to discover new
classes of bugs.

1476
01:18:09,400 --> 01:18:15,200
So, for example, Nikolai,
and Franz, and Xi Wang and I

1477
01:18:15,200 --> 01:18:19,330
had a paper, what
was it, last year

1478
01:18:19,330 --> 01:18:23,810
when we were looking at using
symbolic execution to identify

1479
01:18:23,810 --> 01:18:28,770
coding your program that a
compiler might optimize away.

1480
01:18:28,770 --> 01:18:32,410
Security checks that might get
optimized away by a compiler.

1481
01:18:32,410 --> 01:18:38,510
So it's very different from the
question of will the program go

1482
01:18:38,510 --> 01:18:42,470
down this path or not, but
there is a modeling step

1483
01:18:42,470 --> 01:18:45,300
to go from this high
level conceptual question

1484
01:18:45,300 --> 01:18:47,750
of, is there a
code in my program

1485
01:18:47,750 --> 01:18:54,780
that can be compiled away
to an algorithm based

1486
01:18:54,780 --> 01:18:56,673
on symbolic execution
that will rely

1487
01:18:56,673 --> 01:18:58,530
on the ability of
symbolic execution

1488
01:18:58,530 --> 01:19:01,290
to easily tell you whether
the program can go down

1489
01:19:01,290 --> 01:19:04,930
a particular path, or whether
a particular path is feasible.

1490
01:19:04,930 --> 01:19:08,380
So applications is a
big area, extending

1491
01:19:08,380 --> 01:19:12,080
to newer classes
of bugs, growing

1492
01:19:12,080 --> 01:19:15,500
to new and different
language features.

1493
01:19:15,500 --> 01:19:19,740
For example, one of the
things that is still

1494
01:19:19,740 --> 01:19:22,840
fairly hard to model from
using symbolic execution

1495
01:19:22,840 --> 01:19:28,850
are very high level languages,
like JavaScript or Python where

1496
01:19:28,850 --> 01:19:31,750
you have a lot of very
dynamic language features,

1497
01:19:31,750 --> 01:19:37,910
but at the same time they
are-- if any technique can

1498
01:19:37,910 --> 01:19:40,370
work for the symbolic execution,
it's definitely very good.

1499
01:19:40,370 --> 01:19:44,640
And, in fact, we had some
work a couple of years

1500
01:19:44,640 --> 01:19:46,780
ago using symbolic
execution to reason

1501
01:19:46,780 --> 01:19:50,070
about errors in Python
programming assignments,

1502
01:19:50,070 --> 01:19:51,890
for example.

1503
01:19:51,890 --> 01:19:52,623
Yes?

1504
01:19:52,623 --> 01:19:54,102
AUDIENCE: So [INAUDIBLE].

1505
01:20:03,962 --> 01:20:04,948
How does [INAUDIBLE]?

1506
01:20:08,204 --> 01:20:09,370
ARMANDO SOLAR-LEZAMA: It is.

1507
01:20:09,370 --> 01:20:13,990
So in the case of symbolic
execution part of the problem

1508
01:20:13,990 --> 01:20:19,130
is that your symbolic state,
it's very hard to simply say,

1509
01:20:19,130 --> 01:20:21,340
OK, I executed this
instruction, and then

1510
01:20:21,340 --> 01:20:23,430
this instruction, and
then this instruction.

1511
01:20:23,430 --> 01:20:24,720
The sequence is not there.

1512
01:20:24,720 --> 01:20:28,180
There was some work a few
years ago looking, for example,

1513
01:20:28,180 --> 01:20:31,970
at very small pieces of
code, but very critical,

1514
01:20:31,970 --> 01:20:35,150
like a concurring data
structure in operating

1515
01:20:35,150 --> 01:20:37,240
system, or lock-free
data structure

1516
01:20:37,240 --> 01:20:43,190
and modeling the
interactions between threads

1517
01:20:43,190 --> 01:20:47,984
by essentially saying, every
time there is a variable that

1518
01:20:47,984 --> 01:20:49,900
could have been overwritten
by something else,

1519
01:20:49,900 --> 01:20:54,000
you replace that value with
just a fresh symbolic value that

1520
01:20:54,000 --> 01:20:55,946
says, I have no
idea what this is.

1521
01:20:55,946 --> 01:20:57,320
And you generate
constraints that

1522
01:20:57,320 --> 01:21:00,060
relate to those symbolic
values to symbolic values

1523
01:21:00,060 --> 01:21:01,520
in other threads.

1524
01:21:01,520 --> 01:21:03,320
And this has been
used even to reason

1525
01:21:03,320 --> 01:21:08,840
about things like missing
memory fences, for example.

1526
01:21:08,840 --> 01:21:13,565
And so it is possible, but the
complexity grows quite a bit.

1527
01:21:13,565 --> 01:21:18,100
And it becomes the kind of thing
that you cannot no longer do

1528
01:21:18,100 --> 01:21:22,240
at the scale of Microsoft Word,
but you can do at the scale

1529
01:21:22,240 --> 01:21:26,087
of, say, a concurring data
structure, for example.

1530
01:21:26,087 --> 01:21:28,670
There had been other work though
in the context of concurrency

1531
01:21:28,670 --> 01:21:31,200
looking at, for example,
can I use symbolic execution

1532
01:21:31,200 --> 01:21:34,830
to reconstruct
interleavings based

1533
01:21:34,830 --> 01:21:38,290
on knowledge of how the program
behaved as it was running,

1534
01:21:38,290 --> 01:21:40,810
for example.

1535
01:21:40,810 --> 01:21:46,020
And so this opens a
lot of possibilities,

1536
01:21:46,020 --> 01:21:49,220
having this capability to
ask very concrete questions

1537
01:21:49,220 --> 01:21:52,660
about can my program
run down this path.

1538
01:21:52,660 --> 01:21:54,440
Being able to have
symbolic values

1539
01:21:54,440 --> 01:21:57,600
and ask questions, what values
should these things have

1540
01:21:57,600 --> 01:22:00,200
in order for the program to
do something, or in order

1541
01:22:00,200 --> 01:22:03,215
something to happen is a
very powerful capability

1542
01:22:03,215 --> 01:22:04,590
and there's a lot
of applications

1543
01:22:04,590 --> 01:22:10,660
that have been tried, but
this is a fairly new piece

1544
01:22:10,660 --> 01:22:13,280
of technology as
far as technology

1545
01:22:13,280 --> 01:22:15,203
for analyzing a program goes.