1
00:00:01,580 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,340
Commons license.

3
00:00:05,340 --> 00:00:07,550
Your support will help
MIT OpenCourseWare

4
00:00:07,550 --> 00:00:11,640
continue to offer high quality
educational resources for free.

5
00:00:11,640 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,110
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,110 --> 00:00:19,340
at ocw.MIT.edu.

8
00:00:23,110 --> 00:00:25,150
PROFESSOR 1: So as he
gives an introduction,

9
00:00:25,150 --> 00:00:27,670
myself along with Catherine,
David, [INAUDIBLE],,

10
00:00:27,670 --> 00:00:29,336
and Charlotte
[? are going to present ?]

11
00:00:29,336 --> 00:00:31,735
you guys Probablistic and
Infinite Horizon Planning.

12
00:00:31,735 --> 00:00:34,360
To give you a brief overview of
what we're going to talk about,

13
00:00:34,360 --> 00:00:36,740
we're going to start with the
Quadrotor motivating example.

14
00:00:36,740 --> 00:00:39,005
We're going to move into
planning with Markov decision

15
00:00:39,005 --> 00:00:42,070
processses, give you a little
bit about value iteration

16
00:00:42,070 --> 00:00:44,240
before discussing
heuristic guided solvers.

17
00:00:44,240 --> 00:00:47,130
And we're going to go into the
more stochastic case, partially

18
00:00:47,130 --> 00:00:48,715
observable Markov
decision processes

19
00:00:48,715 --> 00:00:52,890
and operating in belief space.

20
00:00:52,890 --> 00:00:56,560
So we very often now see
quadrotor motion planning

21
00:00:56,560 --> 00:01:00,029
as a problem, given, for
example, with the Amazon

22
00:01:00,029 --> 00:01:00,820
fulfillment center.

23
00:01:00,820 --> 00:01:03,085
We start with a goal
configuration, start

24
00:01:03,085 --> 00:01:05,562
configuration, set of
actions we can take,

25
00:01:05,562 --> 00:01:09,090
and some type of reward
function or cost function.

26
00:01:09,090 --> 00:01:11,665
So for instance, if we
have a quadrotor starting

27
00:01:11,665 --> 00:01:13,040
at the Amazon
fulfillment center,

28
00:01:13,040 --> 00:01:14,849
and we want to get
to 77 Mass Ave,

29
00:01:14,849 --> 00:01:17,390
and let's say we want to take
the shortest path to get there.

30
00:01:17,390 --> 00:01:19,244
We would follow the
red dashed line.

31
00:01:19,244 --> 00:01:21,900
But, as we can see, it comes
very close to these obstacles.

32
00:01:21,900 --> 00:01:23,792
So we're looking
at very higher risk

33
00:01:23,792 --> 00:01:25,000
for mission failure, crashes.

34
00:01:25,000 --> 00:01:27,010
If there's any
uncertainty in its path,

35
00:01:27,010 --> 00:01:29,440
we're going to have a problem.

36
00:01:29,440 --> 00:01:32,535
So one of the ways that we
can compensate for that is we

37
00:01:32,535 --> 00:01:34,305
can also done a
green path, which

38
00:01:34,305 --> 00:01:37,632
is adjusted to give us
a little bit of space.

39
00:01:37,632 --> 00:01:42,078
Are there any more questions?

40
00:01:42,078 --> 00:01:45,300
So as you can see, the
level of uncertainty

41
00:01:45,300 --> 00:01:47,720
allows us to determine how
easy or difficult the problem's

42
00:01:47,720 --> 00:01:48,620
going to be to solve.

43
00:01:48,620 --> 00:01:50,940
On the easier side we have
deterministic dynamics

44
00:01:50,940 --> 00:01:52,760
and deterministic sensors.

45
00:01:52,760 --> 00:01:54,180
In this case our
actions are going

46
00:01:54,180 --> 00:01:56,420
to be executed as
commanded, and our sensors

47
00:01:56,420 --> 00:01:58,170
are going to tell us
exactly where we are.

48
00:01:58,170 --> 00:02:00,336
This will be something like
dead reckoning validated

49
00:02:00,336 --> 00:02:02,970
through sensing, very redundant.

50
00:02:02,970 --> 00:02:05,350
If we moved to a little bit
more at a difficult case,

51
00:02:05,350 --> 00:02:06,940
we have deterministic dynamics.

52
00:02:06,940 --> 00:02:08,740
Our commands are
still being executed,

53
00:02:08,740 --> 00:02:10,259
but maybe we have
a noisy camera.

54
00:02:10,259 --> 00:02:12,050
So we have some
uncertainty in the sensors.

55
00:02:12,050 --> 00:02:15,445
These are the cases where
we would see dead reckoning,

56
00:02:15,445 --> 00:02:17,320
but we would compensate
with Kalman filtering

57
00:02:17,320 --> 00:02:19,350
to get rid of that noise level.

58
00:02:19,350 --> 00:02:20,160
Now, down at the--

59
00:02:20,160 --> 00:02:20,660
yes?

60
00:02:20,660 --> 00:02:22,952
AUDIENCE: Sorry, what
is dead reckoning?

61
00:02:22,952 --> 00:02:24,550
PROFESSOR 1: It's essentially--

62
00:02:24,550 --> 00:02:26,749
you are saying I want
to execute this option,

63
00:02:26,749 --> 00:02:28,290
and it's going to
execute it exactly.

64
00:02:28,290 --> 00:02:28,831
AUDIENCE: OK.

65
00:02:31,199 --> 00:02:32,740
PROFESSOR 1: Then
towards the bottom,

66
00:02:32,740 --> 00:02:34,930
we have stochastic dynamics
and deterministic sensors.

67
00:02:34,930 --> 00:02:36,846
So in this case maybe
there's some uncertainty

68
00:02:36,846 --> 00:02:37,790
in our actions.

69
00:02:37,790 --> 00:02:39,750
But we can validate
through sensing.

70
00:02:39,750 --> 00:02:42,340
This is where we're going
to spend the next section

71
00:02:42,340 --> 00:02:45,070
on Markov decision processes.

72
00:02:45,070 --> 00:02:46,695
And then briefly,
later in the lecture,

73
00:02:46,695 --> 00:02:49,069
we're going to talk about this
most difficult case, which

74
00:02:49,069 --> 00:02:51,415
is the stochastic dynamics
and stochastic sensors.

75
00:02:51,415 --> 00:02:55,189
Our execution maybe, maybe
not, our sensors maybe a little

76
00:02:55,189 --> 00:02:55,730
bit of noise.

77
00:02:55,730 --> 00:02:58,080
And this is where we
see partially observable

78
00:02:58,080 --> 00:03:01,175
Markov decision processes.

79
00:03:01,175 --> 00:03:03,300
PROFESSOR 2: Can we make
just a brief clarification

80
00:03:03,300 --> 00:03:04,220
of the dead reckoning?

81
00:03:04,220 --> 00:03:05,760
So dead reckoning is
where you estimate

82
00:03:05,760 --> 00:03:07,426
you position using a
probabilistic file,

83
00:03:07,426 --> 00:03:09,290
but you don't use any
observation stuffs.

84
00:03:09,290 --> 00:03:10,700
That's the thing.

85
00:03:13,684 --> 00:03:14,350
PROFESSOR 1: OK.

86
00:03:14,350 --> 00:03:16,475
So we talked a little bit
about action uncertainty,

87
00:03:16,475 --> 00:03:17,850
where we're going to focus.

88
00:03:17,850 --> 00:03:19,870
And this is a case
where, for instance,

89
00:03:19,870 --> 00:03:24,710
even if you tell a quadrotor
to stay in its place

90
00:03:24,710 --> 00:03:26,320
and a gust of wind
comes by, it's not

91
00:03:26,320 --> 00:03:27,610
going to stay in that
same place, right?

92
00:03:27,610 --> 00:03:29,443
It's going to have a
little bit of movement.

93
00:03:29,443 --> 00:03:31,697
And we can see, this
causes things in disparity

94
00:03:31,697 --> 00:03:33,280
where you have a
command of trajectory

95
00:03:33,280 --> 00:03:37,220
and your actual trajectory and
then you need to associate.

96
00:03:37,220 --> 00:03:39,730
So in order to
compensate for that,

97
00:03:39,730 --> 00:03:41,800
we want to model things
with that uncertainty

98
00:03:41,800 --> 00:03:44,090
or else we have these
higher situations where

99
00:03:44,090 --> 00:03:45,480
we command to follow some line.

100
00:03:45,480 --> 00:03:46,480
We don't incorporate
the uncertainty,

101
00:03:46,480 --> 00:03:47,680
and we see a crash.

102
00:03:51,159 --> 00:03:54,720
So this allows us to introduce
Planning with Markov Decision

103
00:03:54,720 --> 00:03:55,680
Processes.

104
00:03:58,560 --> 00:04:01,980
So MDPs have a set of states--

105
00:04:01,980 --> 00:04:04,370
some actions you can
take-- a transition model.

106
00:04:04,370 --> 00:04:06,010
So essentially,
what's the probability

107
00:04:06,010 --> 00:04:09,442
of reaching some state
if you take an action?

108
00:04:09,442 --> 00:04:12,215
An immediate reward function
and a discount factor.

109
00:04:12,215 --> 00:04:13,590
This discount
factor is important

110
00:04:13,590 --> 00:04:15,007
because it allows
us to prioritize

111
00:04:15,007 --> 00:04:16,590
gaining an immediate
reward as opposed

112
00:04:16,590 --> 00:04:17,839
to an uncertain future award.

113
00:04:17,839 --> 00:04:21,250
So the concept of, bird in the
hand is worth two in the bush.

114
00:04:21,250 --> 00:04:23,140
Now we want to find
an optimum policy

115
00:04:23,140 --> 00:04:24,850
that will essentially
back an action--

116
00:04:24,850 --> 00:04:27,120
the best action--
for each state.

117
00:04:27,120 --> 00:04:28,750
And what we hope
to get from this

118
00:04:28,750 --> 00:04:31,040
is maximized expected
lifetime reward.

119
00:04:31,040 --> 00:04:33,820
So we want to maximize the
accumulative reward we get over

120
00:04:33,820 --> 00:04:37,054
the times.

121
00:04:37,054 --> 00:04:38,580
So let's walk
through an example.

122
00:04:38,580 --> 00:04:41,925
If we have a quadrotor with
a perfect sensor, and let's

123
00:04:41,925 --> 00:04:44,240
put it in this environment
[INAUDIBLE] 7x7 bridge.

124
00:04:44,240 --> 00:04:48,553
Our set of states are obviously
[INAUDIBLE] space in them.

125
00:04:48,553 --> 00:04:51,199
Can anybody tell me what
some of the actions might be?

126
00:04:51,199 --> 00:04:55,944
AUDIENCE: [INAUDIBLE]

127
00:04:55,944 --> 00:04:57,110
PROFESSOR 1: (LAUGHING) Yep.

128
00:04:57,110 --> 00:04:58,150
Up, down, left, right.

129
00:04:58,150 --> 00:05:00,150
In this case, we call them
North, South, East, West,

130
00:05:00,150 --> 00:05:00,660
or Null.

131
00:05:03,202 --> 00:05:04,910
The next thing we need
for this example--

132
00:05:04,910 --> 00:05:07,260
arbitrarily gave ourselves
a transition probability.

133
00:05:07,260 --> 00:05:09,800
So we said that you have
2% chance of following

134
00:05:09,800 --> 00:05:12,185
your commanded action with
a 25% chance of moving

135
00:05:12,185 --> 00:05:14,596
to the left or the right.

136
00:05:14,596 --> 00:05:16,675
Next we have the
reward function.

137
00:05:16,675 --> 00:05:18,050
And again, we
arbitrarily decided

138
00:05:18,050 --> 00:05:19,870
that we want it to be a reward.

139
00:05:19,870 --> 00:05:24,131
If you get to the state 6-5,
the increment [INAUDIBLE]

140
00:05:24,131 --> 00:05:27,610
AUDIENCE: Alicia, you had
[INAUDIBLE] left or the right,

141
00:05:27,610 --> 00:05:32,654
is that clockwise or
counter clockwise?

142
00:05:32,654 --> 00:05:34,590
Let's say you had planned
to go to the right,

143
00:05:34,590 --> 00:05:39,430
would that mean you have a 75%
[INAUDIBLE] or 25% chance of

144
00:05:39,430 --> 00:05:39,930
[INAUDIBLE]?

145
00:05:39,930 --> 00:05:42,080
What if you just
waited [INAUDIBLE]??

146
00:05:42,080 --> 00:05:44,080
PROFESSOR 1: We said
clockwise, counterclockwise

147
00:05:44,080 --> 00:05:46,132
from the intended
direction of action.

148
00:05:46,132 --> 00:05:46,873
AUDIENCE: Thanks.

149
00:05:46,873 --> 00:05:48,110
PROFESSOR 1: Uh-huh.

150
00:05:48,110 --> 00:05:52,170
And finally, we give ourselves
a discount factor of 0.9.

151
00:05:52,170 --> 00:05:54,800
So let's assume for
a second that we

152
00:05:54,800 --> 00:05:56,080
have our optimal policy.

153
00:05:56,080 --> 00:05:58,720
And let's say that our optimal
policy says, from this state,

154
00:05:58,720 --> 00:06:00,100
we want to take
the action North.

155
00:06:00,100 --> 00:06:00,520
Right?

156
00:06:00,520 --> 00:06:02,061
As we discussed, we
have a 50% chance

157
00:06:02,061 --> 00:06:06,260
of going North and 25% chance
of going to the left or right.

158
00:06:06,260 --> 00:06:10,650
So after that time step,
these are possible states

159
00:06:10,650 --> 00:06:11,820
that we could end up in.

160
00:06:11,820 --> 00:06:12,930
Right?

161
00:06:12,930 --> 00:06:15,757
So now let's assume
for a second that we

162
00:06:15,757 --> 00:06:17,080
can take our next action.

163
00:06:17,080 --> 00:06:19,150
And our next action
says, go North.

164
00:06:19,150 --> 00:06:21,410
Again, we have the same
probability distribution.

165
00:06:21,410 --> 00:06:23,220
And these are the
states we could end up

166
00:06:23,220 --> 00:06:24,221
in after two time steps.

167
00:06:24,221 --> 00:06:26,428
We can see that this starts
getting very complicated,

168
00:06:26,428 --> 00:06:27,060
right?

169
00:06:27,060 --> 00:06:30,340
And there are increasing
amounts of uncertainty.

170
00:06:30,340 --> 00:06:33,300
So does anybody have any
ideas on how we could

171
00:06:33,300 --> 00:06:35,192
collapse this distribution?

172
00:06:35,192 --> 00:06:37,150
Keeping in mind that our
senses, at this point,

173
00:06:37,150 --> 00:06:40,075
are deterministic.

174
00:06:40,075 --> 00:06:40,803
Yep.

175
00:06:40,803 --> 00:06:41,886
AUDIENCE: Fly to a corner?

176
00:06:41,886 --> 00:06:42,844
PROFESSOR 1: I'm sorry.

177
00:06:42,844 --> 00:06:44,010
AUDIENCE: Fly to a corner?

178
00:06:44,010 --> 00:06:45,416
PROFESSOR 1: We could do that.

179
00:06:45,416 --> 00:06:47,624
AUDIENCE: You just sense
how far away the red box is.

180
00:06:47,624 --> 00:06:50,440
PROFESSOR 1: We could do that.

181
00:06:50,440 --> 00:06:51,985
AUDIENCE: Quick comments.

182
00:06:51,985 --> 00:06:53,982
So the blue state
is not actually one.

183
00:06:53,982 --> 00:06:54,940
PROFESSOR 1: I'm sorry.

184
00:06:54,940 --> 00:06:57,255
AUDIENCE: So the problem
is you're not actually one.

185
00:06:57,255 --> 00:06:57,880
AUDIENCE: Yeah.

186
00:06:57,880 --> 00:07:00,680
The issue is if
you went to (1,3)

187
00:07:00,680 --> 00:07:02,958
and then you transitioned
to the left, that's where

188
00:07:02,958 --> 00:07:05,974
the 0.0625 is coming from.

189
00:07:05,974 --> 00:07:08,390
AUDIENCE: [INAUDIBLE] then,
shouldn't those numbers always

190
00:07:08,390 --> 00:07:09,100
are actually 1--

191
00:07:09,100 --> 00:07:10,730
your probability distribution
always are actually 1?

192
00:07:10,730 --> 00:07:11,438
PROFESSOR 1: Yep.

193
00:07:11,438 --> 00:07:12,880
AUDIENCE: So it's just a point.

194
00:07:12,880 --> 00:07:14,609
It's just off the screen.

195
00:07:14,609 --> 00:07:16,900
PROFESSOR 1: We would add
another section to the screen

196
00:07:16,900 --> 00:07:19,540
and just move the grid over.

197
00:07:19,540 --> 00:07:22,299
We just cut it off for graphics.

198
00:07:22,299 --> 00:07:22,840
AUDIENCE: OK.

199
00:07:25,130 --> 00:07:25,880
PROFESSOR 1: Yeah.

200
00:07:25,880 --> 00:07:29,740
So those are all great points.

201
00:07:29,740 --> 00:07:32,400
The easiest way to do it is
just take an observation.

202
00:07:32,400 --> 00:07:34,689
So at this point we say,
after our first time set,

203
00:07:34,689 --> 00:07:36,980
we weren't sure which of
these three states we were in.

204
00:07:36,980 --> 00:07:38,438
So we took an
observation and said,

205
00:07:38,438 --> 00:07:41,290
wait, we're actually here
with complete certainty.

206
00:07:41,290 --> 00:07:43,350
So to make this a
little bit clear,

207
00:07:43,350 --> 00:07:44,515
we're going to look at
it from a tree view.

208
00:07:44,515 --> 00:07:44,830
Right?

209
00:07:44,830 --> 00:07:45,910
We said that we
started at a state.

210
00:07:45,910 --> 00:07:46,770
We took an action.

211
00:07:46,770 --> 00:07:49,895
And this is our possible states
we could have ended up in.

212
00:07:49,895 --> 00:07:52,940
We're going to collapse this
by taking this observation.

213
00:07:52,940 --> 00:07:54,605
And now have a
complete study here.

214
00:07:54,605 --> 00:07:58,250
And we take our
next action and see

215
00:07:58,250 --> 00:08:00,880
that we have moved out here.

216
00:08:00,880 --> 00:08:04,500
So this allows us to basically
ignore the history of states.

217
00:08:04,500 --> 00:08:09,020
We have the same percentage
probability from each time set.

218
00:08:09,020 --> 00:08:12,240
This will be really useful
in completely collapsing

219
00:08:12,240 --> 00:08:16,290
the distribution every single
time you take an observation.

220
00:08:16,290 --> 00:08:19,170
Anybody have any
questions on that?

221
00:08:19,170 --> 00:08:21,090
OK.

222
00:08:21,090 --> 00:08:24,030
So let's go back now and
figure out how he came up

223
00:08:24,030 --> 00:08:25,340
with his optimal policy.

224
00:08:25,340 --> 00:08:28,156
They way we do that is
through dynamic programming.

225
00:08:28,156 --> 00:08:29,280
There's two different ways.

226
00:08:29,280 --> 00:08:30,825
You can do it either through
value iteration or policy

227
00:08:30,825 --> 00:08:31,325
iteration.

228
00:08:31,325 --> 00:08:33,836
For this lecture, we're going
to focus on value iteration.

229
00:08:37,780 --> 00:08:39,590
So let's take this
same example we had.

230
00:08:39,590 --> 00:08:42,159
We still want to maximize
the expected reward.

231
00:08:42,159 --> 00:08:44,250
And so to start, we're
going to initialize

232
00:08:44,250 --> 00:08:46,326
the values of each state to 0.

233
00:08:49,284 --> 00:08:51,750
Let's for example,
start at (2,0).

234
00:08:51,750 --> 00:08:53,265
We're going to
focus on suite 6-5.

235
00:08:53,265 --> 00:08:54,640
And we're going
to say that we're

236
00:08:54,640 --> 00:08:59,002
going to take the Null
action to start with.

237
00:08:59,002 --> 00:09:00,710
From there you can
see with a probability

238
00:09:00,710 --> 00:09:04,840
distribution 4, 6, 5, we
have a 50% of staying.

239
00:09:04,840 --> 00:09:07,829
We have 25% chance
of going to 5-5.

240
00:09:07,829 --> 00:09:12,230
And a 25% chance
of going to 7-5.

241
00:09:12,230 --> 00:09:14,090
The next item of
information is the values

242
00:09:14,090 --> 00:09:16,090
at each of those states
that we could end up in.

243
00:09:16,090 --> 00:09:19,502
And currently,
they're all set to 0.

244
00:09:19,502 --> 00:09:21,570
And finally, the
most important part

245
00:09:21,570 --> 00:09:23,665
here will be with the
reward [INAUDIBLE]

246
00:09:23,665 --> 00:09:26,100
6-5, taking the null action,
regardless of what state

247
00:09:26,100 --> 00:09:28,771
you end up, is going to be 1,
as we defined in our initial set

248
00:09:28,771 --> 00:09:29,271
up.

249
00:09:31,774 --> 00:09:35,270
So let's see how we would
calculate the next time step

250
00:09:35,270 --> 00:09:37,269
value of the state.

251
00:09:37,269 --> 00:09:39,060
You'd start by taking
probabilities, right?

252
00:09:42,026 --> 00:09:43,900
And from there, we would
add the reward here.

253
00:09:43,900 --> 00:09:45,580
And because we said
that the reward does

254
00:09:45,580 --> 00:09:47,746
not dependent on the state
we end up in, [INAUDIBLE]

255
00:09:47,746 --> 00:09:50,845
should be across all
three probabilities.

256
00:09:50,845 --> 00:09:54,200
From there, we're going to
add in the discounted value

257
00:09:54,200 --> 00:09:58,475
that we had from before.

258
00:09:58,475 --> 00:10:00,970
So a way to look at this
in a more generalized form

259
00:10:00,970 --> 00:10:03,474
is to say that across all the
states that you can end up in,

260
00:10:03,474 --> 00:10:05,640
you're going to look at the
probability of ending up

261
00:10:05,640 --> 00:10:06,245
in that state.

262
00:10:06,245 --> 00:10:07,911
You're going to
multiply that by the sum

263
00:10:07,911 --> 00:10:10,788
of the reward and the discounted
lifetime value that it has.

264
00:10:14,600 --> 00:10:18,010
So we want to make sure
that we're getting the best

265
00:10:18,010 --> 00:10:18,720
possible values.

266
00:10:18,720 --> 00:10:21,090
So we need to incorporate
all the other actions that we

267
00:10:21,090 --> 00:10:24,064
can take from that state.

268
00:10:24,064 --> 00:10:25,480
So what's going
to happen is we're

269
00:10:25,480 --> 00:10:26,660
going to take that
general formula,

270
00:10:26,660 --> 00:10:29,076
we're going to repeat it over
all of the possible actions.

271
00:10:29,076 --> 00:10:31,350
And then we're going to
take the maximum of that.

272
00:10:31,350 --> 00:10:34,830
So, for this example,
the state is very easy.

273
00:10:34,830 --> 00:10:37,490
All of them are the
same for this case.

274
00:10:37,490 --> 00:10:39,465
But we go fast, and we
say we get a value of 1.

275
00:10:39,465 --> 00:10:41,186
And we update it by
showing the graph.

276
00:10:45,170 --> 00:10:48,210
This gives us what's called
the value [INAUDIBLE] Backup--

277
00:10:48,210 --> 00:10:49,560
or Update equation.

278
00:10:49,560 --> 00:10:50,610
This will be really
important because it

279
00:10:50,610 --> 00:10:52,193
reaches across the
entire states space

280
00:10:52,193 --> 00:10:54,073
and allows us to
provide a history.

281
00:10:56,780 --> 00:10:58,170
So what this would
end up looking

282
00:10:58,170 --> 00:11:00,003
like is we're going to
iteratively calculate

283
00:11:00,003 --> 00:11:02,370
the values across the
entire state space So at t0,

284
00:11:02,370 --> 00:11:05,320
we determine that
all the values are 0.

285
00:11:05,320 --> 00:11:09,630
At t1, (6,5) gets a
value of 1, and at t2, we

286
00:11:09,630 --> 00:11:13,106
see that value start
to propagate out.

287
00:11:13,106 --> 00:11:14,094
Make sense so far?

288
00:11:17,044 --> 00:11:19,460
So the way this works is you
would repeat those iterations

289
00:11:19,460 --> 00:11:21,835
until your changes in value
become what we would consider

290
00:11:21,835 --> 00:11:24,043
small enough, which would
indicate your approximation

291
00:11:24,043 --> 00:11:25,620
is close enough
to the real value.

292
00:11:25,620 --> 00:11:27,989
From there, you would
extract the optimal policy

293
00:11:27,989 --> 00:11:29,030
from the lifetime values.

294
00:11:29,030 --> 00:11:31,155
So you see the [INAUDIBLE]
in the Bellman equation.

295
00:11:31,155 --> 00:11:33,130
And you're now just
taking the action from it.

296
00:11:33,130 --> 00:11:36,140
And you would map those
actions to your states.

297
00:11:36,140 --> 00:11:38,792
An example of what this might
look like propagating out--

298
00:11:38,792 --> 00:11:42,310
if you say blue was the
reward and red is an obstacle,

299
00:11:42,310 --> 00:11:42,970
et cetera--

300
00:11:42,970 --> 00:11:44,970
you can see, as that
value propagates out,

301
00:11:44,970 --> 00:11:48,425
you start seeing your
policy by the arrows.

302
00:11:48,425 --> 00:11:50,381
Anybody have any questions?

303
00:11:53,804 --> 00:11:56,370
So one last thing to
mention about this, though,

304
00:11:56,370 --> 00:11:57,870
is the complexity
for each iteration

305
00:11:57,870 --> 00:12:00,180
is dependent on the size
of the state space squared

306
00:12:00,180 --> 00:12:02,112
and the number of
actions you can take.

307
00:12:02,112 --> 00:12:04,470
So you can imagine, as your
state space expands or you

308
00:12:04,470 --> 00:12:07,690
gather more actions
it's very complex,

309
00:12:07,690 --> 00:12:09,810
in which case time and
value depiction becomes

310
00:12:09,810 --> 00:12:12,480
very time intensive and costly.

311
00:12:12,480 --> 00:12:16,897
So this allows us to move into
the Heuristic-Guided solvers.

312
00:12:16,897 --> 00:12:17,730
AUDIENCE: Thank you.

313
00:12:17,730 --> 00:12:20,383
[INAUDIBLE] transition,
one quick question.

314
00:12:20,383 --> 00:12:23,625
Can I just get a show of
hands-- how many of you

315
00:12:23,625 --> 00:12:26,247
have learned value iteration
before versus how many of you

316
00:12:26,247 --> 00:12:27,580
have seen it for the first time?

317
00:12:27,580 --> 00:12:30,770
So how many have seen it before?

318
00:12:30,770 --> 00:12:31,270
OK.

319
00:12:31,270 --> 00:12:34,546
And then, how many is
this their first time?

320
00:12:34,546 --> 00:12:37,312
[INAUDIBLE]

321
00:12:37,312 --> 00:12:39,186
PROFESSOR 3: Any questions
on value iteration

322
00:12:39,186 --> 00:12:40,120
before we jump in?

323
00:12:40,120 --> 00:12:42,642
It's going to be an essential
part of how we're going

324
00:12:42,642 --> 00:12:43,912
to do [INAUDIBLE] solvers.

325
00:12:43,912 --> 00:12:45,816
Anyone who hasn't [INAUDIBLE].

326
00:12:49,640 --> 00:12:52,380
So the most important thing
we said about value iteration

327
00:12:52,380 --> 00:12:53,990
is that it's super slow.

328
00:12:53,990 --> 00:12:56,880
It's going to have to go
over every possible state

329
00:12:56,880 --> 00:12:59,660
and every possible
action that it can take.

330
00:12:59,660 --> 00:13:01,270
Our state space is
multi-dimensional,

331
00:13:01,270 --> 00:13:03,130
and we can take a lot
of different actions.

332
00:13:03,130 --> 00:13:08,150
That going to be really costly
and really hard to [INAUDIBLE]..

333
00:13:08,150 --> 00:13:11,820
So the approach we're
probably going to want to take

334
00:13:11,820 --> 00:13:16,210
is using some sort of
best first search applet?

335
00:13:16,210 --> 00:13:18,600
Who can tell me some
example algorithms

336
00:13:18,600 --> 00:13:22,504
that already do that?

337
00:13:22,504 --> 00:13:23,480
AUDIENCE: A star.

338
00:13:23,480 --> 00:13:24,460
PROFESSOR 3: A star.

339
00:13:24,460 --> 00:13:26,069
So that's very good.

340
00:13:26,069 --> 00:13:28,610
And that's exactly what we're
going to base our stuff off of.

341
00:13:28,610 --> 00:13:33,105
The A star is for
deterministic graph search.

342
00:13:33,105 --> 00:13:35,510
If we have a graph, we
can use it heuristic,

343
00:13:35,510 --> 00:13:38,120
and we walk down it
and search the space

344
00:13:38,120 --> 00:13:41,520
that we're most interested in
until [INAUDIBLE] variable.

345
00:13:41,520 --> 00:13:45,850
So going to introduce to new
items focusing on the last one.

346
00:13:45,850 --> 00:13:48,580
AO star is an
algorithm we can use

347
00:13:48,580 --> 00:13:51,450
to search for graphs that
have this "and" problem.

348
00:13:51,450 --> 00:13:57,520
AO stands for And Or graphs as
opposed to the simple graphs

349
00:13:57,520 --> 00:13:59,140
that we have [INAUDIBLE].

350
00:13:59,140 --> 00:14:03,360
The And is a way to express
probabilistic coupling

351
00:14:03,360 --> 00:14:04,532
between edges.

352
00:14:04,532 --> 00:14:06,240
So if we explore one
thing, we might have

353
00:14:06,240 --> 00:14:07,364
to explore other functions.

354
00:14:07,364 --> 00:14:09,890
We'll discuss that a little
bit more in the next slide.

355
00:14:09,890 --> 00:14:12,070
LAO is a bit more
of a generalization.

356
00:14:12,070 --> 00:14:16,510
What it does is allows
us to search loopy graphs

357
00:14:16,510 --> 00:14:18,560
and deal with the
probabilistic coupling.

358
00:14:18,560 --> 00:14:23,090
And allows us to search
and find the best path

359
00:14:23,090 --> 00:14:25,930
with a heuristic guided
algorithm over infinite time

360
00:14:25,930 --> 00:14:29,550
horizon, possibly revisiting
states using the tools

361
00:14:29,550 --> 00:14:30,352
we just had--

362
00:14:30,352 --> 00:14:32,518
valued iterations-- to
understand what the next best

363
00:14:32,518 --> 00:14:34,840
thing to do is.

364
00:14:34,840 --> 00:14:38,620
So we'll talk about exactly how
to get our MDPs that we just

365
00:14:38,620 --> 00:14:42,660
saw, these little arrow
examples to these And/Or graphs.

366
00:14:42,660 --> 00:14:43,990
We'll talk about two cases.

367
00:14:43,990 --> 00:14:47,590
One simple one where we could
apply the AO star algorithm

368
00:14:47,590 --> 00:14:49,450
is where you have a qualicopter.

369
00:14:49,450 --> 00:14:51,890
And if you command
an action North,

370
00:14:51,890 --> 00:14:53,890
you have a high
probability of going North,

371
00:14:53,890 --> 00:14:56,390
but you might go East.

372
00:14:56,390 --> 00:14:57,250
And vise versa.

373
00:14:57,250 --> 00:14:59,470
If you go East,
you might go North.

374
00:14:59,470 --> 00:15:04,066
This is expressed here
with the growing tree.

375
00:15:04,066 --> 00:15:06,000
And these And edgers
that connect it.

376
00:15:06,000 --> 00:15:09,500
But despite the action of
reading my command from 0 to 0.

377
00:15:09,500 --> 00:15:12,201
We might end up
in (1,0) or (0,1).

378
00:15:12,201 --> 00:15:12,700
Right?

379
00:15:12,700 --> 00:15:14,700
As we propagate
outward, you can see

380
00:15:14,700 --> 00:15:16,116
how we're never
going to loop back

381
00:15:16,116 --> 00:15:17,324
to a statement we've been to.

382
00:15:17,324 --> 00:15:20,228
We're going to be moving in the
Northeast direction constantly.

383
00:15:20,228 --> 00:15:22,770
Our [INAUDIBLE] and our
probability distribution

384
00:15:22,770 --> 00:15:25,186
across this tree explands.

385
00:15:25,186 --> 00:15:28,090
And that's the
coupling of the edges.

386
00:15:28,090 --> 00:15:30,930
Because as we explore
down the tree,

387
00:15:30,930 --> 00:15:32,982
we have to explore all
the edges together.

388
00:15:32,982 --> 00:15:34,773
Anyone have any questions
on just this kind

389
00:15:34,773 --> 00:15:37,050
of conversion formulation?

390
00:15:37,050 --> 00:15:38,550
AUDIENCE: I have a
broader question.

391
00:15:38,550 --> 00:15:43,010
We mentioned that MDPs
deal with finite states.

392
00:15:43,010 --> 00:15:46,070
Do we always just
discretize a continuous row

393
00:15:46,070 --> 00:15:47,360
into a set of planning states?

394
00:15:47,360 --> 00:15:48,240
PROFESSOR 3: Yes.

395
00:15:48,240 --> 00:15:51,710
That's our prerequisite for
searching over the state space.

396
00:15:51,710 --> 00:15:53,574
We can do that as
finely as possible,

397
00:15:53,574 --> 00:15:58,240
but yes, discretization
is there.

398
00:15:58,240 --> 00:16:00,340
So now let's look at this case.

399
00:16:00,340 --> 00:16:03,680
Instead of the Northeast, let's
say it's not deterministic

400
00:16:03,680 --> 00:16:05,910
whether we command an
action north or South

401
00:16:05,910 --> 00:16:07,680
that we go north or south.

402
00:16:07,680 --> 00:16:10,740
Here we see this loopy
structure begin to emerge.

403
00:16:10,740 --> 00:16:12,360
We see that we might--

404
00:16:12,360 --> 00:16:16,100
on our first action commanding
North, we might go to plus 1.

405
00:16:16,100 --> 00:16:18,372
And then commanding North
again, but we have our

406
00:16:18,372 --> 00:16:20,080
And edge and our
probability distribution

407
00:16:20,080 --> 00:16:23,150
is [? back ?] to 0.

408
00:16:23,150 --> 00:16:24,775
What this does is
this loopy structure.

409
00:16:24,775 --> 00:16:26,983
And this is exactly what
we're going to be exploring.

410
00:16:26,983 --> 00:16:28,480
This is a very
real world scenario

411
00:16:28,480 --> 00:16:32,030
where it's a very likely we
might return to somewhere

412
00:16:32,030 --> 00:16:35,400
we just came from, just because
of the uncertain dynamics

413
00:16:35,400 --> 00:16:36,400
we have.

414
00:16:36,400 --> 00:16:39,840
This is the type of
problem [INAUDIBLE]..

415
00:16:39,840 --> 00:16:43,582
So we're going to use the
LAO to start out with.

416
00:16:43,582 --> 00:16:45,540
We're going to talk about
the three main things

417
00:16:45,540 --> 00:16:48,480
that [INAUDIBLE] has a
heuristic guided envelope.

418
00:16:48,480 --> 00:16:51,410
And what that means is that
we have our large state space

419
00:16:51,410 --> 00:16:51,910
here.

420
00:16:51,910 --> 00:16:54,670
But we're only going to
look at a portion of it.

421
00:16:54,670 --> 00:16:57,150
This greyed-out portion.

422
00:16:57,150 --> 00:16:59,360
We're only going to
look at the portion that

423
00:16:59,360 --> 00:17:02,320
is interesting to us-- the
portion that provides us

424
00:17:02,320 --> 00:17:04,819
with the biggest rewards--
the portion that's reachable

425
00:17:04,819 --> 00:17:07,089
if we follow an optimal policy.

426
00:17:07,089 --> 00:17:10,609
We figure this out with
some admissible heuristic.

427
00:17:10,609 --> 00:17:13,819
We'll estimate our
rewards just like A star.

428
00:17:13,819 --> 00:17:17,089
The idea here was that
we'll keep the problem small

429
00:17:17,089 --> 00:17:21,339
so we don't have to search the
valuation for a giant state

430
00:17:21,339 --> 00:17:22,329
space.

431
00:17:22,329 --> 00:17:26,050
What we'll do next is at
the state space expands,

432
00:17:26,050 --> 00:17:28,146
we get a bigger
picture of the states

433
00:17:28,146 --> 00:17:29,884
that we're interested in.

434
00:17:29,884 --> 00:17:31,800
We're going to do
[? an audio ?] [INAUDIBLE]..

435
00:17:31,800 --> 00:17:34,510
And then we're going to figure
out what the best action is.

436
00:17:34,510 --> 00:17:37,980
And in the ideal
case, the states

437
00:17:37,980 --> 00:17:40,149
that we would never reach
using an optimal policy

438
00:17:40,149 --> 00:17:41,440
are never going to be explored.

439
00:17:41,440 --> 00:17:44,340
Because our policy is going to
say, no don't go over there.

440
00:17:44,340 --> 00:17:45,481
That's a dead end.

441
00:17:45,481 --> 00:17:47,920
Or that's getting you farther
away from [INAUDIBLE]..

442
00:17:47,920 --> 00:17:50,170
So we're going to be searching
in a very specific part

443
00:17:50,170 --> 00:17:53,390
of the state space that
is useful to explore--

444
00:17:53,390 --> 00:17:56,502
that will get us closer and
give us a higher reward.

445
00:18:05,420 --> 00:18:08,670
What's important here,
the L stands for Loops.

446
00:18:08,670 --> 00:18:10,770
It's an extension of
the AO star algorithm,

447
00:18:10,770 --> 00:18:12,769
which is by itself is an
extension of the A star

448
00:18:12,769 --> 00:18:13,270
algorithm.

449
00:18:13,270 --> 00:18:15,110
We can handle infinite
horizon problems,

450
00:18:15,110 --> 00:18:17,975
like transition
in different ways.

451
00:18:17,975 --> 00:18:21,940
And really models the real world
scenarios we're interesed in.

452
00:18:21,940 --> 00:18:23,850
Any questions so far
on the broad scope

453
00:18:23,850 --> 00:18:25,526
of what AO star is going to do?

454
00:18:25,526 --> 00:18:26,026
Yeah.

455
00:18:26,026 --> 00:18:29,634
AUDIENCE: Can you put the
[INAUDIBLE] what you're doing

456
00:18:29,634 --> 00:18:30,300
over here, but--

457
00:18:30,300 --> 00:18:31,050
PROFESSOR 3: Sure.

458
00:18:31,050 --> 00:18:32,705
The AO stands for And Or graphs.

459
00:18:32,705 --> 00:18:37,576
So that's graphs that we have
here edges that are coupled.

460
00:18:37,576 --> 00:18:39,200
If you took time to
do this transition,

461
00:18:39,200 --> 00:18:41,671
you might end up here or
you might end up there.

462
00:18:41,671 --> 00:18:44,455
So that's the notion
that of this probability

463
00:18:44,455 --> 00:18:53,400
coupling, that we'll see
in action as we [INAUDIBLE]

464
00:18:53,400 --> 00:18:57,320
we're going to do, we're
going to input this MDP or AO

465
00:18:57,320 --> 00:19:01,000
graph with transition
probabilities or reward

466
00:19:01,000 --> 00:19:02,275
function heuristics.

467
00:19:02,275 --> 00:19:04,000
These are all things
that we defined

468
00:19:04,000 --> 00:19:06,330
prior to figuring out a plan.

469
00:19:06,330 --> 00:19:07,800
What we're going
to come out with

470
00:19:07,800 --> 00:19:10,292
is an optimal policy for
every regional state.

471
00:19:10,292 --> 00:19:12,500
It's a little different than
what value iteration is,

472
00:19:12,500 --> 00:19:15,790
which is an optimal policy
from every possible state.

473
00:19:15,790 --> 00:19:18,840
But we argue that
that's all we need.

474
00:19:18,840 --> 00:19:22,060
If we know where we're
starting and we follow

475
00:19:22,060 --> 00:19:23,560
our optimal policy,
we're only going

476
00:19:23,560 --> 00:19:25,760
to explore a certain
portion of the state space.

477
00:19:25,760 --> 00:19:28,120
And we're going to
explore that together.

478
00:19:28,120 --> 00:19:30,198
[INAUDIBLE] plan
a little bit more

479
00:19:30,198 --> 00:19:35,362
efficiently than iterating for
a high [INAUDIBLE] heuristic.

480
00:19:35,362 --> 00:19:39,340
Any questions on that?

481
00:19:39,340 --> 00:19:41,445
So we'll define some
terms that we're

482
00:19:41,445 --> 00:19:42,810
going to use throughout this.

483
00:19:42,810 --> 00:19:44,700
We've already talked
about our state space.

484
00:19:44,700 --> 00:19:46,650
This is just a
small portion of it

485
00:19:46,650 --> 00:19:48,483
that we're going to
work with as we're going

486
00:19:48,483 --> 00:19:49,850
to walk through this example.

487
00:19:49,850 --> 00:19:52,225
Next we're going to define
something called our envelope.

488
00:19:52,225 --> 00:19:54,224
And that's the
sub-portion of our states

489
00:19:54,224 --> 00:19:55,640
that we're going
to be looking at.

490
00:19:55,640 --> 00:19:59,429
We're going to initialize that
to just this zero [INAUDIBLE]..

491
00:19:59,429 --> 00:20:01,970
But as we progress through the
algorithm, it's going to grow.

492
00:20:01,970 --> 00:20:04,770
It's going to grow only in the
areas that we're interested in.

493
00:20:07,650 --> 00:20:10,430
A subset of this envelope
is the terminal states.

494
00:20:10,430 --> 00:20:12,150
Now these aren't
goal states, these

495
00:20:12,150 --> 00:20:14,760
are just [INAUDIBLE]
of our space

496
00:20:14,760 --> 00:20:16,176
that we've added
to our envelopes.

497
00:20:16,176 --> 00:20:19,904
And this is what it would be
in the expanded [INAUDIBLE]..

498
00:20:19,904 --> 00:20:22,329
And it's just the nodes that
haven't been expanded yet.

499
00:20:22,329 --> 00:20:24,120
Here they haven't drawn
anything incredible

500
00:20:24,120 --> 00:20:26,890
but you can imagine the
state space goes out

501
00:20:26,890 --> 00:20:29,375
further because [INAUDIBLE].

502
00:20:29,375 --> 00:20:31,180
You can imagine that
this goes out further.

503
00:20:31,180 --> 00:20:33,070
And they're keeping
track of the nodes

504
00:20:33,070 --> 00:20:37,380
that we haven't expanded yet.

505
00:20:37,380 --> 00:20:41,350
Or likewise, we've showed that
we initialize [INAUDIBLE]..

506
00:20:44,290 --> 00:20:47,292
The other few things that
we're going to define-- we've

507
00:20:47,292 --> 00:20:49,864
already defined the states
that are in our envelope.

508
00:20:49,864 --> 00:20:51,640
That's the blue or the red.

509
00:20:51,640 --> 00:20:55,790
We're going to define cost
heuristic or reward function, R

510
00:20:55,790 --> 00:21:00,890
E, and a transition probability
matrix, or set of matrices,

511
00:21:00,890 --> 00:21:04,850
that we're going to use our
optimal policy search on.

512
00:21:04,850 --> 00:21:07,910
What's important here is
that our reward function

513
00:21:07,910 --> 00:21:09,670
and transition
probabilities are slightly

514
00:21:09,670 --> 00:21:12,003
altered to account for the
fact that we haven't explored

515
00:21:12,003 --> 00:21:13,680
the entire state space.

516
00:21:13,680 --> 00:21:17,490
We see here that if a node
is in ST, in other words,

517
00:21:17,490 --> 00:21:21,350
it's one of those
terminal nodes.

518
00:21:21,350 --> 00:21:23,650
We're going to say we
can't transition out

519
00:21:23,650 --> 00:21:25,940
of it, because we don't know
what's beyond it so far.

520
00:21:25,940 --> 00:21:28,675
And we're going to set it
for reward to be a heuristic.

521
00:21:28,675 --> 00:21:30,870
Whenever we think
that the reward is

522
00:21:30,870 --> 00:21:35,310
going to be when we begin
to explore and go further.

523
00:21:35,310 --> 00:21:39,270
Like I said, we're just going to
feed this into a policy search

524
00:21:39,270 --> 00:21:41,650
just like we discussed
with value iterations

525
00:21:41,650 --> 00:21:43,080
on the sub problem.

526
00:21:43,080 --> 00:21:46,310
And we're going to search for
an optimal policy [INAUDIBLE]..

527
00:21:46,310 --> 00:21:47,446
So far so good?

528
00:21:50,991 --> 00:21:52,074
This is the general steps.

529
00:21:52,074 --> 00:21:53,900
This is very text-y, but
we're going to definitely walk

530
00:21:53,900 --> 00:21:54,983
through every single step.

531
00:21:54,983 --> 00:21:57,610
We're going to do two full
iterations of the algorithm.

532
00:21:57,610 --> 00:22:00,482
So like I said, we're
going to create RE and TE.

533
00:22:00,482 --> 00:22:02,990
That's are reward function
transition probabilities using

534
00:22:02,990 --> 00:22:05,224
the definitions we showed.

535
00:22:05,224 --> 00:22:06,640
We're going to use
value iteration

536
00:22:06,640 --> 00:22:08,630
to find the optimal
policy of the sub space

537
00:22:08,630 --> 00:22:11,190
that we're interested in now.

538
00:22:11,190 --> 00:22:14,030
And then this is probably
the most important step.

539
00:22:14,030 --> 00:22:18,020
Knowing this optimal policy, we
take a look at what new states

540
00:22:18,020 --> 00:22:19,580
that aren't in our
terminal states--

541
00:22:19,580 --> 00:22:21,344
nodes that we haven't
explored yet--

542
00:22:21,344 --> 00:22:24,150
what new states we
might visit now.

543
00:22:24,150 --> 00:22:27,526
So let's say we have our policy,
and it says we'll go North.

544
00:22:27,526 --> 00:22:29,900
And we know that we haven't
explored the North state yet.

545
00:22:29,900 --> 00:22:31,400
We know this is the
state that we're

546
00:22:31,400 --> 00:22:34,287
going to reach following
what we consider now to be

547
00:22:34,287 --> 00:22:37,041
already an optimal [INAUDIBLE].

548
00:22:37,041 --> 00:22:39,040
So that's the states we're
going to expand next.

549
00:22:39,040 --> 00:22:40,300
We're going to do
some bookkeeping,

550
00:22:40,300 --> 00:22:42,280
adding them to the
terminal states [INAUDIBLE]

551
00:22:42,280 --> 00:22:44,764
once we expand it, then
adding them to our envelope.

552
00:22:44,764 --> 00:22:46,430
What's important here
is that we're only

553
00:22:46,430 --> 00:22:49,570
going to add states that we
visited yet to our envelope.

554
00:22:49,570 --> 00:22:51,480
And this is basically
the little hack

555
00:22:51,480 --> 00:22:54,750
that allows us to deal
with loopy graphs.

556
00:22:54,750 --> 00:22:57,880
We're not going to
continually explore nodes

557
00:22:57,880 --> 00:23:00,400
that we might reach a second
time, probabilistically.

558
00:23:00,400 --> 00:23:02,353
We're going to let
value iteration handle

559
00:23:02,353 --> 00:23:05,092
that [INAUDIBLE]

560
00:23:05,092 --> 00:23:11,068
AUDIENCE: [INAUDIBLE]
states are expanded.

561
00:23:11,068 --> 00:23:13,060
I'm the one who got
confused on that.

562
00:23:13,060 --> 00:23:15,052
Are you saying that you're
going to just repeat

563
00:23:15,052 --> 00:23:18,538
until there aren't any
more terminals to look at.

564
00:23:18,538 --> 00:23:22,191
And if that's the case, how
is that possible if you have

565
00:23:22,191 --> 00:23:24,701
an infinite horizon [INAUDIBLE]

566
00:23:24,701 --> 00:23:25,450
PROFESSOR 3: Sure.

567
00:23:25,450 --> 00:23:28,520
So if you can imagine-- and
we'll talk about termination

568
00:23:28,520 --> 00:23:29,602
at the end.

569
00:23:29,602 --> 00:23:33,540
But you can imagine that as
we have these terminal states,

570
00:23:33,540 --> 00:23:35,860
but you have a policy
that guides you

571
00:23:35,860 --> 00:23:38,358
to a part of the state space
that we've already expanded.

572
00:23:38,358 --> 00:23:39,733
Imagine you've
reached your goal.

573
00:23:39,733 --> 00:23:41,680
Your optimal policy is
going to say, stay put.

574
00:23:41,680 --> 00:23:42,487
Right?

575
00:23:42,487 --> 00:23:44,320
And it's not going to
say, move North again.

576
00:23:44,320 --> 00:23:46,094
AUDIENCE: It's just
the goal [INAUDIBLE]

577
00:23:46,094 --> 00:23:47,760
PROFESSOR 3: Essentially,
the goal state

578
00:23:47,760 --> 00:23:49,923
is definitely an example
in the more extreme case

579
00:23:49,923 --> 00:23:51,964
that there's nothing else
you can do that's going

580
00:23:51,964 --> 00:23:54,293
to get you closer to our goal.

581
00:23:54,293 --> 00:23:59,020
The idea is that your
policy on your sub space

582
00:23:59,020 --> 00:24:00,520
never tells you to
go to a terminal.

583
00:24:00,520 --> 00:24:04,096
Nobody can [INAUDIBLE]
inherently worse than

584
00:24:04,096 --> 00:24:06,008
[INAUDIBLE]

585
00:24:06,008 --> 00:24:07,420
AUDIENCE: Planning
optimal policy

586
00:24:07,420 --> 00:24:09,862
means running value
iteration entirely.

587
00:24:09,862 --> 00:24:10,570
PROFESSOR 3: Yes.

588
00:24:10,570 --> 00:24:13,036
We were going to treat it
essentially as a black box.

589
00:24:13,036 --> 00:24:15,014
But the trick here is
that we're doing it

590
00:24:15,014 --> 00:24:17,360
on a smaller portion of
space of a different world.

591
00:24:20,780 --> 00:24:21,280
All right.

592
00:24:21,280 --> 00:24:23,060
I'm going to put
the steps up there.

593
00:24:23,060 --> 00:24:25,530
Hopefully you can see this.

594
00:24:25,530 --> 00:24:28,143
But for now, we'll
just walk through

595
00:24:28,143 --> 00:24:30,380
from the beginning of
what we're going to do.

596
00:24:30,380 --> 00:24:34,890
So we said that our envelope
and our terminal nodes

597
00:24:34,890 --> 00:24:37,310
are just at 0 to start.

598
00:24:37,310 --> 00:24:40,230
So very simply, we
use these definitions

599
00:24:40,230 --> 00:24:42,640
and say, OK, the
transition probability

600
00:24:42,640 --> 00:24:45,531
as 0 to any node right now is 0.

601
00:24:45,531 --> 00:24:46,914
Because it's in that terminal.

602
00:24:46,914 --> 00:24:50,670
[INAUDIBLE] That's just so that
we don't transition out of it.

603
00:24:50,670 --> 00:24:54,240
We can develop the policy
based on only this portion

604
00:24:54,240 --> 00:24:55,302
of the space [INAUDIBLE].

605
00:24:55,302 --> 00:24:56,760
And our reward
function, we're just

606
00:24:56,760 --> 00:24:59,040
going to apply it to be
the heuristic [INAUDIBLE]..

607
00:24:59,040 --> 00:25:01,040
Let's say that's 20.

608
00:25:01,040 --> 00:25:05,234
And for a move-on from there.

609
00:25:05,234 --> 00:25:07,992
[INAUDIBLE] started
to value iteration.

610
00:25:07,992 --> 00:25:09,450
And we'll run this
value iteration.

611
00:25:09,450 --> 00:25:11,540
So this is a very basic case.

612
00:25:11,540 --> 00:25:14,035
We're just going to-- we're
using this to kind of build up

613
00:25:14,035 --> 00:25:16,170
the machinery, understand
what we're doing.

614
00:25:16,170 --> 00:25:19,700
It's a very basic case where you
only node we have is that zero.

615
00:25:19,700 --> 00:25:20,960
We can't transition out of it.

616
00:25:20,960 --> 00:25:23,142
All we have is some heuristic.

617
00:25:23,142 --> 00:25:24,960
So the only thing we
can do is nothing.

618
00:25:24,960 --> 00:25:28,310
So the action we're going
to take from S0 is nothing.

619
00:25:28,310 --> 00:25:30,780
Very simple case just to get
us warmed up and understand

620
00:25:30,780 --> 00:25:33,540
what's happening.

621
00:25:33,540 --> 00:25:37,258
So using this policy and knowing
that those 0s in our terminal

622
00:25:37,258 --> 00:25:40,271
node are node reset.

623
00:25:40,271 --> 00:25:41,770
We're going to take
the intersection

624
00:25:41,770 --> 00:25:43,370
between this terminal mode set.

625
00:25:43,370 --> 00:25:46,450
And the nodes that we might
reach following our policy.

626
00:25:46,450 --> 00:25:47,790
And we know that we're at S0.

627
00:25:47,790 --> 00:25:53,235
And we know that action that
our optimal policy says to take

628
00:25:53,235 --> 00:25:53,930
is not.

629
00:25:53,930 --> 00:25:55,280
So we know we're there.

630
00:25:55,280 --> 00:25:57,680
And we know it's in
our terminal node set.

631
00:25:57,680 --> 00:26:01,655
So that's the only thing
that we could reach so far

632
00:26:01,655 --> 00:26:03,080
using our optimal policy.

633
00:26:03,080 --> 00:26:04,840
So far so good?

634
00:26:04,840 --> 00:26:05,340
Clear?

635
00:26:09,340 --> 00:26:11,600
So that's exactly what
we're going to expand.

636
00:26:11,600 --> 00:26:14,633
Just expand S0, A, B,
and C defined those

637
00:26:14,633 --> 00:26:16,339
to our terminal nodes.

638
00:26:16,339 --> 00:26:19,702
So that's up there the
symbols [INAUDIBLE]

639
00:26:19,702 --> 00:26:21,940
from our terminal nodes
added as children.

640
00:26:21,940 --> 00:26:24,430
Then we added the
children to the end.

641
00:26:29,390 --> 00:26:29,890
Here we go.

642
00:26:29,890 --> 00:26:31,390
We're going to do
a little bit more.

643
00:26:31,390 --> 00:26:33,370
We're going to do
the same thing again.

644
00:26:33,370 --> 00:26:36,274
But now we obviously have more
nodes and a bigger sub space

645
00:26:36,274 --> 00:26:38,854
to explore.

646
00:26:38,854 --> 00:26:43,200
So using these definitions
we see our reward function

647
00:26:43,200 --> 00:26:47,146
is a tuple of the node
that we start from.

648
00:26:47,146 --> 00:26:48,670
And so this S0.

649
00:26:48,670 --> 00:26:50,200
And one of these three actions.

650
00:26:50,200 --> 00:26:52,510
And we'll take A1, A2, and A3.

651
00:26:52,510 --> 00:26:56,350
When we have our instantaneous
reward functions, 6,4, and 8

652
00:26:56,350 --> 00:27:00,650
as being our rewards
from doing those actions.

653
00:27:00,650 --> 00:27:03,327
From A, B, and C, no matter
what action you take,

654
00:27:03,327 --> 00:27:04,698
it's just heuristic.

655
00:27:04,698 --> 00:27:07,900
That's part of it.

656
00:27:07,900 --> 00:27:10,430
And likewise, we're going to
take a look at this transition

657
00:27:10,430 --> 00:27:11,430
probability [INAUDIBLE].

658
00:27:11,430 --> 00:27:13,638
This is for [INAUDIBLE]
transitioning from a state S0

659
00:27:13,638 --> 00:27:17,510
for saying that if we take
actions A1, A2, and A3,

660
00:27:17,510 --> 00:27:21,742
what's the probability of being
done in nodes A, B, and C?

661
00:27:21,742 --> 00:27:23,632
If you look so far
here, we're going

662
00:27:23,632 --> 00:27:25,090
to look at something
deterministic.

663
00:27:25,090 --> 00:27:26,617
If we take an
action, we'll end up

664
00:27:26,617 --> 00:27:28,075
where we say we're
going to end up.

665
00:27:28,075 --> 00:27:31,060
And we'll see how this algorithm
collapses down to A star

666
00:27:31,060 --> 00:27:35,084
if everything's deterministic.

667
00:27:35,084 --> 00:27:38,100
We're also obviously going to
look at the probabilistic case

668
00:27:38,100 --> 00:27:40,700
where we say [INAUDIBLE]
small probability that it

669
00:27:40,700 --> 00:27:42,007
might end up with [INAUDIBLE].

670
00:27:42,007 --> 00:27:43,590
That's going to
necessitate that we're

671
00:27:43,590 --> 00:27:46,730
going to have to look at
and expand B together with A

672
00:27:46,730 --> 00:27:50,015
if we were to decide we
want to try [INAUDIBLE]..

673
00:27:50,015 --> 00:27:53,076
And likewise, if we
try to take action A3,

674
00:27:53,076 --> 00:27:54,700
we have to expand
them all three nodes.

675
00:27:54,700 --> 00:27:57,410
So the tighter the probabilistic
coupling, the more of the space

676
00:27:57,410 --> 00:27:58,701
we're going to have to explore.

677
00:28:01,440 --> 00:28:05,346
So just off this, assuming
that if you take action A1, A2,

678
00:28:05,346 --> 00:28:09,990
and A3, can someone tell me what
the policy is for the rewards

679
00:28:09,990 --> 00:28:12,200
here?

680
00:28:12,200 --> 00:28:17,160
And we're interested in a policy
from S0 what we're going to do.

681
00:28:17,160 --> 00:28:19,500
Take a look at the
rewards and judge

682
00:28:19,500 --> 00:28:22,738
what the best action to take
is from a purely deterministic

683
00:28:22,738 --> 00:28:23,238
sense.

684
00:28:28,562 --> 00:28:32,918
AUDIENCE: [INAUDIBLE]
do you add [INAUDIBLE]

685
00:28:32,918 --> 00:28:34,370
or do you just [INAUDIBLE]?

686
00:28:34,370 --> 00:28:37,030
You have the reward that
you have gotten so far.

687
00:28:45,130 --> 00:28:47,580
PROFESSOR 3: Does that help
you answer the question?

688
00:28:47,580 --> 00:28:48,371
AUDIENCE: Oh, yeah.

689
00:28:48,371 --> 00:28:50,059
[INAUDIBLE] student.

690
00:28:50,059 --> 00:28:51,600
PROFESSOR 3: So the
policy preference

691
00:28:51,600 --> 00:28:55,206
says, from S0, take action A1.

692
00:28:55,206 --> 00:28:57,666
And that's going to stay there.

693
00:28:57,666 --> 00:29:02,010
[INAUDIBLE] from A to C, there's
no action to take [INAUDIBLE]..

694
00:29:02,010 --> 00:29:04,800
What that means is that the
nodes that we might reach

695
00:29:04,800 --> 00:29:08,630
that are in our terminal
state set, using our policy,

696
00:29:08,630 --> 00:29:09,900
is node A. All right.

697
00:29:09,900 --> 00:29:11,430
So that's the node
going to expand.

698
00:29:11,430 --> 00:29:13,800
And this is where you really
see that all we've done

699
00:29:13,800 --> 00:29:16,620
is collapse down to A star.

700
00:29:16,620 --> 00:29:17,910
A star would say, OK.

701
00:29:17,910 --> 00:29:23,350
What's the best node
using some heuristic.

702
00:29:23,350 --> 00:29:26,264
Action a1 one takes
us to that best node,

703
00:29:26,264 --> 00:29:29,680
and we're going to
expand just that.

704
00:29:29,680 --> 00:29:31,640
So when everything
is deterministic,

705
00:29:31,640 --> 00:29:34,110
basically this algorithm
collapses down to A star.

706
00:29:37,056 --> 00:29:38,824
Nothing super interesting.

707
00:29:38,824 --> 00:29:41,026
The interesting
case does come up

708
00:29:41,026 --> 00:29:42,900
when we start doing
more probablistic ones.

709
00:29:42,900 --> 00:29:44,420
That's where nodes
are probabilistically

710
00:29:44,420 --> 00:29:45,878
helpful to the
scanned graph sense.

711
00:29:48,550 --> 00:29:50,125
So now we have our policy.

712
00:29:50,125 --> 00:29:52,250
The policies are politely
going to remain the same,

713
00:29:52,250 --> 00:29:58,250
because they have very little
probability on the edge actions

714
00:29:58,250 --> 00:30:03,250
that we might
accidentally hit up.

715
00:30:03,250 --> 00:30:06,250
What would I want to look
at is what [INAUDIBLE]

716
00:30:06,250 --> 00:30:09,220
and how reachable following
our optimal policy.

717
00:30:15,160 --> 00:30:18,806
So we talked about that if
we were to actually read.

718
00:30:18,806 --> 00:30:22,012
We know some probability of
ending up in [INAUDIBLE] nodes.

719
00:30:22,012 --> 00:30:26,280
So taking action A3 makes
C, 2, and A all reachable.

720
00:30:26,280 --> 00:30:28,650
What's reachable in
taking action A1?

721
00:30:36,111 --> 00:30:36,861
AUDIENCE: A and B.

722
00:30:36,861 --> 00:30:39,620
PROFESSOR 3:
[INAUDIBLE] And that's

723
00:30:39,620 --> 00:30:42,700
where we get the notion on
this probabilistic algorithm

724
00:30:42,700 --> 00:30:45,517
that we're going
expand things together.

725
00:30:45,517 --> 00:30:46,975
Explore the part
of the state space

726
00:30:46,975 --> 00:30:50,040
that we reach both
higher optimal polity

727
00:30:50,040 --> 00:30:54,400
and that we might accidentally
end up in if we [INAUDIBLE]

728
00:30:54,400 --> 00:30:54,982
the policy.

729
00:30:54,982 --> 00:30:57,870
And this is what guarantees
that we'll have an action

730
00:30:57,870 --> 00:31:00,815
to take from any state
that we intended to go to

731
00:31:00,815 --> 00:31:03,440
or that we might
end up [INAUDIBLE]..

732
00:31:03,440 --> 00:31:05,210
That's how our state
expands, our envelope

733
00:31:05,210 --> 00:31:09,030
expands to encompass only
the reachable and interesting

734
00:31:09,030 --> 00:31:11,240
states that we want to look at.

735
00:31:11,240 --> 00:31:14,360
It's not as simple
as just examining

736
00:31:14,360 --> 00:31:17,850
not only the heuristic,
but with A start

737
00:31:17,850 --> 00:31:19,850
because we have this
probabilistic [INAUDIBLE]..

738
00:31:19,850 --> 00:31:24,740
You can see that if you have
a tighter coupling, then

739
00:31:24,740 --> 00:31:28,970
you don't get to exploit
this optimization as much.

740
00:31:28,970 --> 00:31:31,220
For example, A3 we
would have had to expand

741
00:31:31,220 --> 00:31:33,400
more as opposed to A1.

742
00:31:33,400 --> 00:31:38,930
We can just stick to A and B
and ignore expanding C for now.

743
00:31:38,930 --> 00:31:39,736
[INAUDIBLE]

744
00:31:42,670 --> 00:31:44,950
AUDIENCE: Our [INAUDIBLE]
We know that when people

745
00:31:44,950 --> 00:31:48,370
do the statistic [INAUDIBLE].

746
00:31:48,370 --> 00:31:51,340
So in this case, if you take
action a because of a and b,

747
00:31:51,340 --> 00:32:01,880
[INAUDIBLE]

748
00:32:01,880 --> 00:32:03,540
PROFESSOR 3: So in
the complete sense

749
00:32:03,540 --> 00:32:05,650
of running this algorithm,
we shouldn't print it.

750
00:32:05,650 --> 00:32:07,190
Because what if we do?

751
00:32:07,190 --> 00:32:10,030
If we have that 2%
chance, [INAUDIBLE]..

752
00:32:10,030 --> 00:32:12,430
We need to have a
policy for correcting.

753
00:32:12,430 --> 00:32:14,520
So we end up at
B. You can imagine

754
00:32:14,520 --> 00:32:17,502
that these are going to try
to push back to whatever path

755
00:32:17,502 --> 00:32:19,430
that A was looking for.

756
00:32:19,430 --> 00:32:21,720
I suppose that the
probability is low enough,

757
00:32:21,720 --> 00:32:24,850
you can have some
cutoff percentage where

758
00:32:24,850 --> 00:32:26,590
you've decoupled [INAUDIBLE]--

759
00:32:26,590 --> 00:32:28,210
PROFESSOR 2: So
just a quick point.

760
00:32:28,210 --> 00:32:30,979
So I think that's an excellent
point and an excellent answer.

761
00:32:30,979 --> 00:32:32,520
We're going to talk
about next week--

762
00:32:32,520 --> 00:32:34,620
exactly what's going
to happen and if you

763
00:32:34,620 --> 00:32:37,120
can prove lower
probability on the paper

764
00:32:37,120 --> 00:32:39,642
right here will be
lecturing on that.

765
00:32:39,642 --> 00:32:40,580
Good question.

766
00:32:40,580 --> 00:32:43,380
PROFESSOR 3: That also
gets us into the sense

767
00:32:43,380 --> 00:32:46,150
that if every state in the
world were probabilistically

768
00:32:46,150 --> 00:32:50,370
coupled-- let's say we had some
transporter to go with the Star

769
00:32:50,370 --> 00:32:52,495
Trek examples.

770
00:32:52,495 --> 00:32:56,610
If we had this transported that
non-deterministically put us

771
00:32:56,610 --> 00:32:58,425
in any state in
the world, we have

772
00:32:58,425 --> 00:33:00,550
to explore the whole world,
because we could end up

773
00:33:00,550 --> 00:33:01,410
[INAUDIBLE].

774
00:33:01,410 --> 00:33:04,410
So luckily that's not the case
yet and we can take advantage

775
00:33:04,410 --> 00:33:07,950
of the fact that we ended
up most likely where

776
00:33:07,950 --> 00:33:14,510
we commanded with some
probability of [INAUDIBLE]

777
00:33:14,510 --> 00:33:16,382
This is exactly what
we just talked about.

778
00:33:16,382 --> 00:33:19,450
We coupled these nodes
with this And edge.

779
00:33:19,450 --> 00:33:21,230
And we expand those, too.

780
00:33:21,230 --> 00:33:23,570
Is everyone understanding
the holding intuition,

781
00:33:23,570 --> 00:33:26,550
and the logic for why both
of them have the [INAUDIBLE]??

782
00:33:26,550 --> 00:33:29,116
Even if there's only a small
probability that we end up

783
00:33:29,116 --> 00:33:30,112
[INAUDIBLE]?

784
00:33:35,092 --> 00:33:36,950
So and this is what
we're going to repeat

785
00:33:36,950 --> 00:33:38,990
until the states are expanded.

786
00:33:38,990 --> 00:33:41,070
You can imagine that
the next time we

787
00:33:41,070 --> 00:33:43,330
run our value
iteration, we're now

788
00:33:43,330 --> 00:33:45,446
running it on all of
these colored nodes--

789
00:33:45,446 --> 00:33:47,644
both the blue and the red.

790
00:33:47,644 --> 00:33:49,060
We can imagine
that next time, now

791
00:33:49,060 --> 00:33:51,018
that we have a little
bit more information what

792
00:33:51,018 --> 00:33:53,477
lies beyond A and B, that
our policy might say, oh,

793
00:33:53,477 --> 00:33:54,060
you know what?

794
00:33:54,060 --> 00:34:00,350
Actually from S0 C, action
A3 was the best to take.

795
00:34:00,350 --> 00:34:03,378
What that does is say,
OK, with a regional map,

796
00:34:03,378 --> 00:34:06,230
it's A, B, and C. And
we expanded those nodes.

797
00:34:06,230 --> 00:34:08,011
We'll we've already
expanded A and B,

798
00:34:08,011 --> 00:34:09,969
so we move into the next
part of the sub space,

799
00:34:09,969 --> 00:34:12,150
that as we gain
more information,

800
00:34:12,150 --> 00:34:17,789
we run value iteration
and we can expand on why.

801
00:34:17,789 --> 00:34:19,060
Steve asked a good question.

802
00:34:19,060 --> 00:34:22,270
When we did our dry run,
about whether there is a way

803
00:34:22,270 --> 00:34:24,059
to save on the
computation you did

804
00:34:24,059 --> 00:34:27,360
prior to the value iteration and
add these [INAUDIBLE] states.

805
00:34:27,360 --> 00:34:29,909
And we've looked
at it a little bit.

806
00:34:29,909 --> 00:34:31,380
I've seen some stuff.

807
00:34:31,380 --> 00:34:36,609
But I haven't found a paper
that specifically deals with it.

808
00:34:36,609 --> 00:34:38,730
You can imagine
how you've already

809
00:34:38,730 --> 00:34:42,450
run value iteration on your
previous iteration [INAUDIBLE]

810
00:34:42,450 --> 00:34:46,239
and you add the new terminal
edges which you'd expand on.

811
00:34:46,239 --> 00:34:48,100
And you run them again
until it stabilizes.

812
00:34:48,100 --> 00:34:49,808
And that way you've
saved the computation

813
00:34:49,808 --> 00:34:54,446
of having to run valuation
multiple times [INAUDIBLE]

814
00:34:54,446 --> 00:34:56,941
state space.

815
00:34:56,941 --> 00:34:58,700
AUDIENCE: So I'm
trying to think of how

816
00:34:58,700 --> 00:35:01,854
this is different from
something like [INAUDIBLE]

817
00:35:01,854 --> 00:35:03,306
for [INAUDIBLE].

818
00:35:07,662 --> 00:35:11,092
PROFESSOR 3: I think I don't
know enough about that.

819
00:35:11,092 --> 00:35:14,770
But my basic understanding
says that what's useful

820
00:35:14,770 --> 00:35:18,890
here is it's this
explicit [INAUDIBLE]..

821
00:35:18,890 --> 00:35:20,714
I don't know how
much [INAUDIBLE]..

822
00:35:23,581 --> 00:35:26,080
AUDIENCE: And also, as long as
your heuristic is admissible,

823
00:35:26,080 --> 00:35:33,910
it's guaranteed [INAUDIBLE]
not all [INAUDIBLE] algorithms.

824
00:35:33,910 --> 00:35:37,190
Just like A star is optimal,
as long as you've got

825
00:35:37,190 --> 00:35:38,636
a consistent [INAUDIBLE].

826
00:35:43,092 --> 00:35:44,050
PROFESSOR 3: All right.

827
00:35:44,050 --> 00:35:46,820
So that's definitely
the idea here.

828
00:35:46,820 --> 00:35:48,680
We coupled these in
the only explored

829
00:35:48,680 --> 00:35:50,512
of the portion of
the state space

830
00:35:50,512 --> 00:35:55,384
that we'll reach an
optimal policy [INAUDIBLE]..

831
00:35:55,384 --> 00:35:57,750
So we'll talk about quickly
another determination.

832
00:35:57,750 --> 00:36:00,770
We've touched on most of this.

833
00:36:00,770 --> 00:36:03,560
So it's most likely
when there are

834
00:36:03,560 --> 00:36:06,340
no more states to expand this
when we've reached our goal.

835
00:36:06,340 --> 00:36:09,090
It's when our policy that we
run on our entire envelope

836
00:36:09,090 --> 00:36:12,110
from value iteration
doesn't say that we should

837
00:36:12,110 --> 00:36:14,625
go to anymore
terminal [INAUDIBLE]

838
00:36:14,625 --> 00:36:17,484
that we haven't looked at
yet, that we haven't seen yet.

839
00:36:17,484 --> 00:36:19,400
We've said that those
are only the things that

840
00:36:19,400 --> 00:36:21,310
are reachable and needed.

841
00:36:21,310 --> 00:36:24,550
Both, reachable because we're
following optimal policy,

842
00:36:24,550 --> 00:36:27,950
and needed, only if
we might accidentally

843
00:36:27,950 --> 00:36:30,030
end up in the probabilistic.

844
00:36:30,030 --> 00:36:33,000
This is gives us the sense of
rigorous that we get policy

845
00:36:33,000 --> 00:36:37,390
on the entire state space that
we can end up in following

846
00:36:37,390 --> 00:36:38,830
the optimal policy.

847
00:36:41,710 --> 00:36:44,924
The third bullet
here touches on if we

848
00:36:44,924 --> 00:36:47,340
don't expand the states that
are probabilistically coupled

849
00:36:47,340 --> 00:36:49,620
and we do accidentally
end up there,

850
00:36:49,620 --> 00:36:52,840
we risk getting lost
and not having a policy.

851
00:36:52,840 --> 00:36:56,080
We can compute this all off line
and have a plan before we even

852
00:36:56,080 --> 00:36:58,456
start planning to
know exactly where we

853
00:36:58,456 --> 00:37:06,690
want to go even if our
dynamics aren't [INAUDIBLE]..

854
00:37:06,690 --> 00:37:10,100
We're come back to this.

855
00:37:10,100 --> 00:37:11,880
This was our motivating example.

856
00:37:11,880 --> 00:37:15,040
And so we show that
these real platforms can

857
00:37:15,040 --> 00:37:16,696
be modeled
stochastically and then

858
00:37:16,696 --> 00:37:18,360
we can pretty easily
deal with that.

859
00:37:18,360 --> 00:37:21,660
Search our state space and
deal with those probabilities

860
00:37:21,660 --> 00:37:24,646
and expand the nodes
that we might end up.

861
00:37:24,646 --> 00:37:26,092
Right?

862
00:37:26,092 --> 00:37:27,710
And the heuristic
allows us to not

863
00:37:27,710 --> 00:37:29,995
have to explore these
areas of state space.

864
00:37:29,995 --> 00:37:32,180
I never actually end up there.

865
00:37:32,180 --> 00:37:35,495
We'll always be a
commanding toward 77.

866
00:37:35,495 --> 00:37:37,370
We're never going to
try to command backward.

867
00:37:37,370 --> 00:37:39,450
Sure, if there's a gust of wind
and we have some probability

868
00:37:39,450 --> 00:37:39,950
there.

869
00:37:39,950 --> 00:37:42,350
But you can imagine that
we're going to only explore

870
00:37:42,350 --> 00:37:43,896
a small portion of this.

871
00:37:43,896 --> 00:37:46,830
Because we'll always be
trying to correct to get back

872
00:37:46,830 --> 00:37:48,590
to the top four blocks.

873
00:37:48,590 --> 00:37:51,638
And using our
reward function, we

874
00:37:51,638 --> 00:37:54,515
get to determine if we
want to fly a quick path

875
00:37:54,515 --> 00:37:56,280
or if we want to
fly a safer path.

876
00:37:56,280 --> 00:37:59,256
For example, our time
times our probability

877
00:37:59,256 --> 00:38:02,172
[INAUDIBLE] we want to
perhaps reduce that.

878
00:38:06,060 --> 00:38:08,004
All right.

879
00:38:08,004 --> 00:38:12,455
Are there any questions
about planning with MDPs,

880
00:38:12,455 --> 00:38:14,570
anything like that.

881
00:38:14,570 --> 00:38:15,500
I love this stuff.

882
00:38:15,500 --> 00:38:19,148
So the more questions,
the more I get to talk.

883
00:38:19,148 --> 00:38:20,597
Fine.

884
00:38:20,597 --> 00:38:23,910
What I'm going to be
talking about for the rest

885
00:38:23,910 --> 00:38:28,080
of this lecture is
extending beyond MDPs

886
00:38:28,080 --> 00:38:30,390
to a broader class
of problems called

887
00:38:30,390 --> 00:38:35,010
POMDPs, Partially Observable
Markov Decision Processes.

888
00:38:35,010 --> 00:38:36,419
I love this stuff.

889
00:38:36,419 --> 00:38:37,460
I think it's really cool.

890
00:38:37,460 --> 00:38:39,030
They're really fun problems.

891
00:38:39,030 --> 00:38:41,660
We're going to talk about why
they're so much harder to plan

892
00:38:41,660 --> 00:38:44,380
with, to execute--
but why they're

893
00:38:44,380 --> 00:38:47,270
important to at least know
about so that you can model

894
00:38:47,270 --> 00:38:48,740
real world problems with them.

895
00:38:48,740 --> 00:38:51,570
And then we're going to
delve into a case study

896
00:38:51,570 --> 00:38:55,180
of a specific POMDP solver.

897
00:38:55,180 --> 00:38:58,630
We're not going to go into as
much detail as we did for MDPs,

898
00:38:58,630 --> 00:39:00,630
but we're going to look
at what powerful results

899
00:39:00,630 --> 00:39:03,058
we can get by
planning with POMDPs.

900
00:39:05,986 --> 00:39:08,450
PROFESSOR 4: So first,
I want to rephrase

901
00:39:08,450 --> 00:39:10,770
this in the overall talk.

902
00:39:10,770 --> 00:39:11,270
Right?

903
00:39:11,270 --> 00:39:13,750
We have this spectrum
of uncertainty.

904
00:39:13,750 --> 00:39:16,420
And coupled with
uncertainty is difficulty

905
00:39:16,420 --> 00:39:19,830
of planning, of solving,
of executing a problem.

906
00:39:19,830 --> 00:39:22,170
And we've killed
these first two cases.

907
00:39:22,170 --> 00:39:23,630
That was really easy.

908
00:39:23,630 --> 00:39:27,150
And then we just discussed
the bottom [INAUDIBLE],, MDPs.

909
00:39:27,150 --> 00:39:29,535
What I'm going to
talk about is the case

910
00:39:29,535 --> 00:39:34,080
where both your dynamics and
your sensors are stochastic.

911
00:39:34,080 --> 00:39:35,440
Why is that important?

912
00:39:35,440 --> 00:39:37,690
It's because when
we first saw this

913
00:39:37,690 --> 00:39:39,890
slide-- our motivating
example slide,

914
00:39:39,890 --> 00:39:41,990
we only saw the left hand side.

915
00:39:41,990 --> 00:39:43,630
We said, our actions
are uncertain.

916
00:39:43,630 --> 00:39:46,700
But good news, we have
a perfect sensor--

917
00:39:46,700 --> 00:39:48,080
a perfect camera.

918
00:39:48,080 --> 00:39:49,410
But that's unrealistic.

919
00:39:49,410 --> 00:39:51,430
I think, we have
all, to some extent,

920
00:39:51,430 --> 00:39:56,280
experienced the fact that no
sensor is totally perfect.

921
00:39:56,280 --> 00:39:59,090
Your camera might have
fluctuated pixel values.

922
00:39:59,090 --> 00:40:03,530
Your laser range finder is
going to never read out exactly

923
00:40:03,530 --> 00:40:05,381
the right number all the time.

924
00:40:05,381 --> 00:40:07,630
You can have a camera in
different lighting conditions

925
00:40:07,630 --> 00:40:09,600
that will behave differently.

926
00:40:09,600 --> 00:40:11,960
You might not be able to
observe your full state.

927
00:40:11,960 --> 00:40:14,790
That's, in a way, an
imperfect sensor, right?

928
00:40:14,790 --> 00:40:17,440
If I'm in this room,
I have imperfect eyes.

929
00:40:17,440 --> 00:40:19,520
I can't map out
all of MIT's campus

930
00:40:19,520 --> 00:40:21,473
because I'm blocked by walls.

931
00:40:21,473 --> 00:40:23,980
How can you deal with the
fact that you can't see all

932
00:40:23,980 --> 00:40:26,220
your obstacles all the time?

933
00:40:26,220 --> 00:40:28,740
We've already talked
about some cases--

934
00:40:28,740 --> 00:40:30,910
that there are some
algorithms that can help us

935
00:40:30,910 --> 00:40:32,750
with that, like D Star Lite.

936
00:40:32,750 --> 00:40:35,846
But can you reason about these
things probabilistically?

937
00:40:35,846 --> 00:40:39,463
And then finally, you might
be in a non-unique environment

938
00:40:39,463 --> 00:40:43,430
where you cannot resolve your
state with certainty no matter

939
00:40:43,430 --> 00:40:45,170
how good your sensors are.

940
00:40:45,170 --> 00:40:47,700
Imagine you're in a building
with two identical hallways.

941
00:40:47,700 --> 00:40:50,360
You're dropped off
in one of them.

942
00:40:50,360 --> 00:40:53,250
How can you figure
out where you are?

943
00:40:53,250 --> 00:40:55,916
You can't unless
you start exploring.

944
00:40:55,916 --> 00:40:59,420
And so we've got to deal
with this uncertainty,

945
00:40:59,420 --> 00:41:03,760
right? it's Part of
every single problem.

946
00:41:03,760 --> 00:41:05,880
When observational
uncertainty is slowing,

947
00:41:05,880 --> 00:41:07,160
you can maybe ignore it.

948
00:41:07,160 --> 00:41:09,770
But it's there.

949
00:41:09,770 --> 00:41:12,730
And so we're going to
formulate this as a POMDP,

950
00:41:12,730 --> 00:41:16,150
a partially observable
Markov Decision Process.

951
00:41:16,150 --> 00:41:19,230
And this next slide is
just like the MDP slide.

952
00:41:19,230 --> 00:41:21,000
Hairy, but important.

953
00:41:21,000 --> 00:41:22,380
Right?

954
00:41:22,380 --> 00:41:25,640
We can formulate a POMDP,
which is seven elements.

955
00:41:25,640 --> 00:41:27,270
And MDP too five.

956
00:41:27,270 --> 00:41:30,655
Most of them for all those
are carried over here.

957
00:41:30,655 --> 00:41:33,215
We've got our set of
states where we can be.

958
00:41:33,215 --> 00:41:35,880
We've got a set of
actions what we can do.

959
00:41:35,880 --> 00:41:38,210
We've got our transition
model which says,

960
00:41:38,210 --> 00:41:40,010
given that I
started in one state

961
00:41:40,010 --> 00:41:41,810
and then I took
an action, what's

962
00:41:41,810 --> 00:41:44,025
the probability I end
up somewhere else?

963
00:41:44,025 --> 00:41:46,160
And like David
was talking about,

964
00:41:46,160 --> 00:41:49,220
hopefully that distribution
is pretty local--

965
00:41:49,220 --> 00:41:51,720
we're not teleporting
all over the world.

966
00:41:51,720 --> 00:41:52,970
We've got our reward function.

967
00:41:52,970 --> 00:41:56,240
This is exactly the
same as for MDPs.

968
00:41:56,240 --> 00:41:59,561
And we've got our
discount factor down here.

969
00:41:59,561 --> 00:42:03,070
The key difference of POMDP
is these two elements.

970
00:42:03,070 --> 00:42:05,550
We've got a set of
possible observations

971
00:42:05,550 --> 00:42:08,930
and a probabilistic
model for the probability

972
00:42:08,930 --> 00:42:12,663
of making an observation given
your state and the action you

973
00:42:12,663 --> 00:42:14,800
just took.

974
00:42:14,800 --> 00:42:18,460
Now it's important, I
think, it matches up really

975
00:42:18,460 --> 00:42:20,250
well with real
world sensors having

976
00:42:20,250 --> 00:42:21,776
this probabilistic model.

977
00:42:21,776 --> 00:42:24,335
If you have a laser range
finder, for example,

978
00:42:24,335 --> 00:42:26,950
and you're standing one
foot away from the wall--

979
00:42:26,950 --> 00:42:29,090
now a perfect sensor
would always say,

980
00:42:29,090 --> 00:42:30,720
you're one foot
away from the wall.

981
00:42:30,720 --> 00:42:33,060
You're one foot
away from the wall.

982
00:42:33,060 --> 00:42:34,550
Every single
reading is constant.

983
00:42:34,550 --> 00:42:38,180
But realistically, there might
be Gaussian noise, for example.

984
00:42:38,180 --> 00:42:39,768
Or at a more extreme
case, it says,

985
00:42:39,768 --> 00:42:41,143
your one foot away
from the wall.

986
00:42:41,143 --> 00:42:41,809
You're two feet.

987
00:42:41,809 --> 00:42:42,795
You're right there.

988
00:42:42,795 --> 00:42:44,765
There's this distribution.

989
00:42:44,765 --> 00:42:48,890
And so you would ideally
characterize this distribution.

990
00:42:48,890 --> 00:42:50,450
And you plug that
into this model

991
00:42:50,450 --> 00:42:54,170
and that formulates your POMDP.

992
00:42:54,170 --> 00:42:57,800
This sounds really hairy, but if
you work through just a sample

993
00:42:57,800 --> 00:43:01,910
iteration of living in a POMDP
world, that's not too bad.

994
00:43:01,910 --> 00:43:04,863
You start at some state,
S. You take an action,

995
00:43:04,863 --> 00:43:06,985
A. With some
probability, you're going

996
00:43:06,985 --> 00:43:09,022
to end up in a bunch of
different states based

997
00:43:09,022 --> 00:43:10,710
on your transition model.

998
00:43:10,710 --> 00:43:14,679
At that point, we can use the
lessons we learned from MDP

999
00:43:14,679 --> 00:43:16,595
land where we said, when
we make observations,

1000
00:43:16,595 --> 00:43:20,840
we reduce our uncertainty. we
collapse into a single state.

1001
00:43:20,840 --> 00:43:23,120
So we say, let's
make an observation.

1002
00:43:23,120 --> 00:43:25,760
But this time, observations
aren't guaranteed

1003
00:43:25,760 --> 00:43:27,317
to resolve all our uncertainty.

1004
00:43:27,317 --> 00:43:28,400
So we make an observation.

1005
00:43:28,400 --> 00:43:31,840
And that observation
is probabilistic

1006
00:43:31,840 --> 00:43:35,550
based on our current state
and the action we just took.

1007
00:43:35,550 --> 00:43:38,750
And again, obviously, it
depends on your current state.

1008
00:43:38,750 --> 00:43:40,710
Because if you're one
foot away from a wall,

1009
00:43:40,710 --> 00:43:43,610
hopefully you'll get a
different characterization

1010
00:43:43,610 --> 00:43:46,790
of observations than if you're
20 feet away from the wall.

1011
00:43:46,790 --> 00:43:50,480
Otherwise, your sensor
it's totally useless.

1012
00:43:50,480 --> 00:43:52,777
Are there any questions
about this formulation?

1013
00:43:52,777 --> 00:43:53,360
AUDIENCE: Yep.

1014
00:43:53,360 --> 00:43:54,270
Quick question.

1015
00:43:54,270 --> 00:43:56,976
So we will take observation,
any other observation and then

1016
00:43:56,976 --> 00:43:58,940
you try to infer
which state you're in,

1017
00:43:58,940 --> 00:44:01,620
is it just a clustering problem?

1018
00:44:01,620 --> 00:44:06,140
For instance, the multi-cluster
Gaussian [INAUDIBLE] models.

1019
00:44:06,140 --> 00:44:09,594
So class A, class B,
class C, which are state,

1020
00:44:09,594 --> 00:44:11,010
then taking an
observation there's

1021
00:44:11,010 --> 00:44:14,362
a high probability
[INAUDIBLE] This

1022
00:44:14,362 --> 00:44:16,870
is what we're tyring to do for
each observation over here.

1023
00:44:16,870 --> 00:44:21,300
So we're trying to find
clustering [INAUDIBLE]..

1024
00:44:21,300 --> 00:44:23,857
PROFESSOR 4: So you
could, I imagine,

1025
00:44:23,857 --> 00:44:25,940
implement an algorithm
where, yeah, every time you

1026
00:44:25,940 --> 00:44:28,460
make an observation, you
then try to say, all right.

1027
00:44:28,460 --> 00:44:31,420
What's my most likely estimate,
or maybe my [INAUDIBLE]

1028
00:44:31,420 --> 00:44:33,740
least cost estimate?

1029
00:44:33,740 --> 00:44:36,200
But inherent with
that is the risk

1030
00:44:36,200 --> 00:44:38,720
that you're discarding a
lot of information, right?

1031
00:44:38,720 --> 00:44:40,725
Because you're going to
generate a probability

1032
00:44:40,725 --> 00:44:44,200
distribution over your state.

1033
00:44:44,200 --> 00:44:46,470
And so, yes, you
can say, I'm going

1034
00:44:46,470 --> 00:44:48,584
to stick with the maximum
likelihood estimate.

1035
00:44:48,584 --> 00:44:50,390
But if you can,
you should probably

1036
00:44:50,390 --> 00:44:51,925
try to maintain
that distribution

1037
00:44:51,925 --> 00:44:54,060
as long as possible.

1038
00:44:54,060 --> 00:44:54,560
OK.

1039
00:44:54,560 --> 00:44:57,570
And we'll see that
this is really

1040
00:44:57,570 --> 00:44:59,180
computationally
expensive unless you

1041
00:44:59,180 --> 00:45:00,946
start making some assumptions.

1042
00:45:00,946 --> 00:45:04,280
And in the case study
we're going to look into,

1043
00:45:04,280 --> 00:45:06,040
that's exactly what [INAUDIBLE].

1044
00:45:06,040 --> 00:45:08,740
But I've seen a lot
in the literature

1045
00:45:08,740 --> 00:45:11,540
that as much as
you can, you want

1046
00:45:11,540 --> 00:45:14,864
to maintain these distributions
for improved accuracy.

1047
00:45:14,864 --> 00:45:16,748
Any other questions?

1048
00:45:20,030 --> 00:45:20,530
All right.

1049
00:45:20,530 --> 00:45:24,070
Well, we're going to compare
now the execution of a POMDP

1050
00:45:24,070 --> 00:45:27,668
to the execution of an MDP.

1051
00:45:27,668 --> 00:45:31,480
We started out-- we're living
in the same real world.

1052
00:45:31,480 --> 00:45:32,980
We've got our same
transition model.

1053
00:45:32,980 --> 00:45:34,680
Everything is peachy.

1054
00:45:34,680 --> 00:45:37,090
We take action one, and
we want to go North.

1055
00:45:37,090 --> 00:45:40,100
We have this distribution
that we generate the states.

1056
00:45:40,100 --> 00:45:42,380
And at this point,
what did we say?

1057
00:45:42,380 --> 00:45:43,770
We said, we hate
the fact that we

1058
00:45:43,770 --> 00:45:45,020
have to deal with three cases.

1059
00:45:45,020 --> 00:45:46,580
Three is two too many.

1060
00:45:46,580 --> 00:45:50,520
So let's make an observation
to collapse this distribution.

1061
00:45:50,520 --> 00:45:53,730
Now I've described a
lot about noisy sensors,

1062
00:45:53,730 --> 00:45:55,990
right, where basically
it's a true measurement

1063
00:45:55,990 --> 00:45:59,030
plus some noise, maybe
Gaussian distribution.

1064
00:45:59,030 --> 00:46:02,360
There's another partially
observable sensor

1065
00:46:02,360 --> 00:46:05,390
you can have in the
POMDP which really

1066
00:46:05,390 --> 00:46:07,850
feeds into the name,
Partially Observable.

1067
00:46:07,850 --> 00:46:11,520
What if you can only
observe part of your state?

1068
00:46:11,520 --> 00:46:14,960
For example, if you're
living in an x-y grid,

1069
00:46:14,960 --> 00:46:18,520
maybe you can only
observe your y dimension.

1070
00:46:18,520 --> 00:46:20,550
This match up, in a
real world example,

1071
00:46:20,550 --> 00:46:22,630
to quadrotor flying
down a hallway.

1072
00:46:22,630 --> 00:46:26,032
Catherine was working
on a DARPA project

1073
00:46:26,032 --> 00:46:28,290
with quadrotors
flying down hallways.

1074
00:46:28,290 --> 00:46:30,660
If the hallway is too long,
your laser range finder

1075
00:46:30,660 --> 00:46:32,340
isn't going to be
able to determine

1076
00:46:32,340 --> 00:46:34,040
where you are along one axis.

1077
00:46:34,040 --> 00:46:36,840
But it can tell where
you are along another.

1078
00:46:36,840 --> 00:46:38,420
And so that's what I've said.

1079
00:46:38,420 --> 00:46:41,270
I've said, pretend this
quadrotor has that sensor.

1080
00:46:41,270 --> 00:46:43,900
We can only observe
its y component.

1081
00:46:43,900 --> 00:46:46,480
And it says, my
y component is 3.

1082
00:46:46,480 --> 00:46:49,170
I have no idea what
my x component is.

1083
00:46:49,170 --> 00:46:50,200
Well, this sucks.

1084
00:46:50,200 --> 00:46:50,970
Right?

1085
00:46:50,970 --> 00:46:52,900
Because we got
rid of this state,

1086
00:46:52,900 --> 00:46:58,536
but then we couldn't decide,
are we at state (1,3) or (3,3)?

1087
00:46:58,536 --> 00:46:59,910
There's no way of
resolving this.

1088
00:46:59,910 --> 00:47:00,870
So we can re-normalize.

1089
00:47:00,870 --> 00:47:05,260
We can add in the effect from
the observation probabilities

1090
00:47:05,260 --> 00:47:07,830
saying, maybe, in
fact, I'm far more

1091
00:47:07,830 --> 00:47:12,550
likely to observe a y
component of 3 if I'm at (1,3).

1092
00:47:12,550 --> 00:47:14,790
But in this case, we
say it's equally likely

1093
00:47:14,790 --> 00:47:18,284
to make that observation
for those two states.

1094
00:47:18,284 --> 00:47:21,050
And so now we've got to
deal with these two cases.

1095
00:47:21,050 --> 00:47:23,550
And so we can take
our next action.

1096
00:47:23,550 --> 00:47:25,600
Instead of resetting
to a single state,

1097
00:47:25,600 --> 00:47:28,751
we've got to keep
growing this tree.

1098
00:47:28,751 --> 00:47:31,040
And what's the key
difference between this

1099
00:47:31,040 --> 00:47:33,430
and when we were
executing our MDP?

1100
00:47:33,430 --> 00:47:35,670
Except we didn't
manage to collapse back

1101
00:47:35,670 --> 00:47:36,580
to a single state.

1102
00:47:36,580 --> 00:47:39,861
We didn't manage to
reset the problem.

1103
00:47:39,861 --> 00:47:42,630
And this is annoying.

1104
00:47:42,630 --> 00:47:45,390
Because you can't
execute a policy and say,

1105
00:47:45,390 --> 00:47:47,340
I'm certain that this
is my configuration.

1106
00:47:47,340 --> 00:47:50,385
So your policy can't
map from exact states

1107
00:47:50,385 --> 00:47:55,416
to actions, because you
never know your exact state.

1108
00:47:55,416 --> 00:47:56,290
Does this make sense?

1109
00:47:56,290 --> 00:47:59,700
Has everyone lost hope
in planning, right?

1110
00:47:59,700 --> 00:48:01,610
Yeah.

1111
00:48:01,610 --> 00:48:04,840
AUDIENCE: So in here because--

1112
00:48:04,840 --> 00:48:06,920
so from the left to
the right, you're

1113
00:48:06,920 --> 00:48:09,530
basically mapping from one
belief state to another belief

1114
00:48:09,530 --> 00:48:10,030
state.

1115
00:48:10,030 --> 00:48:11,620
So it's like a one arrow thing.

1116
00:48:11,620 --> 00:48:15,040
But and then from
that second layer,

1117
00:48:15,040 --> 00:48:16,750
you should have two arrows.

1118
00:48:16,750 --> 00:48:20,800
One with probably
0.5 that observes 3,

1119
00:48:20,800 --> 00:48:23,680
and it gives rise to that
belief state it just showed.

1120
00:48:23,680 --> 00:48:27,734
In one, with probability of 0.5
where the sensor is being 4.

1121
00:48:27,734 --> 00:48:29,192
Because if you have
a 0 probability

1122
00:48:29,192 --> 00:48:31,140
of looking at (2,4),
you might as well just

1123
00:48:31,140 --> 00:48:33,160
observe 4 as well as your y.

1124
00:48:33,160 --> 00:48:36,670
So in this case, I'm not
sure if you were just

1125
00:48:36,670 --> 00:48:41,205
trying to show one branch
of your POMDP planning,

1126
00:48:41,205 --> 00:48:42,580
but basically what
you have to do

1127
00:48:42,580 --> 00:48:44,790
is you would go like
this to the second layer.

1128
00:48:44,790 --> 00:48:48,990
You would have a branch with
a 0.5 probability on either.

1129
00:48:48,990 --> 00:48:52,205
One giving you 3,
one giving you 4.

1130
00:48:52,205 --> 00:48:54,580
For the one that is showed,
you've got that relief state.

1131
00:48:54,580 --> 00:48:58,810
For the other one, you get the
state (2,4) with probably 1.

1132
00:48:58,810 --> 00:49:00,220
PROFESSOR 4: Yeah.

1133
00:49:00,220 --> 00:49:04,667
So you were perfectly
describing planning, right?

1134
00:49:04,667 --> 00:49:06,125
I should have made
this more clear.

1135
00:49:06,125 --> 00:49:06,970
This isn't planning.

1136
00:49:06,970 --> 00:49:08,160
This is executing.

1137
00:49:08,160 --> 00:49:10,720
We have this policy that
we're going to execute.

1138
00:49:10,720 --> 00:49:12,220
And so if we were
planning, we would

1139
00:49:12,220 --> 00:49:17,320
have to consider all these
branches and say, well, yeah.

1140
00:49:17,320 --> 00:49:19,830
There's a 50% chance here I'll
end up here, in which case,

1141
00:49:19,830 --> 00:49:21,164
my y value is going
to read 4 and you

1142
00:49:21,164 --> 00:49:22,413
have to grow this whole thing.

1143
00:49:22,413 --> 00:49:23,335
I'm saying, no.

1144
00:49:23,335 --> 00:49:26,210
This is real time execution.

1145
00:49:26,210 --> 00:49:27,940
yeah.

1146
00:49:27,940 --> 00:49:28,885
Great question.

1147
00:49:28,885 --> 00:49:32,210
Any others?

1148
00:49:32,210 --> 00:49:35,290
Well, this is a great time
to transition to, well,

1149
00:49:35,290 --> 00:49:38,760
we can't just magically
be handed these policies.

1150
00:49:38,760 --> 00:49:40,436
How do we actually
generate them?

1151
00:49:40,436 --> 00:49:42,310
How do we start planning
in the belief space?

1152
00:49:42,310 --> 00:49:46,290
The belief space is
the space distributions

1153
00:49:46,290 --> 00:49:49,132
of possible configurations.

1154
00:49:49,132 --> 00:49:52,870
So I'm going to talk about a
general class of algorithms.

1155
00:49:52,870 --> 00:49:56,450
A lot of planners in POMDP
land and the belief space

1156
00:49:56,450 --> 00:50:01,025
plan with probabilistic
roadmaps- PRMs.

1157
00:50:01,025 --> 00:50:04,670
The goal is to generate a
policy that maps from a belief

1158
00:50:04,670 --> 00:50:07,180
state to an action.

1159
00:50:07,180 --> 00:50:09,030
And I'm going to go
into a little more

1160
00:50:09,030 --> 00:50:12,160
what a belief state is.

1161
00:50:12,160 --> 00:50:14,800
But the general
algorithm in this graphic

1162
00:50:14,800 --> 00:50:16,370
illustrates these four steps.

1163
00:50:16,370 --> 00:50:20,570
We're going to sample points
from the configuration space,

1164
00:50:20,570 --> 00:50:22,362
as if everything
was deterministic.

1165
00:50:22,362 --> 00:50:25,110
We're going to connect those
points to nearby points.

1166
00:50:25,110 --> 00:50:27,260
And define nearby
however you want.

1167
00:50:27,260 --> 00:50:29,660
It could be your closest
neighbors, all neighbors

1168
00:50:29,660 --> 00:50:31,460
within a radius, whatever.

1169
00:50:31,460 --> 00:50:35,390
As long as those edges don't
collide with obstacles.

1170
00:50:35,390 --> 00:50:37,615
Once you've done that, somehow--

1171
00:50:37,615 --> 00:50:39,565
and there's some magic in this--

1172
00:50:39,565 --> 00:50:43,190
you're going to transform
your configured action space

1173
00:50:43,190 --> 00:50:45,860
probabilistic roadmap
to a probabilistic road

1174
00:50:45,860 --> 00:50:47,350
map in the belief state.

1175
00:50:47,350 --> 00:50:48,210
Great.

1176
00:50:48,210 --> 00:50:49,668
Once you've done
that, you can just

1177
00:50:49,668 --> 00:50:52,990
do shortest path depending on
whatever cost function you use.

1178
00:50:52,990 --> 00:50:57,560
And what's really cool is you
get different paths for when

1179
00:50:57,560 --> 00:50:59,420
you stay in the
configuration space

1180
00:50:59,420 --> 00:51:01,100
and when you go to
the belief space.

1181
00:51:01,100 --> 00:51:05,900
So the green path, that
bottom right figure,

1182
00:51:05,900 --> 00:51:09,050
seems a lot longer
than the red path.

1183
00:51:09,050 --> 00:51:12,120
And the reason is the green path
was planned in the belief space

1184
00:51:12,120 --> 00:51:16,250
and followed a lot of landmarks
that the quadrotor could

1185
00:51:16,250 --> 00:51:17,330
take measurements off of.

1186
00:51:17,330 --> 00:51:19,780
So it was really confident
about its position

1187
00:51:19,780 --> 00:51:22,590
whereas the red path is the
shorter path that had a higher

1188
00:51:22,590 --> 00:51:24,330
likelihood of a collision.

1189
00:51:24,330 --> 00:51:29,620
And we're going to look
into that figure more later.

1190
00:51:29,620 --> 00:51:32,500
I'm going to segment this
algorithm into two parts.

1191
00:51:32,500 --> 00:51:34,780
The first part-- these
first two steps--

1192
00:51:34,780 --> 00:51:37,080
it's just probabilistic
road maps.

1193
00:51:37,080 --> 00:51:40,360
Who here has heard about
probabilistic road maps?

1194
00:51:40,360 --> 00:51:42,720
Raise your hand.

1195
00:51:42,720 --> 00:51:43,220
OK.

1196
00:51:43,220 --> 00:51:44,070
50/50.

1197
00:51:44,070 --> 00:51:46,750
I'm so excited for the
50% who haven't heard.

1198
00:51:46,750 --> 00:51:49,306
One of the top
algorithms, in my opinion.

1199
00:51:49,306 --> 00:51:51,180
It's really simple, and
it's really powerful.

1200
00:51:53,980 --> 00:51:57,810
Here's basically almost
a complete implementation

1201
00:51:57,810 --> 00:51:59,180
of probabilistic road maps.

1202
00:51:59,180 --> 00:52:03,030
It's pseudocode so don't copy
paste, but it's almost there.

1203
00:52:03,030 --> 00:52:05,670
You're going to
construct a graph.

1204
00:52:05,670 --> 00:52:07,382
You're going to add
your start goal.

1205
00:52:07,382 --> 00:52:09,590
Your start configuration
being the green dot and then

1206
00:52:09,590 --> 00:52:11,519
goal configuration, the red dot.

1207
00:52:11,519 --> 00:52:13,560
And then you're just going
to keep sampling notes

1208
00:52:13,560 --> 00:52:14,700
those from the free space.

1209
00:52:14,700 --> 00:52:17,970
You say, how about (2,3)?

1210
00:52:17,970 --> 00:52:19,770
You're going to add
that to your graph.

1211
00:52:19,770 --> 00:52:22,700
You're going to connect that
node to a bunch of other nodes

1212
00:52:22,700 --> 00:52:23,214
nearby.

1213
00:52:23,214 --> 00:52:24,630
And then you're
just going to keep

1214
00:52:24,630 --> 00:52:28,069
sampling until maybe
you have enough nodes

1215
00:52:28,069 --> 00:52:29,860
or maybe until you have
your complete path.

1216
00:52:29,860 --> 00:52:34,080
Should be happening there.

1217
00:52:34,080 --> 00:52:36,262
And then once you've
got this whole graph,

1218
00:52:36,262 --> 00:52:38,220
you can just find the
shortest path along that.

1219
00:52:38,220 --> 00:52:39,803
And there are some
really cool results

1220
00:52:39,803 --> 00:52:43,960
that if you sample in a good
way, and so on, asymptotically.

1221
00:52:43,960 --> 00:52:45,990
As you start
sampling more, you're

1222
00:52:45,990 --> 00:52:51,560
going to asymptotically approach
the best path in a completely

1223
00:52:51,560 --> 00:52:53,065
continuous space.

1224
00:52:53,065 --> 00:52:55,090
The power of
probabilistic road maps

1225
00:52:55,090 --> 00:52:57,800
and a bunch of
randomized algorithms

1226
00:52:57,800 --> 00:53:00,500
though is that they scale
pretty well to high dimensions.

1227
00:53:00,500 --> 00:53:03,890
So you don't need to actually
consider the continuous space.

1228
00:53:03,890 --> 00:53:06,526
You can just sample [INAUDIBLE].

1229
00:53:06,526 --> 00:53:09,802
Are there any questions about
probabilistic road maps?

1230
00:53:12,570 --> 00:53:13,150
Really cool.

1231
00:53:13,150 --> 00:53:15,670
If you are interested, and
you just heard about PRMs.

1232
00:53:15,670 --> 00:53:18,580
You probably haven't
heard about RRTs.

1233
00:53:18,580 --> 00:53:20,957
Those are also really cool.

1234
00:53:20,957 --> 00:53:22,833
AUDIENCE: Just a quick
[INAUDIBLE] question.

1235
00:53:22,833 --> 00:53:26,050
So for any node, [INAUDIBLE]
at this uniform example

1236
00:53:26,050 --> 00:53:26,960
[INAUDIBLE].

1237
00:53:30,429 --> 00:53:31,220
PROFESSOR 4: Sorry.

1238
00:53:31,220 --> 00:53:33,154
When you're choosing
where to place the dot?

1239
00:53:33,154 --> 00:53:34,070
Or what to connect to?

1240
00:53:34,070 --> 00:53:35,778
AUDIENCE: [INAUDIBLE]
there's a dot there

1241
00:53:35,778 --> 00:53:37,400
on the side [INAUDIBLE] right?

1242
00:53:37,400 --> 00:53:40,015
So when I do the sampling, I
just [INAUDIBLE] this node.

1243
00:53:40,015 --> 00:53:42,420
I uniformly choose one
of them [INAUDIBLE]..

1244
00:53:45,490 --> 00:53:49,290
PROFESSOR 4: So in general,
probabilistic roadmaps,

1245
00:53:49,290 --> 00:53:51,430
you can throw in whatever
sampler you want.

1246
00:53:51,430 --> 00:53:53,470
The way this particular one--

1247
00:53:53,470 --> 00:53:57,400
the way I implement this is
you sample points uniformly

1248
00:53:57,400 --> 00:53:59,160
from the entire space.

1249
00:53:59,160 --> 00:54:01,890
If it's inside an obstacle
and you remove it,

1250
00:54:01,890 --> 00:54:04,090
once you place it--

1251
00:54:04,090 --> 00:54:07,150
this was connect
to the K closest--

1252
00:54:07,150 --> 00:54:10,750
I think K is 7 in this case.

1253
00:54:10,750 --> 00:54:14,020
And if there's an edge
that if that edge collides

1254
00:54:14,020 --> 00:54:16,520
with an obstacle, remove it.

1255
00:54:16,520 --> 00:54:20,490
I'd be happy to go into more
detail for that PRMs later.

1256
00:54:20,490 --> 00:54:21,740
All right.

1257
00:54:21,740 --> 00:54:23,960
But that's not enough.

1258
00:54:23,960 --> 00:54:27,680
Because what we described
as PRMs in the configuration

1259
00:54:27,680 --> 00:54:29,950
space, but what we
need to do is somehow

1260
00:54:29,950 --> 00:54:32,440
elevate a PRM from the
configuration space

1261
00:54:32,440 --> 00:54:34,675
to a belief space.

1262
00:54:34,675 --> 00:54:36,860
And this is really hard.

1263
00:54:36,860 --> 00:54:40,570
We don't have access to
these raw configurations.

1264
00:54:40,570 --> 00:54:43,570
Let's imagine we were in
this really simple world

1265
00:54:43,570 --> 00:54:46,120
where the quadrotor could
be in three possible states.

1266
00:54:46,120 --> 00:54:47,830
One, two, three.

1267
00:54:47,830 --> 00:54:48,910
Really easy, right?

1268
00:54:48,910 --> 00:54:50,234
Sample a bunch of points.

1269
00:54:50,234 --> 00:54:52,150
They're going to end up
in one, two, or three.

1270
00:54:52,150 --> 00:54:55,720
You could pretty quickly
cover the entire space.

1271
00:54:55,720 --> 00:54:57,790
But this simple
configuration space

1272
00:54:57,790 --> 00:55:02,040
will transform to the belief
space becomes infinite.

1273
00:55:02,040 --> 00:55:05,495
You have infinite possible
distributions to consider.

1274
00:55:05,495 --> 00:55:07,060
There is the
distribution where you

1275
00:55:07,060 --> 00:55:08,800
have 100% chance probability--

1276
00:55:08,800 --> 00:55:13,902
100% probability that you're
in state 1, 100% probability

1277
00:55:13,902 --> 00:55:14,860
that you're in state 2.

1278
00:55:14,860 --> 00:55:16,870
100% probability that
you're in state 3.

1279
00:55:16,870 --> 00:55:20,100
And then everything in between.

1280
00:55:20,100 --> 00:55:22,490
We went from three to infinite.

1281
00:55:22,490 --> 00:55:24,490
This is not boding well.

1282
00:55:24,490 --> 00:55:26,132
And even if you start
saying, well, I'm

1283
00:55:26,132 --> 00:55:28,007
not going to consider
the whole distribution.

1284
00:55:28,007 --> 00:55:31,080
I just care about the
mean and the variance,

1285
00:55:31,080 --> 00:55:34,850
it's still not pretty, right?

1286
00:55:34,850 --> 00:55:42,455
Well, this is where we have
to start making approximations

1287
00:55:42,455 --> 00:55:44,900
And this is where you
start getting differences

1288
00:55:44,900 --> 00:55:48,475
in POMDP planners-- where
they make assumptions,

1289
00:55:48,475 --> 00:55:51,480
where they make approximations.

1290
00:55:51,480 --> 00:55:55,170
So these images are
from the belief road map

1291
00:55:55,170 --> 00:55:59,090
paper from the Robust Robotics
group a couple of years ago.

1292
00:55:59,090 --> 00:56:01,790
But I'm going to talk about
a different planner soon.

1293
00:56:01,790 --> 00:56:06,500
But the idea behind a
lot of these planners

1294
00:56:06,500 --> 00:56:09,680
is maybe we can start saying
what these distributions are

1295
00:56:09,680 --> 00:56:12,510
going to look like
based on our models.

1296
00:56:12,510 --> 00:56:16,650
So to our planning problem,
if we know we started at (2,3)

1297
00:56:16,650 --> 00:56:19,295
and we know our
transition distribution,

1298
00:56:19,295 --> 00:56:22,729
we can start saying, well, this
is my probability distribution.

1299
00:56:22,729 --> 00:56:24,270
And then when I make
and observation,

1300
00:56:24,270 --> 00:56:27,165
I can build
distributions off that.

1301
00:56:27,165 --> 00:56:31,000
And so, if you
could exhaustively

1302
00:56:31,000 --> 00:56:33,785
propagate these distributions
forward, that would be great.

1303
00:56:33,785 --> 00:56:36,180
But it's unrealistic.

1304
00:56:36,180 --> 00:56:41,440
And I just want to point
in terms of the visual way

1305
00:56:41,440 --> 00:56:42,815
to represent these
distributions.

1306
00:56:42,815 --> 00:56:46,210
A really nice way of saying,
in the deterministic world,

1307
00:56:46,210 --> 00:56:47,950
you have these dots and edges.

1308
00:56:47,950 --> 00:56:51,320
In the probabilistic
world, these circles,

1309
00:56:51,320 --> 00:56:54,380
these ellipsoids,
represent uncertainty.

1310
00:56:54,380 --> 00:56:56,690
Typically, it's the one
standard deviation or three

1311
00:56:56,690 --> 00:56:58,560
standard deviations away.

1312
00:56:58,560 --> 00:57:01,976
And so you can start
building into the map,

1313
00:57:01,976 --> 00:57:03,850
these are the distributions
and the variances

1314
00:57:03,850 --> 00:57:06,910
I can see in these nodes.

1315
00:57:06,910 --> 00:57:09,742
Are there any questions
about this stuff.

1316
00:57:13,060 --> 00:57:13,670
All right.

1317
00:57:13,670 --> 00:57:17,680
Well, we're going to delve now
into a specific case study.

1318
00:57:17,680 --> 00:57:19,620
Feedback Based Information
State Roadmaps--

1319
00:57:19,620 --> 00:57:21,324
FIRM.

1320
00:57:21,324 --> 00:57:23,740
From now on, that's the only
way I'm going to refer to it.

1321
00:57:26,690 --> 00:57:28,560
The idea behind
this is you're going

1322
00:57:28,560 --> 00:57:33,260
to sample mean configurations
from your configuration space.

1323
00:57:33,260 --> 00:57:38,355
Then you want to
build an LQR-- that's

1324
00:57:38,355 --> 00:57:40,910
Linear Quadratic
Regulator controller--

1325
00:57:40,910 --> 00:57:44,910
around these mean points.

1326
00:57:44,910 --> 00:57:47,810
And that will generate what
variance you can tolerate.

1327
00:57:47,810 --> 00:57:53,470
So LQR controllers, if you
don't know, they're really nice.

1328
00:57:53,470 --> 00:57:55,955
Around a small region
around a point,

1329
00:57:55,955 --> 00:58:00,243
they can drive a quadrotor,
for example, to that figure.

1330
00:58:00,243 --> 00:58:04,170
And so if you build these LQR
controllers around points,

1331
00:58:04,170 --> 00:58:06,120
you can say, all
right, anytime I end up

1332
00:58:06,120 --> 00:58:09,500
in this cloud in
the belief space--

1333
00:58:09,500 --> 00:58:14,810
so any sort of distribution and
can bring it back to that mean.

1334
00:58:14,810 --> 00:58:17,840
And so now, what we've done is
we've generated every point,

1335
00:58:17,840 --> 00:58:19,290
we just need to connect them.

1336
00:58:19,290 --> 00:58:24,050
And the idea is if you
have a feedback based,

1337
00:58:24,050 --> 00:58:27,740
they can get from one cloud to
another anywhere in the clouds.

1338
00:58:27,740 --> 00:58:31,760
Then you can get
from point to point.

1339
00:58:31,760 --> 00:58:32,762
All right.

1340
00:58:32,762 --> 00:58:34,220
This is how you
generate the graph.

1341
00:58:34,220 --> 00:58:36,255
And what's cool is the
way they formulated

1342
00:58:36,255 --> 00:58:39,360
the problem is they
said, well, set up

1343
00:58:39,360 --> 00:58:41,190
the cost of executing an edge.

1344
00:58:41,190 --> 00:58:43,530
We switched from
rewards to costs

1345
00:58:43,530 --> 00:58:46,700
because we're pessimists now.

1346
00:58:46,700 --> 00:58:49,110
Well, the cost is going
to be a linear combination

1347
00:58:49,110 --> 00:58:53,610
of the expected time to execute
that edge and the uncertainty

1348
00:58:53,610 --> 00:58:55,262
along that edge.

1349
00:58:55,262 --> 00:58:58,020
Now, this is actually
really cool to play with.

1350
00:58:58,020 --> 00:59:04,475
What do you think would happen
if I set beta to 0 in this?

1351
00:59:04,475 --> 00:59:07,829
Like, what sort of
[INAUDIBLE] would you get?

1352
00:59:07,829 --> 00:59:09,870
I will cold call because
I know a few names here.

1353
00:59:12,410 --> 00:59:12,910
Anybody?

1354
00:59:12,910 --> 00:59:17,181
I don't want to cold call
someone who may [INAUDIBLE]..

1355
00:59:17,181 --> 00:59:19,180
AUDIENCE: You're going
to get the shortest path.

1356
00:59:19,180 --> 00:59:21,179
PROFESSOR 4: Shortest
path, that's right, right?

1357
00:59:21,179 --> 00:59:22,600
This turn goes away.

1358
00:59:22,600 --> 00:59:24,615
The cost is a
function of the time--

1359
00:59:24,615 --> 00:59:26,030
shortest path.

1360
00:59:26,030 --> 00:59:29,220
Now, what's cool is one
day I was messing around

1361
00:59:29,220 --> 00:59:29,940
with this code.

1362
00:59:29,940 --> 00:59:33,750
I'm like, I wonder what happens
if I set alpha to 0, right?

1363
00:59:33,750 --> 00:59:37,180
So your cost is purely a
function of uncertainty.

1364
00:59:37,180 --> 00:59:39,900
In turns out what the
quadrotor does it just

1365
00:59:39,900 --> 00:59:41,990
hangs out where it starts.

1366
00:59:41,990 --> 00:59:44,400
It says, I'm in no hurry.

1367
00:59:44,400 --> 00:59:45,285
I know where I am.

1368
00:59:45,285 --> 00:59:46,866
I'm just going to stay here.

1369
00:59:49,380 --> 00:59:51,212
But I find this amazing, right?

1370
00:59:51,212 --> 00:59:52,920
Because this almost
models behavior even.

1371
00:59:52,920 --> 00:59:54,660
You could start
saying, do I want

1372
00:59:54,660 --> 00:59:56,990
to be a risky quadrotor
or a safe one?

1373
00:59:56,990 --> 00:59:59,650
Like, how important it is for
me to get somewhere on time

1374
00:59:59,650 --> 01:00:00,765
or be safe?

1375
01:00:00,765 --> 01:00:02,940
And it's just those
two parameters.

1376
01:00:02,940 --> 01:00:07,250
Or you could even make it 1,
like alpha and 1 minus alpha.

1377
01:00:07,250 --> 01:00:10,170
I think this stuff
is really cool.

1378
01:00:10,170 --> 01:00:12,780
The one detail that I'm
really going to into for FIRM

1379
01:00:12,780 --> 01:00:15,900
is the cost equation, which
is based on the Bellman backup

1380
01:00:15,900 --> 01:00:17,700
equation that we had.

1381
01:00:17,700 --> 01:00:21,560
The cost to go, right, the
expected cost from a belief

1382
01:00:21,560 --> 01:00:25,355
state [INAUDIBLE] is--

1383
01:00:25,355 --> 01:00:27,146
well, you're going to
take the best action.

1384
01:00:27,146 --> 01:00:29,354
So you're going to take the
mide of this whole thing.

1385
01:00:29,354 --> 01:00:33,400
You're going to say, it's
the cost of executing

1386
01:00:33,400 --> 01:00:38,570
a specific action plus the cost
of colliding with something,

1387
01:00:38,570 --> 01:00:40,750
an obstacle, times the
probability of colliding,

1388
01:00:40,750 --> 01:00:41,250
right?

1389
01:00:41,250 --> 01:00:43,570
That's this term.

1390
01:00:43,570 --> 01:00:47,240
And then you've got to
say, well, OK, and then

1391
01:00:47,240 --> 01:00:49,642
once I reach a state,
what's the cost from there?

1392
01:00:49,642 --> 01:00:51,600
And then I could weight
that by the probability

1393
01:00:51,600 --> 01:00:53,970
of ending up in that state.

1394
01:00:53,970 --> 01:00:58,580
Does this equation
make sense to people?

1395
01:00:58,580 --> 01:01:01,390
There are a lot of symbols,
and honestly, I hate notation,

1396
01:01:01,390 --> 01:01:03,050
but it works.

1397
01:01:03,050 --> 01:01:04,660
You just plug in--
the cost is I'm

1398
01:01:04,660 --> 01:01:05,910
going to take the best action.

1399
01:01:05,910 --> 01:01:07,326
It's going to be
the cost of using

1400
01:01:07,326 --> 01:01:09,390
that action, the cost
of colliding times

1401
01:01:09,390 --> 01:01:12,540
the probability of colliding
given that I used that action

1402
01:01:12,540 --> 01:01:14,720
and I started where
I started, and then

1403
01:01:14,720 --> 01:01:17,980
the cost from where I end up.

1404
01:01:17,980 --> 01:01:20,570
And since you end up in a
probability distribution,

1405
01:01:20,570 --> 01:01:23,300
we need to consider
all these cases.

1406
01:01:23,300 --> 01:01:24,580
Yeah?

1407
01:01:24,580 --> 01:01:26,904
AUDIENCE: When you define
like an action from one place

1408
01:01:26,904 --> 01:01:28,445
to another, do you
always think of it

1409
01:01:28,445 --> 01:01:30,070
as starting from
mean that you sampled

1410
01:01:30,070 --> 01:01:33,280
or from anywhere [INAUDIBLE]?

1411
01:01:33,280 --> 01:01:35,520
PROFESSOR 4: So it's from the--

1412
01:01:35,520 --> 01:01:38,510
so formally, it was
from the belief state,

1413
01:01:38,510 --> 01:01:42,700
which is the mean, yeah, plus
the variance once it stabilized

1414
01:01:42,700 --> 01:01:44,682
to that point.

1415
01:01:44,682 --> 01:01:47,140
And the way that variance is
generated, I should have said,

1416
01:01:47,140 --> 01:01:50,000
is, you're going to have
models of these quadrotors.

1417
01:01:50,000 --> 01:01:55,695
And so I spent a good time
in the ACL with John Howell,

1418
01:01:55,695 --> 01:01:56,585
in Course 16.

1419
01:01:56,585 --> 01:01:58,920
And you just like let
the quadrotor hover.

1420
01:01:58,920 --> 01:02:01,570
You measure its position
for a long time.

1421
01:02:01,570 --> 01:02:06,848
And you get a distribution
over where it goes.

1422
01:02:06,848 --> 01:02:08,849
AUDIENCE: So what does
the letter M stand for?

1423
01:02:08,849 --> 01:02:10,265
PROFESSOR 4: What
is the letter M?

1424
01:02:10,265 --> 01:02:10,640
AUDIENCE: Yeah.

1425
01:02:10,640 --> 01:02:11,431
PROFESSOR 4: Right.

1426
01:02:11,431 --> 01:02:14,260
So you're summing over-- these
are the belief states that you

1427
01:02:14,260 --> 01:02:16,390
could end up in, right?

1428
01:02:16,390 --> 01:02:18,190
So if everything
were deterministic,

1429
01:02:18,190 --> 01:02:20,280
this would just
be ignore the sum.

1430
01:02:20,280 --> 01:02:21,950
It's where you end up.

1431
01:02:21,950 --> 01:02:25,386
Realistically you could
end up in some other state

1432
01:02:25,386 --> 01:02:27,497
we haven't considered.

1433
01:02:27,497 --> 01:02:28,955
AUDIENCE: So just
a quick question.

1434
01:02:28,955 --> 01:02:32,090
So those, if we're operating
on a Gaussian space,

1435
01:02:32,090 --> 01:02:33,590
then you have
Gaussian observations.

1436
01:02:33,590 --> 01:02:36,980
So those are sums over the
observation samples that

1437
01:02:36,980 --> 01:02:39,030
are generated when [INAUDIBLE].

1438
01:02:39,030 --> 01:02:43,350
So that is a finite sum over the
possible infinite observation

1439
01:02:43,350 --> 01:02:44,270
states we might have.

1440
01:02:44,270 --> 01:02:45,060
PROFESSOR 4: Yeah.

1441
01:02:45,060 --> 01:02:47,600
So there's definitely
[INAUDIBLE]

1442
01:02:47,600 --> 01:02:50,720
in terms of where
you could end up.

1443
01:02:50,720 --> 01:02:53,690
And even the action
space, there's

1444
01:02:53,690 --> 01:02:56,440
a set feedback controller
that you're allowed.

1445
01:02:56,440 --> 01:03:00,224
I think the observation, if
you modeled it as a Gaussian,

1446
01:03:00,224 --> 01:03:01,640
if you made some
nice assumptions,

1447
01:03:01,640 --> 01:03:03,709
it can be tractable
as continuous.

1448
01:03:03,709 --> 01:03:04,500
AUDIENCE: Oh, yeah.

1449
01:03:04,500 --> 01:03:05,249
I was [INAUDIBLE].

1450
01:03:05,249 --> 01:03:08,410
So for the start, so if you
start with Gaussian noise,

1451
01:03:08,410 --> 01:03:10,460
then you have a linear model.

1452
01:03:10,460 --> 01:03:12,910
Then your prediction's
going to be Gaussian.

1453
01:03:12,910 --> 01:03:13,747
PROFESSOR 4: Yeah.

1454
01:03:13,747 --> 01:03:15,830
AUDIENCE: But then you
have Gaussian observations,

1455
01:03:15,830 --> 01:03:18,560
which is great because then
your update is Gaussian

1456
01:03:18,560 --> 01:03:21,140
but the observation
space is infinite.

1457
01:03:21,140 --> 01:03:24,440
So basically not only you
have two sample positions,

1458
01:03:24,440 --> 01:03:26,655
but you also have two sample
potential observations

1459
01:03:26,655 --> 01:03:28,370
that you might get
as you go along.

1460
01:03:28,900 --> 01:03:29,650
PROFESSOR 4: Yeah.

1461
01:03:29,650 --> 01:03:31,650
AUDIENCE: So that
you can basically--

1462
01:03:31,650 --> 01:03:32,410
and it's great.

1463
01:03:32,410 --> 01:03:35,990
But you end up basically
reducing a possibly infinite

1464
01:03:35,990 --> 01:03:39,330
branching, which is your
Gaussian to kind of a

1465
01:03:39,330 --> 01:03:41,289
like a Monte Carlo Tree search.

1466
01:03:41,289 --> 01:03:43,330
You generate a whole bunch
of meaningful samples,

1467
01:03:43,330 --> 01:03:45,480
and those are the ones
that you consider.

1468
01:03:45,480 --> 01:03:47,416
So is that where the
sum is coming from?

1469
01:03:47,416 --> 01:03:48,624
PROFESSOR 4: Yeah, basically.

1470
01:03:48,624 --> 01:03:52,686
And a fun fact, the theorem
that I read and trusted

1471
01:03:52,686 --> 01:03:55,515
was that if you
just sample randomly

1472
01:03:55,515 --> 01:03:57,450
right from your
configuration space,

1473
01:03:57,450 --> 01:04:00,130
you have zero probability
of constructing

1474
01:04:00,130 --> 01:04:03,340
a graph that without any
assumptions will be connected.

1475
01:04:03,340 --> 01:04:05,350
Whereas for PRMs,
you sample enough

1476
01:04:05,350 --> 01:04:08,461
and things will turn out nicely,
not the case with the belief

1477
01:04:08,461 --> 01:04:08,960
state.

1478
01:04:08,960 --> 01:04:11,728
That's why we need to
make these assumptions.

1479
01:04:11,728 --> 01:04:13,120
We're killing it.

1480
01:04:13,120 --> 01:04:14,697
We know POMDPs.

1481
01:04:14,697 --> 01:04:17,030
Now we get to look at some
really fun graphics generated

1482
01:04:17,030 --> 01:04:20,622
from real flights using FIRM.

1483
01:04:24,090 --> 01:04:28,204
The big takeaway is that
FIRM prefers safer paths.

1484
01:04:28,204 --> 01:04:30,120
We've got two images
that look really similar.

1485
01:04:30,120 --> 01:04:32,080
We're going to talk
about one and then show

1486
01:04:32,080 --> 01:04:34,560
why they're slightly different.

1487
01:04:34,560 --> 01:04:38,289
The test flight that we put this
quadrotor under was we said,

1488
01:04:38,289 --> 01:04:40,080
we're going to start
at this configuration.

1489
01:04:40,080 --> 01:04:42,980
We're going to go
this configuration.

1490
01:04:42,980 --> 01:04:45,870
And there's this
big, blue obstacle.

1491
01:04:45,870 --> 01:04:48,695
And there are landmarks that you
can take measurements off of.

1492
01:04:48,695 --> 01:04:51,060
So those are these red dots.

1493
01:04:51,060 --> 01:04:54,830
We want to compare two planners
right, a PRM and a FIRM

1494
01:04:54,830 --> 01:04:56,270
planner.

1495
01:04:56,270 --> 01:04:59,180
The PRM planner said, right,
I just want to minimize time.

1496
01:04:59,180 --> 01:05:01,710
And so it found the
first path is stay

1497
01:05:01,710 --> 01:05:03,390
to the left of the obstacle.

1498
01:05:03,390 --> 01:05:06,592
But that's actually
a really narrow path

1499
01:05:06,592 --> 01:05:08,250
between the obstacle
and the wall.

1500
01:05:08,250 --> 01:05:09,910
On the other hand,
the FIRM planner

1501
01:05:09,910 --> 01:05:13,890
said, right, I want to minimize
the linear combination of time

1502
01:05:13,890 --> 01:05:15,590
and uncertainty.

1503
01:05:15,590 --> 01:05:17,370
And so if you ramp
up the term the wait

1504
01:05:17,370 --> 01:05:20,460
for uncertainty, at some
point the path sort of

1505
01:05:20,460 --> 01:05:22,060
pops over the obstacle.

1506
01:05:22,060 --> 01:05:23,165
It's really cool.

1507
01:05:23,165 --> 01:05:24,125
It snaps.

1508
01:05:24,125 --> 01:05:25,500
And then you get
this safer path.

1509
01:05:25,500 --> 01:05:27,252
But that's longer.

1510
01:05:27,252 --> 01:05:28,996
And how do we know it's safer?

1511
01:05:28,996 --> 01:05:31,890
Because these ellipsoids
that we've drawn

1512
01:05:31,890 --> 01:05:35,730
are the 3D versions of what
we saw for PRM elevated

1513
01:05:35,730 --> 01:05:37,280
to the belief space version.

1514
01:05:37,280 --> 01:05:39,840
It's the uncertainty
that the quadrotor

1515
01:05:39,840 --> 01:05:42,350
has over its true state.

1516
01:05:42,350 --> 01:05:44,250
Why is the PRM plan--

1517
01:05:44,250 --> 01:05:45,890
why does it have
such big ellipsoids?

1518
01:05:45,890 --> 01:05:48,090
It's because when it's
behind the obstacle,

1519
01:05:48,090 --> 01:05:50,050
it can't make any of
these measurements

1520
01:05:50,050 --> 01:05:51,990
because it can't
see the landmarks.

1521
01:05:51,990 --> 01:05:55,490
So it's basically
using dead reckoning.

1522
01:05:55,490 --> 01:05:59,580
And the transition model,
right, is not deterministic.

1523
01:05:59,580 --> 01:06:01,045
So it's uncertainty grows.

1524
01:06:01,045 --> 01:06:04,030
And so we can see these
ellipsoids are bigger for PRM

1525
01:06:04,030 --> 01:06:04,909
than FIRM.

1526
01:06:04,909 --> 01:06:06,450
And then the reason
I have two images

1527
01:06:06,450 --> 01:06:08,870
is they are slightly different.

1528
01:06:08,870 --> 01:06:12,220
It's that these landmarks
were fictitious landmarks

1529
01:06:12,220 --> 01:06:13,585
that we just said, you can--

1530
01:06:13,585 --> 01:06:16,440
and we would generate fake
measurements off of them.

1531
01:06:16,440 --> 01:06:20,277
And so we could tune the
noise of the landmarks.

1532
01:06:20,277 --> 01:06:21,860
Maybe a little bit
of cheating, but it

1533
01:06:21,860 --> 01:06:25,350
allowed us to say, would it
increase the noise from some--

1534
01:06:25,350 --> 01:06:29,790
the dimension was
number 0.05 to 0.15.

1535
01:06:29,790 --> 01:06:32,554
You can see the uncertainty
ellipsoids from PRM

1536
01:06:32,554 --> 01:06:36,290
grow when we increased the
noise, whereas for FIRM, they

1537
01:06:36,290 --> 01:06:38,110
stay about the same.

1538
01:06:38,110 --> 01:06:39,750
And importantly,
these ellipsoids

1539
01:06:39,750 --> 01:06:43,140
grow enough that they start
overlapping with the obstacles.

1540
01:06:43,140 --> 01:06:46,995
That represents a very high
probability of collision.

1541
01:06:46,995 --> 01:06:48,495
Whereas the FIRM,
the way it managed

1542
01:06:48,495 --> 01:06:52,335
to keep these ellipsoids
so small is we

1543
01:06:52,335 --> 01:06:55,800
kept the cost function the
same, right-- the same weight

1544
01:06:55,800 --> 01:06:58,000
on uncertainty, same
weight on the time, right?

1545
01:06:58,000 --> 01:07:00,400
But the uncertainty
was so much higher.

1546
01:07:00,400 --> 01:07:04,195
And so we decided, well, I can
sacrifice a little bit of time.

1547
01:07:04,195 --> 01:07:05,320
I can take the slower path.

1548
01:07:05,320 --> 01:07:07,525
I can just hang out
by landmark, really

1549
01:07:07,525 --> 01:07:10,920
make sure I know where
I am before continuing.

1550
01:07:10,920 --> 01:07:13,300
And so the path-- the
duration of the flight

1551
01:07:13,300 --> 01:07:15,354
took a lot longer
for FIRM as time

1552
01:07:15,354 --> 01:07:17,020
would increase but
the uncertainty would

1553
01:07:17,020 --> 01:07:19,304
stay about constant.

1554
01:07:19,304 --> 01:07:23,168
Do people understand
these graphics?

1555
01:07:23,168 --> 01:07:24,500
Great.

1556
01:07:24,500 --> 01:07:25,640
All right.

1557
01:07:25,640 --> 01:07:28,550
We've got a graph
that just represents

1558
01:07:28,550 --> 01:07:30,820
a little more formally
that growing uncertainly.

1559
01:07:30,820 --> 01:07:35,640
As noise increases, the variance
along a single dimension, z,

1560
01:07:35,640 --> 01:07:41,810
y, to x, for PRM, which
is that, and FIRM.

1561
01:07:41,810 --> 01:07:44,775
The variance for PRM is
always higher, and it grows.

1562
01:07:44,775 --> 01:07:48,630
For FIRM, it's lower,
and stays about constant.

1563
01:07:48,630 --> 01:07:51,360
That's the big takeaway.

1564
01:07:51,360 --> 01:07:54,440
FIRM minimizes this uncertainty.

1565
01:07:54,440 --> 01:07:58,320
And then the final image
from these results that I

1566
01:07:58,320 --> 01:08:00,510
want to show is in simulation.

1567
01:08:00,510 --> 01:08:03,200
They said, well, let's actually
measure how often it crashes.

1568
01:08:03,200 --> 01:08:05,070
We didn't want to do
this in the real world

1569
01:08:05,070 --> 01:08:08,790
because we don't want to crash
the quadrotor that many times.

1570
01:08:08,790 --> 01:08:11,440
The gist of it is comparing
a reactive planner

1571
01:08:11,440 --> 01:08:14,670
and a deterministic
one [INAUDIBLE]..

1572
01:08:14,670 --> 01:08:18,920
As noise increases-- noise was
simulated with wind strength--

1573
01:08:18,920 --> 01:08:21,740
the number of crashes
increases for PRM, basically.

1574
01:08:21,740 --> 01:08:25,410
And for FIRM, it stays
constant and low.

1575
01:08:25,410 --> 01:08:27,540
The reason there
are two lines is

1576
01:08:27,540 --> 01:08:30,500
there were two planners with
different time horizons.

1577
01:08:30,500 --> 01:08:33,717
The important thing is
FIRM is low and constant.

1578
01:08:33,717 --> 01:08:34,711
PRM grows.

1579
01:08:40,680 --> 01:08:44,889
We've talked now throughout all
probabilistic planning you ever

1580
01:08:44,889 --> 01:08:46,023
need to know, right?

1581
01:08:46,023 --> 01:08:47,210
No, not quite.

1582
01:08:47,210 --> 01:08:49,820
But we have covered
a lot of stuff.

1583
01:08:49,820 --> 01:08:51,170
What are the big takeaways?

1584
01:08:51,170 --> 01:08:54,660
We've learned that real-world
problems are stochastic, right?

1585
01:08:54,660 --> 01:08:57,396
Quadrotors are not
these perfect machines

1586
01:08:57,396 --> 01:08:58,819
that we wish they were.

1587
01:08:58,819 --> 01:09:02,000
But it's important to
model them as stochastic.

1588
01:09:02,000 --> 01:09:04,430
The problem is once you start
modeling them as stochastic,

1589
01:09:04,430 --> 01:09:07,430
it becomes a lot
harder to solve.

1590
01:09:07,430 --> 01:09:09,050
But if you make
some assumptions,

1591
01:09:09,050 --> 01:09:11,069
or even if you don't,
if you're just get smart

1592
01:09:11,069 --> 01:09:15,199
and you do the heuristics, you
can resolve this complexity.

1593
01:09:15,199 --> 01:09:17,324
And so I hope you remember
these three points,

1594
01:09:17,324 --> 01:09:20,474
you remember this graphic
or that graphic, the idea

1595
01:09:20,474 --> 01:09:22,399
that if you take
uncertainty into account,

1596
01:09:22,399 --> 01:09:25,684
you get fundamentally
different paths [INAUDIBLE]..

1597
01:09:25,684 --> 01:09:28,770
And that can be a good thing.

1598
01:09:28,770 --> 01:09:30,400
What questions do
you have, anything?

1599
01:09:30,400 --> 01:09:30,899
Yeah?

1600
01:09:30,899 --> 01:09:33,700
AUDIENCE: [INAUDIBLE]
so far people

1601
01:09:33,700 --> 01:09:36,279
have [INAUDIBLE] problems.

1602
01:09:36,279 --> 01:09:39,479
So how do the same,
you know, [INAUDIBLE]??

1603
01:09:39,479 --> 01:09:43,029
So for instance, we suggest
that this is the maximum risk

1604
01:09:43,029 --> 01:09:44,040
we want to take.

1605
01:09:44,040 --> 01:09:46,780
Is it possible to integrate
each of our constraints

1606
01:09:46,780 --> 01:09:49,029
into your optimization
problem and solve it?

1607
01:09:49,029 --> 01:09:51,279
Or [INAUDIBLE]?

1608
01:09:51,279 --> 01:09:56,800
PROFESSOR 4: So I imagine one
thing that I would like to test

1609
01:09:56,800 --> 01:10:05,165
is if we go back to this really
crude version of our cost

1610
01:10:05,165 --> 01:10:07,380
equation, we can imagine
saying that if we

1611
01:10:07,380 --> 01:10:10,590
want to come up with a
bound for uncertainty

1612
01:10:10,590 --> 01:10:12,380
that we can tolerate,
you could maybe

1613
01:10:12,380 --> 01:10:15,135
like setting an intercept
for this to be that bound

1614
01:10:15,135 --> 01:10:17,650
and then just ramping this up.

1615
01:10:17,650 --> 01:10:21,305
AUDIENCE: Oh, [INAUDIBLE]
multiplier [INAUDIBLE]..

1616
01:10:21,305 --> 01:10:22,680
PROFESSOR 4:
Something like that.

1617
01:10:22,680 --> 01:10:23,080
AUDIENCE: [INAUDIBLE].

1618
01:10:23,080 --> 01:10:24,288
GUEST SPEAKER: Yeah, exactly.

1619
01:10:24,288 --> 01:10:26,530
So it only comes into effect.

1620
01:10:26,530 --> 01:10:29,170
Possibly what I said
is a terrible idea.

1621
01:10:29,170 --> 01:10:30,807
But it's-- especially
in simulation,

1622
01:10:30,807 --> 01:10:32,640
you can just try it out
and see if it works.

1623
01:10:32,640 --> 01:10:35,830
AUDIENCE: And just
another question,

1624
01:10:35,830 --> 01:10:39,310
you said propagated probability
solution on the networks

1625
01:10:39,310 --> 01:10:41,040
is hard.

1626
01:10:41,040 --> 01:10:41,810
[INAUDIBLE]

1627
01:10:45,590 --> 01:10:49,435
So therefore we can assume
an illustrated area, and then

1628
01:10:49,435 --> 01:10:52,252
[INAUDIBLE] normal
distribution, they

1629
01:10:52,252 --> 01:10:53,770
are contrary to each other.

1630
01:10:53,770 --> 01:10:56,500
Therefore you can
[INAUDIBLE] observation

1631
01:10:56,500 --> 01:11:00,860
and then updating [INAUDIBLE]
and form a distribution, right?

1632
01:11:00,860 --> 01:11:01,610
PROFESSOR 4: Sure.

1633
01:11:01,610 --> 01:11:08,080
So I think that might be making
some assumptions that we might

1634
01:11:08,080 --> 01:11:10,580
not be willing to make about
the nature of the distributions

1635
01:11:10,580 --> 01:11:12,290
that you're trying to propagate.

1636
01:11:12,290 --> 01:11:15,690
I need think about
this some more.

1637
01:11:15,690 --> 01:11:18,920
But I think intuitively, we can
understand that any time you're

1638
01:11:18,920 --> 01:11:22,580
propagating a distribution
versus a single discrete value,

1639
01:11:22,580 --> 01:11:25,040
it's definitely not
going to be easier.

1640
01:11:25,040 --> 01:11:31,120
And so as your distributions
become more complex.

1641
01:11:31,120 --> 01:11:35,280
Perhaps if you're modeling a
real-world stochastic sensor,

1642
01:11:35,280 --> 01:11:38,120
you might not be able to perform
these efficient updates using

1643
01:11:38,120 --> 01:11:39,842
conjugate priority.

1644
01:11:42,520 --> 01:11:44,140
Any other questions?

1645
01:11:44,140 --> 01:11:45,480
Otherwise we-- yeah?

1646
01:11:45,480 --> 01:11:47,646
AUDIENCE: So all these FIRM
examples you're showing,

1647
01:11:47,646 --> 01:11:49,690
this is all planning
done offline.

1648
01:11:49,690 --> 01:11:51,932
[INAUDIBLE] offline, right?

1649
01:11:51,932 --> 01:11:55,359
Have you expanded this at
all to like an online case?

1650
01:11:55,359 --> 01:11:56,900
Or does that all
require re-planning?

1651
01:11:56,900 --> 01:11:58,350
And how long does it take?

1652
01:11:58,350 --> 01:11:59,141
PROFESSOR 4: Right.

1653
01:11:59,141 --> 01:12:02,530
So it's not good, I
can tell you that much.

1654
01:12:02,530 --> 01:12:05,360
What's nice is FIRM
generates a policy.

1655
01:12:05,360 --> 01:12:08,600
So if you construct
your PRM to say

1656
01:12:08,600 --> 01:12:10,340
don't just find
the shortest path,

1657
01:12:10,340 --> 01:12:12,910
keep sampling
points until you're

1658
01:12:12,910 --> 01:12:14,445
confident that
you're always going

1659
01:12:14,445 --> 01:12:16,276
to end up near some point.

1660
01:12:16,276 --> 01:12:18,186
You can construct
a policy and then

1661
01:12:18,186 --> 01:12:21,380
just online look up
what's my belief that

1662
01:12:21,380 --> 01:12:22,600
matches most closely.

1663
01:12:25,110 --> 01:12:31,016
It took for-- I think we
sampled 600 nodes in like--

1664
01:12:31,016 --> 01:12:33,560
what are the dimensions on this?

1665
01:12:33,560 --> 01:12:38,880
So 2 to 3 meter by
8 by 3 meter thing--

1666
01:12:38,880 --> 01:12:40,610
it took about 15 minutes.

1667
01:12:40,610 --> 01:12:41,670
It was really slow.

1668
01:12:41,670 --> 01:12:43,850
Now, granted it was
a virtual machine.

1669
01:12:43,850 --> 01:12:47,600
But it's not something we
want to re-plan on the fly.

1670
01:12:47,600 --> 01:12:50,710
With PRM, typically
a lot faster.

1671
01:12:50,710 --> 01:12:52,730
AUDIENCE: Who was a lot
faster in that case?

1672
01:12:52,730 --> 01:12:54,930
PROFESSOR 4: So when
we just did PRM,

1673
01:12:54,930 --> 01:12:58,735
I think it was one minute
or something, at least

1674
01:12:58,735 --> 01:12:59,750
an order of magnitude.

1675
01:12:59,750 --> 01:13:04,390
And you could use other
planners as well, like RRTs.

1676
01:13:04,390 --> 01:13:07,270
And there are RRT versions of--

1677
01:13:07,270 --> 01:13:09,596
there are POMDT
versions of RRTs.

1678
01:13:09,596 --> 01:13:11,825
But [INAUDIBLE].

1679
01:13:11,825 --> 01:13:13,950
All right, I'm going to
turn it over to Steve so we

1680
01:13:13,950 --> 01:13:16,387
can learn about this project.

1681
01:13:16,387 --> 01:13:18,383
[APPLAUSE]

1682
01:13:53,860 --> 01:13:56,170
STEVE: So I'm just going
to say a few good words

1683
01:13:56,170 --> 01:13:57,280
about the Grand Challenge.

1684
01:13:57,280 --> 01:13:59,275
So again, the final
assignment will

1685
01:13:59,275 --> 01:14:01,730
be released tonight for this.

1686
01:14:01,730 --> 01:14:03,449
And as Professor
Warrens mentioned,

1687
01:14:03,449 --> 01:14:05,740
it's going to be descoped a
bit from what we originally

1688
01:14:05,740 --> 01:14:07,880
had in the syllabus due
to time constraints.

1689
01:14:07,880 --> 01:14:10,880
But here's a preview of
what you'll be doing.

1690
01:14:10,880 --> 01:14:14,920
So for an overview, it will
be a class-wide collaboration.

1691
01:14:14,920 --> 01:14:16,970
So what we're going
to do, you guys

1692
01:14:16,970 --> 01:14:19,160
are going to stay in your
advanced lecture teams

1693
01:14:19,160 --> 01:14:20,990
and each basically
apply what you've

1694
01:14:20,990 --> 01:14:23,190
done, the great work
you've done on that,

1695
01:14:23,190 --> 01:14:26,740
onto our hardwork robot
that we have. so it's not

1696
01:14:26,740 --> 01:14:28,414
a competition where
each team will be

1697
01:14:28,414 --> 01:14:31,184
competing against each other.

1698
01:14:31,184 --> 01:14:33,970
And so again,
we're descoping it.

1699
01:14:33,970 --> 01:14:36,220
The syllabus originally said
it was 20% of your grade.

1700
01:14:36,220 --> 01:14:39,950
It's probably going to be more
like 10% or 15% in the end.

1701
01:14:39,950 --> 01:14:42,720
So this is the robot
you guys will be using.

1702
01:14:42,720 --> 01:14:45,029
It's called an
[INAUDIBLE] robot.

1703
01:14:45,029 --> 01:14:46,820
Usually it has sort of
a robotic arm on it.

1704
01:14:46,820 --> 01:14:48,611
But we're actually not
going to be using it

1705
01:14:48,611 --> 01:14:50,848
for this [INAUDIBLE].

1706
01:14:50,848 --> 01:14:53,310
This base made from a
company from Spain, Orbonix.

1707
01:14:53,310 --> 01:14:55,330
It's a pretty cool robot.

1708
01:14:55,330 --> 01:14:58,870
One cool thing about it is that
it's actually omnidirectional.

1709
01:14:58,870 --> 01:15:01,010
So there are wheels
on your wheels here.

1710
01:15:01,010 --> 01:15:03,186
And what that means is
that it can drive sideways,

1711
01:15:03,186 --> 01:15:04,690
left to right.

1712
01:15:04,690 --> 01:15:07,280
It would make parallel
parking your car very easy.

1713
01:15:07,280 --> 01:15:09,086
[LAUGHTER]

1714
01:15:09,086 --> 01:15:11,210
We're not going to be
driving it probably this fast

1715
01:15:11,210 --> 01:15:13,620
because it's
actually super heavy.

1716
01:15:13,620 --> 01:15:15,410
And we don't want
anyone to-- yeah,

1717
01:15:15,410 --> 01:15:17,550
we forgot to screw
in that [INAUDIBLE]..

1718
01:15:17,550 --> 01:15:18,420
[LAUGHTER]

1719
01:15:18,420 --> 01:15:20,740
It will be screwed in
during the competition.

1720
01:15:20,740 --> 01:15:22,660
But it's a pretty fun robot.

1721
01:15:22,660 --> 01:15:26,822
So you guys will
be working on this.

1722
01:15:26,822 --> 01:15:30,420
And so the actual
challenge itself,

1723
01:15:30,420 --> 01:15:32,920
as we've mentioned many
times throughout this class,

1724
01:15:32,920 --> 01:15:35,797
will be a modified
orienteering your challenge.

1725
01:15:35,797 --> 01:15:38,172
So there's going to be a few
different challenge stations

1726
01:15:38,172 --> 01:15:41,150
that you have to drive to.

1727
01:15:41,150 --> 01:15:44,614
And at the stations, there'll be
small computational challenges.

1728
01:15:44,614 --> 01:15:46,030
And those computational
challenges

1729
01:15:46,030 --> 01:15:51,530
will be a subset of the advanced
lecture teams, not all of them.

1730
01:15:51,530 --> 01:15:54,695
And the goal is to complete
as many of those as you can

1731
01:15:54,695 --> 01:15:58,030
and try to do that as
quickly as possible.

1732
01:15:58,030 --> 01:16:00,370
And it's also going to be
held indoors in our lab space,

1733
01:16:00,370 --> 01:16:01,430
where we just showed you.

1734
01:16:01,430 --> 01:16:02,971
So that way if it
rains, we can still

1735
01:16:02,971 --> 01:16:05,680
have the Grand Challenge
at the end of the semester.

1736
01:16:05,680 --> 01:16:12,310
So this is sort of how it's set
up as of last night actually.

1737
01:16:12,310 --> 01:16:13,966
So basically the
robot will be abe

1738
01:16:13,966 --> 01:16:15,930
to drive around in a
small little LEGO maze

1739
01:16:15,930 --> 01:16:17,456
that we set up.

1740
01:16:17,456 --> 01:16:19,456
And there's going to be
sort of different things

1741
01:16:19,456 --> 01:16:22,710
that you have to do with
in different places.

1742
01:16:22,710 --> 01:16:24,922
So what are you actually
going to be doing?

1743
01:16:24,922 --> 01:16:26,880
What assignment are you
guys going to be doing?

1744
01:16:26,880 --> 01:16:28,580
Well, it's actually
a little flexible.

1745
01:16:28,580 --> 01:16:30,470
Since each of you
did different things

1746
01:16:30,470 --> 01:16:33,380
for your advanced
lectures, each team

1747
01:16:33,380 --> 01:16:35,842
is going to have a bit of a
different assignment applied

1748
01:16:35,842 --> 01:16:36,575
to this.

1749
01:16:36,575 --> 01:16:38,033
I have some proposed
ideas that I'm

1750
01:16:38,033 --> 01:16:40,740
going to talk about in
the next slide here.

1751
01:16:40,740 --> 01:16:43,250
But the big thing is that
these are just ideas.

1752
01:16:43,250 --> 01:16:46,092
You guys have a lot of
flexibility in this.

1753
01:16:46,092 --> 01:16:48,050
You'll be probably working
with the [INAUDIBLE]

1754
01:16:48,050 --> 01:16:51,490
a lot to have
access to the robot.

1755
01:16:51,490 --> 01:16:53,683
We can arrange extra
office hours for you guys

1756
01:16:53,683 --> 01:16:57,034
to come use the
hardware to test things.

1757
01:16:57,034 --> 01:16:58,700
So we'll be arranging
all of that things

1758
01:16:58,700 --> 01:17:00,540
as sort of on an
as-needed basis.

1759
01:17:00,540 --> 01:17:02,000
If you want to--

1760
01:17:02,000 --> 01:17:06,370
basically it'll sort
of depend on your team.

1761
01:17:06,370 --> 01:17:08,870
AUDIENCE: And the people who'll
be helping out are you, us--

1762
01:17:08,870 --> 01:17:11,490
STEVE: Yes, me and
Tiago, who gave a lecture

1763
01:17:11,490 --> 01:17:13,477
earlier in the semester
and possibly also

1764
01:17:13,477 --> 01:17:14,810
a few other people from our lab.

1765
01:17:14,810 --> 01:17:18,680
But we'll be the main
contact points for it.

1766
01:17:18,680 --> 01:17:20,520
So of course it should
go without saying,

1767
01:17:20,520 --> 01:17:23,120
but all the team members
should contribute equally

1768
01:17:23,120 --> 01:17:24,360
within your team.

1769
01:17:24,360 --> 01:17:27,253
So it'll be maybe less structure
than in advanced lecture.

1770
01:17:27,253 --> 01:17:29,544
But just make sure that
everyone's contributing equally

1771
01:17:29,544 --> 01:17:32,310
in the assignment.

1772
01:17:32,310 --> 01:17:35,480
And it's going to involve using
this thing called the Robot

1773
01:17:35,480 --> 01:17:39,260
Operating System, or ROS,
which is basically a software

1774
01:17:39,260 --> 01:17:42,719
framework for communicating and
it's used a lot in robotics.

1775
01:17:42,719 --> 01:17:44,510
Just a quick show of
hands, how many of you

1776
01:17:44,510 --> 01:17:46,305
have used ROS before?

1777
01:17:46,305 --> 01:17:48,350
Oh, wow, so a lot of used ROS.

1778
01:17:48,350 --> 01:17:50,660
How many have heard of
ROS, if not used it?

1779
01:17:50,660 --> 01:17:51,860
OK, so a lot of people.

1780
01:17:51,860 --> 01:17:54,540
So that's a good starting point.

1781
01:17:54,540 --> 01:17:58,220
So here are the the proposed
tasks for each group.

1782
01:17:58,220 --> 01:18:00,630
And of course, all of
these are up for change.

1783
01:18:00,630 --> 01:18:03,075
If you guys want to
change it, let me know.

1784
01:18:03,075 --> 01:18:04,450
Basically, so some
of the groups,

1785
01:18:04,450 --> 01:18:07,020
it's very clear
how it immediately

1786
01:18:07,020 --> 01:18:09,250
applied to the Grand Challenge.

1787
01:18:09,250 --> 01:18:12,650
So incremental path planning--
well, we have a mobile robot,

1788
01:18:12,650 --> 01:18:15,030
so maybe we can actually
print that on the robot

1789
01:18:15,030 --> 01:18:18,630
and get it to change or plan
if something gets in the way.

1790
01:18:18,630 --> 01:18:20,840
The semantic
globalization group--

1791
01:18:20,840 --> 01:18:23,920
obviously very applicable
to the Grand Challenge.

1792
01:18:23,920 --> 01:18:25,910
You need to know where you are.

1793
01:18:25,910 --> 01:18:28,070
So the robot that we
have now can actually

1794
01:18:28,070 --> 01:18:30,590
do normal metric localization.

1795
01:18:30,590 --> 01:18:32,152
So it can sort of
know where it is.

1796
01:18:32,152 --> 01:18:33,610
But what will be
interesting to see

1797
01:18:33,610 --> 01:18:36,150
is compare the semantic
localization to that one.

1798
01:18:36,150 --> 01:18:39,340
But how would you do
semantic localization?

1799
01:18:39,340 --> 01:18:40,430
Well, we can use a camera.

1800
01:18:40,430 --> 01:18:43,110
And we can choose
the visual learning

1801
01:18:43,110 --> 01:18:45,484
through deep classification--
the visual classification

1802
01:18:45,484 --> 01:18:46,650
through deep learning group.

1803
01:18:46,650 --> 01:18:48,525
So I think these
two groups would

1804
01:18:48,525 --> 01:18:50,920
have a really nice synergy
and a really cool way

1805
01:18:50,920 --> 01:18:52,940
to work together.

1806
01:18:52,940 --> 01:18:55,581
So the MCTS group
are very cool, but I

1807
01:18:55,581 --> 01:18:57,747
had a little trouble thinking
about exactly how that

1808
01:18:57,747 --> 01:18:58,340
would apply.

1809
01:18:58,340 --> 01:19:00,410
So maybe you look like
you have an idea maybe.

1810
01:19:00,410 --> 01:19:02,180
AUDIENCE: So to
clarify, all these

1811
01:19:02,180 --> 01:19:03,560
are separate runs of the robot?

1812
01:19:03,560 --> 01:19:05,185
STEVE: So we're
actually probably going

1813
01:19:05,185 --> 01:19:06,660
to run all of them--

1814
01:19:06,660 --> 01:19:09,820
so the grid is-- this would be
probably one run of the robot.

1815
01:19:09,820 --> 01:19:13,180
We're going to decouple these
so that if one or a subset these

1816
01:19:13,180 --> 01:19:16,490
don't work super well, the
other groups will still

1817
01:19:16,490 --> 01:19:17,180
be able to run.

1818
01:19:17,180 --> 01:19:20,780
So we're carefully
planning that out too.

1819
01:19:20,780 --> 01:19:24,410
So the MCTS group,
I was thinking maybe

1820
01:19:24,410 --> 01:19:26,886
could solve-- that could be
a really nice way to-- one

1821
01:19:26,886 --> 01:19:29,266
of those computational
challenges.

1822
01:19:29,266 --> 01:19:30,890
Maybe you have to
play against a human.

1823
01:19:30,890 --> 01:19:31,806
And if it wins, great.

1824
01:19:31,806 --> 01:19:33,590
You can go faster or
you get more points.

1825
01:19:33,590 --> 01:19:36,855
So you could implement
on a different game

1826
01:19:36,855 --> 01:19:39,934
other than connect four or
possibly and sort of [wrap it

1827
01:19:39,934 --> 01:19:41,814
around [? rod ?] that
to call with that.

1828
01:19:41,814 --> 01:19:43,480
So each ability group
I think would also

1829
01:19:43,480 --> 01:19:46,110
be a great place to do one
of these challenge stations.

1830
01:19:46,110 --> 01:19:48,050
So we could give
you guys puzzle,

1831
01:19:48,050 --> 01:19:51,334
say maybe it's some sort
of maze-like state space.

1832
01:19:51,334 --> 01:19:53,750
And you have to see could we
even the reach the goal here?

1833
01:19:53,750 --> 01:19:55,625
And if you get the answer
right, well, you an

1834
01:19:55,625 --> 01:19:58,420
move on to the next stage or
get more points or something

1835
01:19:58,420 --> 01:19:58,920
like that.

1836
01:19:58,920 --> 01:20:00,919
AUDIENCE: But again, these
are just suggestions.

1837
01:20:00,919 --> 01:20:02,836
So for example, they
can do the reachability

1838
01:20:02,836 --> 01:20:04,195
for doing motion planning.

1839
01:20:04,195 --> 01:20:06,710
STEVE: Yeah, so if you guys
have other suggestions on how

1840
01:20:06,710 --> 01:20:08,720
to implement your team's stuff
into the Grand Challenge,

1841
01:20:08,720 --> 01:20:10,760
definitely send me an
email, preferably today

1842
01:20:10,760 --> 01:20:12,810
or as soon as you
think of the things.

1843
01:20:12,810 --> 01:20:13,810
And we can change these.

1844
01:20:13,810 --> 01:20:18,470
These are just
suggestions for right now.

1845
01:20:18,470 --> 01:20:21,390
The last two are sort of
more planning related.

1846
01:20:21,390 --> 01:20:26,090
So planning with temporal logic,
which was Monday's lecture,

1847
01:20:26,090 --> 01:20:28,190
I thought would be cool
way to sort of control

1848
01:20:28,190 --> 01:20:30,020
the robot's high-level action.

1849
01:20:30,020 --> 01:20:32,800
So maybe that could
involve modeling

1850
01:20:32,800 --> 01:20:35,135
are Grand Challenge with PDDL.

1851
01:20:35,135 --> 01:20:37,740
And maybe with linear
temporal logic goals,

1852
01:20:37,740 --> 01:20:39,430
models that you get [INAUDIBLE].

1853
01:20:39,430 --> 01:20:42,082
Then you could compile that
and call up to turn around it

1854
01:20:42,082 --> 01:20:44,460
and execute that plan
on the actual robot,

1855
01:20:44,460 --> 01:20:46,995
do [INAUDIBLE] high-level
actions of the robot.

1856
01:20:46,995 --> 01:20:47,870
That's a possibility.

1857
01:20:47,870 --> 01:20:51,390
And for today's group,
the infinite horizon

1858
01:20:51,390 --> 01:20:55,631
probablistics planning, maybe
you could do something actually

1859
01:20:55,631 --> 01:20:56,130
similar.

1860
01:20:56,130 --> 01:20:59,110
But instead of modeling
the domain as PDDL

1861
01:20:59,110 --> 01:21:02,055
model it as an NVP and
solve it with LAO star,

1862
01:21:02,055 --> 01:21:05,180
and get sort of a policy on
how to control the robot.

1863
01:21:05,180 --> 01:21:07,135
Maybe we can change
it a little bit so

1864
01:21:07,135 --> 01:21:10,925
that certain squares can be
more risky than others or so on.

1865
01:21:10,925 --> 01:21:13,050
So again, there's flexibility
in all of these here.

1866
01:21:13,050 --> 01:21:16,208
So these are just some
of the suggestions.

1867
01:21:16,208 --> 01:21:18,085
Does anyone have any
questions before we--

1868
01:21:18,085 --> 01:21:19,065
this all I have.

1869
01:21:19,065 --> 01:21:20,273
So anyone have any questions?

1870
01:21:20,273 --> 01:21:20,773
Yeah?

1871
01:21:21,991 --> 01:21:24,201
AUDIENCE: How are we
working on [INAUDIBLE]??

1872
01:21:24,201 --> 01:21:26,656
Are you [INAUDIBLE]?

1873
01:21:26,656 --> 01:21:28,530
STEVE: So it's
basically up to you

1874
01:21:28,530 --> 01:21:31,350
guys really divide up the
work amongst yourselves.

1875
01:21:31,350 --> 01:21:33,780
It's really different
for every team.

1876
01:21:33,780 --> 01:21:35,074
So we're--

1877
01:21:35,074 --> 01:21:35,740
AUDIENCE: Sorry.

1878
01:21:35,740 --> 01:21:37,406
So each team is
developing [INAUDIBLE]..

1879
01:21:37,406 --> 01:21:38,241
STEVE: Right, yeah.

1880
01:21:38,241 --> 01:21:40,532
Each team is really in their
own separate package FIRM.

1881
01:21:40,532 --> 01:21:43,110
For example, these
two teams, there's

1882
01:21:43,110 --> 01:21:45,210
probably going to be a
common interface where

1883
01:21:45,210 --> 01:21:47,419
the output of this one goes
to the input of that one.

1884
01:21:47,419 --> 01:21:49,085
So for that one, we're
going to give you

1885
01:21:49,085 --> 01:21:50,710
sort of the
interface [INAUDIBLE]

1886
01:21:50,710 --> 01:21:52,082
message type package for that.

1887
01:21:52,082 --> 01:21:54,690
But other than than, like
for dividing up the work,

1888
01:21:54,690 --> 01:21:56,600
you guys [INAUDIBLE] that.

1889
01:21:56,600 --> 01:22:00,280
PROFESSOR 2: So your
plans will be integration

1890
01:22:00,280 --> 01:22:05,238
if you're trying [INAUDIBLE]
pieces the group is doing.

1891
01:22:05,238 --> 01:22:08,230
That we think is, thing
that uses soft engineering

1892
01:22:08,230 --> 01:22:10,860
skills [INAUDIBLE].

1893
01:22:10,860 --> 01:22:14,010
It can be really unpleasant
to do, take a lot of time.

1894
01:22:14,010 --> 01:22:15,970
So we don't want you to
have that experience.

1895
01:22:15,970 --> 01:22:19,300
So that's why you can...

1896
01:22:19,300 --> 01:22:21,081
If not only what
we wanted you to do

1897
01:22:21,081 --> 01:22:23,330
is to be able to get your
own capability [INAUDIBLE]..

1898
01:22:25,860 --> 01:22:29,230
If you guys choose to
integrate with other teams

1899
01:22:29,230 --> 01:22:31,290
because you're really
excited about that

1900
01:22:31,290 --> 01:22:33,020
and because it looks
like the people

1901
01:22:33,020 --> 01:22:34,914
that you're working
for [INAUDIBLE]..

1902
01:22:34,914 --> 01:22:37,910
Then that's purely your choice.

1903
01:22:37,910 --> 01:22:38,810
Is that fair enough?

1904
01:22:38,810 --> 01:22:39,310
STEVE: Sure.

1905
01:22:39,310 --> 01:22:40,010
It's fine.

1906
01:22:40,010 --> 01:22:41,210
Any more questions?

1907
01:22:44,910 --> 01:22:45,410
OK.

1908
01:22:45,410 --> 01:22:46,310
Sounds great.

1909
01:22:46,310 --> 01:22:49,360
[APPLAUSE]