1 00:00:01,580 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,340 Commons license. 3 00:00:05,340 --> 00:00:07,550 Your support will help MIT OpenCourseWare 4 00:00:07,550 --> 00:00:11,640 continue to offer high quality educational resources for free. 5 00:00:11,640 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,110 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,110 --> 00:00:19,340 at ocw.MIT.edu. 8 00:00:23,110 --> 00:00:25,150 PROFESSOR 1: So as he gives an introduction, 9 00:00:25,150 --> 00:00:27,670 myself along with Catherine, David, [INAUDIBLE],, 10 00:00:27,670 --> 00:00:29,336 and Charlotte [? are going to present ?] 11 00:00:29,336 --> 00:00:31,735 you guys Probablistic and Infinite Horizon Planning. 12 00:00:31,735 --> 00:00:34,360 To give you a brief overview of what we're going to talk about, 13 00:00:34,360 --> 00:00:36,740 we're going to start with the Quadrotor motivating example. 14 00:00:36,740 --> 00:00:39,005 We're going to move into planning with Markov decision 15 00:00:39,005 --> 00:00:42,070 processses, give you a little bit about value iteration 16 00:00:42,070 --> 00:00:44,240 before discussing heuristic guided solvers. 17 00:00:44,240 --> 00:00:47,130 And we're going to go into the more stochastic case, partially 18 00:00:47,130 --> 00:00:48,715 observable Markov decision processes 19 00:00:48,715 --> 00:00:52,890 and operating in belief space. 20 00:00:52,890 --> 00:00:56,560 So we very often now see quadrotor motion planning 21 00:00:56,560 --> 00:01:00,029 as a problem, given, for example, with the Amazon 22 00:01:00,029 --> 00:01:00,820 fulfillment center. 23 00:01:00,820 --> 00:01:03,085 We start with a goal configuration, start 24 00:01:03,085 --> 00:01:05,562 configuration, set of actions we can take, 25 00:01:05,562 --> 00:01:09,090 and some type of reward function or cost function. 26 00:01:09,090 --> 00:01:11,665 So for instance, if we have a quadrotor starting 27 00:01:11,665 --> 00:01:13,040 at the Amazon fulfillment center, 28 00:01:13,040 --> 00:01:14,849 and we want to get to 77 Mass Ave, 29 00:01:14,849 --> 00:01:17,390 and let's say we want to take the shortest path to get there. 30 00:01:17,390 --> 00:01:19,244 We would follow the red dashed line. 31 00:01:19,244 --> 00:01:21,900 But, as we can see, it comes very close to these obstacles. 32 00:01:21,900 --> 00:01:23,792 So we're looking at very higher risk 33 00:01:23,792 --> 00:01:25,000 for mission failure, crashes. 34 00:01:25,000 --> 00:01:27,010 If there's any uncertainty in its path, 35 00:01:27,010 --> 00:01:29,440 we're going to have a problem. 36 00:01:29,440 --> 00:01:32,535 So one of the ways that we can compensate for that is we 37 00:01:32,535 --> 00:01:34,305 can also done a green path, which 38 00:01:34,305 --> 00:01:37,632 is adjusted to give us a little bit of space. 39 00:01:37,632 --> 00:01:42,078 Are there any more questions? 40 00:01:42,078 --> 00:01:45,300 So as you can see, the level of uncertainty 41 00:01:45,300 --> 00:01:47,720 allows us to determine how easy or difficult the problem's 42 00:01:47,720 --> 00:01:48,620 going to be to solve. 43 00:01:48,620 --> 00:01:50,940 On the easier side we have deterministic dynamics 44 00:01:50,940 --> 00:01:52,760 and deterministic sensors. 45 00:01:52,760 --> 00:01:54,180 In this case our actions are going 46 00:01:54,180 --> 00:01:56,420 to be executed as commanded, and our sensors 47 00:01:56,420 --> 00:01:58,170 are going to tell us exactly where we are. 48 00:01:58,170 --> 00:02:00,336 This will be something like dead reckoning validated 49 00:02:00,336 --> 00:02:02,970 through sensing, very redundant. 50 00:02:02,970 --> 00:02:05,350 If we moved to a little bit more at a difficult case, 51 00:02:05,350 --> 00:02:06,940 we have deterministic dynamics. 52 00:02:06,940 --> 00:02:08,740 Our commands are still being executed, 53 00:02:08,740 --> 00:02:10,259 but maybe we have a noisy camera. 54 00:02:10,259 --> 00:02:12,050 So we have some uncertainty in the sensors. 55 00:02:12,050 --> 00:02:15,445 These are the cases where we would see dead reckoning, 56 00:02:15,445 --> 00:02:17,320 but we would compensate with Kalman filtering 57 00:02:17,320 --> 00:02:19,350 to get rid of that noise level. 58 00:02:19,350 --> 00:02:20,160 Now, down at the-- 59 00:02:20,160 --> 00:02:20,660 yes? 60 00:02:20,660 --> 00:02:22,952 AUDIENCE: Sorry, what is dead reckoning? 61 00:02:22,952 --> 00:02:24,550 PROFESSOR 1: It's essentially-- 62 00:02:24,550 --> 00:02:26,749 you are saying I want to execute this option, 63 00:02:26,749 --> 00:02:28,290 and it's going to execute it exactly. 64 00:02:28,290 --> 00:02:28,831 AUDIENCE: OK. 65 00:02:31,199 --> 00:02:32,740 PROFESSOR 1: Then towards the bottom, 66 00:02:32,740 --> 00:02:34,930 we have stochastic dynamics and deterministic sensors. 67 00:02:34,930 --> 00:02:36,846 So in this case maybe there's some uncertainty 68 00:02:36,846 --> 00:02:37,790 in our actions. 69 00:02:37,790 --> 00:02:39,750 But we can validate through sensing. 70 00:02:39,750 --> 00:02:42,340 This is where we're going to spend the next section 71 00:02:42,340 --> 00:02:45,070 on Markov decision processes. 72 00:02:45,070 --> 00:02:46,695 And then briefly, later in the lecture, 73 00:02:46,695 --> 00:02:49,069 we're going to talk about this most difficult case, which 74 00:02:49,069 --> 00:02:51,415 is the stochastic dynamics and stochastic sensors. 75 00:02:51,415 --> 00:02:55,189 Our execution maybe, maybe not, our sensors maybe a little 76 00:02:55,189 --> 00:02:55,730 bit of noise. 77 00:02:55,730 --> 00:02:58,080 And this is where we see partially observable 78 00:02:58,080 --> 00:03:01,175 Markov decision processes. 79 00:03:01,175 --> 00:03:03,300 PROFESSOR 2: Can we make just a brief clarification 80 00:03:03,300 --> 00:03:04,220 of the dead reckoning? 81 00:03:04,220 --> 00:03:05,760 So dead reckoning is where you estimate 82 00:03:05,760 --> 00:03:07,426 you position using a probabilistic file, 83 00:03:07,426 --> 00:03:09,290 but you don't use any observation stuffs. 84 00:03:09,290 --> 00:03:10,700 That's the thing. 85 00:03:13,684 --> 00:03:14,350 PROFESSOR 1: OK. 86 00:03:14,350 --> 00:03:16,475 So we talked a little bit about action uncertainty, 87 00:03:16,475 --> 00:03:17,850 where we're going to focus. 88 00:03:17,850 --> 00:03:19,870 And this is a case where, for instance, 89 00:03:19,870 --> 00:03:24,710 even if you tell a quadrotor to stay in its place 90 00:03:24,710 --> 00:03:26,320 and a gust of wind comes by, it's not 91 00:03:26,320 --> 00:03:27,610 going to stay in that same place, right? 92 00:03:27,610 --> 00:03:29,443 It's going to have a little bit of movement. 93 00:03:29,443 --> 00:03:31,697 And we can see, this causes things in disparity 94 00:03:31,697 --> 00:03:33,280 where you have a command of trajectory 95 00:03:33,280 --> 00:03:37,220 and your actual trajectory and then you need to associate. 96 00:03:37,220 --> 00:03:39,730 So in order to compensate for that, 97 00:03:39,730 --> 00:03:41,800 we want to model things with that uncertainty 98 00:03:41,800 --> 00:03:44,090 or else we have these higher situations where 99 00:03:44,090 --> 00:03:45,480 we command to follow some line. 100 00:03:45,480 --> 00:03:46,480 We don't incorporate the uncertainty, 101 00:03:46,480 --> 00:03:47,680 and we see a crash. 102 00:03:51,159 --> 00:03:54,720 So this allows us to introduce Planning with Markov Decision 103 00:03:54,720 --> 00:03:55,680 Processes. 104 00:03:58,560 --> 00:04:01,980 So MDPs have a set of states-- 105 00:04:01,980 --> 00:04:04,370 some actions you can take-- a transition model. 106 00:04:04,370 --> 00:04:06,010 So essentially, what's the probability 107 00:04:06,010 --> 00:04:09,442 of reaching some state if you take an action? 108 00:04:09,442 --> 00:04:12,215 An immediate reward function and a discount factor. 109 00:04:12,215 --> 00:04:13,590 This discount factor is important 110 00:04:13,590 --> 00:04:15,007 because it allows us to prioritize 111 00:04:15,007 --> 00:04:16,590 gaining an immediate reward as opposed 112 00:04:16,590 --> 00:04:17,839 to an uncertain future award. 113 00:04:17,839 --> 00:04:21,250 So the concept of, bird in the hand is worth two in the bush. 114 00:04:21,250 --> 00:04:23,140 Now we want to find an optimum policy 115 00:04:23,140 --> 00:04:24,850 that will essentially back an action-- 116 00:04:24,850 --> 00:04:27,120 the best action-- for each state. 117 00:04:27,120 --> 00:04:28,750 And what we hope to get from this 118 00:04:28,750 --> 00:04:31,040 is maximized expected lifetime reward. 119 00:04:31,040 --> 00:04:33,820 So we want to maximize the accumulative reward we get over 120 00:04:33,820 --> 00:04:37,054 the times. 121 00:04:37,054 --> 00:04:38,580 So let's walk through an example. 122 00:04:38,580 --> 00:04:41,925 If we have a quadrotor with a perfect sensor, and let's 123 00:04:41,925 --> 00:04:44,240 put it in this environment [INAUDIBLE] 7x7 bridge. 124 00:04:44,240 --> 00:04:48,553 Our set of states are obviously [INAUDIBLE] space in them. 125 00:04:48,553 --> 00:04:51,199 Can anybody tell me what some of the actions might be? 126 00:04:51,199 --> 00:04:55,944 AUDIENCE: [INAUDIBLE] 127 00:04:55,944 --> 00:04:57,110 PROFESSOR 1: (LAUGHING) Yep. 128 00:04:57,110 --> 00:04:58,150 Up, down, left, right. 129 00:04:58,150 --> 00:05:00,150 In this case, we call them North, South, East, West, 130 00:05:00,150 --> 00:05:00,660 or Null. 131 00:05:03,202 --> 00:05:04,910 The next thing we need for this example-- 132 00:05:04,910 --> 00:05:07,260 arbitrarily gave ourselves a transition probability. 133 00:05:07,260 --> 00:05:09,800 So we said that you have 2% chance of following 134 00:05:09,800 --> 00:05:12,185 your commanded action with a 25% chance of moving 135 00:05:12,185 --> 00:05:14,596 to the left or the right. 136 00:05:14,596 --> 00:05:16,675 Next we have the reward function. 137 00:05:16,675 --> 00:05:18,050 And again, we arbitrarily decided 138 00:05:18,050 --> 00:05:19,870 that we want it to be a reward. 139 00:05:19,870 --> 00:05:24,131 If you get to the state 6-5, the increment [INAUDIBLE] 140 00:05:24,131 --> 00:05:27,610 AUDIENCE: Alicia, you had [INAUDIBLE] left or the right, 141 00:05:27,610 --> 00:05:32,654 is that clockwise or counter clockwise? 142 00:05:32,654 --> 00:05:34,590 Let's say you had planned to go to the right, 143 00:05:34,590 --> 00:05:39,430 would that mean you have a 75% [INAUDIBLE] or 25% chance of 144 00:05:39,430 --> 00:05:39,930 [INAUDIBLE]? 145 00:05:39,930 --> 00:05:42,080 What if you just waited [INAUDIBLE]?? 146 00:05:42,080 --> 00:05:44,080 PROFESSOR 1: We said clockwise, counterclockwise 147 00:05:44,080 --> 00:05:46,132 from the intended direction of action. 148 00:05:46,132 --> 00:05:46,873 AUDIENCE: Thanks. 149 00:05:46,873 --> 00:05:48,110 PROFESSOR 1: Uh-huh. 150 00:05:48,110 --> 00:05:52,170 And finally, we give ourselves a discount factor of 0.9. 151 00:05:52,170 --> 00:05:54,800 So let's assume for a second that we 152 00:05:54,800 --> 00:05:56,080 have our optimal policy. 153 00:05:56,080 --> 00:05:58,720 And let's say that our optimal policy says, from this state, 154 00:05:58,720 --> 00:06:00,100 we want to take the action North. 155 00:06:00,100 --> 00:06:00,520 Right? 156 00:06:00,520 --> 00:06:02,061 As we discussed, we have a 50% chance 157 00:06:02,061 --> 00:06:06,260 of going North and 25% chance of going to the left or right. 158 00:06:06,260 --> 00:06:10,650 So after that time step, these are possible states 159 00:06:10,650 --> 00:06:11,820 that we could end up in. 160 00:06:11,820 --> 00:06:12,930 Right? 161 00:06:12,930 --> 00:06:15,757 So now let's assume for a second that we 162 00:06:15,757 --> 00:06:17,080 can take our next action. 163 00:06:17,080 --> 00:06:19,150 And our next action says, go North. 164 00:06:19,150 --> 00:06:21,410 Again, we have the same probability distribution. 165 00:06:21,410 --> 00:06:23,220 And these are the states we could end up 166 00:06:23,220 --> 00:06:24,221 in after two time steps. 167 00:06:24,221 --> 00:06:26,428 We can see that this starts getting very complicated, 168 00:06:26,428 --> 00:06:27,060 right? 169 00:06:27,060 --> 00:06:30,340 And there are increasing amounts of uncertainty. 170 00:06:30,340 --> 00:06:33,300 So does anybody have any ideas on how we could 171 00:06:33,300 --> 00:06:35,192 collapse this distribution? 172 00:06:35,192 --> 00:06:37,150 Keeping in mind that our senses, at this point, 173 00:06:37,150 --> 00:06:40,075 are deterministic. 174 00:06:40,075 --> 00:06:40,803 Yep. 175 00:06:40,803 --> 00:06:41,886 AUDIENCE: Fly to a corner? 176 00:06:41,886 --> 00:06:42,844 PROFESSOR 1: I'm sorry. 177 00:06:42,844 --> 00:06:44,010 AUDIENCE: Fly to a corner? 178 00:06:44,010 --> 00:06:45,416 PROFESSOR 1: We could do that. 179 00:06:45,416 --> 00:06:47,624 AUDIENCE: You just sense how far away the red box is. 180 00:06:47,624 --> 00:06:50,440 PROFESSOR 1: We could do that. 181 00:06:50,440 --> 00:06:51,985 AUDIENCE: Quick comments. 182 00:06:51,985 --> 00:06:53,982 So the blue state is not actually one. 183 00:06:53,982 --> 00:06:54,940 PROFESSOR 1: I'm sorry. 184 00:06:54,940 --> 00:06:57,255 AUDIENCE: So the problem is you're not actually one. 185 00:06:57,255 --> 00:06:57,880 AUDIENCE: Yeah. 186 00:06:57,880 --> 00:07:00,680 The issue is if you went to (1,3) 187 00:07:00,680 --> 00:07:02,958 and then you transitioned to the left, that's where 188 00:07:02,958 --> 00:07:05,974 the 0.0625 is coming from. 189 00:07:05,974 --> 00:07:08,390 AUDIENCE: [INAUDIBLE] then, shouldn't those numbers always 190 00:07:08,390 --> 00:07:09,100 are actually 1-- 191 00:07:09,100 --> 00:07:10,730 your probability distribution always are actually 1? 192 00:07:10,730 --> 00:07:11,438 PROFESSOR 1: Yep. 193 00:07:11,438 --> 00:07:12,880 AUDIENCE: So it's just a point. 194 00:07:12,880 --> 00:07:14,609 It's just off the screen. 195 00:07:14,609 --> 00:07:16,900 PROFESSOR 1: We would add another section to the screen 196 00:07:16,900 --> 00:07:19,540 and just move the grid over. 197 00:07:19,540 --> 00:07:22,299 We just cut it off for graphics. 198 00:07:22,299 --> 00:07:22,840 AUDIENCE: OK. 199 00:07:25,130 --> 00:07:25,880 PROFESSOR 1: Yeah. 200 00:07:25,880 --> 00:07:29,740 So those are all great points. 201 00:07:29,740 --> 00:07:32,400 The easiest way to do it is just take an observation. 202 00:07:32,400 --> 00:07:34,689 So at this point we say, after our first time set, 203 00:07:34,689 --> 00:07:36,980 we weren't sure which of these three states we were in. 204 00:07:36,980 --> 00:07:38,438 So we took an observation and said, 205 00:07:38,438 --> 00:07:41,290 wait, we're actually here with complete certainty. 206 00:07:41,290 --> 00:07:43,350 So to make this a little bit clear, 207 00:07:43,350 --> 00:07:44,515 we're going to look at it from a tree view. 208 00:07:44,515 --> 00:07:44,830 Right? 209 00:07:44,830 --> 00:07:45,910 We said that we started at a state. 210 00:07:45,910 --> 00:07:46,770 We took an action. 211 00:07:46,770 --> 00:07:49,895 And this is our possible states we could have ended up in. 212 00:07:49,895 --> 00:07:52,940 We're going to collapse this by taking this observation. 213 00:07:52,940 --> 00:07:54,605 And now have a complete study here. 214 00:07:54,605 --> 00:07:58,250 And we take our next action and see 215 00:07:58,250 --> 00:08:00,880 that we have moved out here. 216 00:08:00,880 --> 00:08:04,500 So this allows us to basically ignore the history of states. 217 00:08:04,500 --> 00:08:09,020 We have the same percentage probability from each time set. 218 00:08:09,020 --> 00:08:12,240 This will be really useful in completely collapsing 219 00:08:12,240 --> 00:08:16,290 the distribution every single time you take an observation. 220 00:08:16,290 --> 00:08:19,170 Anybody have any questions on that? 221 00:08:19,170 --> 00:08:21,090 OK. 222 00:08:21,090 --> 00:08:24,030 So let's go back now and figure out how he came up 223 00:08:24,030 --> 00:08:25,340 with his optimal policy. 224 00:08:25,340 --> 00:08:28,156 They way we do that is through dynamic programming. 225 00:08:28,156 --> 00:08:29,280 There's two different ways. 226 00:08:29,280 --> 00:08:30,825 You can do it either through value iteration or policy 227 00:08:30,825 --> 00:08:31,325 iteration. 228 00:08:31,325 --> 00:08:33,836 For this lecture, we're going to focus on value iteration. 229 00:08:37,780 --> 00:08:39,590 So let's take this same example we had. 230 00:08:39,590 --> 00:08:42,159 We still want to maximize the expected reward. 231 00:08:42,159 --> 00:08:44,250 And so to start, we're going to initialize 232 00:08:44,250 --> 00:08:46,326 the values of each state to 0. 233 00:08:49,284 --> 00:08:51,750 Let's for example, start at (2,0). 234 00:08:51,750 --> 00:08:53,265 We're going to focus on suite 6-5. 235 00:08:53,265 --> 00:08:54,640 And we're going to say that we're 236 00:08:54,640 --> 00:08:59,002 going to take the Null action to start with. 237 00:08:59,002 --> 00:09:00,710 From there you can see with a probability 238 00:09:00,710 --> 00:09:04,840 distribution 4, 6, 5, we have a 50% of staying. 239 00:09:04,840 --> 00:09:07,829 We have 25% chance of going to 5-5. 240 00:09:07,829 --> 00:09:12,230 And a 25% chance of going to 7-5. 241 00:09:12,230 --> 00:09:14,090 The next item of information is the values 242 00:09:14,090 --> 00:09:16,090 at each of those states that we could end up in. 243 00:09:16,090 --> 00:09:19,502 And currently, they're all set to 0. 244 00:09:19,502 --> 00:09:21,570 And finally, the most important part 245 00:09:21,570 --> 00:09:23,665 here will be with the reward [INAUDIBLE] 246 00:09:23,665 --> 00:09:26,100 6-5, taking the null action, regardless of what state 247 00:09:26,100 --> 00:09:28,771 you end up, is going to be 1, as we defined in our initial set 248 00:09:28,771 --> 00:09:29,271 up. 249 00:09:31,774 --> 00:09:35,270 So let's see how we would calculate the next time step 250 00:09:35,270 --> 00:09:37,269 value of the state. 251 00:09:37,269 --> 00:09:39,060 You'd start by taking probabilities, right? 252 00:09:42,026 --> 00:09:43,900 And from there, we would add the reward here. 253 00:09:43,900 --> 00:09:45,580 And because we said that the reward does 254 00:09:45,580 --> 00:09:47,746 not dependent on the state we end up in, [INAUDIBLE] 255 00:09:47,746 --> 00:09:50,845 should be across all three probabilities. 256 00:09:50,845 --> 00:09:54,200 From there, we're going to add in the discounted value 257 00:09:54,200 --> 00:09:58,475 that we had from before. 258 00:09:58,475 --> 00:10:00,970 So a way to look at this in a more generalized form 259 00:10:00,970 --> 00:10:03,474 is to say that across all the states that you can end up in, 260 00:10:03,474 --> 00:10:05,640 you're going to look at the probability of ending up 261 00:10:05,640 --> 00:10:06,245 in that state. 262 00:10:06,245 --> 00:10:07,911 You're going to multiply that by the sum 263 00:10:07,911 --> 00:10:10,788 of the reward and the discounted lifetime value that it has. 264 00:10:14,600 --> 00:10:18,010 So we want to make sure that we're getting the best 265 00:10:18,010 --> 00:10:18,720 possible values. 266 00:10:18,720 --> 00:10:21,090 So we need to incorporate all the other actions that we 267 00:10:21,090 --> 00:10:24,064 can take from that state. 268 00:10:24,064 --> 00:10:25,480 So what's going to happen is we're 269 00:10:25,480 --> 00:10:26,660 going to take that general formula, 270 00:10:26,660 --> 00:10:29,076 we're going to repeat it over all of the possible actions. 271 00:10:29,076 --> 00:10:31,350 And then we're going to take the maximum of that. 272 00:10:31,350 --> 00:10:34,830 So, for this example, the state is very easy. 273 00:10:34,830 --> 00:10:37,490 All of them are the same for this case. 274 00:10:37,490 --> 00:10:39,465 But we go fast, and we say we get a value of 1. 275 00:10:39,465 --> 00:10:41,186 And we update it by showing the graph. 276 00:10:45,170 --> 00:10:48,210 This gives us what's called the value [INAUDIBLE] Backup-- 277 00:10:48,210 --> 00:10:49,560 or Update equation. 278 00:10:49,560 --> 00:10:50,610 This will be really important because it 279 00:10:50,610 --> 00:10:52,193 reaches across the entire states space 280 00:10:52,193 --> 00:10:54,073 and allows us to provide a history. 281 00:10:56,780 --> 00:10:58,170 So what this would end up looking 282 00:10:58,170 --> 00:11:00,003 like is we're going to iteratively calculate 283 00:11:00,003 --> 00:11:02,370 the values across the entire state space So at t0, 284 00:11:02,370 --> 00:11:05,320 we determine that all the values are 0. 285 00:11:05,320 --> 00:11:09,630 At t1, (6,5) gets a value of 1, and at t2, we 286 00:11:09,630 --> 00:11:13,106 see that value start to propagate out. 287 00:11:13,106 --> 00:11:14,094 Make sense so far? 288 00:11:17,044 --> 00:11:19,460 So the way this works is you would repeat those iterations 289 00:11:19,460 --> 00:11:21,835 until your changes in value become what we would consider 290 00:11:21,835 --> 00:11:24,043 small enough, which would indicate your approximation 291 00:11:24,043 --> 00:11:25,620 is close enough to the real value. 292 00:11:25,620 --> 00:11:27,989 From there, you would extract the optimal policy 293 00:11:27,989 --> 00:11:29,030 from the lifetime values. 294 00:11:29,030 --> 00:11:31,155 So you see the [INAUDIBLE] in the Bellman equation. 295 00:11:31,155 --> 00:11:33,130 And you're now just taking the action from it. 296 00:11:33,130 --> 00:11:36,140 And you would map those actions to your states. 297 00:11:36,140 --> 00:11:38,792 An example of what this might look like propagating out-- 298 00:11:38,792 --> 00:11:42,310 if you say blue was the reward and red is an obstacle, 299 00:11:42,310 --> 00:11:42,970 et cetera-- 300 00:11:42,970 --> 00:11:44,970 you can see, as that value propagates out, 301 00:11:44,970 --> 00:11:48,425 you start seeing your policy by the arrows. 302 00:11:48,425 --> 00:11:50,381 Anybody have any questions? 303 00:11:53,804 --> 00:11:56,370 So one last thing to mention about this, though, 304 00:11:56,370 --> 00:11:57,870 is the complexity for each iteration 305 00:11:57,870 --> 00:12:00,180 is dependent on the size of the state space squared 306 00:12:00,180 --> 00:12:02,112 and the number of actions you can take. 307 00:12:02,112 --> 00:12:04,470 So you can imagine, as your state space expands or you 308 00:12:04,470 --> 00:12:07,690 gather more actions it's very complex, 309 00:12:07,690 --> 00:12:09,810 in which case time and value depiction becomes 310 00:12:09,810 --> 00:12:12,480 very time intensive and costly. 311 00:12:12,480 --> 00:12:16,897 So this allows us to move into the Heuristic-Guided solvers. 312 00:12:16,897 --> 00:12:17,730 AUDIENCE: Thank you. 313 00:12:17,730 --> 00:12:20,383 [INAUDIBLE] transition, one quick question. 314 00:12:20,383 --> 00:12:23,625 Can I just get a show of hands-- how many of you 315 00:12:23,625 --> 00:12:26,247 have learned value iteration before versus how many of you 316 00:12:26,247 --> 00:12:27,580 have seen it for the first time? 317 00:12:27,580 --> 00:12:30,770 So how many have seen it before? 318 00:12:30,770 --> 00:12:31,270 OK. 319 00:12:31,270 --> 00:12:34,546 And then, how many is this their first time? 320 00:12:34,546 --> 00:12:37,312 [INAUDIBLE] 321 00:12:37,312 --> 00:12:39,186 PROFESSOR 3: Any questions on value iteration 322 00:12:39,186 --> 00:12:40,120 before we jump in? 323 00:12:40,120 --> 00:12:42,642 It's going to be an essential part of how we're going 324 00:12:42,642 --> 00:12:43,912 to do [INAUDIBLE] solvers. 325 00:12:43,912 --> 00:12:45,816 Anyone who hasn't [INAUDIBLE]. 326 00:12:49,640 --> 00:12:52,380 So the most important thing we said about value iteration 327 00:12:52,380 --> 00:12:53,990 is that it's super slow. 328 00:12:53,990 --> 00:12:56,880 It's going to have to go over every possible state 329 00:12:56,880 --> 00:12:59,660 and every possible action that it can take. 330 00:12:59,660 --> 00:13:01,270 Our state space is multi-dimensional, 331 00:13:01,270 --> 00:13:03,130 and we can take a lot of different actions. 332 00:13:03,130 --> 00:13:08,150 That going to be really costly and really hard to [INAUDIBLE].. 333 00:13:08,150 --> 00:13:11,820 So the approach we're probably going to want to take 334 00:13:11,820 --> 00:13:16,210 is using some sort of best first search applet? 335 00:13:16,210 --> 00:13:18,600 Who can tell me some example algorithms 336 00:13:18,600 --> 00:13:22,504 that already do that? 337 00:13:22,504 --> 00:13:23,480 AUDIENCE: A star. 338 00:13:23,480 --> 00:13:24,460 PROFESSOR 3: A star. 339 00:13:24,460 --> 00:13:26,069 So that's very good. 340 00:13:26,069 --> 00:13:28,610 And that's exactly what we're going to base our stuff off of. 341 00:13:28,610 --> 00:13:33,105 The A star is for deterministic graph search. 342 00:13:33,105 --> 00:13:35,510 If we have a graph, we can use it heuristic, 343 00:13:35,510 --> 00:13:38,120 and we walk down it and search the space 344 00:13:38,120 --> 00:13:41,520 that we're most interested in until [INAUDIBLE] variable. 345 00:13:41,520 --> 00:13:45,850 So going to introduce to new items focusing on the last one. 346 00:13:45,850 --> 00:13:48,580 AO star is an algorithm we can use 347 00:13:48,580 --> 00:13:51,450 to search for graphs that have this "and" problem. 348 00:13:51,450 --> 00:13:57,520 AO stands for And Or graphs as opposed to the simple graphs 349 00:13:57,520 --> 00:13:59,140 that we have [INAUDIBLE]. 350 00:13:59,140 --> 00:14:03,360 The And is a way to express probabilistic coupling 351 00:14:03,360 --> 00:14:04,532 between edges. 352 00:14:04,532 --> 00:14:06,240 So if we explore one thing, we might have 353 00:14:06,240 --> 00:14:07,364 to explore other functions. 354 00:14:07,364 --> 00:14:09,890 We'll discuss that a little bit more in the next slide. 355 00:14:09,890 --> 00:14:12,070 LAO is a bit more of a generalization. 356 00:14:12,070 --> 00:14:16,510 What it does is allows us to search loopy graphs 357 00:14:16,510 --> 00:14:18,560 and deal with the probabilistic coupling. 358 00:14:18,560 --> 00:14:23,090 And allows us to search and find the best path 359 00:14:23,090 --> 00:14:25,930 with a heuristic guided algorithm over infinite time 360 00:14:25,930 --> 00:14:29,550 horizon, possibly revisiting states using the tools 361 00:14:29,550 --> 00:14:30,352 we just had-- 362 00:14:30,352 --> 00:14:32,518 valued iterations-- to understand what the next best 363 00:14:32,518 --> 00:14:34,840 thing to do is. 364 00:14:34,840 --> 00:14:38,620 So we'll talk about exactly how to get our MDPs that we just 365 00:14:38,620 --> 00:14:42,660 saw, these little arrow examples to these And/Or graphs. 366 00:14:42,660 --> 00:14:43,990 We'll talk about two cases. 367 00:14:43,990 --> 00:14:47,590 One simple one where we could apply the AO star algorithm 368 00:14:47,590 --> 00:14:49,450 is where you have a qualicopter. 369 00:14:49,450 --> 00:14:51,890 And if you command an action North, 370 00:14:51,890 --> 00:14:53,890 you have a high probability of going North, 371 00:14:53,890 --> 00:14:56,390 but you might go East. 372 00:14:56,390 --> 00:14:57,250 And vise versa. 373 00:14:57,250 --> 00:14:59,470 If you go East, you might go North. 374 00:14:59,470 --> 00:15:04,066 This is expressed here with the growing tree. 375 00:15:04,066 --> 00:15:06,000 And these And edgers that connect it. 376 00:15:06,000 --> 00:15:09,500 But despite the action of reading my command from 0 to 0. 377 00:15:09,500 --> 00:15:12,201 We might end up in (1,0) or (0,1). 378 00:15:12,201 --> 00:15:12,700 Right? 379 00:15:12,700 --> 00:15:14,700 As we propagate outward, you can see 380 00:15:14,700 --> 00:15:16,116 how we're never going to loop back 381 00:15:16,116 --> 00:15:17,324 to a statement we've been to. 382 00:15:17,324 --> 00:15:20,228 We're going to be moving in the Northeast direction constantly. 383 00:15:20,228 --> 00:15:22,770 Our [INAUDIBLE] and our probability distribution 384 00:15:22,770 --> 00:15:25,186 across this tree explands. 385 00:15:25,186 --> 00:15:28,090 And that's the coupling of the edges. 386 00:15:28,090 --> 00:15:30,930 Because as we explore down the tree, 387 00:15:30,930 --> 00:15:32,982 we have to explore all the edges together. 388 00:15:32,982 --> 00:15:34,773 Anyone have any questions on just this kind 389 00:15:34,773 --> 00:15:37,050 of conversion formulation? 390 00:15:37,050 --> 00:15:38,550 AUDIENCE: I have a broader question. 391 00:15:38,550 --> 00:15:43,010 We mentioned that MDPs deal with finite states. 392 00:15:43,010 --> 00:15:46,070 Do we always just discretize a continuous row 393 00:15:46,070 --> 00:15:47,360 into a set of planning states? 394 00:15:47,360 --> 00:15:48,240 PROFESSOR 3: Yes. 395 00:15:48,240 --> 00:15:51,710 That's our prerequisite for searching over the state space. 396 00:15:51,710 --> 00:15:53,574 We can do that as finely as possible, 397 00:15:53,574 --> 00:15:58,240 but yes, discretization is there. 398 00:15:58,240 --> 00:16:00,340 So now let's look at this case. 399 00:16:00,340 --> 00:16:03,680 Instead of the Northeast, let's say it's not deterministic 400 00:16:03,680 --> 00:16:05,910 whether we command an action north or South 401 00:16:05,910 --> 00:16:07,680 that we go north or south. 402 00:16:07,680 --> 00:16:10,740 Here we see this loopy structure begin to emerge. 403 00:16:10,740 --> 00:16:12,360 We see that we might-- 404 00:16:12,360 --> 00:16:16,100 on our first action commanding North, we might go to plus 1. 405 00:16:16,100 --> 00:16:18,372 And then commanding North again, but we have our 406 00:16:18,372 --> 00:16:20,080 And edge and our probability distribution 407 00:16:20,080 --> 00:16:23,150 is [? back ?] to 0. 408 00:16:23,150 --> 00:16:24,775 What this does is this loopy structure. 409 00:16:24,775 --> 00:16:26,983 And this is exactly what we're going to be exploring. 410 00:16:26,983 --> 00:16:28,480 This is a very real world scenario 411 00:16:28,480 --> 00:16:32,030 where it's a very likely we might return to somewhere 412 00:16:32,030 --> 00:16:35,400 we just came from, just because of the uncertain dynamics 413 00:16:35,400 --> 00:16:36,400 we have. 414 00:16:36,400 --> 00:16:39,840 This is the type of problem [INAUDIBLE].. 415 00:16:39,840 --> 00:16:43,582 So we're going to use the LAO to start out with. 416 00:16:43,582 --> 00:16:45,540 We're going to talk about the three main things 417 00:16:45,540 --> 00:16:48,480 that [INAUDIBLE] has a heuristic guided envelope. 418 00:16:48,480 --> 00:16:51,410 And what that means is that we have our large state space 419 00:16:51,410 --> 00:16:51,910 here. 420 00:16:51,910 --> 00:16:54,670 But we're only going to look at a portion of it. 421 00:16:54,670 --> 00:16:57,150 This greyed-out portion. 422 00:16:57,150 --> 00:16:59,360 We're only going to look at the portion that 423 00:16:59,360 --> 00:17:02,320 is interesting to us-- the portion that provides us 424 00:17:02,320 --> 00:17:04,819 with the biggest rewards-- the portion that's reachable 425 00:17:04,819 --> 00:17:07,089 if we follow an optimal policy. 426 00:17:07,089 --> 00:17:10,609 We figure this out with some admissible heuristic. 427 00:17:10,609 --> 00:17:13,819 We'll estimate our rewards just like A star. 428 00:17:13,819 --> 00:17:17,089 The idea here was that we'll keep the problem small 429 00:17:17,089 --> 00:17:21,339 so we don't have to search the valuation for a giant state 430 00:17:21,339 --> 00:17:22,329 space. 431 00:17:22,329 --> 00:17:26,050 What we'll do next is at the state space expands, 432 00:17:26,050 --> 00:17:28,146 we get a bigger picture of the states 433 00:17:28,146 --> 00:17:29,884 that we're interested in. 434 00:17:29,884 --> 00:17:31,800 We're going to do [? an audio ?] [INAUDIBLE].. 435 00:17:31,800 --> 00:17:34,510 And then we're going to figure out what the best action is. 436 00:17:34,510 --> 00:17:37,980 And in the ideal case, the states 437 00:17:37,980 --> 00:17:40,149 that we would never reach using an optimal policy 438 00:17:40,149 --> 00:17:41,440 are never going to be explored. 439 00:17:41,440 --> 00:17:44,340 Because our policy is going to say, no don't go over there. 440 00:17:44,340 --> 00:17:45,481 That's a dead end. 441 00:17:45,481 --> 00:17:47,920 Or that's getting you farther away from [INAUDIBLE].. 442 00:17:47,920 --> 00:17:50,170 So we're going to be searching in a very specific part 443 00:17:50,170 --> 00:17:53,390 of the state space that is useful to explore-- 444 00:17:53,390 --> 00:17:56,502 that will get us closer and give us a higher reward. 445 00:18:05,420 --> 00:18:08,670 What's important here, the L stands for Loops. 446 00:18:08,670 --> 00:18:10,770 It's an extension of the AO star algorithm, 447 00:18:10,770 --> 00:18:12,769 which is by itself is an extension of the A star 448 00:18:12,769 --> 00:18:13,270 algorithm. 449 00:18:13,270 --> 00:18:15,110 We can handle infinite horizon problems, 450 00:18:15,110 --> 00:18:17,975 like transition in different ways. 451 00:18:17,975 --> 00:18:21,940 And really models the real world scenarios we're interesed in. 452 00:18:21,940 --> 00:18:23,850 Any questions so far on the broad scope 453 00:18:23,850 --> 00:18:25,526 of what AO star is going to do? 454 00:18:25,526 --> 00:18:26,026 Yeah. 455 00:18:26,026 --> 00:18:29,634 AUDIENCE: Can you put the [INAUDIBLE] what you're doing 456 00:18:29,634 --> 00:18:30,300 over here, but-- 457 00:18:30,300 --> 00:18:31,050 PROFESSOR 3: Sure. 458 00:18:31,050 --> 00:18:32,705 The AO stands for And Or graphs. 459 00:18:32,705 --> 00:18:37,576 So that's graphs that we have here edges that are coupled. 460 00:18:37,576 --> 00:18:39,200 If you took time to do this transition, 461 00:18:39,200 --> 00:18:41,671 you might end up here or you might end up there. 462 00:18:41,671 --> 00:18:44,455 So that's the notion that of this probability 463 00:18:44,455 --> 00:18:53,400 coupling, that we'll see in action as we [INAUDIBLE] 464 00:18:53,400 --> 00:18:57,320 we're going to do, we're going to input this MDP or AO 465 00:18:57,320 --> 00:19:01,000 graph with transition probabilities or reward 466 00:19:01,000 --> 00:19:02,275 function heuristics. 467 00:19:02,275 --> 00:19:04,000 These are all things that we defined 468 00:19:04,000 --> 00:19:06,330 prior to figuring out a plan. 469 00:19:06,330 --> 00:19:07,800 What we're going to come out with 470 00:19:07,800 --> 00:19:10,292 is an optimal policy for every regional state. 471 00:19:10,292 --> 00:19:12,500 It's a little different than what value iteration is, 472 00:19:12,500 --> 00:19:15,790 which is an optimal policy from every possible state. 473 00:19:15,790 --> 00:19:18,840 But we argue that that's all we need. 474 00:19:18,840 --> 00:19:22,060 If we know where we're starting and we follow 475 00:19:22,060 --> 00:19:23,560 our optimal policy, we're only going 476 00:19:23,560 --> 00:19:25,760 to explore a certain portion of the state space. 477 00:19:25,760 --> 00:19:28,120 And we're going to explore that together. 478 00:19:28,120 --> 00:19:30,198 [INAUDIBLE] plan a little bit more 479 00:19:30,198 --> 00:19:35,362 efficiently than iterating for a high [INAUDIBLE] heuristic. 480 00:19:35,362 --> 00:19:39,340 Any questions on that? 481 00:19:39,340 --> 00:19:41,445 So we'll define some terms that we're 482 00:19:41,445 --> 00:19:42,810 going to use throughout this. 483 00:19:42,810 --> 00:19:44,700 We've already talked about our state space. 484 00:19:44,700 --> 00:19:46,650 This is just a small portion of it 485 00:19:46,650 --> 00:19:48,483 that we're going to work with as we're going 486 00:19:48,483 --> 00:19:49,850 to walk through this example. 487 00:19:49,850 --> 00:19:52,225 Next we're going to define something called our envelope. 488 00:19:52,225 --> 00:19:54,224 And that's the sub-portion of our states 489 00:19:54,224 --> 00:19:55,640 that we're going to be looking at. 490 00:19:55,640 --> 00:19:59,429 We're going to initialize that to just this zero [INAUDIBLE].. 491 00:19:59,429 --> 00:20:01,970 But as we progress through the algorithm, it's going to grow. 492 00:20:01,970 --> 00:20:04,770 It's going to grow only in the areas that we're interested in. 493 00:20:07,650 --> 00:20:10,430 A subset of this envelope is the terminal states. 494 00:20:10,430 --> 00:20:12,150 Now these aren't goal states, these 495 00:20:12,150 --> 00:20:14,760 are just [INAUDIBLE] of our space 496 00:20:14,760 --> 00:20:16,176 that we've added to our envelopes. 497 00:20:16,176 --> 00:20:19,904 And this is what it would be in the expanded [INAUDIBLE].. 498 00:20:19,904 --> 00:20:22,329 And it's just the nodes that haven't been expanded yet. 499 00:20:22,329 --> 00:20:24,120 Here they haven't drawn anything incredible 500 00:20:24,120 --> 00:20:26,890 but you can imagine the state space goes out 501 00:20:26,890 --> 00:20:29,375 further because [INAUDIBLE]. 502 00:20:29,375 --> 00:20:31,180 You can imagine that this goes out further. 503 00:20:31,180 --> 00:20:33,070 And they're keeping track of the nodes 504 00:20:33,070 --> 00:20:37,380 that we haven't expanded yet. 505 00:20:37,380 --> 00:20:41,350 Or likewise, we've showed that we initialize [INAUDIBLE].. 506 00:20:44,290 --> 00:20:47,292 The other few things that we're going to define-- we've 507 00:20:47,292 --> 00:20:49,864 already defined the states that are in our envelope. 508 00:20:49,864 --> 00:20:51,640 That's the blue or the red. 509 00:20:51,640 --> 00:20:55,790 We're going to define cost heuristic or reward function, R 510 00:20:55,790 --> 00:21:00,890 E, and a transition probability matrix, or set of matrices, 511 00:21:00,890 --> 00:21:04,850 that we're going to use our optimal policy search on. 512 00:21:04,850 --> 00:21:07,910 What's important here is that our reward function 513 00:21:07,910 --> 00:21:09,670 and transition probabilities are slightly 514 00:21:09,670 --> 00:21:12,003 altered to account for the fact that we haven't explored 515 00:21:12,003 --> 00:21:13,680 the entire state space. 516 00:21:13,680 --> 00:21:17,490 We see here that if a node is in ST, in other words, 517 00:21:17,490 --> 00:21:21,350 it's one of those terminal nodes. 518 00:21:21,350 --> 00:21:23,650 We're going to say we can't transition out 519 00:21:23,650 --> 00:21:25,940 of it, because we don't know what's beyond it so far. 520 00:21:25,940 --> 00:21:28,675 And we're going to set it for reward to be a heuristic. 521 00:21:28,675 --> 00:21:30,870 Whenever we think that the reward is 522 00:21:30,870 --> 00:21:35,310 going to be when we begin to explore and go further. 523 00:21:35,310 --> 00:21:39,270 Like I said, we're just going to feed this into a policy search 524 00:21:39,270 --> 00:21:41,650 just like we discussed with value iterations 525 00:21:41,650 --> 00:21:43,080 on the sub problem. 526 00:21:43,080 --> 00:21:46,310 And we're going to search for an optimal policy [INAUDIBLE].. 527 00:21:46,310 --> 00:21:47,446 So far so good? 528 00:21:50,991 --> 00:21:52,074 This is the general steps. 529 00:21:52,074 --> 00:21:53,900 This is very text-y, but we're going to definitely walk 530 00:21:53,900 --> 00:21:54,983 through every single step. 531 00:21:54,983 --> 00:21:57,610 We're going to do two full iterations of the algorithm. 532 00:21:57,610 --> 00:22:00,482 So like I said, we're going to create RE and TE. 533 00:22:00,482 --> 00:22:02,990 That's are reward function transition probabilities using 534 00:22:02,990 --> 00:22:05,224 the definitions we showed. 535 00:22:05,224 --> 00:22:06,640 We're going to use value iteration 536 00:22:06,640 --> 00:22:08,630 to find the optimal policy of the sub space 537 00:22:08,630 --> 00:22:11,190 that we're interested in now. 538 00:22:11,190 --> 00:22:14,030 And then this is probably the most important step. 539 00:22:14,030 --> 00:22:18,020 Knowing this optimal policy, we take a look at what new states 540 00:22:18,020 --> 00:22:19,580 that aren't in our terminal states-- 541 00:22:19,580 --> 00:22:21,344 nodes that we haven't explored yet-- 542 00:22:21,344 --> 00:22:24,150 what new states we might visit now. 543 00:22:24,150 --> 00:22:27,526 So let's say we have our policy, and it says we'll go North. 544 00:22:27,526 --> 00:22:29,900 And we know that we haven't explored the North state yet. 545 00:22:29,900 --> 00:22:31,400 We know this is the state that we're 546 00:22:31,400 --> 00:22:34,287 going to reach following what we consider now to be 547 00:22:34,287 --> 00:22:37,041 already an optimal [INAUDIBLE]. 548 00:22:37,041 --> 00:22:39,040 So that's the states we're going to expand next. 549 00:22:39,040 --> 00:22:40,300 We're going to do some bookkeeping, 550 00:22:40,300 --> 00:22:42,280 adding them to the terminal states [INAUDIBLE] 551 00:22:42,280 --> 00:22:44,764 once we expand it, then adding them to our envelope. 552 00:22:44,764 --> 00:22:46,430 What's important here is that we're only 553 00:22:46,430 --> 00:22:49,570 going to add states that we visited yet to our envelope. 554 00:22:49,570 --> 00:22:51,480 And this is basically the little hack 555 00:22:51,480 --> 00:22:54,750 that allows us to deal with loopy graphs. 556 00:22:54,750 --> 00:22:57,880 We're not going to continually explore nodes 557 00:22:57,880 --> 00:23:00,400 that we might reach a second time, probabilistically. 558 00:23:00,400 --> 00:23:02,353 We're going to let value iteration handle 559 00:23:02,353 --> 00:23:05,092 that [INAUDIBLE] 560 00:23:05,092 --> 00:23:11,068 AUDIENCE: [INAUDIBLE] states are expanded. 561 00:23:11,068 --> 00:23:13,060 I'm the one who got confused on that. 562 00:23:13,060 --> 00:23:15,052 Are you saying that you're going to just repeat 563 00:23:15,052 --> 00:23:18,538 until there aren't any more terminals to look at. 564 00:23:18,538 --> 00:23:22,191 And if that's the case, how is that possible if you have 565 00:23:22,191 --> 00:23:24,701 an infinite horizon [INAUDIBLE] 566 00:23:24,701 --> 00:23:25,450 PROFESSOR 3: Sure. 567 00:23:25,450 --> 00:23:28,520 So if you can imagine-- and we'll talk about termination 568 00:23:28,520 --> 00:23:29,602 at the end. 569 00:23:29,602 --> 00:23:33,540 But you can imagine that as we have these terminal states, 570 00:23:33,540 --> 00:23:35,860 but you have a policy that guides you 571 00:23:35,860 --> 00:23:38,358 to a part of the state space that we've already expanded. 572 00:23:38,358 --> 00:23:39,733 Imagine you've reached your goal. 573 00:23:39,733 --> 00:23:41,680 Your optimal policy is going to say, stay put. 574 00:23:41,680 --> 00:23:42,487 Right? 575 00:23:42,487 --> 00:23:44,320 And it's not going to say, move North again. 576 00:23:44,320 --> 00:23:46,094 AUDIENCE: It's just the goal [INAUDIBLE] 577 00:23:46,094 --> 00:23:47,760 PROFESSOR 3: Essentially, the goal state 578 00:23:47,760 --> 00:23:49,923 is definitely an example in the more extreme case 579 00:23:49,923 --> 00:23:51,964 that there's nothing else you can do that's going 580 00:23:51,964 --> 00:23:54,293 to get you closer to our goal. 581 00:23:54,293 --> 00:23:59,020 The idea is that your policy on your sub space 582 00:23:59,020 --> 00:24:00,520 never tells you to go to a terminal. 583 00:24:00,520 --> 00:24:04,096 Nobody can [INAUDIBLE] inherently worse than 584 00:24:04,096 --> 00:24:06,008 [INAUDIBLE] 585 00:24:06,008 --> 00:24:07,420 AUDIENCE: Planning optimal policy 586 00:24:07,420 --> 00:24:09,862 means running value iteration entirely. 587 00:24:09,862 --> 00:24:10,570 PROFESSOR 3: Yes. 588 00:24:10,570 --> 00:24:13,036 We were going to treat it essentially as a black box. 589 00:24:13,036 --> 00:24:15,014 But the trick here is that we're doing it 590 00:24:15,014 --> 00:24:17,360 on a smaller portion of space of a different world. 591 00:24:20,780 --> 00:24:21,280 All right. 592 00:24:21,280 --> 00:24:23,060 I'm going to put the steps up there. 593 00:24:23,060 --> 00:24:25,530 Hopefully you can see this. 594 00:24:25,530 --> 00:24:28,143 But for now, we'll just walk through 595 00:24:28,143 --> 00:24:30,380 from the beginning of what we're going to do. 596 00:24:30,380 --> 00:24:34,890 So we said that our envelope and our terminal nodes 597 00:24:34,890 --> 00:24:37,310 are just at 0 to start. 598 00:24:37,310 --> 00:24:40,230 So very simply, we use these definitions 599 00:24:40,230 --> 00:24:42,640 and say, OK, the transition probability 600 00:24:42,640 --> 00:24:45,531 as 0 to any node right now is 0. 601 00:24:45,531 --> 00:24:46,914 Because it's in that terminal. 602 00:24:46,914 --> 00:24:50,670 [INAUDIBLE] That's just so that we don't transition out of it. 603 00:24:50,670 --> 00:24:54,240 We can develop the policy based on only this portion 604 00:24:54,240 --> 00:24:55,302 of the space [INAUDIBLE]. 605 00:24:55,302 --> 00:24:56,760 And our reward function, we're just 606 00:24:56,760 --> 00:24:59,040 going to apply it to be the heuristic [INAUDIBLE].. 607 00:24:59,040 --> 00:25:01,040 Let's say that's 20. 608 00:25:01,040 --> 00:25:05,234 And for a move-on from there. 609 00:25:05,234 --> 00:25:07,992 [INAUDIBLE] started to value iteration. 610 00:25:07,992 --> 00:25:09,450 And we'll run this value iteration. 611 00:25:09,450 --> 00:25:11,540 So this is a very basic case. 612 00:25:11,540 --> 00:25:14,035 We're just going to-- we're using this to kind of build up 613 00:25:14,035 --> 00:25:16,170 the machinery, understand what we're doing. 614 00:25:16,170 --> 00:25:19,700 It's a very basic case where you only node we have is that zero. 615 00:25:19,700 --> 00:25:20,960 We can't transition out of it. 616 00:25:20,960 --> 00:25:23,142 All we have is some heuristic. 617 00:25:23,142 --> 00:25:24,960 So the only thing we can do is nothing. 618 00:25:24,960 --> 00:25:28,310 So the action we're going to take from S0 is nothing. 619 00:25:28,310 --> 00:25:30,780 Very simple case just to get us warmed up and understand 620 00:25:30,780 --> 00:25:33,540 what's happening. 621 00:25:33,540 --> 00:25:37,258 So using this policy and knowing that those 0s in our terminal 622 00:25:37,258 --> 00:25:40,271 node are node reset. 623 00:25:40,271 --> 00:25:41,770 We're going to take the intersection 624 00:25:41,770 --> 00:25:43,370 between this terminal mode set. 625 00:25:43,370 --> 00:25:46,450 And the nodes that we might reach following our policy. 626 00:25:46,450 --> 00:25:47,790 And we know that we're at S0. 627 00:25:47,790 --> 00:25:53,235 And we know that action that our optimal policy says to take 628 00:25:53,235 --> 00:25:53,930 is not. 629 00:25:53,930 --> 00:25:55,280 So we know we're there. 630 00:25:55,280 --> 00:25:57,680 And we know it's in our terminal node set. 631 00:25:57,680 --> 00:26:01,655 So that's the only thing that we could reach so far 632 00:26:01,655 --> 00:26:03,080 using our optimal policy. 633 00:26:03,080 --> 00:26:04,840 So far so good? 634 00:26:04,840 --> 00:26:05,340 Clear? 635 00:26:09,340 --> 00:26:11,600 So that's exactly what we're going to expand. 636 00:26:11,600 --> 00:26:14,633 Just expand S0, A, B, and C defined those 637 00:26:14,633 --> 00:26:16,339 to our terminal nodes. 638 00:26:16,339 --> 00:26:19,702 So that's up there the symbols [INAUDIBLE] 639 00:26:19,702 --> 00:26:21,940 from our terminal nodes added as children. 640 00:26:21,940 --> 00:26:24,430 Then we added the children to the end. 641 00:26:29,390 --> 00:26:29,890 Here we go. 642 00:26:29,890 --> 00:26:31,390 We're going to do a little bit more. 643 00:26:31,390 --> 00:26:33,370 We're going to do the same thing again. 644 00:26:33,370 --> 00:26:36,274 But now we obviously have more nodes and a bigger sub space 645 00:26:36,274 --> 00:26:38,854 to explore. 646 00:26:38,854 --> 00:26:43,200 So using these definitions we see our reward function 647 00:26:43,200 --> 00:26:47,146 is a tuple of the node that we start from. 648 00:26:47,146 --> 00:26:48,670 And so this S0. 649 00:26:48,670 --> 00:26:50,200 And one of these three actions. 650 00:26:50,200 --> 00:26:52,510 And we'll take A1, A2, and A3. 651 00:26:52,510 --> 00:26:56,350 When we have our instantaneous reward functions, 6,4, and 8 652 00:26:56,350 --> 00:27:00,650 as being our rewards from doing those actions. 653 00:27:00,650 --> 00:27:03,327 From A, B, and C, no matter what action you take, 654 00:27:03,327 --> 00:27:04,698 it's just heuristic. 655 00:27:04,698 --> 00:27:07,900 That's part of it. 656 00:27:07,900 --> 00:27:10,430 And likewise, we're going to take a look at this transition 657 00:27:10,430 --> 00:27:11,430 probability [INAUDIBLE]. 658 00:27:11,430 --> 00:27:13,638 This is for [INAUDIBLE] transitioning from a state S0 659 00:27:13,638 --> 00:27:17,510 for saying that if we take actions A1, A2, and A3, 660 00:27:17,510 --> 00:27:21,742 what's the probability of being done in nodes A, B, and C? 661 00:27:21,742 --> 00:27:23,632 If you look so far here, we're going 662 00:27:23,632 --> 00:27:25,090 to look at something deterministic. 663 00:27:25,090 --> 00:27:26,617 If we take an action, we'll end up 664 00:27:26,617 --> 00:27:28,075 where we say we're going to end up. 665 00:27:28,075 --> 00:27:31,060 And we'll see how this algorithm collapses down to A star 666 00:27:31,060 --> 00:27:35,084 if everything's deterministic. 667 00:27:35,084 --> 00:27:38,100 We're also obviously going to look at the probabilistic case 668 00:27:38,100 --> 00:27:40,700 where we say [INAUDIBLE] small probability that it 669 00:27:40,700 --> 00:27:42,007 might end up with [INAUDIBLE]. 670 00:27:42,007 --> 00:27:43,590 That's going to necessitate that we're 671 00:27:43,590 --> 00:27:46,730 going to have to look at and expand B together with A 672 00:27:46,730 --> 00:27:50,015 if we were to decide we want to try [INAUDIBLE].. 673 00:27:50,015 --> 00:27:53,076 And likewise, if we try to take action A3, 674 00:27:53,076 --> 00:27:54,700 we have to expand them all three nodes. 675 00:27:54,700 --> 00:27:57,410 So the tighter the probabilistic coupling, the more of the space 676 00:27:57,410 --> 00:27:58,701 we're going to have to explore. 677 00:28:01,440 --> 00:28:05,346 So just off this, assuming that if you take action A1, A2, 678 00:28:05,346 --> 00:28:09,990 and A3, can someone tell me what the policy is for the rewards 679 00:28:09,990 --> 00:28:12,200 here? 680 00:28:12,200 --> 00:28:17,160 And we're interested in a policy from S0 what we're going to do. 681 00:28:17,160 --> 00:28:19,500 Take a look at the rewards and judge 682 00:28:19,500 --> 00:28:22,738 what the best action to take is from a purely deterministic 683 00:28:22,738 --> 00:28:23,238 sense. 684 00:28:28,562 --> 00:28:32,918 AUDIENCE: [INAUDIBLE] do you add [INAUDIBLE] 685 00:28:32,918 --> 00:28:34,370 or do you just [INAUDIBLE]? 686 00:28:34,370 --> 00:28:37,030 You have the reward that you have gotten so far. 687 00:28:45,130 --> 00:28:47,580 PROFESSOR 3: Does that help you answer the question? 688 00:28:47,580 --> 00:28:48,371 AUDIENCE: Oh, yeah. 689 00:28:48,371 --> 00:28:50,059 [INAUDIBLE] student. 690 00:28:50,059 --> 00:28:51,600 PROFESSOR 3: So the policy preference 691 00:28:51,600 --> 00:28:55,206 says, from S0, take action A1. 692 00:28:55,206 --> 00:28:57,666 And that's going to stay there. 693 00:28:57,666 --> 00:29:02,010 [INAUDIBLE] from A to C, there's no action to take [INAUDIBLE].. 694 00:29:02,010 --> 00:29:04,800 What that means is that the nodes that we might reach 695 00:29:04,800 --> 00:29:08,630 that are in our terminal state set, using our policy, 696 00:29:08,630 --> 00:29:09,900 is node A. All right. 697 00:29:09,900 --> 00:29:11,430 So that's the node going to expand. 698 00:29:11,430 --> 00:29:13,800 And this is where you really see that all we've done 699 00:29:13,800 --> 00:29:16,620 is collapse down to A star. 700 00:29:16,620 --> 00:29:17,910 A star would say, OK. 701 00:29:17,910 --> 00:29:23,350 What's the best node using some heuristic. 702 00:29:23,350 --> 00:29:26,264 Action a1 one takes us to that best node, 703 00:29:26,264 --> 00:29:29,680 and we're going to expand just that. 704 00:29:29,680 --> 00:29:31,640 So when everything is deterministic, 705 00:29:31,640 --> 00:29:34,110 basically this algorithm collapses down to A star. 706 00:29:37,056 --> 00:29:38,824 Nothing super interesting. 707 00:29:38,824 --> 00:29:41,026 The interesting case does come up 708 00:29:41,026 --> 00:29:42,900 when we start doing more probablistic ones. 709 00:29:42,900 --> 00:29:44,420 That's where nodes are probabilistically 710 00:29:44,420 --> 00:29:45,878 helpful to the scanned graph sense. 711 00:29:48,550 --> 00:29:50,125 So now we have our policy. 712 00:29:50,125 --> 00:29:52,250 The policies are politely going to remain the same, 713 00:29:52,250 --> 00:29:58,250 because they have very little probability on the edge actions 714 00:29:58,250 --> 00:30:03,250 that we might accidentally hit up. 715 00:30:03,250 --> 00:30:06,250 What would I want to look at is what [INAUDIBLE] 716 00:30:06,250 --> 00:30:09,220 and how reachable following our optimal policy. 717 00:30:15,160 --> 00:30:18,806 So we talked about that if we were to actually read. 718 00:30:18,806 --> 00:30:22,012 We know some probability of ending up in [INAUDIBLE] nodes. 719 00:30:22,012 --> 00:30:26,280 So taking action A3 makes C, 2, and A all reachable. 720 00:30:26,280 --> 00:30:28,650 What's reachable in taking action A1? 721 00:30:36,111 --> 00:30:36,861 AUDIENCE: A and B. 722 00:30:36,861 --> 00:30:39,620 PROFESSOR 3: [INAUDIBLE] And that's 723 00:30:39,620 --> 00:30:42,700 where we get the notion on this probabilistic algorithm 724 00:30:42,700 --> 00:30:45,517 that we're going expand things together. 725 00:30:45,517 --> 00:30:46,975 Explore the part of the state space 726 00:30:46,975 --> 00:30:50,040 that we reach both higher optimal polity 727 00:30:50,040 --> 00:30:54,400 and that we might accidentally end up in if we [INAUDIBLE] 728 00:30:54,400 --> 00:30:54,982 the policy. 729 00:30:54,982 --> 00:30:57,870 And this is what guarantees that we'll have an action 730 00:30:57,870 --> 00:31:00,815 to take from any state that we intended to go to 731 00:31:00,815 --> 00:31:03,440 or that we might end up [INAUDIBLE].. 732 00:31:03,440 --> 00:31:05,210 That's how our state expands, our envelope 733 00:31:05,210 --> 00:31:09,030 expands to encompass only the reachable and interesting 734 00:31:09,030 --> 00:31:11,240 states that we want to look at. 735 00:31:11,240 --> 00:31:14,360 It's not as simple as just examining 736 00:31:14,360 --> 00:31:17,850 not only the heuristic, but with A start 737 00:31:17,850 --> 00:31:19,850 because we have this probabilistic [INAUDIBLE].. 738 00:31:19,850 --> 00:31:24,740 You can see that if you have a tighter coupling, then 739 00:31:24,740 --> 00:31:28,970 you don't get to exploit this optimization as much. 740 00:31:28,970 --> 00:31:31,220 For example, A3 we would have had to expand 741 00:31:31,220 --> 00:31:33,400 more as opposed to A1. 742 00:31:33,400 --> 00:31:38,930 We can just stick to A and B and ignore expanding C for now. 743 00:31:38,930 --> 00:31:39,736 [INAUDIBLE] 744 00:31:42,670 --> 00:31:44,950 AUDIENCE: Our [INAUDIBLE] We know that when people 745 00:31:44,950 --> 00:31:48,370 do the statistic [INAUDIBLE]. 746 00:31:48,370 --> 00:31:51,340 So in this case, if you take action a because of a and b, 747 00:31:51,340 --> 00:32:01,880 [INAUDIBLE] 748 00:32:01,880 --> 00:32:03,540 PROFESSOR 3: So in the complete sense 749 00:32:03,540 --> 00:32:05,650 of running this algorithm, we shouldn't print it. 750 00:32:05,650 --> 00:32:07,190 Because what if we do? 751 00:32:07,190 --> 00:32:10,030 If we have that 2% chance, [INAUDIBLE].. 752 00:32:10,030 --> 00:32:12,430 We need to have a policy for correcting. 753 00:32:12,430 --> 00:32:14,520 So we end up at B. You can imagine 754 00:32:14,520 --> 00:32:17,502 that these are going to try to push back to whatever path 755 00:32:17,502 --> 00:32:19,430 that A was looking for. 756 00:32:19,430 --> 00:32:21,720 I suppose that the probability is low enough, 757 00:32:21,720 --> 00:32:24,850 you can have some cutoff percentage where 758 00:32:24,850 --> 00:32:26,590 you've decoupled [INAUDIBLE]-- 759 00:32:26,590 --> 00:32:28,210 PROFESSOR 2: So just a quick point. 760 00:32:28,210 --> 00:32:30,979 So I think that's an excellent point and an excellent answer. 761 00:32:30,979 --> 00:32:32,520 We're going to talk about next week-- 762 00:32:32,520 --> 00:32:34,620 exactly what's going to happen and if you 763 00:32:34,620 --> 00:32:37,120 can prove lower probability on the paper 764 00:32:37,120 --> 00:32:39,642 right here will be lecturing on that. 765 00:32:39,642 --> 00:32:40,580 Good question. 766 00:32:40,580 --> 00:32:43,380 PROFESSOR 3: That also gets us into the sense 767 00:32:43,380 --> 00:32:46,150 that if every state in the world were probabilistically 768 00:32:46,150 --> 00:32:50,370 coupled-- let's say we had some transporter to go with the Star 769 00:32:50,370 --> 00:32:52,495 Trek examples. 770 00:32:52,495 --> 00:32:56,610 If we had this transported that non-deterministically put us 771 00:32:56,610 --> 00:32:58,425 in any state in the world, we have 772 00:32:58,425 --> 00:33:00,550 to explore the whole world, because we could end up 773 00:33:00,550 --> 00:33:01,410 [INAUDIBLE]. 774 00:33:01,410 --> 00:33:04,410 So luckily that's not the case yet and we can take advantage 775 00:33:04,410 --> 00:33:07,950 of the fact that we ended up most likely where 776 00:33:07,950 --> 00:33:14,510 we commanded with some probability of [INAUDIBLE] 777 00:33:14,510 --> 00:33:16,382 This is exactly what we just talked about. 778 00:33:16,382 --> 00:33:19,450 We coupled these nodes with this And edge. 779 00:33:19,450 --> 00:33:21,230 And we expand those, too. 780 00:33:21,230 --> 00:33:23,570 Is everyone understanding the holding intuition, 781 00:33:23,570 --> 00:33:26,550 and the logic for why both of them have the [INAUDIBLE]?? 782 00:33:26,550 --> 00:33:29,116 Even if there's only a small probability that we end up 783 00:33:29,116 --> 00:33:30,112 [INAUDIBLE]? 784 00:33:35,092 --> 00:33:36,950 So and this is what we're going to repeat 785 00:33:36,950 --> 00:33:38,990 until the states are expanded. 786 00:33:38,990 --> 00:33:41,070 You can imagine that the next time we 787 00:33:41,070 --> 00:33:43,330 run our value iteration, we're now 788 00:33:43,330 --> 00:33:45,446 running it on all of these colored nodes-- 789 00:33:45,446 --> 00:33:47,644 both the blue and the red. 790 00:33:47,644 --> 00:33:49,060 We can imagine that next time, now 791 00:33:49,060 --> 00:33:51,018 that we have a little bit more information what 792 00:33:51,018 --> 00:33:53,477 lies beyond A and B, that our policy might say, oh, 793 00:33:53,477 --> 00:33:54,060 you know what? 794 00:33:54,060 --> 00:34:00,350 Actually from S0 C, action A3 was the best to take. 795 00:34:00,350 --> 00:34:03,378 What that does is say, OK, with a regional map, 796 00:34:03,378 --> 00:34:06,230 it's A, B, and C. And we expanded those nodes. 797 00:34:06,230 --> 00:34:08,011 We'll we've already expanded A and B, 798 00:34:08,011 --> 00:34:09,969 so we move into the next part of the sub space, 799 00:34:09,969 --> 00:34:12,150 that as we gain more information, 800 00:34:12,150 --> 00:34:17,789 we run value iteration and we can expand on why. 801 00:34:17,789 --> 00:34:19,060 Steve asked a good question. 802 00:34:19,060 --> 00:34:22,270 When we did our dry run, about whether there is a way 803 00:34:22,270 --> 00:34:24,059 to save on the computation you did 804 00:34:24,059 --> 00:34:27,360 prior to the value iteration and add these [INAUDIBLE] states. 805 00:34:27,360 --> 00:34:29,909 And we've looked at it a little bit. 806 00:34:29,909 --> 00:34:31,380 I've seen some stuff. 807 00:34:31,380 --> 00:34:36,609 But I haven't found a paper that specifically deals with it. 808 00:34:36,609 --> 00:34:38,730 You can imagine how you've already 809 00:34:38,730 --> 00:34:42,450 run value iteration on your previous iteration [INAUDIBLE] 810 00:34:42,450 --> 00:34:46,239 and you add the new terminal edges which you'd expand on. 811 00:34:46,239 --> 00:34:48,100 And you run them again until it stabilizes. 812 00:34:48,100 --> 00:34:49,808 And that way you've saved the computation 813 00:34:49,808 --> 00:34:54,446 of having to run valuation multiple times [INAUDIBLE] 814 00:34:54,446 --> 00:34:56,941 state space. 815 00:34:56,941 --> 00:34:58,700 AUDIENCE: So I'm trying to think of how 816 00:34:58,700 --> 00:35:01,854 this is different from something like [INAUDIBLE] 817 00:35:01,854 --> 00:35:03,306 for [INAUDIBLE]. 818 00:35:07,662 --> 00:35:11,092 PROFESSOR 3: I think I don't know enough about that. 819 00:35:11,092 --> 00:35:14,770 But my basic understanding says that what's useful 820 00:35:14,770 --> 00:35:18,890 here is it's this explicit [INAUDIBLE].. 821 00:35:18,890 --> 00:35:20,714 I don't know how much [INAUDIBLE].. 822 00:35:23,581 --> 00:35:26,080 AUDIENCE: And also, as long as your heuristic is admissible, 823 00:35:26,080 --> 00:35:33,910 it's guaranteed [INAUDIBLE] not all [INAUDIBLE] algorithms. 824 00:35:33,910 --> 00:35:37,190 Just like A star is optimal, as long as you've got 825 00:35:37,190 --> 00:35:38,636 a consistent [INAUDIBLE]. 826 00:35:43,092 --> 00:35:44,050 PROFESSOR 3: All right. 827 00:35:44,050 --> 00:35:46,820 So that's definitely the idea here. 828 00:35:46,820 --> 00:35:48,680 We coupled these in the only explored 829 00:35:48,680 --> 00:35:50,512 of the portion of the state space 830 00:35:50,512 --> 00:35:55,384 that we'll reach an optimal policy [INAUDIBLE].. 831 00:35:55,384 --> 00:35:57,750 So we'll talk about quickly another determination. 832 00:35:57,750 --> 00:36:00,770 We've touched on most of this. 833 00:36:00,770 --> 00:36:03,560 So it's most likely when there are 834 00:36:03,560 --> 00:36:06,340 no more states to expand this when we've reached our goal. 835 00:36:06,340 --> 00:36:09,090 It's when our policy that we run on our entire envelope 836 00:36:09,090 --> 00:36:12,110 from value iteration doesn't say that we should 837 00:36:12,110 --> 00:36:14,625 go to anymore terminal [INAUDIBLE] 838 00:36:14,625 --> 00:36:17,484 that we haven't looked at yet, that we haven't seen yet. 839 00:36:17,484 --> 00:36:19,400 We've said that those are only the things that 840 00:36:19,400 --> 00:36:21,310 are reachable and needed. 841 00:36:21,310 --> 00:36:24,550 Both, reachable because we're following optimal policy, 842 00:36:24,550 --> 00:36:27,950 and needed, only if we might accidentally 843 00:36:27,950 --> 00:36:30,030 end up in the probabilistic. 844 00:36:30,030 --> 00:36:33,000 This is gives us the sense of rigorous that we get policy 845 00:36:33,000 --> 00:36:37,390 on the entire state space that we can end up in following 846 00:36:37,390 --> 00:36:38,830 the optimal policy. 847 00:36:41,710 --> 00:36:44,924 The third bullet here touches on if we 848 00:36:44,924 --> 00:36:47,340 don't expand the states that are probabilistically coupled 849 00:36:47,340 --> 00:36:49,620 and we do accidentally end up there, 850 00:36:49,620 --> 00:36:52,840 we risk getting lost and not having a policy. 851 00:36:52,840 --> 00:36:56,080 We can compute this all off line and have a plan before we even 852 00:36:56,080 --> 00:36:58,456 start planning to know exactly where we 853 00:36:58,456 --> 00:37:06,690 want to go even if our dynamics aren't [INAUDIBLE].. 854 00:37:06,690 --> 00:37:10,100 We're come back to this. 855 00:37:10,100 --> 00:37:11,880 This was our motivating example. 856 00:37:11,880 --> 00:37:15,040 And so we show that these real platforms can 857 00:37:15,040 --> 00:37:16,696 be modeled stochastically and then 858 00:37:16,696 --> 00:37:18,360 we can pretty easily deal with that. 859 00:37:18,360 --> 00:37:21,660 Search our state space and deal with those probabilities 860 00:37:21,660 --> 00:37:24,646 and expand the nodes that we might end up. 861 00:37:24,646 --> 00:37:26,092 Right? 862 00:37:26,092 --> 00:37:27,710 And the heuristic allows us to not 863 00:37:27,710 --> 00:37:29,995 have to explore these areas of state space. 864 00:37:29,995 --> 00:37:32,180 I never actually end up there. 865 00:37:32,180 --> 00:37:35,495 We'll always be a commanding toward 77. 866 00:37:35,495 --> 00:37:37,370 We're never going to try to command backward. 867 00:37:37,370 --> 00:37:39,450 Sure, if there's a gust of wind and we have some probability 868 00:37:39,450 --> 00:37:39,950 there. 869 00:37:39,950 --> 00:37:42,350 But you can imagine that we're going to only explore 870 00:37:42,350 --> 00:37:43,896 a small portion of this. 871 00:37:43,896 --> 00:37:46,830 Because we'll always be trying to correct to get back 872 00:37:46,830 --> 00:37:48,590 to the top four blocks. 873 00:37:48,590 --> 00:37:51,638 And using our reward function, we 874 00:37:51,638 --> 00:37:54,515 get to determine if we want to fly a quick path 875 00:37:54,515 --> 00:37:56,280 or if we want to fly a safer path. 876 00:37:56,280 --> 00:37:59,256 For example, our time times our probability 877 00:37:59,256 --> 00:38:02,172 [INAUDIBLE] we want to perhaps reduce that. 878 00:38:06,060 --> 00:38:08,004 All right. 879 00:38:08,004 --> 00:38:12,455 Are there any questions about planning with MDPs, 880 00:38:12,455 --> 00:38:14,570 anything like that. 881 00:38:14,570 --> 00:38:15,500 I love this stuff. 882 00:38:15,500 --> 00:38:19,148 So the more questions, the more I get to talk. 883 00:38:19,148 --> 00:38:20,597 Fine. 884 00:38:20,597 --> 00:38:23,910 What I'm going to be talking about for the rest 885 00:38:23,910 --> 00:38:28,080 of this lecture is extending beyond MDPs 886 00:38:28,080 --> 00:38:30,390 to a broader class of problems called 887 00:38:30,390 --> 00:38:35,010 POMDPs, Partially Observable Markov Decision Processes. 888 00:38:35,010 --> 00:38:36,419 I love this stuff. 889 00:38:36,419 --> 00:38:37,460 I think it's really cool. 890 00:38:37,460 --> 00:38:39,030 They're really fun problems. 891 00:38:39,030 --> 00:38:41,660 We're going to talk about why they're so much harder to plan 892 00:38:41,660 --> 00:38:44,380 with, to execute-- but why they're 893 00:38:44,380 --> 00:38:47,270 important to at least know about so that you can model 894 00:38:47,270 --> 00:38:48,740 real world problems with them. 895 00:38:48,740 --> 00:38:51,570 And then we're going to delve into a case study 896 00:38:51,570 --> 00:38:55,180 of a specific POMDP solver. 897 00:38:55,180 --> 00:38:58,630 We're not going to go into as much detail as we did for MDPs, 898 00:38:58,630 --> 00:39:00,630 but we're going to look at what powerful results 899 00:39:00,630 --> 00:39:03,058 we can get by planning with POMDPs. 900 00:39:05,986 --> 00:39:08,450 PROFESSOR 4: So first, I want to rephrase 901 00:39:08,450 --> 00:39:10,770 this in the overall talk. 902 00:39:10,770 --> 00:39:11,270 Right? 903 00:39:11,270 --> 00:39:13,750 We have this spectrum of uncertainty. 904 00:39:13,750 --> 00:39:16,420 And coupled with uncertainty is difficulty 905 00:39:16,420 --> 00:39:19,830 of planning, of solving, of executing a problem. 906 00:39:19,830 --> 00:39:22,170 And we've killed these first two cases. 907 00:39:22,170 --> 00:39:23,630 That was really easy. 908 00:39:23,630 --> 00:39:27,150 And then we just discussed the bottom [INAUDIBLE],, MDPs. 909 00:39:27,150 --> 00:39:29,535 What I'm going to talk about is the case 910 00:39:29,535 --> 00:39:34,080 where both your dynamics and your sensors are stochastic. 911 00:39:34,080 --> 00:39:35,440 Why is that important? 912 00:39:35,440 --> 00:39:37,690 It's because when we first saw this 913 00:39:37,690 --> 00:39:39,890 slide-- our motivating example slide, 914 00:39:39,890 --> 00:39:41,990 we only saw the left hand side. 915 00:39:41,990 --> 00:39:43,630 We said, our actions are uncertain. 916 00:39:43,630 --> 00:39:46,700 But good news, we have a perfect sensor-- 917 00:39:46,700 --> 00:39:48,080 a perfect camera. 918 00:39:48,080 --> 00:39:49,410 But that's unrealistic. 919 00:39:49,410 --> 00:39:51,430 I think, we have all, to some extent, 920 00:39:51,430 --> 00:39:56,280 experienced the fact that no sensor is totally perfect. 921 00:39:56,280 --> 00:39:59,090 Your camera might have fluctuated pixel values. 922 00:39:59,090 --> 00:40:03,530 Your laser range finder is going to never read out exactly 923 00:40:03,530 --> 00:40:05,381 the right number all the time. 924 00:40:05,381 --> 00:40:07,630 You can have a camera in different lighting conditions 925 00:40:07,630 --> 00:40:09,600 that will behave differently. 926 00:40:09,600 --> 00:40:11,960 You might not be able to observe your full state. 927 00:40:11,960 --> 00:40:14,790 That's, in a way, an imperfect sensor, right? 928 00:40:14,790 --> 00:40:17,440 If I'm in this room, I have imperfect eyes. 929 00:40:17,440 --> 00:40:19,520 I can't map out all of MIT's campus 930 00:40:19,520 --> 00:40:21,473 because I'm blocked by walls. 931 00:40:21,473 --> 00:40:23,980 How can you deal with the fact that you can't see all 932 00:40:23,980 --> 00:40:26,220 your obstacles all the time? 933 00:40:26,220 --> 00:40:28,740 We've already talked about some cases-- 934 00:40:28,740 --> 00:40:30,910 that there are some algorithms that can help us 935 00:40:30,910 --> 00:40:32,750 with that, like D Star Lite. 936 00:40:32,750 --> 00:40:35,846 But can you reason about these things probabilistically? 937 00:40:35,846 --> 00:40:39,463 And then finally, you might be in a non-unique environment 938 00:40:39,463 --> 00:40:43,430 where you cannot resolve your state with certainty no matter 939 00:40:43,430 --> 00:40:45,170 how good your sensors are. 940 00:40:45,170 --> 00:40:47,700 Imagine you're in a building with two identical hallways. 941 00:40:47,700 --> 00:40:50,360 You're dropped off in one of them. 942 00:40:50,360 --> 00:40:53,250 How can you figure out where you are? 943 00:40:53,250 --> 00:40:55,916 You can't unless you start exploring. 944 00:40:55,916 --> 00:40:59,420 And so we've got to deal with this uncertainty, 945 00:40:59,420 --> 00:41:03,760 right? it's Part of every single problem. 946 00:41:03,760 --> 00:41:05,880 When observational uncertainty is slowing, 947 00:41:05,880 --> 00:41:07,160 you can maybe ignore it. 948 00:41:07,160 --> 00:41:09,770 But it's there. 949 00:41:09,770 --> 00:41:12,730 And so we're going to formulate this as a POMDP, 950 00:41:12,730 --> 00:41:16,150 a partially observable Markov Decision Process. 951 00:41:16,150 --> 00:41:19,230 And this next slide is just like the MDP slide. 952 00:41:19,230 --> 00:41:21,000 Hairy, but important. 953 00:41:21,000 --> 00:41:22,380 Right? 954 00:41:22,380 --> 00:41:25,640 We can formulate a POMDP, which is seven elements. 955 00:41:25,640 --> 00:41:27,270 And MDP too five. 956 00:41:27,270 --> 00:41:30,655 Most of them for all those are carried over here. 957 00:41:30,655 --> 00:41:33,215 We've got our set of states where we can be. 958 00:41:33,215 --> 00:41:35,880 We've got a set of actions what we can do. 959 00:41:35,880 --> 00:41:38,210 We've got our transition model which says, 960 00:41:38,210 --> 00:41:40,010 given that I started in one state 961 00:41:40,010 --> 00:41:41,810 and then I took an action, what's 962 00:41:41,810 --> 00:41:44,025 the probability I end up somewhere else? 963 00:41:44,025 --> 00:41:46,160 And like David was talking about, 964 00:41:46,160 --> 00:41:49,220 hopefully that distribution is pretty local-- 965 00:41:49,220 --> 00:41:51,720 we're not teleporting all over the world. 966 00:41:51,720 --> 00:41:52,970 We've got our reward function. 967 00:41:52,970 --> 00:41:56,240 This is exactly the same as for MDPs. 968 00:41:56,240 --> 00:41:59,561 And we've got our discount factor down here. 969 00:41:59,561 --> 00:42:03,070 The key difference of POMDP is these two elements. 970 00:42:03,070 --> 00:42:05,550 We've got a set of possible observations 971 00:42:05,550 --> 00:42:08,930 and a probabilistic model for the probability 972 00:42:08,930 --> 00:42:12,663 of making an observation given your state and the action you 973 00:42:12,663 --> 00:42:14,800 just took. 974 00:42:14,800 --> 00:42:18,460 Now it's important, I think, it matches up really 975 00:42:18,460 --> 00:42:20,250 well with real world sensors having 976 00:42:20,250 --> 00:42:21,776 this probabilistic model. 977 00:42:21,776 --> 00:42:24,335 If you have a laser range finder, for example, 978 00:42:24,335 --> 00:42:26,950 and you're standing one foot away from the wall-- 979 00:42:26,950 --> 00:42:29,090 now a perfect sensor would always say, 980 00:42:29,090 --> 00:42:30,720 you're one foot away from the wall. 981 00:42:30,720 --> 00:42:33,060 You're one foot away from the wall. 982 00:42:33,060 --> 00:42:34,550 Every single reading is constant. 983 00:42:34,550 --> 00:42:38,180 But realistically, there might be Gaussian noise, for example. 984 00:42:38,180 --> 00:42:39,768 Or at a more extreme case, it says, 985 00:42:39,768 --> 00:42:41,143 your one foot away from the wall. 986 00:42:41,143 --> 00:42:41,809 You're two feet. 987 00:42:41,809 --> 00:42:42,795 You're right there. 988 00:42:42,795 --> 00:42:44,765 There's this distribution. 989 00:42:44,765 --> 00:42:48,890 And so you would ideally characterize this distribution. 990 00:42:48,890 --> 00:42:50,450 And you plug that into this model 991 00:42:50,450 --> 00:42:54,170 and that formulates your POMDP. 992 00:42:54,170 --> 00:42:57,800 This sounds really hairy, but if you work through just a sample 993 00:42:57,800 --> 00:43:01,910 iteration of living in a POMDP world, that's not too bad. 994 00:43:01,910 --> 00:43:04,863 You start at some state, S. You take an action, 995 00:43:04,863 --> 00:43:06,985 A. With some probability, you're going 996 00:43:06,985 --> 00:43:09,022 to end up in a bunch of different states based 997 00:43:09,022 --> 00:43:10,710 on your transition model. 998 00:43:10,710 --> 00:43:14,679 At that point, we can use the lessons we learned from MDP 999 00:43:14,679 --> 00:43:16,595 land where we said, when we make observations, 1000 00:43:16,595 --> 00:43:20,840 we reduce our uncertainty. we collapse into a single state. 1001 00:43:20,840 --> 00:43:23,120 So we say, let's make an observation. 1002 00:43:23,120 --> 00:43:25,760 But this time, observations aren't guaranteed 1003 00:43:25,760 --> 00:43:27,317 to resolve all our uncertainty. 1004 00:43:27,317 --> 00:43:28,400 So we make an observation. 1005 00:43:28,400 --> 00:43:31,840 And that observation is probabilistic 1006 00:43:31,840 --> 00:43:35,550 based on our current state and the action we just took. 1007 00:43:35,550 --> 00:43:38,750 And again, obviously, it depends on your current state. 1008 00:43:38,750 --> 00:43:40,710 Because if you're one foot away from a wall, 1009 00:43:40,710 --> 00:43:43,610 hopefully you'll get a different characterization 1010 00:43:43,610 --> 00:43:46,790 of observations than if you're 20 feet away from the wall. 1011 00:43:46,790 --> 00:43:50,480 Otherwise, your sensor it's totally useless. 1012 00:43:50,480 --> 00:43:52,777 Are there any questions about this formulation? 1013 00:43:52,777 --> 00:43:53,360 AUDIENCE: Yep. 1014 00:43:53,360 --> 00:43:54,270 Quick question. 1015 00:43:54,270 --> 00:43:56,976 So we will take observation, any other observation and then 1016 00:43:56,976 --> 00:43:58,940 you try to infer which state you're in, 1017 00:43:58,940 --> 00:44:01,620 is it just a clustering problem? 1018 00:44:01,620 --> 00:44:06,140 For instance, the multi-cluster Gaussian [INAUDIBLE] models. 1019 00:44:06,140 --> 00:44:09,594 So class A, class B, class C, which are state, 1020 00:44:09,594 --> 00:44:11,010 then taking an observation there's 1021 00:44:11,010 --> 00:44:14,362 a high probability [INAUDIBLE] This 1022 00:44:14,362 --> 00:44:16,870 is what we're tyring to do for each observation over here. 1023 00:44:16,870 --> 00:44:21,300 So we're trying to find clustering [INAUDIBLE].. 1024 00:44:21,300 --> 00:44:23,857 PROFESSOR 4: So you could, I imagine, 1025 00:44:23,857 --> 00:44:25,940 implement an algorithm where, yeah, every time you 1026 00:44:25,940 --> 00:44:28,460 make an observation, you then try to say, all right. 1027 00:44:28,460 --> 00:44:31,420 What's my most likely estimate, or maybe my [INAUDIBLE] 1028 00:44:31,420 --> 00:44:33,740 least cost estimate? 1029 00:44:33,740 --> 00:44:36,200 But inherent with that is the risk 1030 00:44:36,200 --> 00:44:38,720 that you're discarding a lot of information, right? 1031 00:44:38,720 --> 00:44:40,725 Because you're going to generate a probability 1032 00:44:40,725 --> 00:44:44,200 distribution over your state. 1033 00:44:44,200 --> 00:44:46,470 And so, yes, you can say, I'm going 1034 00:44:46,470 --> 00:44:48,584 to stick with the maximum likelihood estimate. 1035 00:44:48,584 --> 00:44:50,390 But if you can, you should probably 1036 00:44:50,390 --> 00:44:51,925 try to maintain that distribution 1037 00:44:51,925 --> 00:44:54,060 as long as possible. 1038 00:44:54,060 --> 00:44:54,560 OK. 1039 00:44:54,560 --> 00:44:57,570 And we'll see that this is really 1040 00:44:57,570 --> 00:44:59,180 computationally expensive unless you 1041 00:44:59,180 --> 00:45:00,946 start making some assumptions. 1042 00:45:00,946 --> 00:45:04,280 And in the case study we're going to look into, 1043 00:45:04,280 --> 00:45:06,040 that's exactly what [INAUDIBLE]. 1044 00:45:06,040 --> 00:45:08,740 But I've seen a lot in the literature 1045 00:45:08,740 --> 00:45:11,540 that as much as you can, you want 1046 00:45:11,540 --> 00:45:14,864 to maintain these distributions for improved accuracy. 1047 00:45:14,864 --> 00:45:16,748 Any other questions? 1048 00:45:20,030 --> 00:45:20,530 All right. 1049 00:45:20,530 --> 00:45:24,070 Well, we're going to compare now the execution of a POMDP 1050 00:45:24,070 --> 00:45:27,668 to the execution of an MDP. 1051 00:45:27,668 --> 00:45:31,480 We started out-- we're living in the same real world. 1052 00:45:31,480 --> 00:45:32,980 We've got our same transition model. 1053 00:45:32,980 --> 00:45:34,680 Everything is peachy. 1054 00:45:34,680 --> 00:45:37,090 We take action one, and we want to go North. 1055 00:45:37,090 --> 00:45:40,100 We have this distribution that we generate the states. 1056 00:45:40,100 --> 00:45:42,380 And at this point, what did we say? 1057 00:45:42,380 --> 00:45:43,770 We said, we hate the fact that we 1058 00:45:43,770 --> 00:45:45,020 have to deal with three cases. 1059 00:45:45,020 --> 00:45:46,580 Three is two too many. 1060 00:45:46,580 --> 00:45:50,520 So let's make an observation to collapse this distribution. 1061 00:45:50,520 --> 00:45:53,730 Now I've described a lot about noisy sensors, 1062 00:45:53,730 --> 00:45:55,990 right, where basically it's a true measurement 1063 00:45:55,990 --> 00:45:59,030 plus some noise, maybe Gaussian distribution. 1064 00:45:59,030 --> 00:46:02,360 There's another partially observable sensor 1065 00:46:02,360 --> 00:46:05,390 you can have in the POMDP which really 1066 00:46:05,390 --> 00:46:07,850 feeds into the name, Partially Observable. 1067 00:46:07,850 --> 00:46:11,520 What if you can only observe part of your state? 1068 00:46:11,520 --> 00:46:14,960 For example, if you're living in an x-y grid, 1069 00:46:14,960 --> 00:46:18,520 maybe you can only observe your y dimension. 1070 00:46:18,520 --> 00:46:20,550 This match up, in a real world example, 1071 00:46:20,550 --> 00:46:22,630 to quadrotor flying down a hallway. 1072 00:46:22,630 --> 00:46:26,032 Catherine was working on a DARPA project 1073 00:46:26,032 --> 00:46:28,290 with quadrotors flying down hallways. 1074 00:46:28,290 --> 00:46:30,660 If the hallway is too long, your laser range finder 1075 00:46:30,660 --> 00:46:32,340 isn't going to be able to determine 1076 00:46:32,340 --> 00:46:34,040 where you are along one axis. 1077 00:46:34,040 --> 00:46:36,840 But it can tell where you are along another. 1078 00:46:36,840 --> 00:46:38,420 And so that's what I've said. 1079 00:46:38,420 --> 00:46:41,270 I've said, pretend this quadrotor has that sensor. 1080 00:46:41,270 --> 00:46:43,900 We can only observe its y component. 1081 00:46:43,900 --> 00:46:46,480 And it says, my y component is 3. 1082 00:46:46,480 --> 00:46:49,170 I have no idea what my x component is. 1083 00:46:49,170 --> 00:46:50,200 Well, this sucks. 1084 00:46:50,200 --> 00:46:50,970 Right? 1085 00:46:50,970 --> 00:46:52,900 Because we got rid of this state, 1086 00:46:52,900 --> 00:46:58,536 but then we couldn't decide, are we at state (1,3) or (3,3)? 1087 00:46:58,536 --> 00:46:59,910 There's no way of resolving this. 1088 00:46:59,910 --> 00:47:00,870 So we can re-normalize. 1089 00:47:00,870 --> 00:47:05,260 We can add in the effect from the observation probabilities 1090 00:47:05,260 --> 00:47:07,830 saying, maybe, in fact, I'm far more 1091 00:47:07,830 --> 00:47:12,550 likely to observe a y component of 3 if I'm at (1,3). 1092 00:47:12,550 --> 00:47:14,790 But in this case, we say it's equally likely 1093 00:47:14,790 --> 00:47:18,284 to make that observation for those two states. 1094 00:47:18,284 --> 00:47:21,050 And so now we've got to deal with these two cases. 1095 00:47:21,050 --> 00:47:23,550 And so we can take our next action. 1096 00:47:23,550 --> 00:47:25,600 Instead of resetting to a single state, 1097 00:47:25,600 --> 00:47:28,751 we've got to keep growing this tree. 1098 00:47:28,751 --> 00:47:31,040 And what's the key difference between this 1099 00:47:31,040 --> 00:47:33,430 and when we were executing our MDP? 1100 00:47:33,430 --> 00:47:35,670 Except we didn't manage to collapse back 1101 00:47:35,670 --> 00:47:36,580 to a single state. 1102 00:47:36,580 --> 00:47:39,861 We didn't manage to reset the problem. 1103 00:47:39,861 --> 00:47:42,630 And this is annoying. 1104 00:47:42,630 --> 00:47:45,390 Because you can't execute a policy and say, 1105 00:47:45,390 --> 00:47:47,340 I'm certain that this is my configuration. 1106 00:47:47,340 --> 00:47:50,385 So your policy can't map from exact states 1107 00:47:50,385 --> 00:47:55,416 to actions, because you never know your exact state. 1108 00:47:55,416 --> 00:47:56,290 Does this make sense? 1109 00:47:56,290 --> 00:47:59,700 Has everyone lost hope in planning, right? 1110 00:47:59,700 --> 00:48:01,610 Yeah. 1111 00:48:01,610 --> 00:48:04,840 AUDIENCE: So in here because-- 1112 00:48:04,840 --> 00:48:06,920 so from the left to the right, you're 1113 00:48:06,920 --> 00:48:09,530 basically mapping from one belief state to another belief 1114 00:48:09,530 --> 00:48:10,030 state. 1115 00:48:10,030 --> 00:48:11,620 So it's like a one arrow thing. 1116 00:48:11,620 --> 00:48:15,040 But and then from that second layer, 1117 00:48:15,040 --> 00:48:16,750 you should have two arrows. 1118 00:48:16,750 --> 00:48:20,800 One with probably 0.5 that observes 3, 1119 00:48:20,800 --> 00:48:23,680 and it gives rise to that belief state it just showed. 1120 00:48:23,680 --> 00:48:27,734 In one, with probability of 0.5 where the sensor is being 4. 1121 00:48:27,734 --> 00:48:29,192 Because if you have a 0 probability 1122 00:48:29,192 --> 00:48:31,140 of looking at (2,4), you might as well just 1123 00:48:31,140 --> 00:48:33,160 observe 4 as well as your y. 1124 00:48:33,160 --> 00:48:36,670 So in this case, I'm not sure if you were just 1125 00:48:36,670 --> 00:48:41,205 trying to show one branch of your POMDP planning, 1126 00:48:41,205 --> 00:48:42,580 but basically what you have to do 1127 00:48:42,580 --> 00:48:44,790 is you would go like this to the second layer. 1128 00:48:44,790 --> 00:48:48,990 You would have a branch with a 0.5 probability on either. 1129 00:48:48,990 --> 00:48:52,205 One giving you 3, one giving you 4. 1130 00:48:52,205 --> 00:48:54,580 For the one that is showed, you've got that relief state. 1131 00:48:54,580 --> 00:48:58,810 For the other one, you get the state (2,4) with probably 1. 1132 00:48:58,810 --> 00:49:00,220 PROFESSOR 4: Yeah. 1133 00:49:00,220 --> 00:49:04,667 So you were perfectly describing planning, right? 1134 00:49:04,667 --> 00:49:06,125 I should have made this more clear. 1135 00:49:06,125 --> 00:49:06,970 This isn't planning. 1136 00:49:06,970 --> 00:49:08,160 This is executing. 1137 00:49:08,160 --> 00:49:10,720 We have this policy that we're going to execute. 1138 00:49:10,720 --> 00:49:12,220 And so if we were planning, we would 1139 00:49:12,220 --> 00:49:17,320 have to consider all these branches and say, well, yeah. 1140 00:49:17,320 --> 00:49:19,830 There's a 50% chance here I'll end up here, in which case, 1141 00:49:19,830 --> 00:49:21,164 my y value is going to read 4 and you 1142 00:49:21,164 --> 00:49:22,413 have to grow this whole thing. 1143 00:49:22,413 --> 00:49:23,335 I'm saying, no. 1144 00:49:23,335 --> 00:49:26,210 This is real time execution. 1145 00:49:26,210 --> 00:49:27,940 yeah. 1146 00:49:27,940 --> 00:49:28,885 Great question. 1147 00:49:28,885 --> 00:49:32,210 Any others? 1148 00:49:32,210 --> 00:49:35,290 Well, this is a great time to transition to, well, 1149 00:49:35,290 --> 00:49:38,760 we can't just magically be handed these policies. 1150 00:49:38,760 --> 00:49:40,436 How do we actually generate them? 1151 00:49:40,436 --> 00:49:42,310 How do we start planning in the belief space? 1152 00:49:42,310 --> 00:49:46,290 The belief space is the space distributions 1153 00:49:46,290 --> 00:49:49,132 of possible configurations. 1154 00:49:49,132 --> 00:49:52,870 So I'm going to talk about a general class of algorithms. 1155 00:49:52,870 --> 00:49:56,450 A lot of planners in POMDP land and the belief space 1156 00:49:56,450 --> 00:50:01,025 plan with probabilistic roadmaps- PRMs. 1157 00:50:01,025 --> 00:50:04,670 The goal is to generate a policy that maps from a belief 1158 00:50:04,670 --> 00:50:07,180 state to an action. 1159 00:50:07,180 --> 00:50:09,030 And I'm going to go into a little more 1160 00:50:09,030 --> 00:50:12,160 what a belief state is. 1161 00:50:12,160 --> 00:50:14,800 But the general algorithm in this graphic 1162 00:50:14,800 --> 00:50:16,370 illustrates these four steps. 1163 00:50:16,370 --> 00:50:20,570 We're going to sample points from the configuration space, 1164 00:50:20,570 --> 00:50:22,362 as if everything was deterministic. 1165 00:50:22,362 --> 00:50:25,110 We're going to connect those points to nearby points. 1166 00:50:25,110 --> 00:50:27,260 And define nearby however you want. 1167 00:50:27,260 --> 00:50:29,660 It could be your closest neighbors, all neighbors 1168 00:50:29,660 --> 00:50:31,460 within a radius, whatever. 1169 00:50:31,460 --> 00:50:35,390 As long as those edges don't collide with obstacles. 1170 00:50:35,390 --> 00:50:37,615 Once you've done that, somehow-- 1171 00:50:37,615 --> 00:50:39,565 and there's some magic in this-- 1172 00:50:39,565 --> 00:50:43,190 you're going to transform your configured action space 1173 00:50:43,190 --> 00:50:45,860 probabilistic roadmap to a probabilistic road 1174 00:50:45,860 --> 00:50:47,350 map in the belief state. 1175 00:50:47,350 --> 00:50:48,210 Great. 1176 00:50:48,210 --> 00:50:49,668 Once you've done that, you can just 1177 00:50:49,668 --> 00:50:52,990 do shortest path depending on whatever cost function you use. 1178 00:50:52,990 --> 00:50:57,560 And what's really cool is you get different paths for when 1179 00:50:57,560 --> 00:50:59,420 you stay in the configuration space 1180 00:50:59,420 --> 00:51:01,100 and when you go to the belief space. 1181 00:51:01,100 --> 00:51:05,900 So the green path, that bottom right figure, 1182 00:51:05,900 --> 00:51:09,050 seems a lot longer than the red path. 1183 00:51:09,050 --> 00:51:12,120 And the reason is the green path was planned in the belief space 1184 00:51:12,120 --> 00:51:16,250 and followed a lot of landmarks that the quadrotor could 1185 00:51:16,250 --> 00:51:17,330 take measurements off of. 1186 00:51:17,330 --> 00:51:19,780 So it was really confident about its position 1187 00:51:19,780 --> 00:51:22,590 whereas the red path is the shorter path that had a higher 1188 00:51:22,590 --> 00:51:24,330 likelihood of a collision. 1189 00:51:24,330 --> 00:51:29,620 And we're going to look into that figure more later. 1190 00:51:29,620 --> 00:51:32,500 I'm going to segment this algorithm into two parts. 1191 00:51:32,500 --> 00:51:34,780 The first part-- these first two steps-- 1192 00:51:34,780 --> 00:51:37,080 it's just probabilistic road maps. 1193 00:51:37,080 --> 00:51:40,360 Who here has heard about probabilistic road maps? 1194 00:51:40,360 --> 00:51:42,720 Raise your hand. 1195 00:51:42,720 --> 00:51:43,220 OK. 1196 00:51:43,220 --> 00:51:44,070 50/50. 1197 00:51:44,070 --> 00:51:46,750 I'm so excited for the 50% who haven't heard. 1198 00:51:46,750 --> 00:51:49,306 One of the top algorithms, in my opinion. 1199 00:51:49,306 --> 00:51:51,180 It's really simple, and it's really powerful. 1200 00:51:53,980 --> 00:51:57,810 Here's basically almost a complete implementation 1201 00:51:57,810 --> 00:51:59,180 of probabilistic road maps. 1202 00:51:59,180 --> 00:52:03,030 It's pseudocode so don't copy paste, but it's almost there. 1203 00:52:03,030 --> 00:52:05,670 You're going to construct a graph. 1204 00:52:05,670 --> 00:52:07,382 You're going to add your start goal. 1205 00:52:07,382 --> 00:52:09,590 Your start configuration being the green dot and then 1206 00:52:09,590 --> 00:52:11,519 goal configuration, the red dot. 1207 00:52:11,519 --> 00:52:13,560 And then you're just going to keep sampling notes 1208 00:52:13,560 --> 00:52:14,700 those from the free space. 1209 00:52:14,700 --> 00:52:17,970 You say, how about (2,3)? 1210 00:52:17,970 --> 00:52:19,770 You're going to add that to your graph. 1211 00:52:19,770 --> 00:52:22,700 You're going to connect that node to a bunch of other nodes 1212 00:52:22,700 --> 00:52:23,214 nearby. 1213 00:52:23,214 --> 00:52:24,630 And then you're just going to keep 1214 00:52:24,630 --> 00:52:28,069 sampling until maybe you have enough nodes 1215 00:52:28,069 --> 00:52:29,860 or maybe until you have your complete path. 1216 00:52:29,860 --> 00:52:34,080 Should be happening there. 1217 00:52:34,080 --> 00:52:36,262 And then once you've got this whole graph, 1218 00:52:36,262 --> 00:52:38,220 you can just find the shortest path along that. 1219 00:52:38,220 --> 00:52:39,803 And there are some really cool results 1220 00:52:39,803 --> 00:52:43,960 that if you sample in a good way, and so on, asymptotically. 1221 00:52:43,960 --> 00:52:45,990 As you start sampling more, you're 1222 00:52:45,990 --> 00:52:51,560 going to asymptotically approach the best path in a completely 1223 00:52:51,560 --> 00:52:53,065 continuous space. 1224 00:52:53,065 --> 00:52:55,090 The power of probabilistic road maps 1225 00:52:55,090 --> 00:52:57,800 and a bunch of randomized algorithms 1226 00:52:57,800 --> 00:53:00,500 though is that they scale pretty well to high dimensions. 1227 00:53:00,500 --> 00:53:03,890 So you don't need to actually consider the continuous space. 1228 00:53:03,890 --> 00:53:06,526 You can just sample [INAUDIBLE]. 1229 00:53:06,526 --> 00:53:09,802 Are there any questions about probabilistic road maps? 1230 00:53:12,570 --> 00:53:13,150 Really cool. 1231 00:53:13,150 --> 00:53:15,670 If you are interested, and you just heard about PRMs. 1232 00:53:15,670 --> 00:53:18,580 You probably haven't heard about RRTs. 1233 00:53:18,580 --> 00:53:20,957 Those are also really cool. 1234 00:53:20,957 --> 00:53:22,833 AUDIENCE: Just a quick [INAUDIBLE] question. 1235 00:53:22,833 --> 00:53:26,050 So for any node, [INAUDIBLE] at this uniform example 1236 00:53:26,050 --> 00:53:26,960 [INAUDIBLE]. 1237 00:53:30,429 --> 00:53:31,220 PROFESSOR 4: Sorry. 1238 00:53:31,220 --> 00:53:33,154 When you're choosing where to place the dot? 1239 00:53:33,154 --> 00:53:34,070 Or what to connect to? 1240 00:53:34,070 --> 00:53:35,778 AUDIENCE: [INAUDIBLE] there's a dot there 1241 00:53:35,778 --> 00:53:37,400 on the side [INAUDIBLE] right? 1242 00:53:37,400 --> 00:53:40,015 So when I do the sampling, I just [INAUDIBLE] this node. 1243 00:53:40,015 --> 00:53:42,420 I uniformly choose one of them [INAUDIBLE].. 1244 00:53:45,490 --> 00:53:49,290 PROFESSOR 4: So in general, probabilistic roadmaps, 1245 00:53:49,290 --> 00:53:51,430 you can throw in whatever sampler you want. 1246 00:53:51,430 --> 00:53:53,470 The way this particular one-- 1247 00:53:53,470 --> 00:53:57,400 the way I implement this is you sample points uniformly 1248 00:53:57,400 --> 00:53:59,160 from the entire space. 1249 00:53:59,160 --> 00:54:01,890 If it's inside an obstacle and you remove it, 1250 00:54:01,890 --> 00:54:04,090 once you place it-- 1251 00:54:04,090 --> 00:54:07,150 this was connect to the K closest-- 1252 00:54:07,150 --> 00:54:10,750 I think K is 7 in this case. 1253 00:54:10,750 --> 00:54:14,020 And if there's an edge that if that edge collides 1254 00:54:14,020 --> 00:54:16,520 with an obstacle, remove it. 1255 00:54:16,520 --> 00:54:20,490 I'd be happy to go into more detail for that PRMs later. 1256 00:54:20,490 --> 00:54:21,740 All right. 1257 00:54:21,740 --> 00:54:23,960 But that's not enough. 1258 00:54:23,960 --> 00:54:27,680 Because what we described as PRMs in the configuration 1259 00:54:27,680 --> 00:54:29,950 space, but what we need to do is somehow 1260 00:54:29,950 --> 00:54:32,440 elevate a PRM from the configuration space 1261 00:54:32,440 --> 00:54:34,675 to a belief space. 1262 00:54:34,675 --> 00:54:36,860 And this is really hard. 1263 00:54:36,860 --> 00:54:40,570 We don't have access to these raw configurations. 1264 00:54:40,570 --> 00:54:43,570 Let's imagine we were in this really simple world 1265 00:54:43,570 --> 00:54:46,120 where the quadrotor could be in three possible states. 1266 00:54:46,120 --> 00:54:47,830 One, two, three. 1267 00:54:47,830 --> 00:54:48,910 Really easy, right? 1268 00:54:48,910 --> 00:54:50,234 Sample a bunch of points. 1269 00:54:50,234 --> 00:54:52,150 They're going to end up in one, two, or three. 1270 00:54:52,150 --> 00:54:55,720 You could pretty quickly cover the entire space. 1271 00:54:55,720 --> 00:54:57,790 But this simple configuration space 1272 00:54:57,790 --> 00:55:02,040 will transform to the belief space becomes infinite. 1273 00:55:02,040 --> 00:55:05,495 You have infinite possible distributions to consider. 1274 00:55:05,495 --> 00:55:07,060 There is the distribution where you 1275 00:55:07,060 --> 00:55:08,800 have 100% chance probability-- 1276 00:55:08,800 --> 00:55:13,902 100% probability that you're in state 1, 100% probability 1277 00:55:13,902 --> 00:55:14,860 that you're in state 2. 1278 00:55:14,860 --> 00:55:16,870 100% probability that you're in state 3. 1279 00:55:16,870 --> 00:55:20,100 And then everything in between. 1280 00:55:20,100 --> 00:55:22,490 We went from three to infinite. 1281 00:55:22,490 --> 00:55:24,490 This is not boding well. 1282 00:55:24,490 --> 00:55:26,132 And even if you start saying, well, I'm 1283 00:55:26,132 --> 00:55:28,007 not going to consider the whole distribution. 1284 00:55:28,007 --> 00:55:31,080 I just care about the mean and the variance, 1285 00:55:31,080 --> 00:55:34,850 it's still not pretty, right? 1286 00:55:34,850 --> 00:55:42,455 Well, this is where we have to start making approximations 1287 00:55:42,455 --> 00:55:44,900 And this is where you start getting differences 1288 00:55:44,900 --> 00:55:48,475 in POMDP planners-- where they make assumptions, 1289 00:55:48,475 --> 00:55:51,480 where they make approximations. 1290 00:55:51,480 --> 00:55:55,170 So these images are from the belief road map 1291 00:55:55,170 --> 00:55:59,090 paper from the Robust Robotics group a couple of years ago. 1292 00:55:59,090 --> 00:56:01,790 But I'm going to talk about a different planner soon. 1293 00:56:01,790 --> 00:56:06,500 But the idea behind a lot of these planners 1294 00:56:06,500 --> 00:56:09,680 is maybe we can start saying what these distributions are 1295 00:56:09,680 --> 00:56:12,510 going to look like based on our models. 1296 00:56:12,510 --> 00:56:16,650 So to our planning problem, if we know we started at (2,3) 1297 00:56:16,650 --> 00:56:19,295 and we know our transition distribution, 1298 00:56:19,295 --> 00:56:22,729 we can start saying, well, this is my probability distribution. 1299 00:56:22,729 --> 00:56:24,270 And then when I make and observation, 1300 00:56:24,270 --> 00:56:27,165 I can build distributions off that. 1301 00:56:27,165 --> 00:56:31,000 And so, if you could exhaustively 1302 00:56:31,000 --> 00:56:33,785 propagate these distributions forward, that would be great. 1303 00:56:33,785 --> 00:56:36,180 But it's unrealistic. 1304 00:56:36,180 --> 00:56:41,440 And I just want to point in terms of the visual way 1305 00:56:41,440 --> 00:56:42,815 to represent these distributions. 1306 00:56:42,815 --> 00:56:46,210 A really nice way of saying, in the deterministic world, 1307 00:56:46,210 --> 00:56:47,950 you have these dots and edges. 1308 00:56:47,950 --> 00:56:51,320 In the probabilistic world, these circles, 1309 00:56:51,320 --> 00:56:54,380 these ellipsoids, represent uncertainty. 1310 00:56:54,380 --> 00:56:56,690 Typically, it's the one standard deviation or three 1311 00:56:56,690 --> 00:56:58,560 standard deviations away. 1312 00:56:58,560 --> 00:57:01,976 And so you can start building into the map, 1313 00:57:01,976 --> 00:57:03,850 these are the distributions and the variances 1314 00:57:03,850 --> 00:57:06,910 I can see in these nodes. 1315 00:57:06,910 --> 00:57:09,742 Are there any questions about this stuff. 1316 00:57:13,060 --> 00:57:13,670 All right. 1317 00:57:13,670 --> 00:57:17,680 Well, we're going to delve now into a specific case study. 1318 00:57:17,680 --> 00:57:19,620 Feedback Based Information State Roadmaps-- 1319 00:57:19,620 --> 00:57:21,324 FIRM. 1320 00:57:21,324 --> 00:57:23,740 From now on, that's the only way I'm going to refer to it. 1321 00:57:26,690 --> 00:57:28,560 The idea behind this is you're going 1322 00:57:28,560 --> 00:57:33,260 to sample mean configurations from your configuration space. 1323 00:57:33,260 --> 00:57:38,355 Then you want to build an LQR-- that's 1324 00:57:38,355 --> 00:57:40,910 Linear Quadratic Regulator controller-- 1325 00:57:40,910 --> 00:57:44,910 around these mean points. 1326 00:57:44,910 --> 00:57:47,810 And that will generate what variance you can tolerate. 1327 00:57:47,810 --> 00:57:53,470 So LQR controllers, if you don't know, they're really nice. 1328 00:57:53,470 --> 00:57:55,955 Around a small region around a point, 1329 00:57:55,955 --> 00:58:00,243 they can drive a quadrotor, for example, to that figure. 1330 00:58:00,243 --> 00:58:04,170 And so if you build these LQR controllers around points, 1331 00:58:04,170 --> 00:58:06,120 you can say, all right, anytime I end up 1332 00:58:06,120 --> 00:58:09,500 in this cloud in the belief space-- 1333 00:58:09,500 --> 00:58:14,810 so any sort of distribution and can bring it back to that mean. 1334 00:58:14,810 --> 00:58:17,840 And so now, what we've done is we've generated every point, 1335 00:58:17,840 --> 00:58:19,290 we just need to connect them. 1336 00:58:19,290 --> 00:58:24,050 And the idea is if you have a feedback based, 1337 00:58:24,050 --> 00:58:27,740 they can get from one cloud to another anywhere in the clouds. 1338 00:58:27,740 --> 00:58:31,760 Then you can get from point to point. 1339 00:58:31,760 --> 00:58:32,762 All right. 1340 00:58:32,762 --> 00:58:34,220 This is how you generate the graph. 1341 00:58:34,220 --> 00:58:36,255 And what's cool is the way they formulated 1342 00:58:36,255 --> 00:58:39,360 the problem is they said, well, set up 1343 00:58:39,360 --> 00:58:41,190 the cost of executing an edge. 1344 00:58:41,190 --> 00:58:43,530 We switched from rewards to costs 1345 00:58:43,530 --> 00:58:46,700 because we're pessimists now. 1346 00:58:46,700 --> 00:58:49,110 Well, the cost is going to be a linear combination 1347 00:58:49,110 --> 00:58:53,610 of the expected time to execute that edge and the uncertainty 1348 00:58:53,610 --> 00:58:55,262 along that edge. 1349 00:58:55,262 --> 00:58:58,020 Now, this is actually really cool to play with. 1350 00:58:58,020 --> 00:59:04,475 What do you think would happen if I set beta to 0 in this? 1351 00:59:04,475 --> 00:59:07,829 Like, what sort of [INAUDIBLE] would you get? 1352 00:59:07,829 --> 00:59:09,870 I will cold call because I know a few names here. 1353 00:59:12,410 --> 00:59:12,910 Anybody? 1354 00:59:12,910 --> 00:59:17,181 I don't want to cold call someone who may [INAUDIBLE].. 1355 00:59:17,181 --> 00:59:19,180 AUDIENCE: You're going to get the shortest path. 1356 00:59:19,180 --> 00:59:21,179 PROFESSOR 4: Shortest path, that's right, right? 1357 00:59:21,179 --> 00:59:22,600 This turn goes away. 1358 00:59:22,600 --> 00:59:24,615 The cost is a function of the time-- 1359 00:59:24,615 --> 00:59:26,030 shortest path. 1360 00:59:26,030 --> 00:59:29,220 Now, what's cool is one day I was messing around 1361 00:59:29,220 --> 00:59:29,940 with this code. 1362 00:59:29,940 --> 00:59:33,750 I'm like, I wonder what happens if I set alpha to 0, right? 1363 00:59:33,750 --> 00:59:37,180 So your cost is purely a function of uncertainty. 1364 00:59:37,180 --> 00:59:39,900 In turns out what the quadrotor does it just 1365 00:59:39,900 --> 00:59:41,990 hangs out where it starts. 1366 00:59:41,990 --> 00:59:44,400 It says, I'm in no hurry. 1367 00:59:44,400 --> 00:59:45,285 I know where I am. 1368 00:59:45,285 --> 00:59:46,866 I'm just going to stay here. 1369 00:59:49,380 --> 00:59:51,212 But I find this amazing, right? 1370 00:59:51,212 --> 00:59:52,920 Because this almost models behavior even. 1371 00:59:52,920 --> 00:59:54,660 You could start saying, do I want 1372 00:59:54,660 --> 00:59:56,990 to be a risky quadrotor or a safe one? 1373 00:59:56,990 --> 00:59:59,650 Like, how important it is for me to get somewhere on time 1374 00:59:59,650 --> 01:00:00,765 or be safe? 1375 01:00:00,765 --> 01:00:02,940 And it's just those two parameters. 1376 01:00:02,940 --> 01:00:07,250 Or you could even make it 1, like alpha and 1 minus alpha. 1377 01:00:07,250 --> 01:00:10,170 I think this stuff is really cool. 1378 01:00:10,170 --> 01:00:12,780 The one detail that I'm really going to into for FIRM 1379 01:00:12,780 --> 01:00:15,900 is the cost equation, which is based on the Bellman backup 1380 01:00:15,900 --> 01:00:17,700 equation that we had. 1381 01:00:17,700 --> 01:00:21,560 The cost to go, right, the expected cost from a belief 1382 01:00:21,560 --> 01:00:25,355 state [INAUDIBLE] is-- 1383 01:00:25,355 --> 01:00:27,146 well, you're going to take the best action. 1384 01:00:27,146 --> 01:00:29,354 So you're going to take the mide of this whole thing. 1385 01:00:29,354 --> 01:00:33,400 You're going to say, it's the cost of executing 1386 01:00:33,400 --> 01:00:38,570 a specific action plus the cost of colliding with something, 1387 01:00:38,570 --> 01:00:40,750 an obstacle, times the probability of colliding, 1388 01:00:40,750 --> 01:00:41,250 right? 1389 01:00:41,250 --> 01:00:43,570 That's this term. 1390 01:00:43,570 --> 01:00:47,240 And then you've got to say, well, OK, and then 1391 01:00:47,240 --> 01:00:49,642 once I reach a state, what's the cost from there? 1392 01:00:49,642 --> 01:00:51,600 And then I could weight that by the probability 1393 01:00:51,600 --> 01:00:53,970 of ending up in that state. 1394 01:00:53,970 --> 01:00:58,580 Does this equation make sense to people? 1395 01:00:58,580 --> 01:01:01,390 There are a lot of symbols, and honestly, I hate notation, 1396 01:01:01,390 --> 01:01:03,050 but it works. 1397 01:01:03,050 --> 01:01:04,660 You just plug in-- the cost is I'm 1398 01:01:04,660 --> 01:01:05,910 going to take the best action. 1399 01:01:05,910 --> 01:01:07,326 It's going to be the cost of using 1400 01:01:07,326 --> 01:01:09,390 that action, the cost of colliding times 1401 01:01:09,390 --> 01:01:12,540 the probability of colliding given that I used that action 1402 01:01:12,540 --> 01:01:14,720 and I started where I started, and then 1403 01:01:14,720 --> 01:01:17,980 the cost from where I end up. 1404 01:01:17,980 --> 01:01:20,570 And since you end up in a probability distribution, 1405 01:01:20,570 --> 01:01:23,300 we need to consider all these cases. 1406 01:01:23,300 --> 01:01:24,580 Yeah? 1407 01:01:24,580 --> 01:01:26,904 AUDIENCE: When you define like an action from one place 1408 01:01:26,904 --> 01:01:28,445 to another, do you always think of it 1409 01:01:28,445 --> 01:01:30,070 as starting from mean that you sampled 1410 01:01:30,070 --> 01:01:33,280 or from anywhere [INAUDIBLE]? 1411 01:01:33,280 --> 01:01:35,520 PROFESSOR 4: So it's from the-- 1412 01:01:35,520 --> 01:01:38,510 so formally, it was from the belief state, 1413 01:01:38,510 --> 01:01:42,700 which is the mean, yeah, plus the variance once it stabilized 1414 01:01:42,700 --> 01:01:44,682 to that point. 1415 01:01:44,682 --> 01:01:47,140 And the way that variance is generated, I should have said, 1416 01:01:47,140 --> 01:01:50,000 is, you're going to have models of these quadrotors. 1417 01:01:50,000 --> 01:01:55,695 And so I spent a good time in the ACL with John Howell, 1418 01:01:55,695 --> 01:01:56,585 in Course 16. 1419 01:01:56,585 --> 01:01:58,920 And you just like let the quadrotor hover. 1420 01:01:58,920 --> 01:02:01,570 You measure its position for a long time. 1421 01:02:01,570 --> 01:02:06,848 And you get a distribution over where it goes. 1422 01:02:06,848 --> 01:02:08,849 AUDIENCE: So what does the letter M stand for? 1423 01:02:08,849 --> 01:02:10,265 PROFESSOR 4: What is the letter M? 1424 01:02:10,265 --> 01:02:10,640 AUDIENCE: Yeah. 1425 01:02:10,640 --> 01:02:11,431 PROFESSOR 4: Right. 1426 01:02:11,431 --> 01:02:14,260 So you're summing over-- these are the belief states that you 1427 01:02:14,260 --> 01:02:16,390 could end up in, right? 1428 01:02:16,390 --> 01:02:18,190 So if everything were deterministic, 1429 01:02:18,190 --> 01:02:20,280 this would just be ignore the sum. 1430 01:02:20,280 --> 01:02:21,950 It's where you end up. 1431 01:02:21,950 --> 01:02:25,386 Realistically you could end up in some other state 1432 01:02:25,386 --> 01:02:27,497 we haven't considered. 1433 01:02:27,497 --> 01:02:28,955 AUDIENCE: So just a quick question. 1434 01:02:28,955 --> 01:02:32,090 So those, if we're operating on a Gaussian space, 1435 01:02:32,090 --> 01:02:33,590 then you have Gaussian observations. 1436 01:02:33,590 --> 01:02:36,980 So those are sums over the observation samples that 1437 01:02:36,980 --> 01:02:39,030 are generated when [INAUDIBLE]. 1438 01:02:39,030 --> 01:02:43,350 So that is a finite sum over the possible infinite observation 1439 01:02:43,350 --> 01:02:44,270 states we might have. 1440 01:02:44,270 --> 01:02:45,060 PROFESSOR 4: Yeah. 1441 01:02:45,060 --> 01:02:47,600 So there's definitely [INAUDIBLE] 1442 01:02:47,600 --> 01:02:50,720 in terms of where you could end up. 1443 01:02:50,720 --> 01:02:53,690 And even the action space, there's 1444 01:02:53,690 --> 01:02:56,440 a set feedback controller that you're allowed. 1445 01:02:56,440 --> 01:03:00,224 I think the observation, if you modeled it as a Gaussian, 1446 01:03:00,224 --> 01:03:01,640 if you made some nice assumptions, 1447 01:03:01,640 --> 01:03:03,709 it can be tractable as continuous. 1448 01:03:03,709 --> 01:03:04,500 AUDIENCE: Oh, yeah. 1449 01:03:04,500 --> 01:03:05,249 I was [INAUDIBLE]. 1450 01:03:05,249 --> 01:03:08,410 So for the start, so if you start with Gaussian noise, 1451 01:03:08,410 --> 01:03:10,460 then you have a linear model. 1452 01:03:10,460 --> 01:03:12,910 Then your prediction's going to be Gaussian. 1453 01:03:12,910 --> 01:03:13,747 PROFESSOR 4: Yeah. 1454 01:03:13,747 --> 01:03:15,830 AUDIENCE: But then you have Gaussian observations, 1455 01:03:15,830 --> 01:03:18,560 which is great because then your update is Gaussian 1456 01:03:18,560 --> 01:03:21,140 but the observation space is infinite. 1457 01:03:21,140 --> 01:03:24,440 So basically not only you have two sample positions, 1458 01:03:24,440 --> 01:03:26,655 but you also have two sample potential observations 1459 01:03:26,655 --> 01:03:28,370 that you might get as you go along. 1460 01:03:28,900 --> 01:03:29,650 PROFESSOR 4: Yeah. 1461 01:03:29,650 --> 01:03:31,650 AUDIENCE: So that you can basically-- 1462 01:03:31,650 --> 01:03:32,410 and it's great. 1463 01:03:32,410 --> 01:03:35,990 But you end up basically reducing a possibly infinite 1464 01:03:35,990 --> 01:03:39,330 branching, which is your Gaussian to kind of a 1465 01:03:39,330 --> 01:03:41,289 like a Monte Carlo Tree search. 1466 01:03:41,289 --> 01:03:43,330 You generate a whole bunch of meaningful samples, 1467 01:03:43,330 --> 01:03:45,480 and those are the ones that you consider. 1468 01:03:45,480 --> 01:03:47,416 So is that where the sum is coming from? 1469 01:03:47,416 --> 01:03:48,624 PROFESSOR 4: Yeah, basically. 1470 01:03:48,624 --> 01:03:52,686 And a fun fact, the theorem that I read and trusted 1471 01:03:52,686 --> 01:03:55,515 was that if you just sample randomly 1472 01:03:55,515 --> 01:03:57,450 right from your configuration space, 1473 01:03:57,450 --> 01:04:00,130 you have zero probability of constructing 1474 01:04:00,130 --> 01:04:03,340 a graph that without any assumptions will be connected. 1475 01:04:03,340 --> 01:04:05,350 Whereas for PRMs, you sample enough 1476 01:04:05,350 --> 01:04:08,461 and things will turn out nicely, not the case with the belief 1477 01:04:08,461 --> 01:04:08,960 state. 1478 01:04:08,960 --> 01:04:11,728 That's why we need to make these assumptions. 1479 01:04:11,728 --> 01:04:13,120 We're killing it. 1480 01:04:13,120 --> 01:04:14,697 We know POMDPs. 1481 01:04:14,697 --> 01:04:17,030 Now we get to look at some really fun graphics generated 1482 01:04:17,030 --> 01:04:20,622 from real flights using FIRM. 1483 01:04:24,090 --> 01:04:28,204 The big takeaway is that FIRM prefers safer paths. 1484 01:04:28,204 --> 01:04:30,120 We've got two images that look really similar. 1485 01:04:30,120 --> 01:04:32,080 We're going to talk about one and then show 1486 01:04:32,080 --> 01:04:34,560 why they're slightly different. 1487 01:04:34,560 --> 01:04:38,289 The test flight that we put this quadrotor under was we said, 1488 01:04:38,289 --> 01:04:40,080 we're going to start at this configuration. 1489 01:04:40,080 --> 01:04:42,980 We're going to go this configuration. 1490 01:04:42,980 --> 01:04:45,870 And there's this big, blue obstacle. 1491 01:04:45,870 --> 01:04:48,695 And there are landmarks that you can take measurements off of. 1492 01:04:48,695 --> 01:04:51,060 So those are these red dots. 1493 01:04:51,060 --> 01:04:54,830 We want to compare two planners right, a PRM and a FIRM 1494 01:04:54,830 --> 01:04:56,270 planner. 1495 01:04:56,270 --> 01:04:59,180 The PRM planner said, right, I just want to minimize time. 1496 01:04:59,180 --> 01:05:01,710 And so it found the first path is stay 1497 01:05:01,710 --> 01:05:03,390 to the left of the obstacle. 1498 01:05:03,390 --> 01:05:06,592 But that's actually a really narrow path 1499 01:05:06,592 --> 01:05:08,250 between the obstacle and the wall. 1500 01:05:08,250 --> 01:05:09,910 On the other hand, the FIRM planner 1501 01:05:09,910 --> 01:05:13,890 said, right, I want to minimize the linear combination of time 1502 01:05:13,890 --> 01:05:15,590 and uncertainty. 1503 01:05:15,590 --> 01:05:17,370 And so if you ramp up the term the wait 1504 01:05:17,370 --> 01:05:20,460 for uncertainty, at some point the path sort of 1505 01:05:20,460 --> 01:05:22,060 pops over the obstacle. 1506 01:05:22,060 --> 01:05:23,165 It's really cool. 1507 01:05:23,165 --> 01:05:24,125 It snaps. 1508 01:05:24,125 --> 01:05:25,500 And then you get this safer path. 1509 01:05:25,500 --> 01:05:27,252 But that's longer. 1510 01:05:27,252 --> 01:05:28,996 And how do we know it's safer? 1511 01:05:28,996 --> 01:05:31,890 Because these ellipsoids that we've drawn 1512 01:05:31,890 --> 01:05:35,730 are the 3D versions of what we saw for PRM elevated 1513 01:05:35,730 --> 01:05:37,280 to the belief space version. 1514 01:05:37,280 --> 01:05:39,840 It's the uncertainty that the quadrotor 1515 01:05:39,840 --> 01:05:42,350 has over its true state. 1516 01:05:42,350 --> 01:05:44,250 Why is the PRM plan-- 1517 01:05:44,250 --> 01:05:45,890 why does it have such big ellipsoids? 1518 01:05:45,890 --> 01:05:48,090 It's because when it's behind the obstacle, 1519 01:05:48,090 --> 01:05:50,050 it can't make any of these measurements 1520 01:05:50,050 --> 01:05:51,990 because it can't see the landmarks. 1521 01:05:51,990 --> 01:05:55,490 So it's basically using dead reckoning. 1522 01:05:55,490 --> 01:05:59,580 And the transition model, right, is not deterministic. 1523 01:05:59,580 --> 01:06:01,045 So it's uncertainty grows. 1524 01:06:01,045 --> 01:06:04,030 And so we can see these ellipsoids are bigger for PRM 1525 01:06:04,030 --> 01:06:04,909 than FIRM. 1526 01:06:04,909 --> 01:06:06,450 And then the reason I have two images 1527 01:06:06,450 --> 01:06:08,870 is they are slightly different. 1528 01:06:08,870 --> 01:06:12,220 It's that these landmarks were fictitious landmarks 1529 01:06:12,220 --> 01:06:13,585 that we just said, you can-- 1530 01:06:13,585 --> 01:06:16,440 and we would generate fake measurements off of them. 1531 01:06:16,440 --> 01:06:20,277 And so we could tune the noise of the landmarks. 1532 01:06:20,277 --> 01:06:21,860 Maybe a little bit of cheating, but it 1533 01:06:21,860 --> 01:06:25,350 allowed us to say, would it increase the noise from some-- 1534 01:06:25,350 --> 01:06:29,790 the dimension was number 0.05 to 0.15. 1535 01:06:29,790 --> 01:06:32,554 You can see the uncertainty ellipsoids from PRM 1536 01:06:32,554 --> 01:06:36,290 grow when we increased the noise, whereas for FIRM, they 1537 01:06:36,290 --> 01:06:38,110 stay about the same. 1538 01:06:38,110 --> 01:06:39,750 And importantly, these ellipsoids 1539 01:06:39,750 --> 01:06:43,140 grow enough that they start overlapping with the obstacles. 1540 01:06:43,140 --> 01:06:46,995 That represents a very high probability of collision. 1541 01:06:46,995 --> 01:06:48,495 Whereas the FIRM, the way it managed 1542 01:06:48,495 --> 01:06:52,335 to keep these ellipsoids so small is we 1543 01:06:52,335 --> 01:06:55,800 kept the cost function the same, right-- the same weight 1544 01:06:55,800 --> 01:06:58,000 on uncertainty, same weight on the time, right? 1545 01:06:58,000 --> 01:07:00,400 But the uncertainty was so much higher. 1546 01:07:00,400 --> 01:07:04,195 And so we decided, well, I can sacrifice a little bit of time. 1547 01:07:04,195 --> 01:07:05,320 I can take the slower path. 1548 01:07:05,320 --> 01:07:07,525 I can just hang out by landmark, really 1549 01:07:07,525 --> 01:07:10,920 make sure I know where I am before continuing. 1550 01:07:10,920 --> 01:07:13,300 And so the path-- the duration of the flight 1551 01:07:13,300 --> 01:07:15,354 took a lot longer for FIRM as time 1552 01:07:15,354 --> 01:07:17,020 would increase but the uncertainty would 1553 01:07:17,020 --> 01:07:19,304 stay about constant. 1554 01:07:19,304 --> 01:07:23,168 Do people understand these graphics? 1555 01:07:23,168 --> 01:07:24,500 Great. 1556 01:07:24,500 --> 01:07:25,640 All right. 1557 01:07:25,640 --> 01:07:28,550 We've got a graph that just represents 1558 01:07:28,550 --> 01:07:30,820 a little more formally that growing uncertainly. 1559 01:07:30,820 --> 01:07:35,640 As noise increases, the variance along a single dimension, z, 1560 01:07:35,640 --> 01:07:41,810 y, to x, for PRM, which is that, and FIRM. 1561 01:07:41,810 --> 01:07:44,775 The variance for PRM is always higher, and it grows. 1562 01:07:44,775 --> 01:07:48,630 For FIRM, it's lower, and stays about constant. 1563 01:07:48,630 --> 01:07:51,360 That's the big takeaway. 1564 01:07:51,360 --> 01:07:54,440 FIRM minimizes this uncertainty. 1565 01:07:54,440 --> 01:07:58,320 And then the final image from these results that I 1566 01:07:58,320 --> 01:08:00,510 want to show is in simulation. 1567 01:08:00,510 --> 01:08:03,200 They said, well, let's actually measure how often it crashes. 1568 01:08:03,200 --> 01:08:05,070 We didn't want to do this in the real world 1569 01:08:05,070 --> 01:08:08,790 because we don't want to crash the quadrotor that many times. 1570 01:08:08,790 --> 01:08:11,440 The gist of it is comparing a reactive planner 1571 01:08:11,440 --> 01:08:14,670 and a deterministic one [INAUDIBLE].. 1572 01:08:14,670 --> 01:08:18,920 As noise increases-- noise was simulated with wind strength-- 1573 01:08:18,920 --> 01:08:21,740 the number of crashes increases for PRM, basically. 1574 01:08:21,740 --> 01:08:25,410 And for FIRM, it stays constant and low. 1575 01:08:25,410 --> 01:08:27,540 The reason there are two lines is 1576 01:08:27,540 --> 01:08:30,500 there were two planners with different time horizons. 1577 01:08:30,500 --> 01:08:33,717 The important thing is FIRM is low and constant. 1578 01:08:33,717 --> 01:08:34,711 PRM grows. 1579 01:08:40,680 --> 01:08:44,889 We've talked now throughout all probabilistic planning you ever 1580 01:08:44,889 --> 01:08:46,023 need to know, right? 1581 01:08:46,023 --> 01:08:47,210 No, not quite. 1582 01:08:47,210 --> 01:08:49,820 But we have covered a lot of stuff. 1583 01:08:49,820 --> 01:08:51,170 What are the big takeaways? 1584 01:08:51,170 --> 01:08:54,660 We've learned that real-world problems are stochastic, right? 1585 01:08:54,660 --> 01:08:57,396 Quadrotors are not these perfect machines 1586 01:08:57,396 --> 01:08:58,819 that we wish they were. 1587 01:08:58,819 --> 01:09:02,000 But it's important to model them as stochastic. 1588 01:09:02,000 --> 01:09:04,430 The problem is once you start modeling them as stochastic, 1589 01:09:04,430 --> 01:09:07,430 it becomes a lot harder to solve. 1590 01:09:07,430 --> 01:09:09,050 But if you make some assumptions, 1591 01:09:09,050 --> 01:09:11,069 or even if you don't, if you're just get smart 1592 01:09:11,069 --> 01:09:15,199 and you do the heuristics, you can resolve this complexity. 1593 01:09:15,199 --> 01:09:17,324 And so I hope you remember these three points, 1594 01:09:17,324 --> 01:09:20,474 you remember this graphic or that graphic, the idea 1595 01:09:20,474 --> 01:09:22,399 that if you take uncertainty into account, 1596 01:09:22,399 --> 01:09:25,684 you get fundamentally different paths [INAUDIBLE].. 1597 01:09:25,684 --> 01:09:28,770 And that can be a good thing. 1598 01:09:28,770 --> 01:09:30,400 What questions do you have, anything? 1599 01:09:30,400 --> 01:09:30,899 Yeah? 1600 01:09:30,899 --> 01:09:33,700 AUDIENCE: [INAUDIBLE] so far people 1601 01:09:33,700 --> 01:09:36,279 have [INAUDIBLE] problems. 1602 01:09:36,279 --> 01:09:39,479 So how do the same, you know, [INAUDIBLE]?? 1603 01:09:39,479 --> 01:09:43,029 So for instance, we suggest that this is the maximum risk 1604 01:09:43,029 --> 01:09:44,040 we want to take. 1605 01:09:44,040 --> 01:09:46,780 Is it possible to integrate each of our constraints 1606 01:09:46,780 --> 01:09:49,029 into your optimization problem and solve it? 1607 01:09:49,029 --> 01:09:51,279 Or [INAUDIBLE]? 1608 01:09:51,279 --> 01:09:56,800 PROFESSOR 4: So I imagine one thing that I would like to test 1609 01:09:56,800 --> 01:10:05,165 is if we go back to this really crude version of our cost 1610 01:10:05,165 --> 01:10:07,380 equation, we can imagine saying that if we 1611 01:10:07,380 --> 01:10:10,590 want to come up with a bound for uncertainty 1612 01:10:10,590 --> 01:10:12,380 that we can tolerate, you could maybe 1613 01:10:12,380 --> 01:10:15,135 like setting an intercept for this to be that bound 1614 01:10:15,135 --> 01:10:17,650 and then just ramping this up. 1615 01:10:17,650 --> 01:10:21,305 AUDIENCE: Oh, [INAUDIBLE] multiplier [INAUDIBLE].. 1616 01:10:21,305 --> 01:10:22,680 PROFESSOR 4: Something like that. 1617 01:10:22,680 --> 01:10:23,080 AUDIENCE: [INAUDIBLE]. 1618 01:10:23,080 --> 01:10:24,288 GUEST SPEAKER: Yeah, exactly. 1619 01:10:24,288 --> 01:10:26,530 So it only comes into effect. 1620 01:10:26,530 --> 01:10:29,170 Possibly what I said is a terrible idea. 1621 01:10:29,170 --> 01:10:30,807 But it's-- especially in simulation, 1622 01:10:30,807 --> 01:10:32,640 you can just try it out and see if it works. 1623 01:10:32,640 --> 01:10:35,830 AUDIENCE: And just another question, 1624 01:10:35,830 --> 01:10:39,310 you said propagated probability solution on the networks 1625 01:10:39,310 --> 01:10:41,040 is hard. 1626 01:10:41,040 --> 01:10:41,810 [INAUDIBLE] 1627 01:10:45,590 --> 01:10:49,435 So therefore we can assume an illustrated area, and then 1628 01:10:49,435 --> 01:10:52,252 [INAUDIBLE] normal distribution, they 1629 01:10:52,252 --> 01:10:53,770 are contrary to each other. 1630 01:10:53,770 --> 01:10:56,500 Therefore you can [INAUDIBLE] observation 1631 01:10:56,500 --> 01:11:00,860 and then updating [INAUDIBLE] and form a distribution, right? 1632 01:11:00,860 --> 01:11:01,610 PROFESSOR 4: Sure. 1633 01:11:01,610 --> 01:11:08,080 So I think that might be making some assumptions that we might 1634 01:11:08,080 --> 01:11:10,580 not be willing to make about the nature of the distributions 1635 01:11:10,580 --> 01:11:12,290 that you're trying to propagate. 1636 01:11:12,290 --> 01:11:15,690 I need think about this some more. 1637 01:11:15,690 --> 01:11:18,920 But I think intuitively, we can understand that any time you're 1638 01:11:18,920 --> 01:11:22,580 propagating a distribution versus a single discrete value, 1639 01:11:22,580 --> 01:11:25,040 it's definitely not going to be easier. 1640 01:11:25,040 --> 01:11:31,120 And so as your distributions become more complex. 1641 01:11:31,120 --> 01:11:35,280 Perhaps if you're modeling a real-world stochastic sensor, 1642 01:11:35,280 --> 01:11:38,120 you might not be able to perform these efficient updates using 1643 01:11:38,120 --> 01:11:39,842 conjugate priority. 1644 01:11:42,520 --> 01:11:44,140 Any other questions? 1645 01:11:44,140 --> 01:11:45,480 Otherwise we-- yeah? 1646 01:11:45,480 --> 01:11:47,646 AUDIENCE: So all these FIRM examples you're showing, 1647 01:11:47,646 --> 01:11:49,690 this is all planning done offline. 1648 01:11:49,690 --> 01:11:51,932 [INAUDIBLE] offline, right? 1649 01:11:51,932 --> 01:11:55,359 Have you expanded this at all to like an online case? 1650 01:11:55,359 --> 01:11:56,900 Or does that all require re-planning? 1651 01:11:56,900 --> 01:11:58,350 And how long does it take? 1652 01:11:58,350 --> 01:11:59,141 PROFESSOR 4: Right. 1653 01:11:59,141 --> 01:12:02,530 So it's not good, I can tell you that much. 1654 01:12:02,530 --> 01:12:05,360 What's nice is FIRM generates a policy. 1655 01:12:05,360 --> 01:12:08,600 So if you construct your PRM to say 1656 01:12:08,600 --> 01:12:10,340 don't just find the shortest path, 1657 01:12:10,340 --> 01:12:12,910 keep sampling points until you're 1658 01:12:12,910 --> 01:12:14,445 confident that you're always going 1659 01:12:14,445 --> 01:12:16,276 to end up near some point. 1660 01:12:16,276 --> 01:12:18,186 You can construct a policy and then 1661 01:12:18,186 --> 01:12:21,380 just online look up what's my belief that 1662 01:12:21,380 --> 01:12:22,600 matches most closely. 1663 01:12:25,110 --> 01:12:31,016 It took for-- I think we sampled 600 nodes in like-- 1664 01:12:31,016 --> 01:12:33,560 what are the dimensions on this? 1665 01:12:33,560 --> 01:12:38,880 So 2 to 3 meter by 8 by 3 meter thing-- 1666 01:12:38,880 --> 01:12:40,610 it took about 15 minutes. 1667 01:12:40,610 --> 01:12:41,670 It was really slow. 1668 01:12:41,670 --> 01:12:43,850 Now, granted it was a virtual machine. 1669 01:12:43,850 --> 01:12:47,600 But it's not something we want to re-plan on the fly. 1670 01:12:47,600 --> 01:12:50,710 With PRM, typically a lot faster. 1671 01:12:50,710 --> 01:12:52,730 AUDIENCE: Who was a lot faster in that case? 1672 01:12:52,730 --> 01:12:54,930 PROFESSOR 4: So when we just did PRM, 1673 01:12:54,930 --> 01:12:58,735 I think it was one minute or something, at least 1674 01:12:58,735 --> 01:12:59,750 an order of magnitude. 1675 01:12:59,750 --> 01:13:04,390 And you could use other planners as well, like RRTs. 1676 01:13:04,390 --> 01:13:07,270 And there are RRT versions of-- 1677 01:13:07,270 --> 01:13:09,596 there are POMDT versions of RRTs. 1678 01:13:09,596 --> 01:13:11,825 But [INAUDIBLE]. 1679 01:13:11,825 --> 01:13:13,950 All right, I'm going to turn it over to Steve so we 1680 01:13:13,950 --> 01:13:16,387 can learn about this project. 1681 01:13:16,387 --> 01:13:18,383 [APPLAUSE] 1682 01:13:53,860 --> 01:13:56,170 STEVE: So I'm just going to say a few good words 1683 01:13:56,170 --> 01:13:57,280 about the Grand Challenge. 1684 01:13:57,280 --> 01:13:59,275 So again, the final assignment will 1685 01:13:59,275 --> 01:14:01,730 be released tonight for this. 1686 01:14:01,730 --> 01:14:03,449 And as Professor Warrens mentioned, 1687 01:14:03,449 --> 01:14:05,740 it's going to be descoped a bit from what we originally 1688 01:14:05,740 --> 01:14:07,880 had in the syllabus due to time constraints. 1689 01:14:07,880 --> 01:14:10,880 But here's a preview of what you'll be doing. 1690 01:14:10,880 --> 01:14:14,920 So for an overview, it will be a class-wide collaboration. 1691 01:14:14,920 --> 01:14:16,970 So what we're going to do, you guys 1692 01:14:16,970 --> 01:14:19,160 are going to stay in your advanced lecture teams 1693 01:14:19,160 --> 01:14:20,990 and each basically apply what you've 1694 01:14:20,990 --> 01:14:23,190 done, the great work you've done on that, 1695 01:14:23,190 --> 01:14:26,740 onto our hardwork robot that we have. so it's not 1696 01:14:26,740 --> 01:14:28,414 a competition where each team will be 1697 01:14:28,414 --> 01:14:31,184 competing against each other. 1698 01:14:31,184 --> 01:14:33,970 And so again, we're descoping it. 1699 01:14:33,970 --> 01:14:36,220 The syllabus originally said it was 20% of your grade. 1700 01:14:36,220 --> 01:14:39,950 It's probably going to be more like 10% or 15% in the end. 1701 01:14:39,950 --> 01:14:42,720 So this is the robot you guys will be using. 1702 01:14:42,720 --> 01:14:45,029 It's called an [INAUDIBLE] robot. 1703 01:14:45,029 --> 01:14:46,820 Usually it has sort of a robotic arm on it. 1704 01:14:46,820 --> 01:14:48,611 But we're actually not going to be using it 1705 01:14:48,611 --> 01:14:50,848 for this [INAUDIBLE]. 1706 01:14:50,848 --> 01:14:53,310 This base made from a company from Spain, Orbonix. 1707 01:14:53,310 --> 01:14:55,330 It's a pretty cool robot. 1708 01:14:55,330 --> 01:14:58,870 One cool thing about it is that it's actually omnidirectional. 1709 01:14:58,870 --> 01:15:01,010 So there are wheels on your wheels here. 1710 01:15:01,010 --> 01:15:03,186 And what that means is that it can drive sideways, 1711 01:15:03,186 --> 01:15:04,690 left to right. 1712 01:15:04,690 --> 01:15:07,280 It would make parallel parking your car very easy. 1713 01:15:07,280 --> 01:15:09,086 [LAUGHTER] 1714 01:15:09,086 --> 01:15:11,210 We're not going to be driving it probably this fast 1715 01:15:11,210 --> 01:15:13,620 because it's actually super heavy. 1716 01:15:13,620 --> 01:15:15,410 And we don't want anyone to-- yeah, 1717 01:15:15,410 --> 01:15:17,550 we forgot to screw in that [INAUDIBLE].. 1718 01:15:17,550 --> 01:15:18,420 [LAUGHTER] 1719 01:15:18,420 --> 01:15:20,740 It will be screwed in during the competition. 1720 01:15:20,740 --> 01:15:22,660 But it's a pretty fun robot. 1721 01:15:22,660 --> 01:15:26,822 So you guys will be working on this. 1722 01:15:26,822 --> 01:15:30,420 And so the actual challenge itself, 1723 01:15:30,420 --> 01:15:32,920 as we've mentioned many times throughout this class, 1724 01:15:32,920 --> 01:15:35,797 will be a modified orienteering your challenge. 1725 01:15:35,797 --> 01:15:38,172 So there's going to be a few different challenge stations 1726 01:15:38,172 --> 01:15:41,150 that you have to drive to. 1727 01:15:41,150 --> 01:15:44,614 And at the stations, there'll be small computational challenges. 1728 01:15:44,614 --> 01:15:46,030 And those computational challenges 1729 01:15:46,030 --> 01:15:51,530 will be a subset of the advanced lecture teams, not all of them. 1730 01:15:51,530 --> 01:15:54,695 And the goal is to complete as many of those as you can 1731 01:15:54,695 --> 01:15:58,030 and try to do that as quickly as possible. 1732 01:15:58,030 --> 01:16:00,370 And it's also going to be held indoors in our lab space, 1733 01:16:00,370 --> 01:16:01,430 where we just showed you. 1734 01:16:01,430 --> 01:16:02,971 So that way if it rains, we can still 1735 01:16:02,971 --> 01:16:05,680 have the Grand Challenge at the end of the semester. 1736 01:16:05,680 --> 01:16:12,310 So this is sort of how it's set up as of last night actually. 1737 01:16:12,310 --> 01:16:13,966 So basically the robot will be abe 1738 01:16:13,966 --> 01:16:15,930 to drive around in a small little LEGO maze 1739 01:16:15,930 --> 01:16:17,456 that we set up. 1740 01:16:17,456 --> 01:16:19,456 And there's going to be sort of different things 1741 01:16:19,456 --> 01:16:22,710 that you have to do with in different places. 1742 01:16:22,710 --> 01:16:24,922 So what are you actually going to be doing? 1743 01:16:24,922 --> 01:16:26,880 What assignment are you guys going to be doing? 1744 01:16:26,880 --> 01:16:28,580 Well, it's actually a little flexible. 1745 01:16:28,580 --> 01:16:30,470 Since each of you did different things 1746 01:16:30,470 --> 01:16:33,380 for your advanced lectures, each team 1747 01:16:33,380 --> 01:16:35,842 is going to have a bit of a different assignment applied 1748 01:16:35,842 --> 01:16:36,575 to this. 1749 01:16:36,575 --> 01:16:38,033 I have some proposed ideas that I'm 1750 01:16:38,033 --> 01:16:40,740 going to talk about in the next slide here. 1751 01:16:40,740 --> 01:16:43,250 But the big thing is that these are just ideas. 1752 01:16:43,250 --> 01:16:46,092 You guys have a lot of flexibility in this. 1753 01:16:46,092 --> 01:16:48,050 You'll be probably working with the [INAUDIBLE] 1754 01:16:48,050 --> 01:16:51,490 a lot to have access to the robot. 1755 01:16:51,490 --> 01:16:53,683 We can arrange extra office hours for you guys 1756 01:16:53,683 --> 01:16:57,034 to come use the hardware to test things. 1757 01:16:57,034 --> 01:16:58,700 So we'll be arranging all of that things 1758 01:16:58,700 --> 01:17:00,540 as sort of on an as-needed basis. 1759 01:17:00,540 --> 01:17:02,000 If you want to-- 1760 01:17:02,000 --> 01:17:06,370 basically it'll sort of depend on your team. 1761 01:17:06,370 --> 01:17:08,870 AUDIENCE: And the people who'll be helping out are you, us-- 1762 01:17:08,870 --> 01:17:11,490 STEVE: Yes, me and Tiago, who gave a lecture 1763 01:17:11,490 --> 01:17:13,477 earlier in the semester and possibly also 1764 01:17:13,477 --> 01:17:14,810 a few other people from our lab. 1765 01:17:14,810 --> 01:17:18,680 But we'll be the main contact points for it. 1766 01:17:18,680 --> 01:17:20,520 So of course it should go without saying, 1767 01:17:20,520 --> 01:17:23,120 but all the team members should contribute equally 1768 01:17:23,120 --> 01:17:24,360 within your team. 1769 01:17:24,360 --> 01:17:27,253 So it'll be maybe less structure than in advanced lecture. 1770 01:17:27,253 --> 01:17:29,544 But just make sure that everyone's contributing equally 1771 01:17:29,544 --> 01:17:32,310 in the assignment. 1772 01:17:32,310 --> 01:17:35,480 And it's going to involve using this thing called the Robot 1773 01:17:35,480 --> 01:17:39,260 Operating System, or ROS, which is basically a software 1774 01:17:39,260 --> 01:17:42,719 framework for communicating and it's used a lot in robotics. 1775 01:17:42,719 --> 01:17:44,510 Just a quick show of hands, how many of you 1776 01:17:44,510 --> 01:17:46,305 have used ROS before? 1777 01:17:46,305 --> 01:17:48,350 Oh, wow, so a lot of used ROS. 1778 01:17:48,350 --> 01:17:50,660 How many have heard of ROS, if not used it? 1779 01:17:50,660 --> 01:17:51,860 OK, so a lot of people. 1780 01:17:51,860 --> 01:17:54,540 So that's a good starting point. 1781 01:17:54,540 --> 01:17:58,220 So here are the the proposed tasks for each group. 1782 01:17:58,220 --> 01:18:00,630 And of course, all of these are up for change. 1783 01:18:00,630 --> 01:18:03,075 If you guys want to change it, let me know. 1784 01:18:03,075 --> 01:18:04,450 Basically, so some of the groups, 1785 01:18:04,450 --> 01:18:07,020 it's very clear how it immediately 1786 01:18:07,020 --> 01:18:09,250 applied to the Grand Challenge. 1787 01:18:09,250 --> 01:18:12,650 So incremental path planning-- well, we have a mobile robot, 1788 01:18:12,650 --> 01:18:15,030 so maybe we can actually print that on the robot 1789 01:18:15,030 --> 01:18:18,630 and get it to change or plan if something gets in the way. 1790 01:18:18,630 --> 01:18:20,840 The semantic globalization group-- 1791 01:18:20,840 --> 01:18:23,920 obviously very applicable to the Grand Challenge. 1792 01:18:23,920 --> 01:18:25,910 You need to know where you are. 1793 01:18:25,910 --> 01:18:28,070 So the robot that we have now can actually 1794 01:18:28,070 --> 01:18:30,590 do normal metric localization. 1795 01:18:30,590 --> 01:18:32,152 So it can sort of know where it is. 1796 01:18:32,152 --> 01:18:33,610 But what will be interesting to see 1797 01:18:33,610 --> 01:18:36,150 is compare the semantic localization to that one. 1798 01:18:36,150 --> 01:18:39,340 But how would you do semantic localization? 1799 01:18:39,340 --> 01:18:40,430 Well, we can use a camera. 1800 01:18:40,430 --> 01:18:43,110 And we can choose the visual learning 1801 01:18:43,110 --> 01:18:45,484 through deep classification-- the visual classification 1802 01:18:45,484 --> 01:18:46,650 through deep learning group. 1803 01:18:46,650 --> 01:18:48,525 So I think these two groups would 1804 01:18:48,525 --> 01:18:50,920 have a really nice synergy and a really cool way 1805 01:18:50,920 --> 01:18:52,940 to work together. 1806 01:18:52,940 --> 01:18:55,581 So the MCTS group are very cool, but I 1807 01:18:55,581 --> 01:18:57,747 had a little trouble thinking about exactly how that 1808 01:18:57,747 --> 01:18:58,340 would apply. 1809 01:18:58,340 --> 01:19:00,410 So maybe you look like you have an idea maybe. 1810 01:19:00,410 --> 01:19:02,180 AUDIENCE: So to clarify, all these 1811 01:19:02,180 --> 01:19:03,560 are separate runs of the robot? 1812 01:19:03,560 --> 01:19:05,185 STEVE: So we're actually probably going 1813 01:19:05,185 --> 01:19:06,660 to run all of them-- 1814 01:19:06,660 --> 01:19:09,820 so the grid is-- this would be probably one run of the robot. 1815 01:19:09,820 --> 01:19:13,180 We're going to decouple these so that if one or a subset these 1816 01:19:13,180 --> 01:19:16,490 don't work super well, the other groups will still 1817 01:19:16,490 --> 01:19:17,180 be able to run. 1818 01:19:17,180 --> 01:19:20,780 So we're carefully planning that out too. 1819 01:19:20,780 --> 01:19:24,410 So the MCTS group, I was thinking maybe 1820 01:19:24,410 --> 01:19:26,886 could solve-- that could be a really nice way to-- one 1821 01:19:26,886 --> 01:19:29,266 of those computational challenges. 1822 01:19:29,266 --> 01:19:30,890 Maybe you have to play against a human. 1823 01:19:30,890 --> 01:19:31,806 And if it wins, great. 1824 01:19:31,806 --> 01:19:33,590 You can go faster or you get more points. 1825 01:19:33,590 --> 01:19:36,855 So you could implement on a different game 1826 01:19:36,855 --> 01:19:39,934 other than connect four or possibly and sort of [wrap it 1827 01:19:39,934 --> 01:19:41,814 around [? rod ?] that to call with that. 1828 01:19:41,814 --> 01:19:43,480 So each ability group I think would also 1829 01:19:43,480 --> 01:19:46,110 be a great place to do one of these challenge stations. 1830 01:19:46,110 --> 01:19:48,050 So we could give you guys puzzle, 1831 01:19:48,050 --> 01:19:51,334 say maybe it's some sort of maze-like state space. 1832 01:19:51,334 --> 01:19:53,750 And you have to see could we even the reach the goal here? 1833 01:19:53,750 --> 01:19:55,625 And if you get the answer right, well, you an 1834 01:19:55,625 --> 01:19:58,420 move on to the next stage or get more points or something 1835 01:19:58,420 --> 01:19:58,920 like that. 1836 01:19:58,920 --> 01:20:00,919 AUDIENCE: But again, these are just suggestions. 1837 01:20:00,919 --> 01:20:02,836 So for example, they can do the reachability 1838 01:20:02,836 --> 01:20:04,195 for doing motion planning. 1839 01:20:04,195 --> 01:20:06,710 STEVE: Yeah, so if you guys have other suggestions on how 1840 01:20:06,710 --> 01:20:08,720 to implement your team's stuff into the Grand Challenge, 1841 01:20:08,720 --> 01:20:10,760 definitely send me an email, preferably today 1842 01:20:10,760 --> 01:20:12,810 or as soon as you think of the things. 1843 01:20:12,810 --> 01:20:13,810 And we can change these. 1844 01:20:13,810 --> 01:20:18,470 These are just suggestions for right now. 1845 01:20:18,470 --> 01:20:21,390 The last two are sort of more planning related. 1846 01:20:21,390 --> 01:20:26,090 So planning with temporal logic, which was Monday's lecture, 1847 01:20:26,090 --> 01:20:28,190 I thought would be cool way to sort of control 1848 01:20:28,190 --> 01:20:30,020 the robot's high-level action. 1849 01:20:30,020 --> 01:20:32,800 So maybe that could involve modeling 1850 01:20:32,800 --> 01:20:35,135 are Grand Challenge with PDDL. 1851 01:20:35,135 --> 01:20:37,740 And maybe with linear temporal logic goals, 1852 01:20:37,740 --> 01:20:39,430 models that you get [INAUDIBLE]. 1853 01:20:39,430 --> 01:20:42,082 Then you could compile that and call up to turn around it 1854 01:20:42,082 --> 01:20:44,460 and execute that plan on the actual robot, 1855 01:20:44,460 --> 01:20:46,995 do [INAUDIBLE] high-level actions of the robot. 1856 01:20:46,995 --> 01:20:47,870 That's a possibility. 1857 01:20:47,870 --> 01:20:51,390 And for today's group, the infinite horizon 1858 01:20:51,390 --> 01:20:55,631 probablistics planning, maybe you could do something actually 1859 01:20:55,631 --> 01:20:56,130 similar. 1860 01:20:56,130 --> 01:20:59,110 But instead of modeling the domain as PDDL 1861 01:20:59,110 --> 01:21:02,055 model it as an NVP and solve it with LAO star, 1862 01:21:02,055 --> 01:21:05,180 and get sort of a policy on how to control the robot. 1863 01:21:05,180 --> 01:21:07,135 Maybe we can change it a little bit so 1864 01:21:07,135 --> 01:21:10,925 that certain squares can be more risky than others or so on. 1865 01:21:10,925 --> 01:21:13,050 So again, there's flexibility in all of these here. 1866 01:21:13,050 --> 01:21:16,208 So these are just some of the suggestions. 1867 01:21:16,208 --> 01:21:18,085 Does anyone have any questions before we-- 1868 01:21:18,085 --> 01:21:19,065 this all I have. 1869 01:21:19,065 --> 01:21:20,273 So anyone have any questions? 1870 01:21:20,273 --> 01:21:20,773 Yeah? 1871 01:21:21,991 --> 01:21:24,201 AUDIENCE: How are we working on [INAUDIBLE]?? 1872 01:21:24,201 --> 01:21:26,656 Are you [INAUDIBLE]? 1873 01:21:26,656 --> 01:21:28,530 STEVE: So it's basically up to you 1874 01:21:28,530 --> 01:21:31,350 guys really divide up the work amongst yourselves. 1875 01:21:31,350 --> 01:21:33,780 It's really different for every team. 1876 01:21:33,780 --> 01:21:35,074 So we're-- 1877 01:21:35,074 --> 01:21:35,740 AUDIENCE: Sorry. 1878 01:21:35,740 --> 01:21:37,406 So each team is developing [INAUDIBLE].. 1879 01:21:37,406 --> 01:21:38,241 STEVE: Right, yeah. 1880 01:21:38,241 --> 01:21:40,532 Each team is really in their own separate package FIRM. 1881 01:21:40,532 --> 01:21:43,110 For example, these two teams, there's 1882 01:21:43,110 --> 01:21:45,210 probably going to be a common interface where 1883 01:21:45,210 --> 01:21:47,419 the output of this one goes to the input of that one. 1884 01:21:47,419 --> 01:21:49,085 So for that one, we're going to give you 1885 01:21:49,085 --> 01:21:50,710 sort of the interface [INAUDIBLE] 1886 01:21:50,710 --> 01:21:52,082 message type package for that. 1887 01:21:52,082 --> 01:21:54,690 But other than than, like for dividing up the work, 1888 01:21:54,690 --> 01:21:56,600 you guys [INAUDIBLE] that. 1889 01:21:56,600 --> 01:22:00,280 PROFESSOR 2: So your plans will be integration 1890 01:22:00,280 --> 01:22:05,238 if you're trying [INAUDIBLE] pieces the group is doing. 1891 01:22:05,238 --> 01:22:08,230 That we think is, thing that uses soft engineering 1892 01:22:08,230 --> 01:22:10,860 skills [INAUDIBLE]. 1893 01:22:10,860 --> 01:22:14,010 It can be really unpleasant to do, take a lot of time. 1894 01:22:14,010 --> 01:22:15,970 So we don't want you to have that experience. 1895 01:22:15,970 --> 01:22:19,300 So that's why you can... 1896 01:22:19,300 --> 01:22:21,081 If not only what we wanted you to do 1897 01:22:21,081 --> 01:22:23,330 is to be able to get your own capability [INAUDIBLE].. 1898 01:22:25,860 --> 01:22:29,230 If you guys choose to integrate with other teams 1899 01:22:29,230 --> 01:22:31,290 because you're really excited about that 1900 01:22:31,290 --> 01:22:33,020 and because it looks like the people 1901 01:22:33,020 --> 01:22:34,914 that you're working for [INAUDIBLE].. 1902 01:22:34,914 --> 01:22:37,910 Then that's purely your choice. 1903 01:22:37,910 --> 01:22:38,810 Is that fair enough? 1904 01:22:38,810 --> 01:22:39,310 STEVE: Sure. 1905 01:22:39,310 --> 01:22:40,010 It's fine. 1906 01:22:40,010 --> 01:22:41,210 Any more questions? 1907 01:22:44,910 --> 01:22:45,410 OK. 1908 01:22:45,410 --> 01:22:46,310 Sounds great. 1909 01:22:46,310 --> 01:22:49,360 [APPLAUSE]