1 00:00:01,580 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,340 Commons license. 3 00:00:05,340 --> 00:00:07,550 Your support will help MIT OpenCourseWare 4 00:00:07,550 --> 00:00:11,640 continue to offer high quality educational resources for free. 5 00:00:11,640 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,110 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,110 --> 00:00:19,090 at ocw.mit.edu. 8 00:00:22,420 --> 00:00:24,370 PROFESSOR WILLIAMS: OK, so today's lecture-- 9 00:00:27,010 --> 00:00:31,380 we're going to be talking about probabilistic planning later, 10 00:00:31,380 --> 00:00:33,310 and in these cases where you're planning 11 00:00:33,310 --> 00:00:36,570 a large state spaces is very difficult. 12 00:00:36,570 --> 00:00:38,740 You do the MVP planning. 13 00:00:38,740 --> 00:00:41,690 It could be stress that activity planning, or the likes. 14 00:00:41,690 --> 00:00:43,250 But you have to be able to figure out 15 00:00:43,250 --> 00:00:44,950 how to deal with these state spaces. 16 00:00:44,950 --> 00:00:48,160 So Monte Carlo tree searches is one of the techniques 17 00:00:48,160 --> 00:00:51,040 that people can identify, over last five years, 18 00:00:51,040 --> 00:00:54,667 is having an amazing performance improvement over other kinds 19 00:00:54,667 --> 00:00:56,120 of sample-based approaches. 20 00:00:56,120 --> 00:00:58,510 So entity is very interesting from that standpoint. 21 00:00:58,510 --> 00:01:00,845 And then if we [? link it to ?] the last lecture, 22 00:01:00,845 --> 00:01:02,980 then the combination of something, 23 00:01:02,980 --> 00:01:07,370 we just learn about [INAUDIBLE] and combine it with search, 24 00:01:07,370 --> 00:01:11,035 is very powerful, in this case, through the state-of-the-art 25 00:01:11,035 --> 00:01:15,472 techniques for that, as much as tree search [INAUDIBLE] later 26 00:01:15,472 --> 00:01:20,610 [INAUDIBLE] 27 00:01:20,610 --> 00:01:22,110 PROFESSOR 2: Good morning, everyone. 28 00:01:22,110 --> 00:01:24,411 As Professor Williams just said, we 29 00:01:24,411 --> 00:01:26,910 are going to be talking about Monte Carlo tree search today. 30 00:01:26,910 --> 00:01:30,102 My name is Eann and I'll be leading 31 00:01:30,102 --> 00:01:32,310 the introduction and motivation of this presentation. 32 00:01:32,310 --> 00:01:34,420 By the end of this presentation, you 33 00:01:34,420 --> 00:01:36,890 will know not only why we care about Monte Carlo tree 34 00:01:36,890 --> 00:01:37,390 searches. 35 00:01:37,390 --> 00:01:39,930 As Professor Williams said, there's so many algorithms 36 00:01:39,930 --> 00:01:40,680 out there. 37 00:01:40,680 --> 00:01:43,440 Why do we care about this specific one? 38 00:01:43,440 --> 00:01:46,260 And second, we'll be going through the pros 39 00:01:46,260 --> 00:01:49,620 and cons of MCTS, as well as the algorithm itself. 40 00:01:49,620 --> 00:01:52,440 And then lastly, we will have a pretty cool demo 41 00:01:52,440 --> 00:01:55,650 on how it's applied to Super Mario Brothers and the latest 42 00:01:55,650 --> 00:02:01,350 Alpha Go AI that built the second best leading Go 43 00:02:01,350 --> 00:02:03,520 player in the world. 44 00:02:03,520 --> 00:02:05,820 So the outline for today's presentation 45 00:02:05,820 --> 00:02:08,220 is, first, we're going to talk about pre-MCTS algorithms. 46 00:02:08,220 --> 00:02:11,130 There are other algorithms that currently exist out there, 47 00:02:11,130 --> 00:02:15,600 and just a few of them to lead into why we do care about MCTS 48 00:02:15,600 --> 00:02:18,510 and why these other algorithms fail. 49 00:02:18,510 --> 00:02:20,850 And second, we'll talk about Monte Carlo tree searches 50 00:02:20,850 --> 00:02:21,890 itself with Yo. 51 00:02:21,890 --> 00:02:25,210 And lastly, Nick will tell you more about the applications 52 00:02:25,210 --> 00:02:27,300 of Monte Carlo tree searches. 53 00:02:27,300 --> 00:02:31,800 So the motivation of these kind of algorithms 54 00:02:31,800 --> 00:02:33,440 is we want to be able to play games 55 00:02:33,440 --> 00:02:36,972 and we want to be able to create programs to play these games, 56 00:02:36,972 --> 00:02:38,430 but we want to play them optimally. 57 00:02:38,430 --> 00:02:40,880 We want to be able to win, but we also 58 00:02:40,880 --> 00:02:43,890 want to be able do this in a reasonable amount of time. 59 00:02:43,890 --> 00:02:45,965 So these three can train itself leads 60 00:02:45,965 --> 00:02:47,680 to different kinds of algorithms, 61 00:02:47,680 --> 00:02:50,600 and different algorithms with different complexities 62 00:02:50,600 --> 00:02:53,420 and time, or times to search. 63 00:02:53,420 --> 00:02:55,025 And so that's why today we're going 64 00:02:55,025 --> 00:02:57,140 to be talking about Monte Carlo tree searches. 65 00:02:57,140 --> 00:03:00,614 And you'll figure out in a few slides why we do care. 66 00:03:00,614 --> 00:03:02,900 So these are the types of games we have. 67 00:03:02,900 --> 00:03:04,300 You have this chart where there's 68 00:03:04,300 --> 00:03:07,940 fully observable games, partially observable games, 69 00:03:07,940 --> 00:03:10,450 determinstic, and games of chance. 70 00:03:10,450 --> 00:03:13,240 And so today, the games that we care about 71 00:03:13,240 --> 00:03:16,790 are the games that are fully observable and deterministic. 72 00:03:16,790 --> 00:03:21,470 And these games are games like chess and checkers and Go. 73 00:03:21,470 --> 00:03:23,580 And we'll also be talking about another example 74 00:03:23,580 --> 00:03:25,730 with Tic-tac-toe. 75 00:03:25,730 --> 00:03:29,280 So these pre-MCTS algorithms include 76 00:03:29,280 --> 00:03:32,660 deterministic, fully observable games, like we said earlier. 77 00:03:32,660 --> 00:03:36,510 And the idea of this, and the nice thing about these games, 78 00:03:36,510 --> 00:03:38,760 is that they have perfect information, 79 00:03:38,760 --> 00:03:41,180 and that you have all of the states 80 00:03:41,180 --> 00:03:45,650 that you need and there's no opportunity for chance. 81 00:03:45,650 --> 00:03:47,615 And so the idea is that we can construct 82 00:03:47,615 --> 00:03:50,240 a tree that contains all possible outcomes 83 00:03:50,240 --> 00:03:52,990 because everything is fully determined. 84 00:03:52,990 --> 00:03:55,250 And so one of these algorithms, to address this, 85 00:03:55,250 --> 00:03:58,960 is the algorithm Minimax, which you might have heard before. 86 00:03:58,960 --> 00:04:00,600 And the idea of Minimax to minimize 87 00:04:00,600 --> 00:04:02,317 the maximum possible loss. 88 00:04:02,317 --> 00:04:04,150 That sounds a little weird in the beginning, 89 00:04:04,150 --> 00:04:06,540 but if you take a look at this tree, 90 00:04:06,540 --> 00:04:08,440 this red dot, for example, is the computer. 91 00:04:08,440 --> 00:04:11,730 And so in the computer's eyes, it wants to beat its opponent. 92 00:04:11,730 --> 00:04:14,570 And we're assuming the opponent wants to win also, 93 00:04:14,570 --> 00:04:16,815 so they're playing their best game as well. 94 00:04:16,815 --> 00:04:21,990 And so the computer wants to maximize his or her points, 95 00:04:21,990 --> 00:04:25,820 but also knowing that the opponent, or the human, 96 00:04:25,820 --> 00:04:29,870 wants to maximize their own win as well. 97 00:04:29,870 --> 00:04:31,730 And so in the computer's eyes, it 98 00:04:31,730 --> 00:04:34,000 wants to minimize the maximum possible lost. 99 00:04:34,000 --> 00:04:37,038 Does that make sense to everyone? 100 00:04:37,038 --> 00:04:38,012 Yes? 101 00:04:38,012 --> 00:04:39,480 OK. 102 00:04:39,480 --> 00:04:41,310 And so in the example of Minimax, 103 00:04:41,310 --> 00:04:42,810 we're going to start with a connect, 104 00:04:42,810 --> 00:04:45,450 or a Tic-tac-toe board, where the computer is 105 00:04:45,450 --> 00:04:49,230 this board right here, and the blue Tic-tac-toe boards 106 00:04:49,230 --> 00:04:52,236 are the states that the computer finally chooses. 107 00:04:52,236 --> 00:04:55,030 It's anticipating the moves a human could play. 108 00:04:57,880 --> 00:04:59,820 So if you take a look up here, here's 109 00:04:59,820 --> 00:05:02,292 the current state of the board. 110 00:05:02,292 --> 00:05:03,700 The current state of the board. 111 00:05:03,700 --> 00:05:09,380 And the possible options for the human are this guy, this guy. 112 00:05:09,380 --> 00:05:09,880 Nope. 113 00:05:09,880 --> 00:05:11,820 Possible options for the computer, 114 00:05:11,820 --> 00:05:13,540 we have three different options. 115 00:05:13,540 --> 00:05:16,200 And so you'll notice that this is clearly the obvious winner. 116 00:05:16,200 --> 00:05:18,180 But in the state of Minimax, it goes 117 00:05:18,180 --> 00:05:19,710 through the entire tree, which is 118 00:05:19,710 --> 00:05:21,270 different from depth-first search. 119 00:05:21,270 --> 00:05:24,520 It goes through the entire tree until it finds the winning 120 00:05:24,520 --> 00:05:30,460 move and the minimize of the maximum possible points 121 00:05:30,460 --> 00:05:31,800 it could win. 122 00:05:31,800 --> 00:05:34,080 So is there a way we can make this better? 123 00:05:34,080 --> 00:05:34,620 Yes. 124 00:05:34,620 --> 00:05:36,720 I'm sure you've heard about pruning, 125 00:05:36,720 --> 00:05:39,900 where, in our human intuition, it makes sense. 126 00:05:39,900 --> 00:05:41,790 Well, why don't we just stop when we win, 127 00:05:41,790 --> 00:05:43,940 or when we know we're going to have 128 00:05:43,940 --> 00:05:47,116 a game that allows us to win? 129 00:05:47,116 --> 00:05:49,940 And so this idea is the idea of simple pruning. 130 00:05:49,940 --> 00:05:54,250 And so when we combine Minimax and simple pruning, we have-- 131 00:05:54,250 --> 00:05:54,750 anyone know? 132 00:05:57,612 --> 00:05:58,570 AUDIENCE: Alpha, beta. 133 00:05:58,570 --> 00:05:59,278 PROFESSOR 3: Yes. 134 00:05:59,278 --> 00:06:02,915 Our 6.034 head TA knows about this. 135 00:06:02,915 --> 00:06:05,800 We have alpha-beta pruning, where we prune away any 136 00:06:05,800 --> 00:06:09,090 branches that cannot influence the final decision. 137 00:06:09,090 --> 00:06:13,100 So in other words, you wouldn't keep exploring the tree 138 00:06:13,100 --> 00:06:15,595 if you already knew that a previous term would 139 00:06:15,595 --> 00:06:16,890 allow you to win. 140 00:06:16,890 --> 00:06:19,630 And so this idea in alpha-beta pruning, 141 00:06:19,630 --> 00:06:21,150 we have an alpha and a beta. 142 00:06:21,150 --> 00:06:24,740 And so the details aren't important 143 00:06:24,740 --> 00:06:26,850 for you to know right now, but the idea 144 00:06:26,850 --> 00:06:29,490 is that we stop whenever we know we don't 145 00:06:29,490 --> 00:06:31,930 need to go on any further. 146 00:06:31,930 --> 00:06:34,434 So in the games that have Tic-tac-toe 147 00:06:34,434 --> 00:06:37,380 and Connect 4 and chess, we have relatively low 148 00:06:37,380 --> 00:06:38,800 branching factor. 149 00:06:38,800 --> 00:06:41,130 So in the case of Tic-tac-toe, we have 2 150 00:06:41,130 --> 00:06:43,720 to the fourth branching factor. 151 00:06:43,720 --> 00:06:46,230 But what if we have really large branching factors, 152 00:06:46,230 --> 00:06:47,640 like Alpha Go? 153 00:06:47,640 --> 00:06:50,440 In Alpha Go, we have 2 to the 250. 154 00:06:50,440 --> 00:06:53,760 Do you see that Mini Max, or even alpha-beta pruning, 155 00:06:53,760 --> 00:06:57,140 would be an optimal algorithm for this? 156 00:06:57,140 --> 00:06:59,169 The answer is? 157 00:06:59,169 --> 00:06:59,710 AUDIENCE: No. 158 00:06:59,710 --> 00:07:00,376 PROFESSOR 3: No. 159 00:07:00,376 --> 00:07:04,370 And this leads us to out next section. 160 00:07:04,370 --> 00:07:08,210 Our goal is going to talk about how we can use the Monte Carlo 161 00:07:08,210 --> 00:07:11,210 tree search algorithm for games with really high 162 00:07:11,210 --> 00:07:16,120 branching factors, and using the random extension to allow us 163 00:07:16,120 --> 00:07:21,490 to see, ultimately, how Alpha Go, which is Google's AI, 164 00:07:21,490 --> 00:07:25,843 was able to beat the leading Go player in the world. 165 00:07:29,140 --> 00:07:31,024 PROFESSOR 3: All right, guys. 166 00:07:31,024 --> 00:07:34,000 So this is the part where we re-explain 167 00:07:34,000 --> 00:07:35,410 the algorithm itself. 168 00:07:35,410 --> 00:07:37,240 And before we dive into this, I want 169 00:07:37,240 --> 00:07:38,860 to make something really clear, which 170 00:07:38,860 --> 00:07:41,470 is that because these are technical details 171 00:07:41,470 --> 00:07:43,700 and because we actually want you to understand them, 172 00:07:43,700 --> 00:07:45,760 and because I definitely didn't understand this the first three 173 00:07:45,760 --> 00:07:46,920 times I read the paper. 174 00:07:46,920 --> 00:07:49,420 I really want you to feel free to ask any questions 175 00:07:49,420 --> 00:07:53,590 on your mind, with the knowledge that, in my experience, 176 00:07:53,590 --> 00:07:56,492 it is very rare that someone asks a question in class that's 177 00:07:56,492 --> 00:08:00,350 [INAUDIBLE] OK, so really, whenever you have one. 178 00:08:00,350 --> 00:08:01,630 OK. 179 00:08:01,630 --> 00:08:04,130 So why are we doing this? 180 00:08:04,130 --> 00:08:06,860 Well, the ideal goal behind MTCS is 181 00:08:06,860 --> 00:08:09,160 that we want to selectively build up 182 00:08:09,160 --> 00:08:10,910 different parts of the tree. 183 00:08:10,910 --> 00:08:16,630 So the depth-first search way, the exhaustive search, 184 00:08:16,630 --> 00:08:19,270 would have us exploring the entire koopa tree, 185 00:08:19,270 --> 00:08:21,480 and that our depth is limited by looking 186 00:08:21,480 --> 00:08:23,630 at all the possible nodes of that level. 187 00:08:23,630 --> 00:08:25,270 But what we want is we want-- 188 00:08:25,270 --> 00:08:28,350 because the amount of computation required for that 189 00:08:28,350 --> 00:08:30,080 explodes really quickly. 190 00:08:30,080 --> 00:08:32,373 With the number of moves that you're basically 191 00:08:32,373 --> 00:08:33,789 looking into the future, we wanted 192 00:08:33,789 --> 00:08:37,495 to be able to search selectively in certain parts of the tree. 193 00:08:37,495 --> 00:08:41,230 And so for example, if there are less promising parts over here, 194 00:08:41,230 --> 00:08:44,290 then we care less about looking into the future of those areas. 195 00:08:44,290 --> 00:08:46,030 But if we have a certain move-- 196 00:08:46,030 --> 00:08:48,050 in chess, for example, there's a certain move 197 00:08:48,050 --> 00:08:49,670 where in two moves, you're going to be able to take 198 00:08:49,670 --> 00:08:50,545 the opponent's queen. 199 00:08:50,545 --> 00:08:52,412 You're really want to search that region 200 00:08:52,412 --> 00:08:53,870 and figure out whether that's going 201 00:08:53,870 --> 00:08:58,130 to end up being a significantly positive group for me. 202 00:08:58,130 --> 00:09:00,230 And so the whole goal of our algorithm 203 00:09:00,230 --> 00:09:02,977 is going to be growing this asymmetric tree. 204 00:09:02,977 --> 00:09:03,810 How does that sound? 205 00:09:06,820 --> 00:09:08,700 OK, great. 206 00:09:08,700 --> 00:09:11,210 So how do we actually do this? 207 00:09:11,210 --> 00:09:13,200 We're going to go over a high-level outline, 208 00:09:13,200 --> 00:09:14,800 but before we do that, let's talk 209 00:09:14,800 --> 00:09:16,400 about our tree, which you're going 210 00:09:16,400 --> 00:09:17,483 to get very familiar with. 211 00:09:20,250 --> 00:09:24,710 Can people see that this is red and this is blue? 212 00:09:24,710 --> 00:09:28,850 So this is our game state when we start our game. 213 00:09:28,850 --> 00:09:32,570 We can be given a Tic-tac-toe board with a [INAUDIBLE] place, 214 00:09:32,570 --> 00:09:35,780 a game of chess with the lose configured a certain way. 215 00:09:35,780 --> 00:09:38,420 And so our player, which is the computer, 216 00:09:38,420 --> 00:09:41,070 has three separate moves that it can take. 217 00:09:41,070 --> 00:09:43,560 And so each of those moves are presented by a node. 218 00:09:43,560 --> 00:09:48,170 And each of those moves have response moves by the opponent. 219 00:09:48,170 --> 00:09:50,870 So you can imagine that if one of these 220 00:09:50,870 --> 00:09:53,730 is a Tic-tac-toe board with just a circle, that one of these 221 00:09:53,730 --> 00:09:57,440 is with that circle and the next place right by it. 222 00:09:57,440 --> 00:10:00,620 And as you go down the this tree, 223 00:10:00,620 --> 00:10:02,840 you start understanding basically, 224 00:10:02,840 --> 00:10:06,260 it's the way that humans think about playing these games. 225 00:10:06,260 --> 00:10:10,160 If I go here, then what if they go there, 226 00:10:10,160 --> 00:10:12,280 and then what if I go right here. 227 00:10:12,280 --> 00:10:14,990 You try to think through the set of future moves 228 00:10:14,990 --> 00:10:17,930 and try to evaluate whether your move will 229 00:10:17,930 --> 00:10:20,799 be good in the long term sense. 230 00:10:20,799 --> 00:10:23,090 They way that are going to expand our tree, as we said, 231 00:10:23,090 --> 00:10:26,464 to create an asymmetric tree is first of all, 232 00:10:26,464 --> 00:10:28,130 we're going to descend through the tree. 233 00:10:28,130 --> 00:10:30,296 We're going to start at the top and we're basically, 234 00:10:30,296 --> 00:10:34,560 jump down some sequence of branches until we figure out 235 00:10:34,560 --> 00:10:38,750 where we're going to place our new node, which seems 236 00:10:38,750 --> 00:10:39,920 like a key operation here. 237 00:10:39,920 --> 00:10:42,018 To create an asymmetric tree it's all about how 238 00:10:42,018 --> 00:10:43,707 you [INAUDIBLE]. 239 00:10:43,707 --> 00:10:45,290 For example, in this case, we're going 240 00:10:45,290 --> 00:10:48,580 to pick this sequence of nodes. 241 00:10:48,580 --> 00:10:51,596 And once we get to the bottom and find every location, 242 00:10:51,596 --> 00:10:53,680 we're going to create a new node. 243 00:10:53,680 --> 00:10:55,750 It's not very hard. 244 00:10:55,750 --> 00:10:59,690 Then we're going to simulate a game from this new node. 245 00:10:59,690 --> 00:11:03,260 And this is the key part of MCTS. 246 00:11:03,260 --> 00:11:06,296 Once you get to new a location, what 247 00:11:06,296 --> 00:11:07,670 you're going to be doing then, is 248 00:11:07,670 --> 00:11:10,465 you're going to be simulating a game from that new location. 249 00:11:10,465 --> 00:11:11,840 We're going to talk about how you 250 00:11:11,840 --> 00:11:17,300 go about simulating a game from this more advanced game state 251 00:11:17,300 --> 00:11:18,907 that what we started out with. 252 00:11:18,907 --> 00:11:20,957 Does anyone have any questions right now? 253 00:11:20,957 --> 00:11:23,040 We will be going in depth into all of these steps, 254 00:11:23,040 --> 00:11:24,556 but just in a high level sense. 255 00:11:24,556 --> 00:11:25,420 AUDIENCE: Just a quick question. 256 00:11:25,420 --> 00:11:25,660 PROFESSOR 3: Yeah. 257 00:11:25,660 --> 00:11:27,035 AUDIENCE: To create the new node, 258 00:11:27,035 --> 00:11:29,617 is it probabilistic, just creating a new node as the most 259 00:11:29,617 --> 00:11:30,450 probable [INAUDIBLE] 260 00:11:30,450 --> 00:11:31,300 PROFESSOR 3: No, no. 261 00:11:31,300 --> 00:11:32,590 You're creating some new node. 262 00:11:32,590 --> 00:11:34,140 We'll talk about how we pick that new node, 263 00:11:34,140 --> 00:11:36,806 but we're just making a new node and we're not thinking anything 264 00:11:36,806 --> 00:11:37,780 about probability. 265 00:11:37,780 --> 00:11:40,030 The next thing is that we're going to update the tree. 266 00:11:40,030 --> 00:11:43,195 So whatever the value of the simulation delta was-- 267 00:11:43,195 --> 00:11:50,360 delta, remember-- we're going to propagate that up and basically 268 00:11:50,360 --> 00:11:52,550 add that to all of the nodes that 269 00:11:52,550 --> 00:11:54,416 are in that parent of that node in the tree 270 00:11:54,416 --> 00:11:56,332 and update some information that goes in there 271 00:11:56,332 --> 00:11:58,090 and that they're storing. 272 00:11:58,090 --> 00:12:00,980 This is going to be good because it's going to mean that-- 273 00:12:00,980 --> 00:12:02,975 it's a lot like in search algorithms where 274 00:12:02,975 --> 00:12:05,360 you have trees that then the entirety of the tree 275 00:12:05,360 --> 00:12:07,713 remains up to date with the information from every given 276 00:12:07,713 --> 00:12:08,642 simulation. 277 00:12:08,642 --> 00:12:10,100 And we're just going to repeat this 278 00:12:10,100 --> 00:12:11,390 over and over and over again. 279 00:12:11,390 --> 00:12:13,640 And slowly, our tree will grow out 280 00:12:13,640 --> 00:12:15,946 until whenever we feel like stopping. 281 00:12:15,946 --> 00:12:17,570 This is actually one of the nice things 282 00:12:17,570 --> 00:12:22,220 about MCTS, is that whenever we decide that we're out 283 00:12:22,220 --> 00:12:25,510 of time, like for example, if you're in a competition playing 284 00:12:25,510 --> 00:12:29,060 a champion Go player, you can stop the simulation. 285 00:12:29,060 --> 00:12:30,710 And then all you have to do is pick 286 00:12:30,710 --> 00:12:34,220 between one of the best first moves 287 00:12:34,220 --> 00:12:35,780 that you're going to make. 288 00:12:35,780 --> 00:12:38,510 Because an the end of the day, after you're 289 00:12:38,510 --> 00:12:41,010 doing all the simulation, we're still right here. 290 00:12:41,010 --> 00:12:43,820 And we're still only picking between the movies that go 291 00:12:43,820 --> 00:12:45,850 immediately where we started. 292 00:12:45,850 --> 00:12:47,260 Yeah. 293 00:12:47,260 --> 00:12:50,080 AUDIENCE: Could this [INAUDIBLE] good tree? 294 00:12:50,080 --> 00:12:52,290 And then on some initial region of interest, 295 00:12:52,290 --> 00:12:56,151 or is it arbitrary how you get to create it? 296 00:12:56,151 --> 00:12:57,900 PROFESSOR 3: We'll go through how you pick 297 00:12:57,900 --> 00:13:00,410 where to descend right now. 298 00:13:00,410 --> 00:13:04,030 I guess, it's any possible move that starts 299 00:13:04,030 --> 00:13:06,412 at your starting game state. 300 00:13:06,412 --> 00:13:10,480 Does that make-- great. 301 00:13:10,480 --> 00:13:12,970 Before we move on to the algorithm itself, 302 00:13:12,970 --> 00:13:17,360 let's talk about what we store in each one of these nodes. 303 00:13:17,360 --> 00:13:19,400 So now we've added these numbers. 304 00:13:19,400 --> 00:13:22,510 And these numbers represent is that nk, 305 00:13:22,510 --> 00:13:25,730 as in the value of the right, is the number of games 306 00:13:25,730 --> 00:13:28,500 that have been played that involve a certain node. 307 00:13:28,500 --> 00:13:31,070 So for example, if I look this node, 308 00:13:31,070 --> 00:13:33,410 that means that four games have been 309 00:13:33,410 --> 00:13:34,737 played that involve this node. 310 00:13:34,737 --> 00:13:36,820 A game that has been played that involves the node 311 00:13:36,820 --> 00:13:38,570 just means that one of the states 312 00:13:38,570 --> 00:13:40,940 of the board at some point in the game 313 00:13:40,940 --> 00:13:45,480 was the state of the board that this represents. 314 00:13:45,480 --> 00:13:48,400 For example, if I have a game that was played here, 315 00:13:48,400 --> 00:13:50,275 if I know that I've played this once, 316 00:13:50,275 --> 00:13:51,650 then that guarantees to me that I 317 00:13:51,650 --> 00:13:53,191 played this game once because this is 318 00:13:53,191 --> 00:13:55,444 a precursor state to this one. 319 00:13:55,444 --> 00:13:56,920 Make sense? 320 00:13:56,920 --> 00:13:57,904 Yeah. 321 00:13:57,904 --> 00:14:00,734 AUDIENCE: How can the two n's below that node not 322 00:14:00,734 --> 00:14:03,000 add up to a value of [INAUDIBLE] 323 00:14:03,000 --> 00:14:05,960 PROFESSOR 3: That will come when we start expanding our game. 324 00:14:05,960 --> 00:14:07,180 But that's a great question. 325 00:14:07,180 --> 00:14:10,270 And intuitively speaking, it should. 326 00:14:10,270 --> 00:14:12,940 AUDIENCE: You're saying you're storing data from past games 327 00:14:12,940 --> 00:14:13,742 about what we've-- 328 00:14:13,742 --> 00:14:14,450 PROFESSOR 3: Yes. 329 00:14:14,450 --> 00:14:15,944 AUDIENCE: --done before. 330 00:14:15,944 --> 00:14:18,360 AUDIENCE: If past game's outside of the script simulation? 331 00:14:18,360 --> 00:14:19,360 PROFESSOR 3: No, no, no. 332 00:14:19,360 --> 00:14:21,850 Past game's in the script simulation. 333 00:14:21,850 --> 00:14:23,590 And then the other value is the number 334 00:14:23,590 --> 00:14:26,724 of wins associated with a certain node. 335 00:14:26,724 --> 00:14:28,890 And these are going to be wins for player one, which 336 00:14:28,890 --> 00:14:30,494 is red in this case. 337 00:14:30,494 --> 00:14:32,410 It would get confusing if we put both of them, 338 00:14:32,410 --> 00:14:34,120 but they're complementary. 339 00:14:34,120 --> 00:14:37,020 So for example, three out of the four times 340 00:14:37,020 --> 00:14:42,317 that the red player visited this node, they won in that node. 341 00:14:42,317 --> 00:14:44,650 And these are the two numbers that we're going to store. 342 00:14:44,650 --> 00:14:46,066 And we're going to see why they're 343 00:14:46,066 --> 00:14:48,760 significant to store later. 344 00:14:48,760 --> 00:14:52,629 So first, descending the key part of our algorithm 345 00:14:52,629 --> 00:14:53,670 that we're talking about. 346 00:14:53,670 --> 00:14:55,900 And when descending, there are these two 347 00:14:55,900 --> 00:14:59,260 counterbalanced desires that we have. 348 00:14:59,260 --> 00:15:03,670 The first of them is that we want to explore really 349 00:15:03,670 --> 00:15:05,410 deeply into our tree. 350 00:15:05,410 --> 00:15:08,650 We want to think about, OK, if they do this then I'll do this. 351 00:15:08,650 --> 00:15:11,427 And then, well, then I'll do that unless I want it to forth. 352 00:15:11,427 --> 00:15:13,510 And we want to think through a long term strategy. 353 00:15:13,510 --> 00:15:16,870 But at the same time, we don't want to get caught in that. 354 00:15:16,870 --> 00:15:18,700 We want to make sure that we're not 355 00:15:18,700 --> 00:15:22,750 missing a really promising other movie that we weren't even 356 00:15:22,750 --> 00:15:24,670 considering because we were really going down 357 00:15:24,670 --> 00:15:27,410 this certain rabbit hole of the move 358 00:15:27,410 --> 00:15:28,840 that we had thought about before. 359 00:15:28,840 --> 00:15:33,260 This is illustrated by the x case [INAUDIBLE] SMBC. 360 00:15:33,260 --> 00:15:37,222 The SMBC comic about academia and how someone tells you 361 00:15:37,222 --> 00:15:38,680 that a lot of really great work has 362 00:15:38,680 --> 00:15:40,346 been done in an area, that means nothing 363 00:15:40,346 --> 00:15:44,082 about how promising the future will be. 364 00:15:44,082 --> 00:15:45,790 It's all about expansion and exploration. 365 00:15:45,790 --> 00:15:47,831 And the way that we're going to balance expansion 366 00:15:47,831 --> 00:15:49,520 and exploration in order to create 367 00:15:49,520 --> 00:15:54,083 our really nice asymmetric tree is the following formula. 368 00:15:54,083 --> 00:15:57,610 And it's fine if that looks really confusing and messy. 369 00:15:57,610 --> 00:16:03,220 But actually, it breaks down quite nicely into two parts. 370 00:16:03,220 --> 00:16:04,860 This formula is known as the UCB. 371 00:16:04,860 --> 00:16:07,600 You don't need to know why it's the Upper Confidence Bound. 372 00:16:07,600 --> 00:16:09,231 Let's just talk about what's inside it. 373 00:16:09,231 --> 00:16:11,230 So first of all, you have this term on the left. 374 00:16:11,230 --> 00:16:14,590 And this term on the left is the extension term. 375 00:16:14,590 --> 00:16:18,030 It's basically proportional to the likelihood 376 00:16:18,030 --> 00:16:21,050 that the expected number of times that you're going to win, 377 00:16:21,050 --> 00:16:23,272 given that you are in a certain node 378 00:16:23,272 --> 00:16:24,730 and that you were a certain player. 379 00:16:27,334 --> 00:16:29,000 It's basically the quality of your state 380 00:16:29,000 --> 00:16:30,310 in some abstract level. 381 00:16:30,310 --> 00:16:32,260 If we knew this perfectly, then we 382 00:16:32,260 --> 00:16:33,760 would be doing great because that's 383 00:16:33,760 --> 00:16:37,780 the thing we're looking for on some grand level, The expected 384 00:16:37,780 --> 00:16:39,910 likelihood of winning from a certain state. 385 00:16:39,910 --> 00:16:42,192 On the other hand, you have this exploration term. 386 00:16:42,192 --> 00:16:44,150 And you may not be able to read the font there. 387 00:16:44,150 --> 00:16:45,700 But what this is basically saying 388 00:16:45,700 --> 00:16:49,150 is that it looks at the number of games 389 00:16:49,150 --> 00:16:54,580 that I have been played through, and it was the number of games 390 00:16:54,580 --> 00:16:56,470 that my parent has been played through. 391 00:16:56,470 --> 00:17:00,460 And it tries to preserve those numbers at a certain ratio, 392 00:17:00,460 --> 00:17:01,910 at a log ratio. 393 00:17:01,910 --> 00:17:06,849 And what that effectively means, is that the number of times 394 00:17:06,849 --> 00:17:08,200 that I have been-- 395 00:17:08,200 --> 00:17:10,490 if I have been visited relatively few times, 396 00:17:10,490 --> 00:17:14,180 and the denominator is small. 397 00:17:14,180 --> 00:17:16,740 Whereas my parent has been visited many times, which 398 00:17:16,740 --> 00:17:19,040 means that my siblings have gotten much more attention, 399 00:17:19,040 --> 00:17:23,140 then the likelihood that I will be visited again actually 400 00:17:23,140 --> 00:17:24,380 increases. 401 00:17:24,380 --> 00:17:27,480 So this is biased on the one hand, 402 00:17:27,480 --> 00:17:29,450 towards nodes that are really promising, 403 00:17:29,450 --> 00:17:32,200 and on the other hand, towards nodes 404 00:17:32,200 --> 00:17:34,663 that haven't been explored yet, where there's a gold mine 405 00:17:34,663 --> 00:17:36,996 and all you need to do is dig a little bit, potentially. 406 00:17:39,650 --> 00:17:42,300 We don't actually have an analytical expression for this. 407 00:17:42,300 --> 00:17:45,140 But we can approximate it because you 408 00:17:45,140 --> 00:17:48,150 can think that the expected value from a certain node 409 00:17:48,150 --> 00:17:51,860 is, roughly speaking, approximately the ratio of wins 410 00:17:51,860 --> 00:17:54,080 at that node to the ratio of times 411 00:17:54,080 --> 00:17:55,898 that that node has been visit at all. 412 00:17:59,560 --> 00:18:01,820 Let's talk about actually applying this statement. 413 00:18:01,820 --> 00:18:04,153 Because what the statement is going to give you, is it's 414 00:18:04,153 --> 00:18:06,790 going to give you some number for here and some number 415 00:18:06,790 --> 00:18:09,140 here, and some number for here, and so on. 416 00:18:09,140 --> 00:18:10,890 When we start descending through the tree, 417 00:18:10,890 --> 00:18:12,830 we're going to start at the top node. 418 00:18:12,830 --> 00:18:15,520 And then we're going to look at the three 419 00:18:15,520 --> 00:18:17,500 children of that node. 420 00:18:17,500 --> 00:18:19,290 And we're going to compute this UCB 421 00:18:19,290 --> 00:18:21,560 value for each of these children and pick 422 00:18:21,560 --> 00:18:23,780 whichever one is the highest. 423 00:18:23,780 --> 00:18:27,650 So just as a thought for a moment, 424 00:18:27,650 --> 00:18:28,850 what if we ignore this one? 425 00:18:28,850 --> 00:18:31,600 And what if we're just computing the UCB of these two? 426 00:18:31,600 --> 00:18:35,890 Does anyone have any intuition on whether the UCB would 427 00:18:35,890 --> 00:18:39,088 be higher for this node or for this node? 428 00:18:39,088 --> 00:18:40,430 AUDIENCE: The left node. 429 00:18:40,430 --> 00:18:42,170 PROFESSOR 3: The left node? 430 00:18:42,170 --> 00:18:43,040 OK. 431 00:18:43,040 --> 00:18:44,270 So why is that? 432 00:18:44,270 --> 00:18:46,460 AUDIENCE: It has a win [INAUDIBLE] 433 00:18:46,460 --> 00:18:47,210 PROFESSOR 3: Yeah. 434 00:18:47,210 --> 00:18:47,967 It has a win. 435 00:18:47,967 --> 00:18:49,800 AUDIENCE: And they both have a [INAUDIBLE].. 436 00:18:49,800 --> 00:18:50,675 PROFESSOR 3: Exactly. 437 00:18:50,675 --> 00:18:53,540 And so clearly, you think the exploration term is the same 438 00:18:53,540 --> 00:18:56,040 because you know it's not that one child has been loved less 439 00:18:56,040 --> 00:18:57,950 than the other, but the expansion term 440 00:18:57,950 --> 00:18:59,404 is going to be different. 441 00:18:59,404 --> 00:19:01,320 And so it's definitely going to pick this one. 442 00:19:01,320 --> 00:19:02,850 In this case, what we're going to say 443 00:19:02,850 --> 00:19:05,475 is actually that this is so much more promising than the others 444 00:19:05,475 --> 00:19:07,885 that it's actually going to pick this left node. 445 00:19:07,885 --> 00:19:10,290 And so it's going to expand, and it's going to look down. 446 00:19:10,290 --> 00:19:11,665 And then when it looks down, it's 447 00:19:11,665 --> 00:19:13,150 going to compare between these two. 448 00:19:13,150 --> 00:19:17,290 And this time, remember, that this is a parent. 449 00:19:17,290 --> 00:19:22,590 A parent want to minimize the number of wins that we have. 450 00:19:22,590 --> 00:19:24,250 Which means that our opponent is going 451 00:19:24,250 --> 00:19:29,980 to want to pick the one that were less likely to win in 452 00:19:29,980 --> 00:19:31,710 and they're more likely to win in. 453 00:19:31,710 --> 00:19:34,570 This is the idea of mini-max, minimizing how well 454 00:19:34,570 --> 00:19:36,520 my enemy does in this game. 455 00:19:40,190 --> 00:19:41,910 Although again, the expiration term 456 00:19:41,910 --> 00:19:44,935 might counterbalance it a little bit because, technically, this 457 00:19:44,935 --> 00:19:48,024 has been explored more. 458 00:19:48,024 --> 00:19:49,940 We're going to pick the one on the left again. 459 00:19:49,940 --> 00:19:51,700 And we're going to get to that location 460 00:19:51,700 --> 00:19:54,480 that we got to originally. 461 00:19:54,480 --> 00:19:57,750 Now when we're comparing between these two, 462 00:19:57,750 --> 00:19:59,896 between a node that has been visited once 463 00:19:59,896 --> 00:20:01,520 and a node that has never been visited, 464 00:20:01,520 --> 00:20:06,121 can anyone guess which one of these it is going to pick? 465 00:20:06,121 --> 00:20:06,620 Yeah. 466 00:20:06,620 --> 00:20:08,185 AUDIENCE: Never has been visited. 467 00:20:08,185 --> 00:20:09,310 PROFESSOR 3: Yeah, exactly. 468 00:20:09,310 --> 00:20:11,690 Because this number is zero. 469 00:20:11,690 --> 00:20:14,056 And so if the parent has ever been 470 00:20:14,056 --> 00:20:16,535 visited but the node hasn't, this is going to be infinite 471 00:20:16,535 --> 00:20:18,909 and it's going to have to pick the node that it has never 472 00:20:18,909 --> 00:20:20,512 seen before. 473 00:20:20,512 --> 00:20:22,262 So that's how we descend through the tree. 474 00:20:22,262 --> 00:20:23,886 Does anyone have any questions on that. 475 00:20:23,886 --> 00:20:25,070 Really, it's totally fine. 476 00:20:25,070 --> 00:20:27,440 We're going to be talking about this for a while. 477 00:20:27,440 --> 00:20:28,056 Yeah. 478 00:20:28,056 --> 00:20:31,287 AUDIENCE: With the left node that has the four for n sub k, 479 00:20:31,287 --> 00:20:36,344 wouldn't that be three because there's two and one below? 480 00:20:36,344 --> 00:20:37,760 PROFESSOR 3: No because of the way 481 00:20:37,760 --> 00:20:39,468 that we're going to be updating the tree. 482 00:20:39,468 --> 00:20:41,490 Next, we'll talk about some [INAUDIBLE].. 483 00:20:41,490 --> 00:20:42,698 AUDIENCE: I like the concept. 484 00:20:42,698 --> 00:20:44,742 But if it's a deterministic game, why couldn't it 485 00:20:44,742 --> 00:20:46,499 hold it's [INAUDIBLE] pretty strictly? 486 00:20:46,499 --> 00:20:48,040 PROFESSOR 3: That's a great question. 487 00:20:48,040 --> 00:20:50,606 That's really up to computer memory limits. 488 00:20:50,606 --> 00:20:54,280 As I think that Leah mentioned, the number of stakes 489 00:20:54,280 --> 00:20:55,794 in the game of Go-- 490 00:20:55,794 --> 00:20:57,835 it's a 19 by 19 board, and you can play something 491 00:20:57,835 --> 00:20:58,500 at every state. 492 00:20:58,500 --> 00:21:00,150 It's only like 2 to the-- 493 00:21:00,150 --> 00:21:01,150 PROFESSOR 2: [INAUDIBLE] 494 00:21:01,150 --> 00:21:01,340 PROFESSOR 3: What? 495 00:21:01,340 --> 00:21:02,048 PROFESSOR 2: 250. 496 00:21:02,048 --> 00:21:04,460 PROFESSOR 3: 250. 497 00:21:04,460 --> 00:21:07,000 You could never explore the entire search tree. 498 00:21:07,000 --> 00:21:09,180 AUDIENCE: [INAUDIBLE] over the first few layers 499 00:21:09,180 --> 00:21:12,010 or are we going polite. 500 00:21:12,010 --> 00:21:14,340 We try to do this real time where you could 501 00:21:14,340 --> 00:21:15,710 have done something offline. 502 00:21:15,710 --> 00:21:17,330 PROFESSOR 3: It's definitely true. 503 00:21:17,330 --> 00:21:18,440 If you know a state that you're going 504 00:21:18,440 --> 00:21:20,814 to arrive at ahead of time, then you can totally do that. 505 00:21:20,814 --> 00:21:22,420 But in a game that's large enough 506 00:21:22,420 --> 00:21:25,660 that to do that for all the possible states 507 00:21:25,660 --> 00:21:29,050 would take that much more time and take that much more memory. 508 00:21:29,050 --> 00:21:30,970 It doesn't end up making that much sense. 509 00:21:30,970 --> 00:21:32,550 Also, something to point out here, 510 00:21:32,550 --> 00:21:34,841 is that for most of the games that we're talking about, 511 00:21:34,841 --> 00:21:38,730 simulating a run through of the game is really fast. 512 00:21:38,730 --> 00:21:40,460 So if you think about it-- 513 00:21:40,460 --> 00:21:43,170 let's actually get to that in next piece. 514 00:21:43,170 --> 00:21:44,890 But the point is that building up 515 00:21:44,890 --> 00:21:46,885 this many levels of a tree for a computer 516 00:21:46,885 --> 00:21:50,780 takes probably on the order of less than millisecond. 517 00:21:50,780 --> 00:21:55,410 So doing this for a really, really huge tree, 518 00:21:55,410 --> 00:21:58,504 it's peanuts because their such simple operations. 519 00:21:58,504 --> 00:22:00,670 But it won't get expensive when we start building up 520 00:22:00,670 --> 00:22:04,650 the tree to serious depths. 521 00:22:04,650 --> 00:22:08,425 AUDIENCE: But a game like Go, how many nodes would you have? 522 00:22:08,425 --> 00:22:10,300 PROFESSOR 3: On each level, in the beginning, 523 00:22:10,300 --> 00:22:12,280 we have something on the order of 400 nodes. 524 00:22:12,280 --> 00:22:14,580 And we have a depth of about, I think 525 00:22:14,580 --> 00:22:17,542 most games have up to 250 steps, or something like that. 526 00:22:17,542 --> 00:22:19,750 AUDIENCE: So just to build, if you go in there blank, 527 00:22:19,750 --> 00:22:21,958 without any nodes built, you have to in the computer, 528 00:22:21,958 --> 00:22:23,939 like you said, it hasn't visited a node, 529 00:22:23,939 --> 00:22:26,450 it has to go there before it descends further. 530 00:22:26,450 --> 00:22:27,782 Basically, like breadth first. 531 00:22:27,782 --> 00:22:30,240 PROFESSOR 3: It's sort of like breadth first but not quite. 532 00:22:30,240 --> 00:22:31,823 There's an important distinction here, 533 00:22:31,823 --> 00:22:37,387 which is that it doesn't have to build up this or this node. 534 00:22:37,387 --> 00:22:39,220 It doesn't have to build up all of the nodes 535 00:22:39,220 --> 00:22:40,430 at a certain level. 536 00:22:40,430 --> 00:22:44,970 All it has to do is, if it branches down to a certain sub 537 00:22:44,970 --> 00:22:48,050 region, then can't descend in that sub region 538 00:22:48,050 --> 00:22:51,160 below one of its siblings without having at least looked 539 00:22:51,160 --> 00:22:52,410 once at all its siblings. 540 00:22:52,410 --> 00:22:55,190 After it looks once it can do whatever it wants. 541 00:22:55,190 --> 00:22:57,130 And the point is, that it doesn't 542 00:22:57,130 --> 00:22:59,440 mean the tree has to be kept at an even level. 543 00:22:59,440 --> 00:23:02,551 All it means is that the tree, in order 544 00:23:02,551 --> 00:23:04,300 to descend on a specific part of the tree, 545 00:23:04,300 --> 00:23:10,220 it has to have at least visited direct neighbors once before. 546 00:23:10,220 --> 00:23:12,400 Any more questions on this before-- 547 00:23:12,400 --> 00:23:12,940 Yeah. 548 00:23:12,940 --> 00:23:14,850 AUDIENCE: What's the advantage necessarily 549 00:23:14,850 --> 00:23:16,779 of having to visit every single? 550 00:23:21,821 --> 00:23:23,320 PROFESSOR 3: The advantage of having 551 00:23:23,320 --> 00:23:25,740 to visit every single-- the way that I think of it, 552 00:23:25,740 --> 00:23:28,470 is that you don't want to be missing out 553 00:23:28,470 --> 00:23:32,860 on potentially being interested in some of the things 554 00:23:32,860 --> 00:23:35,380 and not others. 555 00:23:35,380 --> 00:23:41,690 It comes back to the exploration versus expectation distinction. 556 00:23:41,690 --> 00:23:46,050 We do want to descend into the region of the tree that 557 00:23:46,050 --> 00:23:47,200 is really valuable to us. 558 00:23:47,200 --> 00:23:50,280 But at least have explored a little bit, 559 00:23:50,280 --> 00:23:51,760 at least maintaining some baseline, 560 00:23:51,760 --> 00:23:53,820 which really isn't that costly compared 561 00:23:53,820 --> 00:23:55,120 to the size of the tree. 562 00:23:55,120 --> 00:23:59,444 400 moves is not that bad compared with 400 and 250. 563 00:23:59,444 --> 00:24:01,110 AUDIENCE: Are these simulations, they're 564 00:24:01,110 --> 00:24:02,180 just random simulations? 565 00:24:02,180 --> 00:24:03,835 PROFESSOR 3: We're going to talk about that in a minute. 566 00:24:03,835 --> 00:24:05,626 Any more questions before I move onto that? 567 00:24:08,790 --> 00:24:10,280 Next step is expanding. 568 00:24:10,280 --> 00:24:11,280 And this is very simple. 569 00:24:11,280 --> 00:24:15,619 You just create a node and you set the two initial values. 570 00:24:15,619 --> 00:24:17,160 And the initial values are the number 571 00:24:17,160 --> 00:24:18,840 of times it's been visited is zero, 572 00:24:18,840 --> 00:24:20,720 and then number of times that someone has won from there 573 00:24:20,720 --> 00:24:21,220 is zero. 574 00:24:21,220 --> 00:24:25,020 AUDIENCE: [INAUDIBLE] So the easy part is solving it. 575 00:24:25,020 --> 00:24:27,180 PROFESSOR 3: Now, simulating. 576 00:24:27,180 --> 00:24:29,320 Simulating is really hard. 577 00:24:29,320 --> 00:24:31,470 You can imagine that if you get to a single node 578 00:24:31,470 --> 00:24:33,480 and you've never seen that node before, 579 00:24:33,480 --> 00:24:36,270 and you don't know what to do from this node onward, 580 00:24:36,270 --> 00:24:39,484 that if we knew how the game was going to play out, 581 00:24:39,484 --> 00:24:41,150 that is exactly what were searching for, 582 00:24:41,150 --> 00:24:42,360 and we would be done. 583 00:24:42,360 --> 00:24:43,320 But we don't. 584 00:24:43,320 --> 00:24:47,770 And in fact, we have no idea how to go about simulating 585 00:24:47,770 --> 00:24:49,790 a realistic game, and a game that 586 00:24:49,790 --> 00:24:51,990 will tell us something meaningful about the quality 587 00:24:51,990 --> 00:24:53,410 of a certain state. 588 00:24:53,410 --> 00:24:56,180 And so, as you correctly guessed, 589 00:24:56,180 --> 00:24:58,560 we're going to do it randomly. 590 00:24:58,560 --> 00:25:00,380 We're going to be at a certain state. 591 00:25:00,380 --> 00:25:01,960 And then from that state, we're just 592 00:25:01,960 --> 00:25:04,530 going to pick random nodes for each of the players 593 00:25:04,530 --> 00:25:07,280 until the game ends. 594 00:25:07,280 --> 00:25:11,990 And if we, as player one, win then we're going to add one. 595 00:25:11,990 --> 00:25:13,980 Then we're going to say delta equals plus one. 596 00:25:13,980 --> 00:25:18,140 And if we don't win, or if we tie or lose, 597 00:25:18,140 --> 00:25:20,427 then we're going to call it a zero. 598 00:25:20,427 --> 00:25:22,760 You can in this graph, we're descending randomly and not 599 00:25:22,760 --> 00:25:23,510 thinking about it. 600 00:25:23,510 --> 00:25:25,370 And it turns out that this is actually great 601 00:25:25,370 --> 00:25:28,570 because it's really, really computationally efficient. 602 00:25:28,570 --> 00:25:31,860 If you have a board, even if it has 400 open squares, 603 00:25:31,860 --> 00:25:33,810 populating it by a bunch of random moves 604 00:25:33,810 --> 00:25:35,860 doesn't take you very long, on the order 605 00:25:35,860 --> 00:25:38,276 of not that many machine can. 606 00:25:38,276 --> 00:25:40,390 AUDIENCE: That's why does you don't score-- 607 00:25:40,390 --> 00:25:44,332 if you go down a tree randomly, you already have a simulation. 608 00:25:44,332 --> 00:25:46,560 So the node's going to get to someplace. 609 00:25:46,560 --> 00:25:49,060 But you don't store it because it would lose the randomness? 610 00:25:49,060 --> 00:25:51,920 PROFESSOR 3: You're totally right, actually, in this case. 611 00:25:51,920 --> 00:25:54,420 I've thought through this, and I can't come up with a reason 612 00:25:54,420 --> 00:25:55,780 why you wouldn't store it, that's 613 00:25:55,780 --> 00:25:58,363 it's temporary values that you find all the way down the tree. 614 00:25:58,363 --> 00:26:01,610 But they don't in most of the literature [INAUDIBLE] 615 00:26:01,610 --> 00:26:03,574 But you're totally right about that. 616 00:26:03,574 --> 00:26:06,270 Does everyone understand that distinction? 617 00:26:06,270 --> 00:26:08,460 The fact that we only hold onto the result 618 00:26:08,460 --> 00:26:10,110 here and don't theoretically make 619 00:26:10,110 --> 00:26:13,320 nodes for every place down in the tree just because we could, 620 00:26:13,320 --> 00:26:15,000 just because we've seen them before. 621 00:26:15,000 --> 00:26:17,166 We don't, and it doesn't really matter in this case. 622 00:26:17,166 --> 00:26:19,762 But it's theoretically a slight speed up that you could do. 623 00:26:19,762 --> 00:26:22,420 AUDIENCE: But you reduce that question to generalities? 624 00:26:22,420 --> 00:26:25,950 PROFESSOR 3: Yeah, a little bit. 625 00:26:25,950 --> 00:26:29,940 So we can look at an example of simulating out a running game. 626 00:26:29,940 --> 00:26:32,610 We get some intuition for why a random game would 627 00:26:32,610 --> 00:26:35,760 be correlated with how good your board position is. 628 00:26:35,760 --> 00:26:38,470 For example, here we have a Detecto game. 629 00:26:38,470 --> 00:26:40,210 Circle is going to move next. 630 00:26:40,210 --> 00:26:42,540 But as hopefully you can see, because you have played 631 00:26:42,540 --> 00:26:46,120 Detecto before, this is not a particularly promising board 632 00:26:46,120 --> 00:26:47,990 for x. 633 00:26:47,990 --> 00:26:51,500 Because no matter what circle does, 634 00:26:51,500 --> 00:26:54,802 if x is an intelligent player x can win right now. 635 00:26:54,802 --> 00:26:56,510 It has two different options for winning. 636 00:26:56,510 --> 00:26:59,333 And so, if you simulated this forward randomly, what you'll 637 00:26:59,333 --> 00:27:01,856 get is that 2/3 of the time, x will in fact win, 638 00:27:01,856 --> 00:27:03,230 even if the players aren't really 639 00:27:03,230 --> 00:27:04,410 thinking of it ahead of time. 640 00:27:04,410 --> 00:27:04,909 Yeah. 641 00:27:04,909 --> 00:27:07,170 AUDIENCE: Then why not do n simulations 642 00:27:07,170 --> 00:27:09,860 at a node instead of just a single simulation? 643 00:27:09,860 --> 00:27:10,510 PROFESSOR 3: You totally can do that. 644 00:27:10,510 --> 00:27:12,470 That's in fact, something that make sense to do 645 00:27:12,470 --> 00:27:13,740 and that some people do. 646 00:27:13,740 --> 00:27:16,110 Although what you'll find somewhat soon, 647 00:27:16,110 --> 00:27:18,780 is that considering that we're going down the tree, 648 00:27:18,780 --> 00:27:20,520 and that sometimes soon we're going 649 00:27:20,520 --> 00:27:22,170 to explore all of its children, there's 650 00:27:22,170 --> 00:27:24,930 a good question of why you end simulations now 651 00:27:24,930 --> 00:27:28,080 when you could just descend through the tree n times 652 00:27:28,080 --> 00:27:31,030 and thereby do n simulations by going through the thing 653 00:27:31,030 --> 00:27:34,210 and also building out the children? 654 00:27:34,210 --> 00:27:35,650 This case is-- yeah. 655 00:27:35,650 --> 00:27:37,360 AUDIENCE: This gives more importance 656 00:27:37,360 --> 00:27:38,440 to why you do randomness. 657 00:27:38,440 --> 00:27:40,605 Because if you're doing random simulations 658 00:27:40,605 --> 00:27:42,696 you would ignore the possibility of the best one. 659 00:27:42,696 --> 00:27:45,255 When you first ran a simulation here was that o wins. 660 00:27:45,255 --> 00:27:47,090 If I ignore this node-- 661 00:27:47,090 --> 00:27:48,230 PROFESSOR 3: Absolutely. 662 00:27:48,230 --> 00:27:52,530 Which is why it matters that we do this so many times that we 663 00:27:52,530 --> 00:27:55,515 drown out all the noise that is associated with playing 664 00:27:55,515 --> 00:27:57,010 a game out randomly. 665 00:27:57,010 --> 00:27:58,930 Let's talk about that. 666 00:27:58,930 --> 00:28:02,010 If there's a lot of distance between where we are right now 667 00:28:02,010 --> 00:28:03,600 and our end result-- 668 00:28:03,600 --> 00:28:05,100 For example, in this game, if I were 669 00:28:05,100 --> 00:28:08,522 to tell you how good is this board position, if you are one 670 00:28:08,522 --> 00:28:10,730 of those people who played out every game of Detecto, 671 00:28:10,730 --> 00:28:12,900 you'll know that this is great if you want it to be 672 00:28:12,900 --> 00:28:15,660 [INAUDIBLE] 673 00:28:15,660 --> 00:28:17,820 Anyway, the point is, that is not 674 00:28:17,820 --> 00:28:20,550 easy to do if you are doing random simulations from where 675 00:28:20,550 --> 00:28:21,730 you start. 676 00:28:21,730 --> 00:28:24,500 The correlation between your friend's board state 677 00:28:24,500 --> 00:28:27,989 and the quality of that state actually drops precipitously. 678 00:28:27,989 --> 00:28:29,780 And this for me is one of the hardest parts 679 00:28:29,780 --> 00:28:31,940 to study about Monte Carlo Tree Search. 680 00:28:31,940 --> 00:28:33,890 Although, as Nick will explain to you, 681 00:28:33,890 --> 00:28:36,270 it actually works quite well. 682 00:28:36,270 --> 00:28:38,840 And one of the reasons that it works quite well in practice 683 00:28:38,840 --> 00:28:40,215 for more complicated applications 684 00:28:40,215 --> 00:28:42,320 is they do away with the assumption 685 00:28:42,320 --> 00:28:43,480 of random simulation. 686 00:28:43,480 --> 00:28:45,032 Because even the random simulations 687 00:28:45,032 --> 00:28:47,240 does allow you to explore all the states, if you have 688 00:28:47,240 --> 00:28:50,600 some idea of where a reasonable quality approach would be, 689 00:28:50,600 --> 00:28:54,510 then using that, as long as it's not that much more expensive 690 00:28:54,510 --> 00:28:56,770 computationally, can help you with your simulation. 691 00:28:56,770 --> 00:28:59,140 Right now we're still talking about total randomness. 692 00:28:59,140 --> 00:29:00,640 How are people doing with that idea? 693 00:29:04,205 --> 00:29:06,330 Now we're going to update the tree with the results 694 00:29:06,330 --> 00:29:07,320 of our simulation. 695 00:29:07,320 --> 00:29:10,330 So given that we had some result lambda, 696 00:29:10,330 --> 00:29:12,140 we're going to try to get up the parents. 697 00:29:12,140 --> 00:29:13,960 And for each parent we're going to add 698 00:29:13,960 --> 00:29:15,780 that the game has been played there once, 699 00:29:15,780 --> 00:29:20,790 and that the result of that simulation 700 00:29:20,790 --> 00:29:24,460 gets added if it was a one. 701 00:29:24,460 --> 00:29:27,300 So for example, if there was a win in this game, 702 00:29:27,300 --> 00:29:30,520 than this becomes one, one because now it's won once 703 00:29:30,520 --> 00:29:32,190 and it's been visited once. 704 00:29:32,190 --> 00:29:34,630 And these two get incremented by one, 705 00:29:34,630 --> 00:29:37,280 and these two get incremented by one. 706 00:29:37,280 --> 00:29:41,060 That in itself comprises a complete iteration, 707 00:29:41,060 --> 00:29:44,610 the complete single iteration of running Monte Carlo Tree 708 00:29:44,610 --> 00:29:49,950 Search, which means that now we can keep doing this 709 00:29:49,950 --> 00:29:52,620 over and over again, building up the tree 710 00:29:52,620 --> 00:29:55,350 and slowly making it deeper, and making it deeper 711 00:29:55,350 --> 00:29:56,740 in selective areas. 712 00:29:56,740 --> 00:29:59,450 And having these numbers increase and increase. 713 00:29:59,450 --> 00:30:01,080 And be more and more proportional 714 00:30:01,080 --> 00:30:05,430 to the actual expected value of the quality of the state, 715 00:30:05,430 --> 00:30:06,080 until-- 716 00:30:06,080 --> 00:30:08,226 does anyone have any questions about this idea?-- 717 00:30:11,740 --> 00:30:12,550 until we terminate. 718 00:30:12,550 --> 00:30:15,040 And we have to come up with a way to terminate it. 719 00:30:15,040 --> 00:30:18,670 Now again, we said we're going to pick what the best child is 720 00:30:18,670 --> 00:30:21,850 going to be, what the best immediate move from the start 721 00:30:21,850 --> 00:30:24,335 state is going to be. 722 00:30:24,335 --> 00:30:26,650 That's the move that were actually going to play. 723 00:30:26,650 --> 00:30:29,010 And so, how do we determine what the best is? 724 00:30:29,010 --> 00:30:33,240 Well, the trivial solution is just the highest 725 00:30:33,240 --> 00:30:36,790 expected win given k. 726 00:30:36,790 --> 00:30:38,550 What that, in our case, is going to be 727 00:30:38,550 --> 00:30:41,190 is the ratio of number of times that I've 728 00:30:41,190 --> 00:30:44,250 win from a given early state to the number of times 729 00:30:44,250 --> 00:30:45,880 that I visited. 730 00:30:45,880 --> 00:30:48,745 However, this doesn't actually work as well as we might hope. 731 00:30:48,745 --> 00:30:50,530 Let's suppose the following scenario, 732 00:30:50,530 --> 00:30:54,020 which is that you have the Detecto game like this. 733 00:30:54,020 --> 00:30:57,220 And you have been exploring the tree for a while. 734 00:30:57,220 --> 00:31:00,520 And you're really mostly looking at these two nodes. 735 00:31:00,520 --> 00:31:04,390 One of these nodes, if you think it through, 736 00:31:04,390 --> 00:31:06,070 this node is quite promising and you've 737 00:31:06,070 --> 00:31:07,339 been exploring it for a while. 738 00:31:07,339 --> 00:31:09,130 There is a winning strategy from this node. 739 00:31:09,130 --> 00:31:11,260 It's that circle goes here, and then x goes here, 740 00:31:11,260 --> 00:31:13,964 and then circle loses because x has two options to win. 741 00:31:16,694 --> 00:31:18,610 However, if you explore this a bunch of times, 742 00:31:18,610 --> 00:31:20,480 and for some reason, due to the randomness, 743 00:31:20,480 --> 00:31:21,970 this is at 11 out of 20. 744 00:31:21,970 --> 00:31:25,687 Whereas this state, which is inherently inferior, 745 00:31:25,687 --> 00:31:28,020 is at three out of five because of a bunch of randomness 746 00:31:28,020 --> 00:31:30,380 and because it hasn't been explored as much. 747 00:31:30,380 --> 00:31:32,390 And if we had looked at this one as exhaustively 748 00:31:32,390 --> 00:31:35,631 we had at this one, that you probably 749 00:31:35,631 --> 00:31:37,880 would actually say that this state is actually better. 750 00:31:37,880 --> 00:31:40,900 And so, you can create an alternative criteria, 751 00:31:40,900 --> 00:31:43,920 which is that it's the highest expected win 752 00:31:43,920 --> 00:31:46,060 value of one of the children. 753 00:31:46,060 --> 00:31:49,602 But also, that value has to be the node that 754 00:31:49,602 --> 00:31:51,310 has been most visited so that they aren't 755 00:31:51,310 --> 00:31:54,590 explored by different amounts. 756 00:31:54,590 --> 00:31:56,080 What this sacrifice is however, is 757 00:31:56,080 --> 00:32:01,050 that this means that we can't terminate on demand. 758 00:32:01,050 --> 00:32:02,870 This is not always going to be true, 759 00:32:02,870 --> 00:32:05,161 and therefore, we're going to have to let the algorithm 760 00:32:05,161 --> 00:32:07,362 run until that's true for some start state, which 761 00:32:07,362 --> 00:32:09,320 means that maybe is not a criteria that we want 762 00:32:09,320 --> 00:32:11,886 to apply even though we know that it would be wise to do so. 763 00:32:11,886 --> 00:32:13,260 Are there any questions about how 764 00:32:13,260 --> 00:32:15,222 we pick the terminating guide? 765 00:32:19,280 --> 00:32:20,435 That was the whole thing. 766 00:32:20,435 --> 00:32:22,560 And now we're going to do it lots and lots of times 767 00:32:22,560 --> 00:32:25,780 until you guys are sick of Monte Carlo Tree Search. 768 00:32:25,780 --> 00:32:26,970 So this our tree. 769 00:32:26,970 --> 00:32:29,826 It's more or less what we've had before. 770 00:32:29,826 --> 00:32:31,200 The first thing we're going to do 771 00:32:31,200 --> 00:32:32,616 is we're going to look at the top. 772 00:32:32,616 --> 00:32:35,250 And then we're going to pick one of these children. 773 00:32:35,250 --> 00:32:37,260 Now let's say that we looked at this, 774 00:32:37,260 --> 00:32:39,210 and it turns out that the one on the left is really valuable. 775 00:32:39,210 --> 00:32:40,060 I think it's the one. 776 00:32:40,060 --> 00:32:40,583 Nope, yeah. 777 00:32:40,583 --> 00:32:41,083 Never mind. 778 00:32:41,083 --> 00:32:42,319 It's wrong. 779 00:32:42,319 --> 00:32:43,860 The one on the left has been explored 780 00:32:43,860 --> 00:32:44,818 a whole bunch of times. 781 00:32:44,818 --> 00:32:47,730 Remember, this term starts becoming larger 782 00:32:47,730 --> 00:32:49,980 than the ones that haven't been visited as much. 783 00:32:49,980 --> 00:32:53,390 And so we're going to descend from this one. 784 00:32:53,390 --> 00:32:57,175 And now we're going to descend, and we have these two options. 785 00:32:57,175 --> 00:33:00,612 Given what you know, would you expect 786 00:33:00,612 --> 00:33:02,070 that this is going to pick is going 787 00:33:02,070 --> 00:33:04,153 to be the one on the right or the one on the left? 788 00:33:04,153 --> 00:33:05,295 AUDIENCE: [INAUDIBLE] 789 00:33:05,295 --> 00:33:06,250 PROFESSOR 3: On the right because it's never 790 00:33:06,250 --> 00:33:06,880 been visited before. 791 00:33:06,880 --> 00:33:08,380 And so, this term is going to explode. 792 00:33:08,380 --> 00:33:10,060 And so, we're going to build a node there. 793 00:33:10,060 --> 00:33:11,726 And then we're going to simulate a game. 794 00:33:11,726 --> 00:33:15,954 And the result is a win, which is bad for this player. 795 00:33:15,954 --> 00:33:18,370 That means that he probably didn't want to make that move. 796 00:33:18,370 --> 00:33:20,820 And so we're going to propagate that value up. 797 00:33:20,820 --> 00:33:24,420 And we're going to start the algorithm again. 798 00:33:24,420 --> 00:33:26,430 And it's going to compare between these three. 799 00:33:26,430 --> 00:33:31,500 And now it's going to pick the one on the left. 800 00:33:34,686 --> 00:33:36,310 Now that it picked the one on the left, 801 00:33:36,310 --> 00:33:39,420 it going to compare between these two states. 802 00:33:39,420 --> 00:33:43,933 Which of the two is going to have a higher expansion factor? 803 00:33:43,933 --> 00:33:46,397 AUDIENCE: The left. 804 00:33:46,397 --> 00:33:47,980 AUDIENCE: Don't you invert it, though, 805 00:33:47,980 --> 00:33:49,280 because this is the opponent. 806 00:33:49,280 --> 00:33:50,480 PROFESSOR 3: Exactly. 807 00:33:50,480 --> 00:33:52,331 Because two out of three is actually better. 808 00:33:52,331 --> 00:33:54,247 Because it's one out of three for the opponent 809 00:33:54,247 --> 00:33:55,530 that's currently making the move. 810 00:33:55,530 --> 00:33:57,440 So the one on the left is going to have a higher expansion 811 00:33:57,440 --> 00:33:58,485 factor, and the one on the right is 812 00:33:58,485 --> 00:33:59,560 going to have a higher exploration factor. 813 00:33:59,560 --> 00:34:01,412 Does that make sense for people? 814 00:34:01,412 --> 00:34:05,114 It's OK if it doesn't. 815 00:34:05,114 --> 00:34:07,280 So we're actually going to pick the one on the right 816 00:34:07,280 --> 00:34:09,969 because the other one was is doing three and has lots 817 00:34:09,969 --> 00:34:11,994 of it's mother's love than that one's. 818 00:34:11,994 --> 00:34:14,129 Anyone else need a drink? 819 00:34:14,129 --> 00:34:15,462 We're going to expand that node. 820 00:34:15,462 --> 00:34:16,370 It doesn't matter. 821 00:34:16,370 --> 00:34:18,502 They are both equally likely to be expanded. 822 00:34:18,502 --> 00:34:20,918 We're going to simulate forward, and it's going to be one. 823 00:34:20,918 --> 00:34:24,639 Which means that that was probably a wise countermove. 824 00:34:24,639 --> 00:34:25,220 Yeah. 825 00:34:25,220 --> 00:34:26,969 AUDIENCE: So when it's the opponent's turn 826 00:34:26,969 --> 00:34:29,300 versus your turn, the exploration factor 827 00:34:29,300 --> 00:34:33,562 is the same but we complement the expansion factor, right? 828 00:34:33,562 --> 00:34:34,270 PROFESSOR 3: Yes. 829 00:34:34,270 --> 00:34:36,739 So the key here being that this takes 830 00:34:36,739 --> 00:34:39,162 in both the state that you're talking about 831 00:34:39,162 --> 00:34:40,870 and the player that you're talking about. 832 00:34:40,870 --> 00:34:42,239 AUDIENCE: But regardless of the player, 833 00:34:42,239 --> 00:34:44,570 the exploration factor will always be like this is. 834 00:34:44,570 --> 00:34:46,945 PROFESSOR 3: Because it's only the number of visits it's. 835 00:34:46,945 --> 00:34:49,716 It has nothing to do with results of exploration. 836 00:34:53,176 --> 00:34:55,584 AUDIENCE: If you win and you have the plus one, 837 00:34:55,584 --> 00:34:57,375 double plus one, and you've propagated out, 838 00:34:57,375 --> 00:35:00,485 but I'm wondering-- 839 00:35:00,485 --> 00:35:03,602 so if the opponent wins do you also propagate 840 00:35:03,602 --> 00:35:07,950 out the win increment itself? 841 00:35:07,950 --> 00:35:09,574 If the opponent's winning, wouldn't you 842 00:35:09,574 --> 00:35:11,572 want to [INAUDIBLE] node here? 843 00:35:11,572 --> 00:35:13,530 PROFESSOR 3: If the opponent wins then what you 844 00:35:13,530 --> 00:35:14,780 do is you propagate up a zero. 845 00:35:14,780 --> 00:35:20,390 Which means that wk is not incremented, but nk is. 846 00:35:23,695 --> 00:35:26,580 Have we seen a zero yet? 847 00:35:26,580 --> 00:35:28,440 There's one soon. 848 00:35:28,440 --> 00:35:31,900 But the idea is that rather than subtract or anything, 849 00:35:31,900 --> 00:35:34,010 all you do is propagate up the result of the game, 850 00:35:34,010 --> 00:35:37,820 which in this case is zero. 851 00:35:37,820 --> 00:35:39,360 Which means that all of those states 852 00:35:39,360 --> 00:35:41,820 seems to become more valuable to the blue and less valuable 853 00:35:41,820 --> 00:35:42,830 to the red. 854 00:35:42,830 --> 00:35:46,159 Because these numbers are lower than the other ones were. 855 00:35:46,159 --> 00:35:46,700 AUDIENCE: OK. 856 00:35:50,750 --> 00:35:52,250 PROFESSOR 3: So we propagate this up 857 00:35:52,250 --> 00:35:54,580 and this becomes better. 858 00:35:54,580 --> 00:35:56,930 What we've done here is we've figured out 859 00:35:56,930 --> 00:36:00,057 a theoretical countermove to blue moving here. 860 00:36:00,057 --> 00:36:02,140 That's how you should think about this whole tree. 861 00:36:02,140 --> 00:36:04,540 It's really a lot like the way the humans think 862 00:36:04,540 --> 00:36:05,440 about these things. 863 00:36:05,440 --> 00:36:07,950 If I do this, then what if they do this? 864 00:36:07,950 --> 00:36:09,140 Well, then I'll do this. 865 00:36:09,140 --> 00:36:14,142 And I see that I'm successful when I do that. 866 00:36:14,142 --> 00:36:16,580 We're going to look again at the top. 867 00:36:16,580 --> 00:36:18,765 And we're going to pick the one on the left 868 00:36:18,765 --> 00:36:20,015 because it's really promising. 869 00:36:20,015 --> 00:36:21,687 Five out of six is a good number. 870 00:36:21,687 --> 00:36:23,270 And we're going to look at both sides. 871 00:36:23,270 --> 00:36:25,571 And which one is blue going to pick now? 872 00:36:25,571 --> 00:36:27,321 Well, it's going to pick the one that it's 873 00:36:27,321 --> 00:36:29,090 going to be more successful in, which is two out of three. 874 00:36:29,090 --> 00:36:31,246 I realize that this is actually not the kind of thing 875 00:36:31,246 --> 00:36:32,746 where I could necessarily ask people 876 00:36:32,746 --> 00:36:37,680 because I'm the one who's decided which node to stop. 877 00:36:37,680 --> 00:36:39,360 Then we go down here. 878 00:36:39,360 --> 00:36:41,430 And there's an equal likelihood of picking 879 00:36:41,430 --> 00:36:42,347 either of those nodes. 880 00:36:42,347 --> 00:36:44,054 And so we're going to pick one at random. 881 00:36:44,054 --> 00:36:45,530 So that's going to be the left one. 882 00:36:45,530 --> 00:36:47,227 And we're going to create an empty node. 883 00:36:47,227 --> 00:36:48,560 Then we're going to play it out. 884 00:36:48,560 --> 00:36:50,660 And it was a success for blue, which 885 00:36:50,660 --> 00:36:54,080 is amazing because what this means now is that suddenly, 886 00:36:54,080 --> 00:36:57,180 in this tree of this really good move that red could make 887 00:36:57,180 --> 00:36:59,090 the blue wasn't find a response to, suddenly 888 00:36:59,090 --> 00:37:02,280 there's hope because we're going to propagate this back. 889 00:37:02,280 --> 00:37:03,830 And that means that blue actually 890 00:37:03,830 --> 00:37:06,772 has a response move to that sequence of red's moves. 891 00:37:06,772 --> 00:37:08,380 And so it's going to propagate up. 892 00:37:08,380 --> 00:37:10,915 And this state's going to be more promising to blue and less 893 00:37:10,915 --> 00:37:12,350 promising of red. 894 00:37:12,350 --> 00:37:14,230 That region of the tree that we had dug into 895 00:37:14,230 --> 00:37:17,164 is a little less promising. 896 00:37:17,164 --> 00:37:18,330 We're going to look back up. 897 00:37:18,330 --> 00:37:19,788 And this time, instead, we're going 898 00:37:19,788 --> 00:37:22,880 to evaluate the thing that is both promising 899 00:37:22,880 --> 00:37:25,910 from the expansion factor, and also 900 00:37:25,910 --> 00:37:27,800 promising because we haven't looked 901 00:37:27,800 --> 00:37:29,930 at it very much [INAUDIBLE] exploration factor. 902 00:37:29,930 --> 00:37:31,513 We're going to pick between these two. 903 00:37:31,513 --> 00:37:33,589 Which one is going to be picked here? 904 00:37:33,589 --> 00:37:37,392 AUDIENCE: [INAUDIBLE] 905 00:37:37,392 --> 00:37:39,683 PROFESSOR 3: Because the exploration factor is the same 906 00:37:39,683 --> 00:37:44,080 but the expansion factor is higher for the one on the left. 907 00:37:44,080 --> 00:37:45,700 And it's going to show us a node. 908 00:37:45,700 --> 00:37:48,649 And the result is going to be a win for a red, which 909 00:37:48,649 --> 00:37:51,190 means that red has found a good countermove to the thing that 910 00:37:51,190 --> 00:37:52,760 was previously promising for blue. 911 00:37:52,760 --> 00:37:53,926 And we propagate it back up. 912 00:37:53,926 --> 00:37:57,610 And finally, we're going to pick the one furthest on the right. 913 00:37:57,610 --> 00:37:59,360 Because even though it's terrible for red, 914 00:37:59,360 --> 00:38:01,443 and even though it's never won when it's tried it, 915 00:38:01,443 --> 00:38:04,285 it has to obey his idea of the exploration mode 916 00:38:04,285 --> 00:38:06,910 to find out whether maybe there isn't something possible there. 917 00:38:06,910 --> 00:38:09,110 So it explores, and it goes down, 918 00:38:09,110 --> 00:38:10,870 and it has to pick the one on the right. 919 00:38:10,870 --> 00:38:12,090 And so it does. 920 00:38:12,090 --> 00:38:13,420 And it plays this game out. 921 00:38:13,420 --> 00:38:16,180 And it's a loss, again. 922 00:38:16,180 --> 00:38:19,212 Which goes to show you, that blue 923 00:38:19,212 --> 00:38:20,670 has found yet another superior move 924 00:38:20,670 --> 00:38:22,570 to this really bad move of red, where 925 00:38:22,570 --> 00:38:24,886 probably this move of red, if this is a game of chess, 926 00:38:24,886 --> 00:38:26,260 is like putting my queen directly 927 00:38:26,260 --> 00:38:27,926 in front of the opponent's row of pawns, 928 00:38:27,926 --> 00:38:29,010 and I just leave it there. 929 00:38:29,010 --> 00:38:31,175 There's nothing good that's ever going to come of it 930 00:38:31,175 --> 00:38:33,070 but we have to explore it just to find out 931 00:38:33,070 --> 00:38:36,290 whether there isn't some magical way that I should protect. 932 00:38:36,290 --> 00:38:39,250 And as you can see, we've built up this tree 933 00:38:39,250 --> 00:38:40,460 over and over and over again. 934 00:38:40,460 --> 00:38:41,970 And it's starting to look asymmetric. 935 00:38:41,970 --> 00:38:43,845 And we're starting to see that there's really 936 00:38:43,845 --> 00:38:47,170 this disparity between exploring the regions that are crossing 937 00:38:47,170 --> 00:38:49,420 this tree and exploring the regions that are not 938 00:38:49,420 --> 00:38:52,500 and that don't really matter to us very much. 939 00:38:52,500 --> 00:38:55,940 And that this is exactly what we wanted from Monte Carlo trees. 940 00:38:55,940 --> 00:38:58,475 That was why we started the whole endeavor 941 00:38:58,475 --> 00:39:00,030 in the first place. 942 00:39:00,030 --> 00:39:02,530 The next thing I'm going to talk about is the pros and cons. 943 00:39:02,530 --> 00:39:03,905 But before I do that, does anyone 944 00:39:03,905 --> 00:39:06,771 have any more questions about the algorithm? 945 00:39:06,771 --> 00:39:07,270 Yeah. 946 00:39:07,270 --> 00:39:09,966 AUDIENCE: It's still not clear how we're getting nodes 947 00:39:09,966 --> 00:39:11,322 with different denominators-- 948 00:39:11,322 --> 00:39:13,500 [INAUDIBLE] 949 00:39:13,500 --> 00:39:16,011 PROFESSOR 3: The reason for that is because of the way 950 00:39:16,011 --> 00:39:17,260 that we're simulating through. 951 00:39:17,260 --> 00:39:19,970 We're actually not holding onto to the results 952 00:39:19,970 --> 00:39:23,130 of the simulation as we're going farther down the tree 953 00:39:23,130 --> 00:39:25,200 than the lowest node we expand. 954 00:39:25,200 --> 00:39:27,540 For example, when you simulate from here, 955 00:39:27,540 --> 00:39:31,030 you're going to propagate that value here and here, and so on. 956 00:39:31,030 --> 00:39:32,800 But then when we expand below, even 957 00:39:32,800 --> 00:39:35,110 if in the course of this guy's simulation 958 00:39:35,110 --> 00:39:36,300 it happened to go through one of the states 959 00:39:36,300 --> 00:39:38,160 that we expanded below, it will not 960 00:39:38,160 --> 00:39:40,140 have incremented the values of that state 961 00:39:40,140 --> 00:39:42,836 because we weren't keeping track of it. 962 00:39:42,836 --> 00:39:44,210 Theoretically, if we were to keep 963 00:39:44,210 --> 00:39:47,050 track of all of the simulations that we have in fact run, 964 00:39:47,050 --> 00:39:51,480 the numbers beneath these things would be higher. 965 00:39:51,480 --> 00:39:54,114 AUDIENCE: If you've already run a simulation from that-- 966 00:39:54,114 --> 00:39:55,530 if you've already run a simulation 967 00:39:55,530 --> 00:39:58,080 from that red node when you first built it, 968 00:39:58,080 --> 00:40:02,480 and then when you created those two ones, each of those have 969 00:40:02,480 --> 00:40:03,156 [INAUDIBLE] 970 00:40:03,156 --> 00:40:03,822 PROFESSOR 3: OK. 971 00:40:03,822 --> 00:40:04,520 I see. 972 00:40:04,520 --> 00:40:06,228 AUDIENCE: So would the denominator always 973 00:40:06,228 --> 00:40:08,100 be one more than the sum of the children? 974 00:40:08,100 --> 00:40:10,960 PROFESSOR 3: Yeah, in [INAUDIBLE] Yeah. 975 00:40:13,581 --> 00:40:15,330 AUDIENCE: I understand how you built that. 976 00:40:18,304 --> 00:40:20,900 Is there a rule of thumb, like it's time to choose a move? 977 00:40:20,900 --> 00:40:23,150 And it seems like you have very low numbers here 978 00:40:23,150 --> 00:40:25,080 to make a [INAUDIBLE] 979 00:40:25,080 --> 00:40:27,000 Is there a rule of thumb on giving games 980 00:40:27,000 --> 00:40:29,550 like it's 2 to the 4 or 2 to the 350, whatever it is. 981 00:40:29,550 --> 00:40:32,335 What kind of numbers do you need for that first row 982 00:40:32,335 --> 00:40:35,582 before you [INAUDIBLE]? 983 00:40:35,582 --> 00:40:38,010 PROFESSOR 3: What we'll get to soon is that isn't one. 984 00:40:38,010 --> 00:40:40,750 That's one of the problem with MCTS. 985 00:40:40,750 --> 00:40:44,210 But in terms of which of the moves you will choose, 986 00:40:44,210 --> 00:40:48,350 there are actually variants of MCTS that suggest that you more 987 00:40:48,350 --> 00:40:51,480 selectively age or insert new children based 988 00:40:51,480 --> 00:40:56,430 on something more than just the blind look right now. 989 00:40:56,430 --> 00:41:00,320 In terms of, if I'm here and it's creating my next children 990 00:41:00,320 --> 00:41:02,805 as the equivalent, then there are some intelligent guesses 991 00:41:02,805 --> 00:41:04,430 that you can make in terms of which one 992 00:41:04,430 --> 00:41:05,674 you should score first. 993 00:41:05,674 --> 00:41:07,340 Although it doesn't particularly matter. 994 00:41:07,340 --> 00:41:09,540 AUDIENCE: I'm just saying computational time 995 00:41:09,540 --> 00:41:11,400 being what it is, you might say, OK, 996 00:41:11,400 --> 00:41:13,860 if this is the timeline of this game I can expect 997 00:41:13,860 --> 00:41:16,274 to do a million simulations, which will give me 998 00:41:16,274 --> 00:41:18,440 if there's 400 nodes, I'm going to have so much use. 999 00:41:18,440 --> 00:41:21,310 In other words, is that enough time 1000 00:41:21,310 --> 00:41:22,962 to say that I can play through a game? 1001 00:41:22,962 --> 00:41:24,920 I couldn't play through a game with 400 options 1002 00:41:24,920 --> 00:41:26,982 if I've gotten five out of seven [INAUDIBLE] 1003 00:41:26,982 --> 00:41:28,190 three out of four [INAUDIBLE] 1004 00:41:28,190 --> 00:41:29,190 PROFESSOR 3: Absolutely. 1005 00:41:29,190 --> 00:41:30,810 And I would say that so far as I know, 1006 00:41:30,810 --> 00:41:32,260 that's something that's basically very 1007 00:41:32,260 --> 00:41:33,096 high experimentally. 1008 00:41:33,096 --> 00:41:34,770 They don't have good balance on it. 1009 00:41:34,770 --> 00:41:35,645 [INAUDIBLE] 1010 00:41:35,645 --> 00:41:37,020 So let's get on the first comment 1011 00:41:37,020 --> 00:41:39,070 because that is a computer element. 1012 00:41:39,070 --> 00:41:41,962 So why should you use this algorithm? 1013 00:41:41,962 --> 00:41:43,920 Even though we've seen tremendous breakthroughs 1014 00:41:43,920 --> 00:41:45,607 in this algorithm, and you're going 1015 00:41:45,607 --> 00:41:47,440 to have to ignore everything that I tell you 1016 00:41:47,440 --> 00:41:49,020 and remember that this does actually 1017 00:41:49,020 --> 00:41:51,540 work quite well in certain scenarios. 1018 00:41:51,540 --> 00:41:53,920 Should we use it or not? 1019 00:41:53,920 --> 00:41:56,225 The pros are that it actually does the thing 1020 00:41:56,225 --> 00:41:57,141 that we want it to do. 1021 00:41:57,141 --> 00:41:58,515 It grows the tree asymmetrically. 1022 00:41:58,515 --> 00:42:00,380 It means that we do not have to explore. 1023 00:42:00,380 --> 00:42:02,340 And it doesn't explode exponentially 1024 00:42:02,340 --> 00:42:06,049 with the number of moves that we're looking into the future. 1025 00:42:06,049 --> 00:42:08,590 And that it selectively grows the tree towards the areas that 1026 00:42:08,590 --> 00:42:11,050 are most promising. 1027 00:42:11,050 --> 00:42:13,010 The other huge benefit, if you'll 1028 00:42:13,010 --> 00:42:15,290 notice from what we've just talked through, 1029 00:42:15,290 --> 00:42:17,220 is that it never relies on anything 1030 00:42:17,220 --> 00:42:19,120 other than the strict rules of the game. 1031 00:42:19,120 --> 00:42:21,555 What that means is that the only weight of the game that's 1032 00:42:21,555 --> 00:42:23,580 factored in is that the game is what tells us 1033 00:42:23,580 --> 00:42:26,120 what the next moves we can take from a given state are, 1034 00:42:26,120 --> 00:42:32,310 and whether a given state is a victory or a defeat. 1035 00:42:32,310 --> 00:42:35,070 And that's kind of amazing because we 1036 00:42:35,070 --> 00:42:37,650 had no external heuristic information about this game. 1037 00:42:37,650 --> 00:42:39,850 Which means that if I took a completely new game 1038 00:42:39,850 --> 00:42:42,720 that someone had just invented, and I plugged MCTS into it, 1039 00:42:42,720 --> 00:42:47,720 MCTS would be a slightly or someone competitive player 1040 00:42:47,720 --> 00:42:50,600 for this game, which is a powerful idea. 1041 00:42:50,600 --> 00:42:52,350 It leads to our next two pros. 1042 00:42:52,350 --> 00:42:56,220 The first of which is that it's very easy to adapt to new games 1043 00:42:56,220 --> 00:42:58,850 that it hasn't seen before, or even that people 1044 00:42:58,850 --> 00:43:02,160 haven't seen before. 1045 00:43:02,160 --> 00:43:03,720 This is clearly valuable. 1046 00:43:03,720 --> 00:43:05,100 But the other nice thing about it 1047 00:43:05,100 --> 00:43:07,020 is that even though heuristics are not 1048 00:43:07,020 --> 00:43:11,810 required to make MCTS work [INAUDIBLE],, 1049 00:43:11,810 --> 00:43:12,840 it can work [INAUDIBLE]. 1050 00:43:12,840 --> 00:43:14,340 There are a number of [? advanced ?] 1051 00:43:14,340 --> 00:43:16,340 places in the algorithm that you can actually 1052 00:43:16,340 --> 00:43:17,630 incorporate heuristics into. 1053 00:43:17,630 --> 00:43:20,880 Nick is going to talk about how AlphaGo uses this very heavily. 1054 00:43:20,880 --> 00:43:22,460 AlphaGo is not vanilla Go. 1055 00:43:22,460 --> 00:43:24,270 It has a lot of external information 1056 00:43:24,270 --> 00:43:26,430 that's built into the way that it works. 1057 00:43:26,430 --> 00:43:29,841 But MCTS is a framework-- you can imagine your heuristics you 1058 00:43:29,841 --> 00:43:31,257 can apply in the simulation, there 1059 00:43:31,257 --> 00:43:33,420 are heuristics you can apply in the UCB in the way 1060 00:43:33,420 --> 00:43:35,550 that we choose the next node. 1061 00:43:35,550 --> 00:43:37,210 There are places that it can fit in. 1062 00:43:37,210 --> 00:43:39,376 And this services as a nice infrastructure to do so. 1063 00:43:41,320 --> 00:43:45,150 The other benefit is that it's an on demand algorithm, which 1064 00:43:45,150 --> 00:43:47,660 is particularly valuable when you're under some sort of time 1065 00:43:47,660 --> 00:43:49,909 pressure, when you're competing against someone that's 1066 00:43:49,909 --> 00:43:53,100 a mathematician, or when something is about to explode 1067 00:43:53,100 --> 00:43:57,240 and you have to make a decision on which reactor to shut down. 1068 00:43:57,240 --> 00:44:00,180 And lastly-- or not lastly, actually, it's 1069 00:44:00,180 --> 00:44:02,590 complete, which is really nice because you 1070 00:44:02,590 --> 00:44:04,740 know that if you run this game for long enough 1071 00:44:04,740 --> 00:44:08,270 it's going to start looking at a lot like a BFS tree. 1072 00:44:08,270 --> 00:44:09,936 No, it's actually going to start looking 1073 00:44:09,936 --> 00:44:14,820 like an alpha-beta tree, if it is what it is converted to. 1074 00:44:14,820 --> 00:44:16,650 It's a nice property to have. 1075 00:44:16,650 --> 00:44:18,470 Although, this property does slightly 1076 00:44:18,470 --> 00:44:20,595 get compromised if you remove the red in this idea, 1077 00:44:20,595 --> 00:44:24,690 and if only simulate these [INAUDIBLE].. 1078 00:44:24,690 --> 00:44:25,594 Yeah. 1079 00:44:25,594 --> 00:44:27,370 PROFESSOR: You made an interesting comment 1080 00:44:27,370 --> 00:44:29,410 when you said, oh, it looks like -beta tree. 1081 00:44:29,410 --> 00:44:32,290 So it looked like a mini-max tree. 1082 00:44:32,290 --> 00:44:35,190 But have they also incorporated notions 1083 00:44:35,190 --> 00:44:37,530 of pruning in the MCTS, which would make 1084 00:44:37,530 --> 00:44:38,947 it look like an -beta tree? 1085 00:44:38,947 --> 00:44:40,780 PROFESSOR 3: Sorry, you're completely right. 1086 00:44:40,780 --> 00:44:42,990 It does look like a mini-max tree. 1087 00:44:42,990 --> 00:44:45,380 I think I've seen variants where they do pruning, 1088 00:44:45,380 --> 00:44:46,963 but I haven't looked into it as much. 1089 00:44:46,963 --> 00:44:48,690 But I would imagine that they would 1090 00:44:48,690 --> 00:44:50,500 converge to whatever you know pruning 1091 00:44:50,500 --> 00:44:52,020 a certain tree [INAUDIBLE]. 1092 00:44:52,020 --> 00:44:54,780 AUDIENCE: But people have explored incorporating pruning 1093 00:44:54,780 --> 00:44:55,380 into MCTS? 1094 00:44:55,380 --> 00:44:57,170 PROFESSOR 3: I think so. 1095 00:44:57,170 --> 00:45:01,350 I can't say [INAUDIBLE] And then lastly, it's 1096 00:45:01,350 --> 00:45:02,680 really parallelizable. 1097 00:45:02,680 --> 00:45:05,610 You'll notice, none of the regions of this tree, 1098 00:45:05,610 --> 00:45:08,005 other than the original choice, ever 1099 00:45:08,005 --> 00:45:09,380 have to interact with each other. 1100 00:45:09,380 --> 00:45:12,030 So if you have 200 processors and you decide, 1101 00:45:12,030 --> 00:45:15,169 OK, I'm going to break up this tree in the first 200 decisions 1102 00:45:15,169 --> 00:45:16,710 and then have each one of those flesh 1103 00:45:16,710 --> 00:45:20,600 out one of those decisions, that actually means that they can 1104 00:45:20,600 --> 00:45:22,400 all combine information right at the end 1105 00:45:22,400 --> 00:45:24,025 and make a decision [INAUDIBLE],, which 1106 00:45:24,025 --> 00:45:29,280 is a really nice, powerful principle as you [INAUDIBLE].. 1107 00:45:29,280 --> 00:45:31,290 It does have its fair share of problems. 1108 00:45:31,290 --> 00:45:34,950 The first problem being that it does breakdown 1109 00:45:34,950 --> 00:45:38,290 under extreme tree depth. 1110 00:45:38,290 --> 00:45:41,340 The main reason for this being that as you increase 1111 00:45:41,340 --> 00:45:45,150 more moves between you and the end of the game, 1112 00:45:45,150 --> 00:45:47,250 you're increasing the probability-- 1113 00:45:47,250 --> 00:45:49,604 you are decreasing the correlation between your game 1114 00:45:49,604 --> 00:45:51,270 state and whether a random playoff would 1115 00:45:51,270 --> 00:45:54,750 suggest that you're in a good position or a bad position. 1116 00:45:54,750 --> 00:45:56,397 The same goes for branching factors. 1117 00:45:56,397 --> 00:45:58,605 One of the things that people sometimes talk about it 1118 00:45:58,605 --> 00:46:03,930 as if MCTS AI's cannot play first-person shooters 1119 00:46:03,930 --> 00:46:07,590 because the distance between the number of things that you can 1120 00:46:07,590 --> 00:46:11,460 do at every given moment, and what would be a successful 1121 00:46:11,460 --> 00:46:14,200 approach in the long term after meeting many, many, 1122 00:46:14,200 --> 00:46:16,360 many moves that each have many branching factors, 1123 00:46:16,360 --> 00:46:20,937 is that never begins to explore the size of the search tree. 1124 00:46:20,937 --> 00:46:22,770 For the most part, it's not really coming up 1125 00:46:22,770 --> 00:46:24,460 with a long term policy. 1126 00:46:24,460 --> 00:46:27,736 It's really thinking about what are the next sequence of moves 1127 00:46:27,736 --> 00:46:31,190 that I should [INAUDIBLE]. 1128 00:46:31,190 --> 00:46:34,000 Another problem is that it requires 1129 00:46:34,000 --> 00:46:38,530 simulation to be very easy and very repeatable. 1130 00:46:38,530 --> 00:46:42,820 So for example, if we wanted to tell our AI, 1131 00:46:42,820 --> 00:46:44,920 how do I take over Ontario? 1132 00:46:44,920 --> 00:46:46,630 There's not a particularly good way 1133 00:46:46,630 --> 00:46:49,480 that you can simulate taking over Ontario? 1134 00:46:49,480 --> 00:46:50,995 If you try it once, you're not going 1135 00:46:50,995 --> 00:46:52,810 to have an opportunity to try it again, 1136 00:46:52,810 --> 00:46:56,470 at least with the same set of configurations. 1137 00:46:56,470 --> 00:46:59,030 And actually, one of the things that we really took advantage 1138 00:46:59,030 --> 00:47:01,238 of, if that random simulation happens really quickly, 1139 00:47:01,238 --> 00:47:02,865 on the order of microseconds. 1140 00:47:02,865 --> 00:47:07,494 On other hand, the bigger your computational 1141 00:47:07,494 --> 00:47:08,910 resources that you have access to, 1142 00:47:08,910 --> 00:47:10,300 the better the algorithm works. 1143 00:47:10,300 --> 00:47:12,799 That means that I can't run it off my Mac particularly well. 1144 00:47:12,799 --> 00:47:15,670 It would be like large games. 1145 00:47:15,670 --> 00:47:18,195 It relies on this tenuous assumption of random play 1146 00:47:18,195 --> 00:47:21,257 be weakly correlated with the quality of our game state. 1147 00:47:21,257 --> 00:47:23,132 And this is one of the first assumptions that 1148 00:47:23,132 --> 00:47:25,548 is going to be thrown out the window for a lot of the more 1149 00:47:25,548 --> 00:47:27,880 advanced MCTS approaches, which are going to have 1150 00:47:27,880 --> 00:47:29,857 more intelligent play outs. 1151 00:47:29,857 --> 00:47:31,940 But those are going to lose some of the generality 1152 00:47:31,940 --> 00:47:35,380 that we had before. 1153 00:47:35,380 --> 00:47:38,719 Something that goes off of that is that MCTS is a framework. 1154 00:47:38,719 --> 00:47:41,260 But in order to actually make it effective for a lot of games 1155 00:47:41,260 --> 00:47:44,251 it does require a lot of tuning, in the sense that there 1156 00:47:44,251 --> 00:47:45,500 are a whole bunch of variants. 1157 00:47:45,500 --> 00:47:47,140 And that you need to be able to implement whatever 1158 00:47:47,140 --> 00:47:48,745 flavor is best suited for you. 1159 00:47:48,745 --> 00:47:51,290 Which means that it's not quite as nice and black boxy 1160 00:47:51,290 --> 00:47:54,890 as we would want it to be as far as give it the rules 1161 00:47:54,890 --> 00:47:58,270 and have it magically come up with a strategy [INAUDIBLE].. 1162 00:47:58,270 --> 00:48:00,160 And then lastly, as you mentioned, 1163 00:48:00,160 --> 00:48:03,280 there is not a great amount of literature right now 1164 00:48:03,280 --> 00:48:06,080 about the properties of MCTS and its convergence, 1165 00:48:06,080 --> 00:48:09,040 and what the actual proportion of time 1166 00:48:09,040 --> 00:48:11,950 to quality of your solution is. 1167 00:48:11,950 --> 00:48:15,610 This is true of all modern machine learning things, 1168 00:48:15,610 --> 00:48:18,261 is that there is certainly a lot more work that could be done. 1169 00:48:18,261 --> 00:48:19,760 But right now, that's a gap in terms 1170 00:48:19,760 --> 00:48:23,577 of using this for a simulation that's supposed to be reliable. 1171 00:48:23,577 --> 00:48:27,630 Anyone have any questions on the Pros and Cons? 1172 00:48:27,630 --> 00:48:29,940 Before we jump dive into applications, 1173 00:48:29,940 --> 00:48:32,320 let's talk through a few examples 1174 00:48:32,320 --> 00:48:34,770 of what games could be solved and could not 1175 00:48:34,770 --> 00:48:36,750 be solved by MCTS. 1176 00:48:36,750 --> 00:48:38,842 Do you guys think that checkers is a game that 1177 00:48:38,842 --> 00:48:40,610 could be solved by MCTS? 1178 00:48:40,610 --> 00:48:41,369 AUDIENCE: Yes. 1179 00:48:41,369 --> 00:48:43,160 PROFESSOR 3: It's completely deterministic. 1180 00:48:43,160 --> 00:48:43,826 It's two-player. 1181 00:48:43,826 --> 00:48:46,770 It satisfies all of the criteria that we've laid out before. 1182 00:48:46,770 --> 00:48:48,240 Checkers is definitely a game that 1183 00:48:48,240 --> 00:48:51,240 can and has been solved by MCTS, although not solved 1184 00:48:51,240 --> 00:48:53,760 to the extent that you can defeat the thing that actually 1185 00:48:53,760 --> 00:48:57,270 has the solution [INAUDIBLE]. 1186 00:48:57,270 --> 00:48:58,860 How about "Settlers of Catan?" 1187 00:48:58,860 --> 00:49:00,234 This one's a little bit trickier. 1188 00:49:00,234 --> 00:49:02,680 Do you guys think that MCTS is likely to be able to play 1189 00:49:02,680 --> 00:49:04,650 "Settlers of Catan?" 1190 00:49:04,650 --> 00:49:07,844 If not, let's throw out reason why or why not it would be 1191 00:49:07,844 --> 00:49:09,080 [INAUDIBLE]. 1192 00:49:09,080 --> 00:49:09,580 Yeah. 1193 00:49:09,580 --> 00:49:11,800 AUDIENCE: No because there's randomness. 1194 00:49:11,800 --> 00:49:14,050 PROFESSOR 3: So yes, that is absolutely the criticism. 1195 00:49:14,050 --> 00:49:16,640 And that's why we can't apply it vanilla. 1196 00:49:16,640 --> 00:49:18,820 I put this on here as a trick question, 1197 00:49:18,820 --> 00:49:20,500 though, because it turns out that MCTS 1198 00:49:20,500 --> 00:49:22,460 is robust to randomness. 1199 00:49:22,460 --> 00:49:23,990 That you can actually play-- 1200 00:49:23,990 --> 00:49:25,614 and I realize that's just me and we do. 1201 00:49:25,614 --> 00:49:26,470 [LAUGHTER] 1202 00:49:26,470 --> 00:49:29,349 You can actually play through games. 1203 00:49:29,349 --> 00:49:30,765 If you think about the simulation, 1204 00:49:30,765 --> 00:49:32,680 the simulation is actually applicable 1205 00:49:32,680 --> 00:49:35,692 even if the game is not deterministic 1206 00:49:35,692 --> 00:49:37,650 because it does give you a sense of the quality 1207 00:49:37,650 --> 00:49:38,870 of your position. 1208 00:49:38,870 --> 00:49:42,830 And the MCTS-based AI to play "Settlers" 1209 00:49:42,830 --> 00:49:47,946 is, I think, at least 49% competitive with the best AI 1210 00:49:47,946 --> 00:49:50,860 to play, at least in the autonomous non-scale space. 1211 00:49:50,860 --> 00:49:53,780 So it does work. 1212 00:49:53,780 --> 00:49:57,710 Let's talk about the war operations plan response. 1213 00:49:57,710 --> 00:50:00,831 Who here has seen the movie "War Games?" 1214 00:50:00,831 --> 00:50:01,330 OK. 1215 00:50:01,330 --> 00:50:04,030 Well, it should be more of you. 1216 00:50:04,030 --> 00:50:06,730 The idea of "War Games" is that one 1217 00:50:06,730 --> 00:50:09,640 of the core characters in this world 1218 00:50:09,640 --> 00:50:11,380 is this computer that has been put 1219 00:50:11,380 --> 00:50:15,130 in charge of the national defense strategy with respect 1220 00:50:15,130 --> 00:50:16,700 to Russia. 1221 00:50:16,700 --> 00:50:19,092 And that it needs to think through the possible future 1222 00:50:19,092 --> 00:50:21,550 scenarios and decide whether it's going to launch the nukes 1223 00:50:21,550 --> 00:50:23,170 or not. 1224 00:50:23,170 --> 00:50:27,810 Do you think that WOPR can be MCTS-based? 1225 00:50:27,810 --> 00:50:29,010 AUDIENCE: No. 1226 00:50:29,010 --> 00:50:29,900 PROFESSOR 3: No. 1227 00:50:29,900 --> 00:50:32,191 AUDIENCE: It could, it just wouldn't be very good. 1228 00:50:32,191 --> 00:50:33,190 PROFESSOR 3: Absolutely. 1229 00:50:33,190 --> 00:50:34,606 Once you fire the nukes you're not 1230 00:50:34,606 --> 00:50:36,060 going to get another chance. 1231 00:50:36,060 --> 00:50:37,600 So you can't particularly simulate 1232 00:50:37,600 --> 00:50:39,620 through what the possible scenarios are going to be like. 1233 00:50:39,620 --> 00:50:39,910 Yeah. 1234 00:50:39,910 --> 00:50:41,160 AUDIENCE: So what if you had-- 1235 00:50:41,160 --> 00:50:43,390 I agree you can't simulate it in the real world. 1236 00:50:43,390 --> 00:50:45,790 But what if you had a really good model 1237 00:50:45,790 --> 00:50:47,710 and you just simulated based on that model? 1238 00:50:51,074 --> 00:50:52,990 PROFESSOR 3: In that case, it probably depends 1239 00:50:52,990 --> 00:50:55,536 on the quality of your model. 1240 00:50:55,536 --> 00:50:59,236 If you have a good model for how World War III is going to 1241 00:50:59,236 --> 00:50:59,736 [INAUDIBLE]. 1242 00:50:59,736 --> 00:51:02,850 [LAUGHTER] 1243 00:51:02,850 --> 00:51:05,170 AUDIENCE: It is the case that the military does 1244 00:51:05,170 --> 00:51:10,200 have simulators and they do war games in simulation. 1245 00:51:10,200 --> 00:51:12,070 PROFESSOR 3: Yes, that's true. 1246 00:51:12,070 --> 00:51:14,655 They could certainly try it and run MCTS if they wanted. 1247 00:51:14,655 --> 00:51:16,560 And that's what happened in the movie. 1248 00:51:16,560 --> 00:51:18,667 [INTERPOSING VOICES] 1249 00:51:18,667 --> 00:51:20,542 AUDIENCE: And there you're putting your money 1250 00:51:20,542 --> 00:51:22,430 in the simulation not in the-- 1251 00:51:22,430 --> 00:51:24,910 AUDIENCE: It's like having an MCTS play SOCOM or something 1252 00:51:24,910 --> 00:51:25,410 like that. 1253 00:51:25,410 --> 00:51:26,160 PROFESSOR 3: Yeah. 1254 00:51:26,160 --> 00:51:29,142 It's definitely about putting money into the simulation 1255 00:51:29,142 --> 00:51:30,600 and you get really good simulation. 1256 00:51:30,600 --> 00:51:33,400 If you have a really good simulations then you 1257 00:51:33,400 --> 00:51:35,490 [INAUDIBLE] to play WOPR. 1258 00:51:35,490 --> 00:51:36,066 Yeah. 1259 00:51:36,066 --> 00:51:37,816 AUDIENCE: Back to "Settlers" for a second. 1260 00:51:37,816 --> 00:51:40,600 I'm curious if there's a way for the whole player training 1261 00:51:40,600 --> 00:51:42,970 resources thing, or would it have 1262 00:51:42,970 --> 00:51:47,216 to be only purely like using the ports. 1263 00:51:47,216 --> 00:51:48,760 PROFESSOR 3: That's a good question. 1264 00:51:48,760 --> 00:51:53,620 I haven't looked closely at whether they do that or not. 1265 00:51:53,620 --> 00:51:55,400 If it's playing a two-player game, 1266 00:51:55,400 --> 00:51:58,790 then I would imagine that they wouldn't because you don't 1267 00:51:58,790 --> 00:52:00,290 really trade in to play a game. 1268 00:52:00,290 --> 00:52:01,748 But if they weren't, I bet that you 1269 00:52:01,748 --> 00:52:03,257 can incorporate it with WOPR. 1270 00:52:03,257 --> 00:52:05,090 AUDIENCE: Is it limited to two-player games? 1271 00:52:05,090 --> 00:52:06,070 PROFESSOR 3: No, not at all. 1272 00:52:06,070 --> 00:52:07,240 In fact, there are lots of purchases 1273 00:52:07,240 --> 00:52:08,890 that do only one-player games, where 1274 00:52:08,890 --> 00:52:11,482 you think of what's the best movie that you can make. 1275 00:52:11,482 --> 00:52:12,190 AUDIENCE: I know. 1276 00:52:12,190 --> 00:52:15,215 But I mean, couldn't MCTS handle three- or four-player games? 1277 00:52:15,215 --> 00:52:16,840 PROFESSOR 3: Yeah, it absolutely could. 1278 00:52:16,840 --> 00:52:19,622 I'm not sure how they computed their head-to-head. 1279 00:52:19,622 --> 00:52:21,730 That might be completely flat cursors. 1280 00:52:21,730 --> 00:52:24,734 I'm not even sure how the settlers interact. 1281 00:52:24,734 --> 00:52:25,234 Yeah. 1282 00:52:25,234 --> 00:52:26,359 AUDIENCE: A quick question. 1283 00:52:26,359 --> 00:52:29,460 So at first you know if I reduce the chess board to only 4 1284 00:52:29,460 --> 00:52:32,530 by 4 or 5 by 5, and I run MCTS versus 1285 00:52:32,530 --> 00:52:35,050 the traditional algorithm that AlphaGo offered as a tree. 1286 00:52:35,050 --> 00:52:38,222 Do you think MCTS will prefer theory and perform 1287 00:52:38,222 --> 00:52:40,219 this computational requirement. 1288 00:52:40,219 --> 00:52:42,510 PROFESSOR 3: The thing about the way that Deep Blue is, 1289 00:52:42,510 --> 00:52:44,570 which is the AI that ended the Kasparov 1290 00:52:44,570 --> 00:52:47,370 thing, a bunch of his chess grand master, 1291 00:52:47,370 --> 00:52:49,970 is that it has a tremendous amount of heuristic 1292 00:52:49,970 --> 00:52:50,587 information. 1293 00:52:50,587 --> 00:52:52,170 There's a lot of external stuff that's 1294 00:52:52,170 --> 00:52:54,310 incorporated into the system that makes it 1295 00:52:54,310 --> 00:52:57,250 able to explore the best paths. 1296 00:52:57,250 --> 00:52:59,500 What I would say is that knoledgesless 1297 00:52:59,500 --> 00:53:03,730 MCTS based on randomness, would take a very long 1298 00:53:03,730 --> 00:53:07,850 computational time to even become competitive with those 1299 00:53:07,850 --> 00:53:10,382 kinds of algorithms, and probably feasibly never would. 1300 00:53:10,382 --> 00:53:12,340 What if you incorporated heuristic information, 1301 00:53:12,340 --> 00:53:15,320 I think that there's a bunch of hope in terms of getting MCTS 1302 00:53:15,320 --> 00:53:16,600 to start performing better. 1303 00:53:16,600 --> 00:53:18,850 And you can look at what next I'm going to talk about, 1304 00:53:18,850 --> 00:53:19,460 AlphaGo. 1305 00:53:19,460 --> 00:53:22,040 It takes inspiration for how we go about incorporating 1306 00:53:22,040 --> 00:53:22,940 these new circuits. 1307 00:53:22,940 --> 00:53:27,147 AUDIENCE: So only the circuit you [INAUDIBLE] 1308 00:53:27,147 --> 00:53:28,980 PROFESSOR 3: It definitely seems like if you 1309 00:53:28,980 --> 00:53:33,330 have a really good heuristic model for what 1310 00:53:33,330 --> 00:53:38,530 good states in the game are, that if it's a smaller search 1311 00:53:38,530 --> 00:53:42,420 space, that some other models could perform better. 1312 00:53:42,420 --> 00:53:44,546 Although, I'm probably going to eat my foot here 1313 00:53:44,546 --> 00:53:47,390 because this is going to be on OCW some massive amount, 1314 00:53:47,390 --> 00:53:49,936 massive chess playing algorithms. 1315 00:53:49,936 --> 00:53:53,429 Eat my shoe not my foot. 1316 00:53:53,429 --> 00:53:55,430 [LAUGHTER] 1317 00:53:55,430 --> 00:53:58,100 One last game. 1318 00:53:58,100 --> 00:54:00,364 Does anyone know what this game is? 1319 00:54:00,364 --> 00:54:01,280 AUDIENCE: "Total War?" 1320 00:54:01,280 --> 00:54:02,380 PROFESSOR 3: Yes. 1321 00:54:02,380 --> 00:54:03,060 Nice. 1322 00:54:03,060 --> 00:54:04,850 This is "Rome, Total War II." 1323 00:54:04,850 --> 00:54:09,890 It's a simulator for this tremendous real time strategy 1324 00:54:09,890 --> 00:54:13,100 game, where you play, I think, the Roman Empire. 1325 00:54:13,100 --> 00:54:17,540 And you're controlling armies and huge infrastructure systems 1326 00:54:17,540 --> 00:54:20,980 that move and conquer states and continents, 1327 00:54:20,980 --> 00:54:24,530 and meet in the field, and manage resources, and do 1328 00:54:24,530 --> 00:54:26,860 all of these incredible diplomacy feats. 1329 00:54:26,860 --> 00:54:29,347 And so do you think that this game can be solved by MCTS? 1330 00:54:29,347 --> 00:54:29,930 AUDIENCE: Yes. 1331 00:54:29,930 --> 00:54:32,409 AUDIENCE: Yes. 1332 00:54:32,409 --> 00:54:33,450 PROFESSOR 3: Lets say no. 1333 00:54:33,450 --> 00:54:34,658 But I guess I put it on here. 1334 00:54:34,658 --> 00:54:36,870 So that's good on you. 1335 00:54:36,870 --> 00:54:40,925 The way that the AI in "Rome, Total War II" is built 1336 00:54:40,925 --> 00:54:43,200 is that it's built on an MCTS structure. 1337 00:54:43,200 --> 00:54:45,980 And it in fact does do resource allocation 1338 00:54:45,980 --> 00:54:47,780 and a lot of its political maneuvers 1339 00:54:47,780 --> 00:54:49,439 based on Monte Carlo Tree Search moves. 1340 00:54:49,439 --> 00:54:51,355 There are a bunch of reasons that they explain 1341 00:54:51,355 --> 00:54:53,390 in the game for why they do this, 1342 00:54:53,390 --> 00:54:54,961 or in papers released about the game. 1343 00:54:54,961 --> 00:54:56,835 But one of the nice ones is that it's random, 1344 00:54:56,835 --> 00:54:58,293 which means that you're never going 1345 00:54:58,293 --> 00:55:01,280 to play against the same kind of AI twice because every time 1346 00:55:01,280 --> 00:55:02,750 the set of decisions that it's going to think about 1347 00:55:02,750 --> 00:55:03,736 is completely different. 1348 00:55:03,736 --> 00:55:04,608 AUDIENCE: I have a quick question. 1349 00:55:04,608 --> 00:55:05,358 PROFESSOR 3: Yeah. 1350 00:55:05,358 --> 00:55:07,646 AUDIENCE: So if I want to model any game with MCTS, 1351 00:55:07,646 --> 00:55:10,996 does it have to be that the actions in playing a game 1352 00:55:10,996 --> 00:55:14,272 has to be able to discretize. 1353 00:55:14,272 --> 00:55:14,980 PROFESSOR 3: Yes. 1354 00:55:14,980 --> 00:55:17,755 So far as I know, I haven't seen many continuous variants 1355 00:55:17,755 --> 00:55:19,520 in MCTS. 1356 00:55:19,520 --> 00:55:22,680 And so, I think that it is about choosing these reactions, which 1357 00:55:22,680 --> 00:55:26,130 on it's most narrow level does actually bring it down to here. 1358 00:55:26,130 --> 00:55:27,630 I think one of the reasons that this 1359 00:55:27,630 --> 00:55:30,046 is nice is that there are so many different decisions that 1360 00:55:30,046 --> 00:55:32,610 could be made that MCTS is really the only approach that 1361 00:55:32,610 --> 00:55:35,525 could even begin to handle the massive branching factor that's 1362 00:55:35,525 --> 00:55:37,810 associated with the game Rome, Total War. 1363 00:55:37,810 --> 00:55:38,579 Yeah. 1364 00:55:38,579 --> 00:55:40,162 AUDIENCE: This is also the consequence 1365 00:55:40,162 --> 00:55:43,407 of this year you get the play off when this game comes. 1366 00:55:43,407 --> 00:55:44,740 PROFESSOR 3: That's interesting. 1367 00:55:44,740 --> 00:55:46,600 That's probably totally it. 1368 00:55:46,600 --> 00:55:47,170 That's cool. 1369 00:55:50,640 --> 00:55:54,710 That's everything about how the algorithm actually works. 1370 00:55:54,710 --> 00:55:56,134 I'm going to pass it off to Nick, 1371 00:55:56,134 --> 00:55:58,550 and he's going to talk to us about some actual limitations 1372 00:55:58,550 --> 00:56:00,317 for this game [INAUDIBLE]. 1373 00:56:04,221 --> 00:56:05,596 PROFESSOR 3: So as you have said, 1374 00:56:05,596 --> 00:56:07,980 I'm going to start diving into some applications here. 1375 00:56:07,980 --> 00:56:12,180 And not only applications but also some modifications 1376 00:56:12,180 --> 00:56:13,460 or augmentations of MCTS. 1377 00:56:13,460 --> 00:56:16,590 It should hopefully clarify some of the side questions 1378 00:56:16,590 --> 00:56:21,610 you all have been having on slight tweaks to MCTS. 1379 00:56:21,610 --> 00:56:23,470 Now let's get started. 1380 00:56:23,470 --> 00:56:24,230 Wait for it. 1381 00:56:24,230 --> 00:56:25,267 Now let's get started. 1382 00:56:25,267 --> 00:56:25,766 [LAUGHTER] 1383 00:56:25,766 --> 00:56:27,600 Part III, applications. 1384 00:56:27,600 --> 00:56:29,025 First thing we're going to look at 1385 00:56:29,025 --> 00:56:31,110 is an MCTS-based "Mario" controller. 1386 00:56:31,110 --> 00:56:35,290 And "Mario" might seem like some weird thing to test AI on, 1387 00:56:35,290 --> 00:56:38,320 but there actually is a "Super Mario Bros" AI benchmark, 1388 00:56:38,320 --> 00:56:39,930 which it used to test a lot of AI 1389 00:56:39,930 --> 00:56:42,280 on how well they could play this platform. 1390 00:56:42,280 --> 00:56:45,290 In case any of you don't know what "Super Mario 1391 00:56:45,290 --> 00:56:47,420 Bros" is, this is a screenshot. 1392 00:56:47,420 --> 00:56:49,170 Basically, you control this one character. 1393 00:56:49,170 --> 00:56:52,780 It's a single-player game. 1394 00:56:52,780 --> 00:56:55,920 The ultimate goal is to reach this flag at the end. 1395 00:56:55,920 --> 00:56:58,180 But along the way there's enemies, 1396 00:56:58,180 --> 00:57:01,046 there's some bonus shrooms you can get. 1397 00:57:01,046 --> 00:57:03,870 If you break open some boxes you might get coins, 1398 00:57:03,870 --> 00:57:06,360 things like that. 1399 00:57:06,360 --> 00:57:09,642 But first, let's just highlight some of the modifications that 1400 00:57:09,642 --> 00:57:12,100 need to be made, or some of the differences between vanilla 1401 00:57:12,100 --> 00:57:16,590 MCTS and an MCTS that's going to be able to work for "Mario." 1402 00:57:16,590 --> 00:57:18,296 First thing is that it's single-player. 1403 00:57:18,296 --> 00:57:21,000 The second is, we use a slightly different simulation 1404 00:57:21,000 --> 00:57:25,130 strategy than the initial just vanilla simulation. 1405 00:57:25,130 --> 00:57:27,760 And someone actually hinted at doing more than one simulation 1406 00:57:27,760 --> 00:57:32,280 because you, you're watching us to n simulations, I think. 1407 00:57:32,280 --> 00:57:33,810 We'll touch on that. 1408 00:57:33,810 --> 00:57:36,630 Then this also introduces what I would consider 1409 00:57:36,630 --> 00:57:38,840 to be domain knowledge. 1410 00:57:38,840 --> 00:57:42,854 Then finally, there's a 50 to 40 millisecond computation time. 1411 00:57:42,854 --> 00:57:45,270 And that has to do with the frames per second of the game. 1412 00:57:45,270 --> 00:57:48,240 So you would think that "Mario" is a continuous game, 1413 00:57:48,240 --> 00:57:50,950 but if we discretize time into these chunks, 1414 00:57:50,950 --> 00:57:54,380 then we can use MTTS. 1415 00:57:54,380 --> 00:57:56,570 Now let's just think about how we could possibly 1416 00:57:56,570 --> 00:57:57,840 formulate this problem. 1417 00:57:57,840 --> 00:58:00,630 Can anyone think of what each of these nodes 1418 00:58:00,630 --> 00:58:02,586 would be if we're playing "Super Mario?" 1419 00:58:02,586 --> 00:58:04,169 AUDIENCE: Jump. 1420 00:58:04,169 --> 00:58:04,960 PROFESSOR 3: Sorry? 1421 00:58:04,960 --> 00:58:05,585 AUDIENCE: Jump. 1422 00:58:05,585 --> 00:58:07,960 It would be like, first node you're going to jump. 1423 00:58:07,960 --> 00:58:11,600 PROFESSOR 3: That might be a way to formulate it. 1424 00:58:11,600 --> 00:58:13,522 But I think that could get-- 1425 00:58:13,522 --> 00:58:16,099 AUDIENCE: Oh, it's not your control at inputs [INAUDIBLE].. 1426 00:58:16,099 --> 00:58:16,890 PROFESSOR 3: Right. 1427 00:58:16,890 --> 00:58:22,110 So the node itself isn't going to be an action. 1428 00:58:22,110 --> 00:58:23,490 AUDIENCE: Equal frames. 1429 00:58:23,490 --> 00:58:24,698 PROFESSOR 3: Yeah, basically. 1430 00:58:24,698 --> 00:58:27,320 So it's going to be the state of a game, what 1431 00:58:27,320 --> 00:58:28,490 we'll call a state. 1432 00:58:28,490 --> 00:58:30,360 So it's basically just a screen grab. 1433 00:58:30,360 --> 00:58:31,990 And it take it, in this case, it's 1434 00:58:31,990 --> 00:58:35,190 a 15 by 19 grid screen grab of the game. 1435 00:58:35,190 --> 00:58:37,083 And it will have information about-- it 1436 00:58:37,083 --> 00:58:40,240 knows Mario's position, it knows the enemy's position, position 1437 00:58:40,240 --> 00:58:42,600 of the blocks, et cetera. 1438 00:58:42,600 --> 00:58:45,370 And then, as Yo was saying, in MCTS 1439 00:58:45,370 --> 00:58:47,660 we have values associated with our nodes. 1440 00:58:47,660 --> 00:58:49,240 And so it will also have a value. 1441 00:58:49,240 --> 00:58:52,890 But we'll get into the value in the next slide 1442 00:58:52,890 --> 00:58:56,820 because I can't really fit it all in here. 1443 00:58:56,820 --> 00:58:58,930 With that being said for our node, that 1444 00:58:58,930 --> 00:59:02,490 being the state of the game, what makes sense for the edge? 1445 00:59:02,490 --> 00:59:03,420 Does anyone know? 1446 00:59:03,420 --> 00:59:05,990 How do we transition from one state to another state? 1447 00:59:05,990 --> 00:59:07,220 AUDIENCE: Jump. 1448 00:59:07,220 --> 00:59:07,710 PROFESSOR 3: Yeah, exactly. 1449 00:59:07,710 --> 00:59:09,560 So this is where the jump and all the action 1450 00:59:09,560 --> 00:59:10,268 have been played. 1451 00:59:10,268 --> 00:59:11,970 So the actions that you take-- 1452 00:59:11,970 --> 00:59:13,230 I didn't list all the actions. 1453 00:59:13,230 --> 00:59:16,502 You can also have a jump left, jump right, all those things. 1454 00:59:16,502 --> 00:59:17,960 But basically, the actions are what 1455 00:59:17,960 --> 00:59:19,209 takes you from state to state. 1456 00:59:19,209 --> 00:59:22,792 So I just drew out what a node might 1457 00:59:22,792 --> 00:59:24,375 look like if you used the jump action. 1458 00:59:24,375 --> 00:59:27,078 You might have Mario go up in the sky. 1459 00:59:27,078 --> 00:59:28,572 Are there questions? 1460 00:59:28,572 --> 00:59:30,790 AUDIENCE: Does it just run the rest of it? 1461 00:59:30,790 --> 00:59:34,520 Because that little thing's moving as they move on? 1462 00:59:34,520 --> 00:59:37,260 PROFESSOR 3: Well, it's not moving in this moment in time. 1463 00:59:37,260 --> 00:59:39,799 We're discretizing time right now. 1464 00:59:39,799 --> 00:59:41,840 AUDIENCE: But I'm saying, if your action is jump, 1465 00:59:41,840 --> 00:59:46,047 just you would have 1,000 nodes because if you did 1466 00:59:46,047 --> 00:59:48,130 plan out where that thing's moving, left or right, 1467 00:59:48,130 --> 00:59:48,710 then it could be-- 1468 00:59:48,710 --> 00:59:49,751 PROFESSOR 3: Yeah, right. 1469 00:59:49,751 --> 00:59:52,450 So in each state we have the enemy position. 1470 00:59:52,450 --> 00:59:54,110 And we know the speed and direction. 1471 00:59:54,110 --> 00:59:57,290 And so we know when we go from this node to one time step 1472 00:59:57,290 --> 01:00:00,694 later, we'll know where the enemy's moving. 1473 01:00:00,694 --> 01:00:04,530 Any other questions? 1474 01:00:04,530 --> 01:00:06,550 Moving on. 1475 01:00:06,550 --> 01:00:07,290 Sorry. 1476 01:00:07,290 --> 01:00:09,080 Let me just preface this part real quick. 1477 01:00:09,080 --> 01:00:11,970 So in our other simulations, at the end of the simulation 1478 01:00:11,970 --> 01:00:14,700 we would get either a one or a zero, if we'd won tic-tac-toe 1479 01:00:14,700 --> 01:00:17,910 or we lost tic-tac-toe. 1480 01:00:17,910 --> 01:00:21,600 But that won't really work too well here because there's 1481 01:00:21,600 --> 01:00:23,850 a lot of other factors that go into play when 1482 01:00:23,850 --> 01:00:25,410 you're playing "Mario." 1483 01:00:25,410 --> 01:00:28,450 Also, if you're doing a simulation, more than likely, 1484 01:00:28,450 --> 01:00:30,210 you're going to end up hitting an enemy 1485 01:00:30,210 --> 01:00:32,380 and dying or falling into a gap and dying. 1486 01:00:32,380 --> 01:00:34,830 So a lot of these simulations might all return zero. 1487 01:00:34,830 --> 01:00:38,440 And that is, you can't really distinguish between them. 1488 01:00:38,440 --> 01:00:41,670 So this is why I say, this version of MCTS 1489 01:00:41,670 --> 01:00:44,370 introduces what I would consider to be domain knowledge. 1490 01:00:44,370 --> 01:00:46,860 Basically, they're assigning scores 1491 01:00:46,860 --> 01:00:50,680 to potential things that could happened along the way. 1492 01:00:50,680 --> 01:00:55,630 And this is basically telling the AI that collecting a flower 1493 01:00:55,630 --> 01:00:57,936 is a little bit better than collecting a mushroom. 1494 01:00:57,936 --> 01:00:59,760 It's telling it that getting hurt is bad. 1495 01:00:59,760 --> 01:01:02,130 Right off the bat, all these things in the score 1496 01:01:02,130 --> 01:01:05,100 are giving the AI some domain knowledge about "Super Mario 1497 01:01:05,100 --> 01:01:07,901 Bros," that it's helping it calculate the simulation 1498 01:01:07,901 --> 01:01:08,400 results. 1499 01:01:10,985 --> 01:01:14,280 As it says here, it's just doing a multi-objective weighted sum 1500 01:01:14,280 --> 01:01:15,450 of all these things. 1501 01:01:15,450 --> 01:01:17,825 Throughout the simulation it's just adding up your score. 1502 01:01:17,825 --> 01:01:20,742 And then that's the score that is going to be propagated. 1503 01:01:20,742 --> 01:01:24,479 Are there questions about the score? 1504 01:01:24,479 --> 01:01:26,520 AUDIENCE: You said that it adds up all these guys 1505 01:01:26,520 --> 01:01:27,730 and it propagates it over. 1506 01:01:27,730 --> 01:01:33,086 Is it possible to just propagate the multi-part sum [INAUDIBLE] 1507 01:01:33,086 --> 01:01:37,655 as opposed to propagating one value that you create? 1508 01:01:37,655 --> 01:01:39,600 Are you essentially propagating all-- 1509 01:01:39,600 --> 01:01:42,375 what's this?-- 15 values upwards at every node, or are 1510 01:01:42,375 --> 01:01:43,500 you propagating one value-- 1511 01:01:43,500 --> 01:01:44,220 PROFESSOR 3: Well, it's one value. 1512 01:01:44,220 --> 01:01:45,000 It's the collective-- 1513 01:01:45,000 --> 01:01:46,315 AUDIENCE: Then you make them add it together 1514 01:01:46,315 --> 01:01:48,065 and you got each one of them a sub factor. 1515 01:01:50,164 --> 01:01:52,330 PROFESSOR 3: Then also, just one thing to note here, 1516 01:01:52,330 --> 01:01:55,230 is distance, you get 0.1. 1517 01:01:55,230 --> 01:01:58,650 And these are all parameters that have been tuned. 1518 01:01:58,650 --> 01:02:01,470 In the initial version, distance was, I think, 1519 01:02:01,470 --> 01:02:05,035 a reward of five, but probably realized 1520 01:02:05,035 --> 01:02:08,500 that that made Mario skip past a lot of coins and things. 1521 01:02:08,500 --> 01:02:11,050 And so he tweaked the score for that. 1522 01:02:11,050 --> 01:02:13,100 And also, time left is two. 1523 01:02:13,100 --> 01:02:14,460 So there's some weight there. 1524 01:02:14,460 --> 01:02:16,700 You want to get to the very end of the game. 1525 01:02:16,700 --> 01:02:18,720 AUDIENCE: If you're pushing up this score, 1526 01:02:18,720 --> 01:02:20,850 it's no longer a win over losses. 1527 01:02:20,850 --> 01:02:22,920 So it's not w over n. 1528 01:02:22,920 --> 01:02:23,950 What is it affecting? 1529 01:02:23,950 --> 01:02:25,764 PROFESSOR 3: You can just use the score. 1530 01:02:25,764 --> 01:02:26,930 AUDIENCE: The score is the-- 1531 01:02:26,930 --> 01:02:27,860 PROFESSOR 3: Yeah. 1532 01:02:27,860 --> 01:02:31,620 In MCTS you have this idea of when 1533 01:02:31,620 --> 01:02:36,540 you're propagating your q value, you could have that to be zero, 1534 01:02:36,540 --> 01:02:37,080 one. 1535 01:02:37,080 --> 01:02:39,665 AUDIENCE: It's like the sum of all the scores and the nodes 1536 01:02:39,665 --> 01:02:41,290 below over the number of games you win. 1537 01:02:41,290 --> 01:02:42,150 PROFESSOR 3: So basically, what you 1538 01:02:42,150 --> 01:02:44,700 would be getting when you divide by the number of simulations 1539 01:02:44,700 --> 01:02:47,980 is your average score at that node. 1540 01:02:47,980 --> 01:02:50,330 AUDIENCE: OK. 1541 01:02:50,330 --> 01:02:53,550 AUDIENCE: When you have killsByFire and [INAUDIBLE] 1542 01:02:53,550 --> 01:02:56,490 like that, if you have a positive value, 1543 01:02:56,490 --> 01:02:58,562 then isn't it good to be killed by fire, 1544 01:02:58,562 --> 01:02:59,520 or something like that? 1545 01:02:59,520 --> 01:03:01,436 PROFESSOR 3: This is killing an enemy by fire. 1546 01:03:01,436 --> 01:03:04,350 Like Mario could collect a certain flower or mushroom? 1547 01:03:04,350 --> 01:03:07,060 I think flower, then you have a fire breath and you 1548 01:03:07,060 --> 01:03:07,560 [INAUDIBLE]. 1549 01:03:07,560 --> 01:03:10,864 AUDIENCE: So that's Mario's status if Mario never dies? 1550 01:03:10,864 --> 01:03:11,530 PROFESSOR 3: No. 1551 01:03:11,530 --> 01:03:13,120 Mario's status is-- 1552 01:03:13,120 --> 01:03:15,086 I believe, Mario's status is the fact 1553 01:03:15,086 --> 01:03:17,710 that you could upgrade Mario by collecting [INAUDIBLE] mushroom 1554 01:03:17,710 --> 01:03:20,154 from a fire Mario. 1555 01:03:20,154 --> 01:03:21,570 So that gives you a lot of points. 1556 01:03:21,570 --> 01:03:22,945 Because if you become fire Mario, 1557 01:03:22,945 --> 01:03:26,795 then you're more likely to not die by running into enemies 1558 01:03:26,795 --> 01:03:28,434 because you have fire-spewing-- 1559 01:03:28,434 --> 01:03:30,225 AUDIENCE: You said they spent a lot of time 1560 01:03:30,225 --> 01:03:31,900 tuning these parameters. 1561 01:03:31,900 --> 01:03:34,170 Isn't it generally, though, just an optimization 1562 01:03:34,170 --> 01:03:36,540 framework if that's some formula? 1563 01:03:36,540 --> 01:03:38,215 So they tuned the parameters just 1564 01:03:38,215 --> 01:03:41,370 to make behave the way that we think is nice. 1565 01:03:41,370 --> 01:03:43,480 But if you change the values, they'll 1566 01:03:43,480 --> 01:03:45,120 do the right thing for that equation. 1567 01:03:45,120 --> 01:03:45,870 PROFESSOR 3: Yeah. 1568 01:03:45,870 --> 01:03:46,320 AUDIENCE: OK. 1569 01:03:46,320 --> 01:03:47,028 PROFESSOR 3: Yes. 1570 01:03:47,028 --> 01:03:50,830 But they were tuning this to make it play how they wanted. 1571 01:03:50,830 --> 01:03:56,069 AUDIENCE: [INAUDIBLE] can't just be a reflection of [INAUDIBLE] 1572 01:03:56,069 --> 01:03:57,360 PROFESSOR 3: That's a strategy. 1573 01:03:57,360 --> 01:03:59,380 If you choose that, I don't see why not. 1574 01:03:59,380 --> 01:04:02,352 That might affect certain things. 1575 01:04:02,352 --> 01:04:04,560 Obviously, you can change these to whatever you want. 1576 01:04:04,560 --> 01:04:06,730 It'll slightly tweak which simulations 1577 01:04:06,730 --> 01:04:10,240 as to working better, in terms of changing which nodes you 1578 01:04:10,240 --> 01:04:12,640 end up choosing [INAUDIBLE]. 1579 01:04:12,640 --> 01:04:15,670 So we move on. 1580 01:04:15,670 --> 01:04:17,820 So we know about scoring simulations. 1581 01:04:17,820 --> 01:04:20,160 Now we're going to look at exactly the simulation type 1582 01:04:20,160 --> 01:04:23,580 that's used to play this MCTS controller. 1583 01:04:23,580 --> 01:04:25,500 So the regular version that Yo talked about 1584 01:04:25,500 --> 01:04:28,290 is just choosing a random node at each level 1585 01:04:28,290 --> 01:04:30,300 in your simulation. 1586 01:04:30,300 --> 01:04:31,840 But there are some other strategies. 1587 01:04:31,840 --> 01:04:32,965 And someone brought one up. 1588 01:04:32,965 --> 01:04:34,500 The first is, look at best of n. 1589 01:04:34,500 --> 01:04:39,280 So in this one, you choose three random nodes at each level, 1590 01:04:39,280 --> 01:04:43,122 except that you stick with the best of those three. 1591 01:04:43,122 --> 01:04:45,080 Choose three random nodes, stick with this one. 1592 01:04:45,080 --> 01:04:45,871 Go to the next one. 1593 01:04:45,871 --> 01:04:48,671 You would choose n random three, take the best one, 1594 01:04:48,671 --> 01:04:49,920 and then go to the next level. 1595 01:04:49,920 --> 01:04:53,216 You are able to do that in this game because 1596 01:04:53,216 --> 01:04:54,590 of the way the scoring works, you 1597 01:04:54,590 --> 01:04:56,923 don't have to get to the end of the game for your score. 1598 01:04:56,923 --> 01:05:00,425 You actually could collect a coin along the way. 1599 01:05:00,425 --> 01:05:01,860 If this is jump, and then it gets 1600 01:05:01,860 --> 01:05:03,610 to be a coin versus moving left and right. 1601 01:05:03,610 --> 01:05:04,720 That doesn't give you any points. 1602 01:05:04,720 --> 01:05:07,310 Then this is the node that would give you the highest scores, 1603 01:05:07,310 --> 01:05:09,376 so I would choose that one, et cetera. 1604 01:05:09,376 --> 01:05:11,250 And then the final one, which is the one that 1605 01:05:11,250 --> 01:05:12,791 is actually used for this controller, 1606 01:05:12,791 --> 01:05:14,700 is multi-simulation. 1607 01:05:14,700 --> 01:05:17,043 This was brought up by him. 1608 01:05:17,043 --> 01:05:18,005 I don't know your name. 1609 01:05:18,005 --> 01:05:18,505 Sorry. 1610 01:05:18,505 --> 01:05:21,410 But basically, you run multiple random simulations 1611 01:05:21,410 --> 01:05:22,035 from your node. 1612 01:05:22,035 --> 01:05:24,990 And then you propagate up whichever of those simulations 1613 01:05:24,990 --> 01:05:26,580 give you the highest value. 1614 01:05:26,580 --> 01:05:28,600 And the reason to do multiple simulations 1615 01:05:28,600 --> 01:05:33,767 is to attempt to increase the accuracy of your simulations. 1616 01:05:33,767 --> 01:05:35,141 If you just do one simulation you 1617 01:05:35,141 --> 01:05:36,307 might just get really lucky. 1618 01:05:36,307 --> 01:05:40,490 But if you do three then you can take the highest value 1619 01:05:40,490 --> 01:05:42,450 use that as your value. 1620 01:05:42,450 --> 01:05:45,424 Since the whole point of this is to try make moves that get you 1621 01:05:45,424 --> 01:05:47,330 the highest values, then that will 1622 01:05:47,330 --> 01:05:50,700 make your random simulation value more accurate. 1623 01:05:50,700 --> 01:05:52,915 Are there questions about multi-simulation? 1624 01:05:52,915 --> 01:05:55,040 AUDIENCE: So what do you think about the simulation 1625 01:05:55,040 --> 01:05:58,940 [INAUDIBLE] how many [INAUDIBLE] 1626 01:05:58,940 --> 01:06:00,990 PROFESSOR 3: So there's a trade off here. 1627 01:06:00,990 --> 01:06:04,655 The more simulations you do the more accurate-- 1628 01:06:04,655 --> 01:06:06,280 the more representative your simulation 1629 01:06:06,280 --> 01:06:08,120 will be at the end of the game. 1630 01:06:10,720 --> 01:06:13,600 You could run two to the whatever simulations 1631 01:06:13,600 --> 01:06:16,390 to try to get every single possible action 1632 01:06:16,390 --> 01:06:17,700 and then take the max of that. 1633 01:06:17,700 --> 01:06:19,450 And that would give you the maximum value. 1634 01:06:19,450 --> 01:06:20,791 That would be ideal. 1635 01:06:20,791 --> 01:06:22,290 But obviously, that takes more time. 1636 01:06:22,290 --> 01:06:24,248 So there's a trade off between computation time 1637 01:06:24,248 --> 01:06:25,990 and the number of simulations you run. 1638 01:06:25,990 --> 01:06:27,820 And that's just something that they probably just 1639 01:06:27,820 --> 01:06:28,611 played around with. 1640 01:06:28,611 --> 01:06:37,472 AUDIENCE: Do you use [INAUDIBLE] have 1641 01:06:37,472 --> 01:06:41,422 to finish the decision losing a couple of minutes or 10 minutes 1642 01:06:41,422 --> 01:06:43,380 or they're going to take your [INAUDIBLE] away. 1643 01:06:43,380 --> 01:06:46,340 PROFESSOR 3: In this competition there is different computation 1644 01:06:46,340 --> 01:06:48,480 time budgets that you get. 1645 01:06:48,480 --> 01:06:51,140 And I believe the reason for the different computation time 1646 01:06:51,140 --> 01:06:53,091 budgets is the frame per second of the game. 1647 01:06:58,390 --> 01:07:00,380 I told you all about the setup, we 1648 01:07:00,380 --> 01:07:03,800 went over, the scoring, the nodes, what the advantages are, 1649 01:07:03,800 --> 01:07:05,935 what the simulation strategy is used. 1650 01:07:05,935 --> 01:07:08,490 So you probably want to see it in action. 1651 01:07:08,490 --> 01:07:11,150 So this is always a risky move trying to get video to play. 1652 01:07:11,150 --> 01:07:12,830 AUDIENCE: It's actually in the back up. 1653 01:07:12,830 --> 01:07:14,000 Hit Escape. 1654 01:07:14,000 --> 01:07:14,800 PROFESSOR 3: OK. 1655 01:07:14,800 --> 01:07:15,660 Got it. 1656 01:07:15,660 --> 01:07:17,075 AUDIENCE: And now, I guess, we-- 1657 01:07:17,075 --> 01:07:18,765 PROFESSOR 3: And drag it over again? 1658 01:07:18,765 --> 01:07:19,390 AUDIENCE: Yeah. 1659 01:07:24,667 --> 01:07:26,250 PROFESSOR 3: Running this full screen. 1660 01:07:26,250 --> 01:07:28,760 AUDIENCE: Hit the [INAUDIBLE] 1661 01:07:28,760 --> 01:07:33,330 PROFESSOR 3: [INAUDIBLE] All right. 1662 01:07:33,330 --> 01:07:35,887 Here's this MCTS-based "Mario" playing controller. 1663 01:07:35,887 --> 01:07:37,470 You can see he's actually wrecking, so 1664 01:07:37,470 --> 01:07:39,240 doing some serious damage here. 1665 01:07:39,240 --> 01:07:42,840 But those lines that you see, the reason they're 1666 01:07:42,840 --> 01:07:46,616 different colors it's not showing different players, 1667 01:07:46,616 --> 01:07:47,532 or anything like that. 1668 01:07:47,532 --> 01:07:49,157 It's just using different colors so you 1669 01:07:49,157 --> 01:07:52,065 can see the different layers of this tree search. 1670 01:07:52,065 --> 01:07:53,940 You can see he actually went backwards there. 1671 01:07:53,940 --> 01:07:55,520 And that's because in a simulation, 1672 01:07:55,520 --> 01:07:58,670 when one of the backward ones landed on an enemy-- 1673 01:07:58,670 --> 01:08:01,710 and in fact gets you points from our scoring system versus 1674 01:08:01,710 --> 01:08:04,376 if you had just gone forward you would have gotten some distance 1675 01:08:04,376 --> 01:08:06,110 points but not-- 1676 01:08:06,110 --> 01:08:08,740 also, he is just [INAUDIBLE] 1677 01:08:08,740 --> 01:08:12,754 The simulation is quickly being able to figure out that he 1678 01:08:12,754 --> 01:08:13,920 can jump on all his enemies. 1679 01:08:13,920 --> 01:08:16,340 So he's just wrecking all these guys. 1680 01:08:16,340 --> 01:08:19,350 Getting lots of points here, collecting the coin, et cetera. 1681 01:08:19,350 --> 01:08:20,760 You get the idea. 1682 01:08:20,760 --> 01:08:22,230 It's pretty awesome to watch. 1683 01:08:22,230 --> 01:08:23,979 There's that flower we were talking about. 1684 01:08:23,979 --> 01:08:28,532 So now he's actually a fire-spewing Mario demon. 1685 01:08:28,532 --> 01:08:30,990 He's doing some serious damage with that. 1686 01:08:30,990 --> 01:08:31,979 Stepping on missiles. 1687 01:08:31,979 --> 01:08:34,689 I didn't even know you could step on the missiles. 1688 01:08:34,689 --> 01:08:36,930 All right. 1689 01:08:36,930 --> 01:08:38,370 You could watch this for a while. 1690 01:08:38,370 --> 01:08:41,599 But we'll exit now. 1691 01:08:41,599 --> 01:08:44,430 It looks super promising in this video. 1692 01:08:44,430 --> 01:08:46,450 I don't know how close max stuff. 1693 01:08:46,450 --> 01:08:48,590 AUDIENCE: Just click on back [INAUDIBLE] 1694 01:08:48,590 --> 01:08:50,300 PROFESSOR 3: There it is. 1695 01:08:50,300 --> 01:08:51,819 OK. 1696 01:08:51,819 --> 01:08:54,300 The demo looks really cool, looks really promising. 1697 01:08:54,300 --> 01:08:57,330 Let's take a look at the charts here because we all 1698 01:08:57,330 --> 01:09:00,240 want some quantitative stuff. 1699 01:09:00,240 --> 01:09:01,286 This is the chart. 1700 01:09:01,286 --> 01:09:02,410 The score is on the y-axis. 1701 01:09:02,410 --> 01:09:05,060 The bottom is computation budget, which is something 1702 01:09:05,060 --> 01:09:06,439 that you were talking about. 1703 01:09:06,439 --> 01:09:11,760 I just want to highlight to make this a little more 1704 01:09:11,760 --> 01:09:13,319 visually appealing here. 1705 01:09:13,319 --> 01:09:16,870 All of these things that I highlighted, 1706 01:09:16,870 --> 01:09:17,939 it's labelled as UCT. 1707 01:09:17,939 --> 01:09:20,387 That's Upper Confidence Bound Tree. 1708 01:09:20,387 --> 01:09:22,470 Remember, Yo talked about upper confidence bounds. 1709 01:09:22,470 --> 01:09:24,990 That's essentially what's used in that TTS 1710 01:09:24,990 --> 01:09:26,262 for guiding your tree search. 1711 01:09:26,262 --> 01:09:27,470 So these are all the methods. 1712 01:09:27,470 --> 01:09:31,200 But then UCT multi, which is this purple square, that's 1713 01:09:31,200 --> 01:09:34,930 saying it's using MCTS but it's doing the multiple simulations. 1714 01:09:34,930 --> 01:09:41,090 And you can see this multi plus care is also in the top. 1715 01:09:41,090 --> 01:09:43,510 Both these use the multi-simulation technique. 1716 01:09:43,510 --> 01:09:47,279 And then the plus car is they added an extra scoring 1717 01:09:47,279 --> 01:09:49,800 mechanism for carries. 1718 01:09:49,800 --> 01:09:52,670 I believe that's probably like carrying a shell. 1719 01:09:52,670 --> 01:09:54,715 That made it do better. 1720 01:09:54,715 --> 01:09:56,340 Then these ones that aren't highlighted 1721 01:09:56,340 --> 01:10:01,130 are using plain Astar, and then a refined version of Astar. 1722 01:10:01,130 --> 01:10:03,630 With increasing time, the do increase scores, 1723 01:10:03,630 --> 01:10:07,950 but they're even worse than just your UCT 1724 01:10:07,950 --> 01:10:13,424 with just random simulation, no multi-simulations. 1725 01:10:13,424 --> 01:10:15,340 We're running low on time, which is not ideal. 1726 01:10:15,340 --> 01:10:19,810 But another thing that I want to point out is down at the bottom 1727 01:10:19,810 --> 01:10:23,830 here, these are the multi-simulations. 1728 01:10:23,830 --> 01:10:27,540 They have the lowest maximal search depth, which 1729 01:10:27,540 --> 01:10:30,846 at first would seem like, what? 1730 01:10:30,846 --> 01:10:33,840 I have the lowest search depth but my score is the most? 1731 01:10:33,840 --> 01:10:35,730 But that comes into play when you 1732 01:10:35,730 --> 01:10:38,220 were saying about the trade off between the simulations 1733 01:10:38,220 --> 01:10:41,220 and the amount time it takes. 1734 01:10:41,220 --> 01:10:43,530 So because I'm doing multiple simulations, 1735 01:10:43,530 --> 01:10:46,110 I'm taking more time at each node. 1736 01:10:46,110 --> 01:10:49,770 But that's giving me a more accurate value assessment. 1737 01:10:49,770 --> 01:10:52,300 So that let's me choose my actions more carefully, 1738 01:10:52,300 --> 01:10:53,907 or with more information. 1739 01:10:53,907 --> 01:10:56,240 And so that's what's able to give me this better scores. 1740 01:10:59,590 --> 01:11:00,400 That's all "Mario." 1741 01:11:00,400 --> 01:11:02,010 So we're going to moving onto AlphaGo. 1742 01:11:02,010 --> 01:11:04,590 Are there any questions about "Mario" before I go to AlphaGo? 1743 01:11:04,590 --> 01:11:05,090 Yeah. 1744 01:11:05,090 --> 01:11:07,300 AUDIENCE: What's the table [INAUDIBLE] inference? 1745 01:11:07,300 --> 01:11:08,800 PROFESSOR 3: That's a good question. 1746 01:11:08,800 --> 01:11:12,420 I have a feeling it's because if you're doing best of n, 1747 01:11:12,420 --> 01:11:16,480 that's really heavily relying on your scoring metrics. 1748 01:11:19,360 --> 01:11:21,786 Let's say at one step if I jump and collect 1749 01:11:21,786 --> 01:11:23,660 a coin versus if I go left or right and play, 1750 01:11:23,660 --> 01:11:25,326 I'll get more points if I get that coin. 1751 01:11:25,326 --> 01:11:27,690 But maybe, a missile is going to hit me in the face 1752 01:11:27,690 --> 01:11:28,690 if I do that. 1753 01:11:28,690 --> 01:11:30,940 It gets rid of some of the-- it's 1754 01:11:30,940 --> 01:11:32,670 forcing you to do certain moves. 1755 01:11:32,670 --> 01:11:34,873 AUDIENCE: Is the A* heuristically using the same 1756 01:11:34,873 --> 01:11:37,718 value, the same value that you're getting 1757 01:11:37,718 --> 01:11:38,861 by your simulation? 1758 01:11:38,861 --> 01:11:39,610 PROFESSOR 3: Yeah. 1759 01:11:39,610 --> 01:11:42,820 I'm not exactly sure what the Astar heuristic is. 1760 01:11:42,820 --> 01:11:48,520 The whole reason that A* is difficult is because coming up 1761 01:11:48,520 --> 01:11:51,900 with heuristics for these types of games are. 1762 01:11:51,900 --> 01:11:54,670 But this is not his version of Astar. 1763 01:11:54,670 --> 01:11:57,190 I believe this is the Astar that was used by-- 1764 01:11:57,190 --> 01:12:00,890 I forget the name of the guy-- but he won the AI competition 1765 01:12:00,890 --> 01:12:04,060 a couple of years ago. 1766 01:12:04,060 --> 01:12:06,390 I'm going to try to move onto AlphaGo. 1767 01:12:06,390 --> 01:12:09,985 Does someone have how many minutes I have left? 1768 01:12:09,985 --> 01:12:10,610 AUDIENCE: Four. 1769 01:12:10,610 --> 01:12:11,276 PROFESSOR 3: OK. 1770 01:12:11,276 --> 01:12:13,005 We're going to power through. 1771 01:12:13,005 --> 01:12:14,124 Here's AlphaGo. 1772 01:12:14,124 --> 01:12:15,540 Hopefully, you all know the rules. 1773 01:12:15,540 --> 01:12:17,490 Just in case, I'll just go through a quick-- 1774 01:12:17,490 --> 01:12:18,220 19 by 19. 1775 01:12:18,220 --> 01:12:20,300 You alternate black stones and white stones. 1776 01:12:20,300 --> 01:12:23,480 You collect enemy stones by completely surrounding them. 1777 01:12:23,480 --> 01:12:25,740 You can surround a single stone. groups of stones. 1778 01:12:25,740 --> 01:12:28,365 And your score is your territory plus the number 1779 01:12:28,365 --> 01:12:29,115 of captive pieces. 1780 01:12:29,115 --> 01:12:31,573 So your territory is just the area that you're surrounding, 1781 01:12:31,573 --> 01:12:34,506 and then you just add the stones you've collected. 1782 01:12:34,506 --> 01:12:35,880 The rules aren't super important. 1783 01:12:35,880 --> 01:12:39,399 The main emphasis is there's very few rules so you 1784 01:12:39,399 --> 01:12:40,690 would think it's really simple. 1785 01:12:40,690 --> 01:12:43,105 But the complexity of the game is quite extreme. 1786 01:12:45,660 --> 01:12:50,290 At each turn you have about 250 options that you can play. 1787 01:12:50,290 --> 01:12:52,440 Each Go game lasts about 150 turns. 1788 01:12:52,440 --> 01:12:54,750 So that gives you a total of 10 to the 761 games, 1789 01:12:54,750 --> 01:12:56,370 approximately. 1790 01:12:56,370 --> 01:12:58,470 And to put that in comparison, here's chess. 1791 01:12:58,470 --> 01:12:59,820 You can read those numbers. 1792 01:12:59,820 --> 01:13:01,400 Chess is also pretty complex. 1793 01:13:01,400 --> 01:13:03,610 But there's 35 options for turns. 1794 01:13:03,610 --> 01:13:06,432 Deep Blue. 1795 01:13:06,432 --> 01:13:08,890 I think you were talking about building out the whole tree. 1796 01:13:08,890 --> 01:13:12,100 So Deep Blue would build out the tree for six levels. 1797 01:13:12,100 --> 01:13:14,545 And then use this hard core chess 1798 01:13:14,545 --> 01:13:17,745 master inputted heuristic evaluation that it 1799 01:13:17,745 --> 01:13:20,230 used to find the best move. 1800 01:13:20,230 --> 01:13:22,712 Except with Go, you have 250 options, 1801 01:13:22,712 --> 01:13:26,670 which already is adding a lot more complexity. 1802 01:13:26,670 --> 01:13:30,970 So that strategy won't work quite as nicely. 1803 01:13:30,970 --> 01:13:31,870 What do we do? 1804 01:13:31,870 --> 01:13:34,075 We use a modified version of MCTS. 1805 01:13:34,075 --> 01:13:35,210 Well, it's not what we do. 1806 01:13:35,210 --> 01:13:39,220 That's what Google's DeepMind team did with Go. 1807 01:13:39,220 --> 01:13:41,900 They combined neural networks with MCTS. 1808 01:13:41,900 --> 01:13:45,430 Coincidentally, we learned about neural networks last class. 1809 01:13:45,430 --> 01:13:47,900 Probably not a coincidence. 1810 01:13:47,900 --> 01:13:49,400 PROFESSOR 3: It's not a coincidence. 1811 01:13:49,400 --> 01:13:51,500 PROFESSOR 3: The we ordered two policy networks 1812 01:13:51,500 --> 01:13:53,430 in the AlphaGo, and one value network. 1813 01:13:53,430 --> 01:13:55,580 And another big coincidence here, 1814 01:13:55,580 --> 01:13:57,475 the two policy networks are actually 1815 01:13:57,475 --> 01:14:00,140 CNN's, which we learned specifically about last class, 1816 01:14:00,140 --> 01:14:01,390 convolutional neural nets. 1817 01:14:01,390 --> 01:14:04,515 And the reason for that is the input 1818 01:14:04,515 --> 01:14:07,995 to the policy neural networks is an image of the game. 1819 01:14:07,995 --> 01:14:10,120 And remember, convolutional neural nets work really 1820 01:14:10,120 --> 01:14:12,170 well with images. 1821 01:14:12,170 --> 01:14:15,520 What it outputs, though, is a probability distribution 1822 01:14:15,520 --> 01:14:16,770 over the legal moves. 1823 01:14:16,770 --> 01:14:20,740 And the idea is, that if a move has a higher probability 1824 01:14:20,740 --> 01:14:23,830 it will be a more promising move for you to take. 1825 01:14:23,830 --> 01:14:27,195 But another key point is that it's not deterministic. 1826 01:14:27,195 --> 01:14:28,820 It's not telling you to take this move. 1827 01:14:28,820 --> 01:14:32,310 It's just assigning a higher probability to this move. 1828 01:14:32,310 --> 01:14:34,990 And this network was generated by doing supervised learning 1829 01:14:34,990 --> 01:14:39,390 on 30 million positions from human expert games. 1830 01:14:39,390 --> 01:14:42,640 Apparently, there's a giant database of Go expert games. 1831 01:14:42,640 --> 01:14:44,260 So that came in handy. 1832 01:14:44,260 --> 01:14:46,870 And there were two different networks trained. 1833 01:14:46,870 --> 01:14:49,810 One of them was a slow policy, the other was a fast policy. 1834 01:14:49,810 --> 01:14:54,980 The slow was able to predict an expert move with 57% accuracy, 1835 01:14:54,980 --> 01:14:57,250 which to me was mind blowing. 1836 01:14:57,250 --> 01:15:00,460 Using this neural network, 57% of the time 1837 01:15:00,460 --> 01:15:04,260 it could pin where the expert would place his move. 1838 01:15:04,260 --> 01:15:05,780 That took 3,000 microseconds. 1839 01:15:05,780 --> 01:15:08,995 Versus the fast policy, which suffered a bit in the accuracy, 1840 01:15:08,995 --> 01:15:11,259 but it's 1,500 times faster. 1841 01:15:11,259 --> 01:15:12,675 And we'll see where they used each 1842 01:15:12,675 --> 01:15:15,580 of these different policies later on. 1843 01:15:15,580 --> 01:15:20,680 But it could predict the expert move with 57% accuracy. 1844 01:15:20,680 --> 01:15:22,472 The other Go team was, that's not our goal. 1845 01:15:22,472 --> 01:15:24,138 We don't want to predict an expert move. 1846 01:15:24,138 --> 01:15:25,630 We want to predict a winning move. 1847 01:15:25,630 --> 01:15:28,180 And so to do that, they took their policy network, 1848 01:15:28,180 --> 01:15:30,170 and then they would use reinforcement learning. 1849 01:15:30,170 --> 01:15:32,950 That's where you play the network against iterations 1850 01:15:32,950 --> 01:15:35,830 of itself in order to hone in a better policy that's 1851 01:15:35,830 --> 01:15:39,560 geared towards winning moves. 1852 01:15:39,560 --> 01:15:42,889 Then they tested this against Pachi, which uses-- 1853 01:15:42,889 --> 01:15:44,555 for the camera, I have no idea if that's 1854 01:15:44,555 --> 01:15:45,340 how you pronounce Pachi. 1855 01:15:45,340 --> 01:15:46,295 It might be Patchey. 1856 01:15:46,295 --> 01:15:47,320 I'm not sure. 1857 01:15:47,320 --> 01:15:52,330 But there's 100,000 MCTS simulations at each turn. 1858 01:15:52,330 --> 01:15:55,420 So this is purely MCTS. 1859 01:15:55,420 --> 01:15:59,842 If it were playing just the AlphaGo policy network, 1860 01:15:59,842 --> 01:16:03,700 the policy network won 85% of the game. 1861 01:16:03,700 --> 01:16:06,780 So without any sort of trained search or anything involved, 1862 01:16:06,780 --> 01:16:08,680 it won 85%, which is pretty great. 1863 01:16:08,680 --> 01:16:11,810 And that suggests that maybe intuition wins 1864 01:16:11,810 --> 01:16:13,880 over long reflections in Go. 1865 01:16:13,880 --> 01:16:16,535 And interestingly, if you talk to expert Go players 1866 01:16:16,535 --> 01:16:19,340 and you ask them why they did a certain move, they'll just say, 1867 01:16:19,340 --> 01:16:22,840 It felt good, or I had a hunch in this. 1868 01:16:22,840 --> 01:16:26,660 That's indicative there. 1869 01:16:26,660 --> 01:16:28,330 Hopefully, I'm not going overtime. 1870 01:16:28,330 --> 01:16:29,970 Sorry. 1871 01:16:29,970 --> 01:16:31,505 Those are the two policy networks. 1872 01:16:31,505 --> 01:16:32,713 There's also a value network. 1873 01:16:32,713 --> 01:16:35,450 What the value network does is it takes in a board, 1874 01:16:35,450 --> 01:16:40,140 and they'll give you a value, like how good is this board? 1875 01:16:40,140 --> 01:16:42,260 They'll give you a win probability number. 1876 01:16:42,260 --> 01:16:45,592 So 77%, it would say, 77% of the time you 1877 01:16:45,592 --> 01:16:47,362 should win from the board. 1878 01:16:47,362 --> 01:16:49,820 That's similar to the evaluation that comes from Deep Blue. 1879 01:16:49,820 --> 01:16:53,570 But rather than a Go master coming in and telling you, 1880 01:16:53,570 --> 01:16:55,730 well, if these are connected in this way, 1881 01:16:55,730 --> 01:16:57,730 and down here we have this certain thing 1882 01:16:57,730 --> 01:17:00,060 then here's the score we should expect, 1883 01:17:00,060 --> 01:17:01,579 in chess, they had chess masters, 1884 01:17:01,579 --> 01:17:03,620 like if the knight is here and the queen is here, 1885 01:17:03,620 --> 01:17:04,840 all these specific things. 1886 01:17:04,840 --> 01:17:07,649 This was actually learned from the reinforcement learning that 1887 01:17:07,649 --> 01:17:09,440 was happening when the policy networks were 1888 01:17:09,440 --> 01:17:10,240 playing each other. 1889 01:17:10,240 --> 01:17:12,890 The value network was learning about those positions 1890 01:17:12,890 --> 01:17:14,294 during that time. 1891 01:17:14,294 --> 01:17:16,210 And the predictions get better towards the end 1892 01:17:16,210 --> 01:17:21,590 of the game, which I think Yo mentioned in his talk. 1893 01:17:21,590 --> 01:17:23,651 So how do you combine all these into MCTS? 1894 01:17:23,651 --> 01:17:25,359 The slow policy network, if you remember, 1895 01:17:25,359 --> 01:17:27,830 is slower but should give us stronger moves. 1896 01:17:27,830 --> 01:17:29,780 It is used to guide our tree search in order 1897 01:17:29,780 --> 01:17:33,322 to help us decide which nodes to expand next. 1898 01:17:33,322 --> 01:17:35,840 When we expand that node to get the value, 1899 01:17:35,840 --> 01:17:38,390 the value of the state is the simulation, like before, 1900 01:17:38,390 --> 01:17:41,000 like normal MCTS, except it's not 1901 01:17:41,000 --> 01:17:42,560 a completely random simulation. 1902 01:17:42,560 --> 01:17:45,200 We use our fast policy network to give us a more educated 1903 01:17:45,200 --> 01:17:46,117 simulation here. 1904 01:17:46,117 --> 01:17:47,700 But we're using a fast one, obviously, 1905 01:17:47,700 --> 01:17:49,990 to save some computation time. 1906 01:17:49,990 --> 01:17:53,810 It's giving us probably a more indicative random simulation 1907 01:17:53,810 --> 01:17:55,920 of what's going to actually happen. 1908 01:17:55,920 --> 01:17:58,830 And then we also combine that with our value network output. 1909 01:17:58,830 --> 01:18:01,070 So we run our value network on this node, as well. 1910 01:18:01,070 --> 01:18:02,695 And we add that to our simulation value 1911 01:18:02,695 --> 01:18:03,800 and we propagate it. 1912 01:18:03,800 --> 01:18:06,440 Interestingly, the AlphaGo team tested out 1913 01:18:06,440 --> 01:18:09,140 just using the fast policy simulation value 1914 01:18:09,140 --> 01:18:11,180 and scrapping the value network. 1915 01:18:11,180 --> 01:18:13,160 And they also just used the value network 1916 01:18:13,160 --> 01:18:14,720 and scrapped the simulation value. 1917 01:18:14,720 --> 01:18:17,030 And those both performed worse than if it had these. 1918 01:18:17,030 --> 01:18:19,625 And another added interesting point here, 1919 01:18:19,625 --> 01:18:22,630 is that these two factors in our value 1920 01:18:22,630 --> 01:18:24,130 have about the same weight. 1921 01:18:24,130 --> 01:18:27,970 They were both about equally important. 1922 01:18:27,970 --> 01:18:29,635 I think I'll get into that later. 1923 01:18:29,635 --> 01:18:30,410 But first-- 1924 01:18:30,410 --> 01:18:32,160 AUDIENCE: Can I just ask a quick question? 1925 01:18:32,160 --> 01:18:33,980 PROFESSOR 3: Yeah. 1926 01:18:33,980 --> 01:18:36,230 AUDIENCE: So when you said the policy network is used, 1927 01:18:36,230 --> 01:18:38,240 is that used when you're navigating to the tree 1928 01:18:38,240 --> 01:18:40,350 to get to a leaf, or is policy network 1929 01:18:40,350 --> 01:18:43,470 being used to do the simulation once you're 1930 01:18:43,470 --> 01:18:46,340 at the leaf, or both? 1931 01:18:46,340 --> 01:18:49,177 PROFESSOR 3: The slow policy is done for this part. 1932 01:18:49,177 --> 01:18:51,176 Then the fast policy is used for the simulation. 1933 01:18:51,176 --> 01:18:54,455 Because the slow policy does take 1,500 faster than-- 1934 01:18:54,455 --> 01:18:58,052 or the slow takes 1,500 times longer than the fast policy. 1935 01:18:58,052 --> 01:19:00,010 You don't want to use that in your simulations. 1936 01:19:00,010 --> 01:19:01,927 That would just take way too long. 1937 01:19:01,927 --> 01:19:03,510 It's basically just a way of making it 1938 01:19:03,510 --> 01:19:05,260 so our simulation isn't completely random. 1939 01:19:05,260 --> 01:19:06,755 It has some educated moves. 1940 01:19:09,490 --> 01:19:11,392 Why use policy and value network synergy? 1941 01:19:11,392 --> 01:19:13,100 Why can't we just use the policy network? 1942 01:19:13,100 --> 01:19:15,070 Why can't we just use the value network? 1943 01:19:15,070 --> 01:19:16,920 If we have the value network alone, 1944 01:19:16,920 --> 01:19:18,572 we'll actually-- here's a side point. 1945 01:19:18,572 --> 01:19:20,030 Remember, the value network learned 1946 01:19:20,030 --> 01:19:21,320 from the policy network. 1947 01:19:21,320 --> 01:19:23,140 And then also, later on, the policy network 1948 01:19:23,140 --> 01:19:26,070 is improved by our values. 1949 01:19:26,070 --> 01:19:27,400 They work hand-in-hand. 1950 01:19:27,400 --> 01:19:29,114 But if we had the value network alone, 1951 01:19:29,114 --> 01:19:30,780 when we're deciding on it the next move, 1952 01:19:30,780 --> 01:19:33,113 we're going to have to evaluate every single move, which 1953 01:19:33,113 --> 01:19:34,510 would take forever. 1954 01:19:34,510 --> 01:19:36,010 And so, what the policy network does 1955 01:19:36,010 --> 01:19:41,040 is project the best move with a probably distribution. 1956 01:19:41,040 --> 01:19:43,110 And it narrows our search space. 1957 01:19:43,110 --> 01:19:45,010 And then, if we had the policy network alone, 1958 01:19:45,010 --> 01:19:48,366 we'd be unable to compare nodes in different parts of our tree. 1959 01:19:48,366 --> 01:19:50,450 The policy network is able to tell us 1960 01:19:50,450 --> 01:19:52,370 a distribution over which move we should 1961 01:19:52,370 --> 01:19:54,230 take from a certain node. 1962 01:19:54,230 --> 01:19:57,350 But then, if I ask it if I'm in a better position 1963 01:19:57,350 --> 01:19:59,724 here than in some other place, it won't know. 1964 01:19:59,724 --> 01:20:01,390 That's where the value network comes in. 1965 01:20:01,390 --> 01:20:06,140 It will give us an estimated number of the value assigned 1966 01:20:06,140 --> 01:20:08,570 and open an evaluation of that node. 1967 01:20:08,570 --> 01:20:10,760 And then these values are later used 1968 01:20:10,760 --> 01:20:12,860 to direct our tree searches based 1969 01:20:12,860 --> 01:20:16,360 on updating the policy once it realizes, 1970 01:20:16,360 --> 01:20:19,470 oh, I thought this would be a good path but the value is 1971 01:20:19,470 --> 01:20:23,240 this, so update all that. 1972 01:20:23,240 --> 01:20:25,440 Then why do we combine neural networks with MCTS? 1973 01:20:25,440 --> 01:20:27,500 Remember, the policy network alone 1974 01:20:27,500 --> 01:20:31,000 played against Pachi, which was purely MCTS, 1975 01:20:31,000 --> 01:20:33,000 and it did pretty well. 1976 01:20:33,000 --> 01:20:37,220 So how does MCTS improve our policy network? 1977 01:20:37,220 --> 01:20:42,055 Remember, MCTS did win 15% of those games. 1978 01:20:42,055 --> 01:20:44,900 So already, that makes you think there's something there 1979 01:20:44,900 --> 01:20:47,145 that maybe the policy network is missing. 1980 01:20:47,145 --> 01:20:49,220 Also, the policy network is just a prediction. 1981 01:20:49,220 --> 01:20:51,410 So by using this tree structure, we're 1982 01:20:51,410 --> 01:20:57,730 able to use these Monte Carlo rollouts to adjust our policy 1983 01:20:57,730 --> 01:21:01,520 to move towards nodes that are actually evaluated to be good. 1984 01:21:01,520 --> 01:21:03,960 And then, how do neural networks improve MCTS? 1985 01:21:03,960 --> 01:21:06,280 The point should probably be clear by now. 1986 01:21:06,280 --> 01:21:09,930 We're able to more intelligently lead our tree exploration. 1987 01:21:09,930 --> 01:21:13,420 Our simulations are more reflective of actual games. 1988 01:21:13,420 --> 01:21:17,530 And the value network and our simulation value 1989 01:21:17,530 --> 01:21:21,400 are complementary, which I've mentioned before. 1990 01:21:21,400 --> 01:21:25,150 And just to highlight that, basically, the value network 1991 01:21:25,150 --> 01:21:27,910 is going to give us a value that is reflective 1992 01:21:27,910 --> 01:21:30,680 as if we've played the slow policy the whole time. 1993 01:21:30,680 --> 01:21:35,170 And the simulation is if we used a faster policy. 1994 01:21:35,170 --> 01:21:38,070 So they are complementary. 1995 01:21:38,070 --> 01:21:39,710 And I know I'm over time. 1996 01:21:39,710 --> 01:21:44,390 So I just wanted to skim through the stats real quick. 1997 01:21:44,390 --> 01:21:47,395 Distributed AlphaGo won 77% of the games 1998 01:21:47,395 --> 01:21:49,039 against regular AlphaGo. 1999 01:21:49,039 --> 01:21:51,080 So it's the only thing that beat regular AlphaGo. 2000 01:21:51,080 --> 01:21:54,250 And then distributed AlphaGo won 100% of the games 2001 01:21:54,250 --> 01:21:55,130 against all these. 2002 01:21:55,130 --> 01:21:57,720 In a rematch against Pachi, now that we've added MCTS 2003 01:21:57,720 --> 01:21:59,886 to our policy network and we have our value network, 2004 01:21:59,886 --> 01:22:03,170 we slaughtered Pachi 100%. 2005 01:22:03,170 --> 01:22:05,460 Then we decided to see how we fare against humans. 2006 01:22:05,460 --> 01:22:08,540 And by we, I mean not me, I mean Google. 2007 01:22:08,540 --> 01:22:11,190 And they won 4 to 1. 2008 01:22:11,190 --> 01:22:14,680 And Lee Sedol rating was 3,520. 2009 01:22:14,680 --> 01:22:17,880 Now AlphaGo's rating is estimated to be about 3,586. 2010 01:22:17,880 --> 01:22:19,960 So you're like, whoo, we beat the best dude. 2011 01:22:19,960 --> 01:22:22,180 Except we didn't because there's another dude 2012 01:22:22,180 --> 01:22:31,320 who has an even higher score, apparently, 3,621. 2013 01:22:31,320 --> 01:22:32,970 This should be the last part. 2014 01:22:32,970 --> 01:22:34,750 Here's this timeline. 2015 01:22:34,750 --> 01:22:39,410 Basically, tic-tac-toe, checkers were conquered in '50. 2016 01:22:39,410 --> 01:22:42,155 About 40 years later, we conquered checkers, chess. 2017 01:22:42,155 --> 01:22:45,800 Then we scroll down to 2015, is when 2018 01:22:45,800 --> 01:22:48,065 AlphaGo was able to beat Fan Hui, who 2019 01:22:48,065 --> 01:22:51,340 was a two-dan player, which is considered lower down 2020 01:22:51,340 --> 01:22:54,425 in the tier of professional Go. 2021 01:22:54,425 --> 01:22:56,470 But then, Lee Sedol was a nine-dan player. 2022 01:22:56,470 --> 01:23:00,187 And he was able to beat him literally last month. 2023 01:23:00,187 --> 01:23:01,520 PROFESSOR WILLIAMS: So good job. 2024 01:23:01,520 --> 01:23:02,520 PROFESSOR 3: We're done. 2025 01:23:02,520 --> 01:23:03,800 [APPLAUSE]