1
00:00:01,580 --> 00:00:03,920
The following content is
provided under a Creative

2
00:00:03,920 --> 00:00:05,340
Commons license.

3
00:00:05,340 --> 00:00:07,550
Your support will help
MIT OpenCourseWare

4
00:00:07,550 --> 00:00:11,640
continue to offer high quality
educational resources for free.

5
00:00:11,640 --> 00:00:14,180
To make a donation or to
view additional materials

6
00:00:14,180 --> 00:00:18,110
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,110 --> 00:00:19,090
at ocw.mit.edu.

8
00:00:22,420 --> 00:00:24,370
PROFESSOR WILLIAMS: OK,
so today's lecture--

9
00:00:27,010 --> 00:00:31,380
we're going to be talking about
probabilistic planning later,

10
00:00:31,380 --> 00:00:33,310
and in these cases
where you're planning

11
00:00:33,310 --> 00:00:36,570
a large state spaces
is very difficult.

12
00:00:36,570 --> 00:00:38,740
You do the MVP planning.

13
00:00:38,740 --> 00:00:41,690
It could be stress that
activity planning, or the likes.

14
00:00:41,690 --> 00:00:43,250
But you have to be
able to figure out

15
00:00:43,250 --> 00:00:44,950
how to deal with
these state spaces.

16
00:00:44,950 --> 00:00:48,160
So Monte Carlo tree searches
is one of the techniques

17
00:00:48,160 --> 00:00:51,040
that people can identify,
over last five years,

18
00:00:51,040 --> 00:00:54,667
is having an amazing performance
improvement over other kinds

19
00:00:54,667 --> 00:00:56,120
of sample-based approaches.

20
00:00:56,120 --> 00:00:58,510
So entity is very interesting
from that standpoint.

21
00:00:58,510 --> 00:01:00,845
And then if we [? link it to ?]
the last lecture,

22
00:01:00,845 --> 00:01:02,980
then the combination
of something,

23
00:01:02,980 --> 00:01:07,370
we just learn about [INAUDIBLE]
and combine it with search,

24
00:01:07,370 --> 00:01:11,035
is very powerful, in this case,
through the state-of-the-art

25
00:01:11,035 --> 00:01:15,472
techniques for that, as much as
tree search [INAUDIBLE] later

26
00:01:15,472 --> 00:01:20,610
[INAUDIBLE]

27
00:01:20,610 --> 00:01:22,110
PROFESSOR 2: Good
morning, everyone.

28
00:01:22,110 --> 00:01:24,411
As Professor Williams
just said, we

29
00:01:24,411 --> 00:01:26,910
are going to be talking about
Monte Carlo tree search today.

30
00:01:26,910 --> 00:01:30,102
My name is Eann
and I'll be leading

31
00:01:30,102 --> 00:01:32,310
the introduction and motivation
of this presentation.

32
00:01:32,310 --> 00:01:34,420
By the end of this
presentation, you

33
00:01:34,420 --> 00:01:36,890
will know not only why we
care about Monte Carlo tree

34
00:01:36,890 --> 00:01:37,390
searches.

35
00:01:37,390 --> 00:01:39,930
As Professor Williams said,
there's so many algorithms

36
00:01:39,930 --> 00:01:40,680
out there.

37
00:01:40,680 --> 00:01:43,440
Why do we care about
this specific one?

38
00:01:43,440 --> 00:01:46,260
And second, we'll be
going through the pros

39
00:01:46,260 --> 00:01:49,620
and cons of MCTS, as well
as the algorithm itself.

40
00:01:49,620 --> 00:01:52,440
And then lastly, we will
have a pretty cool demo

41
00:01:52,440 --> 00:01:55,650
on how it's applied to Super
Mario Brothers and the latest

42
00:01:55,650 --> 00:02:01,350
Alpha Go AI that built
the second best leading Go

43
00:02:01,350 --> 00:02:03,520
player in the world.

44
00:02:03,520 --> 00:02:05,820
So the outline for
today's presentation

45
00:02:05,820 --> 00:02:08,220
is, first, we're going to talk
about pre-MCTS algorithms.

46
00:02:08,220 --> 00:02:11,130
There are other algorithms
that currently exist out there,

47
00:02:11,130 --> 00:02:15,600
and just a few of them to lead
into why we do care about MCTS

48
00:02:15,600 --> 00:02:18,510
and why these other
algorithms fail.

49
00:02:18,510 --> 00:02:20,850
And second, we'll talk about
Monte Carlo tree searches

50
00:02:20,850 --> 00:02:21,890
itself with Yo.

51
00:02:21,890 --> 00:02:25,210
And lastly, Nick will tell you
more about the applications

52
00:02:25,210 --> 00:02:27,300
of Monte Carlo tree searches.

53
00:02:27,300 --> 00:02:31,800
So the motivation of
these kind of algorithms

54
00:02:31,800 --> 00:02:33,440
is we want to be
able to play games

55
00:02:33,440 --> 00:02:36,972
and we want to be able to create
programs to play these games,

56
00:02:36,972 --> 00:02:38,430
but we want to play
them optimally.

57
00:02:38,430 --> 00:02:40,880
We want to be able
to win, but we also

58
00:02:40,880 --> 00:02:43,890
want to be able do this in
a reasonable amount of time.

59
00:02:43,890 --> 00:02:45,965
So these three can
train itself leads

60
00:02:45,965 --> 00:02:47,680
to different kinds
of algorithms,

61
00:02:47,680 --> 00:02:50,600
and different algorithms
with different complexities

62
00:02:50,600 --> 00:02:53,420
and time, or times to search.

63
00:02:53,420 --> 00:02:55,025
And so that's why
today we're going

64
00:02:55,025 --> 00:02:57,140
to be talking about Monte
Carlo tree searches.

65
00:02:57,140 --> 00:03:00,614
And you'll figure out in a
few slides why we do care.

66
00:03:00,614 --> 00:03:02,900
So these are the types
of games we have.

67
00:03:02,900 --> 00:03:04,300
You have this
chart where there's

68
00:03:04,300 --> 00:03:07,940
fully observable games,
partially observable games,

69
00:03:07,940 --> 00:03:10,450
determinstic, and
games of chance.

70
00:03:10,450 --> 00:03:13,240
And so today, the games
that we care about

71
00:03:13,240 --> 00:03:16,790
are the games that are fully
observable and deterministic.

72
00:03:16,790 --> 00:03:21,470
And these games are games like
chess and checkers and Go.

73
00:03:21,470 --> 00:03:23,580
And we'll also be talking
about another example

74
00:03:23,580 --> 00:03:25,730
with Tic-tac-toe.

75
00:03:25,730 --> 00:03:29,280
So these pre-MCTS
algorithms include

76
00:03:29,280 --> 00:03:32,660
deterministic, fully observable
games, like we said earlier.

77
00:03:32,660 --> 00:03:36,510
And the idea of this, and the
nice thing about these games,

78
00:03:36,510 --> 00:03:38,760
is that they have
perfect information,

79
00:03:38,760 --> 00:03:41,180
and that you have
all of the states

80
00:03:41,180 --> 00:03:45,650
that you need and there's
no opportunity for chance.

81
00:03:45,650 --> 00:03:47,615
And so the idea is
that we can construct

82
00:03:47,615 --> 00:03:50,240
a tree that contains
all possible outcomes

83
00:03:50,240 --> 00:03:52,990
because everything
is fully determined.

84
00:03:52,990 --> 00:03:55,250
And so one of these
algorithms, to address this,

85
00:03:55,250 --> 00:03:58,960
is the algorithm Minimax, which
you might have heard before.

86
00:03:58,960 --> 00:04:00,600
And the idea of
Minimax to minimize

87
00:04:00,600 --> 00:04:02,317
the maximum possible loss.

88
00:04:02,317 --> 00:04:04,150
That sounds a little
weird in the beginning,

89
00:04:04,150 --> 00:04:06,540
but if you take a
look at this tree,

90
00:04:06,540 --> 00:04:08,440
this red dot, for
example, is the computer.

91
00:04:08,440 --> 00:04:11,730
And so in the computer's eyes,
it wants to beat its opponent.

92
00:04:11,730 --> 00:04:14,570
And we're assuming the
opponent wants to win also,

93
00:04:14,570 --> 00:04:16,815
so they're playing
their best game as well.

94
00:04:16,815 --> 00:04:21,990
And so the computer wants to
maximize his or her points,

95
00:04:21,990 --> 00:04:25,820
but also knowing that the
opponent, or the human,

96
00:04:25,820 --> 00:04:29,870
wants to maximize
their own win as well.

97
00:04:29,870 --> 00:04:31,730
And so in the
computer's eyes, it

98
00:04:31,730 --> 00:04:34,000
wants to minimize the
maximum possible lost.

99
00:04:34,000 --> 00:04:37,038
Does that make
sense to everyone?

100
00:04:37,038 --> 00:04:38,012
Yes?

101
00:04:38,012 --> 00:04:39,480
OK.

102
00:04:39,480 --> 00:04:41,310
And so in the
example of Minimax,

103
00:04:41,310 --> 00:04:42,810
we're going to start
with a connect,

104
00:04:42,810 --> 00:04:45,450
or a Tic-tac-toe board,
where the computer is

105
00:04:45,450 --> 00:04:49,230
this board right here, and
the blue Tic-tac-toe boards

106
00:04:49,230 --> 00:04:52,236
are the states that the
computer finally chooses.

107
00:04:52,236 --> 00:04:55,030
It's anticipating the
moves a human could play.

108
00:04:57,880 --> 00:04:59,820
So if you take a
look up here, here's

109
00:04:59,820 --> 00:05:02,292
the current state of the board.

110
00:05:02,292 --> 00:05:03,700
The current state of the board.

111
00:05:03,700 --> 00:05:09,380
And the possible options for the
human are this guy, this guy.

112
00:05:09,380 --> 00:05:09,880
Nope.

113
00:05:09,880 --> 00:05:11,820
Possible options
for the computer,

114
00:05:11,820 --> 00:05:13,540
we have three different options.

115
00:05:13,540 --> 00:05:16,200
And so you'll notice that this
is clearly the obvious winner.

116
00:05:16,200 --> 00:05:18,180
But in the state
of Minimax, it goes

117
00:05:18,180 --> 00:05:19,710
through the entire
tree, which is

118
00:05:19,710 --> 00:05:21,270
different from
depth-first search.

119
00:05:21,270 --> 00:05:24,520
It goes through the entire
tree until it finds the winning

120
00:05:24,520 --> 00:05:30,460
move and the minimize of
the maximum possible points

121
00:05:30,460 --> 00:05:31,800
it could win.

122
00:05:31,800 --> 00:05:34,080
So is there a way we
can make this better?

123
00:05:34,080 --> 00:05:34,620
Yes.

124
00:05:34,620 --> 00:05:36,720
I'm sure you've
heard about pruning,

125
00:05:36,720 --> 00:05:39,900
where, in our human
intuition, it makes sense.

126
00:05:39,900 --> 00:05:41,790
Well, why don't we
just stop when we win,

127
00:05:41,790 --> 00:05:43,940
or when we know
we're going to have

128
00:05:43,940 --> 00:05:47,116
a game that allows us to win?

129
00:05:47,116 --> 00:05:49,940
And so this idea is the
idea of simple pruning.

130
00:05:49,940 --> 00:05:54,250
And so when we combine Minimax
and simple pruning, we have--

131
00:05:54,250 --> 00:05:54,750
anyone know?

132
00:05:57,612 --> 00:05:58,570
AUDIENCE: Alpha, beta.

133
00:05:58,570 --> 00:05:59,278
PROFESSOR 3: Yes.

134
00:05:59,278 --> 00:06:02,915
Our 6.034 head TA
knows about this.

135
00:06:02,915 --> 00:06:05,800
We have alpha-beta pruning,
where we prune away any

136
00:06:05,800 --> 00:06:09,090
branches that cannot
influence the final decision.

137
00:06:09,090 --> 00:06:13,100
So in other words, you wouldn't
keep exploring the tree

138
00:06:13,100 --> 00:06:15,595
if you already knew that
a previous term would

139
00:06:15,595 --> 00:06:16,890
allow you to win.

140
00:06:16,890 --> 00:06:19,630
And so this idea in
alpha-beta pruning,

141
00:06:19,630 --> 00:06:21,150
we have an alpha and a beta.

142
00:06:21,150 --> 00:06:24,740
And so the details
aren't important

143
00:06:24,740 --> 00:06:26,850
for you to know right
now, but the idea

144
00:06:26,850 --> 00:06:29,490
is that we stop whenever
we know we don't

145
00:06:29,490 --> 00:06:31,930
need to go on any further.

146
00:06:31,930 --> 00:06:34,434
So in the games that
have Tic-tac-toe

147
00:06:34,434 --> 00:06:37,380
and Connect 4 and chess,
we have relatively low

148
00:06:37,380 --> 00:06:38,800
branching factor.

149
00:06:38,800 --> 00:06:41,130
So in the case of
Tic-tac-toe, we have 2

150
00:06:41,130 --> 00:06:43,720
to the fourth branching factor.

151
00:06:43,720 --> 00:06:46,230
But what if we have really
large branching factors,

152
00:06:46,230 --> 00:06:47,640
like Alpha Go?

153
00:06:47,640 --> 00:06:50,440
In Alpha Go, we
have 2 to the 250.

154
00:06:50,440 --> 00:06:53,760
Do you see that Mini Max,
or even alpha-beta pruning,

155
00:06:53,760 --> 00:06:57,140
would be an optimal
algorithm for this?

156
00:06:57,140 --> 00:06:59,169
The answer is?

157
00:06:59,169 --> 00:06:59,710
AUDIENCE: No.

158
00:06:59,710 --> 00:07:00,376
PROFESSOR 3: No.

159
00:07:00,376 --> 00:07:04,370
And this leads us
to out next section.

160
00:07:04,370 --> 00:07:08,210
Our goal is going to talk about
how we can use the Monte Carlo

161
00:07:08,210 --> 00:07:11,210
tree search algorithm for
games with really high

162
00:07:11,210 --> 00:07:16,120
branching factors, and using
the random extension to allow us

163
00:07:16,120 --> 00:07:21,490
to see, ultimately, how Alpha
Go, which is Google's AI,

164
00:07:21,490 --> 00:07:25,843
was able to beat the leading
Go player in the world.

165
00:07:29,140 --> 00:07:31,024
PROFESSOR 3: All right, guys.

166
00:07:31,024 --> 00:07:34,000
So this is the part
where we re-explain

167
00:07:34,000 --> 00:07:35,410
the algorithm itself.

168
00:07:35,410 --> 00:07:37,240
And before we dive
into this, I want

169
00:07:37,240 --> 00:07:38,860
to make something
really clear, which

170
00:07:38,860 --> 00:07:41,470
is that because these
are technical details

171
00:07:41,470 --> 00:07:43,700
and because we actually
want you to understand them,

172
00:07:43,700 --> 00:07:45,760
and because I definitely didn't
understand this the first three

173
00:07:45,760 --> 00:07:46,920
times I read the paper.

174
00:07:46,920 --> 00:07:49,420
I really want you to feel
free to ask any questions

175
00:07:49,420 --> 00:07:53,590
on your mind, with the knowledge
that, in my experience,

176
00:07:53,590 --> 00:07:56,492
it is very rare that someone
asks a question in class that's

177
00:07:56,492 --> 00:08:00,350
[INAUDIBLE] OK, so really,
whenever you have one.

178
00:08:00,350 --> 00:08:01,630
OK.

179
00:08:01,630 --> 00:08:04,130
So why are we doing this?

180
00:08:04,130 --> 00:08:06,860
Well, the ideal
goal behind MTCS is

181
00:08:06,860 --> 00:08:09,160
that we want to
selectively build up

182
00:08:09,160 --> 00:08:10,910
different parts of the tree.

183
00:08:10,910 --> 00:08:16,630
So the depth-first search
way, the exhaustive search,

184
00:08:16,630 --> 00:08:19,270
would have us exploring
the entire koopa tree,

185
00:08:19,270 --> 00:08:21,480
and that our depth
is limited by looking

186
00:08:21,480 --> 00:08:23,630
at all the possible
nodes of that level.

187
00:08:23,630 --> 00:08:25,270
But what we want is we want--

188
00:08:25,270 --> 00:08:28,350
because the amount of
computation required for that

189
00:08:28,350 --> 00:08:30,080
explodes really quickly.

190
00:08:30,080 --> 00:08:32,373
With the number of moves
that you're basically

191
00:08:32,373 --> 00:08:33,789
looking into the
future, we wanted

192
00:08:33,789 --> 00:08:37,495
to be able to search selectively
in certain parts of the tree.

193
00:08:37,495 --> 00:08:41,230
And so for example, if there are
less promising parts over here,

194
00:08:41,230 --> 00:08:44,290
then we care less about looking
into the future of those areas.

195
00:08:44,290 --> 00:08:46,030
But if we have a certain move--

196
00:08:46,030 --> 00:08:48,050
in chess, for example,
there's a certain move

197
00:08:48,050 --> 00:08:49,670
where in two moves, you're
going to be able to take

198
00:08:49,670 --> 00:08:50,545
the opponent's queen.

199
00:08:50,545 --> 00:08:52,412
You're really want
to search that region

200
00:08:52,412 --> 00:08:53,870
and figure out
whether that's going

201
00:08:53,870 --> 00:08:58,130
to end up being a significantly
positive group for me.

202
00:08:58,130 --> 00:09:00,230
And so the whole
goal of our algorithm

203
00:09:00,230 --> 00:09:02,977
is going to be growing
this asymmetric tree.

204
00:09:02,977 --> 00:09:03,810
How does that sound?

205
00:09:06,820 --> 00:09:08,700
OK, great.

206
00:09:08,700 --> 00:09:11,210
So how do we actually do this?

207
00:09:11,210 --> 00:09:13,200
We're going to go over
a high-level outline,

208
00:09:13,200 --> 00:09:14,800
but before we do
that, let's talk

209
00:09:14,800 --> 00:09:16,400
about our tree,
which you're going

210
00:09:16,400 --> 00:09:17,483
to get very familiar with.

211
00:09:20,250 --> 00:09:24,710
Can people see that this
is red and this is blue?

212
00:09:24,710 --> 00:09:28,850
So this is our game state
when we start our game.

213
00:09:28,850 --> 00:09:32,570
We can be given a Tic-tac-toe
board with a [INAUDIBLE] place,

214
00:09:32,570 --> 00:09:35,780
a game of chess with the lose
configured a certain way.

215
00:09:35,780 --> 00:09:38,420
And so our player,
which is the computer,

216
00:09:38,420 --> 00:09:41,070
has three separate
moves that it can take.

217
00:09:41,070 --> 00:09:43,560
And so each of those moves
are presented by a node.

218
00:09:43,560 --> 00:09:48,170
And each of those moves have
response moves by the opponent.

219
00:09:48,170 --> 00:09:50,870
So you can imagine
that if one of these

220
00:09:50,870 --> 00:09:53,730
is a Tic-tac-toe board with
just a circle, that one of these

221
00:09:53,730 --> 00:09:57,440
is with that circle and
the next place right by it.

222
00:09:57,440 --> 00:10:00,620
And as you go down
the this tree,

223
00:10:00,620 --> 00:10:02,840
you start understanding
basically,

224
00:10:02,840 --> 00:10:06,260
it's the way that humans think
about playing these games.

225
00:10:06,260 --> 00:10:10,160
If I go here, then
what if they go there,

226
00:10:10,160 --> 00:10:12,280
and then what if
I go right here.

227
00:10:12,280 --> 00:10:14,990
You try to think through
the set of future moves

228
00:10:14,990 --> 00:10:17,930
and try to evaluate
whether your move will

229
00:10:17,930 --> 00:10:20,799
be good in the long term sense.

230
00:10:20,799 --> 00:10:23,090
They way that are going to
expand our tree, as we said,

231
00:10:23,090 --> 00:10:26,464
to create an asymmetric
tree is first of all,

232
00:10:26,464 --> 00:10:28,130
we're going to descend
through the tree.

233
00:10:28,130 --> 00:10:30,296
We're going to start at the
top and we're basically,

234
00:10:30,296 --> 00:10:34,560
jump down some sequence of
branches until we figure out

235
00:10:34,560 --> 00:10:38,750
where we're going to place
our new node, which seems

236
00:10:38,750 --> 00:10:39,920
like a key operation here.

237
00:10:39,920 --> 00:10:42,018
To create an asymmetric
tree it's all about how

238
00:10:42,018 --> 00:10:43,707
you [INAUDIBLE].

239
00:10:43,707 --> 00:10:45,290
For example, in this
case, we're going

240
00:10:45,290 --> 00:10:48,580
to pick this sequence of nodes.

241
00:10:48,580 --> 00:10:51,596
And once we get to the bottom
and find every location,

242
00:10:51,596 --> 00:10:53,680
we're going to
create a new node.

243
00:10:53,680 --> 00:10:55,750
It's not very hard.

244
00:10:55,750 --> 00:10:59,690
Then we're going to simulate
a game from this new node.

245
00:10:59,690 --> 00:11:03,260
And this is the
key part of MCTS.

246
00:11:03,260 --> 00:11:06,296
Once you get to new
a location, what

247
00:11:06,296 --> 00:11:07,670
you're going to
be doing then, is

248
00:11:07,670 --> 00:11:10,465
you're going to be simulating
a game from that new location.

249
00:11:10,465 --> 00:11:11,840
We're going to
talk about how you

250
00:11:11,840 --> 00:11:17,300
go about simulating a game from
this more advanced game state

251
00:11:17,300 --> 00:11:18,907
that what we started out with.

252
00:11:18,907 --> 00:11:20,957
Does anyone have any
questions right now?

253
00:11:20,957 --> 00:11:23,040
We will be going in depth
into all of these steps,

254
00:11:23,040 --> 00:11:24,556
but just in a high level sense.

255
00:11:24,556 --> 00:11:25,420
AUDIENCE: Just a quick question.

256
00:11:25,420 --> 00:11:25,660
PROFESSOR 3: Yeah.

257
00:11:25,660 --> 00:11:27,035
AUDIENCE: To create
the new node,

258
00:11:27,035 --> 00:11:29,617
is it probabilistic, just
creating a new node as the most

259
00:11:29,617 --> 00:11:30,450
probable [INAUDIBLE]

260
00:11:30,450 --> 00:11:31,300
PROFESSOR 3: No, no.

261
00:11:31,300 --> 00:11:32,590
You're creating some new node.

262
00:11:32,590 --> 00:11:34,140
We'll talk about how
we pick that new node,

263
00:11:34,140 --> 00:11:36,806
but we're just making a new node
and we're not thinking anything

264
00:11:36,806 --> 00:11:37,780
about probability.

265
00:11:37,780 --> 00:11:40,030
The next thing is that we're
going to update the tree.

266
00:11:40,030 --> 00:11:43,195
So whatever the value of
the simulation delta was--

267
00:11:43,195 --> 00:11:50,360
delta, remember-- we're going to
propagate that up and basically

268
00:11:50,360 --> 00:11:52,550
add that to all
of the nodes that

269
00:11:52,550 --> 00:11:54,416
are in that parent of
that node in the tree

270
00:11:54,416 --> 00:11:56,332
and update some information
that goes in there

271
00:11:56,332 --> 00:11:58,090
and that they're storing.

272
00:11:58,090 --> 00:12:00,980
This is going to be good because
it's going to mean that--

273
00:12:00,980 --> 00:12:02,975
it's a lot like in
search algorithms where

274
00:12:02,975 --> 00:12:05,360
you have trees that then
the entirety of the tree

275
00:12:05,360 --> 00:12:07,713
remains up to date with the
information from every given

276
00:12:07,713 --> 00:12:08,642
simulation.

277
00:12:08,642 --> 00:12:10,100
And we're just
going to repeat this

278
00:12:10,100 --> 00:12:11,390
over and over and over again.

279
00:12:11,390 --> 00:12:13,640
And slowly, our
tree will grow out

280
00:12:13,640 --> 00:12:15,946
until whenever we
feel like stopping.

281
00:12:15,946 --> 00:12:17,570
This is actually one
of the nice things

282
00:12:17,570 --> 00:12:22,220
about MCTS, is that whenever
we decide that we're out

283
00:12:22,220 --> 00:12:25,510
of time, like for example, if
you're in a competition playing

284
00:12:25,510 --> 00:12:29,060
a champion Go player, you
can stop the simulation.

285
00:12:29,060 --> 00:12:30,710
And then all you
have to do is pick

286
00:12:30,710 --> 00:12:34,220
between one of the
best first moves

287
00:12:34,220 --> 00:12:35,780
that you're going to make.

288
00:12:35,780 --> 00:12:38,510
Because an the end of
the day, after you're

289
00:12:38,510 --> 00:12:41,010
doing all the simulation,
we're still right here.

290
00:12:41,010 --> 00:12:43,820
And we're still only picking
between the movies that go

291
00:12:43,820 --> 00:12:45,850
immediately where we started.

292
00:12:45,850 --> 00:12:47,260
Yeah.

293
00:12:47,260 --> 00:12:50,080
AUDIENCE: Could this
[INAUDIBLE] good tree?

294
00:12:50,080 --> 00:12:52,290
And then on some initial
region of interest,

295
00:12:52,290 --> 00:12:56,151
or is it arbitrary how
you get to create it?

296
00:12:56,151 --> 00:12:57,900
PROFESSOR 3: We'll go
through how you pick

297
00:12:57,900 --> 00:13:00,410
where to descend right now.

298
00:13:00,410 --> 00:13:04,030
I guess, it's any
possible move that starts

299
00:13:04,030 --> 00:13:06,412
at your starting game state.

300
00:13:06,412 --> 00:13:10,480
Does that make-- great.

301
00:13:10,480 --> 00:13:12,970
Before we move on to
the algorithm itself,

302
00:13:12,970 --> 00:13:17,360
let's talk about what we store
in each one of these nodes.

303
00:13:17,360 --> 00:13:19,400
So now we've added
these numbers.

304
00:13:19,400 --> 00:13:22,510
And these numbers
represent is that nk,

305
00:13:22,510 --> 00:13:25,730
as in the value of the
right, is the number of games

306
00:13:25,730 --> 00:13:28,500
that have been played that
involve a certain node.

307
00:13:28,500 --> 00:13:31,070
So for example, if
I look this node,

308
00:13:31,070 --> 00:13:33,410
that means that
four games have been

309
00:13:33,410 --> 00:13:34,737
played that involve this node.

310
00:13:34,737 --> 00:13:36,820
A game that has been played
that involves the node

311
00:13:36,820 --> 00:13:38,570
just means that
one of the states

312
00:13:38,570 --> 00:13:40,940
of the board at some
point in the game

313
00:13:40,940 --> 00:13:45,480
was the state of the board
that this represents.

314
00:13:45,480 --> 00:13:48,400
For example, if I have a
game that was played here,

315
00:13:48,400 --> 00:13:50,275
if I know that I've
played this once,

316
00:13:50,275 --> 00:13:51,650
then that guarantees
to me that I

317
00:13:51,650 --> 00:13:53,191
played this game
once because this is

318
00:13:53,191 --> 00:13:55,444
a precursor state to this one.

319
00:13:55,444 --> 00:13:56,920
Make sense?

320
00:13:56,920 --> 00:13:57,904
Yeah.

321
00:13:57,904 --> 00:14:00,734
AUDIENCE: How can the two
n's below that node not

322
00:14:00,734 --> 00:14:03,000
add up to a value of [INAUDIBLE]

323
00:14:03,000 --> 00:14:05,960
PROFESSOR 3: That will come when
we start expanding our game.

324
00:14:05,960 --> 00:14:07,180
But that's a great question.

325
00:14:07,180 --> 00:14:10,270
And intuitively
speaking, it should.

326
00:14:10,270 --> 00:14:12,940
AUDIENCE: You're saying you're
storing data from past games

327
00:14:12,940 --> 00:14:13,742
about what we've--

328
00:14:13,742 --> 00:14:14,450
PROFESSOR 3: Yes.

329
00:14:14,450 --> 00:14:15,944
AUDIENCE: --done before.

330
00:14:15,944 --> 00:14:18,360
AUDIENCE: If past game's outside
of the script simulation?

331
00:14:18,360 --> 00:14:19,360
PROFESSOR 3: No, no, no.

332
00:14:19,360 --> 00:14:21,850
Past game's in the
script simulation.

333
00:14:21,850 --> 00:14:23,590
And then the other
value is the number

334
00:14:23,590 --> 00:14:26,724
of wins associated
with a certain node.

335
00:14:26,724 --> 00:14:28,890
And these are going to be
wins for player one, which

336
00:14:28,890 --> 00:14:30,494
is red in this case.

337
00:14:30,494 --> 00:14:32,410
It would get confusing
if we put both of them,

338
00:14:32,410 --> 00:14:34,120
but they're complementary.

339
00:14:34,120 --> 00:14:37,020
So for example, three
out of the four times

340
00:14:37,020 --> 00:14:42,317
that the red player visited this
node, they won in that node.

341
00:14:42,317 --> 00:14:44,650
And these are the two numbers
that we're going to store.

342
00:14:44,650 --> 00:14:46,066
And we're going
to see why they're

343
00:14:46,066 --> 00:14:48,760
significant to store later.

344
00:14:48,760 --> 00:14:52,629
So first, descending the
key part of our algorithm

345
00:14:52,629 --> 00:14:53,670
that we're talking about.

346
00:14:53,670 --> 00:14:55,900
And when descending,
there are these two

347
00:14:55,900 --> 00:14:59,260
counterbalanced
desires that we have.

348
00:14:59,260 --> 00:15:03,670
The first of them is that
we want to explore really

349
00:15:03,670 --> 00:15:05,410
deeply into our tree.

350
00:15:05,410 --> 00:15:08,650
We want to think about, OK, if
they do this then I'll do this.

351
00:15:08,650 --> 00:15:11,427
And then, well, then I'll do
that unless I want it to forth.

352
00:15:11,427 --> 00:15:13,510
And we want to think through
a long term strategy.

353
00:15:13,510 --> 00:15:16,870
But at the same time, we don't
want to get caught in that.

354
00:15:16,870 --> 00:15:18,700
We want to make
sure that we're not

355
00:15:18,700 --> 00:15:22,750
missing a really promising
other movie that we weren't even

356
00:15:22,750 --> 00:15:24,670
considering because we
were really going down

357
00:15:24,670 --> 00:15:27,410
this certain rabbit
hole of the move

358
00:15:27,410 --> 00:15:28,840
that we had thought
about before.

359
00:15:28,840 --> 00:15:33,260
This is illustrated by the
x case [INAUDIBLE] SMBC.

360
00:15:33,260 --> 00:15:37,222
The SMBC comic about academia
and how someone tells you

361
00:15:37,222 --> 00:15:38,680
that a lot of really
great work has

362
00:15:38,680 --> 00:15:40,346
been done in an area,
that means nothing

363
00:15:40,346 --> 00:15:44,082
about how promising
the future will be.

364
00:15:44,082 --> 00:15:45,790
It's all about expansion
and exploration.

365
00:15:45,790 --> 00:15:47,831
And the way that we're
going to balance expansion

366
00:15:47,831 --> 00:15:49,520
and exploration
in order to create

367
00:15:49,520 --> 00:15:54,083
our really nice asymmetric
tree is the following formula.

368
00:15:54,083 --> 00:15:57,610
And it's fine if that looks
really confusing and messy.

369
00:15:57,610 --> 00:16:03,220
But actually, it breaks down
quite nicely into two parts.

370
00:16:03,220 --> 00:16:04,860
This formula is
known as the UCB.

371
00:16:04,860 --> 00:16:07,600
You don't need to know why it's
the Upper Confidence Bound.

372
00:16:07,600 --> 00:16:09,231
Let's just talk about
what's inside it.

373
00:16:09,231 --> 00:16:11,230
So first of all, you have
this term on the left.

374
00:16:11,230 --> 00:16:14,590
And this term on the left
is the extension term.

375
00:16:14,590 --> 00:16:18,030
It's basically proportional
to the likelihood

376
00:16:18,030 --> 00:16:21,050
that the expected number of
times that you're going to win,

377
00:16:21,050 --> 00:16:23,272
given that you are
in a certain node

378
00:16:23,272 --> 00:16:24,730
and that you were
a certain player.

379
00:16:27,334 --> 00:16:29,000
It's basically the
quality of your state

380
00:16:29,000 --> 00:16:30,310
in some abstract level.

381
00:16:30,310 --> 00:16:32,260
If we knew this
perfectly, then we

382
00:16:32,260 --> 00:16:33,760
would be doing
great because that's

383
00:16:33,760 --> 00:16:37,780
the thing we're looking for on
some grand level, The expected

384
00:16:37,780 --> 00:16:39,910
likelihood of winning
from a certain state.

385
00:16:39,910 --> 00:16:42,192
On the other hand, you
have this exploration term.

386
00:16:42,192 --> 00:16:44,150
And you may not be able
to read the font there.

387
00:16:44,150 --> 00:16:45,700
But what this is
basically saying

388
00:16:45,700 --> 00:16:49,150
is that it looks at
the number of games

389
00:16:49,150 --> 00:16:54,580
that I have been played through,
and it was the number of games

390
00:16:54,580 --> 00:16:56,470
that my parent has
been played through.

391
00:16:56,470 --> 00:17:00,460
And it tries to preserve those
numbers at a certain ratio,

392
00:17:00,460 --> 00:17:01,910
at a log ratio.

393
00:17:01,910 --> 00:17:06,849
And what that effectively means,
is that the number of times

394
00:17:06,849 --> 00:17:08,200
that I have been--

395
00:17:08,200 --> 00:17:10,490
if I have been visited
relatively few times,

396
00:17:10,490 --> 00:17:14,180
and the denominator is small.

397
00:17:14,180 --> 00:17:16,740
Whereas my parent has been
visited many times, which

398
00:17:16,740 --> 00:17:19,040
means that my siblings have
gotten much more attention,

399
00:17:19,040 --> 00:17:23,140
then the likelihood that I
will be visited again actually

400
00:17:23,140 --> 00:17:24,380
increases.

401
00:17:24,380 --> 00:17:27,480
So this is biased
on the one hand,

402
00:17:27,480 --> 00:17:29,450
towards nodes that
are really promising,

403
00:17:29,450 --> 00:17:32,200
and on the other
hand, towards nodes

404
00:17:32,200 --> 00:17:34,663
that haven't been explored
yet, where there's a gold mine

405
00:17:34,663 --> 00:17:36,996
and all you need to do is dig
a little bit, potentially.

406
00:17:39,650 --> 00:17:42,300
We don't actually have an
analytical expression for this.

407
00:17:42,300 --> 00:17:45,140
But we can approximate
it because you

408
00:17:45,140 --> 00:17:48,150
can think that the expected
value from a certain node

409
00:17:48,150 --> 00:17:51,860
is, roughly speaking,
approximately the ratio of wins

410
00:17:51,860 --> 00:17:54,080
at that node to
the ratio of times

411
00:17:54,080 --> 00:17:55,898
that that node has
been visit at all.

412
00:17:59,560 --> 00:18:01,820
Let's talk about actually
applying this statement.

413
00:18:01,820 --> 00:18:04,153
Because what the statement
is going to give you, is it's

414
00:18:04,153 --> 00:18:06,790
going to give you some number
for here and some number

415
00:18:06,790 --> 00:18:09,140
here, and some number
for here, and so on.

416
00:18:09,140 --> 00:18:10,890
When we start descending
through the tree,

417
00:18:10,890 --> 00:18:12,830
we're going to start
at the top node.

418
00:18:12,830 --> 00:18:15,520
And then we're going
to look at the three

419
00:18:15,520 --> 00:18:17,500
children of that node.

420
00:18:17,500 --> 00:18:19,290
And we're going to
compute this UCB

421
00:18:19,290 --> 00:18:21,560
value for each of
these children and pick

422
00:18:21,560 --> 00:18:23,780
whichever one is the highest.

423
00:18:23,780 --> 00:18:27,650
So just as a thought
for a moment,

424
00:18:27,650 --> 00:18:28,850
what if we ignore this one?

425
00:18:28,850 --> 00:18:31,600
And what if we're just
computing the UCB of these two?

426
00:18:31,600 --> 00:18:35,890
Does anyone have any intuition
on whether the UCB would

427
00:18:35,890 --> 00:18:39,088
be higher for this
node or for this node?

428
00:18:39,088 --> 00:18:40,430
AUDIENCE: The left node.

429
00:18:40,430 --> 00:18:42,170
PROFESSOR 3: The left node?

430
00:18:42,170 --> 00:18:43,040
OK.

431
00:18:43,040 --> 00:18:44,270
So why is that?

432
00:18:44,270 --> 00:18:46,460
AUDIENCE: It has
a win [INAUDIBLE]

433
00:18:46,460 --> 00:18:47,210
PROFESSOR 3: Yeah.

434
00:18:47,210 --> 00:18:47,967
It has a win.

435
00:18:47,967 --> 00:18:49,800
AUDIENCE: And they both
have a [INAUDIBLE]..

436
00:18:49,800 --> 00:18:50,675
PROFESSOR 3: Exactly.

437
00:18:50,675 --> 00:18:53,540
And so clearly, you think the
exploration term is the same

438
00:18:53,540 --> 00:18:56,040
because you know it's not that
one child has been loved less

439
00:18:56,040 --> 00:18:57,950
than the other, but
the expansion term

440
00:18:57,950 --> 00:18:59,404
is going to be different.

441
00:18:59,404 --> 00:19:01,320
And so it's definitely
going to pick this one.

442
00:19:01,320 --> 00:19:02,850
In this case, what
we're going to say

443
00:19:02,850 --> 00:19:05,475
is actually that this is so much
more promising than the others

444
00:19:05,475 --> 00:19:07,885
that it's actually going
to pick this left node.

445
00:19:07,885 --> 00:19:10,290
And so it's going to expand,
and it's going to look down.

446
00:19:10,290 --> 00:19:11,665
And then when it
looks down, it's

447
00:19:11,665 --> 00:19:13,150
going to compare
between these two.

448
00:19:13,150 --> 00:19:17,290
And this time, remember,
that this is a parent.

449
00:19:17,290 --> 00:19:22,590
A parent want to minimize the
number of wins that we have.

450
00:19:22,590 --> 00:19:24,250
Which means that our
opponent is going

451
00:19:24,250 --> 00:19:29,980
to want to pick the one that
were less likely to win in

452
00:19:29,980 --> 00:19:31,710
and they're more
likely to win in.

453
00:19:31,710 --> 00:19:34,570
This is the idea of
mini-max, minimizing how well

454
00:19:34,570 --> 00:19:36,520
my enemy does in this game.

455
00:19:40,190 --> 00:19:41,910
Although again,
the expiration term

456
00:19:41,910 --> 00:19:44,935
might counterbalance it a little
bit because, technically, this

457
00:19:44,935 --> 00:19:48,024
has been explored more.

458
00:19:48,024 --> 00:19:49,940
We're going to pick the
one on the left again.

459
00:19:49,940 --> 00:19:51,700
And we're going to
get to that location

460
00:19:51,700 --> 00:19:54,480
that we got to originally.

461
00:19:54,480 --> 00:19:57,750
Now when we're comparing
between these two,

462
00:19:57,750 --> 00:19:59,896
between a node that
has been visited once

463
00:19:59,896 --> 00:20:01,520
and a node that has
never been visited,

464
00:20:01,520 --> 00:20:06,121
can anyone guess which one
of these it is going to pick?

465
00:20:06,121 --> 00:20:06,620
Yeah.

466
00:20:06,620 --> 00:20:08,185
AUDIENCE: Never
has been visited.

467
00:20:08,185 --> 00:20:09,310
PROFESSOR 3: Yeah, exactly.

468
00:20:09,310 --> 00:20:11,690
Because this number is zero.

469
00:20:11,690 --> 00:20:14,056
And so if the
parent has ever been

470
00:20:14,056 --> 00:20:16,535
visited but the node hasn't,
this is going to be infinite

471
00:20:16,535 --> 00:20:18,909
and it's going to have to pick
the node that it has never

472
00:20:18,909 --> 00:20:20,512
seen before.

473
00:20:20,512 --> 00:20:22,262
So that's how we descend
through the tree.

474
00:20:22,262 --> 00:20:23,886
Does anyone have any
questions on that.

475
00:20:23,886 --> 00:20:25,070
Really, it's totally fine.

476
00:20:25,070 --> 00:20:27,440
We're going to be talking
about this for a while.

477
00:20:27,440 --> 00:20:28,056
Yeah.

478
00:20:28,056 --> 00:20:31,287
AUDIENCE: With the left node
that has the four for n sub k,

479
00:20:31,287 --> 00:20:36,344
wouldn't that be three because
there's two and one below?

480
00:20:36,344 --> 00:20:37,760
PROFESSOR 3: No
because of the way

481
00:20:37,760 --> 00:20:39,468
that we're going to
be updating the tree.

482
00:20:39,468 --> 00:20:41,490
Next, we'll talk about
some [INAUDIBLE]..

483
00:20:41,490 --> 00:20:42,698
AUDIENCE: I like the concept.

484
00:20:42,698 --> 00:20:44,742
But if it's a deterministic
game, why couldn't it

485
00:20:44,742 --> 00:20:46,499
hold it's [INAUDIBLE]
pretty strictly?

486
00:20:46,499 --> 00:20:48,040
PROFESSOR 3: That's
a great question.

487
00:20:48,040 --> 00:20:50,606
That's really up to
computer memory limits.

488
00:20:50,606 --> 00:20:54,280
As I think that Leah
mentioned, the number of stakes

489
00:20:54,280 --> 00:20:55,794
in the game of Go--

490
00:20:55,794 --> 00:20:57,835
it's a 19 by 19 board,
and you can play something

491
00:20:57,835 --> 00:20:58,500
at every state.

492
00:20:58,500 --> 00:21:00,150
It's only like 2 to the--

493
00:21:00,150 --> 00:21:01,150
PROFESSOR 2: [INAUDIBLE]

494
00:21:01,150 --> 00:21:01,340
PROFESSOR 3: What?

495
00:21:01,340 --> 00:21:02,048
PROFESSOR 2: 250.

496
00:21:02,048 --> 00:21:04,460
PROFESSOR 3: 250.

497
00:21:04,460 --> 00:21:07,000
You could never explore
the entire search tree.

498
00:21:07,000 --> 00:21:09,180
AUDIENCE: [INAUDIBLE]
over the first few layers

499
00:21:09,180 --> 00:21:12,010
or are we going polite.

500
00:21:12,010 --> 00:21:14,340
We try to do this real
time where you could

501
00:21:14,340 --> 00:21:15,710
have done something offline.

502
00:21:15,710 --> 00:21:17,330
PROFESSOR 3: It's
definitely true.

503
00:21:17,330 --> 00:21:18,440
If you know a state
that you're going

504
00:21:18,440 --> 00:21:20,814
to arrive at ahead of time,
then you can totally do that.

505
00:21:20,814 --> 00:21:22,420
But in a game
that's large enough

506
00:21:22,420 --> 00:21:25,660
that to do that for
all the possible states

507
00:21:25,660 --> 00:21:29,050
would take that much more time
and take that much more memory.

508
00:21:29,050 --> 00:21:30,970
It doesn't end up
making that much sense.

509
00:21:30,970 --> 00:21:32,550
Also, something
to point out here,

510
00:21:32,550 --> 00:21:34,841
is that for most of the games
that we're talking about,

511
00:21:34,841 --> 00:21:38,730
simulating a run through
of the game is really fast.

512
00:21:38,730 --> 00:21:40,460
So if you think about it--

513
00:21:40,460 --> 00:21:43,170
let's actually get to
that in next piece.

514
00:21:43,170 --> 00:21:44,890
But the point is
that building up

515
00:21:44,890 --> 00:21:46,885
this many levels of
a tree for a computer

516
00:21:46,885 --> 00:21:50,780
takes probably on the order
of less than millisecond.

517
00:21:50,780 --> 00:21:55,410
So doing this for a
really, really huge tree,

518
00:21:55,410 --> 00:21:58,504
it's peanuts because their
such simple operations.

519
00:21:58,504 --> 00:22:00,670
But it won't get expensive
when we start building up

520
00:22:00,670 --> 00:22:04,650
the tree to serious depths.

521
00:22:04,650 --> 00:22:08,425
AUDIENCE: But a game like Go,
how many nodes would you have?

522
00:22:08,425 --> 00:22:10,300
PROFESSOR 3: On each
level, in the beginning,

523
00:22:10,300 --> 00:22:12,280
we have something on
the order of 400 nodes.

524
00:22:12,280 --> 00:22:14,580
And we have a depth
of about, I think

525
00:22:14,580 --> 00:22:17,542
most games have up to 250
steps, or something like that.

526
00:22:17,542 --> 00:22:19,750
AUDIENCE: So just to build,
if you go in there blank,

527
00:22:19,750 --> 00:22:21,958
without any nodes built,
you have to in the computer,

528
00:22:21,958 --> 00:22:23,939
like you said, it
hasn't visited a node,

529
00:22:23,939 --> 00:22:26,450
it has to go there before
it descends further.

530
00:22:26,450 --> 00:22:27,782
Basically, like breadth first.

531
00:22:27,782 --> 00:22:30,240
PROFESSOR 3: It's sort of like
breadth first but not quite.

532
00:22:30,240 --> 00:22:31,823
There's an important
distinction here,

533
00:22:31,823 --> 00:22:37,387
which is that it doesn't have
to build up this or this node.

534
00:22:37,387 --> 00:22:39,220
It doesn't have to build
up all of the nodes

535
00:22:39,220 --> 00:22:40,430
at a certain level.

536
00:22:40,430 --> 00:22:44,970
All it has to do is, if it
branches down to a certain sub

537
00:22:44,970 --> 00:22:48,050
region, then can't
descend in that sub region

538
00:22:48,050 --> 00:22:51,160
below one of its siblings
without having at least looked

539
00:22:51,160 --> 00:22:52,410
once at all its siblings.

540
00:22:52,410 --> 00:22:55,190
After it looks once it
can do whatever it wants.

541
00:22:55,190 --> 00:22:57,130
And the point is,
that it doesn't

542
00:22:57,130 --> 00:22:59,440
mean the tree has to be
kept at an even level.

543
00:22:59,440 --> 00:23:02,551
All it means is that
the tree, in order

544
00:23:02,551 --> 00:23:04,300
to descend on a specific
part of the tree,

545
00:23:04,300 --> 00:23:10,220
it has to have at least visited
direct neighbors once before.

546
00:23:10,220 --> 00:23:12,400
Any more questions
on this before--

547
00:23:12,400 --> 00:23:12,940
Yeah.

548
00:23:12,940 --> 00:23:14,850
AUDIENCE: What's the
advantage necessarily

549
00:23:14,850 --> 00:23:16,779
of having to visit every single?

550
00:23:21,821 --> 00:23:23,320
PROFESSOR 3: The
advantage of having

551
00:23:23,320 --> 00:23:25,740
to visit every single--
the way that I think of it,

552
00:23:25,740 --> 00:23:28,470
is that you don't
want to be missing out

553
00:23:28,470 --> 00:23:32,860
on potentially being interested
in some of the things

554
00:23:32,860 --> 00:23:35,380
and not others.

555
00:23:35,380 --> 00:23:41,690
It comes back to the exploration
versus expectation distinction.

556
00:23:41,690 --> 00:23:46,050
We do want to descend into
the region of the tree that

557
00:23:46,050 --> 00:23:47,200
is really valuable to us.

558
00:23:47,200 --> 00:23:50,280
But at least have
explored a little bit,

559
00:23:50,280 --> 00:23:51,760
at least maintaining
some baseline,

560
00:23:51,760 --> 00:23:53,820
which really isn't
that costly compared

561
00:23:53,820 --> 00:23:55,120
to the size of the tree.

562
00:23:55,120 --> 00:23:59,444
400 moves is not that bad
compared with 400 and 250.

563
00:23:59,444 --> 00:24:01,110
AUDIENCE: Are these
simulations, they're

564
00:24:01,110 --> 00:24:02,180
just random simulations?

565
00:24:02,180 --> 00:24:03,835
PROFESSOR 3: We're going to
talk about that in a minute.

566
00:24:03,835 --> 00:24:05,626
Any more questions
before I move onto that?

567
00:24:08,790 --> 00:24:10,280
Next step is expanding.

568
00:24:10,280 --> 00:24:11,280
And this is very simple.

569
00:24:11,280 --> 00:24:15,619
You just create a node and you
set the two initial values.

570
00:24:15,619 --> 00:24:17,160
And the initial
values are the number

571
00:24:17,160 --> 00:24:18,840
of times it's been
visited is zero,

572
00:24:18,840 --> 00:24:20,720
and then number of times that
someone has won from there

573
00:24:20,720 --> 00:24:21,220
is zero.

574
00:24:21,220 --> 00:24:25,020
AUDIENCE: [INAUDIBLE] So
the easy part is solving it.

575
00:24:25,020 --> 00:24:27,180
PROFESSOR 3: Now, simulating.

576
00:24:27,180 --> 00:24:29,320
Simulating is really hard.

577
00:24:29,320 --> 00:24:31,470
You can imagine that if
you get to a single node

578
00:24:31,470 --> 00:24:33,480
and you've never seen
that node before,

579
00:24:33,480 --> 00:24:36,270
and you don't know what to
do from this node onward,

580
00:24:36,270 --> 00:24:39,484
that if we knew how the
game was going to play out,

581
00:24:39,484 --> 00:24:41,150
that is exactly what
were searching for,

582
00:24:41,150 --> 00:24:42,360
and we would be done.

583
00:24:42,360 --> 00:24:43,320
But we don't.

584
00:24:43,320 --> 00:24:47,770
And in fact, we have no idea
how to go about simulating

585
00:24:47,770 --> 00:24:49,790
a realistic game,
and a game that

586
00:24:49,790 --> 00:24:51,990
will tell us something
meaningful about the quality

587
00:24:51,990 --> 00:24:53,410
of a certain state.

588
00:24:53,410 --> 00:24:56,180
And so, as you
correctly guessed,

589
00:24:56,180 --> 00:24:58,560
we're going to do it randomly.

590
00:24:58,560 --> 00:25:00,380
We're going to be
at a certain state.

591
00:25:00,380 --> 00:25:01,960
And then from that
state, we're just

592
00:25:01,960 --> 00:25:04,530
going to pick random nodes
for each of the players

593
00:25:04,530 --> 00:25:07,280
until the game ends.

594
00:25:07,280 --> 00:25:11,990
And if we, as player one, win
then we're going to add one.

595
00:25:11,990 --> 00:25:13,980
Then we're going to say
delta equals plus one.

596
00:25:13,980 --> 00:25:18,140
And if we don't win,
or if we tie or lose,

597
00:25:18,140 --> 00:25:20,427
then we're going
to call it a zero.

598
00:25:20,427 --> 00:25:22,760
You can in this graph, we're
descending randomly and not

599
00:25:22,760 --> 00:25:23,510
thinking about it.

600
00:25:23,510 --> 00:25:25,370
And it turns out that
this is actually great

601
00:25:25,370 --> 00:25:28,570
because it's really, really
computationally efficient.

602
00:25:28,570 --> 00:25:31,860
If you have a board, even
if it has 400 open squares,

603
00:25:31,860 --> 00:25:33,810
populating it by a
bunch of random moves

604
00:25:33,810 --> 00:25:35,860
doesn't take you very
long, on the order

605
00:25:35,860 --> 00:25:38,276
of not that many machine can.

606
00:25:38,276 --> 00:25:40,390
AUDIENCE: That's why
does you don't score--

607
00:25:40,390 --> 00:25:44,332
if you go down a tree randomly,
you already have a simulation.

608
00:25:44,332 --> 00:25:46,560
So the node's going
to get to someplace.

609
00:25:46,560 --> 00:25:49,060
But you don't store it because
it would lose the randomness?

610
00:25:49,060 --> 00:25:51,920
PROFESSOR 3: You're totally
right, actually, in this case.

611
00:25:51,920 --> 00:25:54,420
I've thought through this, and
I can't come up with a reason

612
00:25:54,420 --> 00:25:55,780
why you wouldn't
store it, that's

613
00:25:55,780 --> 00:25:58,363
it's temporary values that you
find all the way down the tree.

614
00:25:58,363 --> 00:26:01,610
But they don't in most of
the literature [INAUDIBLE]

615
00:26:01,610 --> 00:26:03,574
But you're totally
right about that.

616
00:26:03,574 --> 00:26:06,270
Does everyone understand
that distinction?

617
00:26:06,270 --> 00:26:08,460
The fact that we only
hold onto the result

618
00:26:08,460 --> 00:26:10,110
here and don't
theoretically make

619
00:26:10,110 --> 00:26:13,320
nodes for every place down in
the tree just because we could,

620
00:26:13,320 --> 00:26:15,000
just because we've
seen them before.

621
00:26:15,000 --> 00:26:17,166
We don't, and it doesn't
really matter in this case.

622
00:26:17,166 --> 00:26:19,762
But it's theoretically a slight
speed up that you could do.

623
00:26:19,762 --> 00:26:22,420
AUDIENCE: But you reduce that
question to generalities?

624
00:26:22,420 --> 00:26:25,950
PROFESSOR 3: Yeah, a little bit.

625
00:26:25,950 --> 00:26:29,940
So we can look at an example of
simulating out a running game.

626
00:26:29,940 --> 00:26:32,610
We get some intuition for
why a random game would

627
00:26:32,610 --> 00:26:35,760
be correlated with how good
your board position is.

628
00:26:35,760 --> 00:26:38,470
For example, here we
have a Detecto game.

629
00:26:38,470 --> 00:26:40,210
Circle is going to move next.

630
00:26:40,210 --> 00:26:42,540
But as hopefully you can
see, because you have played

631
00:26:42,540 --> 00:26:46,120
Detecto before, this is not a
particularly promising board

632
00:26:46,120 --> 00:26:47,990
for x.

633
00:26:47,990 --> 00:26:51,500
Because no matter
what circle does,

634
00:26:51,500 --> 00:26:54,802
if x is an intelligent
player x can win right now.

635
00:26:54,802 --> 00:26:56,510
It has two different
options for winning.

636
00:26:56,510 --> 00:26:59,333
And so, if you simulated this
forward randomly, what you'll

637
00:26:59,333 --> 00:27:01,856
get is that 2/3 of the
time, x will in fact win,

638
00:27:01,856 --> 00:27:03,230
even if the players
aren't really

639
00:27:03,230 --> 00:27:04,410
thinking of it ahead of time.

640
00:27:04,410 --> 00:27:04,909
Yeah.

641
00:27:04,909 --> 00:27:07,170
AUDIENCE: Then why
not do n simulations

642
00:27:07,170 --> 00:27:09,860
at a node instead of
just a single simulation?

643
00:27:09,860 --> 00:27:10,510
PROFESSOR 3: You
totally can do that.

644
00:27:10,510 --> 00:27:12,470
That's in fact, something
that make sense to do

645
00:27:12,470 --> 00:27:13,740
and that some people do.

646
00:27:13,740 --> 00:27:16,110
Although what you'll
find somewhat soon,

647
00:27:16,110 --> 00:27:18,780
is that considering that
we're going down the tree,

648
00:27:18,780 --> 00:27:20,520
and that sometimes
soon we're going

649
00:27:20,520 --> 00:27:22,170
to explore all of
its children, there's

650
00:27:22,170 --> 00:27:24,930
a good question of why
you end simulations now

651
00:27:24,930 --> 00:27:28,080
when you could just descend
through the tree n times

652
00:27:28,080 --> 00:27:31,030
and thereby do n simulations
by going through the thing

653
00:27:31,030 --> 00:27:34,210
and also building
out the children?

654
00:27:34,210 --> 00:27:35,650
This case is-- yeah.

655
00:27:35,650 --> 00:27:37,360
AUDIENCE: This gives
more importance

656
00:27:37,360 --> 00:27:38,440
to why you do randomness.

657
00:27:38,440 --> 00:27:40,605
Because if you're doing
random simulations

658
00:27:40,605 --> 00:27:42,696
you would ignore the
possibility of the best one.

659
00:27:42,696 --> 00:27:45,255
When you first ran a simulation
here was that o wins.

660
00:27:45,255 --> 00:27:47,090
If I ignore this node--

661
00:27:47,090 --> 00:27:48,230
PROFESSOR 3: Absolutely.

662
00:27:48,230 --> 00:27:52,530
Which is why it matters that we
do this so many times that we

663
00:27:52,530 --> 00:27:55,515
drown out all the noise that
is associated with playing

664
00:27:55,515 --> 00:27:57,010
a game out randomly.

665
00:27:57,010 --> 00:27:58,930
Let's talk about that.

666
00:27:58,930 --> 00:28:02,010
If there's a lot of distance
between where we are right now

667
00:28:02,010 --> 00:28:03,600
and our end result--

668
00:28:03,600 --> 00:28:05,100
For example, in
this game, if I were

669
00:28:05,100 --> 00:28:08,522
to tell you how good is this
board position, if you are one

670
00:28:08,522 --> 00:28:10,730
of those people who played
out every game of Detecto,

671
00:28:10,730 --> 00:28:12,900
you'll know that this is
great if you want it to be

672
00:28:12,900 --> 00:28:15,660
[INAUDIBLE]

673
00:28:15,660 --> 00:28:17,820
Anyway, the point
is, that is not

674
00:28:17,820 --> 00:28:20,550
easy to do if you are doing
random simulations from where

675
00:28:20,550 --> 00:28:21,730
you start.

676
00:28:21,730 --> 00:28:24,500
The correlation between
your friend's board state

677
00:28:24,500 --> 00:28:27,989
and the quality of that state
actually drops precipitously.

678
00:28:27,989 --> 00:28:29,780
And this for me is one
of the hardest parts

679
00:28:29,780 --> 00:28:31,940
to study about Monte
Carlo Tree Search.

680
00:28:31,940 --> 00:28:33,890
Although, as Nick
will explain to you,

681
00:28:33,890 --> 00:28:36,270
it actually works quite well.

682
00:28:36,270 --> 00:28:38,840
And one of the reasons that it
works quite well in practice

683
00:28:38,840 --> 00:28:40,215
for more complicated
applications

684
00:28:40,215 --> 00:28:42,320
is they do away
with the assumption

685
00:28:42,320 --> 00:28:43,480
of random simulation.

686
00:28:43,480 --> 00:28:45,032
Because even the
random simulations

687
00:28:45,032 --> 00:28:47,240
does allow you to explore
all the states, if you have

688
00:28:47,240 --> 00:28:50,600
some idea of where a reasonable
quality approach would be,

689
00:28:50,600 --> 00:28:54,510
then using that, as long as it's
not that much more expensive

690
00:28:54,510 --> 00:28:56,770
computationally, can help
you with your simulation.

691
00:28:56,770 --> 00:28:59,140
Right now we're still talking
about total randomness.

692
00:28:59,140 --> 00:29:00,640
How are people doing
with that idea?

693
00:29:04,205 --> 00:29:06,330
Now we're going to update
the tree with the results

694
00:29:06,330 --> 00:29:07,320
of our simulation.

695
00:29:07,320 --> 00:29:10,330
So given that we had
some result lambda,

696
00:29:10,330 --> 00:29:12,140
we're going to try to
get up the parents.

697
00:29:12,140 --> 00:29:13,960
And for each parent
we're going to add

698
00:29:13,960 --> 00:29:15,780
that the game has been
played there once,

699
00:29:15,780 --> 00:29:20,790
and that the result
of that simulation

700
00:29:20,790 --> 00:29:24,460
gets added if it was a one.

701
00:29:24,460 --> 00:29:27,300
So for example, if there
was a win in this game,

702
00:29:27,300 --> 00:29:30,520
than this becomes one, one
because now it's won once

703
00:29:30,520 --> 00:29:32,190
and it's been visited once.

704
00:29:32,190 --> 00:29:34,630
And these two get
incremented by one,

705
00:29:34,630 --> 00:29:37,280
and these two get
incremented by one.

706
00:29:37,280 --> 00:29:41,060
That in itself comprises
a complete iteration,

707
00:29:41,060 --> 00:29:44,610
the complete single iteration
of running Monte Carlo Tree

708
00:29:44,610 --> 00:29:49,950
Search, which means that
now we can keep doing this

709
00:29:49,950 --> 00:29:52,620
over and over again,
building up the tree

710
00:29:52,620 --> 00:29:55,350
and slowly making it
deeper, and making it deeper

711
00:29:55,350 --> 00:29:56,740
in selective areas.

712
00:29:56,740 --> 00:29:59,450
And having these numbers
increase and increase.

713
00:29:59,450 --> 00:30:01,080
And be more and
more proportional

714
00:30:01,080 --> 00:30:05,430
to the actual expected value
of the quality of the state,

715
00:30:05,430 --> 00:30:06,080
until--

716
00:30:06,080 --> 00:30:08,226
does anyone have any
questions about this idea?--

717
00:30:11,740 --> 00:30:12,550
until we terminate.

718
00:30:12,550 --> 00:30:15,040
And we have to come up
with a way to terminate it.

719
00:30:15,040 --> 00:30:18,670
Now again, we said we're going
to pick what the best child is

720
00:30:18,670 --> 00:30:21,850
going to be, what the best
immediate move from the start

721
00:30:21,850 --> 00:30:24,335
state is going to be.

722
00:30:24,335 --> 00:30:26,650
That's the move that were
actually going to play.

723
00:30:26,650 --> 00:30:29,010
And so, how do we
determine what the best is?

724
00:30:29,010 --> 00:30:33,240
Well, the trivial solution
is just the highest

725
00:30:33,240 --> 00:30:36,790
expected win given k.

726
00:30:36,790 --> 00:30:38,550
What that, in our
case, is going to be

727
00:30:38,550 --> 00:30:41,190
is the ratio of number
of times that I've

728
00:30:41,190 --> 00:30:44,250
win from a given early
state to the number of times

729
00:30:44,250 --> 00:30:45,880
that I visited.

730
00:30:45,880 --> 00:30:48,745
However, this doesn't actually
work as well as we might hope.

731
00:30:48,745 --> 00:30:50,530
Let's suppose the
following scenario,

732
00:30:50,530 --> 00:30:54,020
which is that you have the
Detecto game like this.

733
00:30:54,020 --> 00:30:57,220
And you have been exploring
the tree for a while.

734
00:30:57,220 --> 00:31:00,520
And you're really mostly
looking at these two nodes.

735
00:31:00,520 --> 00:31:04,390
One of these nodes, if
you think it through,

736
00:31:04,390 --> 00:31:06,070
this node is quite
promising and you've

737
00:31:06,070 --> 00:31:07,339
been exploring it for a while.

738
00:31:07,339 --> 00:31:09,130
There is a winning
strategy from this node.

739
00:31:09,130 --> 00:31:11,260
It's that circle goes
here, and then x goes here,

740
00:31:11,260 --> 00:31:13,964
and then circle loses because
x has two options to win.

741
00:31:16,694 --> 00:31:18,610
However, if you explore
this a bunch of times,

742
00:31:18,610 --> 00:31:20,480
and for some reason,
due to the randomness,

743
00:31:20,480 --> 00:31:21,970
this is at 11 out of 20.

744
00:31:21,970 --> 00:31:25,687
Whereas this state, which
is inherently inferior,

745
00:31:25,687 --> 00:31:28,020
is at three out of five because
of a bunch of randomness

746
00:31:28,020 --> 00:31:30,380
and because it hasn't
been explored as much.

747
00:31:30,380 --> 00:31:32,390
And if we had looked at
this one as exhaustively

748
00:31:32,390 --> 00:31:35,631
we had at this one,
that you probably

749
00:31:35,631 --> 00:31:37,880
would actually say that this
state is actually better.

750
00:31:37,880 --> 00:31:40,900
And so, you can create
an alternative criteria,

751
00:31:40,900 --> 00:31:43,920
which is that it's the
highest expected win

752
00:31:43,920 --> 00:31:46,060
value of one of the children.

753
00:31:46,060 --> 00:31:49,602
But also, that value
has to be the node that

754
00:31:49,602 --> 00:31:51,310
has been most visited
so that they aren't

755
00:31:51,310 --> 00:31:54,590
explored by different amounts.

756
00:31:54,590 --> 00:31:56,080
What this sacrifice
is however, is

757
00:31:56,080 --> 00:32:01,050
that this means that we
can't terminate on demand.

758
00:32:01,050 --> 00:32:02,870
This is not always
going to be true,

759
00:32:02,870 --> 00:32:05,161
and therefore, we're going
to have to let the algorithm

760
00:32:05,161 --> 00:32:07,362
run until that's true for
some start state, which

761
00:32:07,362 --> 00:32:09,320
means that maybe is not
a criteria that we want

762
00:32:09,320 --> 00:32:11,886
to apply even though we know
that it would be wise to do so.

763
00:32:11,886 --> 00:32:13,260
Are there any
questions about how

764
00:32:13,260 --> 00:32:15,222
we pick the terminating guide?

765
00:32:19,280 --> 00:32:20,435
That was the whole thing.

766
00:32:20,435 --> 00:32:22,560
And now we're going to do
it lots and lots of times

767
00:32:22,560 --> 00:32:25,780
until you guys are sick of
Monte Carlo Tree Search.

768
00:32:25,780 --> 00:32:26,970
So this our tree.

769
00:32:26,970 --> 00:32:29,826
It's more or less
what we've had before.

770
00:32:29,826 --> 00:32:31,200
The first thing
we're going to do

771
00:32:31,200 --> 00:32:32,616
is we're going to
look at the top.

772
00:32:32,616 --> 00:32:35,250
And then we're going to
pick one of these children.

773
00:32:35,250 --> 00:32:37,260
Now let's say that
we looked at this,

774
00:32:37,260 --> 00:32:39,210
and it turns out that the one
on the left is really valuable.

775
00:32:39,210 --> 00:32:40,060
I think it's the one.

776
00:32:40,060 --> 00:32:40,583
Nope, yeah.

777
00:32:40,583 --> 00:32:41,083
Never mind.

778
00:32:41,083 --> 00:32:42,319
It's wrong.

779
00:32:42,319 --> 00:32:43,860
The one on the left
has been explored

780
00:32:43,860 --> 00:32:44,818
a whole bunch of times.

781
00:32:44,818 --> 00:32:47,730
Remember, this term
starts becoming larger

782
00:32:47,730 --> 00:32:49,980
than the ones that haven't
been visited as much.

783
00:32:49,980 --> 00:32:53,390
And so we're going to
descend from this one.

784
00:32:53,390 --> 00:32:57,175
And now we're going to descend,
and we have these two options.

785
00:32:57,175 --> 00:33:00,612
Given what you know,
would you expect

786
00:33:00,612 --> 00:33:02,070
that this is going
to pick is going

787
00:33:02,070 --> 00:33:04,153
to be the one on the right
or the one on the left?

788
00:33:04,153 --> 00:33:05,295
AUDIENCE: [INAUDIBLE]

789
00:33:05,295 --> 00:33:06,250
PROFESSOR 3: On the
right because it's never

790
00:33:06,250 --> 00:33:06,880
been visited before.

791
00:33:06,880 --> 00:33:08,380
And so, this term
is going to explode.

792
00:33:08,380 --> 00:33:10,060
And so, we're going
to build a node there.

793
00:33:10,060 --> 00:33:11,726
And then we're going
to simulate a game.

794
00:33:11,726 --> 00:33:15,954
And the result is a win,
which is bad for this player.

795
00:33:15,954 --> 00:33:18,370
That means that he probably
didn't want to make that move.

796
00:33:18,370 --> 00:33:20,820
And so we're going to
propagate that value up.

797
00:33:20,820 --> 00:33:24,420
And we're going to start
the algorithm again.

798
00:33:24,420 --> 00:33:26,430
And it's going to compare
between these three.

799
00:33:26,430 --> 00:33:31,500
And now it's going to
pick the one on the left.

800
00:33:34,686 --> 00:33:36,310
Now that it picked
the one on the left,

801
00:33:36,310 --> 00:33:39,420
it going to compare
between these two states.

802
00:33:39,420 --> 00:33:43,933
Which of the two is going to
have a higher expansion factor?

803
00:33:43,933 --> 00:33:46,397
AUDIENCE: The left.

804
00:33:46,397 --> 00:33:47,980
AUDIENCE: Don't you
invert it, though,

805
00:33:47,980 --> 00:33:49,280
because this is the opponent.

806
00:33:49,280 --> 00:33:50,480
PROFESSOR 3: Exactly.

807
00:33:50,480 --> 00:33:52,331
Because two out of three
is actually better.

808
00:33:52,331 --> 00:33:54,247
Because it's one out of
three for the opponent

809
00:33:54,247 --> 00:33:55,530
that's currently
making the move.

810
00:33:55,530 --> 00:33:57,440
So the one on the left is going
to have a higher expansion

811
00:33:57,440 --> 00:33:58,485
factor, and the
one on the right is

812
00:33:58,485 --> 00:33:59,560
going to have a higher
exploration factor.

813
00:33:59,560 --> 00:34:01,412
Does that make sense for people?

814
00:34:01,412 --> 00:34:05,114
It's OK if it doesn't.

815
00:34:05,114 --> 00:34:07,280
So we're actually going to
pick the one on the right

816
00:34:07,280 --> 00:34:09,969
because the other one was
is doing three and has lots

817
00:34:09,969 --> 00:34:11,994
of it's mother's
love than that one's.

818
00:34:11,994 --> 00:34:14,129
Anyone else need a drink?

819
00:34:14,129 --> 00:34:15,462
We're going to expand that node.

820
00:34:15,462 --> 00:34:16,370
It doesn't matter.

821
00:34:16,370 --> 00:34:18,502
They are both equally
likely to be expanded.

822
00:34:18,502 --> 00:34:20,918
We're going to simulate forward,
and it's going to be one.

823
00:34:20,918 --> 00:34:24,639
Which means that that was
probably a wise countermove.

824
00:34:24,639 --> 00:34:25,220
Yeah.

825
00:34:25,220 --> 00:34:26,969
AUDIENCE: So when it's
the opponent's turn

826
00:34:26,969 --> 00:34:29,300
versus your turn, the
exploration factor

827
00:34:29,300 --> 00:34:33,562
is the same but we complement
the expansion factor, right?

828
00:34:33,562 --> 00:34:34,270
PROFESSOR 3: Yes.

829
00:34:34,270 --> 00:34:36,739
So the key here
being that this takes

830
00:34:36,739 --> 00:34:39,162
in both the state that
you're talking about

831
00:34:39,162 --> 00:34:40,870
and the player that
you're talking about.

832
00:34:40,870 --> 00:34:42,239
AUDIENCE: But regardless
of the player,

833
00:34:42,239 --> 00:34:44,570
the exploration factor will
always be like this is.

834
00:34:44,570 --> 00:34:46,945
PROFESSOR 3: Because it's only
the number of visits it's.

835
00:34:46,945 --> 00:34:49,716
It has nothing to do with
results of exploration.

836
00:34:53,176 --> 00:34:55,584
AUDIENCE: If you win and
you have the plus one,

837
00:34:55,584 --> 00:34:57,375
double plus one, and
you've propagated out,

838
00:34:57,375 --> 00:35:00,485
but I'm wondering--

839
00:35:00,485 --> 00:35:03,602
so if the opponent wins
do you also propagate

840
00:35:03,602 --> 00:35:07,950
out the win increment itself?

841
00:35:07,950 --> 00:35:09,574
If the opponent's
winning, wouldn't you

842
00:35:09,574 --> 00:35:11,572
want to [INAUDIBLE] node here?

843
00:35:11,572 --> 00:35:13,530
PROFESSOR 3: If the
opponent wins then what you

844
00:35:13,530 --> 00:35:14,780
do is you propagate up a zero.

845
00:35:14,780 --> 00:35:20,390
Which means that wk is not
incremented, but nk is.

846
00:35:23,695 --> 00:35:26,580
Have we seen a zero yet?

847
00:35:26,580 --> 00:35:28,440
There's one soon.

848
00:35:28,440 --> 00:35:31,900
But the idea is that rather
than subtract or anything,

849
00:35:31,900 --> 00:35:34,010
all you do is propagate
up the result of the game,

850
00:35:34,010 --> 00:35:37,820
which in this case is zero.

851
00:35:37,820 --> 00:35:39,360
Which means that
all of those states

852
00:35:39,360 --> 00:35:41,820
seems to become more valuable
to the blue and less valuable

853
00:35:41,820 --> 00:35:42,830
to the red.

854
00:35:42,830 --> 00:35:46,159
Because these numbers are
lower than the other ones were.

855
00:35:46,159 --> 00:35:46,700
AUDIENCE: OK.

856
00:35:50,750 --> 00:35:52,250
PROFESSOR 3: So we
propagate this up

857
00:35:52,250 --> 00:35:54,580
and this becomes better.

858
00:35:54,580 --> 00:35:56,930
What we've done here
is we've figured out

859
00:35:56,930 --> 00:36:00,057
a theoretical countermove
to blue moving here.

860
00:36:00,057 --> 00:36:02,140
That's how you should think
about this whole tree.

861
00:36:02,140 --> 00:36:04,540
It's really a lot like
the way the humans think

862
00:36:04,540 --> 00:36:05,440
about these things.

863
00:36:05,440 --> 00:36:07,950
If I do this, then
what if they do this?

864
00:36:07,950 --> 00:36:09,140
Well, then I'll do this.

865
00:36:09,140 --> 00:36:14,142
And I see that I'm
successful when I do that.

866
00:36:14,142 --> 00:36:16,580
We're going to look
again at the top.

867
00:36:16,580 --> 00:36:18,765
And we're going to pick
the one on the left

868
00:36:18,765 --> 00:36:20,015
because it's really promising.

869
00:36:20,015 --> 00:36:21,687
Five out of six
is a good number.

870
00:36:21,687 --> 00:36:23,270
And we're going to
look at both sides.

871
00:36:23,270 --> 00:36:25,571
And which one is blue
going to pick now?

872
00:36:25,571 --> 00:36:27,321
Well, it's going to
pick the one that it's

873
00:36:27,321 --> 00:36:29,090
going to be more successful
in, which is two out of three.

874
00:36:29,090 --> 00:36:31,246
I realize that this is
actually not the kind of thing

875
00:36:31,246 --> 00:36:32,746
where I could
necessarily ask people

876
00:36:32,746 --> 00:36:37,680
because I'm the one who's
decided which node to stop.

877
00:36:37,680 --> 00:36:39,360
Then we go down here.

878
00:36:39,360 --> 00:36:41,430
And there's an equal
likelihood of picking

879
00:36:41,430 --> 00:36:42,347
either of those nodes.

880
00:36:42,347 --> 00:36:44,054
And so we're going to
pick one at random.

881
00:36:44,054 --> 00:36:45,530
So that's going to
be the left one.

882
00:36:45,530 --> 00:36:47,227
And we're going to
create an empty node.

883
00:36:47,227 --> 00:36:48,560
Then we're going to play it out.

884
00:36:48,560 --> 00:36:50,660
And it was a success
for blue, which

885
00:36:50,660 --> 00:36:54,080
is amazing because what this
means now is that suddenly,

886
00:36:54,080 --> 00:36:57,180
in this tree of this really
good move that red could make

887
00:36:57,180 --> 00:36:59,090
the blue wasn't find a
response to, suddenly

888
00:36:59,090 --> 00:37:02,280
there's hope because we're
going to propagate this back.

889
00:37:02,280 --> 00:37:03,830
And that means
that blue actually

890
00:37:03,830 --> 00:37:06,772
has a response move to that
sequence of red's moves.

891
00:37:06,772 --> 00:37:08,380
And so it's going
to propagate up.

892
00:37:08,380 --> 00:37:10,915
And this state's going to be
more promising to blue and less

893
00:37:10,915 --> 00:37:12,350
promising of red.

894
00:37:12,350 --> 00:37:14,230
That region of the tree
that we had dug into

895
00:37:14,230 --> 00:37:17,164
is a little less promising.

896
00:37:17,164 --> 00:37:18,330
We're going to look back up.

897
00:37:18,330 --> 00:37:19,788
And this time,
instead, we're going

898
00:37:19,788 --> 00:37:22,880
to evaluate the thing
that is both promising

899
00:37:22,880 --> 00:37:25,910
from the expansion
factor, and also

900
00:37:25,910 --> 00:37:27,800
promising because
we haven't looked

901
00:37:27,800 --> 00:37:29,930
at it very much [INAUDIBLE]
exploration factor.

902
00:37:29,930 --> 00:37:31,513
We're going to pick
between these two.

903
00:37:31,513 --> 00:37:33,589
Which one is going
to be picked here?

904
00:37:33,589 --> 00:37:37,392
AUDIENCE: [INAUDIBLE]

905
00:37:37,392 --> 00:37:39,683
PROFESSOR 3: Because the
exploration factor is the same

906
00:37:39,683 --> 00:37:44,080
but the expansion factor is
higher for the one on the left.

907
00:37:44,080 --> 00:37:45,700
And it's going to
show us a node.

908
00:37:45,700 --> 00:37:48,649
And the result is going to
be a win for a red, which

909
00:37:48,649 --> 00:37:51,190
means that red has found a good
countermove to the thing that

910
00:37:51,190 --> 00:37:52,760
was previously
promising for blue.

911
00:37:52,760 --> 00:37:53,926
And we propagate it back up.

912
00:37:53,926 --> 00:37:57,610
And finally, we're going to pick
the one furthest on the right.

913
00:37:57,610 --> 00:37:59,360
Because even though
it's terrible for red,

914
00:37:59,360 --> 00:38:01,443
and even though it's never
won when it's tried it,

915
00:38:01,443 --> 00:38:04,285
it has to obey his idea
of the exploration mode

916
00:38:04,285 --> 00:38:06,910
to find out whether maybe there
isn't something possible there.

917
00:38:06,910 --> 00:38:09,110
So it explores,
and it goes down,

918
00:38:09,110 --> 00:38:10,870
and it has to pick
the one on the right.

919
00:38:10,870 --> 00:38:12,090
And so it does.

920
00:38:12,090 --> 00:38:13,420
And it plays this game out.

921
00:38:13,420 --> 00:38:16,180
And it's a loss, again.

922
00:38:16,180 --> 00:38:19,212
Which goes to show
you, that blue

923
00:38:19,212 --> 00:38:20,670
has found yet
another superior move

924
00:38:20,670 --> 00:38:22,570
to this really bad
move of red, where

925
00:38:22,570 --> 00:38:24,886
probably this move of red,
if this is a game of chess,

926
00:38:24,886 --> 00:38:26,260
is like putting
my queen directly

927
00:38:26,260 --> 00:38:27,926
in front of the
opponent's row of pawns,

928
00:38:27,926 --> 00:38:29,010
and I just leave it there.

929
00:38:29,010 --> 00:38:31,175
There's nothing good that's
ever going to come of it

930
00:38:31,175 --> 00:38:33,070
but we have to explore
it just to find out

931
00:38:33,070 --> 00:38:36,290
whether there isn't some magical
way that I should protect.

932
00:38:36,290 --> 00:38:39,250
And as you can see,
we've built up this tree

933
00:38:39,250 --> 00:38:40,460
over and over and over again.

934
00:38:40,460 --> 00:38:41,970
And it's starting
to look asymmetric.

935
00:38:41,970 --> 00:38:43,845
And we're starting to
see that there's really

936
00:38:43,845 --> 00:38:47,170
this disparity between exploring
the regions that are crossing

937
00:38:47,170 --> 00:38:49,420
this tree and exploring
the regions that are not

938
00:38:49,420 --> 00:38:52,500
and that don't really
matter to us very much.

939
00:38:52,500 --> 00:38:55,940
And that this is exactly what we
wanted from Monte Carlo trees.

940
00:38:55,940 --> 00:38:58,475
That was why we started
the whole endeavor

941
00:38:58,475 --> 00:39:00,030
in the first place.

942
00:39:00,030 --> 00:39:02,530
The next thing I'm going to
talk about is the pros and cons.

943
00:39:02,530 --> 00:39:03,905
But before I do
that, does anyone

944
00:39:03,905 --> 00:39:06,771
have any more questions
about the algorithm?

945
00:39:06,771 --> 00:39:07,270
Yeah.

946
00:39:07,270 --> 00:39:09,966
AUDIENCE: It's still not
clear how we're getting nodes

947
00:39:09,966 --> 00:39:11,322
with different denominators--

948
00:39:11,322 --> 00:39:13,500
[INAUDIBLE]

949
00:39:13,500 --> 00:39:16,011
PROFESSOR 3: The reason for
that is because of the way

950
00:39:16,011 --> 00:39:17,260
that we're simulating through.

951
00:39:17,260 --> 00:39:19,970
We're actually not holding
onto to the results

952
00:39:19,970 --> 00:39:23,130
of the simulation as we're
going farther down the tree

953
00:39:23,130 --> 00:39:25,200
than the lowest node we expand.

954
00:39:25,200 --> 00:39:27,540
For example, when you
simulate from here,

955
00:39:27,540 --> 00:39:31,030
you're going to propagate that
value here and here, and so on.

956
00:39:31,030 --> 00:39:32,800
But then when we
expand below, even

957
00:39:32,800 --> 00:39:35,110
if in the course of
this guy's simulation

958
00:39:35,110 --> 00:39:36,300
it happened to go
through one of the states

959
00:39:36,300 --> 00:39:38,160
that we expanded
below, it will not

960
00:39:38,160 --> 00:39:40,140
have incremented the
values of that state

961
00:39:40,140 --> 00:39:42,836
because we weren't
keeping track of it.

962
00:39:42,836 --> 00:39:44,210
Theoretically, if
we were to keep

963
00:39:44,210 --> 00:39:47,050
track of all of the simulations
that we have in fact run,

964
00:39:47,050 --> 00:39:51,480
the numbers beneath these
things would be higher.

965
00:39:51,480 --> 00:39:54,114
AUDIENCE: If you've already
run a simulation from that--

966
00:39:54,114 --> 00:39:55,530
if you've already
run a simulation

967
00:39:55,530 --> 00:39:58,080
from that red node when
you first built it,

968
00:39:58,080 --> 00:40:02,480
and then when you created those
two ones, each of those have

969
00:40:02,480 --> 00:40:03,156
[INAUDIBLE]

970
00:40:03,156 --> 00:40:03,822
PROFESSOR 3: OK.

971
00:40:03,822 --> 00:40:04,520
I see.

972
00:40:04,520 --> 00:40:06,228
AUDIENCE: So would
the denominator always

973
00:40:06,228 --> 00:40:08,100
be one more than the
sum of the children?

974
00:40:08,100 --> 00:40:10,960
PROFESSOR 3: Yeah,
in [INAUDIBLE] Yeah.

975
00:40:13,581 --> 00:40:15,330
AUDIENCE: I understand
how you built that.

976
00:40:18,304 --> 00:40:20,900
Is there a rule of thumb, like
it's time to choose a move?

977
00:40:20,900 --> 00:40:23,150
And it seems like you
have very low numbers here

978
00:40:23,150 --> 00:40:25,080
to make a [INAUDIBLE]

979
00:40:25,080 --> 00:40:27,000
Is there a rule of
thumb on giving games

980
00:40:27,000 --> 00:40:29,550
like it's 2 to the 4 or 2
to the 350, whatever it is.

981
00:40:29,550 --> 00:40:32,335
What kind of numbers do
you need for that first row

982
00:40:32,335 --> 00:40:35,582
before you [INAUDIBLE]?

983
00:40:35,582 --> 00:40:38,010
PROFESSOR 3: What we'll get
to soon is that isn't one.

984
00:40:38,010 --> 00:40:40,750
That's one of the
problem with MCTS.

985
00:40:40,750 --> 00:40:44,210
But in terms of which of
the moves you will choose,

986
00:40:44,210 --> 00:40:48,350
there are actually variants of
MCTS that suggest that you more

987
00:40:48,350 --> 00:40:51,480
selectively age or
insert new children based

988
00:40:51,480 --> 00:40:56,430
on something more than just
the blind look right now.

989
00:40:56,430 --> 00:41:00,320
In terms of, if I'm here and
it's creating my next children

990
00:41:00,320 --> 00:41:02,805
as the equivalent, then there
are some intelligent guesses

991
00:41:02,805 --> 00:41:04,430
that you can make in
terms of which one

992
00:41:04,430 --> 00:41:05,674
you should score first.

993
00:41:05,674 --> 00:41:07,340
Although it doesn't
particularly matter.

994
00:41:07,340 --> 00:41:09,540
AUDIENCE: I'm just
saying computational time

995
00:41:09,540 --> 00:41:11,400
being what it is,
you might say, OK,

996
00:41:11,400 --> 00:41:13,860
if this is the timeline
of this game I can expect

997
00:41:13,860 --> 00:41:16,274
to do a million simulations,
which will give me

998
00:41:16,274 --> 00:41:18,440
if there's 400 nodes, I'm
going to have so much use.

999
00:41:18,440 --> 00:41:21,310
In other words, is
that enough time

1000
00:41:21,310 --> 00:41:22,962
to say that I can
play through a game?

1001
00:41:22,962 --> 00:41:24,920
I couldn't play through
a game with 400 options

1002
00:41:24,920 --> 00:41:26,982
if I've gotten five out
of seven [INAUDIBLE]

1003
00:41:26,982 --> 00:41:28,190
three out of four [INAUDIBLE]

1004
00:41:28,190 --> 00:41:29,190
PROFESSOR 3: Absolutely.

1005
00:41:29,190 --> 00:41:30,810
And I would say that
so far as I know,

1006
00:41:30,810 --> 00:41:32,260
that's something
that's basically very

1007
00:41:32,260 --> 00:41:33,096
high experimentally.

1008
00:41:33,096 --> 00:41:34,770
They don't have
good balance on it.

1009
00:41:34,770 --> 00:41:35,645
[INAUDIBLE]

1010
00:41:35,645 --> 00:41:37,020
So let's get on
the first comment

1011
00:41:37,020 --> 00:41:39,070
because that is a
computer element.

1012
00:41:39,070 --> 00:41:41,962
So why should you
use this algorithm?

1013
00:41:41,962 --> 00:41:43,920
Even though we've seen
tremendous breakthroughs

1014
00:41:43,920 --> 00:41:45,607
in this algorithm,
and you're going

1015
00:41:45,607 --> 00:41:47,440
to have to ignore
everything that I tell you

1016
00:41:47,440 --> 00:41:49,020
and remember that
this does actually

1017
00:41:49,020 --> 00:41:51,540
work quite well in
certain scenarios.

1018
00:41:51,540 --> 00:41:53,920
Should we use it or not?

1019
00:41:53,920 --> 00:41:56,225
The pros are that it
actually does the thing

1020
00:41:56,225 --> 00:41:57,141
that we want it to do.

1021
00:41:57,141 --> 00:41:58,515
It grows the tree
asymmetrically.

1022
00:41:58,515 --> 00:42:00,380
It means that we do
not have to explore.

1023
00:42:00,380 --> 00:42:02,340
And it doesn't
explode exponentially

1024
00:42:02,340 --> 00:42:06,049
with the number of moves that
we're looking into the future.

1025
00:42:06,049 --> 00:42:08,590
And that it selectively grows
the tree towards the areas that

1026
00:42:08,590 --> 00:42:11,050
are most promising.

1027
00:42:11,050 --> 00:42:13,010
The other huge
benefit, if you'll

1028
00:42:13,010 --> 00:42:15,290
notice from what we've
just talked through,

1029
00:42:15,290 --> 00:42:17,220
is that it never
relies on anything

1030
00:42:17,220 --> 00:42:19,120
other than the strict
rules of the game.

1031
00:42:19,120 --> 00:42:21,555
What that means is that the
only weight of the game that's

1032
00:42:21,555 --> 00:42:23,580
factored in is that the
game is what tells us

1033
00:42:23,580 --> 00:42:26,120
what the next moves we can
take from a given state are,

1034
00:42:26,120 --> 00:42:32,310
and whether a given state
is a victory or a defeat.

1035
00:42:32,310 --> 00:42:35,070
And that's kind of
amazing because we

1036
00:42:35,070 --> 00:42:37,650
had no external heuristic
information about this game.

1037
00:42:37,650 --> 00:42:39,850
Which means that if I
took a completely new game

1038
00:42:39,850 --> 00:42:42,720
that someone had just invented,
and I plugged MCTS into it,

1039
00:42:42,720 --> 00:42:47,720
MCTS would be a slightly or
someone competitive player

1040
00:42:47,720 --> 00:42:50,600
for this game, which
is a powerful idea.

1041
00:42:50,600 --> 00:42:52,350
It leads to our next two pros.

1042
00:42:52,350 --> 00:42:56,220
The first of which is that it's
very easy to adapt to new games

1043
00:42:56,220 --> 00:42:58,850
that it hasn't seen
before, or even that people

1044
00:42:58,850 --> 00:43:02,160
haven't seen before.

1045
00:43:02,160 --> 00:43:03,720
This is clearly valuable.

1046
00:43:03,720 --> 00:43:05,100
But the other nice
thing about it

1047
00:43:05,100 --> 00:43:07,020
is that even though
heuristics are not

1048
00:43:07,020 --> 00:43:11,810
required to make MCTS
work [INAUDIBLE],,

1049
00:43:11,810 --> 00:43:12,840
it can work [INAUDIBLE].

1050
00:43:12,840 --> 00:43:14,340
There are a number
of [? advanced ?]

1051
00:43:14,340 --> 00:43:16,340
places in the algorithm
that you can actually

1052
00:43:16,340 --> 00:43:17,630
incorporate heuristics into.

1053
00:43:17,630 --> 00:43:20,880
Nick is going to talk about how
AlphaGo uses this very heavily.

1054
00:43:20,880 --> 00:43:22,460
AlphaGo is not vanilla Go.

1055
00:43:22,460 --> 00:43:24,270
It has a lot of
external information

1056
00:43:24,270 --> 00:43:26,430
that's built into the
way that it works.

1057
00:43:26,430 --> 00:43:29,841
But MCTS is a framework-- you
can imagine your heuristics you

1058
00:43:29,841 --> 00:43:31,257
can apply in the
simulation, there

1059
00:43:31,257 --> 00:43:33,420
are heuristics you can
apply in the UCB in the way

1060
00:43:33,420 --> 00:43:35,550
that we choose the next node.

1061
00:43:35,550 --> 00:43:37,210
There are places
that it can fit in.

1062
00:43:37,210 --> 00:43:39,376
And this services as a nice
infrastructure to do so.

1063
00:43:41,320 --> 00:43:45,150
The other benefit is that it's
an on demand algorithm, which

1064
00:43:45,150 --> 00:43:47,660
is particularly valuable when
you're under some sort of time

1065
00:43:47,660 --> 00:43:49,909
pressure, when you're competing
against someone that's

1066
00:43:49,909 --> 00:43:53,100
a mathematician, or when
something is about to explode

1067
00:43:53,100 --> 00:43:57,240
and you have to make a decision
on which reactor to shut down.

1068
00:43:57,240 --> 00:44:00,180
And lastly-- or not
lastly, actually, it's

1069
00:44:00,180 --> 00:44:02,590
complete, which is
really nice because you

1070
00:44:02,590 --> 00:44:04,740
know that if you run
this game for long enough

1071
00:44:04,740 --> 00:44:08,270
it's going to start looking
at a lot like a BFS tree.

1072
00:44:08,270 --> 00:44:09,936
No, it's actually
going to start looking

1073
00:44:09,936 --> 00:44:14,820
like an alpha-beta tree, if
it is what it is converted to.

1074
00:44:14,820 --> 00:44:16,650
It's a nice property to have.

1075
00:44:16,650 --> 00:44:18,470
Although, this
property does slightly

1076
00:44:18,470 --> 00:44:20,595
get compromised if you
remove the red in this idea,

1077
00:44:20,595 --> 00:44:24,690
and if only simulate
these [INAUDIBLE]..

1078
00:44:24,690 --> 00:44:25,594
Yeah.

1079
00:44:25,594 --> 00:44:27,370
PROFESSOR: You made
an interesting comment

1080
00:44:27,370 --> 00:44:29,410
when you said, oh, it
looks like -beta tree.

1081
00:44:29,410 --> 00:44:32,290
So it looked like
a mini-max tree.

1082
00:44:32,290 --> 00:44:35,190
But have they also
incorporated notions

1083
00:44:35,190 --> 00:44:37,530
of pruning in the
MCTS, which would make

1084
00:44:37,530 --> 00:44:38,947
it look like an -beta tree?

1085
00:44:38,947 --> 00:44:40,780
PROFESSOR 3: Sorry,
you're completely right.

1086
00:44:40,780 --> 00:44:42,990
It does look like
a mini-max tree.

1087
00:44:42,990 --> 00:44:45,380
I think I've seen variants
where they do pruning,

1088
00:44:45,380 --> 00:44:46,963
but I haven't looked
into it as much.

1089
00:44:46,963 --> 00:44:48,690
But I would imagine
that they would

1090
00:44:48,690 --> 00:44:50,500
converge to whatever
you know pruning

1091
00:44:50,500 --> 00:44:52,020
a certain tree [INAUDIBLE].

1092
00:44:52,020 --> 00:44:54,780
AUDIENCE: But people have
explored incorporating pruning

1093
00:44:54,780 --> 00:44:55,380
into MCTS?

1094
00:44:55,380 --> 00:44:57,170
PROFESSOR 3: I think so.

1095
00:44:57,170 --> 00:45:01,350
I can't say [INAUDIBLE]
And then lastly, it's

1096
00:45:01,350 --> 00:45:02,680
really parallelizable.

1097
00:45:02,680 --> 00:45:05,610
You'll notice, none of
the regions of this tree,

1098
00:45:05,610 --> 00:45:08,005
other than the
original choice, ever

1099
00:45:08,005 --> 00:45:09,380
have to interact
with each other.

1100
00:45:09,380 --> 00:45:12,030
So if you have 200
processors and you decide,

1101
00:45:12,030 --> 00:45:15,169
OK, I'm going to break up this
tree in the first 200 decisions

1102
00:45:15,169 --> 00:45:16,710
and then have each
one of those flesh

1103
00:45:16,710 --> 00:45:20,600
out one of those decisions, that
actually means that they can

1104
00:45:20,600 --> 00:45:22,400
all combine information
right at the end

1105
00:45:22,400 --> 00:45:24,025
and make a decision
[INAUDIBLE],, which

1106
00:45:24,025 --> 00:45:29,280
is a really nice, powerful
principle as you [INAUDIBLE]..

1107
00:45:29,280 --> 00:45:31,290
It does have its fair
share of problems.

1108
00:45:31,290 --> 00:45:34,950
The first problem being
that it does breakdown

1109
00:45:34,950 --> 00:45:38,290
under extreme tree depth.

1110
00:45:38,290 --> 00:45:41,340
The main reason for this
being that as you increase

1111
00:45:41,340 --> 00:45:45,150
more moves between you
and the end of the game,

1112
00:45:45,150 --> 00:45:47,250
you're increasing
the probability--

1113
00:45:47,250 --> 00:45:49,604
you are decreasing the
correlation between your game

1114
00:45:49,604 --> 00:45:51,270
state and whether a
random playoff would

1115
00:45:51,270 --> 00:45:54,750
suggest that you're in a good
position or a bad position.

1116
00:45:54,750 --> 00:45:56,397
The same goes for
branching factors.

1117
00:45:56,397 --> 00:45:58,605
One of the things that people
sometimes talk about it

1118
00:45:58,605 --> 00:46:03,930
as if MCTS AI's cannot
play first-person shooters

1119
00:46:03,930 --> 00:46:07,590
because the distance between the
number of things that you can

1120
00:46:07,590 --> 00:46:11,460
do at every given moment, and
what would be a successful

1121
00:46:11,460 --> 00:46:14,200
approach in the long term
after meeting many, many,

1122
00:46:14,200 --> 00:46:16,360
many moves that each have
many branching factors,

1123
00:46:16,360 --> 00:46:20,937
is that never begins to explore
the size of the search tree.

1124
00:46:20,937 --> 00:46:22,770
For the most part, it's
not really coming up

1125
00:46:22,770 --> 00:46:24,460
with a long term policy.

1126
00:46:24,460 --> 00:46:27,736
It's really thinking about what
are the next sequence of moves

1127
00:46:27,736 --> 00:46:31,190
that I should [INAUDIBLE].

1128
00:46:31,190 --> 00:46:34,000
Another problem is
that it requires

1129
00:46:34,000 --> 00:46:38,530
simulation to be very
easy and very repeatable.

1130
00:46:38,530 --> 00:46:42,820
So for example, if we
wanted to tell our AI,

1131
00:46:42,820 --> 00:46:44,920
how do I take over Ontario?

1132
00:46:44,920 --> 00:46:46,630
There's not a
particularly good way

1133
00:46:46,630 --> 00:46:49,480
that you can simulate
taking over Ontario?

1134
00:46:49,480 --> 00:46:50,995
If you try it once,
you're not going

1135
00:46:50,995 --> 00:46:52,810
to have an opportunity
to try it again,

1136
00:46:52,810 --> 00:46:56,470
at least with the same
set of configurations.

1137
00:46:56,470 --> 00:46:59,030
And actually, one of the things
that we really took advantage

1138
00:46:59,030 --> 00:47:01,238
of, if that random simulation
happens really quickly,

1139
00:47:01,238 --> 00:47:02,865
on the order of microseconds.

1140
00:47:02,865 --> 00:47:07,494
On other hand, the
bigger your computational

1141
00:47:07,494 --> 00:47:08,910
resources that you
have access to,

1142
00:47:08,910 --> 00:47:10,300
the better the algorithm works.

1143
00:47:10,300 --> 00:47:12,799
That means that I can't run it
off my Mac particularly well.

1144
00:47:12,799 --> 00:47:15,670
It would be like large games.

1145
00:47:15,670 --> 00:47:18,195
It relies on this tenuous
assumption of random play

1146
00:47:18,195 --> 00:47:21,257
be weakly correlated with the
quality of our game state.

1147
00:47:21,257 --> 00:47:23,132
And this is one of the
first assumptions that

1148
00:47:23,132 --> 00:47:25,548
is going to be thrown out the
window for a lot of the more

1149
00:47:25,548 --> 00:47:27,880
advanced MCTS approaches,
which are going to have

1150
00:47:27,880 --> 00:47:29,857
more intelligent play outs.

1151
00:47:29,857 --> 00:47:31,940
But those are going to
lose some of the generality

1152
00:47:31,940 --> 00:47:35,380
that we had before.

1153
00:47:35,380 --> 00:47:38,719
Something that goes off of that
is that MCTS is a framework.

1154
00:47:38,719 --> 00:47:41,260
But in order to actually make
it effective for a lot of games

1155
00:47:41,260 --> 00:47:44,251
it does require a lot of
tuning, in the sense that there

1156
00:47:44,251 --> 00:47:45,500
are a whole bunch of variants.

1157
00:47:45,500 --> 00:47:47,140
And that you need to be
able to implement whatever

1158
00:47:47,140 --> 00:47:48,745
flavor is best suited for you.

1159
00:47:48,745 --> 00:47:51,290
Which means that it's not
quite as nice and black boxy

1160
00:47:51,290 --> 00:47:54,890
as we would want it to be
as far as give it the rules

1161
00:47:54,890 --> 00:47:58,270
and have it magically come up
with a strategy [INAUDIBLE]..

1162
00:47:58,270 --> 00:48:00,160
And then lastly,
as you mentioned,

1163
00:48:00,160 --> 00:48:03,280
there is not a great amount
of literature right now

1164
00:48:03,280 --> 00:48:06,080
about the properties of
MCTS and its convergence,

1165
00:48:06,080 --> 00:48:09,040
and what the actual
proportion of time

1166
00:48:09,040 --> 00:48:11,950
to quality of your solution is.

1167
00:48:11,950 --> 00:48:15,610
This is true of all modern
machine learning things,

1168
00:48:15,610 --> 00:48:18,261
is that there is certainly a lot
more work that could be done.

1169
00:48:18,261 --> 00:48:19,760
But right now,
that's a gap in terms

1170
00:48:19,760 --> 00:48:23,577
of using this for a simulation
that's supposed to be reliable.

1171
00:48:23,577 --> 00:48:27,630
Anyone have any questions
on the Pros and Cons?

1172
00:48:27,630 --> 00:48:29,940
Before we jump dive
into applications,

1173
00:48:29,940 --> 00:48:32,320
let's talk through
a few examples

1174
00:48:32,320 --> 00:48:34,770
of what games could be
solved and could not

1175
00:48:34,770 --> 00:48:36,750
be solved by MCTS.

1176
00:48:36,750 --> 00:48:38,842
Do you guys think that
checkers is a game that

1177
00:48:38,842 --> 00:48:40,610
could be solved by MCTS?

1178
00:48:40,610 --> 00:48:41,369
AUDIENCE: Yes.

1179
00:48:41,369 --> 00:48:43,160
PROFESSOR 3: It's
completely deterministic.

1180
00:48:43,160 --> 00:48:43,826
It's two-player.

1181
00:48:43,826 --> 00:48:46,770
It satisfies all of the criteria
that we've laid out before.

1182
00:48:46,770 --> 00:48:48,240
Checkers is
definitely a game that

1183
00:48:48,240 --> 00:48:51,240
can and has been solved by
MCTS, although not solved

1184
00:48:51,240 --> 00:48:53,760
to the extent that you can
defeat the thing that actually

1185
00:48:53,760 --> 00:48:57,270
has the solution [INAUDIBLE].

1186
00:48:57,270 --> 00:48:58,860
How about "Settlers of Catan?"

1187
00:48:58,860 --> 00:49:00,234
This one's a little
bit trickier.

1188
00:49:00,234 --> 00:49:02,680
Do you guys think that MCTS
is likely to be able to play

1189
00:49:02,680 --> 00:49:04,650
"Settlers of Catan?"

1190
00:49:04,650 --> 00:49:07,844
If not, let's throw out reason
why or why not it would be

1191
00:49:07,844 --> 00:49:09,080
[INAUDIBLE].

1192
00:49:09,080 --> 00:49:09,580
Yeah.

1193
00:49:09,580 --> 00:49:11,800
AUDIENCE: No because
there's randomness.

1194
00:49:11,800 --> 00:49:14,050
PROFESSOR 3: So yes, that
is absolutely the criticism.

1195
00:49:14,050 --> 00:49:16,640
And that's why we
can't apply it vanilla.

1196
00:49:16,640 --> 00:49:18,820
I put this on here
as a trick question,

1197
00:49:18,820 --> 00:49:20,500
though, because it
turns out that MCTS

1198
00:49:20,500 --> 00:49:22,460
is robust to randomness.

1199
00:49:22,460 --> 00:49:23,990
That you can actually play--

1200
00:49:23,990 --> 00:49:25,614
and I realize that's
just me and we do.

1201
00:49:25,614 --> 00:49:26,470
[LAUGHTER]

1202
00:49:26,470 --> 00:49:29,349
You can actually
play through games.

1203
00:49:29,349 --> 00:49:30,765
If you think about
the simulation,

1204
00:49:30,765 --> 00:49:32,680
the simulation is
actually applicable

1205
00:49:32,680 --> 00:49:35,692
even if the game is
not deterministic

1206
00:49:35,692 --> 00:49:37,650
because it does give you
a sense of the quality

1207
00:49:37,650 --> 00:49:38,870
of your position.

1208
00:49:38,870 --> 00:49:42,830
And the MCTS-based
AI to play "Settlers"

1209
00:49:42,830 --> 00:49:47,946
is, I think, at least 49%
competitive with the best AI

1210
00:49:47,946 --> 00:49:50,860
to play, at least in the
autonomous non-scale space.

1211
00:49:50,860 --> 00:49:53,780
So it does work.

1212
00:49:53,780 --> 00:49:57,710
Let's talk about the war
operations plan response.

1213
00:49:57,710 --> 00:50:00,831
Who here has seen the
movie "War Games?"

1214
00:50:00,831 --> 00:50:01,330
OK.

1215
00:50:01,330 --> 00:50:04,030
Well, it should be more of you.

1216
00:50:04,030 --> 00:50:06,730
The idea of "War
Games" is that one

1217
00:50:06,730 --> 00:50:09,640
of the core characters
in this world

1218
00:50:09,640 --> 00:50:11,380
is this computer
that has been put

1219
00:50:11,380 --> 00:50:15,130
in charge of the national
defense strategy with respect

1220
00:50:15,130 --> 00:50:16,700
to Russia.

1221
00:50:16,700 --> 00:50:19,092
And that it needs to think
through the possible future

1222
00:50:19,092 --> 00:50:21,550
scenarios and decide whether
it's going to launch the nukes

1223
00:50:21,550 --> 00:50:23,170
or not.

1224
00:50:23,170 --> 00:50:27,810
Do you think that WOPR
can be MCTS-based?

1225
00:50:27,810 --> 00:50:29,010
AUDIENCE: No.

1226
00:50:29,010 --> 00:50:29,900
PROFESSOR 3: No.

1227
00:50:29,900 --> 00:50:32,191
AUDIENCE: It could, it
just wouldn't be very good.

1228
00:50:32,191 --> 00:50:33,190
PROFESSOR 3: Absolutely.

1229
00:50:33,190 --> 00:50:34,606
Once you fire the
nukes you're not

1230
00:50:34,606 --> 00:50:36,060
going to get another chance.

1231
00:50:36,060 --> 00:50:37,600
So you can't
particularly simulate

1232
00:50:37,600 --> 00:50:39,620
through what the possible
scenarios are going to be like.

1233
00:50:39,620 --> 00:50:39,910
Yeah.

1234
00:50:39,910 --> 00:50:41,160
AUDIENCE: So what if you had--

1235
00:50:41,160 --> 00:50:43,390
I agree you can't simulate
it in the real world.

1236
00:50:43,390 --> 00:50:45,790
But what if you had
a really good model

1237
00:50:45,790 --> 00:50:47,710
and you just simulated
based on that model?

1238
00:50:51,074 --> 00:50:52,990
PROFESSOR 3: In that
case, it probably depends

1239
00:50:52,990 --> 00:50:55,536
on the quality of your model.

1240
00:50:55,536 --> 00:50:59,236
If you have a good model for
how World War III is going to

1241
00:50:59,236 --> 00:50:59,736
[INAUDIBLE].

1242
00:50:59,736 --> 00:51:02,850
[LAUGHTER]

1243
00:51:02,850 --> 00:51:05,170
AUDIENCE: It is the case
that the military does

1244
00:51:05,170 --> 00:51:10,200
have simulators and they
do war games in simulation.

1245
00:51:10,200 --> 00:51:12,070
PROFESSOR 3: Yes, that's true.

1246
00:51:12,070 --> 00:51:14,655
They could certainly try it
and run MCTS if they wanted.

1247
00:51:14,655 --> 00:51:16,560
And that's what
happened in the movie.

1248
00:51:16,560 --> 00:51:18,667
[INTERPOSING VOICES]

1249
00:51:18,667 --> 00:51:20,542
AUDIENCE: And there
you're putting your money

1250
00:51:20,542 --> 00:51:22,430
in the simulation not in the--

1251
00:51:22,430 --> 00:51:24,910
AUDIENCE: It's like having an
MCTS play SOCOM or something

1252
00:51:24,910 --> 00:51:25,410
like that.

1253
00:51:25,410 --> 00:51:26,160
PROFESSOR 3: Yeah.

1254
00:51:26,160 --> 00:51:29,142
It's definitely about putting
money into the simulation

1255
00:51:29,142 --> 00:51:30,600
and you get really
good simulation.

1256
00:51:30,600 --> 00:51:33,400
If you have a really
good simulations then you

1257
00:51:33,400 --> 00:51:35,490
[INAUDIBLE] to play WOPR.

1258
00:51:35,490 --> 00:51:36,066
Yeah.

1259
00:51:36,066 --> 00:51:37,816
AUDIENCE: Back to
"Settlers" for a second.

1260
00:51:37,816 --> 00:51:40,600
I'm curious if there's a way
for the whole player training

1261
00:51:40,600 --> 00:51:42,970
resources thing,
or would it have

1262
00:51:42,970 --> 00:51:47,216
to be only purely
like using the ports.

1263
00:51:47,216 --> 00:51:48,760
PROFESSOR 3: That's
a good question.

1264
00:51:48,760 --> 00:51:53,620
I haven't looked closely at
whether they do that or not.

1265
00:51:53,620 --> 00:51:55,400
If it's playing a
two-player game,

1266
00:51:55,400 --> 00:51:58,790
then I would imagine that they
wouldn't because you don't

1267
00:51:58,790 --> 00:52:00,290
really trade in to play a game.

1268
00:52:00,290 --> 00:52:01,748
But if they weren't,
I bet that you

1269
00:52:01,748 --> 00:52:03,257
can incorporate it with WOPR.

1270
00:52:03,257 --> 00:52:05,090
AUDIENCE: Is it limited
to two-player games?

1271
00:52:05,090 --> 00:52:06,070
PROFESSOR 3: No, not at all.

1272
00:52:06,070 --> 00:52:07,240
In fact, there are
lots of purchases

1273
00:52:07,240 --> 00:52:08,890
that do only
one-player games, where

1274
00:52:08,890 --> 00:52:11,482
you think of what's the best
movie that you can make.

1275
00:52:11,482 --> 00:52:12,190
AUDIENCE: I know.

1276
00:52:12,190 --> 00:52:15,215
But I mean, couldn't MCTS handle
three- or four-player games?

1277
00:52:15,215 --> 00:52:16,840
PROFESSOR 3: Yeah,
it absolutely could.

1278
00:52:16,840 --> 00:52:19,622
I'm not sure how they
computed their head-to-head.

1279
00:52:19,622 --> 00:52:21,730
That might be
completely flat cursors.

1280
00:52:21,730 --> 00:52:24,734
I'm not even sure how
the settlers interact.

1281
00:52:24,734 --> 00:52:25,234
Yeah.

1282
00:52:25,234 --> 00:52:26,359
AUDIENCE: A quick question.

1283
00:52:26,359 --> 00:52:29,460
So at first you know if I
reduce the chess board to only 4

1284
00:52:29,460 --> 00:52:32,530
by 4 or 5 by 5, and
I run MCTS versus

1285
00:52:32,530 --> 00:52:35,050
the traditional algorithm that
AlphaGo offered as a tree.

1286
00:52:35,050 --> 00:52:38,222
Do you think MCTS will
prefer theory and perform

1287
00:52:38,222 --> 00:52:40,219
this computational requirement.

1288
00:52:40,219 --> 00:52:42,510
PROFESSOR 3: The thing about
the way that Deep Blue is,

1289
00:52:42,510 --> 00:52:44,570
which is the AI that
ended the Kasparov

1290
00:52:44,570 --> 00:52:47,370
thing, a bunch of his
chess grand master,

1291
00:52:47,370 --> 00:52:49,970
is that it has a tremendous
amount of heuristic

1292
00:52:49,970 --> 00:52:50,587
information.

1293
00:52:50,587 --> 00:52:52,170
There's a lot of
external stuff that's

1294
00:52:52,170 --> 00:52:54,310
incorporated into the
system that makes it

1295
00:52:54,310 --> 00:52:57,250
able to explore the best paths.

1296
00:52:57,250 --> 00:52:59,500
What I would say is
that knoledgesless

1297
00:52:59,500 --> 00:53:03,730
MCTS based on randomness,
would take a very long

1298
00:53:03,730 --> 00:53:07,850
computational time to even
become competitive with those

1299
00:53:07,850 --> 00:53:10,382
kinds of algorithms, and
probably feasibly never would.

1300
00:53:10,382 --> 00:53:12,340
What if you incorporated
heuristic information,

1301
00:53:12,340 --> 00:53:15,320
I think that there's a bunch of
hope in terms of getting MCTS

1302
00:53:15,320 --> 00:53:16,600
to start performing better.

1303
00:53:16,600 --> 00:53:18,850
And you can look at what
next I'm going to talk about,

1304
00:53:18,850 --> 00:53:19,460
AlphaGo.

1305
00:53:19,460 --> 00:53:22,040
It takes inspiration for how
we go about incorporating

1306
00:53:22,040 --> 00:53:22,940
these new circuits.

1307
00:53:22,940 --> 00:53:27,147
AUDIENCE: So only the
circuit you [INAUDIBLE]

1308
00:53:27,147 --> 00:53:28,980
PROFESSOR 3: It definitely
seems like if you

1309
00:53:28,980 --> 00:53:33,330
have a really good
heuristic model for what

1310
00:53:33,330 --> 00:53:38,530
good states in the game are,
that if it's a smaller search

1311
00:53:38,530 --> 00:53:42,420
space, that some other
models could perform better.

1312
00:53:42,420 --> 00:53:44,546
Although, I'm probably
going to eat my foot here

1313
00:53:44,546 --> 00:53:47,390
because this is going to be
on OCW some massive amount,

1314
00:53:47,390 --> 00:53:49,936
massive chess
playing algorithms.

1315
00:53:49,936 --> 00:53:53,429
Eat my shoe not my foot.

1316
00:53:53,429 --> 00:53:55,430
[LAUGHTER]

1317
00:53:55,430 --> 00:53:58,100
One last game.

1318
00:53:58,100 --> 00:54:00,364
Does anyone know
what this game is?

1319
00:54:00,364 --> 00:54:01,280
AUDIENCE: "Total War?"

1320
00:54:01,280 --> 00:54:02,380
PROFESSOR 3: Yes.

1321
00:54:02,380 --> 00:54:03,060
Nice.

1322
00:54:03,060 --> 00:54:04,850
This is "Rome, Total War II."

1323
00:54:04,850 --> 00:54:09,890
It's a simulator for this
tremendous real time strategy

1324
00:54:09,890 --> 00:54:13,100
game, where you play, I
think, the Roman Empire.

1325
00:54:13,100 --> 00:54:17,540
And you're controlling armies
and huge infrastructure systems

1326
00:54:17,540 --> 00:54:20,980
that move and conquer
states and continents,

1327
00:54:20,980 --> 00:54:24,530
and meet in the field, and
manage resources, and do

1328
00:54:24,530 --> 00:54:26,860
all of these incredible
diplomacy feats.

1329
00:54:26,860 --> 00:54:29,347
And so do you think that this
game can be solved by MCTS?

1330
00:54:29,347 --> 00:54:29,930
AUDIENCE: Yes.

1331
00:54:29,930 --> 00:54:32,409
AUDIENCE: Yes.

1332
00:54:32,409 --> 00:54:33,450
PROFESSOR 3: Lets say no.

1333
00:54:33,450 --> 00:54:34,658
But I guess I put it on here.

1334
00:54:34,658 --> 00:54:36,870
So that's good on you.

1335
00:54:36,870 --> 00:54:40,925
The way that the AI in
"Rome, Total War II" is built

1336
00:54:40,925 --> 00:54:43,200
is that it's built
on an MCTS structure.

1337
00:54:43,200 --> 00:54:45,980
And it in fact does
do resource allocation

1338
00:54:45,980 --> 00:54:47,780
and a lot of its
political maneuvers

1339
00:54:47,780 --> 00:54:49,439
based on Monte Carlo
Tree Search moves.

1340
00:54:49,439 --> 00:54:51,355
There are a bunch of
reasons that they explain

1341
00:54:51,355 --> 00:54:53,390
in the game for
why they do this,

1342
00:54:53,390 --> 00:54:54,961
or in papers released
about the game.

1343
00:54:54,961 --> 00:54:56,835
But one of the nice ones
is that it's random,

1344
00:54:56,835 --> 00:54:58,293
which means that
you're never going

1345
00:54:58,293 --> 00:55:01,280
to play against the same kind
of AI twice because every time

1346
00:55:01,280 --> 00:55:02,750
the set of decisions that
it's going to think about

1347
00:55:02,750 --> 00:55:03,736
is completely different.

1348
00:55:03,736 --> 00:55:04,608
AUDIENCE: I have
a quick question.

1349
00:55:04,608 --> 00:55:05,358
PROFESSOR 3: Yeah.

1350
00:55:05,358 --> 00:55:07,646
AUDIENCE: So if I want to
model any game with MCTS,

1351
00:55:07,646 --> 00:55:10,996
does it have to be that the
actions in playing a game

1352
00:55:10,996 --> 00:55:14,272
has to be able to discretize.

1353
00:55:14,272 --> 00:55:14,980
PROFESSOR 3: Yes.

1354
00:55:14,980 --> 00:55:17,755
So far as I know, I haven't
seen many continuous variants

1355
00:55:17,755 --> 00:55:19,520
in MCTS.

1356
00:55:19,520 --> 00:55:22,680
And so, I think that it is about
choosing these reactions, which

1357
00:55:22,680 --> 00:55:26,130
on it's most narrow level does
actually bring it down to here.

1358
00:55:26,130 --> 00:55:27,630
I think one of the
reasons that this

1359
00:55:27,630 --> 00:55:30,046
is nice is that there are so
many different decisions that

1360
00:55:30,046 --> 00:55:32,610
could be made that MCTS is
really the only approach that

1361
00:55:32,610 --> 00:55:35,525
could even begin to handle the
massive branching factor that's

1362
00:55:35,525 --> 00:55:37,810
associated with the
game Rome, Total War.

1363
00:55:37,810 --> 00:55:38,579
Yeah.

1364
00:55:38,579 --> 00:55:40,162
AUDIENCE: This is
also the consequence

1365
00:55:40,162 --> 00:55:43,407
of this year you get the play
off when this game comes.

1366
00:55:43,407 --> 00:55:44,740
PROFESSOR 3: That's interesting.

1367
00:55:44,740 --> 00:55:46,600
That's probably totally it.

1368
00:55:46,600 --> 00:55:47,170
That's cool.

1369
00:55:50,640 --> 00:55:54,710
That's everything about how
the algorithm actually works.

1370
00:55:54,710 --> 00:55:56,134
I'm going to pass
it off to Nick,

1371
00:55:56,134 --> 00:55:58,550
and he's going to talk to us
about some actual limitations

1372
00:55:58,550 --> 00:56:00,317
for this game [INAUDIBLE].

1373
00:56:04,221 --> 00:56:05,596
PROFESSOR 3: So
as you have said,

1374
00:56:05,596 --> 00:56:07,980
I'm going to start diving
into some applications here.

1375
00:56:07,980 --> 00:56:12,180
And not only applications
but also some modifications

1376
00:56:12,180 --> 00:56:13,460
or augmentations of MCTS.

1377
00:56:13,460 --> 00:56:16,590
It should hopefully clarify
some of the side questions

1378
00:56:16,590 --> 00:56:21,610
you all have been having
on slight tweaks to MCTS.

1379
00:56:21,610 --> 00:56:23,470
Now let's get started.

1380
00:56:23,470 --> 00:56:24,230
Wait for it.

1381
00:56:24,230 --> 00:56:25,267
Now let's get started.

1382
00:56:25,267 --> 00:56:25,766
[LAUGHTER]

1383
00:56:25,766 --> 00:56:27,600
Part III, applications.

1384
00:56:27,600 --> 00:56:29,025
First thing we're
going to look at

1385
00:56:29,025 --> 00:56:31,110
is an MCTS-based
"Mario" controller.

1386
00:56:31,110 --> 00:56:35,290
And "Mario" might seem like
some weird thing to test AI on,

1387
00:56:35,290 --> 00:56:38,320
but there actually is a "Super
Mario Bros" AI benchmark,

1388
00:56:38,320 --> 00:56:39,930
which it used to
test a lot of AI

1389
00:56:39,930 --> 00:56:42,280
on how well they could
play this platform.

1390
00:56:42,280 --> 00:56:45,290
In case any of you don't
know what "Super Mario

1391
00:56:45,290 --> 00:56:47,420
Bros" is, this is a screenshot.

1392
00:56:47,420 --> 00:56:49,170
Basically, you control
this one character.

1393
00:56:49,170 --> 00:56:52,780
It's a single-player game.

1394
00:56:52,780 --> 00:56:55,920
The ultimate goal is to
reach this flag at the end.

1395
00:56:55,920 --> 00:56:58,180
But along the way
there's enemies,

1396
00:56:58,180 --> 00:57:01,046
there's some bonus
shrooms you can get.

1397
00:57:01,046 --> 00:57:03,870
If you break open some
boxes you might get coins,

1398
00:57:03,870 --> 00:57:06,360
things like that.

1399
00:57:06,360 --> 00:57:09,642
But first, let's just highlight
some of the modifications that

1400
00:57:09,642 --> 00:57:12,100
need to be made, or some of
the differences between vanilla

1401
00:57:12,100 --> 00:57:16,590
MCTS and an MCTS that's going
to be able to work for "Mario."

1402
00:57:16,590 --> 00:57:18,296
First thing is that
it's single-player.

1403
00:57:18,296 --> 00:57:21,000
The second is, we use a
slightly different simulation

1404
00:57:21,000 --> 00:57:25,130
strategy than the initial
just vanilla simulation.

1405
00:57:25,130 --> 00:57:27,760
And someone actually hinted at
doing more than one simulation

1406
00:57:27,760 --> 00:57:32,280
because you, you're watching
us to n simulations, I think.

1407
00:57:32,280 --> 00:57:33,810
We'll touch on that.

1408
00:57:33,810 --> 00:57:36,630
Then this also introduces
what I would consider

1409
00:57:36,630 --> 00:57:38,840
to be domain knowledge.

1410
00:57:38,840 --> 00:57:42,854
Then finally, there's a 50 to
40 millisecond computation time.

1411
00:57:42,854 --> 00:57:45,270
And that has to do with the
frames per second of the game.

1412
00:57:45,270 --> 00:57:48,240
So you would think that
"Mario" is a continuous game,

1413
00:57:48,240 --> 00:57:50,950
but if we discretize
time into these chunks,

1414
00:57:50,950 --> 00:57:54,380
then we can use MTTS.

1415
00:57:54,380 --> 00:57:56,570
Now let's just think about
how we could possibly

1416
00:57:56,570 --> 00:57:57,840
formulate this problem.

1417
00:57:57,840 --> 00:58:00,630
Can anyone think of
what each of these nodes

1418
00:58:00,630 --> 00:58:02,586
would be if we're
playing "Super Mario?"

1419
00:58:02,586 --> 00:58:04,169
AUDIENCE: Jump.

1420
00:58:04,169 --> 00:58:04,960
PROFESSOR 3: Sorry?

1421
00:58:04,960 --> 00:58:05,585
AUDIENCE: Jump.

1422
00:58:05,585 --> 00:58:07,960
It would be like, first
node you're going to jump.

1423
00:58:07,960 --> 00:58:11,600
PROFESSOR 3: That might
be a way to formulate it.

1424
00:58:11,600 --> 00:58:13,522
But I think that could get--

1425
00:58:13,522 --> 00:58:16,099
AUDIENCE: Oh, it's not your
control at inputs [INAUDIBLE]..

1426
00:58:16,099 --> 00:58:16,890
PROFESSOR 3: Right.

1427
00:58:16,890 --> 00:58:22,110
So the node itself isn't
going to be an action.

1428
00:58:22,110 --> 00:58:23,490
AUDIENCE: Equal frames.

1429
00:58:23,490 --> 00:58:24,698
PROFESSOR 3: Yeah, basically.

1430
00:58:24,698 --> 00:58:27,320
So it's going to be the
state of a game, what

1431
00:58:27,320 --> 00:58:28,490
we'll call a state.

1432
00:58:28,490 --> 00:58:30,360
So it's basically
just a screen grab.

1433
00:58:30,360 --> 00:58:31,990
And it take it,
in this case, it's

1434
00:58:31,990 --> 00:58:35,190
a 15 by 19 grid screen
grab of the game.

1435
00:58:35,190 --> 00:58:37,083
And it will have
information about-- it

1436
00:58:37,083 --> 00:58:40,240
knows Mario's position, it knows
the enemy's position, position

1437
00:58:40,240 --> 00:58:42,600
of the blocks, et cetera.

1438
00:58:42,600 --> 00:58:45,370
And then, as Yo
was saying, in MCTS

1439
00:58:45,370 --> 00:58:47,660
we have values associated
with our nodes.

1440
00:58:47,660 --> 00:58:49,240
And so it will
also have a value.

1441
00:58:49,240 --> 00:58:52,890
But we'll get into the
value in the next slide

1442
00:58:52,890 --> 00:58:56,820
because I can't really
fit it all in here.

1443
00:58:56,820 --> 00:58:58,930
With that being said
for our node, that

1444
00:58:58,930 --> 00:59:02,490
being the state of the game,
what makes sense for the edge?

1445
00:59:02,490 --> 00:59:03,420
Does anyone know?

1446
00:59:03,420 --> 00:59:05,990
How do we transition from
one state to another state?

1447
00:59:05,990 --> 00:59:07,220
AUDIENCE: Jump.

1448
00:59:07,220 --> 00:59:07,710
PROFESSOR 3: Yeah, exactly.

1449
00:59:07,710 --> 00:59:09,560
So this is where the
jump and all the action

1450
00:59:09,560 --> 00:59:10,268
have been played.

1451
00:59:10,268 --> 00:59:11,970
So the actions that you take--

1452
00:59:11,970 --> 00:59:13,230
I didn't list all the actions.

1453
00:59:13,230 --> 00:59:16,502
You can also have a jump left,
jump right, all those things.

1454
00:59:16,502 --> 00:59:17,960
But basically, the
actions are what

1455
00:59:17,960 --> 00:59:19,209
takes you from state to state.

1456
00:59:19,209 --> 00:59:22,792
So I just drew out
what a node might

1457
00:59:22,792 --> 00:59:24,375
look like if you
used the jump action.

1458
00:59:24,375 --> 00:59:27,078
You might have Mario
go up in the sky.

1459
00:59:27,078 --> 00:59:28,572
Are there questions?

1460
00:59:28,572 --> 00:59:30,790
AUDIENCE: Does it just
run the rest of it?

1461
00:59:30,790 --> 00:59:34,520
Because that little thing's
moving as they move on?

1462
00:59:34,520 --> 00:59:37,260
PROFESSOR 3: Well, it's not
moving in this moment in time.

1463
00:59:37,260 --> 00:59:39,799
We're discretizing
time right now.

1464
00:59:39,799 --> 00:59:41,840
AUDIENCE: But I'm saying,
if your action is jump,

1465
00:59:41,840 --> 00:59:46,047
just you would have 1,000
nodes because if you did

1466
00:59:46,047 --> 00:59:48,130
plan out where that thing's
moving, left or right,

1467
00:59:48,130 --> 00:59:48,710
then it could be--

1468
00:59:48,710 --> 00:59:49,751
PROFESSOR 3: Yeah, right.

1469
00:59:49,751 --> 00:59:52,450
So in each state we
have the enemy position.

1470
00:59:52,450 --> 00:59:54,110
And we know the
speed and direction.

1471
00:59:54,110 --> 00:59:57,290
And so we know when we go from
this node to one time step

1472
00:59:57,290 --> 01:00:00,694
later, we'll know where
the enemy's moving.

1473
01:00:00,694 --> 01:00:04,530
Any other questions?

1474
01:00:04,530 --> 01:00:06,550
Moving on.

1475
01:00:06,550 --> 01:00:07,290
Sorry.

1476
01:00:07,290 --> 01:00:09,080
Let me just preface
this part real quick.

1477
01:00:09,080 --> 01:00:11,970
So in our other simulations,
at the end of the simulation

1478
01:00:11,970 --> 01:00:14,700
we would get either a one or a
zero, if we'd won tic-tac-toe

1479
01:00:14,700 --> 01:00:17,910
or we lost tic-tac-toe.

1480
01:00:17,910 --> 01:00:21,600
But that won't really work
too well here because there's

1481
01:00:21,600 --> 01:00:23,850
a lot of other factors
that go into play when

1482
01:00:23,850 --> 01:00:25,410
you're playing "Mario."

1483
01:00:25,410 --> 01:00:28,450
Also, if you're doing a
simulation, more than likely,

1484
01:00:28,450 --> 01:00:30,210
you're going to end
up hitting an enemy

1485
01:00:30,210 --> 01:00:32,380
and dying or falling
into a gap and dying.

1486
01:00:32,380 --> 01:00:34,830
So a lot of these simulations
might all return zero.

1487
01:00:34,830 --> 01:00:38,440
And that is, you can't really
distinguish between them.

1488
01:00:38,440 --> 01:00:41,670
So this is why I say,
this version of MCTS

1489
01:00:41,670 --> 01:00:44,370
introduces what I would
consider to be domain knowledge.

1490
01:00:44,370 --> 01:00:46,860
Basically, they're
assigning scores

1491
01:00:46,860 --> 01:00:50,680
to potential things that
could happened along the way.

1492
01:00:50,680 --> 01:00:55,630
And this is basically telling
the AI that collecting a flower

1493
01:00:55,630 --> 01:00:57,936
is a little bit better
than collecting a mushroom.

1494
01:00:57,936 --> 01:00:59,760
It's telling it that
getting hurt is bad.

1495
01:00:59,760 --> 01:01:02,130
Right off the bat, all
these things in the score

1496
01:01:02,130 --> 01:01:05,100
are giving the AI some domain
knowledge about "Super Mario

1497
01:01:05,100 --> 01:01:07,901
Bros," that it's helping
it calculate the simulation

1498
01:01:07,901 --> 01:01:08,400
results.

1499
01:01:10,985 --> 01:01:14,280
As it says here, it's just doing
a multi-objective weighted sum

1500
01:01:14,280 --> 01:01:15,450
of all these things.

1501
01:01:15,450 --> 01:01:17,825
Throughout the simulation it's
just adding up your score.

1502
01:01:17,825 --> 01:01:20,742
And then that's the score that
is going to be propagated.

1503
01:01:20,742 --> 01:01:24,479
Are there questions
about the score?

1504
01:01:24,479 --> 01:01:26,520
AUDIENCE: You said that
it adds up all these guys

1505
01:01:26,520 --> 01:01:27,730
and it propagates it over.

1506
01:01:27,730 --> 01:01:33,086
Is it possible to just propagate
the multi-part sum [INAUDIBLE]

1507
01:01:33,086 --> 01:01:37,655
as opposed to propagating
one value that you create?

1508
01:01:37,655 --> 01:01:39,600
Are you essentially
propagating all--

1509
01:01:39,600 --> 01:01:42,375
what's this?-- 15 values
upwards at every node, or are

1510
01:01:42,375 --> 01:01:43,500
you propagating one value--

1511
01:01:43,500 --> 01:01:44,220
PROFESSOR 3: Well,
it's one value.

1512
01:01:44,220 --> 01:01:45,000
It's the collective--

1513
01:01:45,000 --> 01:01:46,315
AUDIENCE: Then you make
them add it together

1514
01:01:46,315 --> 01:01:48,065
and you got each one
of them a sub factor.

1515
01:01:50,164 --> 01:01:52,330
PROFESSOR 3: Then also,
just one thing to note here,

1516
01:01:52,330 --> 01:01:55,230
is distance, you get 0.1.

1517
01:01:55,230 --> 01:01:58,650
And these are all parameters
that have been tuned.

1518
01:01:58,650 --> 01:02:01,470
In the initial version,
distance was, I think,

1519
01:02:01,470 --> 01:02:05,035
a reward of five,
but probably realized

1520
01:02:05,035 --> 01:02:08,500
that that made Mario skip past
a lot of coins and things.

1521
01:02:08,500 --> 01:02:11,050
And so he tweaked
the score for that.

1522
01:02:11,050 --> 01:02:13,100
And also, time left is two.

1523
01:02:13,100 --> 01:02:14,460
So there's some weight there.

1524
01:02:14,460 --> 01:02:16,700
You want to get to the
very end of the game.

1525
01:02:16,700 --> 01:02:18,720
AUDIENCE: If you're
pushing up this score,

1526
01:02:18,720 --> 01:02:20,850
it's no longer a
win over losses.

1527
01:02:20,850 --> 01:02:22,920
So it's not w over n.

1528
01:02:22,920 --> 01:02:23,950
What is it affecting?

1529
01:02:23,950 --> 01:02:25,764
PROFESSOR 3: You can
just use the score.

1530
01:02:25,764 --> 01:02:26,930
AUDIENCE: The score is the--

1531
01:02:26,930 --> 01:02:27,860
PROFESSOR 3: Yeah.

1532
01:02:27,860 --> 01:02:31,620
In MCTS you have
this idea of when

1533
01:02:31,620 --> 01:02:36,540
you're propagating your q value,
you could have that to be zero,

1534
01:02:36,540 --> 01:02:37,080
one.

1535
01:02:37,080 --> 01:02:39,665
AUDIENCE: It's like the sum of
all the scores and the nodes

1536
01:02:39,665 --> 01:02:41,290
below over the number
of games you win.

1537
01:02:41,290 --> 01:02:42,150
PROFESSOR 3: So
basically, what you

1538
01:02:42,150 --> 01:02:44,700
would be getting when you divide
by the number of simulations

1539
01:02:44,700 --> 01:02:47,980
is your average
score at that node.

1540
01:02:47,980 --> 01:02:50,330
AUDIENCE: OK.

1541
01:02:50,330 --> 01:02:53,550
AUDIENCE: When you have
killsByFire and [INAUDIBLE]

1542
01:02:53,550 --> 01:02:56,490
like that, if you
have a positive value,

1543
01:02:56,490 --> 01:02:58,562
then isn't it good
to be killed by fire,

1544
01:02:58,562 --> 01:02:59,520
or something like that?

1545
01:02:59,520 --> 01:03:01,436
PROFESSOR 3: This is
killing an enemy by fire.

1546
01:03:01,436 --> 01:03:04,350
Like Mario could collect a
certain flower or mushroom?

1547
01:03:04,350 --> 01:03:07,060
I think flower, then you
have a fire breath and you

1548
01:03:07,060 --> 01:03:07,560
[INAUDIBLE].

1549
01:03:07,560 --> 01:03:10,864
AUDIENCE: So that's Mario's
status if Mario never dies?

1550
01:03:10,864 --> 01:03:11,530
PROFESSOR 3: No.

1551
01:03:11,530 --> 01:03:13,120
Mario's status is--

1552
01:03:13,120 --> 01:03:15,086
I believe, Mario's
status is the fact

1553
01:03:15,086 --> 01:03:17,710
that you could upgrade Mario by
collecting [INAUDIBLE] mushroom

1554
01:03:17,710 --> 01:03:20,154
from a fire Mario.

1555
01:03:20,154 --> 01:03:21,570
So that gives you
a lot of points.

1556
01:03:21,570 --> 01:03:22,945
Because if you
become fire Mario,

1557
01:03:22,945 --> 01:03:26,795
then you're more likely to not
die by running into enemies

1558
01:03:26,795 --> 01:03:28,434
because you have fire-spewing--

1559
01:03:28,434 --> 01:03:30,225
AUDIENCE: You said they
spent a lot of time

1560
01:03:30,225 --> 01:03:31,900
tuning these parameters.

1561
01:03:31,900 --> 01:03:34,170
Isn't it generally, though,
just an optimization

1562
01:03:34,170 --> 01:03:36,540
framework if that's
some formula?

1563
01:03:36,540 --> 01:03:38,215
So they tuned the
parameters just

1564
01:03:38,215 --> 01:03:41,370
to make behave the way
that we think is nice.

1565
01:03:41,370 --> 01:03:43,480
But if you change
the values, they'll

1566
01:03:43,480 --> 01:03:45,120
do the right thing
for that equation.

1567
01:03:45,120 --> 01:03:45,870
PROFESSOR 3: Yeah.

1568
01:03:45,870 --> 01:03:46,320
AUDIENCE: OK.

1569
01:03:46,320 --> 01:03:47,028
PROFESSOR 3: Yes.

1570
01:03:47,028 --> 01:03:50,830
But they were tuning this to
make it play how they wanted.

1571
01:03:50,830 --> 01:03:56,069
AUDIENCE: [INAUDIBLE] can't just
be a reflection of [INAUDIBLE]

1572
01:03:56,069 --> 01:03:57,360
PROFESSOR 3: That's a strategy.

1573
01:03:57,360 --> 01:03:59,380
If you choose that,
I don't see why not.

1574
01:03:59,380 --> 01:04:02,352
That might affect
certain things.

1575
01:04:02,352 --> 01:04:04,560
Obviously, you can change
these to whatever you want.

1576
01:04:04,560 --> 01:04:06,730
It'll slightly tweak
which simulations

1577
01:04:06,730 --> 01:04:10,240
as to working better, in terms
of changing which nodes you

1578
01:04:10,240 --> 01:04:12,640
end up choosing [INAUDIBLE].

1579
01:04:12,640 --> 01:04:15,670
So we move on.

1580
01:04:15,670 --> 01:04:17,820
So we know about
scoring simulations.

1581
01:04:17,820 --> 01:04:20,160
Now we're going to look at
exactly the simulation type

1582
01:04:20,160 --> 01:04:23,580
that's used to play
this MCTS controller.

1583
01:04:23,580 --> 01:04:25,500
So the regular version
that Yo talked about

1584
01:04:25,500 --> 01:04:28,290
is just choosing a
random node at each level

1585
01:04:28,290 --> 01:04:30,300
in your simulation.

1586
01:04:30,300 --> 01:04:31,840
But there are some
other strategies.

1587
01:04:31,840 --> 01:04:32,965
And someone brought one up.

1588
01:04:32,965 --> 01:04:34,500
The first is, look at best of n.

1589
01:04:34,500 --> 01:04:39,280
So in this one, you choose three
random nodes at each level,

1590
01:04:39,280 --> 01:04:43,122
except that you stick with
the best of those three.

1591
01:04:43,122 --> 01:04:45,080
Choose three random nodes,
stick with this one.

1592
01:04:45,080 --> 01:04:45,871
Go to the next one.

1593
01:04:45,871 --> 01:04:48,671
You would choose n random
three, take the best one,

1594
01:04:48,671 --> 01:04:49,920
and then go to the next level.

1595
01:04:49,920 --> 01:04:53,216
You are able to do that
in this game because

1596
01:04:53,216 --> 01:04:54,590
of the way the
scoring works, you

1597
01:04:54,590 --> 01:04:56,923
don't have to get to the end
of the game for your score.

1598
01:04:56,923 --> 01:05:00,425
You actually could collect
a coin along the way.

1599
01:05:00,425 --> 01:05:01,860
If this is jump,
and then it gets

1600
01:05:01,860 --> 01:05:03,610
to be a coin versus
moving left and right.

1601
01:05:03,610 --> 01:05:04,720
That doesn't give
you any points.

1602
01:05:04,720 --> 01:05:07,310
Then this is the node that would
give you the highest scores,

1603
01:05:07,310 --> 01:05:09,376
so I would choose
that one, et cetera.

1604
01:05:09,376 --> 01:05:11,250
And then the final one,
which is the one that

1605
01:05:11,250 --> 01:05:12,791
is actually used
for this controller,

1606
01:05:12,791 --> 01:05:14,700
is multi-simulation.

1607
01:05:14,700 --> 01:05:17,043
This was brought up by him.

1608
01:05:17,043 --> 01:05:18,005
I don't know your name.

1609
01:05:18,005 --> 01:05:18,505
Sorry.

1610
01:05:18,505 --> 01:05:21,410
But basically, you run
multiple random simulations

1611
01:05:21,410 --> 01:05:22,035
from your node.

1612
01:05:22,035 --> 01:05:24,990
And then you propagate up
whichever of those simulations

1613
01:05:24,990 --> 01:05:26,580
give you the highest value.

1614
01:05:26,580 --> 01:05:28,600
And the reason to do
multiple simulations

1615
01:05:28,600 --> 01:05:33,767
is to attempt to increase the
accuracy of your simulations.

1616
01:05:33,767 --> 01:05:35,141
If you just do
one simulation you

1617
01:05:35,141 --> 01:05:36,307
might just get really lucky.

1618
01:05:36,307 --> 01:05:40,490
But if you do three then you
can take the highest value

1619
01:05:40,490 --> 01:05:42,450
use that as your value.

1620
01:05:42,450 --> 01:05:45,424
Since the whole point of this is
to try make moves that get you

1621
01:05:45,424 --> 01:05:47,330
the highest values,
then that will

1622
01:05:47,330 --> 01:05:50,700
make your random simulation
value more accurate.

1623
01:05:50,700 --> 01:05:52,915
Are there questions
about multi-simulation?

1624
01:05:52,915 --> 01:05:55,040
AUDIENCE: So what do you
think about the simulation

1625
01:05:55,040 --> 01:05:58,940
[INAUDIBLE] how many [INAUDIBLE]

1626
01:05:58,940 --> 01:06:00,990
PROFESSOR 3: So there's
a trade off here.

1627
01:06:00,990 --> 01:06:04,655
The more simulations you
do the more accurate--

1628
01:06:04,655 --> 01:06:06,280
the more representative
your simulation

1629
01:06:06,280 --> 01:06:08,120
will be at the end of the game.

1630
01:06:10,720 --> 01:06:13,600
You could run two to
the whatever simulations

1631
01:06:13,600 --> 01:06:16,390
to try to get every
single possible action

1632
01:06:16,390 --> 01:06:17,700
and then take the max of that.

1633
01:06:17,700 --> 01:06:19,450
And that would give
you the maximum value.

1634
01:06:19,450 --> 01:06:20,791
That would be ideal.

1635
01:06:20,791 --> 01:06:22,290
But obviously, that
takes more time.

1636
01:06:22,290 --> 01:06:24,248
So there's a trade off
between computation time

1637
01:06:24,248 --> 01:06:25,990
and the number of
simulations you run.

1638
01:06:25,990 --> 01:06:27,820
And that's just something
that they probably just

1639
01:06:27,820 --> 01:06:28,611
played around with.

1640
01:06:28,611 --> 01:06:37,472
AUDIENCE: Do you
use [INAUDIBLE] have

1641
01:06:37,472 --> 01:06:41,422
to finish the decision losing a
couple of minutes or 10 minutes

1642
01:06:41,422 --> 01:06:43,380
or they're going to take
your [INAUDIBLE] away.

1643
01:06:43,380 --> 01:06:46,340
PROFESSOR 3: In this competition
there is different computation

1644
01:06:46,340 --> 01:06:48,480
time budgets that you get.

1645
01:06:48,480 --> 01:06:51,140
And I believe the reason for
the different computation time

1646
01:06:51,140 --> 01:06:53,091
budgets is the frame
per second of the game.

1647
01:06:58,390 --> 01:07:00,380
I told you all
about the setup, we

1648
01:07:00,380 --> 01:07:03,800
went over, the scoring, the
nodes, what the advantages are,

1649
01:07:03,800 --> 01:07:05,935
what the simulation
strategy is used.

1650
01:07:05,935 --> 01:07:08,490
So you probably want
to see it in action.

1651
01:07:08,490 --> 01:07:11,150
So this is always a risky move
trying to get video to play.

1652
01:07:11,150 --> 01:07:12,830
AUDIENCE: It's actually
in the back up.

1653
01:07:12,830 --> 01:07:14,000
Hit Escape.

1654
01:07:14,000 --> 01:07:14,800
PROFESSOR 3: OK.

1655
01:07:14,800 --> 01:07:15,660
Got it.

1656
01:07:15,660 --> 01:07:17,075
AUDIENCE: And now, I guess, we--

1657
01:07:17,075 --> 01:07:18,765
PROFESSOR 3: And
drag it over again?

1658
01:07:18,765 --> 01:07:19,390
AUDIENCE: Yeah.

1659
01:07:24,667 --> 01:07:26,250
PROFESSOR 3: Running
this full screen.

1660
01:07:26,250 --> 01:07:28,760
AUDIENCE: Hit the [INAUDIBLE]

1661
01:07:28,760 --> 01:07:33,330
PROFESSOR 3:
[INAUDIBLE] All right.

1662
01:07:33,330 --> 01:07:35,887
Here's this MCTS-based
"Mario" playing controller.

1663
01:07:35,887 --> 01:07:37,470
You can see he's
actually wrecking, so

1664
01:07:37,470 --> 01:07:39,240
doing some serious damage here.

1665
01:07:39,240 --> 01:07:42,840
But those lines that you
see, the reason they're

1666
01:07:42,840 --> 01:07:46,616
different colors it's not
showing different players,

1667
01:07:46,616 --> 01:07:47,532
or anything like that.

1668
01:07:47,532 --> 01:07:49,157
It's just using
different colors so you

1669
01:07:49,157 --> 01:07:52,065
can see the different
layers of this tree search.

1670
01:07:52,065 --> 01:07:53,940
You can see he actually
went backwards there.

1671
01:07:53,940 --> 01:07:55,520
And that's because
in a simulation,

1672
01:07:55,520 --> 01:07:58,670
when one of the backward
ones landed on an enemy--

1673
01:07:58,670 --> 01:08:01,710
and in fact gets you points
from our scoring system versus

1674
01:08:01,710 --> 01:08:04,376
if you had just gone forward you
would have gotten some distance

1675
01:08:04,376 --> 01:08:06,110
points but not--

1676
01:08:06,110 --> 01:08:08,740
also, he is just [INAUDIBLE]

1677
01:08:08,740 --> 01:08:12,754
The simulation is quickly being
able to figure out that he

1678
01:08:12,754 --> 01:08:13,920
can jump on all his enemies.

1679
01:08:13,920 --> 01:08:16,340
So he's just wrecking
all these guys.

1680
01:08:16,340 --> 01:08:19,350
Getting lots of points here,
collecting the coin, et cetera.

1681
01:08:19,350 --> 01:08:20,760
You get the idea.

1682
01:08:20,760 --> 01:08:22,230
It's pretty awesome to watch.

1683
01:08:22,230 --> 01:08:23,979
There's that flower
we were talking about.

1684
01:08:23,979 --> 01:08:28,532
So now he's actually a
fire-spewing Mario demon.

1685
01:08:28,532 --> 01:08:30,990
He's doing some serious
damage with that.

1686
01:08:30,990 --> 01:08:31,979
Stepping on missiles.

1687
01:08:31,979 --> 01:08:34,689
I didn't even know you
could step on the missiles.

1688
01:08:34,689 --> 01:08:36,930
All right.

1689
01:08:36,930 --> 01:08:38,370
You could watch
this for a while.

1690
01:08:38,370 --> 01:08:41,599
But we'll exit now.

1691
01:08:41,599 --> 01:08:44,430
It looks super
promising in this video.

1692
01:08:44,430 --> 01:08:46,450
I don't know how
close max stuff.

1693
01:08:46,450 --> 01:08:48,590
AUDIENCE: Just click
on back [INAUDIBLE]

1694
01:08:48,590 --> 01:08:50,300
PROFESSOR 3: There it is.

1695
01:08:50,300 --> 01:08:51,819
OK.

1696
01:08:51,819 --> 01:08:54,300
The demo looks really cool,
looks really promising.

1697
01:08:54,300 --> 01:08:57,330
Let's take a look at the
charts here because we all

1698
01:08:57,330 --> 01:09:00,240
want some quantitative stuff.

1699
01:09:00,240 --> 01:09:01,286
This is the chart.

1700
01:09:01,286 --> 01:09:02,410
The score is on the y-axis.

1701
01:09:02,410 --> 01:09:05,060
The bottom is computation
budget, which is something

1702
01:09:05,060 --> 01:09:06,439
that you were talking about.

1703
01:09:06,439 --> 01:09:11,760
I just want to highlight
to make this a little more

1704
01:09:11,760 --> 01:09:13,319
visually appealing here.

1705
01:09:13,319 --> 01:09:16,870
All of these things
that I highlighted,

1706
01:09:16,870 --> 01:09:17,939
it's labelled as UCT.

1707
01:09:17,939 --> 01:09:20,387
That's Upper
Confidence Bound Tree.

1708
01:09:20,387 --> 01:09:22,470
Remember, Yo talked about
upper confidence bounds.

1709
01:09:22,470 --> 01:09:24,990
That's essentially
what's used in that TTS

1710
01:09:24,990 --> 01:09:26,262
for guiding your tree search.

1711
01:09:26,262 --> 01:09:27,470
So these are all the methods.

1712
01:09:27,470 --> 01:09:31,200
But then UCT multi, which is
this purple square, that's

1713
01:09:31,200 --> 01:09:34,930
saying it's using MCTS but it's
doing the multiple simulations.

1714
01:09:34,930 --> 01:09:41,090
And you can see this multi
plus care is also in the top.

1715
01:09:41,090 --> 01:09:43,510
Both these use the
multi-simulation technique.

1716
01:09:43,510 --> 01:09:47,279
And then the plus car is
they added an extra scoring

1717
01:09:47,279 --> 01:09:49,800
mechanism for carries.

1718
01:09:49,800 --> 01:09:52,670
I believe that's probably
like carrying a shell.

1719
01:09:52,670 --> 01:09:54,715
That made it do better.

1720
01:09:54,715 --> 01:09:56,340
Then these ones that
aren't highlighted

1721
01:09:56,340 --> 01:10:01,130
are using plain Astar, and then
a refined version of Astar.

1722
01:10:01,130 --> 01:10:03,630
With increasing time,
the do increase scores,

1723
01:10:03,630 --> 01:10:07,950
but they're even worse
than just your UCT

1724
01:10:07,950 --> 01:10:13,424
with just random simulation,
no multi-simulations.

1725
01:10:13,424 --> 01:10:15,340
We're running low on
time, which is not ideal.

1726
01:10:15,340 --> 01:10:19,810
But another thing that I want to
point out is down at the bottom

1727
01:10:19,810 --> 01:10:23,830
here, these are the
multi-simulations.

1728
01:10:23,830 --> 01:10:27,540
They have the lowest
maximal search depth, which

1729
01:10:27,540 --> 01:10:30,846
at first would seem like, what?

1730
01:10:30,846 --> 01:10:33,840
I have the lowest search depth
but my score is the most?

1731
01:10:33,840 --> 01:10:35,730
But that comes
into play when you

1732
01:10:35,730 --> 01:10:38,220
were saying about the trade
off between the simulations

1733
01:10:38,220 --> 01:10:41,220
and the amount time it takes.

1734
01:10:41,220 --> 01:10:43,530
So because I'm doing
multiple simulations,

1735
01:10:43,530 --> 01:10:46,110
I'm taking more
time at each node.

1736
01:10:46,110 --> 01:10:49,770
But that's giving me a more
accurate value assessment.

1737
01:10:49,770 --> 01:10:52,300
So that let's me choose
my actions more carefully,

1738
01:10:52,300 --> 01:10:53,907
or with more information.

1739
01:10:53,907 --> 01:10:56,240
And so that's what's able to
give me this better scores.

1740
01:10:59,590 --> 01:11:00,400
That's all "Mario."

1741
01:11:00,400 --> 01:11:02,010
So we're going to
moving onto AlphaGo.

1742
01:11:02,010 --> 01:11:04,590
Are there any questions about
"Mario" before I go to AlphaGo?

1743
01:11:04,590 --> 01:11:05,090
Yeah.

1744
01:11:05,090 --> 01:11:07,300
AUDIENCE: What's the table
[INAUDIBLE] inference?

1745
01:11:07,300 --> 01:11:08,800
PROFESSOR 3: That's
a good question.

1746
01:11:08,800 --> 01:11:12,420
I have a feeling it's because
if you're doing best of n,

1747
01:11:12,420 --> 01:11:16,480
that's really heavily relying
on your scoring metrics.

1748
01:11:19,360 --> 01:11:21,786
Let's say at one step
if I jump and collect

1749
01:11:21,786 --> 01:11:23,660
a coin versus if I go
left or right and play,

1750
01:11:23,660 --> 01:11:25,326
I'll get more points
if I get that coin.

1751
01:11:25,326 --> 01:11:27,690
But maybe, a missile is
going to hit me in the face

1752
01:11:27,690 --> 01:11:28,690
if I do that.

1753
01:11:28,690 --> 01:11:30,940
It gets rid of
some of the-- it's

1754
01:11:30,940 --> 01:11:32,670
forcing you to do certain moves.

1755
01:11:32,670 --> 01:11:34,873
AUDIENCE: Is the A*
heuristically using the same

1756
01:11:34,873 --> 01:11:37,718
value, the same value
that you're getting

1757
01:11:37,718 --> 01:11:38,861
by your simulation?

1758
01:11:38,861 --> 01:11:39,610
PROFESSOR 3: Yeah.

1759
01:11:39,610 --> 01:11:42,820
I'm not exactly sure what
the Astar heuristic is.

1760
01:11:42,820 --> 01:11:48,520
The whole reason that A* is
difficult is because coming up

1761
01:11:48,520 --> 01:11:51,900
with heuristics for
these types of games are.

1762
01:11:51,900 --> 01:11:54,670
But this is not his
version of Astar.

1763
01:11:54,670 --> 01:11:57,190
I believe this is the
Astar that was used by--

1764
01:11:57,190 --> 01:12:00,890
I forget the name of the guy--
but he won the AI competition

1765
01:12:00,890 --> 01:12:04,060
a couple of years ago.

1766
01:12:04,060 --> 01:12:06,390
I'm going to try to
move onto AlphaGo.

1767
01:12:06,390 --> 01:12:09,985
Does someone have how
many minutes I have left?

1768
01:12:09,985 --> 01:12:10,610
AUDIENCE: Four.

1769
01:12:10,610 --> 01:12:11,276
PROFESSOR 3: OK.

1770
01:12:11,276 --> 01:12:13,005
We're going to power through.

1771
01:12:13,005 --> 01:12:14,124
Here's AlphaGo.

1772
01:12:14,124 --> 01:12:15,540
Hopefully, you all
know the rules.

1773
01:12:15,540 --> 01:12:17,490
Just in case, I'll just
go through a quick--

1774
01:12:17,490 --> 01:12:18,220
19 by 19.

1775
01:12:18,220 --> 01:12:20,300
You alternate black
stones and white stones.

1776
01:12:20,300 --> 01:12:23,480
You collect enemy stones by
completely surrounding them.

1777
01:12:23,480 --> 01:12:25,740
You can surround a single
stone. groups of stones.

1778
01:12:25,740 --> 01:12:28,365
And your score is your
territory plus the number

1779
01:12:28,365 --> 01:12:29,115
of captive pieces.

1780
01:12:29,115 --> 01:12:31,573
So your territory is just the
area that you're surrounding,

1781
01:12:31,573 --> 01:12:34,506
and then you just add the
stones you've collected.

1782
01:12:34,506 --> 01:12:35,880
The rules aren't
super important.

1783
01:12:35,880 --> 01:12:39,399
The main emphasis is there's
very few rules so you

1784
01:12:39,399 --> 01:12:40,690
would think it's really simple.

1785
01:12:40,690 --> 01:12:43,105
But the complexity of the
game is quite extreme.

1786
01:12:45,660 --> 01:12:50,290
At each turn you have about
250 options that you can play.

1787
01:12:50,290 --> 01:12:52,440
Each Go game lasts
about 150 turns.

1788
01:12:52,440 --> 01:12:54,750
So that gives you a total
of 10 to the 761 games,

1789
01:12:54,750 --> 01:12:56,370
approximately.

1790
01:12:56,370 --> 01:12:58,470
And to put that in
comparison, here's chess.

1791
01:12:58,470 --> 01:12:59,820
You can read those numbers.

1792
01:12:59,820 --> 01:13:01,400
Chess is also pretty complex.

1793
01:13:01,400 --> 01:13:03,610
But there's 35
options for turns.

1794
01:13:03,610 --> 01:13:06,432
Deep Blue.

1795
01:13:06,432 --> 01:13:08,890
I think you were talking about
building out the whole tree.

1796
01:13:08,890 --> 01:13:12,100
So Deep Blue would build
out the tree for six levels.

1797
01:13:12,100 --> 01:13:14,545
And then use this
hard core chess

1798
01:13:14,545 --> 01:13:17,745
master inputted heuristic
evaluation that it

1799
01:13:17,745 --> 01:13:20,230
used to find the best move.

1800
01:13:20,230 --> 01:13:22,712
Except with Go, you
have 250 options,

1801
01:13:22,712 --> 01:13:26,670
which already is adding
a lot more complexity.

1802
01:13:26,670 --> 01:13:30,970
So that strategy won't
work quite as nicely.

1803
01:13:30,970 --> 01:13:31,870
What do we do?

1804
01:13:31,870 --> 01:13:34,075
We use a modified
version of MCTS.

1805
01:13:34,075 --> 01:13:35,210
Well, it's not what we do.

1806
01:13:35,210 --> 01:13:39,220
That's what Google's
DeepMind team did with Go.

1807
01:13:39,220 --> 01:13:41,900
They combined neural
networks with MCTS.

1808
01:13:41,900 --> 01:13:45,430
Coincidentally, we learned about
neural networks last class.

1809
01:13:45,430 --> 01:13:47,900
Probably not a coincidence.

1810
01:13:47,900 --> 01:13:49,400
PROFESSOR 3: It's
not a coincidence.

1811
01:13:49,400 --> 01:13:51,500
PROFESSOR 3: The we
ordered two policy networks

1812
01:13:51,500 --> 01:13:53,430
in the AlphaGo, and
one value network.

1813
01:13:53,430 --> 01:13:55,580
And another big
coincidence here,

1814
01:13:55,580 --> 01:13:57,475
the two policy
networks are actually

1815
01:13:57,475 --> 01:14:00,140
CNN's, which we learned
specifically about last class,

1816
01:14:00,140 --> 01:14:01,390
convolutional neural nets.

1817
01:14:01,390 --> 01:14:04,515
And the reason for
that is the input

1818
01:14:04,515 --> 01:14:07,995
to the policy neural networks
is an image of the game.

1819
01:14:07,995 --> 01:14:10,120
And remember, convolutional
neural nets work really

1820
01:14:10,120 --> 01:14:12,170
well with images.

1821
01:14:12,170 --> 01:14:15,520
What it outputs, though, is
a probability distribution

1822
01:14:15,520 --> 01:14:16,770
over the legal moves.

1823
01:14:16,770 --> 01:14:20,740
And the idea is, that if a
move has a higher probability

1824
01:14:20,740 --> 01:14:23,830
it will be a more promising
move for you to take.

1825
01:14:23,830 --> 01:14:27,195
But another key point is
that it's not deterministic.

1826
01:14:27,195 --> 01:14:28,820
It's not telling you
to take this move.

1827
01:14:28,820 --> 01:14:32,310
It's just assigning a higher
probability to this move.

1828
01:14:32,310 --> 01:14:34,990
And this network was generated
by doing supervised learning

1829
01:14:34,990 --> 01:14:39,390
on 30 million positions
from human expert games.

1830
01:14:39,390 --> 01:14:42,640
Apparently, there's a giant
database of Go expert games.

1831
01:14:42,640 --> 01:14:44,260
So that came in handy.

1832
01:14:44,260 --> 01:14:46,870
And there were two
different networks trained.

1833
01:14:46,870 --> 01:14:49,810
One of them was a slow policy,
the other was a fast policy.

1834
01:14:49,810 --> 01:14:54,980
The slow was able to predict an
expert move with 57% accuracy,

1835
01:14:54,980 --> 01:14:57,250
which to me was mind blowing.

1836
01:14:57,250 --> 01:15:00,460
Using this neural
network, 57% of the time

1837
01:15:00,460 --> 01:15:04,260
it could pin where the
expert would place his move.

1838
01:15:04,260 --> 01:15:05,780
That took 3,000 microseconds.

1839
01:15:05,780 --> 01:15:08,995
Versus the fast policy, which
suffered a bit in the accuracy,

1840
01:15:08,995 --> 01:15:11,259
but it's 1,500 times faster.

1841
01:15:11,259 --> 01:15:12,675
And we'll see where
they used each

1842
01:15:12,675 --> 01:15:15,580
of these different
policies later on.

1843
01:15:15,580 --> 01:15:20,680
But it could predict the
expert move with 57% accuracy.

1844
01:15:20,680 --> 01:15:22,472
The other Go team was,
that's not our goal.

1845
01:15:22,472 --> 01:15:24,138
We don't want to
predict an expert move.

1846
01:15:24,138 --> 01:15:25,630
We want to predict
a winning move.

1847
01:15:25,630 --> 01:15:28,180
And so to do that, they
took their policy network,

1848
01:15:28,180 --> 01:15:30,170
and then they would use
reinforcement learning.

1849
01:15:30,170 --> 01:15:32,950
That's where you play the
network against iterations

1850
01:15:32,950 --> 01:15:35,830
of itself in order to hone
in a better policy that's

1851
01:15:35,830 --> 01:15:39,560
geared towards winning moves.

1852
01:15:39,560 --> 01:15:42,889
Then they tested this
against Pachi, which uses--

1853
01:15:42,889 --> 01:15:44,555
for the camera, I
have no idea if that's

1854
01:15:44,555 --> 01:15:45,340
how you pronounce Pachi.

1855
01:15:45,340 --> 01:15:46,295
It might be Patchey.

1856
01:15:46,295 --> 01:15:47,320
I'm not sure.

1857
01:15:47,320 --> 01:15:52,330
But there's 100,000 MCTS
simulations at each turn.

1858
01:15:52,330 --> 01:15:55,420
So this is purely MCTS.

1859
01:15:55,420 --> 01:15:59,842
If it were playing just
the AlphaGo policy network,

1860
01:15:59,842 --> 01:16:03,700
the policy network
won 85% of the game.

1861
01:16:03,700 --> 01:16:06,780
So without any sort of trained
search or anything involved,

1862
01:16:06,780 --> 01:16:08,680
it won 85%, which
is pretty great.

1863
01:16:08,680 --> 01:16:11,810
And that suggests that
maybe intuition wins

1864
01:16:11,810 --> 01:16:13,880
over long reflections in Go.

1865
01:16:13,880 --> 01:16:16,535
And interestingly, if you
talk to expert Go players

1866
01:16:16,535 --> 01:16:19,340
and you ask them why they did a
certain move, they'll just say,

1867
01:16:19,340 --> 01:16:22,840
It felt good, or I
had a hunch in this.

1868
01:16:22,840 --> 01:16:26,660
That's indicative there.

1869
01:16:26,660 --> 01:16:28,330
Hopefully, I'm not
going overtime.

1870
01:16:28,330 --> 01:16:29,970
Sorry.

1871
01:16:29,970 --> 01:16:31,505
Those are the two
policy networks.

1872
01:16:31,505 --> 01:16:32,713
There's also a value network.

1873
01:16:32,713 --> 01:16:35,450
What the value network does
is it takes in a board,

1874
01:16:35,450 --> 01:16:40,140
and they'll give you a value,
like how good is this board?

1875
01:16:40,140 --> 01:16:42,260
They'll give you a win
probability number.

1876
01:16:42,260 --> 01:16:45,592
So 77%, it would say,
77% of the time you

1877
01:16:45,592 --> 01:16:47,362
should win from the board.

1878
01:16:47,362 --> 01:16:49,820
That's similar to the evaluation
that comes from Deep Blue.

1879
01:16:49,820 --> 01:16:53,570
But rather than a Go master
coming in and telling you,

1880
01:16:53,570 --> 01:16:55,730
well, if these are
connected in this way,

1881
01:16:55,730 --> 01:16:57,730
and down here we have
this certain thing

1882
01:16:57,730 --> 01:17:00,060
then here's the score
we should expect,

1883
01:17:00,060 --> 01:17:01,579
in chess, they
had chess masters,

1884
01:17:01,579 --> 01:17:03,620
like if the knight is here
and the queen is here,

1885
01:17:03,620 --> 01:17:04,840
all these specific things.

1886
01:17:04,840 --> 01:17:07,649
This was actually learned from
the reinforcement learning that

1887
01:17:07,649 --> 01:17:09,440
was happening when the
policy networks were

1888
01:17:09,440 --> 01:17:10,240
playing each other.

1889
01:17:10,240 --> 01:17:12,890
The value network was
learning about those positions

1890
01:17:12,890 --> 01:17:14,294
during that time.

1891
01:17:14,294 --> 01:17:16,210
And the predictions get
better towards the end

1892
01:17:16,210 --> 01:17:21,590
of the game, which I think
Yo mentioned in his talk.

1893
01:17:21,590 --> 01:17:23,651
So how do you combine
all these into MCTS?

1894
01:17:23,651 --> 01:17:25,359
The slow policy network,
if you remember,

1895
01:17:25,359 --> 01:17:27,830
is slower but should
give us stronger moves.

1896
01:17:27,830 --> 01:17:29,780
It is used to guide our
tree search in order

1897
01:17:29,780 --> 01:17:33,322
to help us decide which
nodes to expand next.

1898
01:17:33,322 --> 01:17:35,840
When we expand that
node to get the value,

1899
01:17:35,840 --> 01:17:38,390
the value of the state is
the simulation, like before,

1900
01:17:38,390 --> 01:17:41,000
like normal MCTS,
except it's not

1901
01:17:41,000 --> 01:17:42,560
a completely random simulation.

1902
01:17:42,560 --> 01:17:45,200
We use our fast policy network
to give us a more educated

1903
01:17:45,200 --> 01:17:46,117
simulation here.

1904
01:17:46,117 --> 01:17:47,700
But we're using a
fast one, obviously,

1905
01:17:47,700 --> 01:17:49,990
to save some computation time.

1906
01:17:49,990 --> 01:17:53,810
It's giving us probably a more
indicative random simulation

1907
01:17:53,810 --> 01:17:55,920
of what's going to
actually happen.

1908
01:17:55,920 --> 01:17:58,830
And then we also combine that
with our value network output.

1909
01:17:58,830 --> 01:18:01,070
So we run our value network
on this node, as well.

1910
01:18:01,070 --> 01:18:02,695
And we add that to
our simulation value

1911
01:18:02,695 --> 01:18:03,800
and we propagate it.

1912
01:18:03,800 --> 01:18:06,440
Interestingly, the
AlphaGo team tested out

1913
01:18:06,440 --> 01:18:09,140
just using the fast
policy simulation value

1914
01:18:09,140 --> 01:18:11,180
and scrapping the value network.

1915
01:18:11,180 --> 01:18:13,160
And they also just
used the value network

1916
01:18:13,160 --> 01:18:14,720
and scrapped the
simulation value.

1917
01:18:14,720 --> 01:18:17,030
And those both performed
worse than if it had these.

1918
01:18:17,030 --> 01:18:19,625
And another added
interesting point here,

1919
01:18:19,625 --> 01:18:22,630
is that these two
factors in our value

1920
01:18:22,630 --> 01:18:24,130
have about the same weight.

1921
01:18:24,130 --> 01:18:27,970
They were both about
equally important.

1922
01:18:27,970 --> 01:18:29,635
I think I'll get
into that later.

1923
01:18:29,635 --> 01:18:30,410
But first--

1924
01:18:30,410 --> 01:18:32,160
AUDIENCE: Can I just
ask a quick question?

1925
01:18:32,160 --> 01:18:33,980
PROFESSOR 3: Yeah.

1926
01:18:33,980 --> 01:18:36,230
AUDIENCE: So when you said
the policy network is used,

1927
01:18:36,230 --> 01:18:38,240
is that used when you're
navigating to the tree

1928
01:18:38,240 --> 01:18:40,350
to get to a leaf,
or is policy network

1929
01:18:40,350 --> 01:18:43,470
being used to do the
simulation once you're

1930
01:18:43,470 --> 01:18:46,340
at the leaf, or both?

1931
01:18:46,340 --> 01:18:49,177
PROFESSOR 3: The slow policy
is done for this part.

1932
01:18:49,177 --> 01:18:51,176
Then the fast policy is
used for the simulation.

1933
01:18:51,176 --> 01:18:54,455
Because the slow policy does
take 1,500 faster than--

1934
01:18:54,455 --> 01:18:58,052
or the slow takes 1,500 times
longer than the fast policy.

1935
01:18:58,052 --> 01:19:00,010
You don't want to use
that in your simulations.

1936
01:19:00,010 --> 01:19:01,927
That would just
take way too long.

1937
01:19:01,927 --> 01:19:03,510
It's basically just
a way of making it

1938
01:19:03,510 --> 01:19:05,260
so our simulation isn't
completely random.

1939
01:19:05,260 --> 01:19:06,755
It has some educated moves.

1940
01:19:09,490 --> 01:19:11,392
Why use policy and
value network synergy?

1941
01:19:11,392 --> 01:19:13,100
Why can't we just use
the policy network?

1942
01:19:13,100 --> 01:19:15,070
Why can't we just use
the value network?

1943
01:19:15,070 --> 01:19:16,920
If we have the
value network alone,

1944
01:19:16,920 --> 01:19:18,572
we'll actually--
here's a side point.

1945
01:19:18,572 --> 01:19:20,030
Remember, the value
network learned

1946
01:19:20,030 --> 01:19:21,320
from the policy network.

1947
01:19:21,320 --> 01:19:23,140
And then also, later
on, the policy network

1948
01:19:23,140 --> 01:19:26,070
is improved by our values.

1949
01:19:26,070 --> 01:19:27,400
They work hand-in-hand.

1950
01:19:27,400 --> 01:19:29,114
But if we had the
value network alone,

1951
01:19:29,114 --> 01:19:30,780
when we're deciding
on it the next move,

1952
01:19:30,780 --> 01:19:33,113
we're going to have to evaluate
every single move, which

1953
01:19:33,113 --> 01:19:34,510
would take forever.

1954
01:19:34,510 --> 01:19:36,010
And so, what the
policy network does

1955
01:19:36,010 --> 01:19:41,040
is project the best move
with a probably distribution.

1956
01:19:41,040 --> 01:19:43,110
And it narrows our search space.

1957
01:19:43,110 --> 01:19:45,010
And then, if we had the
policy network alone,

1958
01:19:45,010 --> 01:19:48,366
we'd be unable to compare nodes
in different parts of our tree.

1959
01:19:48,366 --> 01:19:50,450
The policy network
is able to tell us

1960
01:19:50,450 --> 01:19:52,370
a distribution over
which move we should

1961
01:19:52,370 --> 01:19:54,230
take from a certain node.

1962
01:19:54,230 --> 01:19:57,350
But then, if I ask it if
I'm in a better position

1963
01:19:57,350 --> 01:19:59,724
here than in some other
place, it won't know.

1964
01:19:59,724 --> 01:20:01,390
That's where the value
network comes in.

1965
01:20:01,390 --> 01:20:06,140
It will give us an estimated
number of the value assigned

1966
01:20:06,140 --> 01:20:08,570
and open an evaluation
of that node.

1967
01:20:08,570 --> 01:20:10,760
And then these
values are later used

1968
01:20:10,760 --> 01:20:12,860
to direct our tree
searches based

1969
01:20:12,860 --> 01:20:16,360
on updating the policy
once it realizes,

1970
01:20:16,360 --> 01:20:19,470
oh, I thought this would be
a good path but the value is

1971
01:20:19,470 --> 01:20:23,240
this, so update all that.

1972
01:20:23,240 --> 01:20:25,440
Then why do we combine
neural networks with MCTS?

1973
01:20:25,440 --> 01:20:27,500
Remember, the
policy network alone

1974
01:20:27,500 --> 01:20:31,000
played against Pachi,
which was purely MCTS,

1975
01:20:31,000 --> 01:20:33,000
and it did pretty well.

1976
01:20:33,000 --> 01:20:37,220
So how does MCTS improve
our policy network?

1977
01:20:37,220 --> 01:20:42,055
Remember, MCTS did win
15% of those games.

1978
01:20:42,055 --> 01:20:44,900
So already, that makes you
think there's something there

1979
01:20:44,900 --> 01:20:47,145
that maybe the policy
network is missing.

1980
01:20:47,145 --> 01:20:49,220
Also, the policy network
is just a prediction.

1981
01:20:49,220 --> 01:20:51,410
So by using this
tree structure, we're

1982
01:20:51,410 --> 01:20:57,730
able to use these Monte Carlo
rollouts to adjust our policy

1983
01:20:57,730 --> 01:21:01,520
to move towards nodes that are
actually evaluated to be good.

1984
01:21:01,520 --> 01:21:03,960
And then, how do neural
networks improve MCTS?

1985
01:21:03,960 --> 01:21:06,280
The point should
probably be clear by now.

1986
01:21:06,280 --> 01:21:09,930
We're able to more intelligently
lead our tree exploration.

1987
01:21:09,930 --> 01:21:13,420
Our simulations are more
reflective of actual games.

1988
01:21:13,420 --> 01:21:17,530
And the value network
and our simulation value

1989
01:21:17,530 --> 01:21:21,400
are complementary, which
I've mentioned before.

1990
01:21:21,400 --> 01:21:25,150
And just to highlight that,
basically, the value network

1991
01:21:25,150 --> 01:21:27,910
is going to give us a
value that is reflective

1992
01:21:27,910 --> 01:21:30,680
as if we've played the
slow policy the whole time.

1993
01:21:30,680 --> 01:21:35,170
And the simulation is if
we used a faster policy.

1994
01:21:35,170 --> 01:21:38,070
So they are complementary.

1995
01:21:38,070 --> 01:21:39,710
And I know I'm over time.

1996
01:21:39,710 --> 01:21:44,390
So I just wanted to skim
through the stats real quick.

1997
01:21:44,390 --> 01:21:47,395
Distributed AlphaGo
won 77% of the games

1998
01:21:47,395 --> 01:21:49,039
against regular AlphaGo.

1999
01:21:49,039 --> 01:21:51,080
So it's the only thing
that beat regular AlphaGo.

2000
01:21:51,080 --> 01:21:54,250
And then distributed AlphaGo
won 100% of the games

2001
01:21:54,250 --> 01:21:55,130
against all these.

2002
01:21:55,130 --> 01:21:57,720
In a rematch against Pachi,
now that we've added MCTS

2003
01:21:57,720 --> 01:21:59,886
to our policy network and
we have our value network,

2004
01:21:59,886 --> 01:22:03,170
we slaughtered Pachi 100%.

2005
01:22:03,170 --> 01:22:05,460
Then we decided to see how
we fare against humans.

2006
01:22:05,460 --> 01:22:08,540
And by we, I mean not
me, I mean Google.

2007
01:22:08,540 --> 01:22:11,190
And they won 4 to 1.

2008
01:22:11,190 --> 01:22:14,680
And Lee Sedol rating was 3,520.

2009
01:22:14,680 --> 01:22:17,880
Now AlphaGo's rating is
estimated to be about 3,586.

2010
01:22:17,880 --> 01:22:19,960
So you're like, whoo,
we beat the best dude.

2011
01:22:19,960 --> 01:22:22,180
Except we didn't because
there's another dude

2012
01:22:22,180 --> 01:22:31,320
who has an even higher
score, apparently, 3,621.

2013
01:22:31,320 --> 01:22:32,970
This should be the last part.

2014
01:22:32,970 --> 01:22:34,750
Here's this timeline.

2015
01:22:34,750 --> 01:22:39,410
Basically, tic-tac-toe,
checkers were conquered in '50.

2016
01:22:39,410 --> 01:22:42,155
About 40 years later, we
conquered checkers, chess.

2017
01:22:42,155 --> 01:22:45,800
Then we scroll down
to 2015, is when

2018
01:22:45,800 --> 01:22:48,065
AlphaGo was able to
beat Fan Hui, who

2019
01:22:48,065 --> 01:22:51,340
was a two-dan player, which
is considered lower down

2020
01:22:51,340 --> 01:22:54,425
in the tier of professional Go.

2021
01:22:54,425 --> 01:22:56,470
But then, Lee Sedol
was a nine-dan player.

2022
01:22:56,470 --> 01:23:00,187
And he was able to beat
him literally last month.

2023
01:23:00,187 --> 01:23:01,520
PROFESSOR WILLIAMS: So good job.

2024
01:23:01,520 --> 01:23:02,520
PROFESSOR 3: We're done.

2025
01:23:02,520 --> 01:23:03,800
[APPLAUSE]