1
00:00:00,530 --> 00:00:02,960
The following content is
provided under a Creative

2
00:00:02,960 --> 00:00:04,370
Commons license.

3
00:00:04,370 --> 00:00:07,410
Your support will help MIT
OpenCourseWare continue to

4
00:00:07,410 --> 00:00:11,060
offer high quality educational
resources for free.

5
00:00:11,060 --> 00:00:13,960
To make a donation or view
additional materials from

6
00:00:13,960 --> 00:00:17,890
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,890 --> 00:00:19,140
ocw.mit.edu.

8
00:00:24,220 --> 00:00:24,840
PROFESSOR: OK.

9
00:00:24,840 --> 00:00:30,230
Today we're going to finish
up with Markov chains.

10
00:00:30,230 --> 00:00:34,570
And the last topic will be
dynamic programming.

11
00:00:34,570 --> 00:00:39,900
I'm not going to say an awful
lot about dynamic programming.

12
00:00:39,900 --> 00:00:43,530
It's a topic that was enormously
important in

13
00:00:43,530 --> 00:00:49,600
research for probably 20
years from 1960 until

14
00:00:49,600 --> 00:00:53,540
about 1980, or 1990.

15
00:00:53,540 --> 00:01:00,300
And it seemed as if half the
Ph.D. theses done in the

16
00:01:00,300 --> 00:01:03,920
control area and the
operations research

17
00:01:03,920 --> 00:01:07,630
were in this area.

18
00:01:07,630 --> 00:01:11,950
Suddenly, everything seemed
to be done, could be done.

19
00:01:11,950 --> 00:01:15,310
And strangely enough, not
many people seem to

20
00:01:15,310 --> 00:01:16,760
know about it anymore.

21
00:01:16,760 --> 00:01:20,760
It's an enormously useful
algorithm for solving an awful

22
00:01:20,760 --> 00:01:23,000
lot of different problems.

23
00:01:23,000 --> 00:01:25,420
It's quite a simple algorithm.

24
00:01:25,420 --> 00:01:28,780
You don't need the full power
of Markov chains in order to

25
00:01:28,780 --> 00:01:30,470
understand it.

26
00:01:30,470 --> 00:01:34,250
So I do want to at least talk
about it a little bit.

27
00:01:34,250 --> 00:01:38,070
And we will use what we've done
so far with Markov chains

28
00:01:38,070 --> 00:01:40,940
in order to understand it.

29
00:01:40,940 --> 00:01:44,200
I want to start out today by
reviewing a little bit of what

30
00:01:44,200 --> 00:01:49,040
we did last time about
eigenvalues and eigenvectors.

31
00:01:49,040 --> 00:01:56,320
This was a somewhat awkward
topic to talk about, because

32
00:01:56,320 --> 00:01:59,970
you people have very different
backgrounds in linear algebra.

33
00:01:59,970 --> 00:02:03,450
Some of you have a very strong
background, some of you have

34
00:02:03,450 --> 00:02:05,240
almost no background.

35
00:02:05,240 --> 00:02:10,509
So it was a lot of material for
those of you who know very

36
00:02:10,509 --> 00:02:14,190
little about linear algebra.

37
00:02:14,190 --> 00:02:16,620
And probably somewhat boring
for those of you

38
00:02:16,620 --> 00:02:18,690
use it all the time.

39
00:02:18,690 --> 00:02:22,670
At any rate, if you don't know
anything about it, linear

40
00:02:22,670 --> 00:02:28,820
algebra is a topic that you
ought to understand for almost

41
00:02:28,820 --> 00:02:30,270
anything you do.

42
00:02:30,270 --> 00:02:35,230
If you've gotten to this point
without having to study it,

43
00:02:35,230 --> 00:02:37,460
it's very strange.

44
00:02:37,460 --> 00:02:41,720
So you should probably take
some extra time out, not

45
00:02:41,720 --> 00:02:43,900
because you need it so
much for this course.

46
00:02:43,900 --> 00:02:46,670
We won't use it enormously
in many of the

47
00:02:46,670 --> 00:02:48,500
things we do later.

48
00:02:48,500 --> 00:02:51,930
But you will use it so many
times in the future that you

49
00:02:51,930 --> 00:02:56,870
ought to just sit down, not to
learn abstract linear algebra,

50
00:02:56,870 --> 00:03:00,150
which is very useful also, but
just to learn how to use the

51
00:03:00,150 --> 00:03:03,280
topic of solving linear
equations.

52
00:03:03,280 --> 00:03:06,450
Being able to express them
in terms of matrices.

53
00:03:06,450 --> 00:03:09,310
Being able to use the
eigenvalues and eigenvectors,

54
00:03:09,310 --> 00:03:12,220
and matrices as a way of
understanding these things.

55
00:03:12,220 --> 00:03:16,440
So I want to say a little more
about that today, which is why

56
00:03:16,440 --> 00:03:19,720
I've called this a review
plus of eigenvalues and

57
00:03:19,720 --> 00:03:21,020
eigenvectors.

58
00:03:21,020 --> 00:03:25,930
It's a review of the topics
we did last time, but it's

59
00:03:25,930 --> 00:03:28,250
looking at it in a somewhat
different way.

60
00:03:28,250 --> 00:03:32,150
So let's proceed with that.

61
00:03:32,150 --> 00:03:36,810
We said that the determinant of
an M by M matrix is given

62
00:03:36,810 --> 00:03:38,530
by this strange formula.

63
00:03:38,530 --> 00:03:44,340
The determinant of a is the sum
over all permutations of

64
00:03:44,340 --> 00:03:51,260
the integers 1 to M of the
product from i equals 1 to M

65
00:03:51,260 --> 00:03:56,080
of the matrix element
a sub i mu of i.

66
00:03:56,080 --> 00:04:01,670
Mu of i is the permutation of
the number i. i is between one

67
00:04:01,670 --> 00:04:05,510
and M, and mu of i is a
permutation of that.

68
00:04:05,510 --> 00:04:17,529
Now if you look at the matrix,
which has the form, which is

69
00:04:17,529 --> 00:04:19,600
block upper diagonal.

70
00:04:19,600 --> 00:04:22,990
In other words, there's a matrix
here, a square matrix a

71
00:04:22,990 --> 00:04:26,390
sub t, which is a transient
matrix.

72
00:04:26,390 --> 00:04:31,610
There's a recurrent matrix here,
and there's some way of

73
00:04:31,610 --> 00:04:33,900
getting from the transient
states to

74
00:04:33,900 --> 00:04:36,730
the recurring states.

75
00:04:36,730 --> 00:04:41,630
And this is the general form
that a unit chain has to have.

76
00:04:41,630 --> 00:04:44,970
There are a bunch of transient
states, there are a bunch of

77
00:04:44,970 --> 00:04:47,230
recurring states.

78
00:04:47,230 --> 00:04:52,630
And the interesting thing here
is that the determinant of a

79
00:04:52,630 --> 00:04:57,620
is exactly the determinant
of a sub t times the

80
00:04:57,620 --> 00:04:59,410
determinant a sub r.

81
00:04:59,410 --> 00:05:03,210
I'm calling this a instead of
the transition matrix p

82
00:05:03,210 --> 00:05:08,840
because I want to replace a by
p minus lambda i, so I can

83
00:05:08,840 --> 00:05:11,820
talk about the eigenvalues
of p.

84
00:05:11,820 --> 00:05:15,690
So when I do that replacement
here, if I know that the

85
00:05:15,690 --> 00:05:20,140
determinant of a is this product
of determinants, then

86
00:05:20,140 --> 00:05:24,130
the determinant of p minus
lambda i is the determinant of

87
00:05:24,130 --> 00:05:32,160
pt minus lambda it, where it is
just a crazy way of saying

88
00:05:32,160 --> 00:05:35,120
a diagonal matrix.

89
00:05:35,120 --> 00:05:40,070
A diagonal t by t matrix,
because this is a t by t

90
00:05:40,070 --> 00:05:41,740
matrix, also.

91
00:05:41,740 --> 00:05:48,580
i sub r is an r by r matrix,
where this is a square r by r

92
00:05:48,580 --> 00:05:50,260
matrix also.

93
00:05:50,260 --> 00:05:53,970
Now, why is it that this
determinant is equal to this

94
00:05:53,970 --> 00:05:56,670
product of determinants here?

95
00:05:56,670 --> 00:06:02,010
Well, before explaining why this
is true, why do you care?

96
00:06:02,010 --> 00:06:08,180
Well, because we know that if
we have a recurring matrix

97
00:06:08,180 --> 00:06:11,630
here, we know that it has--

98
00:06:11,630 --> 00:06:13,790
I mean, we know a great
deal about it.

99
00:06:13,790 --> 00:06:21,150
We know that any square matrix,
r by r matrix has r

100
00:06:21,150 --> 00:06:22,750
different eigenvalues.

101
00:06:22,750 --> 00:06:26,330
Some of them might be repeated,
but they're always r

102
00:06:26,330 --> 00:06:27,480
eigenvalues.

103
00:06:27,480 --> 00:06:31,420
This matrix here has
t eigenvalues.

104
00:06:31,420 --> 00:06:32,520
OK.

105
00:06:32,520 --> 00:06:37,730
This matrix here, we know has
r plus t eigenvalues.

106
00:06:37,730 --> 00:06:42,060
You look at this formula here
and you say aha, I can take

107
00:06:42,060 --> 00:06:46,670
all the eigenvalues here, add
them to all the eigenvalues

108
00:06:46,670 --> 00:06:50,280
here, and I have every one
of the eigenvalues here.

109
00:06:50,280 --> 00:06:54,780
In other words, if I want to
find all of the eigenvalues of

110
00:06:54,780 --> 00:06:59,620
p, all I have to do is define
the eigenvalues of p sub t,

111
00:06:59,620 --> 00:07:04,710
add them to the eigenvalues of
p sub r, and I'm all done.

112
00:07:04,710 --> 00:07:08,640
So that really has simplified
things a good deal.

113
00:07:08,640 --> 00:07:14,270
And it also really says
explicitly that if you

114
00:07:14,270 --> 00:07:20,060
understand how to deal with
recurrent Markov chains, you

115
00:07:20,060 --> 00:07:22,620
really know everything.

116
00:07:22,620 --> 00:07:25,840
Well, you also have to know how
to deal with a transient

117
00:07:25,840 --> 00:07:29,880
chain, but the main part of it
is dealing with this chain.

118
00:07:29,880 --> 00:07:34,870
This has little r different
eigenvalues, and all of those

119
00:07:34,870 --> 00:07:41,860
are eigenvalues, excuse me,
p sub r has little r

120
00:07:41,860 --> 00:07:42,710
eigenvalues.

121
00:07:42,710 --> 00:07:46,860
They're given by the roots
of this determinant here.

122
00:07:46,860 --> 00:07:49,530
And all of those
are roots here.

123
00:07:49,530 --> 00:07:51,580
OK, so why is this true?

124
00:07:51,580 --> 00:07:57,990
Well, the reason for it is that
this product up here,

125
00:07:57,990 --> 00:07:59,200
look at this.

126
00:07:59,200 --> 00:08:02,490
We're taking the sum over
all permutations.

127
00:08:02,490 --> 00:08:05,315
But which one of those
permutations can be non-zero?

128
00:08:12,940 --> 00:08:18,740
If I start out by saying that a
sub t is t by t, then I know

129
00:08:18,740 --> 00:08:21,440
that this might be anything.

130
00:08:21,440 --> 00:08:24,050
These have to be zeroes here.

131
00:08:24,050 --> 00:08:30,450
If I choose some permutation of
down here, of sum i, which

132
00:08:30,450 --> 00:08:31,530
is greater than t.

133
00:08:31,530 --> 00:08:35,030
In other words, if I choose
mu o i to be some

134
00:08:35,030 --> 00:08:36,130
element over here.

135
00:08:36,130 --> 00:08:42,309
If I choose mu of i to be less
than our equal to t, and i to

136
00:08:42,309 --> 00:08:45,500
be greater than t,
what happens?

137
00:08:45,500 --> 00:08:47,790
I get a term which
is equal to zero.

138
00:08:47,790 --> 00:08:51,210
That term in this
product is zero.

139
00:08:51,210 --> 00:08:55,670
And none of those products
can be zero.

140
00:08:55,670 --> 00:09:00,830
So the only way I can get non
zeros here is when I'm dealing

141
00:09:00,830 --> 00:09:03,730
with an i which is less
than or equal to t.

142
00:09:03,730 --> 00:09:06,100
Namely an i here.

143
00:09:06,100 --> 00:09:09,440
I have to choose a mu of
i, a column which is

144
00:09:09,440 --> 00:09:10,870
less than t, also.

145
00:09:10,870 --> 00:09:17,540
If I'm dealing with an i which
is greater than t, namely and

146
00:09:17,540 --> 00:09:23,410
i up here, then, well, it
looks like I can choose

147
00:09:23,410 --> 00:09:24,950
anything there.

148
00:09:24,950 --> 00:09:25,630
But look.

149
00:09:25,630 --> 00:09:31,180
I've already used up all of
these columns here by the five

150
00:09:31,180 --> 00:09:33,470
by the non-zero terms here.

151
00:09:33,470 --> 00:09:37,360
So I can't do anything
but use a smaller i,

152
00:09:37,360 --> 00:09:40,080
smaller than t up here.

153
00:09:40,080 --> 00:09:44,703
So when I look at the
permutations that are non

154
00:09:44,703 --> 00:09:49,010
zero, the only permutations that
are non zero are those

155
00:09:49,010 --> 00:09:55,610
where mu of i is less than t if
i less than t, and mu of i

156
00:09:55,610 --> 00:10:01,960
is less than or equal to t if i
is less than or equal to t.

157
00:10:01,960 --> 00:10:06,100
And mu of i is greater than
t if i is greater than t.

158
00:10:06,100 --> 00:10:11,580
Now, how does that show that
this is equal here?

159
00:10:11,580 --> 00:10:16,480
Well, let's look at
that a little bit.

160
00:10:16,480 --> 00:10:19,740
I didn't even try to do it on
the slide because the notation

161
00:10:19,740 --> 00:10:20,970
is kind of horrifying.

162
00:10:20,970 --> 00:10:24,850
But let's try to write this
the following way.

163
00:10:24,850 --> 00:10:36,910
Determinant of a is equal to the
sum, and now I'll write it

164
00:10:36,910 --> 00:10:48,040
as a sum over mu of 1 up to t.

165
00:10:48,040 --> 00:10:59,690
And the sum over mu of t
plus 1 up to, well, t

166
00:10:59,690 --> 00:11:02,460
plus r, let's say.

167
00:11:02,460 --> 00:11:06,210
OK, so here I have all of
the permutations of the

168
00:11:06,210 --> 00:11:08,870
numbers 1 to t.

169
00:11:08,870 --> 00:11:11,350
And here I have all the
permutations of the

170
00:11:11,350 --> 00:11:14,010
numbers t plus 1 up.

171
00:11:14,010 --> 00:11:16,760
And for all of those,
I'm going to

172
00:11:16,760 --> 00:11:18,190
ignore this plus minus.

173
00:11:18,190 --> 00:11:21,420
You can sort that out
for yourselves.

174
00:11:21,420 --> 00:11:27,620
And then I have a product
from i equals 1 to t.

175
00:11:27,620 --> 00:11:37,950
And then a product from i
equals t plus 1 up to m.

176
00:11:37,950 --> 00:11:39,200
Excuse me.

177
00:11:42,410 --> 00:11:53,000
i sub i, mu of i times
product of a of i.

178
00:11:53,000 --> 00:12:07,130
Mu of i for i equals t plus
1 up to t plus r.

179
00:12:07,130 --> 00:12:09,300
OK?

180
00:12:09,300 --> 00:12:14,740
So I'm separating this product
here into a product first of

181
00:12:14,740 --> 00:12:19,070
the terms i less than or equal
to t, and then for the terms i

182
00:12:19,070 --> 00:12:20,180
greater than t.

183
00:12:20,180 --> 00:12:24,620
For every permutation I choose
using the i's that are less

184
00:12:24,620 --> 00:12:29,090
than or equal to t, I can choose
any of the permutation

185
00:12:29,090 --> 00:12:33,520
using mu of i greater than
t that I choose to use.

186
00:12:33,520 --> 00:12:35,570
So this breaks up in this way.

187
00:12:35,570 --> 00:12:37,960
I have this sum, I
have this sum.

188
00:12:37,960 --> 00:12:43,120
I have these two products, so
I can break this up as a sum

189
00:12:43,120 --> 00:12:55,270
over mu of 1 to t of plus minus
product from i equals 1

190
00:12:55,270 --> 00:13:08,752
to t of ai, mu of i times the
sum over mu of t plus 1 up to

191
00:13:08,752 --> 00:13:15,072
t plus r ai mu of i.

192
00:13:20,160 --> 00:13:22,380
Product.

193
00:13:22,380 --> 00:13:23,300
OK.

194
00:13:23,300 --> 00:13:26,030
So I've separated that into
two different terms.

195
00:13:26,030 --> 00:13:27,000
STUDENT: T equals [INAUDIBLE].

196
00:13:27,000 --> 00:13:27,570
PROFESSOR: What?

197
00:13:27,570 --> 00:13:30,680
STUDENT: T plus r
equals big m?

198
00:13:30,680 --> 00:13:33,230
PROFESSOR: T plus
r is big m, yes.

199
00:13:33,230 --> 00:13:40,060
Because I have t terms here,
and I have r terms here.

200
00:13:40,060 --> 00:13:44,710
OK, so the interesting thing
here is having this non-zero

201
00:13:44,710 --> 00:13:48,400
term here doesn't make
any difference here.

202
00:13:48,400 --> 00:13:52,430
I mean, this is more
straightforward if you have a

203
00:13:52,430 --> 00:13:54,020
block diagonal matrix.

204
00:13:54,020 --> 00:13:58,330
It's clear that the eigenvalues
of a block

205
00:13:58,330 --> 00:14:03,700
diagonal matrix are going to be
the eigenvalues of 1 plus

206
00:14:03,700 --> 00:14:05,560
the eigenvalues of the other.

207
00:14:05,560 --> 00:14:09,980
Here we have the eigenvalues
of this, and the

208
00:14:09,980 --> 00:14:11,450
eigenvalues is this.

209
00:14:11,450 --> 00:14:14,910
And what's surprising is that as
far as the eigenvalues are

210
00:14:14,910 --> 00:14:19,950
concerned, this has nothing
whatsoever to do with it.

211
00:14:19,950 --> 00:14:20,690
OK.

212
00:14:20,690 --> 00:14:24,480
The only thing that this has
to do with it is it says

213
00:14:24,480 --> 00:14:28,780
something about the sums of this
matrix here, because the

214
00:14:28,780 --> 00:14:31,500
sums of these rows are
now less than 1.

215
00:14:31,500 --> 00:14:34,660
They all have to be, some of
them, at least, have to be

216
00:14:34,660 --> 00:14:36,760
less than or equal to 1.

217
00:14:36,760 --> 00:14:40,090
Because you do have this way of
getting from the transient

218
00:14:40,090 --> 00:14:43,470
elements to the non transient
elements.

219
00:14:43,470 --> 00:14:48,060
But it's very surprising that
these elements, which are

220
00:14:48,060 --> 00:14:52,100
critically important, because
those are the things that get

221
00:14:52,100 --> 00:14:55,800
you from the transition states
to the recurrent states have

222
00:14:55,800 --> 00:14:59,540
nothing to do in the eigenvalues
whatsoever.

223
00:14:59,540 --> 00:15:00,105
I don't know why.

224
00:15:00,105 --> 00:15:04,310
I can't give you any insights
about that, but

225
00:15:04,310 --> 00:15:06,810
that's the way it is.

226
00:15:06,810 --> 00:15:12,030
That's an interesting thing,
because if you take this

227
00:15:12,030 --> 00:15:19,930
transition matrix, and you keep
at and a sub r fixed, and

228
00:15:19,930 --> 00:15:23,250
you play any kind of funny game
you want to with those

229
00:15:23,250 --> 00:15:28,780
terms going from the transient
states to the non transient

230
00:15:28,780 --> 00:15:33,370
states, it won't change
any eigenvalues.

231
00:15:33,370 --> 00:15:35,490
Don't know why it doesn't.

232
00:15:35,490 --> 00:15:39,400
OK, so where do we
go with that?

233
00:15:39,400 --> 00:15:45,440
Well, that's what it says.

234
00:15:45,440 --> 00:15:50,580
The eigenvalues of p, or the t
eigenvalues of pt, and the r

235
00:15:50,580 --> 00:15:52,200
eigenvalues of PR.

236
00:15:52,200 --> 00:15:56,180
It also tells you something
about simple eigenvalues, and

237
00:15:56,180 --> 00:15:59,800
these crazy eigenvalues, which
don't have enough eigenvectors

238
00:15:59,800 --> 00:16:01,230
to go along with them.

239
00:16:01,230 --> 00:16:06,420
Because it tells you that a
piece of r has all of its

240
00:16:06,420 --> 00:16:11,880
eigenvectors, and a piece of t
has all of its eigenvectors.

241
00:16:11,880 --> 00:16:14,550
Then you don't have any of this
crazy [INAUDIBLE] form

242
00:16:14,550 --> 00:16:16,520
thing, or anything.

243
00:16:16,520 --> 00:16:29,670
OK If pi is a left eigenvector
of this recurrent matrix, then

244
00:16:29,670 --> 00:16:35,550
if you look at the vector,
starting was zeros, and then I

245
00:16:35,550 --> 00:16:42,390
guess I should really say, well,
if pi sub 1 up to pi sub

246
00:16:42,390 --> 00:16:47,910
r as a left eigenvalue of this r
by r matrix, then if I start

247
00:16:47,910 --> 00:16:52,620
out with t zeroes, and then
put in pi 1 to pi r, this

248
00:16:52,620 --> 00:16:57,310
vector here has to be a left
eigenvector of all of p.

249
00:16:57,310 --> 00:16:58,310
Why is that?

250
00:16:58,310 --> 00:17:01,610
Well, if I look at a vector,
which starts out with zeroes,

251
00:17:01,610 --> 00:17:06,900
and then has this eigenvector
pi, and I multiply that vector

252
00:17:06,900 --> 00:17:10,210
by this matrix here, I'm
taking these terms,

253
00:17:10,210 --> 00:17:16,260
multiplying them by the columns
of this matrix, these

254
00:17:16,260 --> 00:17:22,310
zeros knock out all of
these elements here.

255
00:17:22,310 --> 00:17:25,470
These zeroes knock out all
of these elements.

256
00:17:25,470 --> 00:17:28,410
So I start out with zeroes
everywhere here.

257
00:17:28,410 --> 00:17:30,480
That's what this says.

258
00:17:30,480 --> 00:17:34,660
And then when I'm dealing with
this part of the matrix, the

259
00:17:34,660 --> 00:17:39,750
zeros knock out all of this, and
I just have pi multiplying

260
00:17:39,750 --> 00:17:40,820
piece of r.

261
00:17:40,820 --> 00:17:45,220
So if I have an eigenvalue
lambda, it says I have the

262
00:17:45,220 --> 00:17:50,170
eigenvalue lambda times a
vector zero times pi.

263
00:17:50,170 --> 00:17:54,760
It says that if I have an
eigenvector, a left

264
00:17:54,760 --> 00:18:01,010
eigenvector of this recurrent
matrix, then that turns into,

265
00:18:01,010 --> 00:18:05,670
if you put some zeroes up in
front of, it turns into an

266
00:18:05,670 --> 00:18:07,790
eigenvector of the
whole matrix.

267
00:18:07,790 --> 00:18:11,580
If we look at the eigenvalue 1,
which is the most important

268
00:18:11,580 --> 00:18:14,350
thing this, is the thing that
gives you the steady state

269
00:18:14,350 --> 00:18:16,930
factor, this is sort
of obvious.

270
00:18:16,930 --> 00:18:19,630
Because the steady state
vector is where you go

271
00:18:19,630 --> 00:18:23,960
eventually, and eventually where
you go is you have to be

272
00:18:23,960 --> 00:18:27,290
in one of these recurrent
states, eventually.

273
00:18:27,290 --> 00:18:30,610
And the probabilities within
the recurrent set of states

274
00:18:30,610 --> 00:18:33,400
are the same as the
probabilities if you didn't

275
00:18:33,400 --> 00:18:36,590
have this transient
states at all.

276
00:18:36,590 --> 00:18:40,490
so this is all sort of obvious,
as far as the steady

277
00:18:40,490 --> 00:18:43,020
state factor pi.

278
00:18:43,020 --> 00:18:47,480
But it's a little less obvious
as far as the other vectors.

279
00:18:47,480 --> 00:18:52,300
The left eigenvectors,
a piece of t, I don't

280
00:18:52,300 --> 00:18:53,610
understand them at all.

281
00:18:53,610 --> 00:18:59,660
They aren't the same as the left
eigenvectors of, well,

282
00:18:59,660 --> 00:19:04,670
the left eigenvectors of the
eigenvalues of p sub t.

283
00:19:08,040 --> 00:19:10,270
I didn't say this right here.

284
00:19:10,270 --> 00:19:15,870
The left eigenvectors of p
corresponding to the left

285
00:19:15,870 --> 00:19:18,700
eigenvectors of p sub t.

286
00:19:18,700 --> 00:19:22,010
I don't understand how they
work, and I don't understand

287
00:19:22,010 --> 00:19:24,350
anything you can derive
from them.

288
00:19:24,350 --> 00:19:26,740
They're just kind of crazy
things, which are what they

289
00:19:26,740 --> 00:19:27,780
happen to be.

290
00:19:27,780 --> 00:19:29,350
And I don't care about them.

291
00:19:29,350 --> 00:19:32,200
I don't know anything
to do with them.

292
00:19:32,200 --> 00:19:35,200
But these other eigenvectors
are very useful.

293
00:19:35,200 --> 00:19:38,130
OK.

294
00:19:38,130 --> 00:19:45,040
We can extend this to as many
different recurrent sets of

295
00:19:45,040 --> 00:19:47,080
states as you choose.

296
00:19:47,080 --> 00:19:53,100
Here I'm doing it with a Markov
chain, which has two

297
00:19:53,100 --> 00:19:56,550
different sets of recurrent
states.

298
00:19:56,550 --> 00:20:00,010
They might be periodic, they
might be ergodic, it doesn't

299
00:20:00,010 --> 00:20:01,340
make any difference.

300
00:20:01,340 --> 00:20:07,730
So the matrix p has these
transient states up here.

301
00:20:07,730 --> 00:20:11,990
Here we have those transition
states would just go to each

302
00:20:11,990 --> 00:20:16,320
other, where the transition
probabilities starting with

303
00:20:16,320 --> 00:20:19,140
the transient state and going
to a transition state.

304
00:20:19,140 --> 00:20:24,090
Here we have the transitions,
which go from transient states

305
00:20:24,090 --> 00:20:26,500
to this first set of
recurrent states.

306
00:20:26,500 --> 00:20:30,810
Here we have the transitions,
which go from a transient

307
00:20:30,810 --> 00:20:35,480
state to the second state
of recurrent states.

308
00:20:35,480 --> 00:20:36,180
OK.

309
00:20:36,180 --> 00:20:39,330
The same way as before, the
determinant of this whole

310
00:20:39,330 --> 00:20:44,790
thing here, and this
determinant, the roots of that

311
00:20:44,790 --> 00:20:49,300
are in fact the eigenvalues of
p, are the product of the

312
00:20:49,300 --> 00:20:54,930
determinant of pt minus lambda
it times the product of this,

313
00:20:54,930 --> 00:20:58,030
times this determinant here.

314
00:20:58,030 --> 00:21:02,180
This has little t eigenvalues.

315
00:21:02,180 --> 00:21:05,220
This has little r eigenvalues.

316
00:21:05,220 --> 00:21:08,690
This has little r prime
eigenvalues, and if you add up

317
00:21:08,690 --> 00:21:11,880
t plus little r plus little
r prime, what do you get?

318
00:21:11,880 --> 00:21:17,790
You get jM, excuse me, capital
M, which is the total number

319
00:21:17,790 --> 00:21:21,470
of states in the Markov chain.

320
00:21:21,470 --> 00:21:27,110
So the eigenvalues here are
exactly the eigenvalues here

321
00:21:27,110 --> 00:21:33,300
plus the eigenvalues here, plus
the eigenvalues here.

322
00:21:33,300 --> 00:21:36,720
And you can find the
eigenvectors, the left

323
00:21:36,720 --> 00:21:40,810
eigenvectors for these
states in exactly

324
00:21:40,810 --> 00:21:43,450
the same way as before.

325
00:21:43,450 --> 00:21:44,570
OK.

326
00:21:44,570 --> 00:21:45,772
Yeah?

327
00:21:45,772 --> 00:21:48,628
STUDENT: So again, the
eigenvalues can be repeated

328
00:21:48,628 --> 00:21:51,960
both within t, r, r prime,
and in between the--

329
00:21:51,960 --> 00:21:52,436
PROFESSOR: Yes.

330
00:21:52,436 --> 00:21:54,340
STUDENT: There's nothing
that says [INAUDIBLE].

331
00:21:54,340 --> 00:21:54,610
PROFESSOR: No.

332
00:21:54,610 --> 00:21:58,440
There's nothing that says they
can't, except you can always

333
00:21:58,440 --> 00:22:05,980
find the left eigenvectors,
anyway, of this are, in fact,

334
00:22:05,980 --> 00:22:08,680
these things in the form.

335
00:22:08,680 --> 00:22:15,840
If pi is a left eigenvector of p
sub r, then zero followed by

336
00:22:15,840 --> 00:22:17,460
pi followed by zero.

337
00:22:17,460 --> 00:22:26,480
In other words, little t zeros
followed by r, followed by the

338
00:22:26,480 --> 00:22:32,060
eigenvector pi, followed by
little r prime zeroes here,

339
00:22:32,060 --> 00:22:34,490
this has to be a left
eigenvector of t.

340
00:22:34,490 --> 00:22:37,280
So this tells you something
about whether you're going to

341
00:22:37,280 --> 00:22:40,140
have a Jordan form or not,
one of these really

342
00:22:40,140 --> 00:22:41,240
ugly things in it.

343
00:22:41,240 --> 00:22:44,590
And it tells you that
in many cases, you

344
00:22:44,590 --> 00:22:46,370
just can't have them.

345
00:22:46,370 --> 00:22:48,850
If you have them, they're
usually tied up with this

346
00:22:48,850 --> 00:22:50,730
matrix here.

347
00:22:50,730 --> 00:22:53,140
OK, so that, I don't know.

348
00:22:53,140 --> 00:22:53,950
Was this useful?

349
00:22:53,950 --> 00:22:55,550
Does this clarify anything?

350
00:22:55,550 --> 00:22:58,830
Or if it doesn't,
it's too bad.

351
00:23:01,810 --> 00:23:02,330
OK.

352
00:23:02,330 --> 00:23:05,080
So now we want to start
talking about rewards.

353
00:23:07,580 --> 00:23:09,150
Some people call these costs.

354
00:23:09,150 --> 00:23:11,230
If you're an optimist,
you call it rewards.

355
00:23:11,230 --> 00:23:13,870
If you're a pessimist,
you call it costs.

356
00:23:13,870 --> 00:23:15,520
They're both the same thing.

357
00:23:15,520 --> 00:23:18,180
If you're dealing with rewards,
you maximize them.

358
00:23:18,180 --> 00:23:20,470
If you're dealing with costs,
you minimize them.

359
00:23:20,470 --> 00:23:24,800
So mathematically, who cares?

360
00:23:24,800 --> 00:23:30,590
OK, so suppose that each state
i of a Markov chain is

361
00:23:30,590 --> 00:23:33,280
associated with a given
reward, or a sub i.

362
00:23:33,280 --> 00:23:36,350
In other words, you think of
this Markov chain, which is

363
00:23:36,350 --> 00:23:37,180
running along.

364
00:23:37,180 --> 00:23:41,320
You go from one state to
another over time.

365
00:23:41,320 --> 00:23:45,930
And while this is happening,
you're pocketing some reward

366
00:23:45,930 --> 00:23:47,250
all the time.

367
00:23:47,250 --> 00:23:47,650
OK.

368
00:23:47,650 --> 00:23:50,890
You invest in a stock.

369
00:23:50,890 --> 00:23:53,470
Strangely enough, these
particular stocks we're

370
00:23:53,470 --> 00:23:57,270
thinking about here I this
Markov property.

371
00:23:57,270 --> 00:23:59,970
Stocks really don't have a
Markov property, but we'll

372
00:23:59,970 --> 00:24:02,130
assume they do.

373
00:24:02,130 --> 00:24:06,200
And since they have this Markov
property, you win for a

374
00:24:06,200 --> 00:24:07,840
while, and you lose
for a while.

375
00:24:07,840 --> 00:24:10,060
You win for a while, you
lose for a while.

376
00:24:10,060 --> 00:24:12,770
But we have something
extra, other than

377
00:24:12,770 --> 00:24:15,050
just the Markov chains.

378
00:24:15,050 --> 00:24:18,830
We can analyze this whole
situation, knowing how Markov

379
00:24:18,830 --> 00:24:20,670
chains behave.

380
00:24:20,670 --> 00:24:24,980
There's not much left besides
that, but there are an

381
00:24:24,980 --> 00:24:29,860
extraordinary number of
applications of this idea, and

382
00:24:29,860 --> 00:24:31,900
dynamic programming
is one of them.

383
00:24:31,900 --> 00:24:35,380
Because that's just one
added extension beyond

384
00:24:35,380 --> 00:24:37,880
this idea of rewards.

385
00:24:37,880 --> 00:24:38,380
OK.

386
00:24:38,380 --> 00:24:40,770
The random variable x of n.

387
00:24:40,770 --> 00:24:43,240
That's a random quantity.

388
00:24:43,240 --> 00:24:45,840
It's the state at time n.

389
00:24:45,840 --> 00:24:50,010
And the random reward of time n
is then the random variable

390
00:24:50,010 --> 00:24:55,680
r of xn that maps xn equals
i into ri for each i.

391
00:24:55,680 --> 00:24:59,140
This is the same idea of taking
one random variable,

392
00:24:59,140 --> 00:25:02,030
which is a function of another
random variable.

393
00:25:02,030 --> 00:25:06,000
The one random variable takes
on the values one up to

394
00:25:06,000 --> 00:25:07,740
capital M.

395
00:25:07,740 --> 00:25:11,080
And then the other random
variable takes on a value

396
00:25:11,080 --> 00:25:14,680
which is determined by the state
that you happen to be

397
00:25:14,680 --> 00:25:16,600
in, which is this
random states.

398
00:25:16,600 --> 00:25:21,700
So specifying our sub i
specifies what the set of

399
00:25:21,700 --> 00:25:25,380
rewards are, what the reward
is in each given state.

400
00:25:25,380 --> 00:25:28,520
Again, we have this awful
problem, which I wish we could

401
00:25:28,520 --> 00:25:32,760
avoid in Markov chains, of using
the same word state to

402
00:25:32,760 --> 00:25:35,900
talk about the set of
different states.

403
00:25:35,900 --> 00:25:38,120
And also to talk about
the random state

404
00:25:38,120 --> 00:25:39,170
at any given time.

405
00:25:39,170 --> 00:25:43,560
But hopefully by now you're
used to that.

406
00:25:43,560 --> 00:25:47,700
In our discussion here, the only
thing we're going to talk

407
00:25:47,700 --> 00:25:50,670
about are expected rewards.

408
00:25:50,670 --> 00:25:55,810
Now, you know that expected
rewards, or expectations are a

409
00:25:55,810 --> 00:25:58,310
little more generally than you
would think they would be,

410
00:25:58,310 --> 00:26:02,060
because you're going to take the
expected value of any sort

411
00:26:02,060 --> 00:26:04,300
of crazy thing.

412
00:26:04,300 --> 00:26:07,870
If you want to talk about any
event, you can take the

413
00:26:07,870 --> 00:26:11,310
indicator function of that
event, and find the expected

414
00:26:11,310 --> 00:26:13,890
value of that indicator
function.

415
00:26:13,890 --> 00:26:16,920
And that's just the probability
of that event.

416
00:26:16,920 --> 00:26:22,660
So by understanding how to deal
with expectations, you

417
00:26:22,660 --> 00:26:25,560
really have the capability
of finding distribution

418
00:26:25,560 --> 00:26:28,480
functions, or anything else
you want to find.

419
00:26:28,480 --> 00:26:28,970
OK.

420
00:26:28,970 --> 00:26:31,490
But anyway, since we're
interested only in expected

421
00:26:31,490 --> 00:26:37,555
rewards, the expected reward at
time n, given that x zero

422
00:26:37,555 --> 00:26:44,950
is i is the expected value of r
of xn given x zero equals i,

423
00:26:44,950 --> 00:26:49,840
which is the sum over j of the
reward you get if you're in

424
00:26:49,840 --> 00:26:55,700
state j at time n times p sub
ij, super n, which we've

425
00:26:55,700 --> 00:27:00,850
talked about ad nauseum for the
last four lectures now.

426
00:27:00,850 --> 00:27:06,900
And this is the probability that
the state at time n is j,

427
00:27:06,900 --> 00:27:09,910
given that the state
at time zero is i.

428
00:27:09,910 --> 00:27:13,650
So you can just automatically
find the expected

429
00:27:13,650 --> 00:27:17,570
value of r of xn.

430
00:27:17,570 --> 00:27:20,610
And it's by that formula.

431
00:27:20,610 --> 00:27:24,230
Now, recall that this quantity
here is not all that simple.

432
00:27:24,230 --> 00:27:28,680
This is the ij element of the
product of the matrix, of the

433
00:27:28,680 --> 00:27:31,010
nth product of the matrix p.

434
00:27:31,010 --> 00:27:32,370
But, so what?

435
00:27:32,370 --> 00:27:36,130
We can at least write a nice
formula for it now.

436
00:27:36,130 --> 00:27:40,140
The expected aggregate reward
over the n steps from m to m

437
00:27:40,140 --> 00:27:43,080
plus n minus 1.

438
00:27:43,080 --> 00:27:44,900
What is m doing in here?

439
00:27:44,900 --> 00:27:48,970
It's just reminding us that
Markov chains are

440
00:27:48,970 --> 00:27:51,890
homogeneous over time.

441
00:27:51,890 --> 00:27:56,370
So, when I talk about the
aggregate reward from time m

442
00:27:56,370 --> 00:28:01,200
the m plus n minus 1, it's the
same as the aggregate reward

443
00:28:01,200 --> 00:28:04,500
from time 0 up to
time n minus 1.

444
00:28:04,500 --> 00:28:06,270
The expected values
are the same.

445
00:28:06,270 --> 00:28:09,550
The actual sample functions
are different.

446
00:28:09,550 --> 00:28:14,290
OK, so if I try to calculate
this aggregate reward

447
00:28:14,290 --> 00:28:18,880
conditional on xm equals i,
mainly conditional on starting

448
00:28:18,880 --> 00:28:23,660
in state i, then this expected
aggregate reward, I use that

449
00:28:23,660 --> 00:28:28,610
as a symbol for it, is the
expected value of r of xm,

450
00:28:28,610 --> 00:28:30,310
given xm equals i.

451
00:28:30,310 --> 00:28:30,890
What is that?

452
00:28:30,890 --> 00:28:33,030
Well, that's ri.

453
00:28:33,030 --> 00:28:35,220
I mean, given that xm
is equal to i, this

454
00:28:35,220 --> 00:28:36,490
isn't random anymore.

455
00:28:36,490 --> 00:28:38,500
It's just the source sub i.

456
00:28:38,500 --> 00:28:45,350
Plus the expected value of r of
xm plus 1, which is the sum

457
00:28:45,350 --> 00:28:49,490
over j, of pij times r sub j.

458
00:28:49,490 --> 00:28:54,305
That's the time m plus 1 given
that you're in state i at time

459
00:28:54,305 --> 00:29:00,370
m, and so forth, up until time
n minus 1, where the expected

460
00:29:00,370 --> 00:29:03,240
reward, then, is
a piece of ij.

461
00:29:06,180 --> 00:29:10,860
Probability of being in state j
at time n minus 1 given that

462
00:29:10,860 --> 00:29:16,190
you started off in state i
at time 0 times r sub j.

463
00:29:16,190 --> 00:29:20,790
And since expectations add, we
have this nice, convenient

464
00:29:20,790 --> 00:29:22,040
formula here.

465
00:29:26,180 --> 00:29:30,580
We're doing something I normally
hate doing, which is

466
00:29:30,580 --> 00:29:35,290
building up a lot of notation,
and then using that notation

467
00:29:35,290 --> 00:29:40,470
to write extremely complicated
formulas in a way that looks

468
00:29:40,470 --> 00:29:41,200
very simple.

469
00:29:41,200 --> 00:29:44,480
And therefore you will get some
sense of what we're doing

470
00:29:44,480 --> 00:29:45,840
is very simple.

471
00:29:45,840 --> 00:29:48,160
These quantities in
here, again, are

472
00:29:48,160 --> 00:29:49,790
not all that simple.

473
00:29:49,790 --> 00:29:52,550
But at least we can write
it in a simple way.

474
00:29:52,550 --> 00:29:56,260
And since we can write it in a
simple way, it turns out we

475
00:29:56,260 --> 00:29:59,160
can do some nice
things with it.

476
00:29:59,160 --> 00:29:59,420
OK.

477
00:29:59,420 --> 00:30:00,970
So where do we go from
all of this?

478
00:30:04,860 --> 00:30:12,280
We have just said that the
expected reward we get,

479
00:30:12,280 --> 00:30:18,550
expected aggregate reward over n
steps, namely from m up to m

480
00:30:18,550 --> 00:30:20,210
plus n minus 1.

481
00:30:20,210 --> 00:30:25,660
We're assuming that if we start
at time m, we pick up a

482
00:30:25,660 --> 00:30:27,660
reward at time n.

483
00:30:27,660 --> 00:30:30,530
I mean, that's just an
arbitrary decision.

484
00:30:30,530 --> 00:30:33,960
We might as well do that,
because otherwise we just have

485
00:30:33,960 --> 00:30:36,840
one more transition matrix
sitting here.

486
00:30:36,840 --> 00:30:38,660
OK, so we start at time m.

487
00:30:38,660 --> 00:30:42,640
We pick up a reward, which
is conditional on the

488
00:30:42,640 --> 00:30:45,030
state we start in.

489
00:30:45,030 --> 00:30:53,040
And then we look at the expected
reward for time m and

490
00:30:53,040 --> 00:30:58,420
time m plus 1, m plus 2,
up to m plus n minus 1.

491
00:30:58,420 --> 00:31:00,610
Since we started at
m, we're picking

492
00:31:00,610 --> 00:31:02,620
up n different rewards.

493
00:31:02,620 --> 00:31:07,490
We have to stop at time
m plus n minus 1.

494
00:31:07,490 --> 00:31:14,040
OK, so that's this expected
aggregate reward.

495
00:31:14,040 --> 00:31:17,890
Why do I care about expected
aggregate reward?

496
00:31:17,890 --> 00:31:22,220
Because the rewards at any time
n are sort of trivial.

497
00:31:22,220 --> 00:31:24,640
What we're are interested
in is how does this

498
00:31:24,640 --> 00:31:27,320
build up over time?

499
00:31:27,320 --> 00:31:29,150
You start to invest
in a stock.

500
00:31:29,150 --> 00:31:34,480
You don't much care what
it's worth at time 10.

501
00:31:34,480 --> 00:31:35,785
You care how it grows.

502
00:31:38,390 --> 00:31:41,040
You care about its value when
you want to sell it, and you

503
00:31:41,040 --> 00:31:44,880
don't know when you're going to
sell it, most of the time.

504
00:31:44,880 --> 00:31:48,150
So you're really interested
in these aggregate

505
00:31:48,150 --> 00:31:49,400
rewards that you.

506
00:31:52,260 --> 00:31:54,590
You'll see when we get to
dynamic programming what

507
00:31:54,590 --> 00:31:56,780
you're interested
in that, also.

508
00:31:56,780 --> 00:31:57,430
OK.

509
00:31:57,430 --> 00:32:01,340
If the Markov chain is an
ergotic unit chain, then

510
00:32:01,340 --> 00:32:04,710
successive terms of this
expression tend to a steady

511
00:32:04,710 --> 00:32:06,450
state gain per step.

512
00:32:06,450 --> 00:32:11,520
In other words, these terms here
, when n gets very large,

513
00:32:11,520 --> 00:32:17,070
if I run this process for very
long time, what happens to p

514
00:32:17,070 --> 00:32:20,640
sub ij to n minus 1?

515
00:32:20,640 --> 00:32:27,920
This tends towards the steady
state vector pi sub j.

516
00:32:27,920 --> 00:32:31,710
And it doesn't matter
where we started.

517
00:32:31,710 --> 00:32:34,690
The only thing of importance
is where we end up.

518
00:32:34,690 --> 00:32:37,180
It doesn't matter how
high this is.

519
00:32:37,180 --> 00:32:42,670
So we have a sum over j, of
pi sub j times r sub j.

520
00:32:42,670 --> 00:32:48,745
After a very long time, the
expected gain per step is just

521
00:32:48,745 --> 00:32:51,930
a sum of pi j times
our r sub j.

522
00:32:51,930 --> 00:32:56,000
That's what's important
after a long time.

523
00:32:56,000 --> 00:32:58,290
And that's independent of
the starting state.

524
00:32:58,290 --> 00:33:02,670
So what we have here is a big,
messy transient, which is a

525
00:33:02,670 --> 00:33:04,780
sum of a whole bunch
of things.

526
00:33:04,780 --> 00:33:08,090
And then eventually it just
settles down, and every extra

527
00:33:08,090 --> 00:33:15,190
step you do, you just pick up
an extra factor of g as an

528
00:33:15,190 --> 00:33:16,970
extra reward.

529
00:33:16,970 --> 00:33:19,960
The reward might, of course, be
negative, like in the stock

530
00:33:19,960 --> 00:33:25,100
market over the last 10 years,
or up until the last year or

531
00:33:25,100 --> 00:33:27,980
so, who was negative
for a long time.

532
00:33:27,980 --> 00:33:30,800
But that doesn't make
any difference.

533
00:33:30,800 --> 00:33:34,480
This is just a number, and
this is independent of

534
00:33:34,480 --> 00:33:36,590
starting state.

535
00:33:36,590 --> 00:33:41,740
And p sub in can be viewed a
transient ni, which is all

536
00:33:41,740 --> 00:33:43,330
this stuff at the beginning.

537
00:33:43,330 --> 00:33:47,010
The sum of all these terms at
the beginning plus something

538
00:33:47,010 --> 00:33:50,290
that settles down over a
long period of time.

539
00:33:50,290 --> 00:33:54,200
How to calculate that transient,
how to combine it

540
00:33:54,200 --> 00:33:56,230
with the steady state gain.

541
00:33:56,230 --> 00:33:59,920
Then those talk a great
deal about that.

542
00:33:59,920 --> 00:34:03,970
What we're trying to do today
is to talk about dynamic

543
00:34:03,970 --> 00:34:09,080
programming without going into
all of this terrible mess

544
00:34:09,080 --> 00:34:12,250
about dealing rewards
words in a very

545
00:34:12,250 --> 00:34:14,239
systematic and simple way.

546
00:34:14,239 --> 00:34:16,199
You can read about that later.

547
00:34:16,199 --> 00:34:19,610
What we're aiming at is to talk
about dynamic programming

548
00:34:19,610 --> 00:34:23,340
a little bit, and then get
off to other things.

549
00:34:23,340 --> 00:34:23,870
OK.

550
00:34:23,870 --> 00:34:27,239
So anyway, we have a transient,
plus we have a

551
00:34:27,239 --> 00:34:29,330
steady state gain.

552
00:34:29,330 --> 00:34:31,470
The transient is important.

553
00:34:31,470 --> 00:34:34,520
And it's particularly important
if g equals zero.

554
00:34:34,520 --> 00:34:40,090
Namely if your average gain per
step is nothing, then what

555
00:34:40,090 --> 00:34:47,980
you're primarily interested in
is how valuable is it to start

556
00:34:47,980 --> 00:34:49,360
in a particular state?

557
00:34:49,360 --> 00:34:53,000
If you start in one state versus
another state, you

558
00:34:53,000 --> 00:34:56,600
might get a great deal of reward
in this one state,

559
00:34:56,600 --> 00:34:59,120
whereas you make a loss
in some other state.

560
00:34:59,120 --> 00:35:03,200
So it's important to know which
state is worth being in.

561
00:35:03,200 --> 00:35:07,960
So that's the next thing
we try to look at.

562
00:35:07,960 --> 00:35:12,410
How does the state
affect things?

563
00:35:12,410 --> 00:35:17,760
This brings us to one example
which is particularly useful.

564
00:35:17,760 --> 00:35:22,360
And along with being a useful
example, well, it's a nice

565
00:35:22,360 --> 00:35:25,840
illustration of Markov
rewards.

566
00:35:25,840 --> 00:35:30,980
It's also something which
you often want to find.

567
00:35:30,980 --> 00:35:35,800
And when we start talking about
renewal processes, you

568
00:35:35,800 --> 00:35:40,890
will find that this idea here
is a nice connection between

569
00:35:40,890 --> 00:35:43,340
Markov chains and
renewal series.

570
00:35:43,340 --> 00:35:47,240
So it's important for a whole
bunch of different reasons.

571
00:35:47,240 --> 00:35:48,220
OK.

572
00:35:48,220 --> 00:35:52,470
Suppose for some arbitrary
unit chain, namely we're

573
00:35:52,470 --> 00:35:56,060
saying one set of recurring
states.

574
00:35:56,060 --> 00:35:59,710
We want to find the expected
number of steps, starting from

575
00:35:59,710 --> 00:36:04,260
a given state i, until
some particular

576
00:36:04,260 --> 00:36:06,560
state 1 is first entered.

577
00:36:06,560 --> 00:36:09,070
So you start at one state.

578
00:36:09,070 --> 00:36:12,090
There's this other state
way over here.

579
00:36:12,090 --> 00:36:15,690
This state is recurrent, so
presumably, eventually you're

580
00:36:15,690 --> 00:36:17,580
going to enter it.

581
00:36:17,580 --> 00:36:20,170
And you want to find out, what's
the expected time that

582
00:36:20,170 --> 00:36:23,810
it takes to get to that
particular state?

583
00:36:23,810 --> 00:36:26,110
OK?

584
00:36:26,110 --> 00:36:30,160
If you're a Ph.D. student, you
have this Markov chain of

585
00:36:30,160 --> 00:36:32,310
doing your research.

586
00:36:32,310 --> 00:36:36,180
And at some point, you're going
to get a Ph.D. So we can

587
00:36:36,180 --> 00:36:39,900
think of this as the first pass
each time to your first

588
00:36:39,900 --> 00:36:44,500
Ph.D. I mean, if you want to
get more Ph.D.'s, fine, but

589
00:36:44,500 --> 00:36:47,560
that's probably a different
Markov chain.

590
00:36:47,560 --> 00:36:48,550
OK.

591
00:36:48,550 --> 00:36:53,110
So anyway, that's the problem
we're trying to solve here.

592
00:36:53,110 --> 00:36:56,690
We can view this problem
as a reward problem.

593
00:36:56,690 --> 00:36:59,750
We have to go through a number
of steps if we want to view it

594
00:36:59,750 --> 00:37:01,940
as a reward problem.

595
00:37:01,940 --> 00:37:07,390
The first one, first step is to
assign one unit of reward

596
00:37:07,390 --> 00:37:11,430
to each successive state until
you enter state 1.

597
00:37:11,430 --> 00:37:15,040
So you're bombing through this
Markov chain, a frog jumping

598
00:37:15,040 --> 00:37:17,120
from lily pad to lily pad.

599
00:37:17,120 --> 00:37:19,590
And finally, the frog
gets to the lily pad

600
00:37:19,590 --> 00:37:21,500
with the food on it.

601
00:37:21,500 --> 00:37:25,780
And the frog wants to know, is
it going to start before he

602
00:37:25,780 --> 00:37:28,830
gets to this lily pad
with the food on it?

603
00:37:28,830 --> 00:37:32,940
So, if we're trying to find
the expected time to get

604
00:37:32,940 --> 00:37:35,850
there, here what we're really
interested in is a cost,

605
00:37:35,850 --> 00:37:39,920
because the frog is in
danger of starving.

606
00:37:39,920 --> 00:37:42,220
Or on the other hand, there
might be a snake lying under

607
00:37:42,220 --> 00:37:44,470
this one lily pad.

608
00:37:44,470 --> 00:37:47,770
And then he's getting a reward
for staying alive.

609
00:37:47,770 --> 00:37:51,390
You can look at these things
whichever way you want to.

610
00:37:51,390 --> 00:37:51,880
OK.

611
00:37:51,880 --> 00:37:55,020
We're going to assign one unit
of reward to successive state

612
00:37:55,020 --> 00:37:56,800
until state 1 is entered.

613
00:37:56,800 --> 00:38:01,430
1 is just an arbitrary state
that we've selected.

614
00:38:01,430 --> 00:38:04,760
That's where the snake is
underneath a lily pad, or

615
00:38:04,760 --> 00:38:08,130
that's where the food is,
or what have you.

616
00:38:08,130 --> 00:38:10,450
Now, there's something
else we have to do.

617
00:38:10,450 --> 00:38:17,010
Because if we're starting out at
some arbitrary state i, and

618
00:38:17,010 --> 00:38:19,910
we're trying to look for the
first time that we enter state

619
00:38:19,910 --> 00:38:23,695
1, what do you do after
you enter state 1?

620
00:38:26,670 --> 00:38:32,400
Well eventually, normally you're
going to go away from

621
00:38:32,400 --> 00:38:34,110
state 1, and you're
going to start

622
00:38:34,110 --> 00:38:36,380
picking up rewards again.

623
00:38:36,380 --> 00:38:38,990
You don't want that to happen.

624
00:38:38,990 --> 00:38:42,020
So you do something we do all
the time when we're dealing

625
00:38:42,020 --> 00:38:45,510
with Markov chains, which is
we start with one Markov

626
00:38:45,510 --> 00:38:49,070
chain, and we say, to solve this
problem I'm interested

627
00:38:49,070 --> 00:38:52,110
in, I've got to change
the Markov chain.

628
00:38:52,110 --> 00:38:54,350
So how are we going
to change it?

629
00:38:54,350 --> 00:38:58,160
We're going to change it to say,
once we get in state 1,

630
00:38:58,160 --> 00:38:59,455
we're going to stay
there forever.

631
00:39:02,070 --> 00:39:04,600
Or in other words, the frog gets
eaten by the snake, and

632
00:39:04,600 --> 00:39:09,650
therefore its remains always
stay at that one lily pad.

633
00:39:09,650 --> 00:39:11,750
So we change the Markov
chain again.

634
00:39:11,750 --> 00:39:14,450
The frog can't jump anymore.

635
00:39:14,450 --> 00:39:18,290
And the way we change it is
to change the transition

636
00:39:18,290 --> 00:39:23,910
probabilities out of state 1
to p sub 1, 1, namely the

637
00:39:23,910 --> 00:39:27,010
probability given you're in
state 1, of going back to

638
00:39:27,010 --> 00:39:30,320
state 1 in the next transition
is equal to 1.

639
00:39:30,320 --> 00:39:32,670
So whenever you get
to state 1, you

640
00:39:32,670 --> 00:39:35,270
just stay there forever.

641
00:39:35,270 --> 00:39:39,210
We're going to say r1 equal to
zero, namely the reward you

642
00:39:39,210 --> 00:39:42,240
get in state 1 will be zero.

643
00:39:42,240 --> 00:39:46,070
So you keep getting rewards
until you go to state 1.

644
00:39:46,070 --> 00:39:49,840
And then when you go to state
1, you don't get any reward.

645
00:39:49,840 --> 00:39:54,150
You don't get any reward from
any time after that.

646
00:39:54,150 --> 00:39:56,600
So in fact, we've converted
the problem.

647
00:39:56,600 --> 00:39:59,970
We've converted the Markov chain
to be able to solve the

648
00:39:59,970 --> 00:40:03,160
problem that we want to solve.

649
00:40:03,160 --> 00:40:07,660
Now, how do we know that we
haven't changed the problem in

650
00:40:07,660 --> 00:40:10,330
some awful way?

651
00:40:10,330 --> 00:40:13,710
I mean, any time you start out
with a Markov chain and you

652
00:40:13,710 --> 00:40:16,510
modify it, and you solve a
problem for the modified

653
00:40:16,510 --> 00:40:20,410
chain, you have to really think
through whether you

654
00:40:20,410 --> 00:40:23,550
changed the problem that
you started to solve.

655
00:40:23,550 --> 00:40:27,790
Well, think of any sample path
which starts in some state i,

656
00:40:27,790 --> 00:40:29,610
which is not equal to 1.

657
00:40:29,610 --> 00:40:33,930
Think of the sample path
as going forever.

658
00:40:33,930 --> 00:40:38,430
In the original Markov chain,
that sample path at some

659
00:40:38,430 --> 00:40:43,050
point, presumably, is going
to get to state 1.

660
00:40:43,050 --> 00:40:47,100
After it gets to state 1, we
don't care what happens,

661
00:40:47,100 --> 00:40:51,520
because we then know how long
it's taken to get to state 1.

662
00:40:51,520 --> 00:40:54,550
And after it gets to state
1, the transition

663
00:40:54,550 --> 00:40:56,410
probabilities change.

664
00:40:56,410 --> 00:40:58,410
We don't care about that.

665
00:40:58,410 --> 00:41:03,570
So for every sample path, the
time that it takes the first

666
00:41:03,570 --> 00:41:08,370
pass each time to state 1 is the
same in the modify chain

667
00:41:08,370 --> 00:41:10,920
as it is in the actual chain.

668
00:41:10,920 --> 00:41:15,750
The transition probabilities are
the same up until the time

669
00:41:15,750 --> 00:41:17,770
when you first get to state 1.

670
00:41:17,770 --> 00:41:22,300
So for first pass each time
problems, it doesn't make any

671
00:41:22,300 --> 00:41:26,550
difference what you do after
you get to state 1.

672
00:41:26,550 --> 00:41:30,590
So to make the problem easy,
we're going to set these

673
00:41:30,590 --> 00:41:34,450
transition probabilities in
state 1 to 1, and we're going

674
00:41:34,450 --> 00:41:38,830
to set the reward
equal to zero.

675
00:41:38,830 --> 00:41:46,710
What do you call a state which
has p sub i, i equal to 1?

676
00:41:46,710 --> 00:41:48,700
You call it a trapping state.

677
00:41:48,700 --> 00:41:51,080
It's a trapping state because
once you get there,

678
00:41:51,080 --> 00:41:52,330
you can't get out.

679
00:41:55,500 --> 00:41:59,710
And since we started out with
a unit chain, and since

680
00:41:59,710 --> 00:42:03,650
presumably state 1 is a
recurrent state in that unit

681
00:42:03,650 --> 00:42:06,500
chain, eventually you're going
to get to state 1.

682
00:42:06,500 --> 00:42:08,560
But once you get there,
you can't get out.

683
00:42:08,560 --> 00:42:11,690
So what you've done is you've
turned the unit chain into

684
00:42:11,690 --> 00:42:15,200
another unit chain where the
recurrent set of states has

685
00:42:15,200 --> 00:42:17,900
only this one state,
state 1 in it.

686
00:42:17,900 --> 00:42:19,690
So it's a trapping state.

687
00:42:19,690 --> 00:42:23,920
Everything eventually
leads to state 1.

688
00:42:23,920 --> 00:42:26,600
All roads lead to Rome, but it's
not obvious that they're

689
00:42:26,600 --> 00:42:28,350
leading to Rome.

690
00:42:28,350 --> 00:42:31,480
And all of these states
eventually lead to state 1,

691
00:42:31,480 --> 00:42:34,420
but not for quite a
while sometimes.

692
00:42:34,420 --> 00:42:35,050
OK.

693
00:42:35,050 --> 00:42:37,710
So the probability of an initial
segment until 1 is

694
00:42:37,710 --> 00:42:41,960
entered is unchanged, and
expected first pass each time

695
00:42:41,960 --> 00:42:43,210
is unchanged.

696
00:42:45,630 --> 00:42:45,770
OK.

697
00:42:45,770 --> 00:42:50,430
A modified Markov chain is now
an ergotic unit chain.

698
00:42:50,430 --> 00:42:53,580
It has a single recurrent
state.

699
00:42:53,580 --> 00:42:57,150
State 1 is a trapping
state, we call it.

700
00:42:57,150 --> 00:43:03,730
ri is equal to 1 for i unequal
to 1, and r1 is equal to zero.

701
00:43:03,730 --> 00:43:08,480
This says that a state 1 is
first entered at time l, and

702
00:43:08,480 --> 00:43:13,770
the aggregate reward from 0 to
n is l for all m greater than

703
00:43:13,770 --> 00:43:14,335
or equal to l.

704
00:43:14,335 --> 00:43:16,780
In other words, after you get
to the trapping state, you

705
00:43:16,780 --> 00:43:19,410
stay there, and you don't
pick up any more

706
00:43:19,410 --> 00:43:21,250
reward from then on.

707
00:43:21,250 --> 00:43:23,970
One of the things that's
maddening about problems like

708
00:43:23,970 --> 00:43:26,720
this, at least that's maddening
for me, because I

709
00:43:26,720 --> 00:43:30,710
can't keep those things
straight, is the difference

710
00:43:30,710 --> 00:43:34,290
between n and n plus 1,
or n and n minus 1.

711
00:43:34,290 --> 00:43:37,280
There's always that strange
thing, we've started at time

712
00:43:37,280 --> 00:43:40,270
m, we get reward at time m.

713
00:43:40,270 --> 00:43:43,600
So if we're looking at m
transitions, as we go from m

714
00:43:43,600 --> 00:43:46,860
the m plus n minus 1.

715
00:43:46,860 --> 00:43:50,150
And that's just life.

716
00:43:50,150 --> 00:43:52,910
If you try to do it in a
different way, you wind up

717
00:43:52,910 --> 00:43:54,800
with a similar problem.

718
00:43:54,800 --> 00:43:56,220
You can't avoid it.

719
00:43:56,220 --> 00:44:02,130
OK, so what we're trying to find
is the expected value of

720
00:44:02,130 --> 00:44:06,470
v sub i of n, and the limit as n
goes to infinity, we'll just

721
00:44:06,470 --> 00:44:10,640
call that v sub i without
the n on it.

722
00:44:10,640 --> 00:44:14,620
And what we want to do is to
calculate this expected time

723
00:44:14,620 --> 00:44:18,040
until we first enter
state one.

724
00:44:18,040 --> 00:44:22,900
We want to calculate that for
all of the other states i.

725
00:44:22,900 --> 00:44:26,980
Well fortunately, there's a
sneaky way to calculate this.

726
00:44:26,980 --> 00:44:29,170
For most of these problems,
there's a sneaky way to

727
00:44:29,170 --> 00:44:30,680
calculate these limits.

728
00:44:30,680 --> 00:44:34,640
And you don't have to worry
about the limit.

729
00:44:34,640 --> 00:44:37,010
So the next thing I'm going to
do is to explain what this

730
00:44:37,010 --> 00:44:39,760
sneaky way is.

731
00:44:39,760 --> 00:44:44,710
You will see the same sneaky
method done about 100 times

732
00:44:44,710 --> 00:44:46,460
from now on until the
end of course.

733
00:44:46,460 --> 00:44:48,760
We use it all the time.

734
00:44:48,760 --> 00:44:52,250
And each time we do it, we'll
get a better sense of what it

735
00:44:52,250 --> 00:44:53,710
really amounts to.

736
00:44:53,710 --> 00:44:59,150
So for each state unequal to
the trapping state, let's

737
00:44:59,150 --> 00:45:02,290
start out by assuming that
we start at time

738
00:45:02,290 --> 00:45:04,470
zero, and state i.

739
00:45:04,470 --> 00:45:08,580
In other words, what this means
is first we're going to

740
00:45:08,580 --> 00:45:12,490
assume that x sub 0 equals
i for some given i.

741
00:45:12,490 --> 00:45:14,300
We're going to go through
whatever we're going to go

742
00:45:14,300 --> 00:45:17,620
through, then we'll go back
and assume that x sub 0 is

743
00:45:17,620 --> 00:45:18,890
some other i.

744
00:45:18,890 --> 00:45:21,800
And we don't have to worry about
that, because i is just

745
00:45:21,800 --> 00:45:22,900
a generic state.

746
00:45:22,900 --> 00:45:26,320
So we'll do it for everything
at once.

747
00:45:26,320 --> 00:45:30,630
There's a unit reward
at time 0.

748
00:45:30,630 --> 00:45:32,970
r sub i is equal to 1.

749
00:45:32,970 --> 00:45:37,270
So we start out at time
zero and state i.

750
00:45:37,270 --> 00:45:41,070
We pick up our reward of 1, and
then we go on from there

751
00:45:41,070 --> 00:45:46,370
to see how much longer it
takes to get to state 1.

752
00:45:46,370 --> 00:45:53,170
In addition to this unit reward
at time zero, which

753
00:45:53,170 --> 00:45:56,430
means it's already taken us one
unit of time to get the

754
00:45:56,430 --> 00:46:02,120
state 1, given that x sub 1
equals j, namely, given that

755
00:46:02,120 --> 00:46:07,910
we go from state i to state j,
the remaining expected reward

756
00:46:07,910 --> 00:46:10,380
is v sub j.

757
00:46:10,380 --> 00:46:15,830
In other words, if it's times
0, I'm in some state i.

758
00:46:15,830 --> 00:46:21,110
Given that I go to some stage
j, the next unit of time,

759
00:46:21,110 --> 00:46:24,930
what's the remaining accepted
expected time they

760
00:46:24,930 --> 00:46:27,560
get to state 1?

761
00:46:27,560 --> 00:46:32,830
The remaining expected time is
just v sub j, because that's

762
00:46:32,830 --> 00:46:34,050
the expected time.

763
00:46:34,050 --> 00:46:37,550
I mean, if v sub j is something
where it's very hard

764
00:46:37,550 --> 00:46:41,560
to get to state 1, then
we really lost out.

765
00:46:41,560 --> 00:46:44,370
If it's something which is
closer to state 1 in some

766
00:46:44,370 --> 00:46:45,730
sense, then we've gained.

767
00:46:45,730 --> 00:46:51,180
But what we wind up with is the
expected time to get to

768
00:46:51,180 --> 00:46:55,370
state 1 from state i is one.

769
00:46:55,370 --> 00:46:59,450
That's the instant reward that
we get, or the instant cost

770
00:46:59,450 --> 00:47:04,880
that we pay, plus each of
the possible states

771
00:47:04,880 --> 00:47:06,420
we might get to.

772
00:47:06,420 --> 00:47:11,290
There's a cost to go, or
reward to go from that

773
00:47:11,290 --> 00:47:12,470
particular j.

774
00:47:12,470 --> 00:47:15,320
So this is the formula
we have to solve.

775
00:47:15,320 --> 00:47:16,190
What's this mean?

776
00:47:16,190 --> 00:47:20,280
It means we have to solve
this formula for all i.

777
00:47:20,280 --> 00:47:24,870
If I solve it for all i, and
I've solved this for all i,

778
00:47:24,870 --> 00:47:28,910
then that's the linear equation
in the variables v

779
00:47:28,910 --> 00:47:40,010
sub 1 up to v linear equations
in i equals 2, up to m.

780
00:47:40,010 --> 00:47:44,660
We also have decided that
v sub 1 is equal to 0.

781
00:47:44,660 --> 00:47:48,350
In other words, if we start out
in state 1, you expect the

782
00:47:48,350 --> 00:47:50,670
time to get to state 1 is 0.

783
00:47:50,670 --> 00:47:53,260
We're already there.

784
00:47:53,260 --> 00:47:53,730
OK.

785
00:47:53,730 --> 00:47:57,300
So we have to solve these
linear equations.

786
00:47:57,300 --> 00:48:03,130
And if your philosophy on
solving linear equations is

787
00:48:03,130 --> 00:48:08,930
that of, I shouldn't say a
computer scientist because I

788
00:48:08,930 --> 00:48:11,830
don't want to indicate that they
are any different from

789
00:48:11,830 --> 00:48:16,960
any of the rest of us, but for
many people, your philosophy

790
00:48:16,960 --> 00:48:20,720
of solving linear equations
is to try to solve it.

791
00:48:20,720 --> 00:48:24,440
If you can't solve it, it
doesn't have any solution.

792
00:48:24,440 --> 00:48:28,020
And if you're happy with
doing that, fine.

793
00:48:28,020 --> 00:48:33,480
Some people would rather spend
10 hours asking whether in

794
00:48:33,480 --> 00:48:37,030
general it has any solution,
rather than spending five

795
00:48:37,030 --> 00:48:38,806
minutes solving it.

796
00:48:38,806 --> 00:48:48,420
So either way, this expected
first passage time, we've just

797
00:48:48,420 --> 00:48:50,390
stated what it is.

798
00:48:50,390 --> 00:48:57,710
Starting in state i, it's 1 plus
the time to go for any

799
00:48:57,710 --> 00:48:59,840
other state you happen
to go to.

800
00:48:59,840 --> 00:49:03,910
If we put this in vector form,
you put things in vector form

801
00:49:03,910 --> 00:49:06,670
because you want to spend two
hours finding the general

802
00:49:06,670 --> 00:49:09,685
solution, rather than five
minutes solving the problem.

803
00:49:14,240 --> 00:49:18,660
If you have 1,000 states, then
it works the other way.

804
00:49:18,660 --> 00:49:22,300
It takes you multiple hours to
work it out by hand, and it

805
00:49:22,300 --> 00:49:25,430
takes you five minutes by
looking at the equation.

806
00:49:25,430 --> 00:49:29,240
So sometimes you win, and
sometimes you lose by looking

807
00:49:29,240 --> 00:49:30,780
at the general solution.

808
00:49:30,780 --> 00:49:37,360
If you look at this as a vector
solution, the vector v

809
00:49:37,360 --> 00:49:43,080
where v1 is equal to zero, and
the other v's are unknowns, is

810
00:49:43,080 --> 00:49:47,590
the vector r, the
vector r is 0.

811
00:49:47,590 --> 00:49:50,030
0 reward in state 1.

812
00:49:50,030 --> 00:49:53,020
Unit reward in all other states,
because we're trying

813
00:49:53,020 --> 00:49:55,860
to get to this end.

814
00:49:55,860 --> 00:50:00,780
And then we have the matrix
here, t times v.

815
00:50:00,780 --> 00:50:04,780
So we want to solve this set of
linear equations, and what

816
00:50:04,780 --> 00:50:08,720
do we know about this set
of linear equations?

817
00:50:08,720 --> 00:50:11,890
We have an ergotic unit chain.

818
00:50:11,890 --> 00:50:16,410
We know that p has
an eigenvalue,

819
00:50:16,410 --> 00:50:18,700
which is equal to 1.

820
00:50:18,700 --> 00:50:22,040
We know that's a simple
eigenvalue.

821
00:50:22,040 --> 00:50:37,130
So that in fact, when we write
v equals r plus pv as zero

822
00:50:37,130 --> 00:50:50,070
equals r plus p minus
i times v.

823
00:50:50,070 --> 00:50:52,190
And we try to ask whether
v has any

824
00:50:52,190 --> 00:50:55,040
solution, what's the answer?

825
00:50:55,040 --> 00:50:59,140
Well, this matrix here has
an eigenvalue of 1.

826
00:50:59,140 --> 00:51:02,030
Since it has an eigenvalue of
one, and since it's a simple

827
00:51:02,030 --> 00:51:06,160
eigenvalue, there's a space of
solutions to this equation.

828
00:51:06,160 --> 00:51:11,330
The space of solutions is the
vector of all ones and the

829
00:51:11,330 --> 00:51:12,850
vector of all anything else.

830
00:51:12,850 --> 00:51:17,650
In other words, it's a vector of
v times any constant alpha.

831
00:51:17,650 --> 00:51:21,460
Now we've stuck this in here,
so now we want to find out

832
00:51:21,460 --> 00:51:25,200
what's the set of
solutions now.

833
00:51:25,200 --> 00:51:31,730
We observe v plus alpha e also
satisfies this equation if we

834
00:51:31,730 --> 00:51:33,500
found another solution.

835
00:51:33,500 --> 00:51:37,450
So if we found a solution, we
have a one dimensional family

836
00:51:37,450 --> 00:51:40,110
of solutions.

837
00:51:40,110 --> 00:51:47,520
Well, since this eigenvalue is a
simple eigenvalue, the space

838
00:51:47,520 --> 00:51:56,040
of vectors for which r is equal
to p minus i times v as

839
00:51:56,040 --> 00:51:59,390
a one dimensional space, and
therefore there has to be a

840
00:51:59,390 --> 00:52:02,350
unique solution to
this question.

841
00:52:02,350 --> 00:52:03,490
OK.

842
00:52:03,490 --> 00:52:07,460
So in fact, in only 15 minutes,
we've solved the

843
00:52:07,460 --> 00:52:13,710
problem in general, so that you
can deal with matrices of

844
00:52:13,710 --> 00:52:17,990
1,000 states, as opposed
to two states.

845
00:52:17,990 --> 00:52:20,170
And you still have
the same answer.

846
00:52:20,170 --> 00:52:21,840
OK.

847
00:52:21,840 --> 00:52:26,970
So this equation has a simple
solution, which says that you

848
00:52:26,970 --> 00:52:29,850
can program your computer to
solve this set of linear

849
00:52:29,850 --> 00:52:33,270
equations, and you're bound
to get an answer.

850
00:52:33,270 --> 00:52:35,740
And the answer will tell you
how long it takes to get to

851
00:52:35,740 --> 00:52:39,958
this particular state.

852
00:52:39,958 --> 00:52:40,390
OK.

853
00:52:40,390 --> 00:52:46,705
Let's go one to aggregate
rewards with a final reward.

854
00:52:51,420 --> 00:52:53,560
Starting to sound like-- yes?

855
00:52:53,560 --> 00:52:56,990
STUDENT: I'm sorry, for the
last example, how are we

856
00:52:56,990 --> 00:52:57,970
guaranteed that it's ergotic?

857
00:52:57,970 --> 00:53:01,370
Like, I possible you enter a
loop somewhere that can never

858
00:53:01,370 --> 00:53:05,670
go to your trapping
state, right?

859
00:53:05,670 --> 00:53:09,750
PROFESSOR: But I can't do that
because there always has to be

860
00:53:09,750 --> 00:53:12,520
a way of getting to the trapping
state, because

861
00:53:12,520 --> 00:53:14,770
there's only one recurrent
state.

862
00:53:14,770 --> 00:53:19,170
All these other states
are transient now.

863
00:53:19,170 --> 00:53:19,920
STUDENT: No, but I mean--

864
00:53:19,920 --> 00:53:21,467
OK, like, let's say you
start off with a

865
00:53:21,467 --> 00:53:22,655
general Markov chain.

866
00:53:22,655 --> 00:53:24,560
PROFESSOR: Oh, I start off with
a general Markov chain?

867
00:53:24,560 --> 00:53:27,060
You're absolutely right.

868
00:53:27,060 --> 00:53:30,060
Then there might be no way of
getting from some starting

869
00:53:30,060 --> 00:53:34,610
state to state 1, and therefore,
the amount of time

870
00:53:34,610 --> 00:53:36,890
that it takes you to get from
that state to the starting

871
00:53:36,890 --> 00:53:38,750
state is going to be infinite.

872
00:53:38,750 --> 00:53:40,250
You can't get there.

873
00:53:40,250 --> 00:53:43,960
So in fact, what you have to do
with a problem like this is

874
00:53:43,960 --> 00:53:48,730
to look at it first, and say,
are you in fact dealing with a

875
00:53:48,730 --> 00:53:49,760
unit chain?

876
00:53:49,760 --> 00:53:52,990
Or do you have multiple
recurrent sets?

877
00:53:52,990 --> 00:53:57,100
If you have multiple recurrent
sets, then the expected time

878
00:53:57,100 --> 00:54:00,770
to get into one of the recurrent
states, starting

879
00:54:00,770 --> 00:54:04,840
from either a transient state,
or from some other recurrent

880
00:54:04,840 --> 00:54:08,720
set is infinite.

881
00:54:08,720 --> 00:54:11,820
I mean, just like this business
we were going through

882
00:54:11,820 --> 00:54:13,480
at the beginning.

883
00:54:13,480 --> 00:54:16,050
What you would like to do is not
have to go through a lot

884
00:54:16,050 --> 00:54:20,750
of calculation when you have, or
a lot of thinking when you

885
00:54:20,750 --> 00:54:24,070
have multiple recurrent
sets of states.

886
00:54:24,070 --> 00:54:25,980
You just know what
happens there.

887
00:54:25,980 --> 00:54:28,540
There's no way to get from this
recurrent set to this

888
00:54:28,540 --> 00:54:30,020
recurrent set.

889
00:54:30,020 --> 00:54:31,440
So that's the end of it.

890
00:54:31,440 --> 00:54:31,888
STUDENT: OK.

891
00:54:31,888 --> 00:54:34,277
So like it works when you have
the unit chain, and then you

892
00:54:34,277 --> 00:54:36,585
choose your trapping state to
be one instance [INAUDIBLE].

893
00:54:36,585 --> 00:54:37,835
PROFESSOR: Yes.

894
00:54:39,700 --> 00:54:40,150
OK.

895
00:54:40,150 --> 00:54:41,400
Good.

896
00:54:44,220 --> 00:54:47,160
Now, yes?

897
00:54:47,160 --> 00:54:50,410
STUDENT: The previous equation
is true for any reward.

898
00:54:50,410 --> 00:54:51,692
But it's not necessary--

899
00:54:51,692 --> 00:54:53,950
PROFESSOR: Yeah, it is true for
any set of rewards, yes.

900
00:54:59,720 --> 00:55:02,090
Although what the interpretation
would be of any

901
00:55:02,090 --> 00:55:05,900
set of rewards is if you
have to sort that out.

902
00:55:05,900 --> 00:55:06,590
But yes.

903
00:55:06,590 --> 00:55:10,200
For any r that you choose,
there's going to be one unique

904
00:55:10,200 --> 00:55:15,530
solution, so long as one is
actually a trapping state, and

905
00:55:15,530 --> 00:55:16,950
everything else leads to one.

906
00:55:20,600 --> 00:55:25,875
OK, so why do I want to
put a-- ah, good.

907
00:55:25,875 --> 00:55:27,537
STUDENT: I feel like there's a
lot of the rewards that are

908
00:55:27,537 --> 00:55:30,625
designed for it, designed with
respect to being in a

909
00:55:30,625 --> 00:55:31,575
particular state.

910
00:55:31,575 --> 00:55:32,060
PROFESSOR: Yes.

911
00:55:32,060 --> 00:55:34,340
STUDENT: But if the rewards are
actually in transition, so

912
00:55:34,340 --> 00:55:38,012
for example, if you go from i to
j, there are going to be a

913
00:55:38,012 --> 00:55:40,015
different number from j to j.

914
00:55:40,015 --> 00:55:41,580
How do you deal with that?

915
00:55:41,580 --> 00:55:42,400
PROFESSOR: How do I
deal with that?

916
00:55:42,400 --> 00:55:45,000
Well, then let's talk
about that.

917
00:55:45,000 --> 00:55:48,200
And in fact, it's fairly simple
so long as you're only

918
00:55:48,200 --> 00:55:50,750
talking about expected
rewards.

919
00:55:50,750 --> 00:55:54,450
Because if I have a reward
associated with--

920
00:55:57,096 --> 00:56:18,574
if I have a reward rij, which is
the reward for transition i

921
00:56:18,574 --> 00:56:36,600
to j, then if I take the sum of
rij times p summed over j,

922
00:56:36,600 --> 00:56:51,768
what this gives me is the
expected reward associated

923
00:56:51,768 --> 00:56:53,940
with state j, with state i.

924
00:56:59,750 --> 00:57:02,680
Now, you have to be a little bit
careful with this because

925
00:57:02,680 --> 00:57:06,310
before we've been picking up
this reward as soon as we get

926
00:57:06,310 --> 00:57:09,860
to state i, and here suddenly
we have a slightly different

927
00:57:09,860 --> 00:57:14,560
situation where you have a
reward associated with state i

928
00:57:14,560 --> 00:57:17,230
but you don't pick it up
until the next set.

929
00:57:17,230 --> 00:57:23,780
So this is where this problem
of i or i plus 1 comes in.

930
00:57:23,780 --> 00:57:29,070
And you guys can do that much
better than I can, because at

931
00:57:29,070 --> 00:57:36,480
my age I start out with an age
of 60 and an age of 61 is the

932
00:57:36,480 --> 00:57:38,500
same thing.

933
00:57:38,500 --> 00:57:40,880
I mean, these are--

934
00:57:40,880 --> 00:57:42,130
OK.

935
00:57:44,790 --> 00:57:48,660
So anyway, the point of it is,
if you have rewards associated

936
00:57:48,660 --> 00:57:52,320
with transitions you can always
convert that to rewards

937
00:57:52,320 --> 00:57:53,570
associated with states.

938
00:57:58,320 --> 00:58:02,220
Oh, I didn't really
get to this.

939
00:58:02,220 --> 00:58:06,150
What I've been trying to say
now for a while is that

940
00:58:06,150 --> 00:58:13,120
sometimes, for some reason or
other, after you go through

941
00:58:13,120 --> 00:58:16,990
and end steps of this Markov
chain, when you get to the

942
00:58:16,990 --> 00:58:21,340
end, you want to consider some
particularly large reward for

943
00:58:21,340 --> 00:58:24,540
having gotten to the end, or
some particularly large cost

944
00:58:24,540 --> 00:58:27,950
of getting to the end, or
something which depends on the

945
00:58:27,950 --> 00:58:30,190
state that you happen
to be in.

946
00:58:30,190 --> 00:58:34,630
So we will assign some final
reward which in general can be

947
00:58:34,630 --> 00:58:37,820
different from the reward that
we're picking up at each of

948
00:58:37,820 --> 00:58:38,840
the other states.

949
00:58:38,840 --> 00:58:41,105
We're going to do this
in a particular way.

950
00:58:47,740 --> 00:58:50,920
You would think that what we
would want to do is, if we

951
00:58:50,920 --> 00:58:55,210
went through in steps, we would
associate this final

952
00:58:55,210 --> 00:58:57,580
reward with the n-th step.

953
00:58:57,580 --> 00:58:59,220
We're going to do it
a different way.

954
00:58:59,220 --> 00:59:02,180
We're going to go through n
steps, and then the final

955
00:59:02,180 --> 00:59:05,980
reward is what happens on
the state after that.

956
00:59:05,980 --> 00:59:09,480
So we're really turning the
problem of looking at n steps

957
00:59:09,480 --> 00:59:13,490
into a problem of looking
at n plus 1 steps.

958
00:59:13,490 --> 00:59:14,490
Why do we do that?

959
00:59:14,490 --> 00:59:16,320
Completely arbitrary.

960
00:59:16,320 --> 00:59:19,320
It turns out to be convenient
when we talk about dynamic

961
00:59:19,320 --> 00:59:24,720
programming, and you'll see
why in just a minute.

962
00:59:24,720 --> 00:59:29,770
So this extra final state is
just an arbitrary thing that

963
00:59:29,770 --> 00:59:34,230
you add, and we'll see
the main purpose for

964
00:59:34,230 --> 00:59:35,780
it in just a minute.

965
00:59:38,730 --> 00:59:39,380
OK.

966
00:59:39,380 --> 00:59:45,910
So we're going to now look at
what in principle is a much

967
00:59:45,910 --> 00:59:48,880
more complicated situation than
what we were looking at

968
00:59:48,880 --> 00:59:53,180
before, but you still have this
basic mark off condition

969
00:59:53,180 --> 00:59:56,310
which is making things
simple for you.

970
00:59:56,310 --> 01:00:00,990
So the idea is, you're looking
at a discrete time situation.

971
01:00:00,990 --> 01:00:04,260
Things happen in steps.

972
01:00:04,260 --> 01:00:07,655
There's a finite set of states
which don't change over time.

973
01:00:10,530 --> 01:00:13,690
At each unit of time, you're
going to be in one of the set

974
01:00:13,690 --> 01:00:20,420
of m states, and at each time l,
there's some decision maker

975
01:00:20,420 --> 01:00:24,520
sitting around who looks
at the state that

976
01:00:24,520 --> 01:00:26,530
you're in at time l.

977
01:00:26,530 --> 01:00:31,970
And the decision maker says I
have a choice between what

978
01:00:31,970 --> 01:00:38,570
reward I'm going to pick up
at this time and what the

979
01:00:38,570 --> 01:00:43,020
transition probabilities are for
going to the next state.

980
01:00:43,020 --> 01:00:46,110
OK, so it's kind of a
complicated thing.

981
01:00:46,110 --> 01:00:51,440
It's the same thing that
you face all the time.

982
01:00:51,440 --> 01:00:54,300
I mean, in the stock market for
example, you see that one

983
01:00:54,300 --> 01:00:57,010
stock is doing poorly,
so you have a choice.

984
01:00:57,010 --> 01:01:03,620
Should I sell it, eat my losses,
or should I keep on

985
01:01:03,620 --> 01:01:05,980
going and hope it'll
turnaround?

986
01:01:05,980 --> 01:01:09,980
If you're doing a thesis, you
have the even worse problem.

987
01:01:09,980 --> 01:01:13,540
You go for three months without
getting the result

988
01:01:13,540 --> 01:01:19,120
that you need, and you say,
well, I don't have a thesis.

989
01:01:19,120 --> 01:01:21,960
I can't say something
about this.

990
01:01:21,960 --> 01:01:25,280
Should I go on for one more
month, or should I can it and

991
01:01:25,280 --> 01:01:27,400
go on to another topic?

992
01:01:27,400 --> 01:01:30,460
OK, it's exactly the
same situation.

993
01:01:30,460 --> 01:01:34,900
So this is really a very broad
set of situations.

994
01:01:34,900 --> 01:01:37,858
The only thing that makes it
really different from real

995
01:01:37,858 --> 01:01:42,260
life is this Markov property
sitting there and the fact

996
01:01:42,260 --> 01:01:46,190
that you actually understand
what the rewards are and you

997
01:01:46,190 --> 01:01:48,180
can predict them in advance.

998
01:01:48,180 --> 01:01:51,990
You can't predict what state
you're going to be in, but you

999
01:01:51,990 --> 01:01:54,230
know that if you're in a
particular state, you know

1000
01:01:54,230 --> 01:01:58,560
what your choices are in the
future as well as now, and all

1001
01:01:58,560 --> 01:02:03,360
you have to do at each unit of
time is to make this choice

1002
01:02:03,360 --> 01:02:05,860
between various different
things.

1003
01:02:05,860 --> 01:02:08,485
You see an interesting
example of that here.

1004
01:02:13,890 --> 01:02:17,430
If you look at this Markov chain
here, it's a two state

1005
01:02:17,430 --> 01:02:18,680
Markov chain.

1006
01:02:21,770 --> 01:02:23,860
And what's the steady
state probability of

1007
01:02:23,860 --> 01:02:25,150
being in state one?

1008
01:02:32,420 --> 01:02:33,670
Anybody?

1009
01:02:35,596 --> 01:02:37,050
It's a half, yes.

1010
01:02:37,050 --> 01:02:40,480
Why is it a half, and
why don't you have

1011
01:02:40,480 --> 01:02:41,930
to solve for this?

1012
01:02:41,930 --> 01:02:45,150
Why can you look at it
and say it's a half?

1013
01:02:45,150 --> 01:02:46,740
Because it's completely
symmetric.

1014
01:02:46,740 --> 01:02:53,930
0.99 here, 0.99 here, 0.01
here, 0.01 here.

1015
01:02:53,930 --> 01:02:56,450
These rewards here had nothing
to do with the

1016
01:02:56,450 --> 01:02:58,290
Markov chain itself.

1017
01:02:58,290 --> 01:03:02,210
The Markov chain is symmetric
between states one and two,

1018
01:03:02,210 --> 01:03:05,200
and therefore, the steady state
probabilities have to be

1019
01:03:05,200 --> 01:03:06,660
one half each.

1020
01:03:06,660 --> 01:03:13,410
So here's something where, if
you happen to be in state two,

1021
01:03:13,410 --> 01:03:15,090
you're going to stay
there typically

1022
01:03:15,090 --> 01:03:17,100
for a very long time.

1023
01:03:17,100 --> 01:03:20,080
And while you're studying there
for a very long time,

1024
01:03:20,080 --> 01:03:24,020
you're going to be picking up
rewards one unit of reward

1025
01:03:24,020 --> 01:03:26,930
every unit of time.

1026
01:03:26,930 --> 01:03:30,720
You work for some very stable
employer who pays you very

1027
01:03:30,720 --> 01:03:33,540
little, and that's a
situation you have.

1028
01:03:33,540 --> 01:03:37,880
You're sitting here, you have
a job but you're not making

1029
01:03:37,880 --> 01:03:42,710
much, but still you're making
something, and you have a lot

1030
01:03:42,710 --> 01:03:45,510
of job security.

1031
01:03:45,510 --> 01:03:49,760
Now, we have a different choice
when we're sitting here

1032
01:03:49,760 --> 01:03:57,390
with a job in state two, we can,
for example, you can go

1033
01:03:57,390 --> 01:04:00,300
to the cash register and take
all the money out of it and

1034
01:04:00,300 --> 01:04:01,550
disappear from the company.

1035
01:04:03,920 --> 01:04:07,170
I don't advocate doing
that, except,

1036
01:04:07,170 --> 01:04:09,190
it's one of your choices.

1037
01:04:09,190 --> 01:04:13,730
So you pick up a big reward of
50, and then for a long period

1038
01:04:13,730 --> 01:04:18,820
of time you go back to this
state over here and you make

1039
01:04:18,820 --> 01:04:21,720
nothing in reward for a
long period of time

1040
01:04:21,720 --> 01:04:23,360
while you're in jail.

1041
01:04:23,360 --> 01:04:28,650
And then eventually you pop back
here, and if we assume

1042
01:04:28,650 --> 01:04:32,320
the judicial system is such
that it has no memory,

1043
01:04:32,320 --> 01:04:33,670
[INAUDIBLE]

1044
01:04:33,670 --> 01:04:40,020
you can cut into the cash
register, and, well, OK.

1045
01:04:40,020 --> 01:04:43,850
So anyway, this decision two,
you're looking for instant

1046
01:04:43,850 --> 01:04:45,410
gratification here.

1047
01:04:45,410 --> 01:04:48,340
You're getting a big reward all
at once, but by getting a

1048
01:04:48,340 --> 01:04:53,040
big reward with probability
one, you're going back to

1049
01:04:53,040 --> 01:04:54,390
state zero.

1050
01:04:54,390 --> 01:04:57,830
From state zero, it takes a long
time to get back to the

1051
01:04:57,830 --> 01:05:02,940
point where you can get a big
reward again, so you wonder,

1052
01:05:02,940 --> 01:05:07,020
is it better to use this policy
or is it better to use

1053
01:05:07,020 --> 01:05:08,270
this policy?

1054
01:05:10,670 --> 01:05:14,160
Now, there are two basic ways
to look at this problem.

1055
01:05:14,160 --> 01:05:16,280
I think it's important to
understand what they are

1056
01:05:16,280 --> 01:05:18,330
before we go further.

1057
01:05:18,330 --> 01:05:24,660
One of the ways is to say, OK,
let's suppose that I work out

1058
01:05:24,660 --> 01:05:30,440
which is the best policy
and I use it forever.

1059
01:05:30,440 --> 01:05:34,140
Namely, I use this policy
forever or I

1060
01:05:34,140 --> 01:05:36,910
use this policy forever.

1061
01:05:36,910 --> 01:05:40,570
And if I use this policy
forever, I can pretty easily

1062
01:05:40,570 --> 01:05:43,470
work out what the steady state
probabilities of these two

1063
01:05:43,470 --> 01:05:44,680
states are.

1064
01:05:44,680 --> 01:05:50,040
I can then work out what my
expected gain is per unit time

1065
01:05:50,040 --> 01:05:52,690
and I can compare
this with that.

1066
01:05:55,260 --> 01:05:58,140
And who thinks that this is
going to be better than that

1067
01:05:58,140 --> 01:06:01,370
and who thinks that this is
going to be better than that?

1068
01:06:01,370 --> 01:06:03,600
Well, you can work
it out easily.

1069
01:06:03,600 --> 01:06:07,080
It's kind of interesting because
the steady state gain

1070
01:06:07,080 --> 01:06:12,940
here and here are very
close to the same.

1071
01:06:12,940 --> 01:06:17,480
It turns out that this is just
a smidgen better than this,

1072
01:06:17,480 --> 01:06:19,610
only by a very small amount.

1073
01:06:19,610 --> 01:06:19,950
OK.

1074
01:06:19,950 --> 01:06:25,620
See, what happens here is that
here, you tend to go for about

1075
01:06:25,620 --> 01:06:28,110
100 steps here.

1076
01:06:28,110 --> 01:06:33,090
So you pick up every reward of
about 100 if you use this very

1077
01:06:33,090 --> 01:06:34,890
simple minded analysis.

1078
01:06:34,890 --> 01:06:37,810
Then for 100 steps, you're
sitting here, you're getting

1079
01:06:37,810 --> 01:06:43,020
no reward, so you think we ought
to get every reward of

1080
01:06:43,020 --> 01:06:46,310
one half on the average,
and that's exactly

1081
01:06:46,310 --> 01:06:48,090
what you do get here.

1082
01:06:48,090 --> 01:06:52,820
And here, you get this big
reward of 50, but then you go

1083
01:06:52,820 --> 01:06:57,560
over here and you spend 100
units of time in purgatory and

1084
01:06:57,560 --> 01:07:00,470
then you get back again, you get
another reward of 50 and

1085
01:07:00,470 --> 01:07:03,830
then spend hundreds units
of time in purgatory.

1086
01:07:03,830 --> 01:07:07,300
So again, you're getting pretty
close to a half of a

1087
01:07:07,300 --> 01:07:10,690
unit of reward, but it turns
out, when you work it out,

1088
01:07:10,690 --> 01:07:12,690
that here is just a smidgen.

1089
01:07:12,690 --> 01:07:18,190
It's 1% less than a half, so
this is not as good as that.

1090
01:07:18,190 --> 01:07:24,900
But suppose that you have
a shorter time horizon.

1091
01:07:24,900 --> 01:07:28,380
Suppose you don't want to wait
for 1,000 steps to see what's

1092
01:07:28,380 --> 01:07:32,280
going on, so you don't want
to look at the average.

1093
01:07:32,280 --> 01:07:34,280
Suppose this was a
gambling game.

1094
01:07:34,280 --> 01:07:38,230
You have your choice of these
two gambling options, and

1095
01:07:38,230 --> 01:07:41,820
suppose you're only going to be
playing for a short time.

1096
01:07:41,820 --> 01:07:43,180
Suppose you're going
to be only playing

1097
01:07:43,180 --> 01:07:44,830
for one unit of time.

1098
01:07:44,830 --> 01:07:47,180
You can only play for one unit
of time and then you have to

1099
01:07:47,180 --> 01:07:50,780
stop, you have to go home, you
have to go back to work, or

1100
01:07:50,780 --> 01:07:52,180
something else.

1101
01:07:52,180 --> 01:07:54,870
And you happen to be sitting
in state two.

1102
01:07:54,870 --> 01:07:57,180
What do you want to do
if you only have one

1103
01:07:57,180 --> 01:07:58,730
unit of time to play.

1104
01:07:58,730 --> 01:08:03,630
Well, obviously, you want to get
the reward of 50, because

1105
01:08:03,630 --> 01:08:07,620
delayed gratification doesn't
work here, because you don't

1106
01:08:07,620 --> 01:08:11,330
get any opportunity for that
gratification later.

1107
01:08:11,330 --> 01:08:14,900
So you pick up the big
reward at first.

1108
01:08:14,900 --> 01:08:18,630
So when you have this problem of
playing for a finite amount

1109
01:08:18,630 --> 01:08:24,649
of time, whatever kind of
situation you're in, what you

1110
01:08:24,649 --> 01:08:28,310
would like to do is say, for
this finite amount of time

1111
01:08:28,310 --> 01:08:34,290
that I'm going to play, what's
my best strategy then?

1112
01:08:34,290 --> 01:08:39,850
Dynamic programming is the
problem, which is the

1113
01:08:39,850 --> 01:08:43,600
algorithm which finds out what
the best thing to do is

1114
01:08:43,600 --> 01:08:45,000
dynamically.

1115
01:08:45,000 --> 01:08:48,670
Namely, if you're going to stop
in 10 steps, stop in 100

1116
01:08:48,670 --> 01:08:52,710
steps, stop in one step, it
tells you what to do under all

1117
01:08:52,710 --> 01:08:55,350
of those circumstances.

1118
01:08:55,350 --> 01:08:59,340
And the stationary policy tells
you what to do if you're

1119
01:08:59,340 --> 01:09:02,630
going to play forever.

1120
01:09:02,630 --> 01:09:05,760
But in a situation like this
where things happen rather

1121
01:09:05,760 --> 01:09:10,189
slowly, it might not be the
relevant thing to deal with.

1122
01:09:10,189 --> 01:09:13,170
A lot of the notes deal with
comparing the stationary

1123
01:09:13,170 --> 01:09:17,180
policy with this
dynamic policy.

1124
01:09:17,180 --> 01:09:21,399
And I'm not going to do that
here because, well, we have

1125
01:09:21,399 --> 01:09:23,290
too many other interesting
things that we

1126
01:09:23,290 --> 01:09:24,170
want to deal with.

1127
01:09:24,170 --> 01:09:26,939
So we're just going to skip
all of that stuff about

1128
01:09:26,939 --> 01:09:28,470
stationary policies.

1129
01:09:28,470 --> 01:09:30,670
You don't have to bother to
read it unless you're

1130
01:09:30,670 --> 01:09:32,580
interested in it.

1131
01:09:32,580 --> 01:09:35,029
I mean, if you're interested in
it, by all means, read it.

1132
01:09:35,029 --> 01:09:38,950
It's a very interesting topic.

1133
01:09:38,950 --> 01:09:41,580
It's not all that interesting
to find out what the best

1134
01:09:41,580 --> 01:09:42,990
stationary policy is.

1135
01:09:42,990 --> 01:09:45,210
That's kind of simple.

1136
01:09:45,210 --> 01:09:48,729
What's the interesting topic
is what's the comparison

1137
01:09:48,729 --> 01:09:53,100
between the dynamic policy and
the stationary policy.

1138
01:09:53,100 --> 01:09:56,500
But all we're going to do
is worry about what the

1139
01:09:56,500 --> 01:09:58,160
dynamic policy is.

1140
01:09:58,160 --> 01:10:03,460
That seems like a hard problem,
and someone by the

1141
01:10:03,460 --> 01:10:09,720
name of Bellman figured out what
the optimal solution to

1142
01:10:09,720 --> 01:10:12,025
that dynamic policy was.

1143
01:10:12,025 --> 01:10:16,900
And it turned out to be a
trivially simple algorithm,

1144
01:10:16,900 --> 01:10:20,030
and Bellman became
famous forever.

1145
01:10:20,030 --> 01:10:23,080
One of the things I want to
point out to you, again, I

1146
01:10:23,080 --> 01:10:27,250
keep coming back to this because
you people are just

1147
01:10:27,250 --> 01:10:29,970
starting a research career.

1148
01:10:29,970 --> 01:10:34,490
Everyone in this class, given
the formulation of this

1149
01:10:34,490 --> 01:10:38,670
dynamic programming problem,
could develop and would

1150
01:10:38,670 --> 01:10:43,440
develop, I'm pretty sure, the
dynamic programming algorithm.

1151
01:10:43,440 --> 01:10:47,020
Developing the algorithm,
understanding what the problem

1152
01:10:47,020 --> 01:10:50,210
is is a trivial matter.

1153
01:10:50,210 --> 01:10:52,390
Why is Bellman famous?

1154
01:10:52,390 --> 01:10:56,270
Because he formulated
the problem.

1155
01:10:56,270 --> 01:11:01,010
He said, aha, this dynamic
problem is interesting.

1156
01:11:01,010 --> 01:11:04,710
I don't have to go through
the stationary problem.

1157
01:11:04,710 --> 01:11:08,430
And in fact, my sense from
reading his book and from

1158
01:11:08,430 --> 01:11:11,470
reading things he's written is
that he couldn't have solved

1159
01:11:11,470 --> 01:11:14,240
the stationary problem because
he didn't understand

1160
01:11:14,240 --> 01:11:16,750
probability that well.

1161
01:11:16,750 --> 01:11:20,600
But he did understand how to
formulate what this really

1162
01:11:20,600 --> 01:11:24,330
important problem was
and he solved it.

1163
01:11:24,330 --> 01:11:27,880
So, all the more credit to him,
but when you're doing

1164
01:11:27,880 --> 01:11:32,460
research, the time you spend
on formulating the right

1165
01:11:32,460 --> 01:11:37,430
problem is far more important
than the time you spend

1166
01:11:37,430 --> 01:11:38,390
solving it.

1167
01:11:38,390 --> 01:11:41,490
If you start out with the right
problem, the solution is

1168
01:11:41,490 --> 01:11:45,650
trivial and you're all done.

1169
01:11:45,650 --> 01:11:49,930
It's hard to formulate the right
problem, and you learn

1170
01:11:49,930 --> 01:11:57,810
to formulate the problem not
by playing all of this

1171
01:11:57,810 --> 01:12:01,570
calculating things, but by
setting back and thinking

1172
01:12:01,570 --> 01:12:04,480
about the problem and trying
to look at things in a more

1173
01:12:04,480 --> 01:12:06,050
general way.

1174
01:12:06,050 --> 01:12:07,660
So just another plug.

1175
01:12:07,660 --> 01:12:10,440
I've been saying this, I will
probably say it every three or

1176
01:12:10,440 --> 01:12:14,420
four lectures throughout
the term.

1177
01:12:14,420 --> 01:12:14,860
OK.

1178
01:12:14,860 --> 01:12:18,450
So let's go back and look
at what the problem is.

1179
01:12:18,450 --> 01:12:21,330
We haven't quite formulated
it yet.

1180
01:12:21,330 --> 01:12:24,940
We're going to assume this
process of random transitions

1181
01:12:24,940 --> 01:12:27,790
combined with decisions based
on the current state.

1182
01:12:27,790 --> 01:12:30,380
In other words, in this decision
maker, the decision

1183
01:12:30,380 --> 01:12:34,960
maker at each unit of time sees
what state you're in at

1184
01:12:34,960 --> 01:12:37,040
this unit of time.

1185
01:12:37,040 --> 01:12:40,940
And seeing what state you're in
at this given unit of time,

1186
01:12:40,940 --> 01:12:45,020
the decision maker has a choice
between how much reward

1187
01:12:45,020 --> 01:12:51,740
is to be taken and along with
how much reward is to be

1188
01:12:51,740 --> 01:12:54,940
taken, what the transition
probabilities are

1189
01:12:54,940 --> 01:12:56,160
for the next state.

1190
01:12:56,160 --> 01:13:00,150
If you rob the cash register,
your transition probabilities

1191
01:13:00,150 --> 01:13:02,230
are going to be very different
than if you don't

1192
01:13:02,230 --> 01:13:04,680
rob the cash register.

1193
01:13:04,680 --> 01:13:08,190
By robbing the cash register,
your transition probabilities

1194
01:13:08,190 --> 01:13:10,770
go into a rather high transition
probability that

1195
01:13:10,770 --> 01:13:12,270
you're going to be caught.

1196
01:13:12,270 --> 01:13:16,050
OK, so you don't want that.

1197
01:13:16,050 --> 01:13:20,600
So you can't avoid the problem
of having the rewards at a

1198
01:13:20,600 --> 01:13:24,290
given time locked into what the
transition probabilities

1199
01:13:24,290 --> 01:13:27,990
are for going to the next state,
and that's the essence

1200
01:13:27,990 --> 01:13:29,890
of this problem.

1201
01:13:29,890 --> 01:13:30,500
OK.

1202
01:13:30,500 --> 01:13:33,470
So, the decision maker observers
the state and

1203
01:13:33,470 --> 01:13:36,530
chooses one of a finite
set of alternatives.

1204
01:13:36,530 --> 01:13:39,790
Each alternative consists of
recurrent reward which we'll

1205
01:13:39,790 --> 01:13:44,030
call r sub j of k, the
alternative is k, and a set of

1206
01:13:44,030 --> 01:13:45,980
transition probabilities.

1207
01:13:45,980 --> 01:13:50,250
pjl of k, one less than or
equal to a l less than or

1208
01:13:50,250 --> 01:13:52,750
equal to m for going
to the next state.

1209
01:13:52,750 --> 01:13:56,450
OK, the notation here is
horrifying, but the idea is

1210
01:13:56,450 --> 01:13:57,880
very simple.

1211
01:13:57,880 --> 01:14:01,370
I mean, once you get used to the
notation, there's nothing

1212
01:14:01,370 --> 01:14:04,880
complicated here at all.

1213
01:14:04,880 --> 01:14:08,940
OK, so in this example here,
well, we already

1214
01:14:08,940 --> 01:14:10,190
talked about that.

1215
01:14:13,120 --> 01:14:14,990
We're going to start
out at time m.

1216
01:14:17,960 --> 01:14:21,150
We're going to make a decision
at time m, pick up the

1217
01:14:21,150 --> 01:14:28,090
associated reward for that
decision, and pick the

1218
01:14:28,090 --> 01:14:30,970
transition probabilities that
we're going to use at that

1219
01:14:30,970 --> 01:14:33,460
time m, and then go on
to the next state.

1220
01:14:33,460 --> 01:14:36,380
We're going to continue doing
this until time m

1221
01:14:36,380 --> 01:14:37,960
plus n minus 1.

1222
01:14:37,960 --> 01:14:41,450
Mainly, we're going to do this
for n steps of time.

1223
01:14:41,450 --> 01:14:43,690
After the n-th decision--

1224
01:14:43,690 --> 01:14:47,140
you make the n-th decision
at m plus n minus t--

1225
01:14:47,140 --> 01:14:52,270
there's a final transition
based on that decision.

1226
01:14:52,270 --> 01:14:55,490
The final transition is based
on that decision, but the

1227
01:14:55,490 --> 01:14:58,345
final reward is fixed
ahead of time.

1228
01:14:58,345 --> 01:15:01,500
You know what the final reward
is going to be, which happens

1229
01:15:01,500 --> 01:15:03,480
at time m plus n.

1230
01:15:03,480 --> 01:15:07,465
So the things which are variable
is how much reward do

1231
01:15:07,465 --> 01:15:14,070
you get at each of these first
n time units, and what

1232
01:15:14,070 --> 01:15:17,870
probabilities you choose for
going through the next state.

1233
01:15:17,870 --> 01:15:20,170
Is this still a Markov chain?

1234
01:15:20,170 --> 01:15:21,420
Is this still Markov?

1235
01:15:24,750 --> 01:15:26,560
You can talk about this
for a long time.

1236
01:15:26,560 --> 01:15:30,460
You can think about it for
a long time because this

1237
01:15:30,460 --> 01:15:34,770
decision maker might or
might not be Markov.

1238
01:15:34,770 --> 01:15:38,870
What is Markov is the transition
probabilities that

1239
01:15:38,870 --> 01:15:41,380
are taking place in
each unit of time.

1240
01:15:41,380 --> 01:15:46,410
After I make a decision, the
transition probabilities are

1241
01:15:46,410 --> 01:15:51,370
fixed for that decision and
that initial state and had

1242
01:15:51,370 --> 01:15:54,650
nothing to do with the decisions
that had been made

1243
01:15:54,650 --> 01:15:58,650
before that or the states you've
been in before that.

1244
01:15:58,650 --> 01:16:02,220
The Markov condition says that
what happens in the next unit

1245
01:16:02,220 --> 01:16:06,020
of time is a function simply
of those transition

1246
01:16:06,020 --> 01:16:10,370
probabilities that
had been chosen.

1247
01:16:10,370 --> 01:16:13,530
We will see that when we look at
the algorithm, and then you

1248
01:16:13,530 --> 01:16:16,670
can sort out for yourselves
whether there's something

1249
01:16:16,670 --> 01:16:18,190
dishonest here or not.

1250
01:16:18,190 --> 01:16:25,480
Turns out there isn't, but to
Bellman's credit he did sort

1251
01:16:25,480 --> 01:16:28,740
out correctly that this worked,
and many people for a

1252
01:16:28,740 --> 01:16:30,520
long time did not
think it worked.

1253
01:16:34,080 --> 01:16:37,150
So the objective of dynamic
programming is both to

1254
01:16:37,150 --> 01:16:41,540
determine the optimal decision
at each time and to determine

1255
01:16:41,540 --> 01:16:45,040
the expected reward for each
starting state and for each

1256
01:16:45,040 --> 01:16:47,690
number and steps.

1257
01:16:47,690 --> 01:16:51,090
As one might suspect, now here's
the first thing that

1258
01:16:51,090 --> 01:16:52,500
Bellman did.

1259
01:16:52,500 --> 01:16:54,010
He said, here, I have
this problem.

1260
01:16:54,010 --> 01:16:57,880
I want to find out what happens
after 1,000 steps.

1261
01:16:57,880 --> 01:17:00,850
How do I solve the problem?

1262
01:17:00,850 --> 01:17:04,330
Well, anybody with any sense
will tell you don't solve the

1263
01:17:04,330 --> 01:17:06,740
problem with 1,000
steps first.

1264
01:17:06,740 --> 01:17:10,220
Solve the problem with one step
first, and then see if

1265
01:17:10,220 --> 01:17:13,330
you find out anything from it
and then maybe you can solve

1266
01:17:13,330 --> 01:17:17,030
the problem with two steps and
then maybe something nice will

1267
01:17:17,030 --> 01:17:20,600
happen, or maybe it won't.

1268
01:17:20,600 --> 01:17:25,320
When we do this, it'll turn
out that what we're really

1269
01:17:25,320 --> 01:17:30,820
doing is we're starting at the
end and working our way back,

1270
01:17:30,820 --> 01:17:34,010
and this algorithm is due to
Richard Bellman, as I said.

1271
01:17:34,010 --> 01:17:38,400
And he was the one who sorted
out how it worked.

1272
01:17:38,400 --> 01:17:40,630
So what is the algorithm?

1273
01:17:40,630 --> 01:17:45,250
We're going to start out making
a decision at time 1.

1274
01:17:45,250 --> 01:17:50,500
So we're going to
start at time n.

1275
01:17:50,500 --> 01:17:53,610
We're going to start
in a given state i.

1276
01:17:53,610 --> 01:17:58,580
You make a decision, decision
k at time m.

1277
01:17:58,580 --> 01:18:03,040
This provides a reward at time
m, and the selected transition

1278
01:18:03,040 --> 01:18:06,240
probabilities lead to a
final expected reward.

1279
01:18:06,240 --> 01:18:11,380
These are these final rewards
which occur at time n plus 1.

1280
01:18:11,380 --> 01:18:13,710
It's nice to have that n because
it's what let's us

1281
01:18:13,710 --> 01:18:15,550
generalize the problem.

1282
01:18:15,550 --> 01:18:18,460
So this was another clever
thing that went on here.

1283
01:18:18,460 --> 01:18:24,710
So the expected optimal
aggregate reward for a one

1284
01:18:24,710 --> 01:18:32,230
step problem is the sum of the
reward that you get at time m

1285
01:18:32,230 --> 01:18:37,260
plus this final reward you get
at time n plus 1, and you're

1286
01:18:37,260 --> 01:18:40,290
maximizing over the different
policies you have

1287
01:18:40,290 --> 01:18:41,490
available to you.

1288
01:18:41,490 --> 01:18:44,970
So it looks like a trivial
problem, but the optimal

1289
01:18:44,970 --> 01:18:47,980
reward with a one step
problem is just this.

1290
01:18:51,170 --> 01:18:54,820
OK, next you want to consider
the two step problem.

1291
01:18:54,820 --> 01:18:58,900
What's the maximum expected
reward starting at xm equals i

1292
01:18:58,900 --> 01:19:03,480
with decisions at times
m and n plus 1.

1293
01:19:03,480 --> 01:19:05,400
You make two decisions.

1294
01:19:05,400 --> 01:19:08,240
Now, before, we just made
one decision at time m.

1295
01:19:08,240 --> 01:19:13,000
Now we make a decision at time
m and at time n plus 1, and

1296
01:19:13,000 --> 01:19:17,750
finally we pick up a final
reward at time n plus 2.

1297
01:19:17,750 --> 01:19:20,540
Knowing what that final reward
is going to be is going to

1298
01:19:20,540 --> 01:19:26,230
affect the decision you make at
time n plus 1, but it's a

1299
01:19:26,230 --> 01:19:29,770
fixed reward which is a
function of the state.

1300
01:19:29,770 --> 01:19:32,720
You can adjust the transition
probabilities of getting to

1301
01:19:32,720 --> 01:19:35,110
those different rewards.

1302
01:19:35,110 --> 01:19:38,420
The key to dynamic programming
is an optimal decision at time

1303
01:19:38,420 --> 01:19:42,630
n plus 1 can be selected based
only on the state j

1304
01:19:42,630 --> 01:19:45,060
at time n plus 1.

1305
01:19:45,060 --> 01:19:48,960
This decision, given that you're
in state j at time n

1306
01:19:48,960 --> 01:19:53,600
plus 1, is optimal independent
of what you did before that,

1307
01:19:53,600 --> 01:19:55,770
which is why we're starting
out looking at what we're

1308
01:19:55,770 --> 01:19:59,240
going to do with time n plus 1
before we even worry about

1309
01:19:59,240 --> 01:20:02,630
what we're going to
do with time n.

1310
01:20:02,630 --> 01:20:06,340
So, whatever decision you made
at time n, you observe what

1311
01:20:06,340 --> 01:20:10,900
state you're at time n plus
1 and the maximal expected

1312
01:20:10,900 --> 01:20:15,510
reward over times n plus 1 and
n plus 2, given that you

1313
01:20:15,510 --> 01:20:20,610
happen to be in state j is just
maximal over k as the

1314
01:20:20,610 --> 01:20:26,430
reward you're going to get by
choosing policy k and the

1315
01:20:26,430 --> 01:20:30,670
expected value of the final
reward you get if you're using

1316
01:20:30,670 --> 01:20:32,480
this policy k.

1317
01:20:32,480 --> 01:20:36,850
This is just dj* of 1 and
u as you just found.

1318
01:20:36,850 --> 01:20:40,070
In other words, you have the
same situation at time n plus

1319
01:20:40,070 --> 01:20:42,090
1 as you have at time n.

1320
01:20:44,600 --> 01:20:49,785
Well, surprisingly, you've just
solved the whole problem.

1321
01:20:52,810 --> 01:20:58,410
So we've seen that what we
should do at time n plus 1 is

1322
01:20:58,410 --> 01:21:00,450
do this maximization.

1323
01:21:00,450 --> 01:21:05,670
So the optimal reward, aggregate
reward over times m,

1324
01:21:05,670 --> 01:21:11,815
n plus 1, and n plus 2 is what
we get maximizing over our

1325
01:21:11,815 --> 01:21:18,110
choice at time m of the reward
we get at time m plus the

1326
01:21:18,110 --> 01:21:21,750
decision plus the transition
probabilities which we've

1327
01:21:21,750 --> 01:21:27,340
decided on which get us to this
reward at time n plus 1

1328
01:21:27,340 --> 01:21:29,020
and n plus 2.

1329
01:21:29,020 --> 01:21:33,070
We found out what the reward
is for times n plus 1 and n

1330
01:21:33,070 --> 01:21:34,370
plus 2 together.

1331
01:21:34,370 --> 01:21:38,060
That's the reward to go, And
we know what that is, so we

1332
01:21:38,060 --> 01:21:40,210
have this same formula
we used before.

1333
01:21:40,210 --> 01:21:46,965
Why do we want to look at
these final rewards now?

1334
01:21:46,965 --> 01:21:50,980
Well, you can view this as a
final reward in state m.

1335
01:21:50,980 --> 01:21:54,220
It's the final reward which
tells you what you get both

1336
01:21:54,220 --> 01:21:57,930
from state n plus
1 and n plus 2.

1337
01:21:57,930 --> 01:22:04,790
And, going quickly, if we look
at playing this game for three

1338
01:22:04,790 --> 01:22:11,280
steps, the optimal reward for
the three step game is the

1339
01:22:11,280 --> 01:22:16,600
immediate reward optimized over
k plus the rewards at n

1340
01:22:16,600 --> 01:22:22,900
plus 1, n plus 2, and n plus 3,
which we've already found.

1341
01:22:22,900 --> 01:22:28,450
And in general, the optimal
reward at time n--

1342
01:22:28,450 --> 01:22:33,900
when you play the game for n
steps, the optimal reward is

1343
01:22:33,900 --> 01:22:35,170
maximum here.

1344
01:22:35,170 --> 01:22:39,980
So, all you do in the algorithm
is, for each value

1345
01:22:39,980 --> 01:22:43,500
of n when you start with n
equal to 1, you solve the

1346
01:22:43,500 --> 01:22:48,950
problem for all states and you
maximize over all policies you

1347
01:22:48,950 --> 01:22:52,100
have a choice over, and then you
go on to the next larger

1348
01:22:52,100 --> 01:22:56,140
value of n, you solve the
problem for all states and you

1349
01:22:56,140 --> 01:22:56,950
keep on going.

1350
01:22:56,950 --> 01:22:59,820
If you don't have many
states, it's easy.

1351
01:22:59,820 --> 01:23:05,100
If you have 100,000 states, it's
kind of tedious to run

1352
01:23:05,100 --> 01:23:05,880
the algorithm.

1353
01:23:05,880 --> 01:23:09,380
Today it's not bad, but today
we look at problems with

1354
01:23:09,380 --> 01:23:11,770
millions and millions of states
or billions of states,

1355
01:23:11,770 --> 01:23:18,100
and no matter how fast
computation gets, the

1356
01:23:18,100 --> 01:23:22,280
ingenuity people to invent
harder problems always makes

1357
01:23:22,280 --> 01:23:24,630
it hard to solve
these problems.

1358
01:23:24,630 --> 01:23:29,060
So anyway, that's the dynamic
programming algorithm.

1359
01:23:29,060 --> 01:23:31,320
And next time, we're going to
start on renewal processes.