1 00:00:00,530 --> 00:00:02,960 The following content is provided under a Creative 2 00:00:02,960 --> 00:00:04,370 Commons license. 3 00:00:04,370 --> 00:00:07,410 Your support will help MIT OpenCourseWare continue to 4 00:00:07,410 --> 00:00:11,060 offer high quality educational resources for free. 5 00:00:11,060 --> 00:00:13,960 To make a donation or view additional materials from 6 00:00:13,960 --> 00:00:17,890 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,890 --> 00:00:19,140 ocw.mit.edu. 8 00:00:24,220 --> 00:00:24,840 PROFESSOR: OK. 9 00:00:24,840 --> 00:00:30,230 Today we're going to finish up with Markov chains. 10 00:00:30,230 --> 00:00:34,570 And the last topic will be dynamic programming. 11 00:00:34,570 --> 00:00:39,900 I'm not going to say an awful lot about dynamic programming. 12 00:00:39,900 --> 00:00:43,530 It's a topic that was enormously important in 13 00:00:43,530 --> 00:00:49,600 research for probably 20 years from 1960 until 14 00:00:49,600 --> 00:00:53,540 about 1980, or 1990. 15 00:00:53,540 --> 00:01:00,300 And it seemed as if half the Ph.D. theses done in the 16 00:01:00,300 --> 00:01:03,920 control area and the operations research 17 00:01:03,920 --> 00:01:07,630 were in this area. 18 00:01:07,630 --> 00:01:11,950 Suddenly, everything seemed to be done, could be done. 19 00:01:11,950 --> 00:01:15,310 And strangely enough, not many people seem to 20 00:01:15,310 --> 00:01:16,760 know about it anymore. 21 00:01:16,760 --> 00:01:20,760 It's an enormously useful algorithm for solving an awful 22 00:01:20,760 --> 00:01:23,000 lot of different problems. 23 00:01:23,000 --> 00:01:25,420 It's quite a simple algorithm. 24 00:01:25,420 --> 00:01:28,780 You don't need the full power of Markov chains in order to 25 00:01:28,780 --> 00:01:30,470 understand it. 26 00:01:30,470 --> 00:01:34,250 So I do want to at least talk about it a little bit. 27 00:01:34,250 --> 00:01:38,070 And we will use what we've done so far with Markov chains 28 00:01:38,070 --> 00:01:40,940 in order to understand it. 29 00:01:40,940 --> 00:01:44,200 I want to start out today by reviewing a little bit of what 30 00:01:44,200 --> 00:01:49,040 we did last time about eigenvalues and eigenvectors. 31 00:01:49,040 --> 00:01:56,320 This was a somewhat awkward topic to talk about, because 32 00:01:56,320 --> 00:01:59,970 you people have very different backgrounds in linear algebra. 33 00:01:59,970 --> 00:02:03,450 Some of you have a very strong background, some of you have 34 00:02:03,450 --> 00:02:05,240 almost no background. 35 00:02:05,240 --> 00:02:10,509 So it was a lot of material for those of you who know very 36 00:02:10,509 --> 00:02:14,190 little about linear algebra. 37 00:02:14,190 --> 00:02:16,620 And probably somewhat boring for those of you 38 00:02:16,620 --> 00:02:18,690 use it all the time. 39 00:02:18,690 --> 00:02:22,670 At any rate, if you don't know anything about it, linear 40 00:02:22,670 --> 00:02:28,820 algebra is a topic that you ought to understand for almost 41 00:02:28,820 --> 00:02:30,270 anything you do. 42 00:02:30,270 --> 00:02:35,230 If you've gotten to this point without having to study it, 43 00:02:35,230 --> 00:02:37,460 it's very strange. 44 00:02:37,460 --> 00:02:41,720 So you should probably take some extra time out, not 45 00:02:41,720 --> 00:02:43,900 because you need it so much for this course. 46 00:02:43,900 --> 00:02:46,670 We won't use it enormously in many of the 47 00:02:46,670 --> 00:02:48,500 things we do later. 48 00:02:48,500 --> 00:02:51,930 But you will use it so many times in the future that you 49 00:02:51,930 --> 00:02:56,870 ought to just sit down, not to learn abstract linear algebra, 50 00:02:56,870 --> 00:03:00,150 which is very useful also, but just to learn how to use the 51 00:03:00,150 --> 00:03:03,280 topic of solving linear equations. 52 00:03:03,280 --> 00:03:06,450 Being able to express them in terms of matrices. 53 00:03:06,450 --> 00:03:09,310 Being able to use the eigenvalues and eigenvectors, 54 00:03:09,310 --> 00:03:12,220 and matrices as a way of understanding these things. 55 00:03:12,220 --> 00:03:16,440 So I want to say a little more about that today, which is why 56 00:03:16,440 --> 00:03:19,720 I've called this a review plus of eigenvalues and 57 00:03:19,720 --> 00:03:21,020 eigenvectors. 58 00:03:21,020 --> 00:03:25,930 It's a review of the topics we did last time, but it's 59 00:03:25,930 --> 00:03:28,250 looking at it in a somewhat different way. 60 00:03:28,250 --> 00:03:32,150 So let's proceed with that. 61 00:03:32,150 --> 00:03:36,810 We said that the determinant of an M by M matrix is given 62 00:03:36,810 --> 00:03:38,530 by this strange formula. 63 00:03:38,530 --> 00:03:44,340 The determinant of a is the sum over all permutations of 64 00:03:44,340 --> 00:03:51,260 the integers 1 to M of the product from i equals 1 to M 65 00:03:51,260 --> 00:03:56,080 of the matrix element a sub i mu of i. 66 00:03:56,080 --> 00:04:01,670 Mu of i is the permutation of the number i. i is between one 67 00:04:01,670 --> 00:04:05,510 and M, and mu of i is a permutation of that. 68 00:04:05,510 --> 00:04:17,529 Now if you look at the matrix, which has the form, which is 69 00:04:17,529 --> 00:04:19,600 block upper diagonal. 70 00:04:19,600 --> 00:04:22,990 In other words, there's a matrix here, a square matrix a 71 00:04:22,990 --> 00:04:26,390 sub t, which is a transient matrix. 72 00:04:26,390 --> 00:04:31,610 There's a recurrent matrix here, and there's some way of 73 00:04:31,610 --> 00:04:33,900 getting from the transient states to 74 00:04:33,900 --> 00:04:36,730 the recurring states. 75 00:04:36,730 --> 00:04:41,630 And this is the general form that a unit chain has to have. 76 00:04:41,630 --> 00:04:44,970 There are a bunch of transient states, there are a bunch of 77 00:04:44,970 --> 00:04:47,230 recurring states. 78 00:04:47,230 --> 00:04:52,630 And the interesting thing here is that the determinant of a 79 00:04:52,630 --> 00:04:57,620 is exactly the determinant of a sub t times the 80 00:04:57,620 --> 00:04:59,410 determinant a sub r. 81 00:04:59,410 --> 00:05:03,210 I'm calling this a instead of the transition matrix p 82 00:05:03,210 --> 00:05:08,840 because I want to replace a by p minus lambda i, so I can 83 00:05:08,840 --> 00:05:11,820 talk about the eigenvalues of p. 84 00:05:11,820 --> 00:05:15,690 So when I do that replacement here, if I know that the 85 00:05:15,690 --> 00:05:20,140 determinant of a is this product of determinants, then 86 00:05:20,140 --> 00:05:24,130 the determinant of p minus lambda i is the determinant of 87 00:05:24,130 --> 00:05:32,160 pt minus lambda it, where it is just a crazy way of saying 88 00:05:32,160 --> 00:05:35,120 a diagonal matrix. 89 00:05:35,120 --> 00:05:40,070 A diagonal t by t matrix, because this is a t by t 90 00:05:40,070 --> 00:05:41,740 matrix, also. 91 00:05:41,740 --> 00:05:48,580 i sub r is an r by r matrix, where this is a square r by r 92 00:05:48,580 --> 00:05:50,260 matrix also. 93 00:05:50,260 --> 00:05:53,970 Now, why is it that this determinant is equal to this 94 00:05:53,970 --> 00:05:56,670 product of determinants here? 95 00:05:56,670 --> 00:06:02,010 Well, before explaining why this is true, why do you care? 96 00:06:02,010 --> 00:06:08,180 Well, because we know that if we have a recurring matrix 97 00:06:08,180 --> 00:06:11,630 here, we know that it has-- 98 00:06:11,630 --> 00:06:13,790 I mean, we know a great deal about it. 99 00:06:13,790 --> 00:06:21,150 We know that any square matrix, r by r matrix has r 100 00:06:21,150 --> 00:06:22,750 different eigenvalues. 101 00:06:22,750 --> 00:06:26,330 Some of them might be repeated, but they're always r 102 00:06:26,330 --> 00:06:27,480 eigenvalues. 103 00:06:27,480 --> 00:06:31,420 This matrix here has t eigenvalues. 104 00:06:31,420 --> 00:06:32,520 OK. 105 00:06:32,520 --> 00:06:37,730 This matrix here, we know has r plus t eigenvalues. 106 00:06:37,730 --> 00:06:42,060 You look at this formula here and you say aha, I can take 107 00:06:42,060 --> 00:06:46,670 all the eigenvalues here, add them to all the eigenvalues 108 00:06:46,670 --> 00:06:50,280 here, and I have every one of the eigenvalues here. 109 00:06:50,280 --> 00:06:54,780 In other words, if I want to find all of the eigenvalues of 110 00:06:54,780 --> 00:06:59,620 p, all I have to do is define the eigenvalues of p sub t, 111 00:06:59,620 --> 00:07:04,710 add them to the eigenvalues of p sub r, and I'm all done. 112 00:07:04,710 --> 00:07:08,640 So that really has simplified things a good deal. 113 00:07:08,640 --> 00:07:14,270 And it also really says explicitly that if you 114 00:07:14,270 --> 00:07:20,060 understand how to deal with recurrent Markov chains, you 115 00:07:20,060 --> 00:07:22,620 really know everything. 116 00:07:22,620 --> 00:07:25,840 Well, you also have to know how to deal with a transient 117 00:07:25,840 --> 00:07:29,880 chain, but the main part of it is dealing with this chain. 118 00:07:29,880 --> 00:07:34,870 This has little r different eigenvalues, and all of those 119 00:07:34,870 --> 00:07:41,860 are eigenvalues, excuse me, p sub r has little r 120 00:07:41,860 --> 00:07:42,710 eigenvalues. 121 00:07:42,710 --> 00:07:46,860 They're given by the roots of this determinant here. 122 00:07:46,860 --> 00:07:49,530 And all of those are roots here. 123 00:07:49,530 --> 00:07:51,580 OK, so why is this true? 124 00:07:51,580 --> 00:07:57,990 Well, the reason for it is that this product up here, 125 00:07:57,990 --> 00:07:59,200 look at this. 126 00:07:59,200 --> 00:08:02,490 We're taking the sum over all permutations. 127 00:08:02,490 --> 00:08:05,315 But which one of those permutations can be non-zero? 128 00:08:12,940 --> 00:08:18,740 If I start out by saying that a sub t is t by t, then I know 129 00:08:18,740 --> 00:08:21,440 that this might be anything. 130 00:08:21,440 --> 00:08:24,050 These have to be zeroes here. 131 00:08:24,050 --> 00:08:30,450 If I choose some permutation of down here, of sum i, which 132 00:08:30,450 --> 00:08:31,530 is greater than t. 133 00:08:31,530 --> 00:08:35,030 In other words, if I choose mu o i to be some 134 00:08:35,030 --> 00:08:36,130 element over here. 135 00:08:36,130 --> 00:08:42,309 If I choose mu of i to be less than our equal to t, and i to 136 00:08:42,309 --> 00:08:45,500 be greater than t, what happens? 137 00:08:45,500 --> 00:08:47,790 I get a term which is equal to zero. 138 00:08:47,790 --> 00:08:51,210 That term in this product is zero. 139 00:08:51,210 --> 00:08:55,670 And none of those products can be zero. 140 00:08:55,670 --> 00:09:00,830 So the only way I can get non zeros here is when I'm dealing 141 00:09:00,830 --> 00:09:03,730 with an i which is less than or equal to t. 142 00:09:03,730 --> 00:09:06,100 Namely an i here. 143 00:09:06,100 --> 00:09:09,440 I have to choose a mu of i, a column which is 144 00:09:09,440 --> 00:09:10,870 less than t, also. 145 00:09:10,870 --> 00:09:17,540 If I'm dealing with an i which is greater than t, namely and 146 00:09:17,540 --> 00:09:23,410 i up here, then, well, it looks like I can choose 147 00:09:23,410 --> 00:09:24,950 anything there. 148 00:09:24,950 --> 00:09:25,630 But look. 149 00:09:25,630 --> 00:09:31,180 I've already used up all of these columns here by the five 150 00:09:31,180 --> 00:09:33,470 by the non-zero terms here. 151 00:09:33,470 --> 00:09:37,360 So I can't do anything but use a smaller i, 152 00:09:37,360 --> 00:09:40,080 smaller than t up here. 153 00:09:40,080 --> 00:09:44,703 So when I look at the permutations that are non 154 00:09:44,703 --> 00:09:49,010 zero, the only permutations that are non zero are those 155 00:09:49,010 --> 00:09:55,610 where mu of i is less than t if i less than t, and mu of i 156 00:09:55,610 --> 00:10:01,960 is less than or equal to t if i is less than or equal to t. 157 00:10:01,960 --> 00:10:06,100 And mu of i is greater than t if i is greater than t. 158 00:10:06,100 --> 00:10:11,580 Now, how does that show that this is equal here? 159 00:10:11,580 --> 00:10:16,480 Well, let's look at that a little bit. 160 00:10:16,480 --> 00:10:19,740 I didn't even try to do it on the slide because the notation 161 00:10:19,740 --> 00:10:20,970 is kind of horrifying. 162 00:10:20,970 --> 00:10:24,850 But let's try to write this the following way. 163 00:10:24,850 --> 00:10:36,910 Determinant of a is equal to the sum, and now I'll write it 164 00:10:36,910 --> 00:10:48,040 as a sum over mu of 1 up to t. 165 00:10:48,040 --> 00:10:59,690 And the sum over mu of t plus 1 up to, well, t 166 00:10:59,690 --> 00:11:02,460 plus r, let's say. 167 00:11:02,460 --> 00:11:06,210 OK, so here I have all of the permutations of the 168 00:11:06,210 --> 00:11:08,870 numbers 1 to t. 169 00:11:08,870 --> 00:11:11,350 And here I have all the permutations of the 170 00:11:11,350 --> 00:11:14,010 numbers t plus 1 up. 171 00:11:14,010 --> 00:11:16,760 And for all of those, I'm going to 172 00:11:16,760 --> 00:11:18,190 ignore this plus minus. 173 00:11:18,190 --> 00:11:21,420 You can sort that out for yourselves. 174 00:11:21,420 --> 00:11:27,620 And then I have a product from i equals 1 to t. 175 00:11:27,620 --> 00:11:37,950 And then a product from i equals t plus 1 up to m. 176 00:11:37,950 --> 00:11:39,200 Excuse me. 177 00:11:42,410 --> 00:11:53,000 i sub i, mu of i times product of a of i. 178 00:11:53,000 --> 00:12:07,130 Mu of i for i equals t plus 1 up to t plus r. 179 00:12:07,130 --> 00:12:09,300 OK? 180 00:12:09,300 --> 00:12:14,740 So I'm separating this product here into a product first of 181 00:12:14,740 --> 00:12:19,070 the terms i less than or equal to t, and then for the terms i 182 00:12:19,070 --> 00:12:20,180 greater than t. 183 00:12:20,180 --> 00:12:24,620 For every permutation I choose using the i's that are less 184 00:12:24,620 --> 00:12:29,090 than or equal to t, I can choose any of the permutation 185 00:12:29,090 --> 00:12:33,520 using mu of i greater than t that I choose to use. 186 00:12:33,520 --> 00:12:35,570 So this breaks up in this way. 187 00:12:35,570 --> 00:12:37,960 I have this sum, I have this sum. 188 00:12:37,960 --> 00:12:43,120 I have these two products, so I can break this up as a sum 189 00:12:43,120 --> 00:12:55,270 over mu of 1 to t of plus minus product from i equals 1 190 00:12:55,270 --> 00:13:08,752 to t of ai, mu of i times the sum over mu of t plus 1 up to 191 00:13:08,752 --> 00:13:15,072 t plus r ai mu of i. 192 00:13:20,160 --> 00:13:22,380 Product. 193 00:13:22,380 --> 00:13:23,300 OK. 194 00:13:23,300 --> 00:13:26,030 So I've separated that into two different terms. 195 00:13:26,030 --> 00:13:27,000 STUDENT: T equals [INAUDIBLE]. 196 00:13:27,000 --> 00:13:27,570 PROFESSOR: What? 197 00:13:27,570 --> 00:13:30,680 STUDENT: T plus r equals big m? 198 00:13:30,680 --> 00:13:33,230 PROFESSOR: T plus r is big m, yes. 199 00:13:33,230 --> 00:13:40,060 Because I have t terms here, and I have r terms here. 200 00:13:40,060 --> 00:13:44,710 OK, so the interesting thing here is having this non-zero 201 00:13:44,710 --> 00:13:48,400 term here doesn't make any difference here. 202 00:13:48,400 --> 00:13:52,430 I mean, this is more straightforward if you have a 203 00:13:52,430 --> 00:13:54,020 block diagonal matrix. 204 00:13:54,020 --> 00:13:58,330 It's clear that the eigenvalues of a block 205 00:13:58,330 --> 00:14:03,700 diagonal matrix are going to be the eigenvalues of 1 plus 206 00:14:03,700 --> 00:14:05,560 the eigenvalues of the other. 207 00:14:05,560 --> 00:14:09,980 Here we have the eigenvalues of this, and the 208 00:14:09,980 --> 00:14:11,450 eigenvalues is this. 209 00:14:11,450 --> 00:14:14,910 And what's surprising is that as far as the eigenvalues are 210 00:14:14,910 --> 00:14:19,950 concerned, this has nothing whatsoever to do with it. 211 00:14:19,950 --> 00:14:20,690 OK. 212 00:14:20,690 --> 00:14:24,480 The only thing that this has to do with it is it says 213 00:14:24,480 --> 00:14:28,780 something about the sums of this matrix here, because the 214 00:14:28,780 --> 00:14:31,500 sums of these rows are now less than 1. 215 00:14:31,500 --> 00:14:34,660 They all have to be, some of them, at least, have to be 216 00:14:34,660 --> 00:14:36,760 less than or equal to 1. 217 00:14:36,760 --> 00:14:40,090 Because you do have this way of getting from the transient 218 00:14:40,090 --> 00:14:43,470 elements to the non transient elements. 219 00:14:43,470 --> 00:14:48,060 But it's very surprising that these elements, which are 220 00:14:48,060 --> 00:14:52,100 critically important, because those are the things that get 221 00:14:52,100 --> 00:14:55,800 you from the transition states to the recurrent states have 222 00:14:55,800 --> 00:14:59,540 nothing to do in the eigenvalues whatsoever. 223 00:14:59,540 --> 00:15:00,105 I don't know why. 224 00:15:00,105 --> 00:15:04,310 I can't give you any insights about that, but 225 00:15:04,310 --> 00:15:06,810 that's the way it is. 226 00:15:06,810 --> 00:15:12,030 That's an interesting thing, because if you take this 227 00:15:12,030 --> 00:15:19,930 transition matrix, and you keep at and a sub r fixed, and 228 00:15:19,930 --> 00:15:23,250 you play any kind of funny game you want to with those 229 00:15:23,250 --> 00:15:28,780 terms going from the transient states to the non transient 230 00:15:28,780 --> 00:15:33,370 states, it won't change any eigenvalues. 231 00:15:33,370 --> 00:15:35,490 Don't know why it doesn't. 232 00:15:35,490 --> 00:15:39,400 OK, so where do we go with that? 233 00:15:39,400 --> 00:15:45,440 Well, that's what it says. 234 00:15:45,440 --> 00:15:50,580 The eigenvalues of p, or the t eigenvalues of pt, and the r 235 00:15:50,580 --> 00:15:52,200 eigenvalues of PR. 236 00:15:52,200 --> 00:15:56,180 It also tells you something about simple eigenvalues, and 237 00:15:56,180 --> 00:15:59,800 these crazy eigenvalues, which don't have enough eigenvectors 238 00:15:59,800 --> 00:16:01,230 to go along with them. 239 00:16:01,230 --> 00:16:06,420 Because it tells you that a piece of r has all of its 240 00:16:06,420 --> 00:16:11,880 eigenvectors, and a piece of t has all of its eigenvectors. 241 00:16:11,880 --> 00:16:14,550 Then you don't have any of this crazy [INAUDIBLE] form 242 00:16:14,550 --> 00:16:16,520 thing, or anything. 243 00:16:16,520 --> 00:16:29,670 OK If pi is a left eigenvector of this recurrent matrix, then 244 00:16:29,670 --> 00:16:35,550 if you look at the vector, starting was zeros, and then I 245 00:16:35,550 --> 00:16:42,390 guess I should really say, well, if pi sub 1 up to pi sub 246 00:16:42,390 --> 00:16:47,910 r as a left eigenvalue of this r by r matrix, then if I start 247 00:16:47,910 --> 00:16:52,620 out with t zeroes, and then put in pi 1 to pi r, this 248 00:16:52,620 --> 00:16:57,310 vector here has to be a left eigenvector of all of p. 249 00:16:57,310 --> 00:16:58,310 Why is that? 250 00:16:58,310 --> 00:17:01,610 Well, if I look at a vector, which starts out with zeroes, 251 00:17:01,610 --> 00:17:06,900 and then has this eigenvector pi, and I multiply that vector 252 00:17:06,900 --> 00:17:10,210 by this matrix here, I'm taking these terms, 253 00:17:10,210 --> 00:17:16,260 multiplying them by the columns of this matrix, these 254 00:17:16,260 --> 00:17:22,310 zeros knock out all of these elements here. 255 00:17:22,310 --> 00:17:25,470 These zeroes knock out all of these elements. 256 00:17:25,470 --> 00:17:28,410 So I start out with zeroes everywhere here. 257 00:17:28,410 --> 00:17:30,480 That's what this says. 258 00:17:30,480 --> 00:17:34,660 And then when I'm dealing with this part of the matrix, the 259 00:17:34,660 --> 00:17:39,750 zeros knock out all of this, and I just have pi multiplying 260 00:17:39,750 --> 00:17:40,820 piece of r. 261 00:17:40,820 --> 00:17:45,220 So if I have an eigenvalue lambda, it says I have the 262 00:17:45,220 --> 00:17:50,170 eigenvalue lambda times a vector zero times pi. 263 00:17:50,170 --> 00:17:54,760 It says that if I have an eigenvector, a left 264 00:17:54,760 --> 00:18:01,010 eigenvector of this recurrent matrix, then that turns into, 265 00:18:01,010 --> 00:18:05,670 if you put some zeroes up in front of, it turns into an 266 00:18:05,670 --> 00:18:07,790 eigenvector of the whole matrix. 267 00:18:07,790 --> 00:18:11,580 If we look at the eigenvalue 1, which is the most important 268 00:18:11,580 --> 00:18:14,350 thing this, is the thing that gives you the steady state 269 00:18:14,350 --> 00:18:16,930 factor, this is sort of obvious. 270 00:18:16,930 --> 00:18:19,630 Because the steady state vector is where you go 271 00:18:19,630 --> 00:18:23,960 eventually, and eventually where you go is you have to be 272 00:18:23,960 --> 00:18:27,290 in one of these recurrent states, eventually. 273 00:18:27,290 --> 00:18:30,610 And the probabilities within the recurrent set of states 274 00:18:30,610 --> 00:18:33,400 are the same as the probabilities if you didn't 275 00:18:33,400 --> 00:18:36,590 have this transient states at all. 276 00:18:36,590 --> 00:18:40,490 so this is all sort of obvious, as far as the steady 277 00:18:40,490 --> 00:18:43,020 state factor pi. 278 00:18:43,020 --> 00:18:47,480 But it's a little less obvious as far as the other vectors. 279 00:18:47,480 --> 00:18:52,300 The left eigenvectors, a piece of t, I don't 280 00:18:52,300 --> 00:18:53,610 understand them at all. 281 00:18:53,610 --> 00:18:59,660 They aren't the same as the left eigenvectors of, well, 282 00:18:59,660 --> 00:19:04,670 the left eigenvectors of the eigenvalues of p sub t. 283 00:19:08,040 --> 00:19:10,270 I didn't say this right here. 284 00:19:10,270 --> 00:19:15,870 The left eigenvectors of p corresponding to the left 285 00:19:15,870 --> 00:19:18,700 eigenvectors of p sub t. 286 00:19:18,700 --> 00:19:22,010 I don't understand how they work, and I don't understand 287 00:19:22,010 --> 00:19:24,350 anything you can derive from them. 288 00:19:24,350 --> 00:19:26,740 They're just kind of crazy things, which are what they 289 00:19:26,740 --> 00:19:27,780 happen to be. 290 00:19:27,780 --> 00:19:29,350 And I don't care about them. 291 00:19:29,350 --> 00:19:32,200 I don't know anything to do with them. 292 00:19:32,200 --> 00:19:35,200 But these other eigenvectors are very useful. 293 00:19:35,200 --> 00:19:38,130 OK. 294 00:19:38,130 --> 00:19:45,040 We can extend this to as many different recurrent sets of 295 00:19:45,040 --> 00:19:47,080 states as you choose. 296 00:19:47,080 --> 00:19:53,100 Here I'm doing it with a Markov chain, which has two 297 00:19:53,100 --> 00:19:56,550 different sets of recurrent states. 298 00:19:56,550 --> 00:20:00,010 They might be periodic, they might be ergodic, it doesn't 299 00:20:00,010 --> 00:20:01,340 make any difference. 300 00:20:01,340 --> 00:20:07,730 So the matrix p has these transient states up here. 301 00:20:07,730 --> 00:20:11,990 Here we have those transition states would just go to each 302 00:20:11,990 --> 00:20:16,320 other, where the transition probabilities starting with 303 00:20:16,320 --> 00:20:19,140 the transient state and going to a transition state. 304 00:20:19,140 --> 00:20:24,090 Here we have the transitions, which go from transient states 305 00:20:24,090 --> 00:20:26,500 to this first set of recurrent states. 306 00:20:26,500 --> 00:20:30,810 Here we have the transitions, which go from a transient 307 00:20:30,810 --> 00:20:35,480 state to the second state of recurrent states. 308 00:20:35,480 --> 00:20:36,180 OK. 309 00:20:36,180 --> 00:20:39,330 The same way as before, the determinant of this whole 310 00:20:39,330 --> 00:20:44,790 thing here, and this determinant, the roots of that 311 00:20:44,790 --> 00:20:49,300 are in fact the eigenvalues of p, are the product of the 312 00:20:49,300 --> 00:20:54,930 determinant of pt minus lambda it times the product of this, 313 00:20:54,930 --> 00:20:58,030 times this determinant here. 314 00:20:58,030 --> 00:21:02,180 This has little t eigenvalues. 315 00:21:02,180 --> 00:21:05,220 This has little r eigenvalues. 316 00:21:05,220 --> 00:21:08,690 This has little r prime eigenvalues, and if you add up 317 00:21:08,690 --> 00:21:11,880 t plus little r plus little r prime, what do you get? 318 00:21:11,880 --> 00:21:17,790 You get jM, excuse me, capital M, which is the total number 319 00:21:17,790 --> 00:21:21,470 of states in the Markov chain. 320 00:21:21,470 --> 00:21:27,110 So the eigenvalues here are exactly the eigenvalues here 321 00:21:27,110 --> 00:21:33,300 plus the eigenvalues here, plus the eigenvalues here. 322 00:21:33,300 --> 00:21:36,720 And you can find the eigenvectors, the left 323 00:21:36,720 --> 00:21:40,810 eigenvectors for these states in exactly 324 00:21:40,810 --> 00:21:43,450 the same way as before. 325 00:21:43,450 --> 00:21:44,570 OK. 326 00:21:44,570 --> 00:21:45,772 Yeah? 327 00:21:45,772 --> 00:21:48,628 STUDENT: So again, the eigenvalues can be repeated 328 00:21:48,628 --> 00:21:51,960 both within t, r, r prime, and in between the-- 329 00:21:51,960 --> 00:21:52,436 PROFESSOR: Yes. 330 00:21:52,436 --> 00:21:54,340 STUDENT: There's nothing that says [INAUDIBLE]. 331 00:21:54,340 --> 00:21:54,610 PROFESSOR: No. 332 00:21:54,610 --> 00:21:58,440 There's nothing that says they can't, except you can always 333 00:21:58,440 --> 00:22:05,980 find the left eigenvectors, anyway, of this are, in fact, 334 00:22:05,980 --> 00:22:08,680 these things in the form. 335 00:22:08,680 --> 00:22:15,840 If pi is a left eigenvector of p sub r, then zero followed by 336 00:22:15,840 --> 00:22:17,460 pi followed by zero. 337 00:22:17,460 --> 00:22:26,480 In other words, little t zeros followed by r, followed by the 338 00:22:26,480 --> 00:22:32,060 eigenvector pi, followed by little r prime zeroes here, 339 00:22:32,060 --> 00:22:34,490 this has to be a left eigenvector of t. 340 00:22:34,490 --> 00:22:37,280 So this tells you something about whether you're going to 341 00:22:37,280 --> 00:22:40,140 have a Jordan form or not, one of these really 342 00:22:40,140 --> 00:22:41,240 ugly things in it. 343 00:22:41,240 --> 00:22:44,590 And it tells you that in many cases, you 344 00:22:44,590 --> 00:22:46,370 just can't have them. 345 00:22:46,370 --> 00:22:48,850 If you have them, they're usually tied up with this 346 00:22:48,850 --> 00:22:50,730 matrix here. 347 00:22:50,730 --> 00:22:53,140 OK, so that, I don't know. 348 00:22:53,140 --> 00:22:53,950 Was this useful? 349 00:22:53,950 --> 00:22:55,550 Does this clarify anything? 350 00:22:55,550 --> 00:22:58,830 Or if it doesn't, it's too bad. 351 00:23:01,810 --> 00:23:02,330 OK. 352 00:23:02,330 --> 00:23:05,080 So now we want to start talking about rewards. 353 00:23:07,580 --> 00:23:09,150 Some people call these costs. 354 00:23:09,150 --> 00:23:11,230 If you're an optimist, you call it rewards. 355 00:23:11,230 --> 00:23:13,870 If you're a pessimist, you call it costs. 356 00:23:13,870 --> 00:23:15,520 They're both the same thing. 357 00:23:15,520 --> 00:23:18,180 If you're dealing with rewards, you maximize them. 358 00:23:18,180 --> 00:23:20,470 If you're dealing with costs, you minimize them. 359 00:23:20,470 --> 00:23:24,800 So mathematically, who cares? 360 00:23:24,800 --> 00:23:30,590 OK, so suppose that each state i of a Markov chain is 361 00:23:30,590 --> 00:23:33,280 associated with a given reward, or a sub i. 362 00:23:33,280 --> 00:23:36,350 In other words, you think of this Markov chain, which is 363 00:23:36,350 --> 00:23:37,180 running along. 364 00:23:37,180 --> 00:23:41,320 You go from one state to another over time. 365 00:23:41,320 --> 00:23:45,930 And while this is happening, you're pocketing some reward 366 00:23:45,930 --> 00:23:47,250 all the time. 367 00:23:47,250 --> 00:23:47,650 OK. 368 00:23:47,650 --> 00:23:50,890 You invest in a stock. 369 00:23:50,890 --> 00:23:53,470 Strangely enough, these particular stocks we're 370 00:23:53,470 --> 00:23:57,270 thinking about here I this Markov property. 371 00:23:57,270 --> 00:23:59,970 Stocks really don't have a Markov property, but we'll 372 00:23:59,970 --> 00:24:02,130 assume they do. 373 00:24:02,130 --> 00:24:06,200 And since they have this Markov property, you win for a 374 00:24:06,200 --> 00:24:07,840 while, and you lose for a while. 375 00:24:07,840 --> 00:24:10,060 You win for a while, you lose for a while. 376 00:24:10,060 --> 00:24:12,770 But we have something extra, other than 377 00:24:12,770 --> 00:24:15,050 just the Markov chains. 378 00:24:15,050 --> 00:24:18,830 We can analyze this whole situation, knowing how Markov 379 00:24:18,830 --> 00:24:20,670 chains behave. 380 00:24:20,670 --> 00:24:24,980 There's not much left besides that, but there are an 381 00:24:24,980 --> 00:24:29,860 extraordinary number of applications of this idea, and 382 00:24:29,860 --> 00:24:31,900 dynamic programming is one of them. 383 00:24:31,900 --> 00:24:35,380 Because that's just one added extension beyond 384 00:24:35,380 --> 00:24:37,880 this idea of rewards. 385 00:24:37,880 --> 00:24:38,380 OK. 386 00:24:38,380 --> 00:24:40,770 The random variable x of n. 387 00:24:40,770 --> 00:24:43,240 That's a random quantity. 388 00:24:43,240 --> 00:24:45,840 It's the state at time n. 389 00:24:45,840 --> 00:24:50,010 And the random reward of time n is then the random variable 390 00:24:50,010 --> 00:24:55,680 r of xn that maps xn equals i into ri for each i. 391 00:24:55,680 --> 00:24:59,140 This is the same idea of taking one random variable, 392 00:24:59,140 --> 00:25:02,030 which is a function of another random variable. 393 00:25:02,030 --> 00:25:06,000 The one random variable takes on the values one up to 394 00:25:06,000 --> 00:25:07,740 capital M. 395 00:25:07,740 --> 00:25:11,080 And then the other random variable takes on a value 396 00:25:11,080 --> 00:25:14,680 which is determined by the state that you happen to be 397 00:25:14,680 --> 00:25:16,600 in, which is this random states. 398 00:25:16,600 --> 00:25:21,700 So specifying our sub i specifies what the set of 399 00:25:21,700 --> 00:25:25,380 rewards are, what the reward is in each given state. 400 00:25:25,380 --> 00:25:28,520 Again, we have this awful problem, which I wish we could 401 00:25:28,520 --> 00:25:32,760 avoid in Markov chains, of using the same word state to 402 00:25:32,760 --> 00:25:35,900 talk about the set of different states. 403 00:25:35,900 --> 00:25:38,120 And also to talk about the random state 404 00:25:38,120 --> 00:25:39,170 at any given time. 405 00:25:39,170 --> 00:25:43,560 But hopefully by now you're used to that. 406 00:25:43,560 --> 00:25:47,700 In our discussion here, the only thing we're going to talk 407 00:25:47,700 --> 00:25:50,670 about are expected rewards. 408 00:25:50,670 --> 00:25:55,810 Now, you know that expected rewards, or expectations are a 409 00:25:55,810 --> 00:25:58,310 little more generally than you would think they would be, 410 00:25:58,310 --> 00:26:02,060 because you're going to take the expected value of any sort 411 00:26:02,060 --> 00:26:04,300 of crazy thing. 412 00:26:04,300 --> 00:26:07,870 If you want to talk about any event, you can take the 413 00:26:07,870 --> 00:26:11,310 indicator function of that event, and find the expected 414 00:26:11,310 --> 00:26:13,890 value of that indicator function. 415 00:26:13,890 --> 00:26:16,920 And that's just the probability of that event. 416 00:26:16,920 --> 00:26:22,660 So by understanding how to deal with expectations, you 417 00:26:22,660 --> 00:26:25,560 really have the capability of finding distribution 418 00:26:25,560 --> 00:26:28,480 functions, or anything else you want to find. 419 00:26:28,480 --> 00:26:28,970 OK. 420 00:26:28,970 --> 00:26:31,490 But anyway, since we're interested only in expected 421 00:26:31,490 --> 00:26:37,555 rewards, the expected reward at time n, given that x zero 422 00:26:37,555 --> 00:26:44,950 is i is the expected value of r of xn given x zero equals i, 423 00:26:44,950 --> 00:26:49,840 which is the sum over j of the reward you get if you're in 424 00:26:49,840 --> 00:26:55,700 state j at time n times p sub ij, super n, which we've 425 00:26:55,700 --> 00:27:00,850 talked about ad nauseum for the last four lectures now. 426 00:27:00,850 --> 00:27:06,900 And this is the probability that the state at time n is j, 427 00:27:06,900 --> 00:27:09,910 given that the state at time zero is i. 428 00:27:09,910 --> 00:27:13,650 So you can just automatically find the expected 429 00:27:13,650 --> 00:27:17,570 value of r of xn. 430 00:27:17,570 --> 00:27:20,610 And it's by that formula. 431 00:27:20,610 --> 00:27:24,230 Now, recall that this quantity here is not all that simple. 432 00:27:24,230 --> 00:27:28,680 This is the ij element of the product of the matrix, of the 433 00:27:28,680 --> 00:27:31,010 nth product of the matrix p. 434 00:27:31,010 --> 00:27:32,370 But, so what? 435 00:27:32,370 --> 00:27:36,130 We can at least write a nice formula for it now. 436 00:27:36,130 --> 00:27:40,140 The expected aggregate reward over the n steps from m to m 437 00:27:40,140 --> 00:27:43,080 plus n minus 1. 438 00:27:43,080 --> 00:27:44,900 What is m doing in here? 439 00:27:44,900 --> 00:27:48,970 It's just reminding us that Markov chains are 440 00:27:48,970 --> 00:27:51,890 homogeneous over time. 441 00:27:51,890 --> 00:27:56,370 So, when I talk about the aggregate reward from time m 442 00:27:56,370 --> 00:28:01,200 the m plus n minus 1, it's the same as the aggregate reward 443 00:28:01,200 --> 00:28:04,500 from time 0 up to time n minus 1. 444 00:28:04,500 --> 00:28:06,270 The expected values are the same. 445 00:28:06,270 --> 00:28:09,550 The actual sample functions are different. 446 00:28:09,550 --> 00:28:14,290 OK, so if I try to calculate this aggregate reward 447 00:28:14,290 --> 00:28:18,880 conditional on xm equals i, mainly conditional on starting 448 00:28:18,880 --> 00:28:23,660 in state i, then this expected aggregate reward, I use that 449 00:28:23,660 --> 00:28:28,610 as a symbol for it, is the expected value of r of xm, 450 00:28:28,610 --> 00:28:30,310 given xm equals i. 451 00:28:30,310 --> 00:28:30,890 What is that? 452 00:28:30,890 --> 00:28:33,030 Well, that's ri. 453 00:28:33,030 --> 00:28:35,220 I mean, given that xm is equal to i, this 454 00:28:35,220 --> 00:28:36,490 isn't random anymore. 455 00:28:36,490 --> 00:28:38,500 It's just the source sub i. 456 00:28:38,500 --> 00:28:45,350 Plus the expected value of r of xm plus 1, which is the sum 457 00:28:45,350 --> 00:28:49,490 over j, of pij times r sub j. 458 00:28:49,490 --> 00:28:54,305 That's the time m plus 1 given that you're in state i at time 459 00:28:54,305 --> 00:29:00,370 m, and so forth, up until time n minus 1, where the expected 460 00:29:00,370 --> 00:29:03,240 reward, then, is a piece of ij. 461 00:29:06,180 --> 00:29:10,860 Probability of being in state j at time n minus 1 given that 462 00:29:10,860 --> 00:29:16,190 you started off in state i at time 0 times r sub j. 463 00:29:16,190 --> 00:29:20,790 And since expectations add, we have this nice, convenient 464 00:29:20,790 --> 00:29:22,040 formula here. 465 00:29:26,180 --> 00:29:30,580 We're doing something I normally hate doing, which is 466 00:29:30,580 --> 00:29:35,290 building up a lot of notation, and then using that notation 467 00:29:35,290 --> 00:29:40,470 to write extremely complicated formulas in a way that looks 468 00:29:40,470 --> 00:29:41,200 very simple. 469 00:29:41,200 --> 00:29:44,480 And therefore you will get some sense of what we're doing 470 00:29:44,480 --> 00:29:45,840 is very simple. 471 00:29:45,840 --> 00:29:48,160 These quantities in here, again, are 472 00:29:48,160 --> 00:29:49,790 not all that simple. 473 00:29:49,790 --> 00:29:52,550 But at least we can write it in a simple way. 474 00:29:52,550 --> 00:29:56,260 And since we can write it in a simple way, it turns out we 475 00:29:56,260 --> 00:29:59,160 can do some nice things with it. 476 00:29:59,160 --> 00:29:59,420 OK. 477 00:29:59,420 --> 00:30:00,970 So where do we go from all of this? 478 00:30:04,860 --> 00:30:12,280 We have just said that the expected reward we get, 479 00:30:12,280 --> 00:30:18,550 expected aggregate reward over n steps, namely from m up to m 480 00:30:18,550 --> 00:30:20,210 plus n minus 1. 481 00:30:20,210 --> 00:30:25,660 We're assuming that if we start at time m, we pick up a 482 00:30:25,660 --> 00:30:27,660 reward at time n. 483 00:30:27,660 --> 00:30:30,530 I mean, that's just an arbitrary decision. 484 00:30:30,530 --> 00:30:33,960 We might as well do that, because otherwise we just have 485 00:30:33,960 --> 00:30:36,840 one more transition matrix sitting here. 486 00:30:36,840 --> 00:30:38,660 OK, so we start at time m. 487 00:30:38,660 --> 00:30:42,640 We pick up a reward, which is conditional on the 488 00:30:42,640 --> 00:30:45,030 state we start in. 489 00:30:45,030 --> 00:30:53,040 And then we look at the expected reward for time m and 490 00:30:53,040 --> 00:30:58,420 time m plus 1, m plus 2, up to m plus n minus 1. 491 00:30:58,420 --> 00:31:00,610 Since we started at m, we're picking 492 00:31:00,610 --> 00:31:02,620 up n different rewards. 493 00:31:02,620 --> 00:31:07,490 We have to stop at time m plus n minus 1. 494 00:31:07,490 --> 00:31:14,040 OK, so that's this expected aggregate reward. 495 00:31:14,040 --> 00:31:17,890 Why do I care about expected aggregate reward? 496 00:31:17,890 --> 00:31:22,220 Because the rewards at any time n are sort of trivial. 497 00:31:22,220 --> 00:31:24,640 What we're are interested in is how does this 498 00:31:24,640 --> 00:31:27,320 build up over time? 499 00:31:27,320 --> 00:31:29,150 You start to invest in a stock. 500 00:31:29,150 --> 00:31:34,480 You don't much care what it's worth at time 10. 501 00:31:34,480 --> 00:31:35,785 You care how it grows. 502 00:31:38,390 --> 00:31:41,040 You care about its value when you want to sell it, and you 503 00:31:41,040 --> 00:31:44,880 don't know when you're going to sell it, most of the time. 504 00:31:44,880 --> 00:31:48,150 So you're really interested in these aggregate 505 00:31:48,150 --> 00:31:49,400 rewards that you. 506 00:31:52,260 --> 00:31:54,590 You'll see when we get to dynamic programming what 507 00:31:54,590 --> 00:31:56,780 you're interested in that, also. 508 00:31:56,780 --> 00:31:57,430 OK. 509 00:31:57,430 --> 00:32:01,340 If the Markov chain is an ergotic unit chain, then 510 00:32:01,340 --> 00:32:04,710 successive terms of this expression tend to a steady 511 00:32:04,710 --> 00:32:06,450 state gain per step. 512 00:32:06,450 --> 00:32:11,520 In other words, these terms here , when n gets very large, 513 00:32:11,520 --> 00:32:17,070 if I run this process for very long time, what happens to p 514 00:32:17,070 --> 00:32:20,640 sub ij to n minus 1? 515 00:32:20,640 --> 00:32:27,920 This tends towards the steady state vector pi sub j. 516 00:32:27,920 --> 00:32:31,710 And it doesn't matter where we started. 517 00:32:31,710 --> 00:32:34,690 The only thing of importance is where we end up. 518 00:32:34,690 --> 00:32:37,180 It doesn't matter how high this is. 519 00:32:37,180 --> 00:32:42,670 So we have a sum over j, of pi sub j times r sub j. 520 00:32:42,670 --> 00:32:48,745 After a very long time, the expected gain per step is just 521 00:32:48,745 --> 00:32:51,930 a sum of pi j times our r sub j. 522 00:32:51,930 --> 00:32:56,000 That's what's important after a long time. 523 00:32:56,000 --> 00:32:58,290 And that's independent of the starting state. 524 00:32:58,290 --> 00:33:02,670 So what we have here is a big, messy transient, which is a 525 00:33:02,670 --> 00:33:04,780 sum of a whole bunch of things. 526 00:33:04,780 --> 00:33:08,090 And then eventually it just settles down, and every extra 527 00:33:08,090 --> 00:33:15,190 step you do, you just pick up an extra factor of g as an 528 00:33:15,190 --> 00:33:16,970 extra reward. 529 00:33:16,970 --> 00:33:19,960 The reward might, of course, be negative, like in the stock 530 00:33:19,960 --> 00:33:25,100 market over the last 10 years, or up until the last year or 531 00:33:25,100 --> 00:33:27,980 so, who was negative for a long time. 532 00:33:27,980 --> 00:33:30,800 But that doesn't make any difference. 533 00:33:30,800 --> 00:33:34,480 This is just a number, and this is independent of 534 00:33:34,480 --> 00:33:36,590 starting state. 535 00:33:36,590 --> 00:33:41,740 And p sub in can be viewed a transient ni, which is all 536 00:33:41,740 --> 00:33:43,330 this stuff at the beginning. 537 00:33:43,330 --> 00:33:47,010 The sum of all these terms at the beginning plus something 538 00:33:47,010 --> 00:33:50,290 that settles down over a long period of time. 539 00:33:50,290 --> 00:33:54,200 How to calculate that transient, how to combine it 540 00:33:54,200 --> 00:33:56,230 with the steady state gain. 541 00:33:56,230 --> 00:33:59,920 Then those talk a great deal about that. 542 00:33:59,920 --> 00:34:03,970 What we're trying to do today is to talk about dynamic 543 00:34:03,970 --> 00:34:09,080 programming without going into all of this terrible mess 544 00:34:09,080 --> 00:34:12,250 about dealing rewards words in a very 545 00:34:12,250 --> 00:34:14,239 systematic and simple way. 546 00:34:14,239 --> 00:34:16,199 You can read about that later. 547 00:34:16,199 --> 00:34:19,610 What we're aiming at is to talk about dynamic programming 548 00:34:19,610 --> 00:34:23,340 a little bit, and then get off to other things. 549 00:34:23,340 --> 00:34:23,870 OK. 550 00:34:23,870 --> 00:34:27,239 So anyway, we have a transient, plus we have a 551 00:34:27,239 --> 00:34:29,330 steady state gain. 552 00:34:29,330 --> 00:34:31,470 The transient is important. 553 00:34:31,470 --> 00:34:34,520 And it's particularly important if g equals zero. 554 00:34:34,520 --> 00:34:40,090 Namely if your average gain per step is nothing, then what 555 00:34:40,090 --> 00:34:47,980 you're primarily interested in is how valuable is it to start 556 00:34:47,980 --> 00:34:49,360 in a particular state? 557 00:34:49,360 --> 00:34:53,000 If you start in one state versus another state, you 558 00:34:53,000 --> 00:34:56,600 might get a great deal of reward in this one state, 559 00:34:56,600 --> 00:34:59,120 whereas you make a loss in some other state. 560 00:34:59,120 --> 00:35:03,200 So it's important to know which state is worth being in. 561 00:35:03,200 --> 00:35:07,960 So that's the next thing we try to look at. 562 00:35:07,960 --> 00:35:12,410 How does the state affect things? 563 00:35:12,410 --> 00:35:17,760 This brings us to one example which is particularly useful. 564 00:35:17,760 --> 00:35:22,360 And along with being a useful example, well, it's a nice 565 00:35:22,360 --> 00:35:25,840 illustration of Markov rewards. 566 00:35:25,840 --> 00:35:30,980 It's also something which you often want to find. 567 00:35:30,980 --> 00:35:35,800 And when we start talking about renewal processes, you 568 00:35:35,800 --> 00:35:40,890 will find that this idea here is a nice connection between 569 00:35:40,890 --> 00:35:43,340 Markov chains and renewal series. 570 00:35:43,340 --> 00:35:47,240 So it's important for a whole bunch of different reasons. 571 00:35:47,240 --> 00:35:48,220 OK. 572 00:35:48,220 --> 00:35:52,470 Suppose for some arbitrary unit chain, namely we're 573 00:35:52,470 --> 00:35:56,060 saying one set of recurring states. 574 00:35:56,060 --> 00:35:59,710 We want to find the expected number of steps, starting from 575 00:35:59,710 --> 00:36:04,260 a given state i, until some particular 576 00:36:04,260 --> 00:36:06,560 state 1 is first entered. 577 00:36:06,560 --> 00:36:09,070 So you start at one state. 578 00:36:09,070 --> 00:36:12,090 There's this other state way over here. 579 00:36:12,090 --> 00:36:15,690 This state is recurrent, so presumably, eventually you're 580 00:36:15,690 --> 00:36:17,580 going to enter it. 581 00:36:17,580 --> 00:36:20,170 And you want to find out, what's the expected time that 582 00:36:20,170 --> 00:36:23,810 it takes to get to that particular state? 583 00:36:23,810 --> 00:36:26,110 OK? 584 00:36:26,110 --> 00:36:30,160 If you're a Ph.D. student, you have this Markov chain of 585 00:36:30,160 --> 00:36:32,310 doing your research. 586 00:36:32,310 --> 00:36:36,180 And at some point, you're going to get a Ph.D. So we can 587 00:36:36,180 --> 00:36:39,900 think of this as the first pass each time to your first 588 00:36:39,900 --> 00:36:44,500 Ph.D. I mean, if you want to get more Ph.D.'s, fine, but 589 00:36:44,500 --> 00:36:47,560 that's probably a different Markov chain. 590 00:36:47,560 --> 00:36:48,550 OK. 591 00:36:48,550 --> 00:36:53,110 So anyway, that's the problem we're trying to solve here. 592 00:36:53,110 --> 00:36:56,690 We can view this problem as a reward problem. 593 00:36:56,690 --> 00:36:59,750 We have to go through a number of steps if we want to view it 594 00:36:59,750 --> 00:37:01,940 as a reward problem. 595 00:37:01,940 --> 00:37:07,390 The first one, first step is to assign one unit of reward 596 00:37:07,390 --> 00:37:11,430 to each successive state until you enter state 1. 597 00:37:11,430 --> 00:37:15,040 So you're bombing through this Markov chain, a frog jumping 598 00:37:15,040 --> 00:37:17,120 from lily pad to lily pad. 599 00:37:17,120 --> 00:37:19,590 And finally, the frog gets to the lily pad 600 00:37:19,590 --> 00:37:21,500 with the food on it. 601 00:37:21,500 --> 00:37:25,780 And the frog wants to know, is it going to start before he 602 00:37:25,780 --> 00:37:28,830 gets to this lily pad with the food on it? 603 00:37:28,830 --> 00:37:32,940 So, if we're trying to find the expected time to get 604 00:37:32,940 --> 00:37:35,850 there, here what we're really interested in is a cost, 605 00:37:35,850 --> 00:37:39,920 because the frog is in danger of starving. 606 00:37:39,920 --> 00:37:42,220 Or on the other hand, there might be a snake lying under 607 00:37:42,220 --> 00:37:44,470 this one lily pad. 608 00:37:44,470 --> 00:37:47,770 And then he's getting a reward for staying alive. 609 00:37:47,770 --> 00:37:51,390 You can look at these things whichever way you want to. 610 00:37:51,390 --> 00:37:51,880 OK. 611 00:37:51,880 --> 00:37:55,020 We're going to assign one unit of reward to successive state 612 00:37:55,020 --> 00:37:56,800 until state 1 is entered. 613 00:37:56,800 --> 00:38:01,430 1 is just an arbitrary state that we've selected. 614 00:38:01,430 --> 00:38:04,760 That's where the snake is underneath a lily pad, or 615 00:38:04,760 --> 00:38:08,130 that's where the food is, or what have you. 616 00:38:08,130 --> 00:38:10,450 Now, there's something else we have to do. 617 00:38:10,450 --> 00:38:17,010 Because if we're starting out at some arbitrary state i, and 618 00:38:17,010 --> 00:38:19,910 we're trying to look for the first time that we enter state 619 00:38:19,910 --> 00:38:23,695 1, what do you do after you enter state 1? 620 00:38:26,670 --> 00:38:32,400 Well eventually, normally you're going to go away from 621 00:38:32,400 --> 00:38:34,110 state 1, and you're going to start 622 00:38:34,110 --> 00:38:36,380 picking up rewards again. 623 00:38:36,380 --> 00:38:38,990 You don't want that to happen. 624 00:38:38,990 --> 00:38:42,020 So you do something we do all the time when we're dealing 625 00:38:42,020 --> 00:38:45,510 with Markov chains, which is we start with one Markov 626 00:38:45,510 --> 00:38:49,070 chain, and we say, to solve this problem I'm interested 627 00:38:49,070 --> 00:38:52,110 in, I've got to change the Markov chain. 628 00:38:52,110 --> 00:38:54,350 So how are we going to change it? 629 00:38:54,350 --> 00:38:58,160 We're going to change it to say, once we get in state 1, 630 00:38:58,160 --> 00:38:59,455 we're going to stay there forever. 631 00:39:02,070 --> 00:39:04,600 Or in other words, the frog gets eaten by the snake, and 632 00:39:04,600 --> 00:39:09,650 therefore its remains always stay at that one lily pad. 633 00:39:09,650 --> 00:39:11,750 So we change the Markov chain again. 634 00:39:11,750 --> 00:39:14,450 The frog can't jump anymore. 635 00:39:14,450 --> 00:39:18,290 And the way we change it is to change the transition 636 00:39:18,290 --> 00:39:23,910 probabilities out of state 1 to p sub 1, 1, namely the 637 00:39:23,910 --> 00:39:27,010 probability given you're in state 1, of going back to 638 00:39:27,010 --> 00:39:30,320 state 1 in the next transition is equal to 1. 639 00:39:30,320 --> 00:39:32,670 So whenever you get to state 1, you 640 00:39:32,670 --> 00:39:35,270 just stay there forever. 641 00:39:35,270 --> 00:39:39,210 We're going to say r1 equal to zero, namely the reward you 642 00:39:39,210 --> 00:39:42,240 get in state 1 will be zero. 643 00:39:42,240 --> 00:39:46,070 So you keep getting rewards until you go to state 1. 644 00:39:46,070 --> 00:39:49,840 And then when you go to state 1, you don't get any reward. 645 00:39:49,840 --> 00:39:54,150 You don't get any reward from any time after that. 646 00:39:54,150 --> 00:39:56,600 So in fact, we've converted the problem. 647 00:39:56,600 --> 00:39:59,970 We've converted the Markov chain to be able to solve the 648 00:39:59,970 --> 00:40:03,160 problem that we want to solve. 649 00:40:03,160 --> 00:40:07,660 Now, how do we know that we haven't changed the problem in 650 00:40:07,660 --> 00:40:10,330 some awful way? 651 00:40:10,330 --> 00:40:13,710 I mean, any time you start out with a Markov chain and you 652 00:40:13,710 --> 00:40:16,510 modify it, and you solve a problem for the modified 653 00:40:16,510 --> 00:40:20,410 chain, you have to really think through whether you 654 00:40:20,410 --> 00:40:23,550 changed the problem that you started to solve. 655 00:40:23,550 --> 00:40:27,790 Well, think of any sample path which starts in some state i, 656 00:40:27,790 --> 00:40:29,610 which is not equal to 1. 657 00:40:29,610 --> 00:40:33,930 Think of the sample path as going forever. 658 00:40:33,930 --> 00:40:38,430 In the original Markov chain, that sample path at some 659 00:40:38,430 --> 00:40:43,050 point, presumably, is going to get to state 1. 660 00:40:43,050 --> 00:40:47,100 After it gets to state 1, we don't care what happens, 661 00:40:47,100 --> 00:40:51,520 because we then know how long it's taken to get to state 1. 662 00:40:51,520 --> 00:40:54,550 And after it gets to state 1, the transition 663 00:40:54,550 --> 00:40:56,410 probabilities change. 664 00:40:56,410 --> 00:40:58,410 We don't care about that. 665 00:40:58,410 --> 00:41:03,570 So for every sample path, the time that it takes the first 666 00:41:03,570 --> 00:41:08,370 pass each time to state 1 is the same in the modify chain 667 00:41:08,370 --> 00:41:10,920 as it is in the actual chain. 668 00:41:10,920 --> 00:41:15,750 The transition probabilities are the same up until the time 669 00:41:15,750 --> 00:41:17,770 when you first get to state 1. 670 00:41:17,770 --> 00:41:22,300 So for first pass each time problems, it doesn't make any 671 00:41:22,300 --> 00:41:26,550 difference what you do after you get to state 1. 672 00:41:26,550 --> 00:41:30,590 So to make the problem easy, we're going to set these 673 00:41:30,590 --> 00:41:34,450 transition probabilities in state 1 to 1, and we're going 674 00:41:34,450 --> 00:41:38,830 to set the reward equal to zero. 675 00:41:38,830 --> 00:41:46,710 What do you call a state which has p sub i, i equal to 1? 676 00:41:46,710 --> 00:41:48,700 You call it a trapping state. 677 00:41:48,700 --> 00:41:51,080 It's a trapping state because once you get there, 678 00:41:51,080 --> 00:41:52,330 you can't get out. 679 00:41:55,500 --> 00:41:59,710 And since we started out with a unit chain, and since 680 00:41:59,710 --> 00:42:03,650 presumably state 1 is a recurrent state in that unit 681 00:42:03,650 --> 00:42:06,500 chain, eventually you're going to get to state 1. 682 00:42:06,500 --> 00:42:08,560 But once you get there, you can't get out. 683 00:42:08,560 --> 00:42:11,690 So what you've done is you've turned the unit chain into 684 00:42:11,690 --> 00:42:15,200 another unit chain where the recurrent set of states has 685 00:42:15,200 --> 00:42:17,900 only this one state, state 1 in it. 686 00:42:17,900 --> 00:42:19,690 So it's a trapping state. 687 00:42:19,690 --> 00:42:23,920 Everything eventually leads to state 1. 688 00:42:23,920 --> 00:42:26,600 All roads lead to Rome, but it's not obvious that they're 689 00:42:26,600 --> 00:42:28,350 leading to Rome. 690 00:42:28,350 --> 00:42:31,480 And all of these states eventually lead to state 1, 691 00:42:31,480 --> 00:42:34,420 but not for quite a while sometimes. 692 00:42:34,420 --> 00:42:35,050 OK. 693 00:42:35,050 --> 00:42:37,710 So the probability of an initial segment until 1 is 694 00:42:37,710 --> 00:42:41,960 entered is unchanged, and expected first pass each time 695 00:42:41,960 --> 00:42:43,210 is unchanged. 696 00:42:45,630 --> 00:42:45,770 OK. 697 00:42:45,770 --> 00:42:50,430 A modified Markov chain is now an ergotic unit chain. 698 00:42:50,430 --> 00:42:53,580 It has a single recurrent state. 699 00:42:53,580 --> 00:42:57,150 State 1 is a trapping state, we call it. 700 00:42:57,150 --> 00:43:03,730 ri is equal to 1 for i unequal to 1, and r1 is equal to zero. 701 00:43:03,730 --> 00:43:08,480 This says that a state 1 is first entered at time l, and 702 00:43:08,480 --> 00:43:13,770 the aggregate reward from 0 to n is l for all m greater than 703 00:43:13,770 --> 00:43:14,335 or equal to l. 704 00:43:14,335 --> 00:43:16,780 In other words, after you get to the trapping state, you 705 00:43:16,780 --> 00:43:19,410 stay there, and you don't pick up any more 706 00:43:19,410 --> 00:43:21,250 reward from then on. 707 00:43:21,250 --> 00:43:23,970 One of the things that's maddening about problems like 708 00:43:23,970 --> 00:43:26,720 this, at least that's maddening for me, because I 709 00:43:26,720 --> 00:43:30,710 can't keep those things straight, is the difference 710 00:43:30,710 --> 00:43:34,290 between n and n plus 1, or n and n minus 1. 711 00:43:34,290 --> 00:43:37,280 There's always that strange thing, we've started at time 712 00:43:37,280 --> 00:43:40,270 m, we get reward at time m. 713 00:43:40,270 --> 00:43:43,600 So if we're looking at m transitions, as we go from m 714 00:43:43,600 --> 00:43:46,860 the m plus n minus 1. 715 00:43:46,860 --> 00:43:50,150 And that's just life. 716 00:43:50,150 --> 00:43:52,910 If you try to do it in a different way, you wind up 717 00:43:52,910 --> 00:43:54,800 with a similar problem. 718 00:43:54,800 --> 00:43:56,220 You can't avoid it. 719 00:43:56,220 --> 00:44:02,130 OK, so what we're trying to find is the expected value of 720 00:44:02,130 --> 00:44:06,470 v sub i of n, and the limit as n goes to infinity, we'll just 721 00:44:06,470 --> 00:44:10,640 call that v sub i without the n on it. 722 00:44:10,640 --> 00:44:14,620 And what we want to do is to calculate this expected time 723 00:44:14,620 --> 00:44:18,040 until we first enter state one. 724 00:44:18,040 --> 00:44:22,900 We want to calculate that for all of the other states i. 725 00:44:22,900 --> 00:44:26,980 Well fortunately, there's a sneaky way to calculate this. 726 00:44:26,980 --> 00:44:29,170 For most of these problems, there's a sneaky way to 727 00:44:29,170 --> 00:44:30,680 calculate these limits. 728 00:44:30,680 --> 00:44:34,640 And you don't have to worry about the limit. 729 00:44:34,640 --> 00:44:37,010 So the next thing I'm going to do is to explain what this 730 00:44:37,010 --> 00:44:39,760 sneaky way is. 731 00:44:39,760 --> 00:44:44,710 You will see the same sneaky method done about 100 times 732 00:44:44,710 --> 00:44:46,460 from now on until the end of course. 733 00:44:46,460 --> 00:44:48,760 We use it all the time. 734 00:44:48,760 --> 00:44:52,250 And each time we do it, we'll get a better sense of what it 735 00:44:52,250 --> 00:44:53,710 really amounts to. 736 00:44:53,710 --> 00:44:59,150 So for each state unequal to the trapping state, let's 737 00:44:59,150 --> 00:45:02,290 start out by assuming that we start at time 738 00:45:02,290 --> 00:45:04,470 zero, and state i. 739 00:45:04,470 --> 00:45:08,580 In other words, what this means is first we're going to 740 00:45:08,580 --> 00:45:12,490 assume that x sub 0 equals i for some given i. 741 00:45:12,490 --> 00:45:14,300 We're going to go through whatever we're going to go 742 00:45:14,300 --> 00:45:17,620 through, then we'll go back and assume that x sub 0 is 743 00:45:17,620 --> 00:45:18,890 some other i. 744 00:45:18,890 --> 00:45:21,800 And we don't have to worry about that, because i is just 745 00:45:21,800 --> 00:45:22,900 a generic state. 746 00:45:22,900 --> 00:45:26,320 So we'll do it for everything at once. 747 00:45:26,320 --> 00:45:30,630 There's a unit reward at time 0. 748 00:45:30,630 --> 00:45:32,970 r sub i is equal to 1. 749 00:45:32,970 --> 00:45:37,270 So we start out at time zero and state i. 750 00:45:37,270 --> 00:45:41,070 We pick up our reward of 1, and then we go on from there 751 00:45:41,070 --> 00:45:46,370 to see how much longer it takes to get to state 1. 752 00:45:46,370 --> 00:45:53,170 In addition to this unit reward at time zero, which 753 00:45:53,170 --> 00:45:56,430 means it's already taken us one unit of time to get the 754 00:45:56,430 --> 00:46:02,120 state 1, given that x sub 1 equals j, namely, given that 755 00:46:02,120 --> 00:46:07,910 we go from state i to state j, the remaining expected reward 756 00:46:07,910 --> 00:46:10,380 is v sub j. 757 00:46:10,380 --> 00:46:15,830 In other words, if it's times 0, I'm in some state i. 758 00:46:15,830 --> 00:46:21,110 Given that I go to some stage j, the next unit of time, 759 00:46:21,110 --> 00:46:24,930 what's the remaining accepted expected time they 760 00:46:24,930 --> 00:46:27,560 get to state 1? 761 00:46:27,560 --> 00:46:32,830 The remaining expected time is just v sub j, because that's 762 00:46:32,830 --> 00:46:34,050 the expected time. 763 00:46:34,050 --> 00:46:37,550 I mean, if v sub j is something where it's very hard 764 00:46:37,550 --> 00:46:41,560 to get to state 1, then we really lost out. 765 00:46:41,560 --> 00:46:44,370 If it's something which is closer to state 1 in some 766 00:46:44,370 --> 00:46:45,730 sense, then we've gained. 767 00:46:45,730 --> 00:46:51,180 But what we wind up with is the expected time to get to 768 00:46:51,180 --> 00:46:55,370 state 1 from state i is one. 769 00:46:55,370 --> 00:46:59,450 That's the instant reward that we get, or the instant cost 770 00:46:59,450 --> 00:47:04,880 that we pay, plus each of the possible states 771 00:47:04,880 --> 00:47:06,420 we might get to. 772 00:47:06,420 --> 00:47:11,290 There's a cost to go, or reward to go from that 773 00:47:11,290 --> 00:47:12,470 particular j. 774 00:47:12,470 --> 00:47:15,320 So this is the formula we have to solve. 775 00:47:15,320 --> 00:47:16,190 What's this mean? 776 00:47:16,190 --> 00:47:20,280 It means we have to solve this formula for all i. 777 00:47:20,280 --> 00:47:24,870 If I solve it for all i, and I've solved this for all i, 778 00:47:24,870 --> 00:47:28,910 then that's the linear equation in the variables v 779 00:47:28,910 --> 00:47:40,010 sub 1 up to v linear equations in i equals 2, up to m. 780 00:47:40,010 --> 00:47:44,660 We also have decided that v sub 1 is equal to 0. 781 00:47:44,660 --> 00:47:48,350 In other words, if we start out in state 1, you expect the 782 00:47:48,350 --> 00:47:50,670 time to get to state 1 is 0. 783 00:47:50,670 --> 00:47:53,260 We're already there. 784 00:47:53,260 --> 00:47:53,730 OK. 785 00:47:53,730 --> 00:47:57,300 So we have to solve these linear equations. 786 00:47:57,300 --> 00:48:03,130 And if your philosophy on solving linear equations is 787 00:48:03,130 --> 00:48:08,930 that of, I shouldn't say a computer scientist because I 788 00:48:08,930 --> 00:48:11,830 don't want to indicate that they are any different from 789 00:48:11,830 --> 00:48:16,960 any of the rest of us, but for many people, your philosophy 790 00:48:16,960 --> 00:48:20,720 of solving linear equations is to try to solve it. 791 00:48:20,720 --> 00:48:24,440 If you can't solve it, it doesn't have any solution. 792 00:48:24,440 --> 00:48:28,020 And if you're happy with doing that, fine. 793 00:48:28,020 --> 00:48:33,480 Some people would rather spend 10 hours asking whether in 794 00:48:33,480 --> 00:48:37,030 general it has any solution, rather than spending five 795 00:48:37,030 --> 00:48:38,806 minutes solving it. 796 00:48:38,806 --> 00:48:48,420 So either way, this expected first passage time, we've just 797 00:48:48,420 --> 00:48:50,390 stated what it is. 798 00:48:50,390 --> 00:48:57,710 Starting in state i, it's 1 plus the time to go for any 799 00:48:57,710 --> 00:48:59,840 other state you happen to go to. 800 00:48:59,840 --> 00:49:03,910 If we put this in vector form, you put things in vector form 801 00:49:03,910 --> 00:49:06,670 because you want to spend two hours finding the general 802 00:49:06,670 --> 00:49:09,685 solution, rather than five minutes solving the problem. 803 00:49:14,240 --> 00:49:18,660 If you have 1,000 states, then it works the other way. 804 00:49:18,660 --> 00:49:22,300 It takes you multiple hours to work it out by hand, and it 805 00:49:22,300 --> 00:49:25,430 takes you five minutes by looking at the equation. 806 00:49:25,430 --> 00:49:29,240 So sometimes you win, and sometimes you lose by looking 807 00:49:29,240 --> 00:49:30,780 at the general solution. 808 00:49:30,780 --> 00:49:37,360 If you look at this as a vector solution, the vector v 809 00:49:37,360 --> 00:49:43,080 where v1 is equal to zero, and the other v's are unknowns, is 810 00:49:43,080 --> 00:49:47,590 the vector r, the vector r is 0. 811 00:49:47,590 --> 00:49:50,030 0 reward in state 1. 812 00:49:50,030 --> 00:49:53,020 Unit reward in all other states, because we're trying 813 00:49:53,020 --> 00:49:55,860 to get to this end. 814 00:49:55,860 --> 00:50:00,780 And then we have the matrix here, t times v. 815 00:50:00,780 --> 00:50:04,780 So we want to solve this set of linear equations, and what 816 00:50:04,780 --> 00:50:08,720 do we know about this set of linear equations? 817 00:50:08,720 --> 00:50:11,890 We have an ergotic unit chain. 818 00:50:11,890 --> 00:50:16,410 We know that p has an eigenvalue, 819 00:50:16,410 --> 00:50:18,700 which is equal to 1. 820 00:50:18,700 --> 00:50:22,040 We know that's a simple eigenvalue. 821 00:50:22,040 --> 00:50:37,130 So that in fact, when we write v equals r plus pv as zero 822 00:50:37,130 --> 00:50:50,070 equals r plus p minus i times v. 823 00:50:50,070 --> 00:50:52,190 And we try to ask whether v has any 824 00:50:52,190 --> 00:50:55,040 solution, what's the answer? 825 00:50:55,040 --> 00:50:59,140 Well, this matrix here has an eigenvalue of 1. 826 00:50:59,140 --> 00:51:02,030 Since it has an eigenvalue of one, and since it's a simple 827 00:51:02,030 --> 00:51:06,160 eigenvalue, there's a space of solutions to this equation. 828 00:51:06,160 --> 00:51:11,330 The space of solutions is the vector of all ones and the 829 00:51:11,330 --> 00:51:12,850 vector of all anything else. 830 00:51:12,850 --> 00:51:17,650 In other words, it's a vector of v times any constant alpha. 831 00:51:17,650 --> 00:51:21,460 Now we've stuck this in here, so now we want to find out 832 00:51:21,460 --> 00:51:25,200 what's the set of solutions now. 833 00:51:25,200 --> 00:51:31,730 We observe v plus alpha e also satisfies this equation if we 834 00:51:31,730 --> 00:51:33,500 found another solution. 835 00:51:33,500 --> 00:51:37,450 So if we found a solution, we have a one dimensional family 836 00:51:37,450 --> 00:51:40,110 of solutions. 837 00:51:40,110 --> 00:51:47,520 Well, since this eigenvalue is a simple eigenvalue, the space 838 00:51:47,520 --> 00:51:56,040 of vectors for which r is equal to p minus i times v as 839 00:51:56,040 --> 00:51:59,390 a one dimensional space, and therefore there has to be a 840 00:51:59,390 --> 00:52:02,350 unique solution to this question. 841 00:52:02,350 --> 00:52:03,490 OK. 842 00:52:03,490 --> 00:52:07,460 So in fact, in only 15 minutes, we've solved the 843 00:52:07,460 --> 00:52:13,710 problem in general, so that you can deal with matrices of 844 00:52:13,710 --> 00:52:17,990 1,000 states, as opposed to two states. 845 00:52:17,990 --> 00:52:20,170 And you still have the same answer. 846 00:52:20,170 --> 00:52:21,840 OK. 847 00:52:21,840 --> 00:52:26,970 So this equation has a simple solution, which says that you 848 00:52:26,970 --> 00:52:29,850 can program your computer to solve this set of linear 849 00:52:29,850 --> 00:52:33,270 equations, and you're bound to get an answer. 850 00:52:33,270 --> 00:52:35,740 And the answer will tell you how long it takes to get to 851 00:52:35,740 --> 00:52:39,958 this particular state. 852 00:52:39,958 --> 00:52:40,390 OK. 853 00:52:40,390 --> 00:52:46,705 Let's go one to aggregate rewards with a final reward. 854 00:52:51,420 --> 00:52:53,560 Starting to sound like-- yes? 855 00:52:53,560 --> 00:52:56,990 STUDENT: I'm sorry, for the last example, how are we 856 00:52:56,990 --> 00:52:57,970 guaranteed that it's ergotic? 857 00:52:57,970 --> 00:53:01,370 Like, I possible you enter a loop somewhere that can never 858 00:53:01,370 --> 00:53:05,670 go to your trapping state, right? 859 00:53:05,670 --> 00:53:09,750 PROFESSOR: But I can't do that because there always has to be 860 00:53:09,750 --> 00:53:12,520 a way of getting to the trapping state, because 861 00:53:12,520 --> 00:53:14,770 there's only one recurrent state. 862 00:53:14,770 --> 00:53:19,170 All these other states are transient now. 863 00:53:19,170 --> 00:53:19,920 STUDENT: No, but I mean-- 864 00:53:19,920 --> 00:53:21,467 OK, like, let's say you start off with a 865 00:53:21,467 --> 00:53:22,655 general Markov chain. 866 00:53:22,655 --> 00:53:24,560 PROFESSOR: Oh, I start off with a general Markov chain? 867 00:53:24,560 --> 00:53:27,060 You're absolutely right. 868 00:53:27,060 --> 00:53:30,060 Then there might be no way of getting from some starting 869 00:53:30,060 --> 00:53:34,610 state to state 1, and therefore, the amount of time 870 00:53:34,610 --> 00:53:36,890 that it takes you to get from that state to the starting 871 00:53:36,890 --> 00:53:38,750 state is going to be infinite. 872 00:53:38,750 --> 00:53:40,250 You can't get there. 873 00:53:40,250 --> 00:53:43,960 So in fact, what you have to do with a problem like this is 874 00:53:43,960 --> 00:53:48,730 to look at it first, and say, are you in fact dealing with a 875 00:53:48,730 --> 00:53:49,760 unit chain? 876 00:53:49,760 --> 00:53:52,990 Or do you have multiple recurrent sets? 877 00:53:52,990 --> 00:53:57,100 If you have multiple recurrent sets, then the expected time 878 00:53:57,100 --> 00:54:00,770 to get into one of the recurrent states, starting 879 00:54:00,770 --> 00:54:04,840 from either a transient state, or from some other recurrent 880 00:54:04,840 --> 00:54:08,720 set is infinite. 881 00:54:08,720 --> 00:54:11,820 I mean, just like this business we were going through 882 00:54:11,820 --> 00:54:13,480 at the beginning. 883 00:54:13,480 --> 00:54:16,050 What you would like to do is not have to go through a lot 884 00:54:16,050 --> 00:54:20,750 of calculation when you have, or a lot of thinking when you 885 00:54:20,750 --> 00:54:24,070 have multiple recurrent sets of states. 886 00:54:24,070 --> 00:54:25,980 You just know what happens there. 887 00:54:25,980 --> 00:54:28,540 There's no way to get from this recurrent set to this 888 00:54:28,540 --> 00:54:30,020 recurrent set. 889 00:54:30,020 --> 00:54:31,440 So that's the end of it. 890 00:54:31,440 --> 00:54:31,888 STUDENT: OK. 891 00:54:31,888 --> 00:54:34,277 So like it works when you have the unit chain, and then you 892 00:54:34,277 --> 00:54:36,585 choose your trapping state to be one instance [INAUDIBLE]. 893 00:54:36,585 --> 00:54:37,835 PROFESSOR: Yes. 894 00:54:39,700 --> 00:54:40,150 OK. 895 00:54:40,150 --> 00:54:41,400 Good. 896 00:54:44,220 --> 00:54:47,160 Now, yes? 897 00:54:47,160 --> 00:54:50,410 STUDENT: The previous equation is true for any reward. 898 00:54:50,410 --> 00:54:51,692 But it's not necessary-- 899 00:54:51,692 --> 00:54:53,950 PROFESSOR: Yeah, it is true for any set of rewards, yes. 900 00:54:59,720 --> 00:55:02,090 Although what the interpretation would be of any 901 00:55:02,090 --> 00:55:05,900 set of rewards is if you have to sort that out. 902 00:55:05,900 --> 00:55:06,590 But yes. 903 00:55:06,590 --> 00:55:10,200 For any r that you choose, there's going to be one unique 904 00:55:10,200 --> 00:55:15,530 solution, so long as one is actually a trapping state, and 905 00:55:15,530 --> 00:55:16,950 everything else leads to one. 906 00:55:20,600 --> 00:55:25,875 OK, so why do I want to put a-- ah, good. 907 00:55:25,875 --> 00:55:27,537 STUDENT: I feel like there's a lot of the rewards that are 908 00:55:27,537 --> 00:55:30,625 designed for it, designed with respect to being in a 909 00:55:30,625 --> 00:55:31,575 particular state. 910 00:55:31,575 --> 00:55:32,060 PROFESSOR: Yes. 911 00:55:32,060 --> 00:55:34,340 STUDENT: But if the rewards are actually in transition, so 912 00:55:34,340 --> 00:55:38,012 for example, if you go from i to j, there are going to be a 913 00:55:38,012 --> 00:55:40,015 different number from j to j. 914 00:55:40,015 --> 00:55:41,580 How do you deal with that? 915 00:55:41,580 --> 00:55:42,400 PROFESSOR: How do I deal with that? 916 00:55:42,400 --> 00:55:45,000 Well, then let's talk about that. 917 00:55:45,000 --> 00:55:48,200 And in fact, it's fairly simple so long as you're only 918 00:55:48,200 --> 00:55:50,750 talking about expected rewards. 919 00:55:50,750 --> 00:55:54,450 Because if I have a reward associated with-- 920 00:55:57,096 --> 00:56:18,574 if I have a reward rij, which is the reward for transition i 921 00:56:18,574 --> 00:56:36,600 to j, then if I take the sum of rij times p summed over j, 922 00:56:36,600 --> 00:56:51,768 what this gives me is the expected reward associated 923 00:56:51,768 --> 00:56:53,940 with state j, with state i. 924 00:56:59,750 --> 00:57:02,680 Now, you have to be a little bit careful with this because 925 00:57:02,680 --> 00:57:06,310 before we've been picking up this reward as soon as we get 926 00:57:06,310 --> 00:57:09,860 to state i, and here suddenly we have a slightly different 927 00:57:09,860 --> 00:57:14,560 situation where you have a reward associated with state i 928 00:57:14,560 --> 00:57:17,230 but you don't pick it up until the next set. 929 00:57:17,230 --> 00:57:23,780 So this is where this problem of i or i plus 1 comes in. 930 00:57:23,780 --> 00:57:29,070 And you guys can do that much better than I can, because at 931 00:57:29,070 --> 00:57:36,480 my age I start out with an age of 60 and an age of 61 is the 932 00:57:36,480 --> 00:57:38,500 same thing. 933 00:57:38,500 --> 00:57:40,880 I mean, these are-- 934 00:57:40,880 --> 00:57:42,130 OK. 935 00:57:44,790 --> 00:57:48,660 So anyway, the point of it is, if you have rewards associated 936 00:57:48,660 --> 00:57:52,320 with transitions you can always convert that to rewards 937 00:57:52,320 --> 00:57:53,570 associated with states. 938 00:57:58,320 --> 00:58:02,220 Oh, I didn't really get to this. 939 00:58:02,220 --> 00:58:06,150 What I've been trying to say now for a while is that 940 00:58:06,150 --> 00:58:13,120 sometimes, for some reason or other, after you go through 941 00:58:13,120 --> 00:58:16,990 and end steps of this Markov chain, when you get to the 942 00:58:16,990 --> 00:58:21,340 end, you want to consider some particularly large reward for 943 00:58:21,340 --> 00:58:24,540 having gotten to the end, or some particularly large cost 944 00:58:24,540 --> 00:58:27,950 of getting to the end, or something which depends on the 945 00:58:27,950 --> 00:58:30,190 state that you happen to be in. 946 00:58:30,190 --> 00:58:34,630 So we will assign some final reward which in general can be 947 00:58:34,630 --> 00:58:37,820 different from the reward that we're picking up at each of 948 00:58:37,820 --> 00:58:38,840 the other states. 949 00:58:38,840 --> 00:58:41,105 We're going to do this in a particular way. 950 00:58:47,740 --> 00:58:50,920 You would think that what we would want to do is, if we 951 00:58:50,920 --> 00:58:55,210 went through in steps, we would associate this final 952 00:58:55,210 --> 00:58:57,580 reward with the n-th step. 953 00:58:57,580 --> 00:58:59,220 We're going to do it a different way. 954 00:58:59,220 --> 00:59:02,180 We're going to go through n steps, and then the final 955 00:59:02,180 --> 00:59:05,980 reward is what happens on the state after that. 956 00:59:05,980 --> 00:59:09,480 So we're really turning the problem of looking at n steps 957 00:59:09,480 --> 00:59:13,490 into a problem of looking at n plus 1 steps. 958 00:59:13,490 --> 00:59:14,490 Why do we do that? 959 00:59:14,490 --> 00:59:16,320 Completely arbitrary. 960 00:59:16,320 --> 00:59:19,320 It turns out to be convenient when we talk about dynamic 961 00:59:19,320 --> 00:59:24,720 programming, and you'll see why in just a minute. 962 00:59:24,720 --> 00:59:29,770 So this extra final state is just an arbitrary thing that 963 00:59:29,770 --> 00:59:34,230 you add, and we'll see the main purpose for 964 00:59:34,230 --> 00:59:35,780 it in just a minute. 965 00:59:38,730 --> 00:59:39,380 OK. 966 00:59:39,380 --> 00:59:45,910 So we're going to now look at what in principle is a much 967 00:59:45,910 --> 00:59:48,880 more complicated situation than what we were looking at 968 00:59:48,880 --> 00:59:53,180 before, but you still have this basic mark off condition 969 00:59:53,180 --> 00:59:56,310 which is making things simple for you. 970 00:59:56,310 --> 01:00:00,990 So the idea is, you're looking at a discrete time situation. 971 01:00:00,990 --> 01:00:04,260 Things happen in steps. 972 01:00:04,260 --> 01:00:07,655 There's a finite set of states which don't change over time. 973 01:00:10,530 --> 01:00:13,690 At each unit of time, you're going to be in one of the set 974 01:00:13,690 --> 01:00:20,420 of m states, and at each time l, there's some decision maker 975 01:00:20,420 --> 01:00:24,520 sitting around who looks at the state that 976 01:00:24,520 --> 01:00:26,530 you're in at time l. 977 01:00:26,530 --> 01:00:31,970 And the decision maker says I have a choice between what 978 01:00:31,970 --> 01:00:38,570 reward I'm going to pick up at this time and what the 979 01:00:38,570 --> 01:00:43,020 transition probabilities are for going to the next state. 980 01:00:43,020 --> 01:00:46,110 OK, so it's kind of a complicated thing. 981 01:00:46,110 --> 01:00:51,440 It's the same thing that you face all the time. 982 01:00:51,440 --> 01:00:54,300 I mean, in the stock market for example, you see that one 983 01:00:54,300 --> 01:00:57,010 stock is doing poorly, so you have a choice. 984 01:00:57,010 --> 01:01:03,620 Should I sell it, eat my losses, or should I keep on 985 01:01:03,620 --> 01:01:05,980 going and hope it'll turnaround? 986 01:01:05,980 --> 01:01:09,980 If you're doing a thesis, you have the even worse problem. 987 01:01:09,980 --> 01:01:13,540 You go for three months without getting the result 988 01:01:13,540 --> 01:01:19,120 that you need, and you say, well, I don't have a thesis. 989 01:01:19,120 --> 01:01:21,960 I can't say something about this. 990 01:01:21,960 --> 01:01:25,280 Should I go on for one more month, or should I can it and 991 01:01:25,280 --> 01:01:27,400 go on to another topic? 992 01:01:27,400 --> 01:01:30,460 OK, it's exactly the same situation. 993 01:01:30,460 --> 01:01:34,900 So this is really a very broad set of situations. 994 01:01:34,900 --> 01:01:37,858 The only thing that makes it really different from real 995 01:01:37,858 --> 01:01:42,260 life is this Markov property sitting there and the fact 996 01:01:42,260 --> 01:01:46,190 that you actually understand what the rewards are and you 997 01:01:46,190 --> 01:01:48,180 can predict them in advance. 998 01:01:48,180 --> 01:01:51,990 You can't predict what state you're going to be in, but you 999 01:01:51,990 --> 01:01:54,230 know that if you're in a particular state, you know 1000 01:01:54,230 --> 01:01:58,560 what your choices are in the future as well as now, and all 1001 01:01:58,560 --> 01:02:03,360 you have to do at each unit of time is to make this choice 1002 01:02:03,360 --> 01:02:05,860 between various different things. 1003 01:02:05,860 --> 01:02:08,485 You see an interesting example of that here. 1004 01:02:13,890 --> 01:02:17,430 If you look at this Markov chain here, it's a two state 1005 01:02:17,430 --> 01:02:18,680 Markov chain. 1006 01:02:21,770 --> 01:02:23,860 And what's the steady state probability of 1007 01:02:23,860 --> 01:02:25,150 being in state one? 1008 01:02:32,420 --> 01:02:33,670 Anybody? 1009 01:02:35,596 --> 01:02:37,050 It's a half, yes. 1010 01:02:37,050 --> 01:02:40,480 Why is it a half, and why don't you have 1011 01:02:40,480 --> 01:02:41,930 to solve for this? 1012 01:02:41,930 --> 01:02:45,150 Why can you look at it and say it's a half? 1013 01:02:45,150 --> 01:02:46,740 Because it's completely symmetric. 1014 01:02:46,740 --> 01:02:53,930 0.99 here, 0.99 here, 0.01 here, 0.01 here. 1015 01:02:53,930 --> 01:02:56,450 These rewards here had nothing to do with the 1016 01:02:56,450 --> 01:02:58,290 Markov chain itself. 1017 01:02:58,290 --> 01:03:02,210 The Markov chain is symmetric between states one and two, 1018 01:03:02,210 --> 01:03:05,200 and therefore, the steady state probabilities have to be 1019 01:03:05,200 --> 01:03:06,660 one half each. 1020 01:03:06,660 --> 01:03:13,410 So here's something where, if you happen to be in state two, 1021 01:03:13,410 --> 01:03:15,090 you're going to stay there typically 1022 01:03:15,090 --> 01:03:17,100 for a very long time. 1023 01:03:17,100 --> 01:03:20,080 And while you're studying there for a very long time, 1024 01:03:20,080 --> 01:03:24,020 you're going to be picking up rewards one unit of reward 1025 01:03:24,020 --> 01:03:26,930 every unit of time. 1026 01:03:26,930 --> 01:03:30,720 You work for some very stable employer who pays you very 1027 01:03:30,720 --> 01:03:33,540 little, and that's a situation you have. 1028 01:03:33,540 --> 01:03:37,880 You're sitting here, you have a job but you're not making 1029 01:03:37,880 --> 01:03:42,710 much, but still you're making something, and you have a lot 1030 01:03:42,710 --> 01:03:45,510 of job security. 1031 01:03:45,510 --> 01:03:49,760 Now, we have a different choice when we're sitting here 1032 01:03:49,760 --> 01:03:57,390 with a job in state two, we can, for example, you can go 1033 01:03:57,390 --> 01:04:00,300 to the cash register and take all the money out of it and 1034 01:04:00,300 --> 01:04:01,550 disappear from the company. 1035 01:04:03,920 --> 01:04:07,170 I don't advocate doing that, except, 1036 01:04:07,170 --> 01:04:09,190 it's one of your choices. 1037 01:04:09,190 --> 01:04:13,730 So you pick up a big reward of 50, and then for a long period 1038 01:04:13,730 --> 01:04:18,820 of time you go back to this state over here and you make 1039 01:04:18,820 --> 01:04:21,720 nothing in reward for a long period of time 1040 01:04:21,720 --> 01:04:23,360 while you're in jail. 1041 01:04:23,360 --> 01:04:28,650 And then eventually you pop back here, and if we assume 1042 01:04:28,650 --> 01:04:32,320 the judicial system is such that it has no memory, 1043 01:04:32,320 --> 01:04:33,670 [INAUDIBLE] 1044 01:04:33,670 --> 01:04:40,020 you can cut into the cash register, and, well, OK. 1045 01:04:40,020 --> 01:04:43,850 So anyway, this decision two, you're looking for instant 1046 01:04:43,850 --> 01:04:45,410 gratification here. 1047 01:04:45,410 --> 01:04:48,340 You're getting a big reward all at once, but by getting a 1048 01:04:48,340 --> 01:04:53,040 big reward with probability one, you're going back to 1049 01:04:53,040 --> 01:04:54,390 state zero. 1050 01:04:54,390 --> 01:04:57,830 From state zero, it takes a long time to get back to the 1051 01:04:57,830 --> 01:05:02,940 point where you can get a big reward again, so you wonder, 1052 01:05:02,940 --> 01:05:07,020 is it better to use this policy or is it better to use 1053 01:05:07,020 --> 01:05:08,270 this policy? 1054 01:05:10,670 --> 01:05:14,160 Now, there are two basic ways to look at this problem. 1055 01:05:14,160 --> 01:05:16,280 I think it's important to understand what they are 1056 01:05:16,280 --> 01:05:18,330 before we go further. 1057 01:05:18,330 --> 01:05:24,660 One of the ways is to say, OK, let's suppose that I work out 1058 01:05:24,660 --> 01:05:30,440 which is the best policy and I use it forever. 1059 01:05:30,440 --> 01:05:34,140 Namely, I use this policy forever or I 1060 01:05:34,140 --> 01:05:36,910 use this policy forever. 1061 01:05:36,910 --> 01:05:40,570 And if I use this policy forever, I can pretty easily 1062 01:05:40,570 --> 01:05:43,470 work out what the steady state probabilities of these two 1063 01:05:43,470 --> 01:05:44,680 states are. 1064 01:05:44,680 --> 01:05:50,040 I can then work out what my expected gain is per unit time 1065 01:05:50,040 --> 01:05:52,690 and I can compare this with that. 1066 01:05:55,260 --> 01:05:58,140 And who thinks that this is going to be better than that 1067 01:05:58,140 --> 01:06:01,370 and who thinks that this is going to be better than that? 1068 01:06:01,370 --> 01:06:03,600 Well, you can work it out easily. 1069 01:06:03,600 --> 01:06:07,080 It's kind of interesting because the steady state gain 1070 01:06:07,080 --> 01:06:12,940 here and here are very close to the same. 1071 01:06:12,940 --> 01:06:17,480 It turns out that this is just a smidgen better than this, 1072 01:06:17,480 --> 01:06:19,610 only by a very small amount. 1073 01:06:19,610 --> 01:06:19,950 OK. 1074 01:06:19,950 --> 01:06:25,620 See, what happens here is that here, you tend to go for about 1075 01:06:25,620 --> 01:06:28,110 100 steps here. 1076 01:06:28,110 --> 01:06:33,090 So you pick up every reward of about 100 if you use this very 1077 01:06:33,090 --> 01:06:34,890 simple minded analysis. 1078 01:06:34,890 --> 01:06:37,810 Then for 100 steps, you're sitting here, you're getting 1079 01:06:37,810 --> 01:06:43,020 no reward, so you think we ought to get every reward of 1080 01:06:43,020 --> 01:06:46,310 one half on the average, and that's exactly 1081 01:06:46,310 --> 01:06:48,090 what you do get here. 1082 01:06:48,090 --> 01:06:52,820 And here, you get this big reward of 50, but then you go 1083 01:06:52,820 --> 01:06:57,560 over here and you spend 100 units of time in purgatory and 1084 01:06:57,560 --> 01:07:00,470 then you get back again, you get another reward of 50 and 1085 01:07:00,470 --> 01:07:03,830 then spend hundreds units of time in purgatory. 1086 01:07:03,830 --> 01:07:07,300 So again, you're getting pretty close to a half of a 1087 01:07:07,300 --> 01:07:10,690 unit of reward, but it turns out, when you work it out, 1088 01:07:10,690 --> 01:07:12,690 that here is just a smidgen. 1089 01:07:12,690 --> 01:07:18,190 It's 1% less than a half, so this is not as good as that. 1090 01:07:18,190 --> 01:07:24,900 But suppose that you have a shorter time horizon. 1091 01:07:24,900 --> 01:07:28,380 Suppose you don't want to wait for 1,000 steps to see what's 1092 01:07:28,380 --> 01:07:32,280 going on, so you don't want to look at the average. 1093 01:07:32,280 --> 01:07:34,280 Suppose this was a gambling game. 1094 01:07:34,280 --> 01:07:38,230 You have your choice of these two gambling options, and 1095 01:07:38,230 --> 01:07:41,820 suppose you're only going to be playing for a short time. 1096 01:07:41,820 --> 01:07:43,180 Suppose you're going to be only playing 1097 01:07:43,180 --> 01:07:44,830 for one unit of time. 1098 01:07:44,830 --> 01:07:47,180 You can only play for one unit of time and then you have to 1099 01:07:47,180 --> 01:07:50,780 stop, you have to go home, you have to go back to work, or 1100 01:07:50,780 --> 01:07:52,180 something else. 1101 01:07:52,180 --> 01:07:54,870 And you happen to be sitting in state two. 1102 01:07:54,870 --> 01:07:57,180 What do you want to do if you only have one 1103 01:07:57,180 --> 01:07:58,730 unit of time to play. 1104 01:07:58,730 --> 01:08:03,630 Well, obviously, you want to get the reward of 50, because 1105 01:08:03,630 --> 01:08:07,620 delayed gratification doesn't work here, because you don't 1106 01:08:07,620 --> 01:08:11,330 get any opportunity for that gratification later. 1107 01:08:11,330 --> 01:08:14,900 So you pick up the big reward at first. 1108 01:08:14,900 --> 01:08:18,630 So when you have this problem of playing for a finite amount 1109 01:08:18,630 --> 01:08:24,649 of time, whatever kind of situation you're in, what you 1110 01:08:24,649 --> 01:08:28,310 would like to do is say, for this finite amount of time 1111 01:08:28,310 --> 01:08:34,290 that I'm going to play, what's my best strategy then? 1112 01:08:34,290 --> 01:08:39,850 Dynamic programming is the problem, which is the 1113 01:08:39,850 --> 01:08:43,600 algorithm which finds out what the best thing to do is 1114 01:08:43,600 --> 01:08:45,000 dynamically. 1115 01:08:45,000 --> 01:08:48,670 Namely, if you're going to stop in 10 steps, stop in 100 1116 01:08:48,670 --> 01:08:52,710 steps, stop in one step, it tells you what to do under all 1117 01:08:52,710 --> 01:08:55,350 of those circumstances. 1118 01:08:55,350 --> 01:08:59,340 And the stationary policy tells you what to do if you're 1119 01:08:59,340 --> 01:09:02,630 going to play forever. 1120 01:09:02,630 --> 01:09:05,760 But in a situation like this where things happen rather 1121 01:09:05,760 --> 01:09:10,189 slowly, it might not be the relevant thing to deal with. 1122 01:09:10,189 --> 01:09:13,170 A lot of the notes deal with comparing the stationary 1123 01:09:13,170 --> 01:09:17,180 policy with this dynamic policy. 1124 01:09:17,180 --> 01:09:21,399 And I'm not going to do that here because, well, we have 1125 01:09:21,399 --> 01:09:23,290 too many other interesting things that we 1126 01:09:23,290 --> 01:09:24,170 want to deal with. 1127 01:09:24,170 --> 01:09:26,939 So we're just going to skip all of that stuff about 1128 01:09:26,939 --> 01:09:28,470 stationary policies. 1129 01:09:28,470 --> 01:09:30,670 You don't have to bother to read it unless you're 1130 01:09:30,670 --> 01:09:32,580 interested in it. 1131 01:09:32,580 --> 01:09:35,029 I mean, if you're interested in it, by all means, read it. 1132 01:09:35,029 --> 01:09:38,950 It's a very interesting topic. 1133 01:09:38,950 --> 01:09:41,580 It's not all that interesting to find out what the best 1134 01:09:41,580 --> 01:09:42,990 stationary policy is. 1135 01:09:42,990 --> 01:09:45,210 That's kind of simple. 1136 01:09:45,210 --> 01:09:48,729 What's the interesting topic is what's the comparison 1137 01:09:48,729 --> 01:09:53,100 between the dynamic policy and the stationary policy. 1138 01:09:53,100 --> 01:09:56,500 But all we're going to do is worry about what the 1139 01:09:56,500 --> 01:09:58,160 dynamic policy is. 1140 01:09:58,160 --> 01:10:03,460 That seems like a hard problem, and someone by the 1141 01:10:03,460 --> 01:10:09,720 name of Bellman figured out what the optimal solution to 1142 01:10:09,720 --> 01:10:12,025 that dynamic policy was. 1143 01:10:12,025 --> 01:10:16,900 And it turned out to be a trivially simple algorithm, 1144 01:10:16,900 --> 01:10:20,030 and Bellman became famous forever. 1145 01:10:20,030 --> 01:10:23,080 One of the things I want to point out to you, again, I 1146 01:10:23,080 --> 01:10:27,250 keep coming back to this because you people are just 1147 01:10:27,250 --> 01:10:29,970 starting a research career. 1148 01:10:29,970 --> 01:10:34,490 Everyone in this class, given the formulation of this 1149 01:10:34,490 --> 01:10:38,670 dynamic programming problem, could develop and would 1150 01:10:38,670 --> 01:10:43,440 develop, I'm pretty sure, the dynamic programming algorithm. 1151 01:10:43,440 --> 01:10:47,020 Developing the algorithm, understanding what the problem 1152 01:10:47,020 --> 01:10:50,210 is is a trivial matter. 1153 01:10:50,210 --> 01:10:52,390 Why is Bellman famous? 1154 01:10:52,390 --> 01:10:56,270 Because he formulated the problem. 1155 01:10:56,270 --> 01:11:01,010 He said, aha, this dynamic problem is interesting. 1156 01:11:01,010 --> 01:11:04,710 I don't have to go through the stationary problem. 1157 01:11:04,710 --> 01:11:08,430 And in fact, my sense from reading his book and from 1158 01:11:08,430 --> 01:11:11,470 reading things he's written is that he couldn't have solved 1159 01:11:11,470 --> 01:11:14,240 the stationary problem because he didn't understand 1160 01:11:14,240 --> 01:11:16,750 probability that well. 1161 01:11:16,750 --> 01:11:20,600 But he did understand how to formulate what this really 1162 01:11:20,600 --> 01:11:24,330 important problem was and he solved it. 1163 01:11:24,330 --> 01:11:27,880 So, all the more credit to him, but when you're doing 1164 01:11:27,880 --> 01:11:32,460 research, the time you spend on formulating the right 1165 01:11:32,460 --> 01:11:37,430 problem is far more important than the time you spend 1166 01:11:37,430 --> 01:11:38,390 solving it. 1167 01:11:38,390 --> 01:11:41,490 If you start out with the right problem, the solution is 1168 01:11:41,490 --> 01:11:45,650 trivial and you're all done. 1169 01:11:45,650 --> 01:11:49,930 It's hard to formulate the right problem, and you learn 1170 01:11:49,930 --> 01:11:57,810 to formulate the problem not by playing all of this 1171 01:11:57,810 --> 01:12:01,570 calculating things, but by setting back and thinking 1172 01:12:01,570 --> 01:12:04,480 about the problem and trying to look at things in a more 1173 01:12:04,480 --> 01:12:06,050 general way. 1174 01:12:06,050 --> 01:12:07,660 So just another plug. 1175 01:12:07,660 --> 01:12:10,440 I've been saying this, I will probably say it every three or 1176 01:12:10,440 --> 01:12:14,420 four lectures throughout the term. 1177 01:12:14,420 --> 01:12:14,860 OK. 1178 01:12:14,860 --> 01:12:18,450 So let's go back and look at what the problem is. 1179 01:12:18,450 --> 01:12:21,330 We haven't quite formulated it yet. 1180 01:12:21,330 --> 01:12:24,940 We're going to assume this process of random transitions 1181 01:12:24,940 --> 01:12:27,790 combined with decisions based on the current state. 1182 01:12:27,790 --> 01:12:30,380 In other words, in this decision maker, the decision 1183 01:12:30,380 --> 01:12:34,960 maker at each unit of time sees what state you're in at 1184 01:12:34,960 --> 01:12:37,040 this unit of time. 1185 01:12:37,040 --> 01:12:40,940 And seeing what state you're in at this given unit of time, 1186 01:12:40,940 --> 01:12:45,020 the decision maker has a choice between how much reward 1187 01:12:45,020 --> 01:12:51,740 is to be taken and along with how much reward is to be 1188 01:12:51,740 --> 01:12:54,940 taken, what the transition probabilities are 1189 01:12:54,940 --> 01:12:56,160 for the next state. 1190 01:12:56,160 --> 01:13:00,150 If you rob the cash register, your transition probabilities 1191 01:13:00,150 --> 01:13:02,230 are going to be very different than if you don't 1192 01:13:02,230 --> 01:13:04,680 rob the cash register. 1193 01:13:04,680 --> 01:13:08,190 By robbing the cash register, your transition probabilities 1194 01:13:08,190 --> 01:13:10,770 go into a rather high transition probability that 1195 01:13:10,770 --> 01:13:12,270 you're going to be caught. 1196 01:13:12,270 --> 01:13:16,050 OK, so you don't want that. 1197 01:13:16,050 --> 01:13:20,600 So you can't avoid the problem of having the rewards at a 1198 01:13:20,600 --> 01:13:24,290 given time locked into what the transition probabilities 1199 01:13:24,290 --> 01:13:27,990 are for going to the next state, and that's the essence 1200 01:13:27,990 --> 01:13:29,890 of this problem. 1201 01:13:29,890 --> 01:13:30,500 OK. 1202 01:13:30,500 --> 01:13:33,470 So, the decision maker observers the state and 1203 01:13:33,470 --> 01:13:36,530 chooses one of a finite set of alternatives. 1204 01:13:36,530 --> 01:13:39,790 Each alternative consists of recurrent reward which we'll 1205 01:13:39,790 --> 01:13:44,030 call r sub j of k, the alternative is k, and a set of 1206 01:13:44,030 --> 01:13:45,980 transition probabilities. 1207 01:13:45,980 --> 01:13:50,250 pjl of k, one less than or equal to a l less than or 1208 01:13:50,250 --> 01:13:52,750 equal to m for going to the next state. 1209 01:13:52,750 --> 01:13:56,450 OK, the notation here is horrifying, but the idea is 1210 01:13:56,450 --> 01:13:57,880 very simple. 1211 01:13:57,880 --> 01:14:01,370 I mean, once you get used to the notation, there's nothing 1212 01:14:01,370 --> 01:14:04,880 complicated here at all. 1213 01:14:04,880 --> 01:14:08,940 OK, so in this example here, well, we already 1214 01:14:08,940 --> 01:14:10,190 talked about that. 1215 01:14:13,120 --> 01:14:14,990 We're going to start out at time m. 1216 01:14:17,960 --> 01:14:21,150 We're going to make a decision at time m, pick up the 1217 01:14:21,150 --> 01:14:28,090 associated reward for that decision, and pick the 1218 01:14:28,090 --> 01:14:30,970 transition probabilities that we're going to use at that 1219 01:14:30,970 --> 01:14:33,460 time m, and then go on to the next state. 1220 01:14:33,460 --> 01:14:36,380 We're going to continue doing this until time m 1221 01:14:36,380 --> 01:14:37,960 plus n minus 1. 1222 01:14:37,960 --> 01:14:41,450 Mainly, we're going to do this for n steps of time. 1223 01:14:41,450 --> 01:14:43,690 After the n-th decision-- 1224 01:14:43,690 --> 01:14:47,140 you make the n-th decision at m plus n minus t-- 1225 01:14:47,140 --> 01:14:52,270 there's a final transition based on that decision. 1226 01:14:52,270 --> 01:14:55,490 The final transition is based on that decision, but the 1227 01:14:55,490 --> 01:14:58,345 final reward is fixed ahead of time. 1228 01:14:58,345 --> 01:15:01,500 You know what the final reward is going to be, which happens 1229 01:15:01,500 --> 01:15:03,480 at time m plus n. 1230 01:15:03,480 --> 01:15:07,465 So the things which are variable is how much reward do 1231 01:15:07,465 --> 01:15:14,070 you get at each of these first n time units, and what 1232 01:15:14,070 --> 01:15:17,870 probabilities you choose for going through the next state. 1233 01:15:17,870 --> 01:15:20,170 Is this still a Markov chain? 1234 01:15:20,170 --> 01:15:21,420 Is this still Markov? 1235 01:15:24,750 --> 01:15:26,560 You can talk about this for a long time. 1236 01:15:26,560 --> 01:15:30,460 You can think about it for a long time because this 1237 01:15:30,460 --> 01:15:34,770 decision maker might or might not be Markov. 1238 01:15:34,770 --> 01:15:38,870 What is Markov is the transition probabilities that 1239 01:15:38,870 --> 01:15:41,380 are taking place in each unit of time. 1240 01:15:41,380 --> 01:15:46,410 After I make a decision, the transition probabilities are 1241 01:15:46,410 --> 01:15:51,370 fixed for that decision and that initial state and had 1242 01:15:51,370 --> 01:15:54,650 nothing to do with the decisions that had been made 1243 01:15:54,650 --> 01:15:58,650 before that or the states you've been in before that. 1244 01:15:58,650 --> 01:16:02,220 The Markov condition says that what happens in the next unit 1245 01:16:02,220 --> 01:16:06,020 of time is a function simply of those transition 1246 01:16:06,020 --> 01:16:10,370 probabilities that had been chosen. 1247 01:16:10,370 --> 01:16:13,530 We will see that when we look at the algorithm, and then you 1248 01:16:13,530 --> 01:16:16,670 can sort out for yourselves whether there's something 1249 01:16:16,670 --> 01:16:18,190 dishonest here or not. 1250 01:16:18,190 --> 01:16:25,480 Turns out there isn't, but to Bellman's credit he did sort 1251 01:16:25,480 --> 01:16:28,740 out correctly that this worked, and many people for a 1252 01:16:28,740 --> 01:16:30,520 long time did not think it worked. 1253 01:16:34,080 --> 01:16:37,150 So the objective of dynamic programming is both to 1254 01:16:37,150 --> 01:16:41,540 determine the optimal decision at each time and to determine 1255 01:16:41,540 --> 01:16:45,040 the expected reward for each starting state and for each 1256 01:16:45,040 --> 01:16:47,690 number and steps. 1257 01:16:47,690 --> 01:16:51,090 As one might suspect, now here's the first thing that 1258 01:16:51,090 --> 01:16:52,500 Bellman did. 1259 01:16:52,500 --> 01:16:54,010 He said, here, I have this problem. 1260 01:16:54,010 --> 01:16:57,880 I want to find out what happens after 1,000 steps. 1261 01:16:57,880 --> 01:17:00,850 How do I solve the problem? 1262 01:17:00,850 --> 01:17:04,330 Well, anybody with any sense will tell you don't solve the 1263 01:17:04,330 --> 01:17:06,740 problem with 1,000 steps first. 1264 01:17:06,740 --> 01:17:10,220 Solve the problem with one step first, and then see if 1265 01:17:10,220 --> 01:17:13,330 you find out anything from it and then maybe you can solve 1266 01:17:13,330 --> 01:17:17,030 the problem with two steps and then maybe something nice will 1267 01:17:17,030 --> 01:17:20,600 happen, or maybe it won't. 1268 01:17:20,600 --> 01:17:25,320 When we do this, it'll turn out that what we're really 1269 01:17:25,320 --> 01:17:30,820 doing is we're starting at the end and working our way back, 1270 01:17:30,820 --> 01:17:34,010 and this algorithm is due to Richard Bellman, as I said. 1271 01:17:34,010 --> 01:17:38,400 And he was the one who sorted out how it worked. 1272 01:17:38,400 --> 01:17:40,630 So what is the algorithm? 1273 01:17:40,630 --> 01:17:45,250 We're going to start out making a decision at time 1. 1274 01:17:45,250 --> 01:17:50,500 So we're going to start at time n. 1275 01:17:50,500 --> 01:17:53,610 We're going to start in a given state i. 1276 01:17:53,610 --> 01:17:58,580 You make a decision, decision k at time m. 1277 01:17:58,580 --> 01:18:03,040 This provides a reward at time m, and the selected transition 1278 01:18:03,040 --> 01:18:06,240 probabilities lead to a final expected reward. 1279 01:18:06,240 --> 01:18:11,380 These are these final rewards which occur at time n plus 1. 1280 01:18:11,380 --> 01:18:13,710 It's nice to have that n because it's what let's us 1281 01:18:13,710 --> 01:18:15,550 generalize the problem. 1282 01:18:15,550 --> 01:18:18,460 So this was another clever thing that went on here. 1283 01:18:18,460 --> 01:18:24,710 So the expected optimal aggregate reward for a one 1284 01:18:24,710 --> 01:18:32,230 step problem is the sum of the reward that you get at time m 1285 01:18:32,230 --> 01:18:37,260 plus this final reward you get at time n plus 1, and you're 1286 01:18:37,260 --> 01:18:40,290 maximizing over the different policies you have 1287 01:18:40,290 --> 01:18:41,490 available to you. 1288 01:18:41,490 --> 01:18:44,970 So it looks like a trivial problem, but the optimal 1289 01:18:44,970 --> 01:18:47,980 reward with a one step problem is just this. 1290 01:18:51,170 --> 01:18:54,820 OK, next you want to consider the two step problem. 1291 01:18:54,820 --> 01:18:58,900 What's the maximum expected reward starting at xm equals i 1292 01:18:58,900 --> 01:19:03,480 with decisions at times m and n plus 1. 1293 01:19:03,480 --> 01:19:05,400 You make two decisions. 1294 01:19:05,400 --> 01:19:08,240 Now, before, we just made one decision at time m. 1295 01:19:08,240 --> 01:19:13,000 Now we make a decision at time m and at time n plus 1, and 1296 01:19:13,000 --> 01:19:17,750 finally we pick up a final reward at time n plus 2. 1297 01:19:17,750 --> 01:19:20,540 Knowing what that final reward is going to be is going to 1298 01:19:20,540 --> 01:19:26,230 affect the decision you make at time n plus 1, but it's a 1299 01:19:26,230 --> 01:19:29,770 fixed reward which is a function of the state. 1300 01:19:29,770 --> 01:19:32,720 You can adjust the transition probabilities of getting to 1301 01:19:32,720 --> 01:19:35,110 those different rewards. 1302 01:19:35,110 --> 01:19:38,420 The key to dynamic programming is an optimal decision at time 1303 01:19:38,420 --> 01:19:42,630 n plus 1 can be selected based only on the state j 1304 01:19:42,630 --> 01:19:45,060 at time n plus 1. 1305 01:19:45,060 --> 01:19:48,960 This decision, given that you're in state j at time n 1306 01:19:48,960 --> 01:19:53,600 plus 1, is optimal independent of what you did before that, 1307 01:19:53,600 --> 01:19:55,770 which is why we're starting out looking at what we're 1308 01:19:55,770 --> 01:19:59,240 going to do with time n plus 1 before we even worry about 1309 01:19:59,240 --> 01:20:02,630 what we're going to do with time n. 1310 01:20:02,630 --> 01:20:06,340 So, whatever decision you made at time n, you observe what 1311 01:20:06,340 --> 01:20:10,900 state you're at time n plus 1 and the maximal expected 1312 01:20:10,900 --> 01:20:15,510 reward over times n plus 1 and n plus 2, given that you 1313 01:20:15,510 --> 01:20:20,610 happen to be in state j is just maximal over k as the 1314 01:20:20,610 --> 01:20:26,430 reward you're going to get by choosing policy k and the 1315 01:20:26,430 --> 01:20:30,670 expected value of the final reward you get if you're using 1316 01:20:30,670 --> 01:20:32,480 this policy k. 1317 01:20:32,480 --> 01:20:36,850 This is just dj* of 1 and u as you just found. 1318 01:20:36,850 --> 01:20:40,070 In other words, you have the same situation at time n plus 1319 01:20:40,070 --> 01:20:42,090 1 as you have at time n. 1320 01:20:44,600 --> 01:20:49,785 Well, surprisingly, you've just solved the whole problem. 1321 01:20:52,810 --> 01:20:58,410 So we've seen that what we should do at time n plus 1 is 1322 01:20:58,410 --> 01:21:00,450 do this maximization. 1323 01:21:00,450 --> 01:21:05,670 So the optimal reward, aggregate reward over times m, 1324 01:21:05,670 --> 01:21:11,815 n plus 1, and n plus 2 is what we get maximizing over our 1325 01:21:11,815 --> 01:21:18,110 choice at time m of the reward we get at time m plus the 1326 01:21:18,110 --> 01:21:21,750 decision plus the transition probabilities which we've 1327 01:21:21,750 --> 01:21:27,340 decided on which get us to this reward at time n plus 1 1328 01:21:27,340 --> 01:21:29,020 and n plus 2. 1329 01:21:29,020 --> 01:21:33,070 We found out what the reward is for times n plus 1 and n 1330 01:21:33,070 --> 01:21:34,370 plus 2 together. 1331 01:21:34,370 --> 01:21:38,060 That's the reward to go, And we know what that is, so we 1332 01:21:38,060 --> 01:21:40,210 have this same formula we used before. 1333 01:21:40,210 --> 01:21:46,965 Why do we want to look at these final rewards now? 1334 01:21:46,965 --> 01:21:50,980 Well, you can view this as a final reward in state m. 1335 01:21:50,980 --> 01:21:54,220 It's the final reward which tells you what you get both 1336 01:21:54,220 --> 01:21:57,930 from state n plus 1 and n plus 2. 1337 01:21:57,930 --> 01:22:04,790 And, going quickly, if we look at playing this game for three 1338 01:22:04,790 --> 01:22:11,280 steps, the optimal reward for the three step game is the 1339 01:22:11,280 --> 01:22:16,600 immediate reward optimized over k plus the rewards at n 1340 01:22:16,600 --> 01:22:22,900 plus 1, n plus 2, and n plus 3, which we've already found. 1341 01:22:22,900 --> 01:22:28,450 And in general, the optimal reward at time n-- 1342 01:22:28,450 --> 01:22:33,900 when you play the game for n steps, the optimal reward is 1343 01:22:33,900 --> 01:22:35,170 maximum here. 1344 01:22:35,170 --> 01:22:39,980 So, all you do in the algorithm is, for each value 1345 01:22:39,980 --> 01:22:43,500 of n when you start with n equal to 1, you solve the 1346 01:22:43,500 --> 01:22:48,950 problem for all states and you maximize over all policies you 1347 01:22:48,950 --> 01:22:52,100 have a choice over, and then you go on to the next larger 1348 01:22:52,100 --> 01:22:56,140 value of n, you solve the problem for all states and you 1349 01:22:56,140 --> 01:22:56,950 keep on going. 1350 01:22:56,950 --> 01:22:59,820 If you don't have many states, it's easy. 1351 01:22:59,820 --> 01:23:05,100 If you have 100,000 states, it's kind of tedious to run 1352 01:23:05,100 --> 01:23:05,880 the algorithm. 1353 01:23:05,880 --> 01:23:09,380 Today it's not bad, but today we look at problems with 1354 01:23:09,380 --> 01:23:11,770 millions and millions of states or billions of states, 1355 01:23:11,770 --> 01:23:18,100 and no matter how fast computation gets, the 1356 01:23:18,100 --> 01:23:22,280 ingenuity people to invent harder problems always makes 1357 01:23:22,280 --> 01:23:24,630 it hard to solve these problems. 1358 01:23:24,630 --> 01:23:29,060 So anyway, that's the dynamic programming algorithm. 1359 01:23:29,060 --> 01:23:31,320 And next time, we're going to start on renewal processes.