1 00:00:01,540 --> 00:00:03,910 The following content is provided under a Creative 2 00:00:03,910 --> 00:00:05,300 Commons license. 3 00:00:05,300 --> 00:00:07,510 Your support will help MIT OpenCourseWare 4 00:00:07,510 --> 00:00:11,600 continue to offer high quality educational resources for free. 5 00:00:11,600 --> 00:00:14,140 To make a donation or to view additional materials 6 00:00:14,140 --> 00:00:18,100 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,100 --> 00:00:18,980 at ocw.mit.edu. 8 00:00:23,905 --> 00:00:24,530 JAMES SWAN: OK. 9 00:00:24,530 --> 00:00:25,821 Let's go ahead and get started. 10 00:00:34,650 --> 00:00:36,380 We saw a lot of good conversation 11 00:00:36,380 --> 00:00:38,275 on Piazza this weekend. 12 00:00:38,275 --> 00:00:38,900 So that's good. 13 00:00:38,900 --> 00:00:40,525 Seems like you guys are making your way 14 00:00:40,525 --> 00:00:46,010 through these two problems on the latest assignment. 15 00:00:46,010 --> 00:00:51,120 I would try to focus less on the chemical engineering 16 00:00:51,120 --> 00:00:54,420 science and problems that involve those. 17 00:00:54,420 --> 00:00:57,360 Usually the topic of interest, the thing that's useful to you 18 00:00:57,360 --> 00:01:00,840 educationally is going to be the numerics, right. 19 00:01:00,840 --> 00:01:02,700 So if you get hung up on the definition 20 00:01:02,700 --> 00:01:06,630 of a particular quantity, yield was one that came up. 21 00:01:06,630 --> 00:01:09,300 Rather than let that prevent you from solving the problem, 22 00:01:09,300 --> 00:01:11,730 pick a definition and see what happens. 23 00:01:11,730 --> 00:01:14,010 You can always ask yourself if the results 24 00:01:14,010 --> 00:01:15,430 seem physically reasonable to you 25 00:01:15,430 --> 00:01:16,890 not based on your definition. 26 00:01:16,890 --> 00:01:20,500 And as long as you explain what you did in your write up, 27 00:01:20,500 --> 00:01:22,260 you're going to get full points. 28 00:01:22,260 --> 00:01:24,750 We want to solve the problems numerically. 29 00:01:24,750 --> 00:01:29,430 If there's some hang up in the science, don't sweat it. 30 00:01:29,430 --> 00:01:32,010 Don't let that stop you from moving ahead with it. 31 00:01:32,010 --> 00:01:36,631 Don't let it make it seem like the problem can't be solved 32 00:01:36,631 --> 00:01:38,130 or there isn't a path to a solution. 33 00:01:38,130 --> 00:01:42,450 Pick a definition and go with it and see what happens, right. 34 00:01:42,450 --> 00:01:43,860 The root of the second problem is 35 00:01:43,860 --> 00:01:46,720 trying to nest together two different numerical methods. 36 00:01:46,720 --> 00:01:48,720 One of those is optimization, and the other one 37 00:01:48,720 --> 00:01:50,650 is solutions of nonlinear equation, 38 00:01:50,650 --> 00:01:52,800 putting those two techniques together, 39 00:01:52,800 --> 00:01:55,040 using them in combination. 40 00:01:55,040 --> 00:01:58,200 The engineering science problem gives us a solvable problem 41 00:01:58,200 --> 00:02:00,240 to work with in that context, but it's not 42 00:02:00,240 --> 00:02:02,800 the key element of it. 43 00:02:02,800 --> 00:02:03,720 OK, good. 44 00:02:03,720 --> 00:02:06,222 So we're continuing optimization, right. 45 00:02:06,222 --> 00:02:08,190 Move this just a little bit. 46 00:02:08,190 --> 00:02:11,039 We're continuing with optimization. 47 00:02:11,039 --> 00:02:14,760 Last time we posed lots of optimization problems. 48 00:02:14,760 --> 00:02:18,420 We talked about constrained optimization, 49 00:02:18,420 --> 00:02:19,990 unconstrained optimization. 50 00:02:19,990 --> 00:02:23,730 We heard a little bit about linear programs. 51 00:02:23,730 --> 00:02:27,480 We started approaching unconstrained optimization 52 00:02:27,480 --> 00:02:31,635 problems from the perspective of steepest descent. 53 00:02:31,635 --> 00:02:35,080 OK, so that's where I want to pick up as we get started. 54 00:02:35,080 --> 00:02:37,620 So you'll recall the idea behind steepest descent 55 00:02:37,620 --> 00:02:40,290 was all the unconstrained optimization 56 00:02:40,290 --> 00:02:43,710 problems we're interested in are based around trying 57 00:02:43,710 --> 00:02:45,909 to find minima, OK. 58 00:02:45,909 --> 00:02:47,700 And so we should think about these problems 59 00:02:47,700 --> 00:02:50,310 as though we're standing on top of a mountain. 60 00:02:50,310 --> 00:02:53,430 And we're looking for directions that allow us to descend. 61 00:02:53,430 --> 00:02:56,327 And as long as we're heading in descending directions, 62 00:02:56,327 --> 00:02:58,410 right, there's a good chance we're going to bottom 63 00:02:58,410 --> 00:02:59,784 out someplace and stop. 64 00:02:59,784 --> 00:03:01,200 And when we've bottomed out, we've 65 00:03:01,200 --> 00:03:02,834 found one of those local minima. 66 00:03:02,834 --> 00:03:04,500 That bottom is going to be a place where 67 00:03:04,500 --> 00:03:06,870 the gradient of the function we're trying to find 68 00:03:06,870 --> 00:03:10,200 the minimum of is zero, OK. 69 00:03:10,200 --> 00:03:11,730 And the idea behind steepest descent 70 00:03:11,730 --> 00:03:15,060 was well, don't just pick any direction that's down hill. 71 00:03:15,060 --> 00:03:16,830 Pick the steepest direction, right. 72 00:03:16,830 --> 00:03:19,350 Go in the direction of the gradient. 73 00:03:19,350 --> 00:03:20,820 That's the steepest descent idea. 74 00:03:20,820 --> 00:03:23,700 And then we did something a little sophisticated last time. 75 00:03:23,700 --> 00:03:26,225 We said well OK, I know the direction. 76 00:03:26,225 --> 00:03:27,600 I'm standing on top the mountain. 77 00:03:27,600 --> 00:03:29,760 I point myself in the steepest descent direction. 78 00:03:29,760 --> 00:03:32,070 How big a step do I take? 79 00:03:32,070 --> 00:03:33,240 I can take any size step. 80 00:03:33,240 --> 00:03:36,180 And some steps may be good and some steps may be bad. 81 00:03:36,180 --> 00:03:39,300 It turns out there are some good estimates for step size 82 00:03:39,300 --> 00:03:41,320 that we can get by taking a Taylor expansion. 83 00:03:41,320 --> 00:03:43,860 So we take our function, right, and we 84 00:03:43,860 --> 00:03:46,800 write it at the next iterate is a Taylor expansion. 85 00:03:46,800 --> 00:03:50,100 About the current iterate, that expansion looks like this. 86 00:03:50,100 --> 00:03:54,010 And it will be quadratic with respect to the step size alpha. 87 00:03:56,550 --> 00:03:59,880 If we want to minimize the value of the function here, 88 00:03:59,880 --> 00:04:02,940 we want the next iterate to be a minimum 89 00:04:02,940 --> 00:04:04,230 of this quadratic function. 90 00:04:04,230 --> 00:04:07,620 Then there's an obvious choice of alpha, right. 91 00:04:07,620 --> 00:04:11,400 We find the vertex of this quadratic functional. 92 00:04:11,400 --> 00:04:14,040 That gives us the optimal step size. 93 00:04:14,040 --> 00:04:16,320 It's optimal if our function actually is quadratic. 94 00:04:16,320 --> 00:04:18,060 It's an approximate, right. 95 00:04:18,060 --> 00:04:20,250 It's an estimation of the right sort of step size 96 00:04:20,250 --> 00:04:22,352 if it's not quadratic. 97 00:04:22,352 --> 00:04:26,100 And so I showed you here was a function where the contours are 98 00:04:26,100 --> 00:04:27,210 very closely spaced. 99 00:04:27,210 --> 00:04:28,890 So it's a very steep function. 100 00:04:28,890 --> 00:04:30,660 And the minima is in the middle. 101 00:04:30,660 --> 00:04:33,600 If we try to solve this with the steepest descent 102 00:04:33,600 --> 00:04:37,060 and we pick different steps sizes, uniform step sizes, 103 00:04:37,060 --> 00:04:41,480 so we try 0.1 and 1 and 10 step sizes, 104 00:04:41,480 --> 00:04:43,710 we'll never find an appropriate choice 105 00:04:43,710 --> 00:04:45,359 to converge to the solution, OK. 106 00:04:45,359 --> 00:04:47,400 We're going to have to pick impossibly small step 107 00:04:47,400 --> 00:04:50,945 sizes, which will require tons of steps in order to get there. 108 00:04:50,945 --> 00:04:52,320 But with this quadratic estimate, 109 00:04:52,320 --> 00:04:55,960 you can get a reasonably smooth convergence to the root. 110 00:04:55,960 --> 00:04:58,750 So that's nice. 111 00:04:58,750 --> 00:05:01,280 And here's a task for you to test whether you understand 112 00:05:01,280 --> 00:05:02,680 steepest descent or not. 113 00:05:02,680 --> 00:05:05,710 In your notes, I've drawn some contours. 114 00:05:05,710 --> 00:05:07,840 For function, we'd like to minimize using 115 00:05:07,840 --> 00:05:09,630 the method of steepest descent. 116 00:05:09,630 --> 00:05:13,900 And I want you to try to draw steepest descent paths on top 117 00:05:13,900 --> 00:05:18,010 of these contours starting from initial conditions 118 00:05:18,010 --> 00:05:20,630 where these stars are located. 119 00:05:20,630 --> 00:05:23,260 So if I'm following steepest descent, the rules 120 00:05:23,260 --> 00:05:27,670 of steepest descent here, and I start from these stars, 121 00:05:27,670 --> 00:05:29,440 what sort of paths do I follow? 122 00:05:29,440 --> 00:05:31,240 You're going to need to pick a step size. 123 00:05:31,240 --> 00:05:34,480 I would suggest thinking about the small step size limit. 124 00:05:34,480 --> 00:05:37,090 What is the steepest descent path in the small step size 125 00:05:37,090 --> 00:05:37,614 limit? 126 00:05:37,614 --> 00:05:39,530 Can you work that out, you and your neighbors? 127 00:05:39,530 --> 00:05:40,870 You don't have to do all of them by yourself. 128 00:05:40,870 --> 00:05:42,710 You can do one, your neighbor could do another. 129 00:05:42,710 --> 00:05:44,335 And we'll take a look at them together. 130 00:08:08,380 --> 00:08:11,485 OK, the roar has turned into a rumble and then a murmur, 131 00:08:11,485 --> 00:08:13,360 so I think you guys are making some progress. 132 00:08:16,790 --> 00:08:17,540 What do you think? 133 00:08:17,540 --> 00:08:20,290 How about let's do an easy one. 134 00:08:20,290 --> 00:08:21,550 How about this one here. 135 00:08:21,550 --> 00:08:24,210 What sort of path does it take? 136 00:08:24,210 --> 00:08:26,340 Yeah, it sort of curls right down into the center 137 00:08:26,340 --> 00:08:26,840 here, right. 138 00:08:26,840 --> 00:08:30,300 Remember, steepest descent paths run perpendicular 139 00:08:30,300 --> 00:08:31,020 to the contours. 140 00:08:31,020 --> 00:08:33,677 So jumps perpendicular to the contour, almost a straight line 141 00:08:33,677 --> 00:08:34,260 to the center. 142 00:08:34,260 --> 00:08:36,799 How about this one over here? 143 00:08:36,799 --> 00:08:37,594 Same thing, right? 144 00:08:37,594 --> 00:08:38,510 It runs the other way. 145 00:08:38,510 --> 00:08:41,870 It's going downhill 1, 0, minus 1, minus 2. 146 00:08:41,870 --> 00:08:44,960 So it runs downhill and curls into the center. 147 00:08:44,960 --> 00:08:47,482 What about this one up here? 148 00:08:47,482 --> 00:08:49,210 What's it do? 149 00:08:49,210 --> 00:08:51,850 Yeah, it just runs to the left, right. 150 00:08:51,850 --> 00:08:54,256 The contour lines had normals that 151 00:08:54,256 --> 00:08:56,130 just keep it running all the way to the left. 152 00:08:56,130 --> 00:08:59,120 So this actually doesn't run into this minimum, right. 153 00:08:59,120 --> 00:09:03,230 It finds a cliff and steps right off of it, keeps on going. 154 00:09:03,230 --> 00:09:04,890 Steepest descent, that's what it does. 155 00:09:04,890 --> 00:09:06,920 How about this one here? 156 00:09:06,920 --> 00:09:08,734 Same thing, right, just to the left. 157 00:09:08,734 --> 00:09:10,400 So these are what these paths look like. 158 00:09:10,400 --> 00:09:13,750 You can draw them yourself. 159 00:09:13,750 --> 00:09:16,095 If I showed you paths and asked you what sort of method 160 00:09:16,095 --> 00:09:18,720 made them, you should be able to identify that actually, right? 161 00:09:18,720 --> 00:09:22,660 You should be able to detect what sort of methodology 162 00:09:22,660 --> 00:09:24,075 generated those kinds of paths. 163 00:09:28,770 --> 00:09:30,330 We're not always so fortunate to have 164 00:09:30,330 --> 00:09:32,730 this graphical view of the landscape 165 00:09:32,730 --> 00:09:34,680 that our method is navigating. 166 00:09:34,680 --> 00:09:36,637 But it's good to have these 2D depictions. 167 00:09:36,637 --> 00:09:38,220 Because they really help us understand 168 00:09:38,220 --> 00:09:41,260 when a method doesn't converge what might be going wrong, 169 00:09:41,260 --> 00:09:41,760 right. 170 00:09:41,760 --> 00:09:44,910 So steepest descent, it always heads downhill. 171 00:09:44,910 --> 00:09:47,640 But if there is no bottom, it's just going to keep going down, 172 00:09:47,640 --> 00:09:48,180 right. 173 00:09:48,180 --> 00:09:50,754 It's never going to find it. 174 00:09:50,754 --> 00:09:51,470 Oh, OK. 175 00:09:51,470 --> 00:09:53,960 Here's a-- this is a story now that you 176 00:09:53,960 --> 00:09:55,010 understand optimization. 177 00:09:55,010 --> 00:09:59,330 So let's see, so mechanical systems, 178 00:09:59,330 --> 00:10:02,190 conservation of momentum, that's also, 179 00:10:02,190 --> 00:10:05,990 in a certain sense, an optimization problem, right. 180 00:10:05,990 --> 00:10:11,690 So conservation of momentum says that the acceleration on a body 181 00:10:11,690 --> 00:10:13,710 is equal to the sum of the forces on it. 182 00:10:13,710 --> 00:10:15,500 And some of those forces are what 183 00:10:15,500 --> 00:10:16,960 we call conservative forces. 184 00:10:16,960 --> 00:10:18,970 They're proportional to gradients 185 00:10:18,970 --> 00:10:20,410 of some energy landscape. 186 00:10:20,410 --> 00:10:22,160 Some of those forces are non-conservative, 187 00:10:22,160 --> 00:10:22,951 like this one here. 188 00:10:22,951 --> 00:10:24,740 It's a little damping force, a little bit 189 00:10:24,740 --> 00:10:28,220 of friction proportional to the velocity with which 190 00:10:28,220 --> 00:10:30,990 the object moves instead. 191 00:10:30,990 --> 00:10:33,860 And if we start some system like this, 192 00:10:33,860 --> 00:10:37,040 we give it some initial inertia and let it go, 193 00:10:37,040 --> 00:10:39,920 right, eventually it's going to want to come to rest at a place 194 00:10:39,920 --> 00:10:42,980 where the gradient in the potential is 0 195 00:10:42,980 --> 00:10:45,120 and the velocity is 0 on the acceleration is 0. 196 00:10:45,120 --> 00:10:46,750 We call that mechanical equilibrium. 197 00:10:46,750 --> 00:10:49,700 We get to mechanical equilibrium and we stop, right. 198 00:10:49,700 --> 00:10:54,500 So physical systems many times are seeking out minimum 199 00:10:54,500 --> 00:10:55,610 of an objective function. 200 00:10:55,610 --> 00:10:58,980 The objective function is the potential energy. 201 00:10:58,980 --> 00:11:02,090 I saw last year at my house we had a pipe underground 202 00:11:02,090 --> 00:11:06,050 that leaked in the front yard. 203 00:11:06,050 --> 00:11:07,874 And they needed to find the pipe, right. 204 00:11:07,874 --> 00:11:10,415 It was like under the asphalt. So they got to dig up asphalt, 205 00:11:10,415 --> 00:11:12,530 and they need to know where is the pipe. 206 00:11:12,530 --> 00:11:15,470 They know it's leaking, but where does the pipe sit? 207 00:11:15,470 --> 00:11:19,080 So the city came out and the guy from the city brought this. 208 00:11:19,080 --> 00:11:20,270 Do you know what this is? 209 00:11:23,240 --> 00:11:23,760 What is it? 210 00:11:23,760 --> 00:11:25,510 Do you know? 211 00:11:25,510 --> 00:11:28,210 Yeah, yeah, yeah. 212 00:11:28,210 --> 00:11:29,290 It's a dowsing rod. 213 00:11:29,290 --> 00:11:33,770 OK, this kind of crazy story right, a dowsing rod. 214 00:11:33,770 --> 00:11:35,500 OK, a dowsing rod. 215 00:11:35,500 --> 00:11:37,090 How does it work? 216 00:11:37,090 --> 00:11:40,270 The way it's supposed to work is I hold it out 217 00:11:40,270 --> 00:11:43,630 and it should turn and rotate in point 218 00:11:43,630 --> 00:11:45,520 in a direction that's parallel to the flow 219 00:11:45,520 --> 00:11:46,660 of the water in the pipe. 220 00:11:46,660 --> 00:11:48,860 That's the theory that this is supposed to work on. 221 00:11:48,860 --> 00:11:50,740 I'm a scientist. 222 00:11:50,740 --> 00:11:53,470 So I expect that somehow the water 223 00:11:53,470 --> 00:11:57,970 is exerting a force on the tip of the dowsing rod, OK. 224 00:11:57,970 --> 00:11:59,920 So the dowsing rod is moving around 225 00:11:59,920 --> 00:12:01,840 as this guy walks around. 226 00:12:01,840 --> 00:12:05,160 And it's going to stop when it finds a point 227 00:12:05,160 --> 00:12:06,530 of mechanical equilibrium. 228 00:12:06,530 --> 00:12:08,350 So the dowsing rod is seeking out 229 00:12:08,350 --> 00:12:10,927 a minimum of some potential energy, let's say. 230 00:12:10,927 --> 00:12:12,760 That's what the physics says has to be true. 231 00:12:12,760 --> 00:12:15,310 I don't know that flowing water exerts a force 232 00:12:15,310 --> 00:12:16,570 on the tip of the dowsing rod. 233 00:12:16,570 --> 00:12:19,750 The guy who had this believed that was true, OK. 234 00:12:19,750 --> 00:12:23,190 It turns out, this is not such a good idea, though, OK. 235 00:12:23,190 --> 00:12:25,690 Like in terms of a method for seeking out 236 00:12:25,690 --> 00:12:29,040 the minimum of a potential, it's not such a great way to do it. 237 00:12:29,040 --> 00:12:31,990 Because he's way up here, and the water's way underground. 238 00:12:31,990 --> 00:12:34,630 So there's a huge distance between these things. 239 00:12:34,630 --> 00:12:37,570 It's not exerting a strong force, OK. 240 00:12:37,570 --> 00:12:40,420 The gradient isn't very big here. 241 00:12:40,420 --> 00:12:41,800 It's a relatively weak force. 242 00:12:41,800 --> 00:12:44,290 So this instrument is incredibly sensitive to all sorts 243 00:12:44,290 --> 00:12:45,670 of external fluctuations. 244 00:12:45,670 --> 00:12:47,320 The gradient is small. 245 00:12:47,320 --> 00:12:50,710 The potential energy landscape is very, very flat. 246 00:12:50,710 --> 00:12:52,630 And we know already from applying things 247 00:12:52,630 --> 00:12:55,240 like steepest descent methods or Newton-Raphson 248 00:12:55,240 --> 00:12:58,210 that those circumstances are disastrous for any method 249 00:12:58,210 --> 00:13:02,381 seeking out minima of potential energies, right. 250 00:13:02,381 --> 00:13:04,630 Those landscapes are the hardest ones to detect it in. 251 00:13:04,630 --> 00:13:08,150 Because every point looks like it's close to being a minima, 252 00:13:08,150 --> 00:13:08,650 right. 253 00:13:08,650 --> 00:13:12,310 It's really difficult to see the differences between these. 254 00:13:12,310 --> 00:13:16,170 Nonetheless, he figured out where the pipe was. 255 00:13:16,170 --> 00:13:19,680 I don't think it was because of this though. 256 00:13:19,680 --> 00:13:21,150 How did he know where the pipe was? 257 00:13:24,054 --> 00:13:25,022 What's that? 258 00:13:25,022 --> 00:13:26,960 STUDENT: Where the ground was squishy? 259 00:13:26,960 --> 00:13:27,720 JAMES SWAN: Where the ground was squishy. 260 00:13:27,720 --> 00:13:29,090 Well yeah, had some good guesses because it 261 00:13:29,090 --> 00:13:30,290 was leaking up a little bit. 262 00:13:30,290 --> 00:13:33,324 No, I looked carefully afterwards. 263 00:13:33,324 --> 00:13:35,240 And I think it turned out the city had come by 264 00:13:35,240 --> 00:13:36,971 and actually painted some white lines 265 00:13:36,971 --> 00:13:39,220 on either side of the street to indicate where it was. 266 00:13:39,220 --> 00:13:40,928 But he was out there with his dowsing rod 267 00:13:40,928 --> 00:13:42,952 making sure the city had gotten it right. 268 00:13:42,952 --> 00:13:45,410 It turns out, there's something called the ideomotor effect 269 00:13:45,410 --> 00:13:47,580 where your hand has very little, you 270 00:13:47,580 --> 00:13:49,330 know, very sensitive little tremors in it. 271 00:13:49,330 --> 00:13:51,929 And can guide something like this, a little weight 272 00:13:51,929 --> 00:13:53,720 at the end of a rod to go wherever you want 273 00:13:53,720 --> 00:13:54,928 it to go when you want it to. 274 00:13:54,928 --> 00:13:56,240 It's like a Ouija board, right. 275 00:13:56,240 --> 00:13:57,812 It works exactly the same way. 276 00:13:57,812 --> 00:13:59,270 Anyway, it's not a good way to find 277 00:13:59,270 --> 00:14:03,620 the minimum of potential energy surfaces, OK. 278 00:14:03,620 --> 00:14:05,990 We have the same problem with numerical methods. 279 00:14:05,990 --> 00:14:08,180 It's really difficult when these potential energy 280 00:14:08,180 --> 00:14:13,688 landscapes are flat to find where the minimum is, OK. 281 00:14:13,688 --> 00:14:14,900 So fun and games are over. 282 00:14:14,900 --> 00:14:16,025 Now we got to do some math. 283 00:14:19,960 --> 00:14:24,450 So we talked about steepest descent. 284 00:14:24,450 --> 00:14:26,370 And steepest descent is an interesting way 285 00:14:26,370 --> 00:14:30,640 to approach these kinds of optimization problems. 286 00:14:30,640 --> 00:14:35,580 It turns out, it turns out that linear equations like Ax 287 00:14:35,580 --> 00:14:40,320 equals b can also be cast as optimization problems, right. 288 00:14:40,320 --> 00:14:42,660 So the solution to this equation Ax 289 00:14:42,660 --> 00:14:47,520 equals b is also a minima of this quadratic function up 290 00:14:47,520 --> 00:14:49,230 here. 291 00:14:49,230 --> 00:14:49,980 How do you know? 292 00:14:49,980 --> 00:14:52,720 You take the gradient of this function, 293 00:14:52,720 --> 00:14:57,090 which is Ax minus b, and the gradient to 0 to minima. 294 00:14:57,090 --> 00:15:00,240 So Ax minus b is 0, or Ax equals b. 295 00:15:00,240 --> 00:15:02,460 So we can do optimization on these sorts 296 00:15:02,460 --> 00:15:06,640 of quadratic functionals, and we would find the solution 297 00:15:06,640 --> 00:15:08,787 of systems of linear equations. 298 00:15:08,787 --> 00:15:10,120 This is an alternative approach. 299 00:15:10,120 --> 00:15:12,161 Sometimes this is called the variational approach 300 00:15:12,161 --> 00:15:15,870 to solving these systems of linear equations. 301 00:15:15,870 --> 00:15:19,280 There are a couple of things that have to be true. 302 00:15:19,280 --> 00:15:21,740 The linear operator, right, the matrix here, 303 00:15:21,740 --> 00:15:23,310 it has to be symmetric. 304 00:15:23,310 --> 00:15:25,700 OK, it has to be symmetric, because it's multiplied 305 00:15:25,700 --> 00:15:27,380 by x from both sides. 306 00:15:27,380 --> 00:15:29,990 It doesn't know that it's transpose is 307 00:15:29,990 --> 00:15:33,140 different from itself in the form of this functional. 308 00:15:33,140 --> 00:15:35,150 If A wasn't symmetric, the functional 309 00:15:35,150 --> 00:15:37,850 would symmetrize it automatically, OK. 310 00:15:37,850 --> 00:15:41,870 So a functional like this only corresponds 311 00:15:41,870 --> 00:15:44,750 to this linear equation when A is symmetric. 312 00:15:44,750 --> 00:15:50,270 And this sort of thing only has a minimum, right, 313 00:15:50,270 --> 00:15:52,392 when the matrix A is positive and definite. 314 00:15:52,392 --> 00:15:54,940 It has to have all positive eigenvalues, right. 315 00:15:54,940 --> 00:15:58,910 The Hessian right, of this functional, 316 00:15:58,910 --> 00:16:00,410 is just the matrix A. And we already 317 00:16:00,410 --> 00:16:03,940 said that Hessian needs all positive eigenvalues to confirm 318 00:16:03,940 --> 00:16:05,100 we have a minima. 319 00:16:05,100 --> 00:16:06,260 OK? 320 00:16:06,260 --> 00:16:08,300 If one of the eigenvalues is zero, 321 00:16:08,300 --> 00:16:09,980 then the problem is indeterminate. 322 00:16:09,980 --> 00:16:11,690 The linear problem is indeterminate. 323 00:16:11,690 --> 00:16:13,670 And there isn't a single local minimum, right. 324 00:16:13,670 --> 00:16:16,100 There's going to be a line of minima or a plane of minima 325 00:16:16,100 --> 00:16:17,300 instead. 326 00:16:17,300 --> 00:16:19,210 OK? 327 00:16:19,210 --> 00:16:22,540 OK, so you can solve systems of linear equations 328 00:16:22,540 --> 00:16:25,750 as optimization problems. 329 00:16:25,750 --> 00:16:28,570 And people have tried to apply things like steepest descent 330 00:16:28,570 --> 00:16:29,870 to these problems. 331 00:16:29,870 --> 00:16:31,870 And it turns out steepest descent is 332 00:16:31,870 --> 00:16:33,980 kind of challenging to apply. 333 00:16:33,980 --> 00:16:38,560 So what winds up happening is let's 334 00:16:38,560 --> 00:16:41,020 suppose we don't take our quadratic approximation 335 00:16:41,020 --> 00:16:42,670 for the descent direction first. 336 00:16:42,670 --> 00:16:45,820 Let's just say we take some fixed step size, right. 337 00:16:45,820 --> 00:16:50,050 When you take that fixed step size, it'll always be, 338 00:16:50,050 --> 00:16:54,250 let's say good for one particular direction. 339 00:16:54,250 --> 00:16:56,110 OK, so I'll step in a particular direction. 340 00:16:56,110 --> 00:16:56,850 It'll be good. 341 00:16:56,850 --> 00:16:59,470 It'll be a nice step into a local minimum. 342 00:16:59,470 --> 00:17:02,260 But when I try to step in the next gradient direction, 343 00:17:02,260 --> 00:17:03,850 it may be too big or too small. 344 00:17:03,850 --> 00:17:06,700 And that will depend on the eigenvalues associated 345 00:17:06,700 --> 00:17:09,160 with the direction that I am trying to step in, OK. 346 00:17:09,160 --> 00:17:12,460 How steep is this convex function? 347 00:17:12,460 --> 00:17:13,240 Right? 348 00:17:13,240 --> 00:17:15,180 How strongly curved is that convex function? 349 00:17:15,180 --> 00:17:17,619 That's what the eigenvalues are describing. 350 00:17:17,619 --> 00:17:20,680 And so fixed value of alpha will lead to cases 351 00:17:20,680 --> 00:17:23,778 where we wind up stepping too far or not far enough. 352 00:17:23,778 --> 00:17:25,569 And there'll be a lot of oscillating around 353 00:17:25,569 --> 00:17:28,900 on this path that converges to a solution. 354 00:17:28,900 --> 00:17:31,200 I showed you how to pick an optimal step size. 355 00:17:31,200 --> 00:17:32,980 It said look in a particular direction 356 00:17:32,980 --> 00:17:34,750 and treat your function as though it were 357 00:17:34,750 --> 00:17:37,120 quadratic along that direction. 358 00:17:37,120 --> 00:17:40,124 That's going to be true for all directions associated 359 00:17:40,124 --> 00:17:41,290 with this functional, right. 360 00:17:41,290 --> 00:17:44,570 It's always quadratic no matter which direction I point in. 361 00:17:44,570 --> 00:17:45,070 Right? 362 00:17:45,070 --> 00:17:47,440 So I pick a direction and I step and I'll 363 00:17:47,440 --> 00:17:50,020 be stepping to the minimal point along that direction. 364 00:17:50,020 --> 00:17:51,664 It'll be exact, OK. 365 00:17:51,664 --> 00:17:53,080 And then I've got to turn and I've 366 00:17:53,080 --> 00:17:55,150 got to go in another gradient direction 367 00:17:55,150 --> 00:17:56,810 and take a step there. 368 00:17:56,810 --> 00:17:58,980 And I'll turn and go in another gradient direction 369 00:17:58,980 --> 00:17:59,920 and take a step there. 370 00:17:59,920 --> 00:18:04,360 And in each direction I go, I'll be minimizing every time. 371 00:18:04,360 --> 00:18:09,127 Because this step size is the ideal step size. 372 00:18:09,127 --> 00:18:11,210 But it turns out you can do even better than that. 373 00:18:14,390 --> 00:18:16,060 So we can step in some direction, which 374 00:18:16,060 --> 00:18:19,270 is a descent direction, but not necessarily 375 00:18:19,270 --> 00:18:20,920 the steepest descent. 376 00:18:20,920 --> 00:18:23,979 And it's going to give us some extra control over how 377 00:18:23,979 --> 00:18:25,270 we're minimizing this function. 378 00:18:25,270 --> 00:18:27,040 I'll explain on the next slide, OK. 379 00:18:27,040 --> 00:18:28,540 The first thing you got to do though 380 00:18:28,540 --> 00:18:32,324 is given some descent direction, what is the optimal step size? 381 00:18:32,324 --> 00:18:34,240 Well, we'll work that out the same way, right. 382 00:18:34,240 --> 00:18:37,480 We can write f at the next iterate in terms 383 00:18:37,480 --> 00:18:41,830 of f at the current iterate plus all the perturbations, right. 384 00:18:41,830 --> 00:18:46,907 So our step method is Xi plus 1 is Xi plus alpha pi, right. 385 00:18:46,907 --> 00:18:48,490 So we do a Taylor expansion, and we'll 386 00:18:48,490 --> 00:18:51,250 get a quadratic function again. 387 00:18:51,250 --> 00:18:54,190 And we'll minimize this quadratic function with respect 388 00:18:54,190 --> 00:18:58,160 to alpha i when alpha takes on this value. 389 00:18:58,160 --> 00:19:01,970 So this is the value of the vertex of this function. 390 00:19:01,970 --> 00:19:03,910 So we'll minimize this quadratic function 391 00:19:03,910 --> 00:19:06,175 in one direction, the direction p. 392 00:19:08,700 --> 00:19:10,660 But is there an optimal choice of direction? 393 00:19:10,660 --> 00:19:14,860 Is it really best to step in the descent direction? 394 00:19:14,860 --> 00:19:17,710 Or are there better directions that I could go in? 395 00:19:17,710 --> 00:19:20,080 We thought going downhill fastest might be best, 396 00:19:20,080 --> 00:19:21,860 but maybe that's not true. 397 00:19:21,860 --> 00:19:23,920 Because if I point to a direction 398 00:19:23,920 --> 00:19:26,650 and I apply my quadratic approximation, 399 00:19:26,650 --> 00:19:28,550 I minimize the function in this direction. 400 00:19:28,550 --> 00:19:30,050 Now I'm going to turn, and I'm going 401 00:19:30,050 --> 00:19:31,630 to go in a different direction. 402 00:19:31,630 --> 00:19:34,210 And I'll minimize it here, but I'll 403 00:19:34,210 --> 00:19:37,170 lose some of the minimization that I got previously, right? 404 00:19:37,170 --> 00:19:38,467 I minimized in this direction. 405 00:19:38,467 --> 00:19:40,300 Then I turned, I went some other way, right. 406 00:19:40,300 --> 00:19:41,720 And I minimized in this direction. 407 00:19:41,720 --> 00:19:44,860 So this will still be a process that will sort of weave 408 00:19:44,860 --> 00:19:47,630 back and forth potentially. 409 00:19:47,630 --> 00:19:51,130 And so the idea instead is to try to preserve minimization 410 00:19:51,130 --> 00:19:53,900 along one particular direction. 411 00:19:53,900 --> 00:19:56,510 So how do we choose an optimal direction? 412 00:19:56,510 --> 00:19:59,060 So f, right, at the current iterate, 413 00:19:59,060 --> 00:20:01,550 it's already minimized along p, right. 414 00:20:01,550 --> 00:20:03,957 Moving in p forward and backwards, 415 00:20:03,957 --> 00:20:05,540 this is going to make f and e smaller. 416 00:20:05,540 --> 00:20:07,130 That's as small as it can be. 417 00:20:09,680 --> 00:20:16,610 So why not choose p so that it's normal to the gradient 418 00:20:16,610 --> 00:20:17,980 at the next iterate? 419 00:20:17,980 --> 00:20:20,330 OK, so choose this direction p so it's 420 00:20:20,330 --> 00:20:22,610 normal to the gradient at the next iterate. 421 00:20:22,610 --> 00:20:27,170 And then see if that holds for one iterate more after that. 422 00:20:27,170 --> 00:20:29,810 So I move in a direction. 423 00:20:29,810 --> 00:20:31,910 I step up to a contour. 424 00:20:31,910 --> 00:20:34,970 And I want my p to be orthogonal to the gradient 425 00:20:34,970 --> 00:20:35,990 at that next contour. 426 00:20:35,990 --> 00:20:38,810 So I've minimized this way, right. 427 00:20:38,810 --> 00:20:41,240 I've minimized everything that I could in directions 428 00:20:41,240 --> 00:20:43,310 that aren't in the gradient direction 429 00:20:43,310 --> 00:20:44,945 associated with the next iterate. 430 00:20:44,945 --> 00:20:47,570 And then let's see if I can even do that for the next iteration 431 00:20:47,570 --> 00:20:48,070 too. 432 00:20:48,070 --> 00:20:51,020 So can it make it true that the gradient at the next iterate 433 00:20:51,020 --> 00:20:52,650 is also orthogonal to p? 434 00:20:56,870 --> 00:20:59,710 By doing this, I get to preserve all the minimization 435 00:20:59,710 --> 00:21:00,710 from the previous steps. 436 00:21:00,710 --> 00:21:02,734 So I minimize in this direction. 437 00:21:02,734 --> 00:21:05,150 And now I'm going to take a step in a different direction. 438 00:21:05,150 --> 00:21:06,950 But I'm going to make sure that as I 439 00:21:06,950 --> 00:21:10,631 take that step in another direction, right, 440 00:21:10,631 --> 00:21:12,630 I don't have to step completely in the gradient. 441 00:21:12,630 --> 00:21:14,360 I don't have to go in the steepest descent direction. 442 00:21:14,360 --> 00:21:16,050 I can project out everything that I've 443 00:21:16,050 --> 00:21:17,640 stepped in already, right. 444 00:21:17,640 --> 00:21:19,380 I can project out all the minimization 445 00:21:19,380 --> 00:21:22,860 I've already accomplished along this p direction. 446 00:21:22,860 --> 00:21:25,860 So it turns out you can solve, right, 447 00:21:25,860 --> 00:21:28,260 you can calculate what this gradient is. 448 00:21:28,260 --> 00:21:32,610 The gradient in this function is Ax minus b. 449 00:21:32,610 --> 00:21:35,514 So you can substitute exactly what that gradient is. 450 00:21:35,514 --> 00:21:41,550 A, this is Xi plus 2 minus b dotted with p, right. 451 00:21:41,550 --> 00:21:43,080 This has to be equal to 0. 452 00:21:43,080 --> 00:21:46,260 And you can show that means that p transpose A times 453 00:21:46,260 --> 00:21:48,497 p has to be equal to 0 as well. 454 00:21:48,497 --> 00:21:50,830 You don't need to be able to work through these details. 455 00:21:50,830 --> 00:21:52,740 You just need to know that this gives 456 00:21:52,740 --> 00:21:54,780 a relationship between the directions 457 00:21:54,780 --> 00:21:57,774 on two consecutive iterates, OK. 458 00:21:57,774 --> 00:21:59,190 So it says if I picked a direction 459 00:21:59,190 --> 00:22:04,776 p on the previous iteration, take how it's transposed by A, 460 00:22:04,776 --> 00:22:07,110 and make sure that my next iteration is 461 00:22:07,110 --> 00:22:10,228 orthogonal to that vector, OK. 462 00:22:10,228 --> 00:22:11,206 Yeah? 463 00:22:11,206 --> 00:22:15,118 STUDENT: So does that mean that your p's 464 00:22:15,118 --> 00:22:17,850 are all independent of each other, 465 00:22:17,850 --> 00:22:20,670 or just that adjacent p is? 466 00:22:20,670 --> 00:22:23,329 K, k plus 1 p's are? 467 00:22:23,329 --> 00:22:24,870 JAMES SWAN: This is a great question. 468 00:22:29,300 --> 00:22:34,040 So the goal with this method, the ideal way to do this 469 00:22:34,040 --> 00:22:36,080 would be to have these directions actually 470 00:22:36,080 --> 00:22:40,301 be the directions of the eigenvectors of A. 471 00:22:40,301 --> 00:22:42,710 And those eigenvectors for symmetric matrix 472 00:22:42,710 --> 00:22:45,230 are all orthogonal to each other. 473 00:22:45,230 --> 00:22:46,361 OK? 474 00:22:46,361 --> 00:22:48,860 And so you'll be stepping along these orthogonal directions. 475 00:22:48,860 --> 00:22:50,990 And they would be all independent of each other. 476 00:22:50,990 --> 00:22:51,200 OK? 477 00:22:51,200 --> 00:22:53,491 But that's a hard problem, finding all the eigenvectors 478 00:22:53,491 --> 00:22:55,180 associated with a matrix. 479 00:22:55,180 --> 00:22:59,900 Instead, OK, we pick an initial direction p to go in. 480 00:22:59,900 --> 00:23:02,420 And then we try to ensure that all of the other directions 481 00:23:02,420 --> 00:23:07,640 satisfy this conjugacy condition, right. 482 00:23:07,640 --> 00:23:09,620 That the transformation of p by A 483 00:23:09,620 --> 00:23:12,630 is orthogonal with the next direction that I choose. 484 00:23:12,630 --> 00:23:15,450 So they're not independent of each other. 485 00:23:15,450 --> 00:23:19,280 But they are what we call conjugate to each other. 486 00:23:19,280 --> 00:23:23,530 It turns out that by doing this, these sets of directions p 487 00:23:23,530 --> 00:23:25,190 will belong to-- 488 00:23:25,190 --> 00:23:29,120 they can be expressed in terms of many products of A 489 00:23:29,120 --> 00:23:30,660 with the initial direction p. 490 00:23:30,660 --> 00:23:32,660 That'll give you all these different directions. 491 00:23:32,660 --> 00:23:35,150 It starts to look something like the power iteration method 492 00:23:35,150 --> 00:23:38,121 for finding the largest eigenvector of a matrix. 493 00:23:38,121 --> 00:23:38,620 OK? 494 00:23:38,620 --> 00:23:41,450 So you create a certain set of vectors 495 00:23:41,450 --> 00:23:44,045 that span the entire subspace of A. 496 00:23:44,045 --> 00:23:47,456 And you step specifically along those directions. 497 00:23:47,456 --> 00:23:49,580 And that lets you preserve some of the minimization 498 00:23:49,580 --> 00:23:51,470 as you step each way. 499 00:23:54,490 --> 00:23:59,000 So what's said here is that the direction p plus 1 500 00:23:59,000 --> 00:24:03,290 is conjugate to the direction p. 501 00:24:03,290 --> 00:24:05,420 And by choosing the directions in this way, 502 00:24:05,420 --> 00:24:10,520 you're ensuring that p is orthogonal to the gradient at i 503 00:24:10,520 --> 00:24:12,850 plus 1 and the gradient i plus 2. 504 00:24:12,850 --> 00:24:17,000 So you're not stepping in the steepest descent directions 505 00:24:17,000 --> 00:24:21,328 that you'll pick up later on in the iterative process. 506 00:24:21,328 --> 00:24:23,074 OK? 507 00:24:23,074 --> 00:24:25,240 So when you know which direction you're stepping in, 508 00:24:25,240 --> 00:24:29,100 then you've got to satisfy this conjugacy condition. 509 00:24:29,100 --> 00:24:34,710 But actually, this is a vector in n space, right. 510 00:24:34,710 --> 00:24:37,000 This is also a vector n space. 511 00:24:37,000 --> 00:24:40,260 And we have one equation to describe all and components. 512 00:24:40,260 --> 00:24:42,430 So it's an under-determined problem. 513 00:24:42,430 --> 00:24:45,960 So then one has to pick which particular one 514 00:24:45,960 --> 00:24:48,240 of these conjugate vectors do I want to step along. 515 00:24:48,240 --> 00:24:49,900 And one particular choice is this one, 516 00:24:49,900 --> 00:24:53,790 which says, step along the gradient direction, OK, do 517 00:24:53,790 --> 00:24:57,450 steepest descent, but project out 518 00:24:57,450 --> 00:25:00,090 the component of the gradient along pi. 519 00:25:00,090 --> 00:25:01,650 We already minimized along pi. 520 00:25:01,650 --> 00:25:04,108 We don't not have to go in the pi direction anymore, right. 521 00:25:04,108 --> 00:25:08,676 So do steepest descent, but remove the pi component. 522 00:25:13,140 --> 00:25:19,590 So here is a quadratic objective function. 523 00:25:19,590 --> 00:25:22,350 It corresponds to a linear equation with coefficient 524 00:25:22,350 --> 00:25:27,690 matrix 1 00 10, a diagonal coefficient matrix. 525 00:25:27,690 --> 00:25:29,760 And b equals 0. 526 00:25:29,760 --> 00:25:33,720 So the solution of the system of linear equations is 00. 527 00:25:33,720 --> 00:25:37,380 We start with an initial guess up here, OK. 528 00:25:37,380 --> 00:25:43,390 And we try steepest descent with some small step size, right. 529 00:25:43,390 --> 00:25:44,870 You'll follow this blue path here. 530 00:25:44,870 --> 00:25:46,120 And you can see what happened. 531 00:25:46,120 --> 00:25:49,240 That step size was reasonable as we 532 00:25:49,240 --> 00:25:51,250 moved along the steepest ascent direction 533 00:25:51,250 --> 00:25:54,950 where the contours were pretty narrowly spaced. 534 00:25:54,950 --> 00:25:57,290 But as we got down to the flatter section, 535 00:25:57,290 --> 00:25:59,710 OK, as we got down to the flatter 536 00:25:59,710 --> 00:26:03,760 section of our objective function, 537 00:26:03,760 --> 00:26:05,060 those steps are really small. 538 00:26:05,060 --> 00:26:05,230 Right? 539 00:26:05,230 --> 00:26:06,729 We're headed in the right direction, 540 00:26:06,729 --> 00:26:09,545 we're just taking very, very small steps. 541 00:26:09,545 --> 00:26:13,370 If you apply this conjugate gradient methodology, well, 542 00:26:13,370 --> 00:26:18,110 the first step you take, that's prescribed. 543 00:26:18,110 --> 00:26:21,290 You've got to step in some direction. 544 00:26:21,290 --> 00:26:24,230 The second step you take though minimizes completely 545 00:26:24,230 --> 00:26:25,610 along this direction. 546 00:26:25,610 --> 00:26:27,800 So the first step was the same for both of these. 547 00:26:27,800 --> 00:26:32,710 But the second step was chosen to minimize completely 548 00:26:32,710 --> 00:26:33,790 along this direction. 549 00:26:33,790 --> 00:26:35,980 So it's totally minimized. 550 00:26:35,980 --> 00:26:41,490 And the third step here also steps all the way 551 00:26:41,490 --> 00:26:43,020 to the center. 552 00:26:43,020 --> 00:26:44,610 So it shows a conjugate direction that 553 00:26:44,610 --> 00:26:46,290 stepped from here to there. 554 00:26:46,290 --> 00:26:49,110 And it didn't lose any of the minimization 555 00:26:49,110 --> 00:26:51,960 in the original direction that it proceeded along. 556 00:26:51,960 --> 00:26:54,936 So that's conjugate gradient. 557 00:26:54,936 --> 00:26:58,190 It's used to solve linear equations with order n 558 00:26:58,190 --> 00:26:59,230 iterations, right. 559 00:26:59,230 --> 00:27:05,690 So A has at most n independent eigenvectors, 560 00:27:05,690 --> 00:27:08,360 independent directions that I can step along and do 561 00:27:08,360 --> 00:27:10,700 this minimization. 562 00:27:10,700 --> 00:27:13,070 The conjugate gradient method is doing precisely that. 563 00:27:13,070 --> 00:27:14,570 Doesn't know what the eigendirections are, but it's 564 00:27:14,570 --> 00:27:17,150 something along these conjugate directions as a proxy 565 00:27:17,150 --> 00:27:19,310 for the eigendirections. 566 00:27:19,310 --> 00:27:24,560 So it can do minimization with just n steps for a system 567 00:27:24,560 --> 00:27:26,030 of n equations for n unknowns. 568 00:27:28,890 --> 00:27:31,350 It requires only the ability to compute 569 00:27:31,350 --> 00:27:34,962 the product of your matrix A with some vector, right. 570 00:27:34,962 --> 00:27:36,420 All the calculations there are only 571 00:27:36,420 --> 00:27:38,820 depended on the product of A with a vector. 572 00:27:38,820 --> 00:27:41,340 So don't have to store A, we just have to know what A is. 573 00:27:41,340 --> 00:27:44,250 We have some procedure for generating A. 574 00:27:44,250 --> 00:27:48,300 Maybe A is a linear operator that comes 575 00:27:48,300 --> 00:27:50,550 from a solution of some differential equations 576 00:27:50,550 --> 00:27:51,892 instead, right. 577 00:27:51,892 --> 00:27:53,850 And we don't have an explicit expression for A, 578 00:27:53,850 --> 00:27:57,990 but we have some simulator that produces, take some data, 579 00:27:57,990 --> 00:28:01,732 and projects A to give some answer, right. 580 00:28:01,732 --> 00:28:02,940 So we just need this product. 581 00:28:02,940 --> 00:28:05,280 We don't have to store A exactly. 582 00:28:05,280 --> 00:28:07,650 It's only good for symmetric positive definite matrices, 583 00:28:07,650 --> 00:28:08,340 right. 584 00:28:08,340 --> 00:28:10,950 This sort of free energy functional that we wrote 585 00:28:10,950 --> 00:28:12,630 or objective function we wrote only 586 00:28:12,630 --> 00:28:16,239 admits symmetric matrices which are positive definite. 587 00:28:16,239 --> 00:28:18,030 That's the only way it will have a minimum. 588 00:28:18,030 --> 00:28:20,460 And so the only way a steepest descent or descent type 589 00:28:20,460 --> 00:28:24,930 procedure is going to get to the optimum. 590 00:28:24,930 --> 00:28:27,390 But there are more sophisticated methods that 591 00:28:27,390 --> 00:28:28,700 exist for arbitrary matrices. 592 00:28:28,700 --> 00:28:32,970 So if we don't want symmetry or we don't care about 593 00:28:32,970 --> 00:28:34,774 whether it's positive definite, there 594 00:28:34,774 --> 00:28:36,690 are equivalent sorts of methods that are based 595 00:28:36,690 --> 00:28:39,600 around the same principle. 596 00:28:39,600 --> 00:28:41,950 And it turns out, this is really the state of the art. 597 00:28:41,950 --> 00:28:44,970 So if you want to solve complicated large systems 598 00:28:44,970 --> 00:28:48,560 of equations, you know Gaussian elimination, that 599 00:28:48,560 --> 00:28:49,920 will get you an exact solution. 600 00:28:49,920 --> 00:28:52,817 But that's often infeasible for the sorts of problems 601 00:28:52,817 --> 00:28:54,150 that we're really interested in. 602 00:28:54,150 --> 00:28:57,120 So instead, you use these sorts of iterative methods. 603 00:28:57,120 --> 00:28:59,850 Things like Jacobi and Gauss-Seidel, 604 00:28:59,850 --> 00:29:02,780 they're sort of the classics in the field. 605 00:29:02,780 --> 00:29:05,220 And they work, and you can show that they converge on 606 00:29:05,220 --> 00:29:06,720 are lots of circumstances. 607 00:29:06,720 --> 00:29:09,150 But these sorts of iterative methods, 608 00:29:09,150 --> 00:29:13,140 like conjugate gradient and its brethren other Krylov subspace 609 00:29:13,140 --> 00:29:14,900 methods they're called, are really 610 00:29:14,900 --> 00:29:17,344 the state of the art, and the ones that you reach to. 611 00:29:17,344 --> 00:29:19,510 You already did conjugate gradients in one homework, 612 00:29:19,510 --> 00:29:20,010 right. 613 00:29:20,010 --> 00:29:23,220 You used this PCG iterative method in Matlab 614 00:29:23,220 --> 00:29:24,980 to solve a system of linear equations. 615 00:29:24,980 --> 00:29:26,310 It was doing this, right. 616 00:29:26,310 --> 00:29:28,154 This is how it works. 617 00:29:28,154 --> 00:29:29,570 OK? 618 00:29:29,570 --> 00:29:31,930 OK. 619 00:29:31,930 --> 00:29:33,420 OK, so that's country ingredients. 620 00:29:33,420 --> 00:29:41,400 You could apply it also to objective functions that 621 00:29:41,400 --> 00:29:44,786 aren't quadratic in nature. 622 00:29:44,786 --> 00:29:46,680 And the formulation changes a little bit. 623 00:29:46,680 --> 00:29:48,630 Everywhere where the matrix A appeared 624 00:29:48,630 --> 00:29:51,210 there needs to be replaced with the Hessian 625 00:29:51,210 --> 00:29:52,590 at a certain iterate. 626 00:29:52,590 --> 00:29:54,300 But the same idea persists. 627 00:29:54,300 --> 00:29:56,716 It says well, we think in our best approximation 628 00:29:56,716 --> 00:29:58,590 for the function that we've minimized as much 629 00:29:58,590 --> 00:29:59,890 as we can in one direction. 630 00:29:59,890 --> 00:30:02,310 So let's choose a conjugate direction to go in, 631 00:30:02,310 --> 00:30:04,950 and try not to ruin the minimizations we 632 00:30:04,950 --> 00:30:07,860 did in the direction we were headed before. 633 00:30:07,860 --> 00:30:12,390 Of course, these are all linearly convergent sorts 634 00:30:12,390 --> 00:30:15,180 of methods. 635 00:30:15,180 --> 00:30:16,860 And we know that there are better ways 636 00:30:16,860 --> 00:30:20,070 to find roots of non-linear equations like this one, 637 00:30:20,070 --> 00:30:22,410 grad f equals zero, namely the Newton-Raphson method, 638 00:30:22,410 --> 00:30:23,950 which is quadratically convergent. 639 00:30:23,950 --> 00:30:27,010 So if we're really close to a critical point, 640 00:30:27,010 --> 00:30:29,730 and hopefully that critical point is a minima in f, 641 00:30:29,730 --> 00:30:31,860 right, then we should rapidly converge 642 00:30:31,860 --> 00:30:35,700 to the solution of this system of nonlinear equations 643 00:30:35,700 --> 00:30:37,548 just by applying the Newton-Raphson method. 644 00:30:40,620 --> 00:30:42,120 It's locally convergent, right. 645 00:30:42,120 --> 00:30:43,500 So we're going to get close. 646 00:30:43,500 --> 00:30:45,420 And we get quadratic improvement. 647 00:30:45,420 --> 00:30:48,750 What is the Newton-Raphson iteration, though? 648 00:30:48,750 --> 00:30:49,890 Can you write that down? 649 00:30:49,890 --> 00:30:51,460 What is the Newton-Raphson iteration 650 00:30:51,460 --> 00:30:55,110 that's the iterative math for this system 651 00:30:55,110 --> 00:30:58,230 of non-linear equations, grad f equals 0? 652 00:30:58,230 --> 00:31:00,890 Can you work that out? 653 00:31:00,890 --> 00:31:01,850 What's that look like? 654 00:32:11,200 --> 00:32:13,200 Have we got this? 655 00:32:13,200 --> 00:32:14,990 What's the Newton-Raphson iterative map 656 00:32:14,990 --> 00:32:20,630 look like for this system of non-linear equations? 657 00:32:23,450 --> 00:32:25,178 Want to volunteer an answer? 658 00:32:28,530 --> 00:32:30,500 Nobody knows or nobody is sharing. 659 00:32:30,500 --> 00:32:33,180 OK, that's fine. 660 00:32:33,180 --> 00:32:36,680 Right, so we're trying to solve an equation g of x equals 0. 661 00:32:36,680 --> 00:32:41,570 So the iterative map is Xi plus 1 is Xi minus Jacobi 662 00:32:41,570 --> 00:32:43,470 inverse times g. 663 00:32:43,470 --> 00:32:45,790 And what's the Jacobian of g? 664 00:32:48,680 --> 00:32:50,830 What's the Jacobian of g? 665 00:32:50,830 --> 00:32:52,730 The Hessian, right. 666 00:32:52,730 --> 00:32:54,810 So the Jacobian of g is the gradient 667 00:32:54,810 --> 00:32:57,690 of g, which is two gradients of f, which is 668 00:32:57,690 --> 00:32:59,280 the definition of the Hessian. 669 00:32:59,280 --> 00:33:01,680 So really, the Newton-Raphson iteration 670 00:33:01,680 --> 00:33:06,915 is Xi plus 1 is Xi minus Hessian inverse times g. 671 00:33:14,350 --> 00:33:17,590 So the Hessian plays the role of the Jacobian, 672 00:33:17,590 --> 00:33:19,120 the sort of solution procedure. 673 00:33:24,492 --> 00:33:26,450 And so everything you know about Newton-Raphson 674 00:33:26,450 --> 00:33:27,560 is going to apply here. 675 00:33:27,560 --> 00:33:29,810 Everything you know about quasi-Newton-Raphson methods 676 00:33:29,810 --> 00:33:31,490 is going to apply here. 677 00:33:31,490 --> 00:33:33,675 You're going to substitute for your nonlinear. 678 00:33:33,675 --> 00:33:35,800 The nonlinear function you're finding the root for, 679 00:33:35,800 --> 00:33:37,085 you're going to substitute the gradient. 680 00:33:37,085 --> 00:33:38,501 And for the Jacobian, you're going 681 00:33:38,501 --> 00:33:40,190 to substitute the Hessian. 682 00:33:40,190 --> 00:33:44,360 Places where the Hessian is, the determent of the Hessian is 0, 683 00:33:44,360 --> 00:33:45,770 right it's going to be a problem. 684 00:33:45,770 --> 00:33:46,985 Places where the Hessian is singular 685 00:33:46,985 --> 00:33:48,026 is going to be a problem. 686 00:33:48,026 --> 00:33:49,280 Same as with the Jacobian. 687 00:33:52,342 --> 00:33:54,050 But Newton-Raphson has the great property 688 00:33:54,050 --> 00:33:58,250 that if our function is quadratic, like this one is, 689 00:33:58,250 --> 00:34:01,630 it will converge in exactly one step. 690 00:34:06,020 --> 00:34:07,690 So here's steepest descent with a fixed 691 00:34:07,690 --> 00:34:10,449 value of alpha, Newton-Raphson, one step 692 00:34:10,449 --> 00:34:13,370 for a quadratic function. 693 00:34:13,370 --> 00:34:16,216 And why is it one step? 694 00:34:16,216 --> 00:34:18,636 STUDENT: [INAUDIBLE] 695 00:34:23,322 --> 00:34:24,030 JAMES SWAN: Good. 696 00:34:24,030 --> 00:34:29,760 So when we take a Taylor expansion of our f, 697 00:34:29,760 --> 00:34:32,580 in order to derive the Newton-Raphson step, 698 00:34:32,580 --> 00:34:35,420 we're expanding it out to quadratic order, 699 00:34:35,420 --> 00:34:36,690 its function is quadratic. 700 00:34:36,690 --> 00:34:38,250 The Taylor expansion is exact. 701 00:34:38,250 --> 00:34:40,800 And the solution of that equation, right, 702 00:34:40,800 --> 00:34:43,739 gradient f equals 0 or g equals 0, 703 00:34:43,739 --> 00:34:45,870 that's the solution of a linear equation, right. 704 00:34:48,630 --> 00:34:53,100 So it gives exactly the right step size here 705 00:34:53,100 --> 00:34:57,240 to move from an initial guess to the exact solution 706 00:34:57,240 --> 00:34:58,745 or the minima of this equation. 707 00:34:58,745 --> 00:35:02,490 So for quadratic equations, Newton-Raphson is exact. 708 00:35:02,490 --> 00:35:06,750 It doesn't go in the steepest ascent direction, right. 709 00:35:06,750 --> 00:35:09,452 It goes in a different direction. 710 00:35:09,452 --> 00:35:11,660 It would like to go in the steepest descent direction 711 00:35:11,660 --> 00:35:15,450 if the Jacobian were identity. 712 00:35:15,450 --> 00:35:19,600 But the Jacobian is a measure of how curved f is. 713 00:35:19,600 --> 00:35:23,310 The Hessian, let's say, is a measure of how curved f is. 714 00:35:23,310 --> 00:35:23,940 Right? 715 00:35:23,940 --> 00:35:26,530 And so there's a projection of the gradient 716 00:35:26,530 --> 00:35:29,980 through the Hessian that changes the direction we go in. 717 00:35:29,980 --> 00:35:31,450 That change in direction is meant 718 00:35:31,450 --> 00:35:35,770 to find the minimum of the quadratic function that we 719 00:35:35,770 --> 00:35:37,090 approximate at this point. 720 00:35:37,090 --> 00:35:39,400 So as long as we have a good quadratic approximation, 721 00:35:39,400 --> 00:35:42,024 Newton-Raphson is going to give us good convergence to a minima 722 00:35:42,024 --> 00:35:44,450 or whatever nearby critical point there is. 723 00:35:44,450 --> 00:35:46,690 If we have a bad approximation for a quadratic, 724 00:35:46,690 --> 00:35:50,160 then it's going to be so good, right. 725 00:35:50,160 --> 00:35:51,780 So here's this very steep function. 726 00:35:51,780 --> 00:35:56,850 Log of f is quadratic, but f is exponential in x here. 727 00:35:56,850 --> 00:36:00,390 So you got all these tightly spaced contours converging 728 00:36:00,390 --> 00:36:02,790 towards a minima at 00. 729 00:36:02,790 --> 00:36:05,250 And here I've got to use the steepest descent step 730 00:36:05,250 --> 00:36:07,860 size, the optimal steepest descent step size, which 731 00:36:07,860 --> 00:36:09,960 is a quadratic approximation for the function, 732 00:36:09,960 --> 00:36:12,075 but in the steepest descent direction only. 733 00:36:12,075 --> 00:36:14,640 And here's the path that it follows. 734 00:36:14,640 --> 00:36:16,710 And if I applied Newton-Raphson to this function, 735 00:36:16,710 --> 00:36:19,830 here is the path that it follows instead. 736 00:36:19,830 --> 00:36:21,110 The function isn't quadratic. 737 00:36:21,110 --> 00:36:23,700 So these quadratic approximations aren't-- 738 00:36:23,700 --> 00:36:25,590 they're not great, right. 739 00:36:25,590 --> 00:36:29,730 But the function is convex, right. 740 00:36:29,730 --> 00:36:31,950 So Newton-Raphson is going to proceed downhill 741 00:36:31,950 --> 00:36:35,340 until it converges towards a solution anyways. 742 00:36:35,340 --> 00:36:38,280 Because the Hessian has positive eigenvalues all the time. 743 00:36:40,800 --> 00:36:43,450 Questions about this? 744 00:36:43,450 --> 00:36:44,620 Make sense? 745 00:36:44,620 --> 00:36:45,350 OK? 746 00:36:45,350 --> 00:36:47,332 So you get two different types of methods 747 00:36:47,332 --> 00:36:48,290 that you can play with. 748 00:36:48,290 --> 00:36:54,250 One of which, right, is always going to direct you down hill. 749 00:36:54,250 --> 00:36:58,032 Steepest descent will always carry you downhill, right, 750 00:36:58,032 --> 00:36:58,740 towards a minima. 751 00:36:58,740 --> 00:37:00,430 And the other one, Newton-Raphson, 752 00:37:00,430 --> 00:37:02,619 converges very quickly when it's close to the root. 753 00:37:02,619 --> 00:37:03,910 OK, so they each have a virtue. 754 00:37:07,504 --> 00:37:08,420 And they're different. 755 00:37:08,420 --> 00:37:10,100 They're fundamentally different, right. 756 00:37:10,100 --> 00:37:13,320 They take steps in completely different directions. 757 00:37:13,320 --> 00:37:15,604 When is Newton-Raphson not going to step down hill? 758 00:37:30,918 --> 00:37:31,906 STUDENT: [INAUDIBLE] 759 00:37:31,906 --> 00:37:32,894 What's that? 760 00:37:32,894 --> 00:37:35,890 STUDENT: [INAUDIBLE] 761 00:37:35,890 --> 00:37:37,929 JAMES SWAN: OK, that's more generic 762 00:37:37,929 --> 00:37:39,220 an answer than I'm looking for. 763 00:37:39,220 --> 00:37:43,390 So there may be circumstances where I have two local minima. 764 00:37:43,390 --> 00:37:45,700 That means there must be maybe a saddle point that 765 00:37:45,700 --> 00:37:47,331 sits between them. 766 00:37:47,331 --> 00:37:49,330 Newton-Raphson doesn't care which critical point 767 00:37:49,330 --> 00:37:50,050 it's going after. 768 00:37:50,050 --> 00:37:52,570 So it may try to approach the saddle point instead. 769 00:37:52,570 --> 00:37:54,280 That's true. 770 00:37:54,280 --> 00:37:56,030 That's true. 771 00:37:56,030 --> 00:37:56,690 Yeah? 772 00:37:56,690 --> 00:38:00,120 STUDENT: When Hessian [INAUDIBLE].. 773 00:38:00,120 --> 00:38:01,510 JAMES SWAN: Good, yeah. 774 00:38:01,510 --> 00:38:04,840 With the Hessian doesn't have all positive eigenvalues, 775 00:38:04,840 --> 00:38:05,980 right. 776 00:38:05,980 --> 00:38:10,390 So if all the eigenvalues of the Hessian are positive, 777 00:38:10,390 --> 00:38:15,190 then the transformation h times g or h inverse times g, 778 00:38:15,190 --> 00:38:17,760 it'll never switch the direction I'm going. 779 00:38:17,760 --> 00:38:20,450 I'll always be headed in a downhill direction. 780 00:38:20,450 --> 00:38:20,950 Right? 781 00:38:20,950 --> 00:38:25,011 In a direction that's anti-parallel to the gradient. 782 00:38:25,011 --> 00:38:25,510 OK? 783 00:38:25,510 --> 00:38:27,190 But if the eigenvalues of the Hessian 784 00:38:27,190 --> 00:38:30,430 are negative, if some of them are negative 785 00:38:30,430 --> 00:38:32,260 and the gradient has me pointing along 786 00:38:32,260 --> 00:38:35,890 that eigenvector in a significant amount, 787 00:38:35,890 --> 00:38:37,990 then this product will switch me around 788 00:38:37,990 --> 00:38:40,360 and will have me go uphill instead. 789 00:38:40,360 --> 00:38:44,260 It'll have me chasing down a maxima or a saddle point 790 00:38:44,260 --> 00:38:44,830 instead. 791 00:38:44,830 --> 00:38:47,050 That's what the quadratic approximation 792 00:38:47,050 --> 00:38:49,834 of our objective function will look like. 793 00:38:49,834 --> 00:38:52,000 It looks like there's a maximum or a saddle instead. 794 00:38:52,000 --> 00:38:54,671 And the function will run uphill. 795 00:38:54,671 --> 00:38:56,645 OK? 796 00:38:56,645 --> 00:38:58,520 So there lots of strengths to Newton-Raphson. 797 00:38:58,520 --> 00:38:59,936 Convergence is one of them, right. 798 00:38:59,936 --> 00:39:01,725 The rate of convergence is good. 799 00:39:01,725 --> 00:39:03,350 It's a locally convergent, that's good. 800 00:39:03,350 --> 00:39:04,760 It's got lots of weaknesses, though. 801 00:39:04,760 --> 00:39:05,060 Right? 802 00:39:05,060 --> 00:39:06,950 It's going to be a pain when the Hessian is 803 00:39:06,950 --> 00:39:08,600 singular at various places. 804 00:39:08,600 --> 00:39:11,672 You've got to solve systems of linear equations 805 00:39:11,672 --> 00:39:13,130 to figure out what these steps are. 806 00:39:13,130 --> 00:39:16,460 That's expensive computationally. 807 00:39:16,460 --> 00:39:19,070 It's not designed to seek out minima, 808 00:39:19,070 --> 00:39:22,075 but to seek out critical points of our objective function. 809 00:39:24,870 --> 00:39:27,210 Steepest descent has lots of strengths, right. 810 00:39:27,210 --> 00:39:29,540 Always heads downhill, that's good. 811 00:39:29,540 --> 00:39:31,800 If we put a little quadratic approximation on it, 812 00:39:31,800 --> 00:39:36,870 we can even stabilize it and get good control over the descent. 813 00:39:36,870 --> 00:39:41,320 Its weaknesses are it's got the property 814 00:39:41,320 --> 00:39:44,827 that it's linearly convergent instead of quadratically 815 00:39:44,827 --> 00:39:46,035 convergent when it converges. 816 00:39:46,035 --> 00:39:47,470 So it's slower, right. 817 00:39:47,470 --> 00:39:49,450 It might be harder to find a minima. 818 00:39:49,450 --> 00:39:52,210 You've seen several examples where the path sort of peters 819 00:39:52,210 --> 00:39:55,630 out with lots of little iterations, tiny steps 820 00:39:55,630 --> 00:39:57,400 towards the solution. 821 00:39:57,400 --> 00:39:59,740 That's a weakness of steepest descent. 822 00:39:59,740 --> 00:40:01,960 We know that if we go over the edge 823 00:40:01,960 --> 00:40:04,480 of a cliff on our potential energy landscape, 824 00:40:04,480 --> 00:40:06,977 steepest descent it just going to run away, right. 825 00:40:06,977 --> 00:40:08,560 As long as there's one of these edges, 826 00:40:08,560 --> 00:40:11,217 it'll just keep running downhill for as long as they can. 827 00:40:14,140 --> 00:40:17,540 So what's done is to try to combine these methods. 828 00:40:17,540 --> 00:40:19,370 Why choose one, right? 829 00:40:19,370 --> 00:40:22,640 We're trying to step our way towards a solution. 830 00:40:22,640 --> 00:40:24,140 What if we could craft a heuristic 831 00:40:24,140 --> 00:40:26,300 procedure that mixed these two? 832 00:40:26,300 --> 00:40:28,487 And when steepest descent would be best, use that. 833 00:40:28,487 --> 00:40:30,320 When Newton-Raphson would be best, use that. 834 00:40:30,320 --> 00:40:31,105 Yes? 835 00:40:31,105 --> 00:40:32,470 STUDENT: Just a quick question on Newton-Raphson. 836 00:40:32,470 --> 00:40:33,136 JAMES SWAN: Yes? 837 00:40:33,136 --> 00:40:34,682 STUDENT: Would it run downhill also 838 00:40:34,682 --> 00:40:36,674 if you started it over there? 839 00:40:36,674 --> 00:40:39,164 Or since it seeks critical points, 840 00:40:39,164 --> 00:40:42,482 could you go back up to the [INAUDIBLE].. 841 00:40:42,482 --> 00:40:43,940 JAMES SWAN: That's a good question. 842 00:40:43,940 --> 00:40:50,370 So if there's an asymptote in f, it 843 00:40:50,370 --> 00:40:54,720 will perceive the asymptote as a critical point and chase it. 844 00:40:54,720 --> 00:40:55,449 OK? 845 00:40:55,449 --> 00:40:57,240 And so if there's an asymptote in f, if can 846 00:40:57,240 --> 00:40:58,440 perceive that and chase it. 847 00:40:58,440 --> 00:41:01,000 It can also run away as it gets very far away. 848 00:41:01,000 --> 00:41:02,082 This is true. 849 00:41:02,082 --> 00:41:02,582 OK? 850 00:41:08,580 --> 00:41:11,940 The contour example that I gave you at the start of class 851 00:41:11,940 --> 00:41:15,120 had sort of bowl shape functions superimposed 852 00:41:15,120 --> 00:41:19,140 with a linear function, sort of planar function instead. 853 00:41:19,140 --> 00:41:24,520 For that one, right, the Hessian is ill-defined, right. 854 00:41:24,520 --> 00:41:26,160 There is no curvature to the function. 855 00:41:26,160 --> 00:41:28,500 But you can imagine adding a small bit of curvature 856 00:41:28,500 --> 00:41:29,410 to that, right. 857 00:41:29,410 --> 00:41:31,410 And depending on the direction of the curvature, 858 00:41:31,410 --> 00:41:35,290 Newton-Raphson may run downhill or it may run back up hill, 859 00:41:35,290 --> 00:41:35,790 right? 860 00:41:35,790 --> 00:41:37,230 We can't guarantee which direction it's going to go. 861 00:41:37,230 --> 00:41:39,150 Depends on the details of the function. 862 00:41:39,150 --> 00:41:39,870 Does that answer your question? 863 00:41:39,870 --> 00:41:40,370 Yeah? 864 00:41:40,370 --> 00:41:41,482 Good. 865 00:41:41,482 --> 00:41:43,690 STUDENT: Sir, can you just go back to that one slide? 866 00:41:43,690 --> 00:41:44,398 JAMES SWAN: Yeah. 867 00:41:44,398 --> 00:41:46,692 I'm just pointing out, if the eigenvalues of h 868 00:41:46,692 --> 00:41:49,982 is further negative, then the formula there for alpha 869 00:41:49,982 --> 00:41:50,940 could have trouble too. 870 00:41:50,940 --> 00:41:51,940 JAMES SWAN: That's true. 871 00:41:51,940 --> 00:41:54,410 STUDENT: Similar to how the Newton-Raphson had trouble. 872 00:41:54,410 --> 00:41:56,551 JAMES SWAN: This is true. 873 00:41:56,551 --> 00:41:58,020 This is true, yeah. 874 00:41:58,020 --> 00:42:01,800 So we chose a quadratic approximation here, right, 875 00:42:01,800 --> 00:42:02,730 for our function. 876 00:42:02,730 --> 00:42:05,875 We sought a critical point of this quadratic approximation. 877 00:42:05,875 --> 00:42:07,750 We didn't mandate that it had to be a minima. 878 00:42:07,750 --> 00:42:09,660 So that's absolutely right. 879 00:42:09,660 --> 00:42:13,620 So if h has negative eigenvalues and the gradient points 880 00:42:13,620 --> 00:42:16,599 enough in the direction of the eigenvectors 881 00:42:16,599 --> 00:42:18,390 associated with those negative eigenvalues, 882 00:42:18,390 --> 00:42:22,100 then we may have a case where alpha isn't positive. 883 00:42:22,100 --> 00:42:24,090 We required early on that alpha should be 884 00:42:24,090 --> 00:42:25,470 positive for steepest descent. 885 00:42:25,470 --> 00:42:28,930 So we can't have a case where alpha is not positive. 886 00:42:28,930 --> 00:42:31,380 That's true. 887 00:42:31,380 --> 00:42:32,221 OK. 888 00:42:32,221 --> 00:42:33,720 So they're both interesting methods, 889 00:42:33,720 --> 00:42:35,270 and they can be mixed together. 890 00:42:35,270 --> 00:42:37,250 And the way you mix those is with what's 891 00:42:37,250 --> 00:42:41,030 called trust-region ideas, OK. 892 00:42:41,030 --> 00:42:44,630 Because it could be that we've had an iteration 893 00:42:44,630 --> 00:42:48,380 Xi and we do a quadratic approximation 894 00:42:48,380 --> 00:42:51,350 to our functional, which is this blue curve. 895 00:42:51,350 --> 00:42:54,260 Our quadratic approximation is this red one. 896 00:42:54,260 --> 00:42:56,150 And we find the minima of this red curve 897 00:42:56,150 --> 00:42:58,765 and use that as our next best guess for the solution 898 00:42:58,765 --> 00:42:59,390 to the problem. 899 00:42:59,390 --> 00:43:02,219 And this seems to be working us closer and closer 900 00:43:02,219 --> 00:43:04,010 towards the actual minimum in the function. 901 00:43:04,010 --> 00:43:06,410 So since quadratic approximation seems good, 902 00:43:06,410 --> 00:43:08,270 if the quadratic approximation is good, 903 00:43:08,270 --> 00:43:11,440 which method should we choose? 904 00:43:11,440 --> 00:43:12,440 STUDENT: Newton-Raphson. 905 00:43:12,440 --> 00:43:14,840 JAMES SWAN: Newton-Raphson, right. 906 00:43:14,840 --> 00:43:16,970 Could also be the case though that we 907 00:43:16,970 --> 00:43:20,020 make this quadratic approximation 908 00:43:20,020 --> 00:43:22,510 from our current iteration, and we 909 00:43:22,510 --> 00:43:26,630 find a minimum that somehow oversteps the minimum here. 910 00:43:26,630 --> 00:43:29,980 In fact, if we look at the value of our objective function 911 00:43:29,980 --> 00:43:32,230 at this next step, it's higher than the value 912 00:43:32,230 --> 00:43:34,350 of the objective function where we started. 913 00:43:34,350 --> 00:43:36,980 So it seems like a quadratic approximation is not so good, 914 00:43:36,980 --> 00:43:37,640 right. 915 00:43:37,640 --> 00:43:40,521 That's a clear indication that this quadratic approximation 916 00:43:40,521 --> 00:43:41,020 isn't right. 917 00:43:41,020 --> 00:43:44,140 Because it suggested that we should have had an minima here, 918 00:43:44,140 --> 00:43:45,100 right. 919 00:43:45,100 --> 00:43:47,456 But our function got bigger instead. 920 00:43:47,456 --> 00:43:49,580 And so in this case, it doesn't seem 921 00:43:49,580 --> 00:43:52,700 like you'd want to choose Newton-Raphson 922 00:43:52,700 --> 00:43:54,250 to take your steps. 923 00:43:54,250 --> 00:43:56,150 The quadratic approximation is not so good. 924 00:43:56,150 --> 00:44:00,640 Maybe just simple steepest descent is a better choice. 925 00:44:00,640 --> 00:44:01,510 OK, so it's done. 926 00:44:05,040 --> 00:44:08,540 So if you're at a point, you might draw a circle 927 00:44:08,540 --> 00:44:11,510 around that point with some prescribed radius. 928 00:44:11,510 --> 00:44:13,430 Call that Ri. 929 00:44:13,430 --> 00:44:15,770 This is our iterate Xi. 930 00:44:15,770 --> 00:44:20,050 This is our trust-region radius Ri. 931 00:44:20,050 --> 00:44:25,800 And we might ask, where does our Newton-Raphson step go? 932 00:44:25,800 --> 00:44:29,520 And where does our steepest descent step take us? 933 00:44:29,520 --> 00:44:32,370 And then based on whether these steps carry us 934 00:44:32,370 --> 00:44:35,280 outside of our trust-region, we might decide 935 00:44:35,280 --> 00:44:38,050 to take one or the other. 936 00:44:38,050 --> 00:44:40,950 So if I set a particular size Ri, 937 00:44:40,950 --> 00:44:44,130 particular trust-region size Ri and the Newton-Raphson step 938 00:44:44,130 --> 00:44:48,310 goes outside of that, we might say well, 939 00:44:48,310 --> 00:44:50,710 I don't actually trust my quadratic approximation 940 00:44:50,710 --> 00:44:52,630 this far away from the starred point. 941 00:44:52,630 --> 00:44:56,860 So let's not take a step in that direction. 942 00:44:56,860 --> 00:45:00,620 Instead, let's move in a steepest descent direction. 943 00:45:00,620 --> 00:45:03,170 If my Newton-Raphson step is inside the trust-region, 944 00:45:03,170 --> 00:45:04,970 maybe I'll choose to take it, right. 945 00:45:04,970 --> 00:45:07,010 I trust the quadratic approximation 946 00:45:07,010 --> 00:45:10,660 within a distance Ri of my current iteration. 947 00:45:10,660 --> 00:45:13,360 Does that strategy makes sense? 948 00:45:13,360 --> 00:45:15,980 So we're trying to pick between two different methods 949 00:45:15,980 --> 00:45:18,860 in order to give us more reliable convergence 950 00:45:18,860 --> 00:45:19,839 to a local minima. 951 00:45:24,830 --> 00:45:26,840 So here's our Newton-Raphson step. 952 00:45:26,840 --> 00:45:29,660 It's minus the Hessian inverse times the gradient. 953 00:45:29,660 --> 00:45:31,400 Here's our steepest descent step. 954 00:45:31,400 --> 00:45:34,490 It's minus alpha times the gradient. 955 00:45:34,490 --> 00:45:37,430 And if the Newton-Raphson step is 956 00:45:37,430 --> 00:45:39,860 smaller than the trust-region radius, 957 00:45:39,860 --> 00:45:45,520 and the value of the objective function at Xi, 958 00:45:45,520 --> 00:45:47,780 plus the Newton-Raphson step is smaller 959 00:45:47,780 --> 00:45:49,780 than the current objective function, 960 00:45:49,780 --> 00:45:51,500 it seems like the quadratic approximation 961 00:45:51,500 --> 00:45:52,730 is a good one, right. 962 00:45:52,730 --> 00:45:56,469 I'm within the region in which I trust this approximation, 963 00:45:56,469 --> 00:45:58,260 and I've reduced the value of the function. 964 00:45:58,260 --> 00:45:59,690 So why not go that way, right? 965 00:45:59,690 --> 00:46:02,660 So take the Newton-Raphson step. 966 00:46:02,660 --> 00:46:09,022 Else, let's try taking a step in the steepest direction instead. 967 00:46:09,022 --> 00:46:11,230 So again, if the steepest ascent direction is smaller 968 00:46:11,230 --> 00:46:14,500 than Ri and the value of the function 969 00:46:14,500 --> 00:46:15,950 in the steepest descent direction, 970 00:46:15,950 --> 00:46:19,399 the optimal steepest descent direction or the optimal step 971 00:46:19,399 --> 00:46:21,190 in the steepest ascent direction is smaller 972 00:46:21,190 --> 00:46:23,440 than the value of the function at the current point, 973 00:46:23,440 --> 00:46:25,201 seems like we should take that step. 974 00:46:25,201 --> 00:46:25,700 Right? 975 00:46:25,700 --> 00:46:27,410 The Newton-Raphson step was no good. 976 00:46:27,410 --> 00:46:29,680 We've already discarded it. 977 00:46:29,680 --> 00:46:31,490 But our optimized steepest descent step 978 00:46:31,490 --> 00:46:32,380 seems like an OK one. 979 00:46:32,380 --> 00:46:34,259 It reduces the value of the function. 980 00:46:34,259 --> 00:46:35,800 And its within the trust-region where 981 00:46:35,800 --> 00:46:37,870 we think quadratic approximations are valid. 982 00:46:41,160 --> 00:46:44,550 If that's not true, if the steepest descent step takes us 983 00:46:44,550 --> 00:46:46,770 outside of our trust-region or we 984 00:46:46,770 --> 00:46:48,750 don't reduce the value of the function 985 00:46:48,750 --> 00:46:51,270 when we take that step, then the next best strategy 986 00:46:51,270 --> 00:46:52,830 is to just take a steepest ascent 987 00:46:52,830 --> 00:46:55,430 step to the edge of the trust-region boundary. 988 00:46:55,430 --> 00:46:55,930 Yeah? 989 00:46:55,930 --> 00:46:58,335 STUDENT: Is there a reason here that Newton-Raphson 990 00:46:58,335 --> 00:47:00,260 is the default? 991 00:47:00,260 --> 00:47:01,530 JAMES SWAN: Oh, good question. 992 00:47:01,530 --> 00:47:04,320 So eventually we're going to get close enough to the solution, 993 00:47:04,320 --> 00:47:07,140 all right, that all these steps are going to live 994 00:47:07,140 --> 00:47:09,566 inside the trust-region ring. 995 00:47:09,566 --> 00:47:11,107 Its going to require very small steps 996 00:47:11,107 --> 00:47:12,630 to converge to the solution. 997 00:47:12,630 --> 00:47:16,060 And which of these two methods is going to converge faster? 998 00:47:16,060 --> 00:47:17,060 STUDENT: Newton-Raphson. 999 00:47:17,060 --> 00:47:18,184 JAMES SWAN: Newton-Raphson. 1000 00:47:18,184 --> 00:47:21,485 So we prioritize Newton-Raphson over steepest descent. 1001 00:47:21,485 --> 00:47:22,485 That's a great question. 1002 00:47:25,620 --> 00:47:27,690 Its the faster converging one, but 1003 00:47:27,690 --> 00:47:30,330 its a little unwieldy, right. 1004 00:47:30,330 --> 00:47:34,740 So let's take it when it seems valid. 1005 00:47:34,740 --> 00:47:37,200 But when it requires steps that are too big 1006 00:47:37,200 --> 00:47:40,180 or steps that don't minimize f, let's take some different steps 1007 00:47:40,180 --> 00:47:40,680 instead. 1008 00:47:40,680 --> 00:47:42,730 Lets use steepest descent as the strategy. 1009 00:47:42,730 --> 00:47:45,516 So this is heuristic. 1010 00:47:45,516 --> 00:47:48,140 So you got to have some rules to go with this heuristic, right. 1011 00:47:48,140 --> 00:47:50,300 We have a set of conditions under which we're going 1012 00:47:50,300 --> 00:47:51,383 to choose different steps. 1013 00:47:54,300 --> 00:47:57,540 We've got to set this trust-region size. 1014 00:47:57,540 --> 00:47:59,060 This Ri has to be set. 1015 00:47:59,060 --> 00:48:00,720 How big is it going to be? 1016 00:48:00,720 --> 00:48:01,290 I don't know. 1017 00:48:01,290 --> 00:48:03,780 You don't know, right, from the start you can't guess 1018 00:48:03,780 --> 00:48:05,580 how big Ri is going to be. 1019 00:48:05,580 --> 00:48:07,320 So you got to pick some initial guess. 1020 00:48:07,320 --> 00:48:10,090 And then we've got to modify the size of the trust-region 1021 00:48:10,090 --> 00:48:11,421 too, right. 1022 00:48:11,421 --> 00:48:13,920 The size of the trust-region is not going to be appropriate. 1023 00:48:13,920 --> 00:48:16,378 One fixed size is not going to be appropriate all the time. 1024 00:48:16,378 --> 00:48:20,020 Instead, we want a strategy for changing its size. 1025 00:48:20,020 --> 00:48:22,330 So it should grow or shrink depending on which steps we 1026 00:48:22,330 --> 00:48:23,470 choose, right. 1027 00:48:23,470 --> 00:48:29,350 Like if we take the Newton-Raphson step 1028 00:48:29,350 --> 00:48:33,040 and we find that our quadratic approximation is 1029 00:48:33,040 --> 00:48:35,830 a little bit bigger than the actual function value 1030 00:48:35,830 --> 00:48:40,240 that we predicted, we might want to grow the trust-region. 1031 00:48:40,240 --> 00:48:42,550 We might be more likely to believe 1032 00:48:42,550 --> 00:48:45,130 that these Newton-Raphson steps are getting us 1033 00:48:45,130 --> 00:48:47,366 to smaller and smaller function values, right. 1034 00:48:47,366 --> 00:48:49,490 The step was even better than we expected it to be. 1035 00:48:49,490 --> 00:48:51,040 Here's the quadratic approximation 1036 00:48:51,040 --> 00:48:52,600 in the Newton-Raphson direction. 1037 00:48:52,600 --> 00:48:55,030 And it was actually bigger than the actual value 1038 00:48:55,030 --> 00:48:55,700 of the function. 1039 00:48:55,700 --> 00:48:58,124 So we got more, you know, we got more than we expected out 1040 00:48:58,124 --> 00:48:59,290 of a step in that direction. 1041 00:48:59,290 --> 00:49:04,200 So why not loosen up, accept more Newton-Raphson steps? 1042 00:49:04,200 --> 00:49:06,340 OK, that's a strategy we can take. 1043 00:49:06,340 --> 00:49:09,754 Otherwise, we might think about shrinking instead, right. 1044 00:49:09,754 --> 00:49:11,170 So there could be the circumstance 1045 00:49:11,170 --> 00:49:14,230 where our quadratic approximation predicted 1046 00:49:14,230 --> 00:49:18,980 a smaller value for the function than we actually found. 1047 00:49:18,980 --> 00:49:21,590 It's not quite as reliable for getting us to the minimum. 1048 00:49:21,590 --> 00:49:24,720 These two circumstances are actually these. 1049 00:49:24,720 --> 00:49:26,980 So this one, the quadratic approximation 1050 00:49:26,980 --> 00:49:31,990 predicted a slightly bigger value than we found. 1051 00:49:31,990 --> 00:49:33,400 Say grow the trust-region, right. 1052 00:49:33,400 --> 00:49:35,140 Try some more Newton-Raphson steps. 1053 00:49:35,140 --> 00:49:38,440 Seems like the Newton-Raphson steps are pretty reliable here. 1054 00:49:38,440 --> 00:49:42,940 Here the value of the function in the quadratic approximation 1055 00:49:42,940 --> 00:49:44,710 is smaller than the value of the function 1056 00:49:44,710 --> 00:49:46,630 after we took the step. 1057 00:49:46,630 --> 00:49:49,690 Seems like our trust-region is probably too big 1058 00:49:49,690 --> 00:49:51,370 if we have a circumstance like that. 1059 00:49:51,370 --> 00:49:52,690 Should shrink it a little bit, right? 1060 00:49:52,690 --> 00:49:54,280 We took the Newton-Raphson step, but it actually 1061 00:49:54,280 --> 00:49:55,966 did worse than we expected it to do 1062 00:49:55,966 --> 00:49:57,340 with the quadratic approximation. 1063 00:49:57,340 --> 00:50:00,066 So maybe we ought to shrink the trust-regional a little bit. 1064 00:50:03,585 --> 00:50:04,960 And you need a good initial value 1065 00:50:04,960 --> 00:50:06,670 for the trust-region radius. 1066 00:50:06,670 --> 00:50:07,580 What does Matlab use? 1067 00:50:07,580 --> 00:50:08,780 It uses 1. 1068 00:50:08,780 --> 00:50:09,280 OK. 1069 00:50:09,280 --> 00:50:10,430 It doesn't know. 1070 00:50:10,430 --> 00:50:12,110 It has no clue. 1071 00:50:12,110 --> 00:50:13,420 It's just a heuristic. 1072 00:50:13,420 --> 00:50:16,480 It starts with 1 and it changes it as need be. 1073 00:50:16,480 --> 00:50:20,920 So this is how fsolve solves systems of nonlinear equations. 1074 00:50:20,920 --> 00:50:24,550 This is how all of the minimizers in Matlab, this 1075 00:50:24,550 --> 00:50:27,400 is the strategy they use to try to find minima. 1076 00:50:27,400 --> 00:50:30,970 They use these sorts of trust-region methods. 1077 00:50:30,970 --> 00:50:34,680 It uses a slight improvement, which is also heuristic, 1078 00:50:34,680 --> 00:50:38,830 called a dogleg trust-region method. 1079 00:50:38,830 --> 00:50:41,950 So you can take a Newton-Raphson step 1080 00:50:41,950 --> 00:50:44,660 or you can take a steepest descent step. 1081 00:50:44,660 --> 00:50:47,620 And if you found the steepest descent step didn't quite 1082 00:50:47,620 --> 00:50:49,570 get you to the boundary of your trust-region, 1083 00:50:49,570 --> 00:50:52,030 you could then step in the Newton-Raphson direction. 1084 00:50:52,030 --> 00:50:52,997 Why do you do that? 1085 00:50:52,997 --> 00:50:55,330 I don't know, people have found that it's useful, right. 1086 00:50:55,330 --> 00:50:57,400 There's actually no good reason to take 1087 00:50:57,400 --> 00:50:59,550 these sorts of dogleg steps. 1088 00:50:59,550 --> 00:51:03,220 People found that for general, right, general objective 1089 00:51:03,220 --> 00:51:06,010 functions that you might want to find minima of, 1090 00:51:06,010 --> 00:51:09,220 this is a reliable strategy for getting there. 1091 00:51:09,220 --> 00:51:12,330 There's no guarantee that this is the best strategy. 1092 00:51:12,330 --> 00:51:14,760 These are general non-convex functions. 1093 00:51:14,760 --> 00:51:18,090 These are just hard problems that one encounters. 1094 00:51:18,090 --> 00:51:20,090 So when you make a software package like Matlab, 1095 00:51:20,090 --> 00:51:20,923 this is what you do. 1096 00:51:20,923 --> 00:51:25,192 You come up with heuristics that work most of the time. 1097 00:51:25,192 --> 00:51:27,150 I'll just provide you with an example here, OK. 1098 00:51:27,150 --> 00:51:29,108 So you've seen this function now several times. 1099 00:51:31,720 --> 00:51:33,720 Let's see, so in red covered up back here 1100 00:51:33,720 --> 00:51:35,940 is the Newton-Raphson path. 1101 00:51:35,940 --> 00:51:38,680 In blue is the optimal steepest descent path. 1102 00:51:38,680 --> 00:51:41,130 And in purple is the trust-region method 1103 00:51:41,130 --> 00:51:43,260 that Matlab uses to find the minima. 1104 00:51:43,260 --> 00:51:44,919 They all start from the same place. 1105 00:51:44,919 --> 00:51:46,710 And you can see the purple path is a little 1106 00:51:46,710 --> 00:51:48,010 different from these two. 1107 00:51:48,010 --> 00:51:52,260 If I zoom in right up here, what you'll see 1108 00:51:52,260 --> 00:51:55,520 is initially Matlab chose to follow the steepest descent 1109 00:51:55,520 --> 00:51:57,090 path. 1110 00:51:57,090 --> 00:52:00,690 And then at a certain point it decided, because of the value 1111 00:52:00,690 --> 00:52:03,150 of the trust-region that Newton-Raphson steps were 1112 00:52:03,150 --> 00:52:04,520 to be preferred. 1113 00:52:04,520 --> 00:52:06,005 And so it changed direction and it 1114 00:52:06,005 --> 00:52:08,130 started stepping along the Newton-Raphson direction 1115 00:52:08,130 --> 00:52:08,630 instead. 1116 00:52:13,060 --> 00:52:15,257 It has some built in logic that tells it 1117 00:52:15,257 --> 00:52:16,840 when to make that choice for switching 1118 00:52:16,840 --> 00:52:18,850 based on the size of the trust-region. 1119 00:52:18,850 --> 00:52:23,500 And the idea is just to choose the best sorts of steps 1120 00:52:23,500 --> 00:52:24,880 possible. 1121 00:52:24,880 --> 00:52:27,130 Your best guess at what the right steps are. 1122 00:52:27,130 --> 00:52:29,650 And this is all based around how trustworthy we 1123 00:52:29,650 --> 00:52:33,010 think this quadratic approximation 1124 00:52:33,010 --> 00:52:34,410 for objective function is. 1125 00:52:34,410 --> 00:52:35,505 Yeah, Dan? 1126 00:52:35,505 --> 00:52:37,880 STUDENT: So for the trust-region on the graph what Matlab 1127 00:52:37,880 --> 00:52:40,901 is doing is at each R trust-region 1128 00:52:40,901 --> 00:52:43,770 length it's reevaluating which way it should go? 1129 00:52:43,770 --> 00:52:45,080 JAMES SWAN: Yes. 1130 00:52:45,080 --> 00:52:45,790 Yes. 1131 00:52:45,790 --> 00:52:47,410 It's computing both sets of steps, 1132 00:52:47,410 --> 00:52:50,674 and it's deciding which one it should take, right. 1133 00:52:50,674 --> 00:52:51,340 It doesn't know. 1134 00:52:51,340 --> 00:52:53,737 It's trying to choose between them. 1135 00:52:53,737 --> 00:52:55,778 STUDENT: Why don't you do the Newton-Raphson step 1136 00:52:55,778 --> 00:52:58,002 through the [? negative R? ?] 1137 00:52:58,002 --> 00:53:00,210 JAMES SWAN: You can do that as well, actually, right. 1138 00:53:00,210 --> 00:53:02,220 But if you're doing that, now you 1139 00:53:02,220 --> 00:53:03,870 have to choose between that strategy 1140 00:53:03,870 --> 00:53:09,000 and taking a steepest descent step up to R as well, right. 1141 00:53:09,000 --> 00:53:11,850 And I think one has to decide which would you prefer. 1142 00:53:11,850 --> 00:53:14,430 It's possible the Newton-Raphson step also doesn't actually 1143 00:53:14,430 --> 00:53:15,720 reduce f. 1144 00:53:15,720 --> 00:53:19,050 In which case, you should discard it entirely, right. 1145 00:53:19,050 --> 00:53:21,690 But you could craft a strategy that does that, right. 1146 00:53:21,690 --> 00:53:23,920 It's still going to converge, likely. 1147 00:53:23,920 --> 00:53:24,670 OK? 1148 00:53:24,670 --> 00:53:25,340 OK. 1149 00:53:25,340 --> 00:53:26,190 I've going to let you guys go, there's 1150 00:53:26,190 --> 00:53:27,220 another class coming in. 1151 00:53:27,220 --> 00:53:28,770 Thanks.