1
00:00:01,540 --> 00:00:03,910
The following content is
provided under a Creative

2
00:00:03,910 --> 00:00:05,300
Commons license.

3
00:00:05,300 --> 00:00:07,510
Your support will help
MIT OpenCourseWare

4
00:00:07,510 --> 00:00:11,600
continue to offer high quality
educational resources for free.

5
00:00:11,600 --> 00:00:14,140
To make a donation or to
view additional materials

6
00:00:14,140 --> 00:00:18,100
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:18,100 --> 00:00:18,980
at ocw.mit.edu.

8
00:00:23,905 --> 00:00:24,530
JAMES SWAN: OK.

9
00:00:24,530 --> 00:00:25,821
Let's go ahead and get started.

10
00:00:34,650 --> 00:00:36,380
We saw a lot of
good conversation

11
00:00:36,380 --> 00:00:38,275
on Piazza this weekend.

12
00:00:38,275 --> 00:00:38,900
So that's good.

13
00:00:38,900 --> 00:00:40,525
Seems like you guys
are making your way

14
00:00:40,525 --> 00:00:46,010
through these two problems
on the latest assignment.

15
00:00:46,010 --> 00:00:51,120
I would try to focus less
on the chemical engineering

16
00:00:51,120 --> 00:00:54,420
science and problems
that involve those.

17
00:00:54,420 --> 00:00:57,360
Usually the topic of interest,
the thing that's useful to you

18
00:00:57,360 --> 00:01:00,840
educationally is going to
be the numerics, right.

19
00:01:00,840 --> 00:01:02,700
So if you get hung
up on the definition

20
00:01:02,700 --> 00:01:06,630
of a particular quantity,
yield was one that came up.

21
00:01:06,630 --> 00:01:09,300
Rather than let that prevent
you from solving the problem,

22
00:01:09,300 --> 00:01:11,730
pick a definition
and see what happens.

23
00:01:11,730 --> 00:01:14,010
You can always ask
yourself if the results

24
00:01:14,010 --> 00:01:15,430
seem physically
reasonable to you

25
00:01:15,430 --> 00:01:16,890
not based on your definition.

26
00:01:16,890 --> 00:01:20,500
And as long as you explain
what you did in your write up,

27
00:01:20,500 --> 00:01:22,260
you're going to get full points.

28
00:01:22,260 --> 00:01:24,750
We want to solve the
problems numerically.

29
00:01:24,750 --> 00:01:29,430
If there's some hang up in
the science, don't sweat it.

30
00:01:29,430 --> 00:01:32,010
Don't let that stop you
from moving ahead with it.

31
00:01:32,010 --> 00:01:36,631
Don't let it make it seem like
the problem can't be solved

32
00:01:36,631 --> 00:01:38,130
or there isn't a
path to a solution.

33
00:01:38,130 --> 00:01:42,450
Pick a definition and go with
it and see what happens, right.

34
00:01:42,450 --> 00:01:43,860
The root of the
second problem is

35
00:01:43,860 --> 00:01:46,720
trying to nest together two
different numerical methods.

36
00:01:46,720 --> 00:01:48,720
One of those is optimization,
and the other one

37
00:01:48,720 --> 00:01:50,650
is solutions of
nonlinear equation,

38
00:01:50,650 --> 00:01:52,800
putting those two
techniques together,

39
00:01:52,800 --> 00:01:55,040
using them in combination.

40
00:01:55,040 --> 00:01:58,200
The engineering science problem
gives us a solvable problem

41
00:01:58,200 --> 00:02:00,240
to work with in that
context, but it's not

42
00:02:00,240 --> 00:02:02,800
the key element of it.

43
00:02:02,800 --> 00:02:03,720
OK, good.

44
00:02:03,720 --> 00:02:06,222
So we're continuing
optimization, right.

45
00:02:06,222 --> 00:02:08,190
Move this just a little bit.

46
00:02:08,190 --> 00:02:11,039
We're continuing
with optimization.

47
00:02:11,039 --> 00:02:14,760
Last time we posed lots
of optimization problems.

48
00:02:14,760 --> 00:02:18,420
We talked about
constrained optimization,

49
00:02:18,420 --> 00:02:19,990
unconstrained optimization.

50
00:02:19,990 --> 00:02:23,730
We heard a little bit
about linear programs.

51
00:02:23,730 --> 00:02:27,480
We started approaching
unconstrained optimization

52
00:02:27,480 --> 00:02:31,635
problems from the perspective
of steepest descent.

53
00:02:31,635 --> 00:02:35,080
OK, so that's where I want
to pick up as we get started.

54
00:02:35,080 --> 00:02:37,620
So you'll recall the idea
behind steepest descent

55
00:02:37,620 --> 00:02:40,290
was all the unconstrained
optimization

56
00:02:40,290 --> 00:02:43,710
problems we're interested
in are based around trying

57
00:02:43,710 --> 00:02:45,909
to find minima, OK.

58
00:02:45,909 --> 00:02:47,700
And so we should think
about these problems

59
00:02:47,700 --> 00:02:50,310
as though we're standing
on top of a mountain.

60
00:02:50,310 --> 00:02:53,430
And we're looking for directions
that allow us to descend.

61
00:02:53,430 --> 00:02:56,327
And as long as we're heading
in descending directions,

62
00:02:56,327 --> 00:02:58,410
right, there's a good
chance we're going to bottom

63
00:02:58,410 --> 00:02:59,784
out someplace and stop.

64
00:02:59,784 --> 00:03:01,200
And when we've
bottomed out, we've

65
00:03:01,200 --> 00:03:02,834
found one of those local minima.

66
00:03:02,834 --> 00:03:04,500
That bottom is going
to be a place where

67
00:03:04,500 --> 00:03:06,870
the gradient of the function
we're trying to find

68
00:03:06,870 --> 00:03:10,200
the minimum of is zero, OK.

69
00:03:10,200 --> 00:03:11,730
And the idea behind
steepest descent

70
00:03:11,730 --> 00:03:15,060
was well, don't just pick any
direction that's down hill.

71
00:03:15,060 --> 00:03:16,830
Pick the steepest
direction, right.

72
00:03:16,830 --> 00:03:19,350
Go in the direction
of the gradient.

73
00:03:19,350 --> 00:03:20,820
That's the steepest
descent idea.

74
00:03:20,820 --> 00:03:23,700
And then we did something a
little sophisticated last time.

75
00:03:23,700 --> 00:03:26,225
We said well OK, I
know the direction.

76
00:03:26,225 --> 00:03:27,600
I'm standing on
top the mountain.

77
00:03:27,600 --> 00:03:29,760
I point myself in the
steepest descent direction.

78
00:03:29,760 --> 00:03:32,070
How big a step do I take?

79
00:03:32,070 --> 00:03:33,240
I can take any size step.

80
00:03:33,240 --> 00:03:36,180
And some steps may be good
and some steps may be bad.

81
00:03:36,180 --> 00:03:39,300
It turns out there are some
good estimates for step size

82
00:03:39,300 --> 00:03:41,320
that we can get by taking
a Taylor expansion.

83
00:03:41,320 --> 00:03:43,860
So we take our
function, right, and we

84
00:03:43,860 --> 00:03:46,800
write it at the next iterate
is a Taylor expansion.

85
00:03:46,800 --> 00:03:50,100
About the current iterate,
that expansion looks like this.

86
00:03:50,100 --> 00:03:54,010
And it will be quadratic with
respect to the step size alpha.

87
00:03:56,550 --> 00:03:59,880
If we want to minimize the
value of the function here,

88
00:03:59,880 --> 00:04:02,940
we want the next
iterate to be a minimum

89
00:04:02,940 --> 00:04:04,230
of this quadratic function.

90
00:04:04,230 --> 00:04:07,620
Then there's an obvious
choice of alpha, right.

91
00:04:07,620 --> 00:04:11,400
We find the vertex of
this quadratic functional.

92
00:04:11,400 --> 00:04:14,040
That gives us the
optimal step size.

93
00:04:14,040 --> 00:04:16,320
It's optimal if our function
actually is quadratic.

94
00:04:16,320 --> 00:04:18,060
It's an approximate, right.

95
00:04:18,060 --> 00:04:20,250
It's an estimation of the
right sort of step size

96
00:04:20,250 --> 00:04:22,352
if it's not quadratic.

97
00:04:22,352 --> 00:04:26,100
And so I showed you here was a
function where the contours are

98
00:04:26,100 --> 00:04:27,210
very closely spaced.

99
00:04:27,210 --> 00:04:28,890
So it's a very steep function.

100
00:04:28,890 --> 00:04:30,660
And the minima is in the middle.

101
00:04:30,660 --> 00:04:33,600
If we try to solve this
with the steepest descent

102
00:04:33,600 --> 00:04:37,060
and we pick different steps
sizes, uniform step sizes,

103
00:04:37,060 --> 00:04:41,480
so we try 0.1 and 1
and 10 step sizes,

104
00:04:41,480 --> 00:04:43,710
we'll never find an
appropriate choice

105
00:04:43,710 --> 00:04:45,359
to converge to the solution, OK.

106
00:04:45,359 --> 00:04:47,400
We're going to have to
pick impossibly small step

107
00:04:47,400 --> 00:04:50,945
sizes, which will require tons
of steps in order to get there.

108
00:04:50,945 --> 00:04:52,320
But with this
quadratic estimate,

109
00:04:52,320 --> 00:04:55,960
you can get a reasonably
smooth convergence to the root.

110
00:04:55,960 --> 00:04:58,750
So that's nice.

111
00:04:58,750 --> 00:05:01,280
And here's a task for you to
test whether you understand

112
00:05:01,280 --> 00:05:02,680
steepest descent or not.

113
00:05:02,680 --> 00:05:05,710
In your notes, I've
drawn some contours.

114
00:05:05,710 --> 00:05:07,840
For function, we'd
like to minimize using

115
00:05:07,840 --> 00:05:09,630
the method of steepest descent.

116
00:05:09,630 --> 00:05:13,900
And I want you to try to draw
steepest descent paths on top

117
00:05:13,900 --> 00:05:18,010
of these contours starting
from initial conditions

118
00:05:18,010 --> 00:05:20,630
where these stars are located.

119
00:05:20,630 --> 00:05:23,260
So if I'm following
steepest descent, the rules

120
00:05:23,260 --> 00:05:27,670
of steepest descent here,
and I start from these stars,

121
00:05:27,670 --> 00:05:29,440
what sort of paths do I follow?

122
00:05:29,440 --> 00:05:31,240
You're going to need
to pick a step size.

123
00:05:31,240 --> 00:05:34,480
I would suggest thinking about
the small step size limit.

124
00:05:34,480 --> 00:05:37,090
What is the steepest descent
path in the small step size

125
00:05:37,090 --> 00:05:37,614
limit?

126
00:05:37,614 --> 00:05:39,530
Can you work that out,
you and your neighbors?

127
00:05:39,530 --> 00:05:40,870
You don't have to do
all of them by yourself.

128
00:05:40,870 --> 00:05:42,710
You can do one, your
neighbor could do another.

129
00:05:42,710 --> 00:05:44,335
And we'll take a look
at them together.

130
00:08:08,380 --> 00:08:11,485
OK, the roar has turned into
a rumble and then a murmur,

131
00:08:11,485 --> 00:08:13,360
so I think you guys are
making some progress.

132
00:08:16,790 --> 00:08:17,540
What do you think?

133
00:08:17,540 --> 00:08:20,290
How about let's do an easy one.

134
00:08:20,290 --> 00:08:21,550
How about this one here.

135
00:08:21,550 --> 00:08:24,210
What sort of path does it take?

136
00:08:24,210 --> 00:08:26,340
Yeah, it sort of curls
right down into the center

137
00:08:26,340 --> 00:08:26,840
here, right.

138
00:08:26,840 --> 00:08:30,300
Remember, steepest descent
paths run perpendicular

139
00:08:30,300 --> 00:08:31,020
to the contours.

140
00:08:31,020 --> 00:08:33,677
So jumps perpendicular to the
contour, almost a straight line

141
00:08:33,677 --> 00:08:34,260
to the center.

142
00:08:34,260 --> 00:08:36,799
How about this one over here?

143
00:08:36,799 --> 00:08:37,594
Same thing, right?

144
00:08:37,594 --> 00:08:38,510
It runs the other way.

145
00:08:38,510 --> 00:08:41,870
It's going downhill 1,
0, minus 1, minus 2.

146
00:08:41,870 --> 00:08:44,960
So it runs downhill and
curls into the center.

147
00:08:44,960 --> 00:08:47,482
What about this one up here?

148
00:08:47,482 --> 00:08:49,210
What's it do?

149
00:08:49,210 --> 00:08:51,850
Yeah, it just runs
to the left, right.

150
00:08:51,850 --> 00:08:54,256
The contour lines
had normals that

151
00:08:54,256 --> 00:08:56,130
just keep it running
all the way to the left.

152
00:08:56,130 --> 00:08:59,120
So this actually doesn't run
into this minimum, right.

153
00:08:59,120 --> 00:09:03,230
It finds a cliff and steps
right off of it, keeps on going.

154
00:09:03,230 --> 00:09:04,890
Steepest descent,
that's what it does.

155
00:09:04,890 --> 00:09:06,920
How about this one here?

156
00:09:06,920 --> 00:09:08,734
Same thing, right,
just to the left.

157
00:09:08,734 --> 00:09:10,400
So these are what
these paths look like.

158
00:09:10,400 --> 00:09:13,750
You can draw them yourself.

159
00:09:13,750 --> 00:09:16,095
If I showed you paths and
asked you what sort of method

160
00:09:16,095 --> 00:09:18,720
made them, you should be able to
identify that actually, right?

161
00:09:18,720 --> 00:09:22,660
You should be able to detect
what sort of methodology

162
00:09:22,660 --> 00:09:24,075
generated those kinds of paths.

163
00:09:28,770 --> 00:09:30,330
We're not always so
fortunate to have

164
00:09:30,330 --> 00:09:32,730
this graphical view
of the landscape

165
00:09:32,730 --> 00:09:34,680
that our method is navigating.

166
00:09:34,680 --> 00:09:36,637
But it's good to have
these 2D depictions.

167
00:09:36,637 --> 00:09:38,220
Because they really
help us understand

168
00:09:38,220 --> 00:09:41,260
when a method doesn't converge
what might be going wrong,

169
00:09:41,260 --> 00:09:41,760
right.

170
00:09:41,760 --> 00:09:44,910
So steepest descent, it
always heads downhill.

171
00:09:44,910 --> 00:09:47,640
But if there is no bottom, it's
just going to keep going down,

172
00:09:47,640 --> 00:09:48,180
right.

173
00:09:48,180 --> 00:09:50,754
It's never going to find it.

174
00:09:50,754 --> 00:09:51,470
Oh, OK.

175
00:09:51,470 --> 00:09:53,960
Here's a-- this is
a story now that you

176
00:09:53,960 --> 00:09:55,010
understand optimization.

177
00:09:55,010 --> 00:09:59,330
So let's see, so
mechanical systems,

178
00:09:59,330 --> 00:10:02,190
conservation of
momentum, that's also,

179
00:10:02,190 --> 00:10:05,990
in a certain sense, an
optimization problem, right.

180
00:10:05,990 --> 00:10:11,690
So conservation of momentum says
that the acceleration on a body

181
00:10:11,690 --> 00:10:13,710
is equal to the sum
of the forces on it.

182
00:10:13,710 --> 00:10:15,500
And some of those
forces are what

183
00:10:15,500 --> 00:10:16,960
we call conservative forces.

184
00:10:16,960 --> 00:10:18,970
They're proportional
to gradients

185
00:10:18,970 --> 00:10:20,410
of some energy landscape.

186
00:10:20,410 --> 00:10:22,160
Some of those forces
are non-conservative,

187
00:10:22,160 --> 00:10:22,951
like this one here.

188
00:10:22,951 --> 00:10:24,740
It's a little damping
force, a little bit

189
00:10:24,740 --> 00:10:28,220
of friction proportional
to the velocity with which

190
00:10:28,220 --> 00:10:30,990
the object moves instead.

191
00:10:30,990 --> 00:10:33,860
And if we start some
system like this,

192
00:10:33,860 --> 00:10:37,040
we give it some initial
inertia and let it go,

193
00:10:37,040 --> 00:10:39,920
right, eventually it's going to
want to come to rest at a place

194
00:10:39,920 --> 00:10:42,980
where the gradient
in the potential is 0

195
00:10:42,980 --> 00:10:45,120
and the velocity is 0 on
the acceleration is 0.

196
00:10:45,120 --> 00:10:46,750
We call that
mechanical equilibrium.

197
00:10:46,750 --> 00:10:49,700
We get to mechanical
equilibrium and we stop, right.

198
00:10:49,700 --> 00:10:54,500
So physical systems many
times are seeking out minimum

199
00:10:54,500 --> 00:10:55,610
of an objective function.

200
00:10:55,610 --> 00:10:58,980
The objective function
is the potential energy.

201
00:10:58,980 --> 00:11:02,090
I saw last year at my house
we had a pipe underground

202
00:11:02,090 --> 00:11:06,050
that leaked in the front yard.

203
00:11:06,050 --> 00:11:07,874
And they needed to
find the pipe, right.

204
00:11:07,874 --> 00:11:10,415
It was like under the asphalt.
So they got to dig up asphalt,

205
00:11:10,415 --> 00:11:12,530
and they need to know
where is the pipe.

206
00:11:12,530 --> 00:11:15,470
They know it's leaking, but
where does the pipe sit?

207
00:11:15,470 --> 00:11:19,080
So the city came out and the
guy from the city brought this.

208
00:11:19,080 --> 00:11:20,270
Do you know what this is?

209
00:11:23,240 --> 00:11:23,760
What is it?

210
00:11:23,760 --> 00:11:25,510
Do you know?

211
00:11:25,510 --> 00:11:28,210
Yeah, yeah, yeah.

212
00:11:28,210 --> 00:11:29,290
It's a dowsing rod.

213
00:11:29,290 --> 00:11:33,770
OK, this kind of crazy
story right, a dowsing rod.

214
00:11:33,770 --> 00:11:35,500
OK, a dowsing rod.

215
00:11:35,500 --> 00:11:37,090
How does it work?

216
00:11:37,090 --> 00:11:40,270
The way it's supposed
to work is I hold it out

217
00:11:40,270 --> 00:11:43,630
and it should turn
and rotate in point

218
00:11:43,630 --> 00:11:45,520
in a direction that's
parallel to the flow

219
00:11:45,520 --> 00:11:46,660
of the water in the pipe.

220
00:11:46,660 --> 00:11:48,860
That's the theory that this
is supposed to work on.

221
00:11:48,860 --> 00:11:50,740
I'm a scientist.

222
00:11:50,740 --> 00:11:53,470
So I expect that
somehow the water

223
00:11:53,470 --> 00:11:57,970
is exerting a force on the
tip of the dowsing rod, OK.

224
00:11:57,970 --> 00:11:59,920
So the dowsing rod
is moving around

225
00:11:59,920 --> 00:12:01,840
as this guy walks around.

226
00:12:01,840 --> 00:12:05,160
And it's going to stop
when it finds a point

227
00:12:05,160 --> 00:12:06,530
of mechanical equilibrium.

228
00:12:06,530 --> 00:12:08,350
So the dowsing
rod is seeking out

229
00:12:08,350 --> 00:12:10,927
a minimum of some potential
energy, let's say.

230
00:12:10,927 --> 00:12:12,760
That's what the physics
says has to be true.

231
00:12:12,760 --> 00:12:15,310
I don't know that flowing
water exerts a force

232
00:12:15,310 --> 00:12:16,570
on the tip of the dowsing rod.

233
00:12:16,570 --> 00:12:19,750
The guy who had this
believed that was true, OK.

234
00:12:19,750 --> 00:12:23,190
It turns out, this is not
such a good idea, though, OK.

235
00:12:23,190 --> 00:12:25,690
Like in terms of a
method for seeking out

236
00:12:25,690 --> 00:12:29,040
the minimum of a potential, it's
not such a great way to do it.

237
00:12:29,040 --> 00:12:31,990
Because he's way up here, and
the water's way underground.

238
00:12:31,990 --> 00:12:34,630
So there's a huge distance
between these things.

239
00:12:34,630 --> 00:12:37,570
It's not exerting
a strong force, OK.

240
00:12:37,570 --> 00:12:40,420
The gradient isn't
very big here.

241
00:12:40,420 --> 00:12:41,800
It's a relatively weak force.

242
00:12:41,800 --> 00:12:44,290
So this instrument is incredibly
sensitive to all sorts

243
00:12:44,290 --> 00:12:45,670
of external fluctuations.

244
00:12:45,670 --> 00:12:47,320
The gradient is small.

245
00:12:47,320 --> 00:12:50,710
The potential energy
landscape is very, very flat.

246
00:12:50,710 --> 00:12:52,630
And we know already
from applying things

247
00:12:52,630 --> 00:12:55,240
like steepest descent
methods or Newton-Raphson

248
00:12:55,240 --> 00:12:58,210
that those circumstances are
disastrous for any method

249
00:12:58,210 --> 00:13:02,381
seeking out minima of
potential energies, right.

250
00:13:02,381 --> 00:13:04,630
Those landscapes are the
hardest ones to detect it in.

251
00:13:04,630 --> 00:13:08,150
Because every point looks like
it's close to being a minima,

252
00:13:08,150 --> 00:13:08,650
right.

253
00:13:08,650 --> 00:13:12,310
It's really difficult to see
the differences between these.

254
00:13:12,310 --> 00:13:16,170
Nonetheless, he figured
out where the pipe was.

255
00:13:16,170 --> 00:13:19,680
I don't think it was
because of this though.

256
00:13:19,680 --> 00:13:21,150
How did he know
where the pipe was?

257
00:13:24,054 --> 00:13:25,022
What's that?

258
00:13:25,022 --> 00:13:26,960
STUDENT: Where the
ground was squishy?

259
00:13:26,960 --> 00:13:27,720
JAMES SWAN: Where the
ground was squishy.

260
00:13:27,720 --> 00:13:29,090
Well yeah, had some
good guesses because it

261
00:13:29,090 --> 00:13:30,290
was leaking up a little bit.

262
00:13:30,290 --> 00:13:33,324
No, I looked
carefully afterwards.

263
00:13:33,324 --> 00:13:35,240
And I think it turned
out the city had come by

264
00:13:35,240 --> 00:13:36,971
and actually painted
some white lines

265
00:13:36,971 --> 00:13:39,220
on either side of the street
to indicate where it was.

266
00:13:39,220 --> 00:13:40,928
But he was out there
with his dowsing rod

267
00:13:40,928 --> 00:13:42,952
making sure the city
had gotten it right.

268
00:13:42,952 --> 00:13:45,410
It turns out, there's something
called the ideomotor effect

269
00:13:45,410 --> 00:13:47,580
where your hand has
very little, you

270
00:13:47,580 --> 00:13:49,330
know, very sensitive
little tremors in it.

271
00:13:49,330 --> 00:13:51,929
And can guide something
like this, a little weight

272
00:13:51,929 --> 00:13:53,720
at the end of a rod to
go wherever you want

273
00:13:53,720 --> 00:13:54,928
it to go when you want it to.

274
00:13:54,928 --> 00:13:56,240
It's like a Ouija board, right.

275
00:13:56,240 --> 00:13:57,812
It works exactly the same way.

276
00:13:57,812 --> 00:13:59,270
Anyway, it's not
a good way to find

277
00:13:59,270 --> 00:14:03,620
the minimum of potential
energy surfaces, OK.

278
00:14:03,620 --> 00:14:05,990
We have the same problem
with numerical methods.

279
00:14:05,990 --> 00:14:08,180
It's really difficult when
these potential energy

280
00:14:08,180 --> 00:14:13,688
landscapes are flat to find
where the minimum is, OK.

281
00:14:13,688 --> 00:14:14,900
So fun and games are over.

282
00:14:14,900 --> 00:14:16,025
Now we got to do some math.

283
00:14:19,960 --> 00:14:24,450
So we talked about
steepest descent.

284
00:14:24,450 --> 00:14:26,370
And steepest descent
is an interesting way

285
00:14:26,370 --> 00:14:30,640
to approach these kinds
of optimization problems.

286
00:14:30,640 --> 00:14:35,580
It turns out, it turns out
that linear equations like Ax

287
00:14:35,580 --> 00:14:40,320
equals b can also be cast as
optimization problems, right.

288
00:14:40,320 --> 00:14:42,660
So the solution to
this equation Ax

289
00:14:42,660 --> 00:14:47,520
equals b is also a minima of
this quadratic function up

290
00:14:47,520 --> 00:14:49,230
here.

291
00:14:49,230 --> 00:14:49,980
How do you know?

292
00:14:49,980 --> 00:14:52,720
You take the gradient
of this function,

293
00:14:52,720 --> 00:14:57,090
which is Ax minus b, and
the gradient to 0 to minima.

294
00:14:57,090 --> 00:15:00,240
So Ax minus b is
0, or Ax equals b.

295
00:15:00,240 --> 00:15:02,460
So we can do optimization
on these sorts

296
00:15:02,460 --> 00:15:06,640
of quadratic functionals, and
we would find the solution

297
00:15:06,640 --> 00:15:08,787
of systems of linear equations.

298
00:15:08,787 --> 00:15:10,120
This is an alternative approach.

299
00:15:10,120 --> 00:15:12,161
Sometimes this is called
the variational approach

300
00:15:12,161 --> 00:15:15,870
to solving these systems
of linear equations.

301
00:15:15,870 --> 00:15:19,280
There are a couple of
things that have to be true.

302
00:15:19,280 --> 00:15:21,740
The linear operator,
right, the matrix here,

303
00:15:21,740 --> 00:15:23,310
it has to be symmetric.

304
00:15:23,310 --> 00:15:25,700
OK, it has to be symmetric,
because it's multiplied

305
00:15:25,700 --> 00:15:27,380
by x from both sides.

306
00:15:27,380 --> 00:15:29,990
It doesn't know that
it's transpose is

307
00:15:29,990 --> 00:15:33,140
different from itself in
the form of this functional.

308
00:15:33,140 --> 00:15:35,150
If A wasn't symmetric,
the functional

309
00:15:35,150 --> 00:15:37,850
would symmetrize it
automatically, OK.

310
00:15:37,850 --> 00:15:41,870
So a functional like
this only corresponds

311
00:15:41,870 --> 00:15:44,750
to this linear equation
when A is symmetric.

312
00:15:44,750 --> 00:15:50,270
And this sort of thing
only has a minimum, right,

313
00:15:50,270 --> 00:15:52,392
when the matrix A is
positive and definite.

314
00:15:52,392 --> 00:15:54,940
It has to have all positive
eigenvalues, right.

315
00:15:54,940 --> 00:15:58,910
The Hessian right,
of this functional,

316
00:15:58,910 --> 00:16:00,410
is just the matrix
A. And we already

317
00:16:00,410 --> 00:16:03,940
said that Hessian needs all
positive eigenvalues to confirm

318
00:16:03,940 --> 00:16:05,100
we have a minima.

319
00:16:05,100 --> 00:16:06,260
OK?

320
00:16:06,260 --> 00:16:08,300
If one of the
eigenvalues is zero,

321
00:16:08,300 --> 00:16:09,980
then the problem
is indeterminate.

322
00:16:09,980 --> 00:16:11,690
The linear problem
is indeterminate.

323
00:16:11,690 --> 00:16:13,670
And there isn't a single
local minimum, right.

324
00:16:13,670 --> 00:16:16,100
There's going to be a line of
minima or a plane of minima

325
00:16:16,100 --> 00:16:17,300
instead.

326
00:16:17,300 --> 00:16:19,210
OK?

327
00:16:19,210 --> 00:16:22,540
OK, so you can solve
systems of linear equations

328
00:16:22,540 --> 00:16:25,750
as optimization problems.

329
00:16:25,750 --> 00:16:28,570
And people have tried to apply
things like steepest descent

330
00:16:28,570 --> 00:16:29,870
to these problems.

331
00:16:29,870 --> 00:16:31,870
And it turns out
steepest descent is

332
00:16:31,870 --> 00:16:33,980
kind of challenging to apply.

333
00:16:33,980 --> 00:16:38,560
So what winds up
happening is let's

334
00:16:38,560 --> 00:16:41,020
suppose we don't take our
quadratic approximation

335
00:16:41,020 --> 00:16:42,670
for the descent direction first.

336
00:16:42,670 --> 00:16:45,820
Let's just say we take some
fixed step size, right.

337
00:16:45,820 --> 00:16:50,050
When you take that fixed
step size, it'll always be,

338
00:16:50,050 --> 00:16:54,250
let's say good for one
particular direction.

339
00:16:54,250 --> 00:16:56,110
OK, so I'll step in a
particular direction.

340
00:16:56,110 --> 00:16:56,850
It'll be good.

341
00:16:56,850 --> 00:16:59,470
It'll be a nice step
into a local minimum.

342
00:16:59,470 --> 00:17:02,260
But when I try to step in
the next gradient direction,

343
00:17:02,260 --> 00:17:03,850
it may be too big or too small.

344
00:17:03,850 --> 00:17:06,700
And that will depend on
the eigenvalues associated

345
00:17:06,700 --> 00:17:09,160
with the direction that I
am trying to step in, OK.

346
00:17:09,160 --> 00:17:12,460
How steep is this
convex function?

347
00:17:12,460 --> 00:17:13,240
Right?

348
00:17:13,240 --> 00:17:15,180
How strongly curved is
that convex function?

349
00:17:15,180 --> 00:17:17,619
That's what the
eigenvalues are describing.

350
00:17:17,619 --> 00:17:20,680
And so fixed value of
alpha will lead to cases

351
00:17:20,680 --> 00:17:23,778
where we wind up stepping
too far or not far enough.

352
00:17:23,778 --> 00:17:25,569
And there'll be a lot
of oscillating around

353
00:17:25,569 --> 00:17:28,900
on this path that
converges to a solution.

354
00:17:28,900 --> 00:17:31,200
I showed you how to pick
an optimal step size.

355
00:17:31,200 --> 00:17:32,980
It said look in a
particular direction

356
00:17:32,980 --> 00:17:34,750
and treat your function
as though it were

357
00:17:34,750 --> 00:17:37,120
quadratic along that direction.

358
00:17:37,120 --> 00:17:40,124
That's going to be true for
all directions associated

359
00:17:40,124 --> 00:17:41,290
with this functional, right.

360
00:17:41,290 --> 00:17:44,570
It's always quadratic no matter
which direction I point in.

361
00:17:44,570 --> 00:17:45,070
Right?

362
00:17:45,070 --> 00:17:47,440
So I pick a direction
and I step and I'll

363
00:17:47,440 --> 00:17:50,020
be stepping to the minimal
point along that direction.

364
00:17:50,020 --> 00:17:51,664
It'll be exact, OK.

365
00:17:51,664 --> 00:17:53,080
And then I've got
to turn and I've

366
00:17:53,080 --> 00:17:55,150
got to go in another
gradient direction

367
00:17:55,150 --> 00:17:56,810
and take a step there.

368
00:17:56,810 --> 00:17:58,980
And I'll turn and go in
another gradient direction

369
00:17:58,980 --> 00:17:59,920
and take a step there.

370
00:17:59,920 --> 00:18:04,360
And in each direction I go,
I'll be minimizing every time.

371
00:18:04,360 --> 00:18:09,127
Because this step size
is the ideal step size.

372
00:18:09,127 --> 00:18:11,210
But it turns out you can
do even better than that.

373
00:18:14,390 --> 00:18:16,060
So we can step in
some direction, which

374
00:18:16,060 --> 00:18:19,270
is a descent direction,
but not necessarily

375
00:18:19,270 --> 00:18:20,920
the steepest descent.

376
00:18:20,920 --> 00:18:23,979
And it's going to give us
some extra control over how

377
00:18:23,979 --> 00:18:25,270
we're minimizing this function.

378
00:18:25,270 --> 00:18:27,040
I'll explain on
the next slide, OK.

379
00:18:27,040 --> 00:18:28,540
The first thing you
got to do though

380
00:18:28,540 --> 00:18:32,324
is given some descent direction,
what is the optimal step size?

381
00:18:32,324 --> 00:18:34,240
Well, we'll work that
out the same way, right.

382
00:18:34,240 --> 00:18:37,480
We can write f at the
next iterate in terms

383
00:18:37,480 --> 00:18:41,830
of f at the current iterate plus
all the perturbations, right.

384
00:18:41,830 --> 00:18:46,907
So our step method is Xi plus
1 is Xi plus alpha pi, right.

385
00:18:46,907 --> 00:18:48,490
So we do a Taylor
expansion, and we'll

386
00:18:48,490 --> 00:18:51,250
get a quadratic function again.

387
00:18:51,250 --> 00:18:54,190
And we'll minimize this
quadratic function with respect

388
00:18:54,190 --> 00:18:58,160
to alpha i when alpha
takes on this value.

389
00:18:58,160 --> 00:19:01,970
So this is the value of the
vertex of this function.

390
00:19:01,970 --> 00:19:03,910
So we'll minimize this
quadratic function

391
00:19:03,910 --> 00:19:06,175
in one direction,
the direction p.

392
00:19:08,700 --> 00:19:10,660
But is there an optimal
choice of direction?

393
00:19:10,660 --> 00:19:14,860
Is it really best to step
in the descent direction?

394
00:19:14,860 --> 00:19:17,710
Or are there better
directions that I could go in?

395
00:19:17,710 --> 00:19:20,080
We thought going downhill
fastest might be best,

396
00:19:20,080 --> 00:19:21,860
but maybe that's not true.

397
00:19:21,860 --> 00:19:23,920
Because if I point
to a direction

398
00:19:23,920 --> 00:19:26,650
and I apply my
quadratic approximation,

399
00:19:26,650 --> 00:19:28,550
I minimize the function
in this direction.

400
00:19:28,550 --> 00:19:30,050
Now I'm going to
turn, and I'm going

401
00:19:30,050 --> 00:19:31,630
to go in a different direction.

402
00:19:31,630 --> 00:19:34,210
And I'll minimize
it here, but I'll

403
00:19:34,210 --> 00:19:37,170
lose some of the minimization
that I got previously, right?

404
00:19:37,170 --> 00:19:38,467
I minimized in this direction.

405
00:19:38,467 --> 00:19:40,300
Then I turned, I went
some other way, right.

406
00:19:40,300 --> 00:19:41,720
And I minimized
in this direction.

407
00:19:41,720 --> 00:19:44,860
So this will still be a
process that will sort of weave

408
00:19:44,860 --> 00:19:47,630
back and forth potentially.

409
00:19:47,630 --> 00:19:51,130
And so the idea instead is to
try to preserve minimization

410
00:19:51,130 --> 00:19:53,900
along one particular direction.

411
00:19:53,900 --> 00:19:56,510
So how do we choose
an optimal direction?

412
00:19:56,510 --> 00:19:59,060
So f, right, at the
current iterate,

413
00:19:59,060 --> 00:20:01,550
it's already minimized
along p, right.

414
00:20:01,550 --> 00:20:03,957
Moving in p forward
and backwards,

415
00:20:03,957 --> 00:20:05,540
this is going to
make f and e smaller.

416
00:20:05,540 --> 00:20:07,130
That's as small as it can be.

417
00:20:09,680 --> 00:20:16,610
So why not choose p so that
it's normal to the gradient

418
00:20:16,610 --> 00:20:17,980
at the next iterate?

419
00:20:17,980 --> 00:20:20,330
OK, so choose this
direction p so it's

420
00:20:20,330 --> 00:20:22,610
normal to the gradient
at the next iterate.

421
00:20:22,610 --> 00:20:27,170
And then see if that holds for
one iterate more after that.

422
00:20:27,170 --> 00:20:29,810
So I move in a direction.

423
00:20:29,810 --> 00:20:31,910
I step up to a contour.

424
00:20:31,910 --> 00:20:34,970
And I want my p to be
orthogonal to the gradient

425
00:20:34,970 --> 00:20:35,990
at that next contour.

426
00:20:35,990 --> 00:20:38,810
So I've minimized
this way, right.

427
00:20:38,810 --> 00:20:41,240
I've minimized everything
that I could in directions

428
00:20:41,240 --> 00:20:43,310
that aren't in the
gradient direction

429
00:20:43,310 --> 00:20:44,945
associated with
the next iterate.

430
00:20:44,945 --> 00:20:47,570
And then let's see if I can even
do that for the next iteration

431
00:20:47,570 --> 00:20:48,070
too.

432
00:20:48,070 --> 00:20:51,020
So can it make it true that the
gradient at the next iterate

433
00:20:51,020 --> 00:20:52,650
is also orthogonal to p?

434
00:20:56,870 --> 00:20:59,710
By doing this, I get to
preserve all the minimization

435
00:20:59,710 --> 00:21:00,710
from the previous steps.

436
00:21:00,710 --> 00:21:02,734
So I minimize in this direction.

437
00:21:02,734 --> 00:21:05,150
And now I'm going to take a
step in a different direction.

438
00:21:05,150 --> 00:21:06,950
But I'm going to
make sure that as I

439
00:21:06,950 --> 00:21:10,631
take that step in
another direction, right,

440
00:21:10,631 --> 00:21:12,630
I don't have to step
completely in the gradient.

441
00:21:12,630 --> 00:21:14,360
I don't have to go in the
steepest descent direction.

442
00:21:14,360 --> 00:21:16,050
I can project out
everything that I've

443
00:21:16,050 --> 00:21:17,640
stepped in already, right.

444
00:21:17,640 --> 00:21:19,380
I can project out
all the minimization

445
00:21:19,380 --> 00:21:22,860
I've already accomplished
along this p direction.

446
00:21:22,860 --> 00:21:25,860
So it turns out you
can solve, right,

447
00:21:25,860 --> 00:21:28,260
you can calculate
what this gradient is.

448
00:21:28,260 --> 00:21:32,610
The gradient in this
function is Ax minus b.

449
00:21:32,610 --> 00:21:35,514
So you can substitute exactly
what that gradient is.

450
00:21:35,514 --> 00:21:41,550
A, this is Xi plus 2 minus
b dotted with p, right.

451
00:21:41,550 --> 00:21:43,080
This has to be equal to 0.

452
00:21:43,080 --> 00:21:46,260
And you can show that means
that p transpose A times

453
00:21:46,260 --> 00:21:48,497
p has to be equal to 0 as well.

454
00:21:48,497 --> 00:21:50,830
You don't need to be able to
work through these details.

455
00:21:50,830 --> 00:21:52,740
You just need to
know that this gives

456
00:21:52,740 --> 00:21:54,780
a relationship
between the directions

457
00:21:54,780 --> 00:21:57,774
on two consecutive iterates, OK.

458
00:21:57,774 --> 00:21:59,190
So it says if I
picked a direction

459
00:21:59,190 --> 00:22:04,776
p on the previous iteration,
take how it's transposed by A,

460
00:22:04,776 --> 00:22:07,110
and make sure that
my next iteration is

461
00:22:07,110 --> 00:22:10,228
orthogonal to that vector, OK.

462
00:22:10,228 --> 00:22:11,206
Yeah?

463
00:22:11,206 --> 00:22:15,118
STUDENT: So does that
mean that your p's

464
00:22:15,118 --> 00:22:17,850
are all independent
of each other,

465
00:22:17,850 --> 00:22:20,670
or just that adjacent p is?

466
00:22:20,670 --> 00:22:23,329
K, k plus 1 p's are?

467
00:22:23,329 --> 00:22:24,870
JAMES SWAN: This is
a great question.

468
00:22:29,300 --> 00:22:34,040
So the goal with this method,
the ideal way to do this

469
00:22:34,040 --> 00:22:36,080
would be to have these
directions actually

470
00:22:36,080 --> 00:22:40,301
be the directions of
the eigenvectors of A.

471
00:22:40,301 --> 00:22:42,710
And those eigenvectors
for symmetric matrix

472
00:22:42,710 --> 00:22:45,230
are all orthogonal
to each other.

473
00:22:45,230 --> 00:22:46,361
OK?

474
00:22:46,361 --> 00:22:48,860
And so you'll be stepping along
these orthogonal directions.

475
00:22:48,860 --> 00:22:50,990
And they would be all
independent of each other.

476
00:22:50,990 --> 00:22:51,200
OK?

477
00:22:51,200 --> 00:22:53,491
But that's a hard problem,
finding all the eigenvectors

478
00:22:53,491 --> 00:22:55,180
associated with a matrix.

479
00:22:55,180 --> 00:22:59,900
Instead, OK, we pick an
initial direction p to go in.

480
00:22:59,900 --> 00:23:02,420
And then we try to ensure that
all of the other directions

481
00:23:02,420 --> 00:23:07,640
satisfy this conjugacy
condition, right.

482
00:23:07,640 --> 00:23:09,620
That the transformation
of p by A

483
00:23:09,620 --> 00:23:12,630
is orthogonal with the next
direction that I choose.

484
00:23:12,630 --> 00:23:15,450
So they're not
independent of each other.

485
00:23:15,450 --> 00:23:19,280
But they are what we call
conjugate to each other.

486
00:23:19,280 --> 00:23:23,530
It turns out that by doing
this, these sets of directions p

487
00:23:23,530 --> 00:23:25,190
will belong to--

488
00:23:25,190 --> 00:23:29,120
they can be expressed in
terms of many products of A

489
00:23:29,120 --> 00:23:30,660
with the initial direction p.

490
00:23:30,660 --> 00:23:32,660
That'll give you all these
different directions.

491
00:23:32,660 --> 00:23:35,150
It starts to look something
like the power iteration method

492
00:23:35,150 --> 00:23:38,121
for finding the largest
eigenvector of a matrix.

493
00:23:38,121 --> 00:23:38,620
OK?

494
00:23:38,620 --> 00:23:41,450
So you create a
certain set of vectors

495
00:23:41,450 --> 00:23:44,045
that span the entire
subspace of A.

496
00:23:44,045 --> 00:23:47,456
And you step specifically
along those directions.

497
00:23:47,456 --> 00:23:49,580
And that lets you preserve
some of the minimization

498
00:23:49,580 --> 00:23:51,470
as you step each way.

499
00:23:54,490 --> 00:23:59,000
So what's said here is
that the direction p plus 1

500
00:23:59,000 --> 00:24:03,290
is conjugate to the direction p.

501
00:24:03,290 --> 00:24:05,420
And by choosing the
directions in this way,

502
00:24:05,420 --> 00:24:10,520
you're ensuring that p is
orthogonal to the gradient at i

503
00:24:10,520 --> 00:24:12,850
plus 1 and the
gradient i plus 2.

504
00:24:12,850 --> 00:24:17,000
So you're not stepping in the
steepest descent directions

505
00:24:17,000 --> 00:24:21,328
that you'll pick up later
on in the iterative process.

506
00:24:21,328 --> 00:24:23,074
OK?

507
00:24:23,074 --> 00:24:25,240
So when you know which
direction you're stepping in,

508
00:24:25,240 --> 00:24:29,100
then you've got to satisfy
this conjugacy condition.

509
00:24:29,100 --> 00:24:34,710
But actually, this is a
vector in n space, right.

510
00:24:34,710 --> 00:24:37,000
This is also a vector n space.

511
00:24:37,000 --> 00:24:40,260
And we have one equation to
describe all and components.

512
00:24:40,260 --> 00:24:42,430
So it's an
under-determined problem.

513
00:24:42,430 --> 00:24:45,960
So then one has to pick
which particular one

514
00:24:45,960 --> 00:24:48,240
of these conjugate vectors
do I want to step along.

515
00:24:48,240 --> 00:24:49,900
And one particular
choice is this one,

516
00:24:49,900 --> 00:24:53,790
which says, step along the
gradient direction, OK, do

517
00:24:53,790 --> 00:24:57,450
steepest descent,
but project out

518
00:24:57,450 --> 00:25:00,090
the component of the
gradient along pi.

519
00:25:00,090 --> 00:25:01,650
We already minimized along pi.

520
00:25:01,650 --> 00:25:04,108
We don't not have to go in the
pi direction anymore, right.

521
00:25:04,108 --> 00:25:08,676
So do steepest descent, but
remove the pi component.

522
00:25:13,140 --> 00:25:19,590
So here is a quadratic
objective function.

523
00:25:19,590 --> 00:25:22,350
It corresponds to a linear
equation with coefficient

524
00:25:22,350 --> 00:25:27,690
matrix 1 00 10, a diagonal
coefficient matrix.

525
00:25:27,690 --> 00:25:29,760
And b equals 0.

526
00:25:29,760 --> 00:25:33,720
So the solution of the system
of linear equations is 00.

527
00:25:33,720 --> 00:25:37,380
We start with an initial
guess up here, OK.

528
00:25:37,380 --> 00:25:43,390
And we try steepest descent with
some small step size, right.

529
00:25:43,390 --> 00:25:44,870
You'll follow this
blue path here.

530
00:25:44,870 --> 00:25:46,120
And you can see what happened.

531
00:25:46,120 --> 00:25:49,240
That step size was
reasonable as we

532
00:25:49,240 --> 00:25:51,250
moved along the steepest
ascent direction

533
00:25:51,250 --> 00:25:54,950
where the contours were
pretty narrowly spaced.

534
00:25:54,950 --> 00:25:57,290
But as we got down to
the flatter section,

535
00:25:57,290 --> 00:25:59,710
OK, as we got down
to the flatter

536
00:25:59,710 --> 00:26:03,760
section of our
objective function,

537
00:26:03,760 --> 00:26:05,060
those steps are really small.

538
00:26:05,060 --> 00:26:05,230
Right?

539
00:26:05,230 --> 00:26:06,729
We're headed in the
right direction,

540
00:26:06,729 --> 00:26:09,545
we're just taking
very, very small steps.

541
00:26:09,545 --> 00:26:13,370
If you apply this conjugate
gradient methodology, well,

542
00:26:13,370 --> 00:26:18,110
the first step you
take, that's prescribed.

543
00:26:18,110 --> 00:26:21,290
You've got to step
in some direction.

544
00:26:21,290 --> 00:26:24,230
The second step you take
though minimizes completely

545
00:26:24,230 --> 00:26:25,610
along this direction.

546
00:26:25,610 --> 00:26:27,800
So the first step was the
same for both of these.

547
00:26:27,800 --> 00:26:32,710
But the second step was
chosen to minimize completely

548
00:26:32,710 --> 00:26:33,790
along this direction.

549
00:26:33,790 --> 00:26:35,980
So it's totally minimized.

550
00:26:35,980 --> 00:26:41,490
And the third step here
also steps all the way

551
00:26:41,490 --> 00:26:43,020
to the center.

552
00:26:43,020 --> 00:26:44,610
So it shows a conjugate
direction that

553
00:26:44,610 --> 00:26:46,290
stepped from here to there.

554
00:26:46,290 --> 00:26:49,110
And it didn't lose any
of the minimization

555
00:26:49,110 --> 00:26:51,960
in the original direction
that it proceeded along.

556
00:26:51,960 --> 00:26:54,936
So that's conjugate gradient.

557
00:26:54,936 --> 00:26:58,190
It's used to solve linear
equations with order n

558
00:26:58,190 --> 00:26:59,230
iterations, right.

559
00:26:59,230 --> 00:27:05,690
So A has at most n
independent eigenvectors,

560
00:27:05,690 --> 00:27:08,360
independent directions that
I can step along and do

561
00:27:08,360 --> 00:27:10,700
this minimization.

562
00:27:10,700 --> 00:27:13,070
The conjugate gradient method
is doing precisely that.

563
00:27:13,070 --> 00:27:14,570
Doesn't know what the
eigendirections are, but it's

564
00:27:14,570 --> 00:27:17,150
something along these
conjugate directions as a proxy

565
00:27:17,150 --> 00:27:19,310
for the eigendirections.

566
00:27:19,310 --> 00:27:24,560
So it can do minimization
with just n steps for a system

567
00:27:24,560 --> 00:27:26,030
of n equations for n unknowns.

568
00:27:28,890 --> 00:27:31,350
It requires only the
ability to compute

569
00:27:31,350 --> 00:27:34,962
the product of your matrix
A with some vector, right.

570
00:27:34,962 --> 00:27:36,420
All the calculations
there are only

571
00:27:36,420 --> 00:27:38,820
depended on the product
of A with a vector.

572
00:27:38,820 --> 00:27:41,340
So don't have to store A, we
just have to know what A is.

573
00:27:41,340 --> 00:27:44,250
We have some procedure
for generating A.

574
00:27:44,250 --> 00:27:48,300
Maybe A is a linear
operator that comes

575
00:27:48,300 --> 00:27:50,550
from a solution of some
differential equations

576
00:27:50,550 --> 00:27:51,892
instead, right.

577
00:27:51,892 --> 00:27:53,850
And we don't have an
explicit expression for A,

578
00:27:53,850 --> 00:27:57,990
but we have some simulator
that produces, take some data,

579
00:27:57,990 --> 00:28:01,732
and projects A to give
some answer, right.

580
00:28:01,732 --> 00:28:02,940
So we just need this product.

581
00:28:02,940 --> 00:28:05,280
We don't have to
store A exactly.

582
00:28:05,280 --> 00:28:07,650
It's only good for symmetric
positive definite matrices,

583
00:28:07,650 --> 00:28:08,340
right.

584
00:28:08,340 --> 00:28:10,950
This sort of free energy
functional that we wrote

585
00:28:10,950 --> 00:28:12,630
or objective function
we wrote only

586
00:28:12,630 --> 00:28:16,239
admits symmetric matrices
which are positive definite.

587
00:28:16,239 --> 00:28:18,030
That's the only way it
will have a minimum.

588
00:28:18,030 --> 00:28:20,460
And so the only way a steepest
descent or descent type

589
00:28:20,460 --> 00:28:24,930
procedure is going to
get to the optimum.

590
00:28:24,930 --> 00:28:27,390
But there are more
sophisticated methods that

591
00:28:27,390 --> 00:28:28,700
exist for arbitrary matrices.

592
00:28:28,700 --> 00:28:32,970
So if we don't want symmetry
or we don't care about

593
00:28:32,970 --> 00:28:34,774
whether it's positive
definite, there

594
00:28:34,774 --> 00:28:36,690
are equivalent sorts of
methods that are based

595
00:28:36,690 --> 00:28:39,600
around the same principle.

596
00:28:39,600 --> 00:28:41,950
And it turns out, this is
really the state of the art.

597
00:28:41,950 --> 00:28:44,970
So if you want to solve
complicated large systems

598
00:28:44,970 --> 00:28:48,560
of equations, you know
Gaussian elimination, that

599
00:28:48,560 --> 00:28:49,920
will get you an exact solution.

600
00:28:49,920 --> 00:28:52,817
But that's often infeasible
for the sorts of problems

601
00:28:52,817 --> 00:28:54,150
that we're really interested in.

602
00:28:54,150 --> 00:28:57,120
So instead, you use these
sorts of iterative methods.

603
00:28:57,120 --> 00:28:59,850
Things like Jacobi
and Gauss-Seidel,

604
00:28:59,850 --> 00:29:02,780
they're sort of the
classics in the field.

605
00:29:02,780 --> 00:29:05,220
And they work, and you can
show that they converge on

606
00:29:05,220 --> 00:29:06,720
are lots of circumstances.

607
00:29:06,720 --> 00:29:09,150
But these sorts of
iterative methods,

608
00:29:09,150 --> 00:29:13,140
like conjugate gradient and its
brethren other Krylov subspace

609
00:29:13,140 --> 00:29:14,900
methods they're
called, are really

610
00:29:14,900 --> 00:29:17,344
the state of the art, and
the ones that you reach to.

611
00:29:17,344 --> 00:29:19,510
You already did conjugate
gradients in one homework,

612
00:29:19,510 --> 00:29:20,010
right.

613
00:29:20,010 --> 00:29:23,220
You used this PCG
iterative method in Matlab

614
00:29:23,220 --> 00:29:24,980
to solve a system
of linear equations.

615
00:29:24,980 --> 00:29:26,310
It was doing this, right.

616
00:29:26,310 --> 00:29:28,154
This is how it works.

617
00:29:28,154 --> 00:29:29,570
OK?

618
00:29:29,570 --> 00:29:31,930
OK.

619
00:29:31,930 --> 00:29:33,420
OK, so that's
country ingredients.

620
00:29:33,420 --> 00:29:41,400
You could apply it also to
objective functions that

621
00:29:41,400 --> 00:29:44,786
aren't quadratic in nature.

622
00:29:44,786 --> 00:29:46,680
And the formulation
changes a little bit.

623
00:29:46,680 --> 00:29:48,630
Everywhere where the
matrix A appeared

624
00:29:48,630 --> 00:29:51,210
there needs to be
replaced with the Hessian

625
00:29:51,210 --> 00:29:52,590
at a certain iterate.

626
00:29:52,590 --> 00:29:54,300
But the same idea persists.

627
00:29:54,300 --> 00:29:56,716
It says well, we think
in our best approximation

628
00:29:56,716 --> 00:29:58,590
for the function that
we've minimized as much

629
00:29:58,590 --> 00:29:59,890
as we can in one direction.

630
00:29:59,890 --> 00:30:02,310
So let's choose a conjugate
direction to go in,

631
00:30:02,310 --> 00:30:04,950
and try not to ruin
the minimizations we

632
00:30:04,950 --> 00:30:07,860
did in the direction
we were headed before.

633
00:30:07,860 --> 00:30:12,390
Of course, these are all
linearly convergent sorts

634
00:30:12,390 --> 00:30:15,180
of methods.

635
00:30:15,180 --> 00:30:16,860
And we know that
there are better ways

636
00:30:16,860 --> 00:30:20,070
to find roots of non-linear
equations like this one,

637
00:30:20,070 --> 00:30:22,410
grad f equals zero, namely
the Newton-Raphson method,

638
00:30:22,410 --> 00:30:23,950
which is quadratically
convergent.

639
00:30:23,950 --> 00:30:27,010
So if we're really close
to a critical point,

640
00:30:27,010 --> 00:30:29,730
and hopefully that critical
point is a minima in f,

641
00:30:29,730 --> 00:30:31,860
right, then we should
rapidly converge

642
00:30:31,860 --> 00:30:35,700
to the solution of this
system of nonlinear equations

643
00:30:35,700 --> 00:30:37,548
just by applying the
Newton-Raphson method.

644
00:30:40,620 --> 00:30:42,120
It's locally convergent, right.

645
00:30:42,120 --> 00:30:43,500
So we're going to get close.

646
00:30:43,500 --> 00:30:45,420
And we get quadratic
improvement.

647
00:30:45,420 --> 00:30:48,750
What is the Newton-Raphson
iteration, though?

648
00:30:48,750 --> 00:30:49,890
Can you write that down?

649
00:30:49,890 --> 00:30:51,460
What is the
Newton-Raphson iteration

650
00:30:51,460 --> 00:30:55,110
that's the iterative
math for this system

651
00:30:55,110 --> 00:30:58,230
of non-linear equations,
grad f equals 0?

652
00:30:58,230 --> 00:31:00,890
Can you work that out?

653
00:31:00,890 --> 00:31:01,850
What's that look like?

654
00:32:11,200 --> 00:32:13,200
Have we got this?

655
00:32:13,200 --> 00:32:14,990
What's the Newton-Raphson
iterative map

656
00:32:14,990 --> 00:32:20,630
look like for this system
of non-linear equations?

657
00:32:23,450 --> 00:32:25,178
Want to volunteer an answer?

658
00:32:28,530 --> 00:32:30,500
Nobody knows or
nobody is sharing.

659
00:32:30,500 --> 00:32:33,180
OK, that's fine.

660
00:32:33,180 --> 00:32:36,680
Right, so we're trying to solve
an equation g of x equals 0.

661
00:32:36,680 --> 00:32:41,570
So the iterative map is Xi
plus 1 is Xi minus Jacobi

662
00:32:41,570 --> 00:32:43,470
inverse times g.

663
00:32:43,470 --> 00:32:45,790
And what's the Jacobian of g?

664
00:32:48,680 --> 00:32:50,830
What's the Jacobian of g?

665
00:32:50,830 --> 00:32:52,730
The Hessian, right.

666
00:32:52,730 --> 00:32:54,810
So the Jacobian of
g is the gradient

667
00:32:54,810 --> 00:32:57,690
of g, which is two
gradients of f, which is

668
00:32:57,690 --> 00:32:59,280
the definition of the Hessian.

669
00:32:59,280 --> 00:33:01,680
So really, the
Newton-Raphson iteration

670
00:33:01,680 --> 00:33:06,915
is Xi plus 1 is Xi minus
Hessian inverse times g.

671
00:33:14,350 --> 00:33:17,590
So the Hessian plays the
role of the Jacobian,

672
00:33:17,590 --> 00:33:19,120
the sort of solution procedure.

673
00:33:24,492 --> 00:33:26,450
And so everything you
know about Newton-Raphson

674
00:33:26,450 --> 00:33:27,560
is going to apply here.

675
00:33:27,560 --> 00:33:29,810
Everything you know about
quasi-Newton-Raphson methods

676
00:33:29,810 --> 00:33:31,490
is going to apply here.

677
00:33:31,490 --> 00:33:33,675
You're going to substitute
for your nonlinear.

678
00:33:33,675 --> 00:33:35,800
The nonlinear function
you're finding the root for,

679
00:33:35,800 --> 00:33:37,085
you're going to
substitute the gradient.

680
00:33:37,085 --> 00:33:38,501
And for the Jacobian,
you're going

681
00:33:38,501 --> 00:33:40,190
to substitute the Hessian.

682
00:33:40,190 --> 00:33:44,360
Places where the Hessian is, the
determent of the Hessian is 0,

683
00:33:44,360 --> 00:33:45,770
right it's going
to be a problem.

684
00:33:45,770 --> 00:33:46,985
Places where the
Hessian is singular

685
00:33:46,985 --> 00:33:48,026
is going to be a problem.

686
00:33:48,026 --> 00:33:49,280
Same as with the Jacobian.

687
00:33:52,342 --> 00:33:54,050
But Newton-Raphson
has the great property

688
00:33:54,050 --> 00:33:58,250
that if our function is
quadratic, like this one is,

689
00:33:58,250 --> 00:34:01,630
it will converge in
exactly one step.

690
00:34:06,020 --> 00:34:07,690
So here's steepest
descent with a fixed

691
00:34:07,690 --> 00:34:10,449
value of alpha,
Newton-Raphson, one step

692
00:34:10,449 --> 00:34:13,370
for a quadratic function.

693
00:34:13,370 --> 00:34:16,216
And why is it one step?

694
00:34:16,216 --> 00:34:18,636
STUDENT: [INAUDIBLE]

695
00:34:23,322 --> 00:34:24,030
JAMES SWAN: Good.

696
00:34:24,030 --> 00:34:29,760
So when we take a Taylor
expansion of our f,

697
00:34:29,760 --> 00:34:32,580
in order to derive the
Newton-Raphson step,

698
00:34:32,580 --> 00:34:35,420
we're expanding it out
to quadratic order,

699
00:34:35,420 --> 00:34:36,690
its function is quadratic.

700
00:34:36,690 --> 00:34:38,250
The Taylor expansion is exact.

701
00:34:38,250 --> 00:34:40,800
And the solution of
that equation, right,

702
00:34:40,800 --> 00:34:43,739
gradient f equals
0 or g equals 0,

703
00:34:43,739 --> 00:34:45,870
that's the solution of a
linear equation, right.

704
00:34:48,630 --> 00:34:53,100
So it gives exactly the
right step size here

705
00:34:53,100 --> 00:34:57,240
to move from an initial
guess to the exact solution

706
00:34:57,240 --> 00:34:58,745
or the minima of this equation.

707
00:34:58,745 --> 00:35:02,490
So for quadratic equations,
Newton-Raphson is exact.

708
00:35:02,490 --> 00:35:06,750
It doesn't go in the steepest
ascent direction, right.

709
00:35:06,750 --> 00:35:09,452
It goes in a
different direction.

710
00:35:09,452 --> 00:35:11,660
It would like to go in the
steepest descent direction

711
00:35:11,660 --> 00:35:15,450
if the Jacobian were identity.

712
00:35:15,450 --> 00:35:19,600
But the Jacobian is a
measure of how curved f is.

713
00:35:19,600 --> 00:35:23,310
The Hessian, let's say, is a
measure of how curved f is.

714
00:35:23,310 --> 00:35:23,940
Right?

715
00:35:23,940 --> 00:35:26,530
And so there's a
projection of the gradient

716
00:35:26,530 --> 00:35:29,980
through the Hessian that
changes the direction we go in.

717
00:35:29,980 --> 00:35:31,450
That change in
direction is meant

718
00:35:31,450 --> 00:35:35,770
to find the minimum of the
quadratic function that we

719
00:35:35,770 --> 00:35:37,090
approximate at this point.

720
00:35:37,090 --> 00:35:39,400
So as long as we have a good
quadratic approximation,

721
00:35:39,400 --> 00:35:42,024
Newton-Raphson is going to give
us good convergence to a minima

722
00:35:42,024 --> 00:35:44,450
or whatever nearby
critical point there is.

723
00:35:44,450 --> 00:35:46,690
If we have a bad
approximation for a quadratic,

724
00:35:46,690 --> 00:35:50,160
then it's going to
be so good, right.

725
00:35:50,160 --> 00:35:51,780
So here's this very
steep function.

726
00:35:51,780 --> 00:35:56,850
Log of f is quadratic, but
f is exponential in x here.

727
00:35:56,850 --> 00:36:00,390
So you got all these tightly
spaced contours converging

728
00:36:00,390 --> 00:36:02,790
towards a minima at 00.

729
00:36:02,790 --> 00:36:05,250
And here I've got to use
the steepest descent step

730
00:36:05,250 --> 00:36:07,860
size, the optimal steepest
descent step size, which

731
00:36:07,860 --> 00:36:09,960
is a quadratic approximation
for the function,

732
00:36:09,960 --> 00:36:12,075
but in the steepest
descent direction only.

733
00:36:12,075 --> 00:36:14,640
And here's the path
that it follows.

734
00:36:14,640 --> 00:36:16,710
And if I applied Newton-Raphson
to this function,

735
00:36:16,710 --> 00:36:19,830
here is the path that
it follows instead.

736
00:36:19,830 --> 00:36:21,110
The function isn't quadratic.

737
00:36:21,110 --> 00:36:23,700
So these quadratic
approximations aren't--

738
00:36:23,700 --> 00:36:25,590
they're not great, right.

739
00:36:25,590 --> 00:36:29,730
But the function
is convex, right.

740
00:36:29,730 --> 00:36:31,950
So Newton-Raphson is
going to proceed downhill

741
00:36:31,950 --> 00:36:35,340
until it converges towards
a solution anyways.

742
00:36:35,340 --> 00:36:38,280
Because the Hessian has positive
eigenvalues all the time.

743
00:36:40,800 --> 00:36:43,450
Questions about this?

744
00:36:43,450 --> 00:36:44,620
Make sense?

745
00:36:44,620 --> 00:36:45,350
OK?

746
00:36:45,350 --> 00:36:47,332
So you get two different
types of methods

747
00:36:47,332 --> 00:36:48,290
that you can play with.

748
00:36:48,290 --> 00:36:54,250
One of which, right, is always
going to direct you down hill.

749
00:36:54,250 --> 00:36:58,032
Steepest descent will always
carry you downhill, right,

750
00:36:58,032 --> 00:36:58,740
towards a minima.

751
00:36:58,740 --> 00:37:00,430
And the other one,
Newton-Raphson,

752
00:37:00,430 --> 00:37:02,619
converges very quickly when
it's close to the root.

753
00:37:02,619 --> 00:37:03,910
OK, so they each have a virtue.

754
00:37:07,504 --> 00:37:08,420
And they're different.

755
00:37:08,420 --> 00:37:10,100
They're fundamentally
different, right.

756
00:37:10,100 --> 00:37:13,320
They take steps in completely
different directions.

757
00:37:13,320 --> 00:37:15,604
When is Newton-Raphson not
going to step down hill?

758
00:37:30,918 --> 00:37:31,906
STUDENT: [INAUDIBLE]

759
00:37:31,906 --> 00:37:32,894
What's that?

760
00:37:32,894 --> 00:37:35,890
STUDENT: [INAUDIBLE]

761
00:37:35,890 --> 00:37:37,929
JAMES SWAN: OK,
that's more generic

762
00:37:37,929 --> 00:37:39,220
an answer than I'm looking for.

763
00:37:39,220 --> 00:37:43,390
So there may be circumstances
where I have two local minima.

764
00:37:43,390 --> 00:37:45,700
That means there must be
maybe a saddle point that

765
00:37:45,700 --> 00:37:47,331
sits between them.

766
00:37:47,331 --> 00:37:49,330
Newton-Raphson doesn't
care which critical point

767
00:37:49,330 --> 00:37:50,050
it's going after.

768
00:37:50,050 --> 00:37:52,570
So it may try to approach
the saddle point instead.

769
00:37:52,570 --> 00:37:54,280
That's true.

770
00:37:54,280 --> 00:37:56,030
That's true.

771
00:37:56,030 --> 00:37:56,690
Yeah?

772
00:37:56,690 --> 00:38:00,120
STUDENT: When
Hessian [INAUDIBLE]..

773
00:38:00,120 --> 00:38:01,510
JAMES SWAN: Good, yeah.

774
00:38:01,510 --> 00:38:04,840
With the Hessian doesn't have
all positive eigenvalues,

775
00:38:04,840 --> 00:38:05,980
right.

776
00:38:05,980 --> 00:38:10,390
So if all the eigenvalues
of the Hessian are positive,

777
00:38:10,390 --> 00:38:15,190
then the transformation h
times g or h inverse times g,

778
00:38:15,190 --> 00:38:17,760
it'll never switch the
direction I'm going.

779
00:38:17,760 --> 00:38:20,450
I'll always be headed
in a downhill direction.

780
00:38:20,450 --> 00:38:20,950
Right?

781
00:38:20,950 --> 00:38:25,011
In a direction that's
anti-parallel to the gradient.

782
00:38:25,011 --> 00:38:25,510
OK?

783
00:38:25,510 --> 00:38:27,190
But if the eigenvalues
of the Hessian

784
00:38:27,190 --> 00:38:30,430
are negative, if some
of them are negative

785
00:38:30,430 --> 00:38:32,260
and the gradient has
me pointing along

786
00:38:32,260 --> 00:38:35,890
that eigenvector in
a significant amount,

787
00:38:35,890 --> 00:38:37,990
then this product
will switch me around

788
00:38:37,990 --> 00:38:40,360
and will have me
go uphill instead.

789
00:38:40,360 --> 00:38:44,260
It'll have me chasing down
a maxima or a saddle point

790
00:38:44,260 --> 00:38:44,830
instead.

791
00:38:44,830 --> 00:38:47,050
That's what the
quadratic approximation

792
00:38:47,050 --> 00:38:49,834
of our objective
function will look like.

793
00:38:49,834 --> 00:38:52,000
It looks like there's a
maximum or a saddle instead.

794
00:38:52,000 --> 00:38:54,671
And the function
will run uphill.

795
00:38:54,671 --> 00:38:56,645
OK?

796
00:38:56,645 --> 00:38:58,520
So there lots of strengths
to Newton-Raphson.

797
00:38:58,520 --> 00:38:59,936
Convergence is one
of them, right.

798
00:38:59,936 --> 00:39:01,725
The rate of convergence is good.

799
00:39:01,725 --> 00:39:03,350
It's a locally
convergent, that's good.

800
00:39:03,350 --> 00:39:04,760
It's got lots of
weaknesses, though.

801
00:39:04,760 --> 00:39:05,060
Right?

802
00:39:05,060 --> 00:39:06,950
It's going to be a pain
when the Hessian is

803
00:39:06,950 --> 00:39:08,600
singular at various places.

804
00:39:08,600 --> 00:39:11,672
You've got to solve
systems of linear equations

805
00:39:11,672 --> 00:39:13,130
to figure out what
these steps are.

806
00:39:13,130 --> 00:39:16,460
That's expensive
computationally.

807
00:39:16,460 --> 00:39:19,070
It's not designed
to seek out minima,

808
00:39:19,070 --> 00:39:22,075
but to seek out critical points
of our objective function.

809
00:39:24,870 --> 00:39:27,210
Steepest descent has
lots of strengths, right.

810
00:39:27,210 --> 00:39:29,540
Always heads
downhill, that's good.

811
00:39:29,540 --> 00:39:31,800
If we put a little quadratic
approximation on it,

812
00:39:31,800 --> 00:39:36,870
we can even stabilize it and get
good control over the descent.

813
00:39:36,870 --> 00:39:41,320
Its weaknesses are
it's got the property

814
00:39:41,320 --> 00:39:44,827
that it's linearly convergent
instead of quadratically

815
00:39:44,827 --> 00:39:46,035
convergent when it converges.

816
00:39:46,035 --> 00:39:47,470
So it's slower, right.

817
00:39:47,470 --> 00:39:49,450
It might be harder
to find a minima.

818
00:39:49,450 --> 00:39:52,210
You've seen several examples
where the path sort of peters

819
00:39:52,210 --> 00:39:55,630
out with lots of little
iterations, tiny steps

820
00:39:55,630 --> 00:39:57,400
towards the solution.

821
00:39:57,400 --> 00:39:59,740
That's a weakness
of steepest descent.

822
00:39:59,740 --> 00:40:01,960
We know that if we
go over the edge

823
00:40:01,960 --> 00:40:04,480
of a cliff on our
potential energy landscape,

824
00:40:04,480 --> 00:40:06,977
steepest descent it just
going to run away, right.

825
00:40:06,977 --> 00:40:08,560
As long as there's
one of these edges,

826
00:40:08,560 --> 00:40:11,217
it'll just keep running downhill
for as long as they can.

827
00:40:14,140 --> 00:40:17,540
So what's done is to try
to combine these methods.

828
00:40:17,540 --> 00:40:19,370
Why choose one, right?

829
00:40:19,370 --> 00:40:22,640
We're trying to step our
way towards a solution.

830
00:40:22,640 --> 00:40:24,140
What if we could
craft a heuristic

831
00:40:24,140 --> 00:40:26,300
procedure that mixed these two?

832
00:40:26,300 --> 00:40:28,487
And when steepest descent
would be best, use that.

833
00:40:28,487 --> 00:40:30,320
When Newton-Raphson
would be best, use that.

834
00:40:30,320 --> 00:40:31,105
Yes?

835
00:40:31,105 --> 00:40:32,470
STUDENT: Just a quick
question on Newton-Raphson.

836
00:40:32,470 --> 00:40:33,136
JAMES SWAN: Yes?

837
00:40:33,136 --> 00:40:34,682
STUDENT: Would it
run downhill also

838
00:40:34,682 --> 00:40:36,674
if you started it over there?

839
00:40:36,674 --> 00:40:39,164
Or since it seeks
critical points,

840
00:40:39,164 --> 00:40:42,482
could you go back up
to the [INAUDIBLE]..

841
00:40:42,482 --> 00:40:43,940
JAMES SWAN: That's
a good question.

842
00:40:43,940 --> 00:40:50,370
So if there's an
asymptote in f, it

843
00:40:50,370 --> 00:40:54,720
will perceive the asymptote as
a critical point and chase it.

844
00:40:54,720 --> 00:40:55,449
OK?

845
00:40:55,449 --> 00:40:57,240
And so if there's an
asymptote in f, if can

846
00:40:57,240 --> 00:40:58,440
perceive that and chase it.

847
00:40:58,440 --> 00:41:01,000
It can also run away as
it gets very far away.

848
00:41:01,000 --> 00:41:02,082
This is true.

849
00:41:02,082 --> 00:41:02,582
OK?

850
00:41:08,580 --> 00:41:11,940
The contour example that I
gave you at the start of class

851
00:41:11,940 --> 00:41:15,120
had sort of bowl shape
functions superimposed

852
00:41:15,120 --> 00:41:19,140
with a linear function, sort
of planar function instead.

853
00:41:19,140 --> 00:41:24,520
For that one, right, the
Hessian is ill-defined, right.

854
00:41:24,520 --> 00:41:26,160
There is no curvature
to the function.

855
00:41:26,160 --> 00:41:28,500
But you can imagine adding
a small bit of curvature

856
00:41:28,500 --> 00:41:29,410
to that, right.

857
00:41:29,410 --> 00:41:31,410
And depending on the
direction of the curvature,

858
00:41:31,410 --> 00:41:35,290
Newton-Raphson may run downhill
or it may run back up hill,

859
00:41:35,290 --> 00:41:35,790
right?

860
00:41:35,790 --> 00:41:37,230
We can't guarantee which
direction it's going to go.

861
00:41:37,230 --> 00:41:39,150
Depends on the details
of the function.

862
00:41:39,150 --> 00:41:39,870
Does that answer your question?

863
00:41:39,870 --> 00:41:40,370
Yeah?

864
00:41:40,370 --> 00:41:41,482
Good.

865
00:41:41,482 --> 00:41:43,690
STUDENT: Sir, can you just
go back to that one slide?

866
00:41:43,690 --> 00:41:44,398
JAMES SWAN: Yeah.

867
00:41:44,398 --> 00:41:46,692
I'm just pointing out,
if the eigenvalues of h

868
00:41:46,692 --> 00:41:49,982
is further negative, then
the formula there for alpha

869
00:41:49,982 --> 00:41:50,940
could have trouble too.

870
00:41:50,940 --> 00:41:51,940
JAMES SWAN: That's true.

871
00:41:51,940 --> 00:41:54,410
STUDENT: Similar to how the
Newton-Raphson had trouble.

872
00:41:54,410 --> 00:41:56,551
JAMES SWAN: This is true.

873
00:41:56,551 --> 00:41:58,020
This is true, yeah.

874
00:41:58,020 --> 00:42:01,800
So we chose a quadratic
approximation here, right,

875
00:42:01,800 --> 00:42:02,730
for our function.

876
00:42:02,730 --> 00:42:05,875
We sought a critical point of
this quadratic approximation.

877
00:42:05,875 --> 00:42:07,750
We didn't mandate that
it had to be a minima.

878
00:42:07,750 --> 00:42:09,660
So that's absolutely right.

879
00:42:09,660 --> 00:42:13,620
So if h has negative eigenvalues
and the gradient points

880
00:42:13,620 --> 00:42:16,599
enough in the direction
of the eigenvectors

881
00:42:16,599 --> 00:42:18,390
associated with those
negative eigenvalues,

882
00:42:18,390 --> 00:42:22,100
then we may have a case
where alpha isn't positive.

883
00:42:22,100 --> 00:42:24,090
We required early on
that alpha should be

884
00:42:24,090 --> 00:42:25,470
positive for steepest descent.

885
00:42:25,470 --> 00:42:28,930
So we can't have a case
where alpha is not positive.

886
00:42:28,930 --> 00:42:31,380
That's true.

887
00:42:31,380 --> 00:42:32,221
OK.

888
00:42:32,221 --> 00:42:33,720
So they're both
interesting methods,

889
00:42:33,720 --> 00:42:35,270
and they can be mixed together.

890
00:42:35,270 --> 00:42:37,250
And the way you mix
those is with what's

891
00:42:37,250 --> 00:42:41,030
called trust-region ideas, OK.

892
00:42:41,030 --> 00:42:44,630
Because it could be that
we've had an iteration

893
00:42:44,630 --> 00:42:48,380
Xi and we do a
quadratic approximation

894
00:42:48,380 --> 00:42:51,350
to our functional, which
is this blue curve.

895
00:42:51,350 --> 00:42:54,260
Our quadratic approximation
is this red one.

896
00:42:54,260 --> 00:42:56,150
And we find the minima
of this red curve

897
00:42:56,150 --> 00:42:58,765
and use that as our next
best guess for the solution

898
00:42:58,765 --> 00:42:59,390
to the problem.

899
00:42:59,390 --> 00:43:02,219
And this seems to be
working us closer and closer

900
00:43:02,219 --> 00:43:04,010
towards the actual
minimum in the function.

901
00:43:04,010 --> 00:43:06,410
So since quadratic
approximation seems good,

902
00:43:06,410 --> 00:43:08,270
if the quadratic
approximation is good,

903
00:43:08,270 --> 00:43:11,440
which method should we choose?

904
00:43:11,440 --> 00:43:12,440
STUDENT: Newton-Raphson.

905
00:43:12,440 --> 00:43:14,840
JAMES SWAN:
Newton-Raphson, right.

906
00:43:14,840 --> 00:43:16,970
Could also be the
case though that we

907
00:43:16,970 --> 00:43:20,020
make this quadratic
approximation

908
00:43:20,020 --> 00:43:22,510
from our current
iteration, and we

909
00:43:22,510 --> 00:43:26,630
find a minimum that somehow
oversteps the minimum here.

910
00:43:26,630 --> 00:43:29,980
In fact, if we look at the
value of our objective function

911
00:43:29,980 --> 00:43:32,230
at this next step, it's
higher than the value

912
00:43:32,230 --> 00:43:34,350
of the objective function
where we started.

913
00:43:34,350 --> 00:43:36,980
So it seems like a quadratic
approximation is not so good,

914
00:43:36,980 --> 00:43:37,640
right.

915
00:43:37,640 --> 00:43:40,521
That's a clear indication that
this quadratic approximation

916
00:43:40,521 --> 00:43:41,020
isn't right.

917
00:43:41,020 --> 00:43:44,140
Because it suggested that we
should have had an minima here,

918
00:43:44,140 --> 00:43:45,100
right.

919
00:43:45,100 --> 00:43:47,456
But our function
got bigger instead.

920
00:43:47,456 --> 00:43:49,580
And so in this case,
it doesn't seem

921
00:43:49,580 --> 00:43:52,700
like you'd want to
choose Newton-Raphson

922
00:43:52,700 --> 00:43:54,250
to take your steps.

923
00:43:54,250 --> 00:43:56,150
The quadratic approximation
is not so good.

924
00:43:56,150 --> 00:44:00,640
Maybe just simple steepest
descent is a better choice.

925
00:44:00,640 --> 00:44:01,510
OK, so it's done.

926
00:44:05,040 --> 00:44:08,540
So if you're at a point,
you might draw a circle

927
00:44:08,540 --> 00:44:11,510
around that point with
some prescribed radius.

928
00:44:11,510 --> 00:44:13,430
Call that Ri.

929
00:44:13,430 --> 00:44:15,770
This is our iterate Xi.

930
00:44:15,770 --> 00:44:20,050
This is our
trust-region radius Ri.

931
00:44:20,050 --> 00:44:25,800
And we might ask, where does
our Newton-Raphson step go?

932
00:44:25,800 --> 00:44:29,520
And where does our steepest
descent step take us?

933
00:44:29,520 --> 00:44:32,370
And then based on whether
these steps carry us

934
00:44:32,370 --> 00:44:35,280
outside of our trust-region,
we might decide

935
00:44:35,280 --> 00:44:38,050
to take one or the other.

936
00:44:38,050 --> 00:44:40,950
So if I set a
particular size Ri,

937
00:44:40,950 --> 00:44:44,130
particular trust-region size
Ri and the Newton-Raphson step

938
00:44:44,130 --> 00:44:48,310
goes outside of that,
we might say well,

939
00:44:48,310 --> 00:44:50,710
I don't actually trust my
quadratic approximation

940
00:44:50,710 --> 00:44:52,630
this far away from
the starred point.

941
00:44:52,630 --> 00:44:56,860
So let's not take a
step in that direction.

942
00:44:56,860 --> 00:45:00,620
Instead, let's move in a
steepest descent direction.

943
00:45:00,620 --> 00:45:03,170
If my Newton-Raphson step
is inside the trust-region,

944
00:45:03,170 --> 00:45:04,970
maybe I'll choose
to take it, right.

945
00:45:04,970 --> 00:45:07,010
I trust the quadratic
approximation

946
00:45:07,010 --> 00:45:10,660
within a distance Ri of
my current iteration.

947
00:45:10,660 --> 00:45:13,360
Does that strategy makes sense?

948
00:45:13,360 --> 00:45:15,980
So we're trying to pick
between two different methods

949
00:45:15,980 --> 00:45:18,860
in order to give us more
reliable convergence

950
00:45:18,860 --> 00:45:19,839
to a local minima.

951
00:45:24,830 --> 00:45:26,840
So here's our
Newton-Raphson step.

952
00:45:26,840 --> 00:45:29,660
It's minus the Hessian
inverse times the gradient.

953
00:45:29,660 --> 00:45:31,400
Here's our steepest
descent step.

954
00:45:31,400 --> 00:45:34,490
It's minus alpha
times the gradient.

955
00:45:34,490 --> 00:45:37,430
And if the
Newton-Raphson step is

956
00:45:37,430 --> 00:45:39,860
smaller than the
trust-region radius,

957
00:45:39,860 --> 00:45:45,520
and the value of the
objective function at Xi,

958
00:45:45,520 --> 00:45:47,780
plus the Newton-Raphson
step is smaller

959
00:45:47,780 --> 00:45:49,780
than the current
objective function,

960
00:45:49,780 --> 00:45:51,500
it seems like the
quadratic approximation

961
00:45:51,500 --> 00:45:52,730
is a good one, right.

962
00:45:52,730 --> 00:45:56,469
I'm within the region in which
I trust this approximation,

963
00:45:56,469 --> 00:45:58,260
and I've reduced the
value of the function.

964
00:45:58,260 --> 00:45:59,690
So why not go that way, right?

965
00:45:59,690 --> 00:46:02,660
So take the Newton-Raphson step.

966
00:46:02,660 --> 00:46:09,022
Else, let's try taking a step in
the steepest direction instead.

967
00:46:09,022 --> 00:46:11,230
So again, if the steepest
ascent direction is smaller

968
00:46:11,230 --> 00:46:14,500
than Ri and the
value of the function

969
00:46:14,500 --> 00:46:15,950
in the steepest
descent direction,

970
00:46:15,950 --> 00:46:19,399
the optimal steepest descent
direction or the optimal step

971
00:46:19,399 --> 00:46:21,190
in the steepest ascent
direction is smaller

972
00:46:21,190 --> 00:46:23,440
than the value of the
function at the current point,

973
00:46:23,440 --> 00:46:25,201
seems like we should
take that step.

974
00:46:25,201 --> 00:46:25,700
Right?

975
00:46:25,700 --> 00:46:27,410
The Newton-Raphson
step was no good.

976
00:46:27,410 --> 00:46:29,680
We've already discarded it.

977
00:46:29,680 --> 00:46:31,490
But our optimized
steepest descent step

978
00:46:31,490 --> 00:46:32,380
seems like an OK one.

979
00:46:32,380 --> 00:46:34,259
It reduces the value
of the function.

980
00:46:34,259 --> 00:46:35,800
And its within the
trust-region where

981
00:46:35,800 --> 00:46:37,870
we think quadratic
approximations are valid.

982
00:46:41,160 --> 00:46:44,550
If that's not true, if the
steepest descent step takes us

983
00:46:44,550 --> 00:46:46,770
outside of our
trust-region or we

984
00:46:46,770 --> 00:46:48,750
don't reduce the
value of the function

985
00:46:48,750 --> 00:46:51,270
when we take that step,
then the next best strategy

986
00:46:51,270 --> 00:46:52,830
is to just take
a steepest ascent

987
00:46:52,830 --> 00:46:55,430
step to the edge of the
trust-region boundary.

988
00:46:55,430 --> 00:46:55,930
Yeah?

989
00:46:55,930 --> 00:46:58,335
STUDENT: Is there a reason
here that Newton-Raphson

990
00:46:58,335 --> 00:47:00,260
is the default?

991
00:47:00,260 --> 00:47:01,530
JAMES SWAN: Oh, good question.

992
00:47:01,530 --> 00:47:04,320
So eventually we're going to get
close enough to the solution,

993
00:47:04,320 --> 00:47:07,140
all right, that all these
steps are going to live

994
00:47:07,140 --> 00:47:09,566
inside the trust-region ring.

995
00:47:09,566 --> 00:47:11,107
Its going to require
very small steps

996
00:47:11,107 --> 00:47:12,630
to converge to the solution.

997
00:47:12,630 --> 00:47:16,060
And which of these two methods
is going to converge faster?

998
00:47:16,060 --> 00:47:17,060
STUDENT: Newton-Raphson.

999
00:47:17,060 --> 00:47:18,184
JAMES SWAN: Newton-Raphson.

1000
00:47:18,184 --> 00:47:21,485
So we prioritize Newton-Raphson
over steepest descent.

1001
00:47:21,485 --> 00:47:22,485
That's a great question.

1002
00:47:25,620 --> 00:47:27,690
Its the faster
converging one, but

1003
00:47:27,690 --> 00:47:30,330
its a little unwieldy, right.

1004
00:47:30,330 --> 00:47:34,740
So let's take it
when it seems valid.

1005
00:47:34,740 --> 00:47:37,200
But when it requires
steps that are too big

1006
00:47:37,200 --> 00:47:40,180
or steps that don't minimize f,
let's take some different steps

1007
00:47:40,180 --> 00:47:40,680
instead.

1008
00:47:40,680 --> 00:47:42,730
Lets use steepest
descent as the strategy.

1009
00:47:42,730 --> 00:47:45,516
So this is heuristic.

1010
00:47:45,516 --> 00:47:48,140
So you got to have some rules to
go with this heuristic, right.

1011
00:47:48,140 --> 00:47:50,300
We have a set of conditions
under which we're going

1012
00:47:50,300 --> 00:47:51,383
to choose different steps.

1013
00:47:54,300 --> 00:47:57,540
We've got to set this
trust-region size.

1014
00:47:57,540 --> 00:47:59,060
This Ri has to be set.

1015
00:47:59,060 --> 00:48:00,720
How big is it going to be?

1016
00:48:00,720 --> 00:48:01,290
I don't know.

1017
00:48:01,290 --> 00:48:03,780
You don't know, right, from
the start you can't guess

1018
00:48:03,780 --> 00:48:05,580
how big Ri is going to be.

1019
00:48:05,580 --> 00:48:07,320
So you got to pick
some initial guess.

1020
00:48:07,320 --> 00:48:10,090
And then we've got to modify
the size of the trust-region

1021
00:48:10,090 --> 00:48:11,421
too, right.

1022
00:48:11,421 --> 00:48:13,920
The size of the trust-region
is not going to be appropriate.

1023
00:48:13,920 --> 00:48:16,378
One fixed size is not going to
be appropriate all the time.

1024
00:48:16,378 --> 00:48:20,020
Instead, we want a strategy
for changing its size.

1025
00:48:20,020 --> 00:48:22,330
So it should grow or shrink
depending on which steps we

1026
00:48:22,330 --> 00:48:23,470
choose, right.

1027
00:48:23,470 --> 00:48:29,350
Like if we take the
Newton-Raphson step

1028
00:48:29,350 --> 00:48:33,040
and we find that our
quadratic approximation is

1029
00:48:33,040 --> 00:48:35,830
a little bit bigger than
the actual function value

1030
00:48:35,830 --> 00:48:40,240
that we predicted, we might
want to grow the trust-region.

1031
00:48:40,240 --> 00:48:42,550
We might be more
likely to believe

1032
00:48:42,550 --> 00:48:45,130
that these Newton-Raphson
steps are getting us

1033
00:48:45,130 --> 00:48:47,366
to smaller and smaller
function values, right.

1034
00:48:47,366 --> 00:48:49,490
The step was even better
than we expected it to be.

1035
00:48:49,490 --> 00:48:51,040
Here's the quadratic
approximation

1036
00:48:51,040 --> 00:48:52,600
in the Newton-Raphson direction.

1037
00:48:52,600 --> 00:48:55,030
And it was actually bigger
than the actual value

1038
00:48:55,030 --> 00:48:55,700
of the function.

1039
00:48:55,700 --> 00:48:58,124
So we got more, you know, we
got more than we expected out

1040
00:48:58,124 --> 00:48:59,290
of a step in that direction.

1041
00:48:59,290 --> 00:49:04,200
So why not loosen up, accept
more Newton-Raphson steps?

1042
00:49:04,200 --> 00:49:06,340
OK, that's a
strategy we can take.

1043
00:49:06,340 --> 00:49:09,754
Otherwise, we might think
about shrinking instead, right.

1044
00:49:09,754 --> 00:49:11,170
So there could be
the circumstance

1045
00:49:11,170 --> 00:49:14,230
where our quadratic
approximation predicted

1046
00:49:14,230 --> 00:49:18,980
a smaller value for the
function than we actually found.

1047
00:49:18,980 --> 00:49:21,590
It's not quite as reliable
for getting us to the minimum.

1048
00:49:21,590 --> 00:49:24,720
These two circumstances
are actually these.

1049
00:49:24,720 --> 00:49:26,980
So this one, the
quadratic approximation

1050
00:49:26,980 --> 00:49:31,990
predicted a slightly
bigger value than we found.

1051
00:49:31,990 --> 00:49:33,400
Say grow the
trust-region, right.

1052
00:49:33,400 --> 00:49:35,140
Try some more
Newton-Raphson steps.

1053
00:49:35,140 --> 00:49:38,440
Seems like the Newton-Raphson
steps are pretty reliable here.

1054
00:49:38,440 --> 00:49:42,940
Here the value of the function
in the quadratic approximation

1055
00:49:42,940 --> 00:49:44,710
is smaller than the
value of the function

1056
00:49:44,710 --> 00:49:46,630
after we took the step.

1057
00:49:46,630 --> 00:49:49,690
Seems like our trust-region
is probably too big

1058
00:49:49,690 --> 00:49:51,370
if we have a
circumstance like that.

1059
00:49:51,370 --> 00:49:52,690
Should shrink it a
little bit, right?

1060
00:49:52,690 --> 00:49:54,280
We took the Newton-Raphson
step, but it actually

1061
00:49:54,280 --> 00:49:55,966
did worse than we
expected it to do

1062
00:49:55,966 --> 00:49:57,340
with the quadratic
approximation.

1063
00:49:57,340 --> 00:50:00,066
So maybe we ought to shrink the
trust-regional a little bit.

1064
00:50:03,585 --> 00:50:04,960
And you need a
good initial value

1065
00:50:04,960 --> 00:50:06,670
for the trust-region radius.

1066
00:50:06,670 --> 00:50:07,580
What does Matlab use?

1067
00:50:07,580 --> 00:50:08,780
It uses 1.

1068
00:50:08,780 --> 00:50:09,280
OK.

1069
00:50:09,280 --> 00:50:10,430
It doesn't know.

1070
00:50:10,430 --> 00:50:12,110
It has no clue.

1071
00:50:12,110 --> 00:50:13,420
It's just a heuristic.

1072
00:50:13,420 --> 00:50:16,480
It starts with 1 and it
changes it as need be.

1073
00:50:16,480 --> 00:50:20,920
So this is how fsolve solves
systems of nonlinear equations.

1074
00:50:20,920 --> 00:50:24,550
This is how all of the
minimizers in Matlab, this

1075
00:50:24,550 --> 00:50:27,400
is the strategy they use
to try to find minima.

1076
00:50:27,400 --> 00:50:30,970
They use these sorts of
trust-region methods.

1077
00:50:30,970 --> 00:50:34,680
It uses a slight improvement,
which is also heuristic,

1078
00:50:34,680 --> 00:50:38,830
called a dogleg
trust-region method.

1079
00:50:38,830 --> 00:50:41,950
So you can take a
Newton-Raphson step

1080
00:50:41,950 --> 00:50:44,660
or you can take a
steepest descent step.

1081
00:50:44,660 --> 00:50:47,620
And if you found the steepest
descent step didn't quite

1082
00:50:47,620 --> 00:50:49,570
get you to the boundary
of your trust-region,

1083
00:50:49,570 --> 00:50:52,030
you could then step in the
Newton-Raphson direction.

1084
00:50:52,030 --> 00:50:52,997
Why do you do that?

1085
00:50:52,997 --> 00:50:55,330
I don't know, people have
found that it's useful, right.

1086
00:50:55,330 --> 00:50:57,400
There's actually no
good reason to take

1087
00:50:57,400 --> 00:50:59,550
these sorts of dogleg steps.

1088
00:50:59,550 --> 00:51:03,220
People found that for general,
right, general objective

1089
00:51:03,220 --> 00:51:06,010
functions that you might
want to find minima of,

1090
00:51:06,010 --> 00:51:09,220
this is a reliable
strategy for getting there.

1091
00:51:09,220 --> 00:51:12,330
There's no guarantee that
this is the best strategy.

1092
00:51:12,330 --> 00:51:14,760
These are general
non-convex functions.

1093
00:51:14,760 --> 00:51:18,090
These are just hard problems
that one encounters.

1094
00:51:18,090 --> 00:51:20,090
So when you make a software
package like Matlab,

1095
00:51:20,090 --> 00:51:20,923
this is what you do.

1096
00:51:20,923 --> 00:51:25,192
You come up with heuristics
that work most of the time.

1097
00:51:25,192 --> 00:51:27,150
I'll just provide you
with an example here, OK.

1098
00:51:27,150 --> 00:51:29,108
So you've seen this
function now several times.

1099
00:51:31,720 --> 00:51:33,720
Let's see, so in red
covered up back here

1100
00:51:33,720 --> 00:51:35,940
is the Newton-Raphson path.

1101
00:51:35,940 --> 00:51:38,680
In blue is the optimal
steepest descent path.

1102
00:51:38,680 --> 00:51:41,130
And in purple is the
trust-region method

1103
00:51:41,130 --> 00:51:43,260
that Matlab uses
to find the minima.

1104
00:51:43,260 --> 00:51:44,919
They all start from
the same place.

1105
00:51:44,919 --> 00:51:46,710
And you can see the
purple path is a little

1106
00:51:46,710 --> 00:51:48,010
different from these two.

1107
00:51:48,010 --> 00:51:52,260
If I zoom in right up
here, what you'll see

1108
00:51:52,260 --> 00:51:55,520
is initially Matlab chose to
follow the steepest descent

1109
00:51:55,520 --> 00:51:57,090
path.

1110
00:51:57,090 --> 00:52:00,690
And then at a certain point it
decided, because of the value

1111
00:52:00,690 --> 00:52:03,150
of the trust-region that
Newton-Raphson steps were

1112
00:52:03,150 --> 00:52:04,520
to be preferred.

1113
00:52:04,520 --> 00:52:06,005
And so it changed
direction and it

1114
00:52:06,005 --> 00:52:08,130
started stepping along the
Newton-Raphson direction

1115
00:52:08,130 --> 00:52:08,630
instead.

1116
00:52:13,060 --> 00:52:15,257
It has some built in
logic that tells it

1117
00:52:15,257 --> 00:52:16,840
when to make that
choice for switching

1118
00:52:16,840 --> 00:52:18,850
based on the size
of the trust-region.

1119
00:52:18,850 --> 00:52:23,500
And the idea is just to
choose the best sorts of steps

1120
00:52:23,500 --> 00:52:24,880
possible.

1121
00:52:24,880 --> 00:52:27,130
Your best guess at what
the right steps are.

1122
00:52:27,130 --> 00:52:29,650
And this is all based
around how trustworthy we

1123
00:52:29,650 --> 00:52:33,010
think this quadratic
approximation

1124
00:52:33,010 --> 00:52:34,410
for objective function is.

1125
00:52:34,410 --> 00:52:35,505
Yeah, Dan?

1126
00:52:35,505 --> 00:52:37,880
STUDENT: So for the trust-region
on the graph what Matlab

1127
00:52:37,880 --> 00:52:40,901
is doing is at
each R trust-region

1128
00:52:40,901 --> 00:52:43,770
length it's reevaluating
which way it should go?

1129
00:52:43,770 --> 00:52:45,080
JAMES SWAN: Yes.

1130
00:52:45,080 --> 00:52:45,790
Yes.

1131
00:52:45,790 --> 00:52:47,410
It's computing
both sets of steps,

1132
00:52:47,410 --> 00:52:50,674
and it's deciding which
one it should take, right.

1133
00:52:50,674 --> 00:52:51,340
It doesn't know.

1134
00:52:51,340 --> 00:52:53,737
It's trying to
choose between them.

1135
00:52:53,737 --> 00:52:55,778
STUDENT: Why don't you do
the Newton-Raphson step

1136
00:52:55,778 --> 00:52:58,002
through the [? negative R? ?]

1137
00:52:58,002 --> 00:53:00,210
JAMES SWAN: You can do that
as well, actually, right.

1138
00:53:00,210 --> 00:53:02,220
But if you're
doing that, now you

1139
00:53:02,220 --> 00:53:03,870
have to choose
between that strategy

1140
00:53:03,870 --> 00:53:09,000
and taking a steepest descent
step up to R as well, right.

1141
00:53:09,000 --> 00:53:11,850
And I think one has to decide
which would you prefer.

1142
00:53:11,850 --> 00:53:14,430
It's possible the Newton-Raphson
step also doesn't actually

1143
00:53:14,430 --> 00:53:15,720
reduce f.

1144
00:53:15,720 --> 00:53:19,050
In which case, you should
discard it entirely, right.

1145
00:53:19,050 --> 00:53:21,690
But you could craft a strategy
that does that, right.

1146
00:53:21,690 --> 00:53:23,920
It's still going to
converge, likely.

1147
00:53:23,920 --> 00:53:24,670
OK?

1148
00:53:24,670 --> 00:53:25,340
OK.

1149
00:53:25,340 --> 00:53:26,190
I've going to let
you guys go, there's

1150
00:53:26,190 --> 00:53:27,220
another class coming in.

1151
00:53:27,220 --> 00:53:28,770
Thanks.