1
00:00:00,000 --> 00:00:02,360
The following content is
provided under a Creative

2
00:00:02,360 --> 00:00:03,650
Commons license.

3
00:00:03,650 --> 00:00:06,540
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,540 --> 00:00:09,970
offer high quality educational
resources for free.

5
00:00:09,970 --> 00:00:12,780
To make a donation or to view
additional materials from

6
00:00:12,780 --> 00:00:16,610
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:16,610 --> 00:00:17,860
ocw.mit.edu.

8
00:00:23,300 --> 00:00:25,440
PROFESSOR: I'm going to review
what we did with the craft

9
00:00:25,440 --> 00:00:29,640
inequality just a little bit,
because evidently a number of

10
00:00:29,640 --> 00:00:31,650
people were confused
about this.

11
00:00:31,650 --> 00:00:35,410
I'm going to put a little more
notation in with it.

12
00:00:35,410 --> 00:00:37,850
For some people,
notation helps.

13
00:00:37,850 --> 00:00:40,110
For other people, it
hinders things.

14
00:00:40,110 --> 00:00:42,780
But after you've thought about
it a little bit, a little more

15
00:00:42,780 --> 00:00:47,630
notation can certainly
be helpful.

16
00:00:47,630 --> 00:00:51,470
What we're trying to do in
this craft inequality is,

17
00:00:51,470 --> 00:00:54,920
we're thinking of a set of
symbols where, supposing that

18
00:00:54,920 --> 00:00:58,050
there's a codeword for each
symbol, c of x is the codeword

19
00:00:58,050 --> 00:01:01,330
for symbol x, which is a string
of binary digits.

20
00:01:01,330 --> 00:01:05,240
y1 up to y sub n.

21
00:01:05,240 --> 00:01:10,150
In the world of two-toed sloths,
the representation of

22
00:01:10,150 --> 00:01:14,700
numbers that sloths use
is binary, base 2.

23
00:01:14,700 --> 00:01:20,820
And therefore, they use the
number associated with some

24
00:01:20,820 --> 00:01:25,090
sequence of bits like this would
be the sum of y sub i,

25
00:01:25,090 --> 00:01:29,070
times 2 to the minus i.

26
00:01:29,070 --> 00:01:32,110
In other words, it's the same
thing as a decimal, except

27
00:01:32,110 --> 00:01:34,760
it's in a world where
people have two

28
00:01:34,760 --> 00:01:37,210
fingers instead of ten.

29
00:01:37,210 --> 00:01:40,420
There's an interval associated
with this number, also.

30
00:01:40,420 --> 00:01:42,800
And the interval is what you
would get if you took an

31
00:01:42,800 --> 00:01:48,360
arbitrary real number and
rounded it down to l sub i.

32
00:01:48,360 --> 00:01:53,220
Well, in this case to
m binary digits.

33
00:01:53,220 --> 00:01:57,500
So that interval, then,
is the number itself.

34
00:01:57,500 --> 00:02:00,460
And the other side of
the interval is that

35
00:02:00,460 --> 00:02:02,250
number itself --

36
00:02:06,080 --> 00:02:09,050
the two-toed sloth didn't like
what I was going to say about

37
00:02:09,050 --> 00:02:11,810
it, and it changed my slide.

38
00:02:15,110 --> 00:02:20,670
So, is that extra factor the
2 to the minus n, there.

39
00:02:20,670 --> 00:02:23,910
In other words, any number in
that range, if you round it

40
00:02:23,910 --> 00:02:28,550
down to m significant binary
digits is going to the this

41
00:02:28,550 --> 00:02:30,240
number here.

42
00:02:30,240 --> 00:02:38,520
Well the point of this, if the
number, namely, this base 2

43
00:02:38,520 --> 00:02:45,590
expansion of a number y prime is
in this interval, then y is

44
00:02:45,590 --> 00:02:47,530
going to be a prefix
of y prime.

45
00:02:47,530 --> 00:02:50,070
Let me just give you some
examples of that because

46
00:02:50,070 --> 00:02:52,050
saying it in words
is confusing, the

47
00:02:52,050 --> 00:02:54,190
idea is very simple.

48
00:02:54,190 --> 00:02:59,650
Suppose you have a binary
string, 011.

49
00:02:59,650 --> 00:03:02,480
That corresponds to
the number 3/8.

50
00:03:02,480 --> 00:03:06,590
Namely, 1/4 plus 1/8.

51
00:03:06,590 --> 00:03:15,000
And the interval there is going
to be the interval from

52
00:03:15,000 --> 00:03:19,640
3/8, including 3/8, up to 1/4.

53
00:03:19,640 --> 00:03:21,070
But not including 1/4.

54
00:03:21,070 --> 00:03:27,850
Namely, 1/4 will be represented
as -- the sloth

55
00:03:27,850 --> 00:03:31,240
really is hitting hard
this morning.

56
00:03:31,240 --> 00:03:33,980
Maybe I'm the sloth.

57
00:03:33,980 --> 00:03:34,460
OK.

58
00:03:34,460 --> 00:03:35,350
There we go.

59
00:03:35,350 --> 00:03:37,270
From 3/8 to 1/2.

60
00:03:37,270 --> 00:03:40,590
Namely, 1/2 is just 1.

61
00:03:40,590 --> 00:03:41,590
Nothing more than that.

62
00:03:41,590 --> 00:03:45,680
Or it could be 10 or
100, and so forth.

63
00:03:45,680 --> 00:03:52,940
So then, as an example, 011 is
going to be a prefix of all of

64
00:03:52,940 --> 00:03:54,420
these quantities here.

65
00:03:54,420 --> 00:04:00,880
That's a prefix of 0111 because
0111 is 3/8 plus 1/16,

66
00:04:00,880 --> 00:04:03,260
which is in that interval
we're talking about.

67
00:04:03,260 --> 00:04:06,460
0110 is a more interesting
case.

68
00:04:06,460 --> 00:04:11,280
Because 0110 is itself,
the number associated

69
00:04:11,280 --> 00:04:13,650
with it, is just 3/8.

70
00:04:13,650 --> 00:04:17,090
But what this is saying is, if
you take the same number but

71
00:04:17,090 --> 00:04:21,320
expand it to four digits,
according to what this says, r

72
00:04:21,320 --> 00:04:24,360
of y like is then in
this interval.

73
00:04:24,360 --> 00:04:28,570
And therefore this prefix
situation holds.

74
00:04:28,570 --> 00:04:31,750
So this is a prefix of this.

75
00:04:31,750 --> 00:04:33,940
Why do I make it
so complicated?

76
00:04:33,940 --> 00:04:36,750
I make it so complicated because
I want to talk about

77
00:04:36,750 --> 00:04:38,360
the length of that interval.

78
00:04:38,360 --> 00:04:41,160
And the length of these
intervals is denoted in this

79
00:04:41,160 --> 00:04:42,670
diagram here.

80
00:04:42,670 --> 00:04:48,990
Anytime I have a number
expressed to n binary digits,

81
00:04:48,990 --> 00:04:53,170
it covers an interval
of 2 to the minus n.

82
00:04:53,170 --> 00:04:56,850
And because of this prefix
property, as soon as one

83
00:04:56,850 --> 00:05:01,720
number covers an interval, no
other number have its base in

84
00:05:01,720 --> 00:05:03,290
that interval.

85
00:05:03,290 --> 00:05:06,270
In other words, all of these
intervals have to be disjoint,

86
00:05:06,270 --> 00:05:08,950
exactly as is indicated here.

87
00:05:08,950 --> 00:05:12,040
So when you add up the size of
all these intervals, you have

88
00:05:12,040 --> 00:05:14,520
to get something less
than or equal to 1.

89
00:05:14,520 --> 00:05:15,920
And that's the proof of
the craft inequality.

90
00:05:21,140 --> 00:05:25,440
So, let's go on and talk about
discrete source probabilities.

91
00:05:35,290 --> 00:05:38,510
Now the two-toed sloth has
gotten his machine out there,

92
00:05:38,510 --> 00:05:41,070
he's really mad at
me this morning.

93
00:05:41,070 --> 00:05:45,340
If we try to model English
text, you know that some

94
00:05:45,340 --> 00:05:47,980
letters are far more probable
than others.

95
00:05:47,980 --> 00:05:50,900
Namely, if you take an enormous
amount of English

96
00:05:50,900 --> 00:05:53,760
text and you measure the
relative frequency with which

97
00:05:53,760 --> 00:05:57,740
each letter occurs, you'll get
more or less stable relative

98
00:05:57,740 --> 00:06:00,310
frequencies if you
take enough text.

99
00:06:00,310 --> 00:06:04,130
And these letters are far more
probable than these letters.

100
00:06:04,130 --> 00:06:07,180
So that gives you
part of a model.

101
00:06:07,180 --> 00:06:10,450
You can also say that successive
letters are going

102
00:06:10,450 --> 00:06:12,480
to be very dependent.

103
00:06:12,480 --> 00:06:17,200
Namely, t is very often followed
by h, and h is often

104
00:06:17,200 --> 00:06:19,070
preceded by t.

105
00:06:19,070 --> 00:06:24,750
q u is even more of a case
here, because as far as

106
00:06:24,750 --> 00:06:30,140
English language words are
concerned, u always follows q.

107
00:06:30,140 --> 00:06:32,010
Some letter strings are words.

108
00:06:32,010 --> 00:06:34,750
Other letter strings
are not words.

109
00:06:34,750 --> 00:06:37,300
There are constraints
on grammar.

110
00:06:37,300 --> 00:06:40,580
And what is really the clincher,
which says there's

111
00:06:40,580 --> 00:06:44,570
no way you're going to model
English in any sensible nice

112
00:06:44,570 --> 00:06:47,550
way, is meaning.

113
00:06:47,550 --> 00:06:51,340
And even worse than that,
depending on who writes the

114
00:06:51,340 --> 00:06:55,920
English, it might have meaning
or it might not have meaning.

115
00:06:55,920 --> 00:07:00,900
And for those English texts that
don't have any meaning,

116
00:07:00,900 --> 00:07:04,420
the entropy is going to depend
very much on whether the

117
00:07:04,420 --> 00:07:07,840
meaningless text is written by
a salesperson or it's written

118
00:07:07,840 --> 00:07:09,640
by James Joyce.

119
00:07:09,640 --> 00:07:14,100
And in one case you have --

120
00:07:14,100 --> 00:07:18,790
well, in one case you have an
enormous amount of freedom in

121
00:07:18,790 --> 00:07:20,580
what this sequence
of letters are.

122
00:07:20,580 --> 00:07:22,780
And in the other case you have
letters which are very, very

123
00:07:22,780 --> 00:07:24,450
constrained.

124
00:07:24,450 --> 00:07:26,110
So what's the point of this?

125
00:07:26,110 --> 00:07:31,220
The point of this is, if you're
interested in trying to

126
00:07:31,220 --> 00:07:36,070
find the source coding method
for English, what you don't

127
00:07:36,070 --> 00:07:39,330
want to do is to start out
trying to get the best

128
00:07:39,330 --> 00:07:42,390
statistical model of English
that you can.

129
00:07:42,390 --> 00:07:44,780
Because it's a losing
proposition.

130
00:07:44,780 --> 00:07:48,740
And by trying to do that, you'll
spend all your time

131
00:07:48,740 --> 00:07:50,220
trying to get the model.

132
00:07:50,220 --> 00:07:53,330
And you won't get any insight
into what you ought to do as

133
00:07:53,330 --> 00:07:55,620
far as source coding
is concerned.

134
00:07:55,620 --> 00:08:00,800
This is pretty much true
throughout all of technology.

135
00:08:00,800 --> 00:08:04,710
You don't solve problems by
first getting too far into the

136
00:08:04,710 --> 00:08:07,970
details of what the problem is,
before you start thinking

137
00:08:07,970 --> 00:08:10,860
about structures of possible
solutions.

138
00:08:10,860 --> 00:08:14,520
In other words, we always deal
with technological problems by

139
00:08:14,520 --> 00:08:17,520
dealing with toy
problems first.

140
00:08:17,520 --> 00:08:20,640
Now, there's a difference
between engineers who worry

141
00:08:20,640 --> 00:08:22,480
about toy problems.

142
00:08:22,480 --> 00:08:25,910
Because engineers, if they hate
theory, usually, don't

143
00:08:25,910 --> 00:08:27,740
say what the toy problem is.

144
00:08:27,740 --> 00:08:30,750
But they have that toy problem
very firmly in the back of

145
00:08:30,750 --> 00:08:33,510
their mind, because of
all their experience.

146
00:08:33,510 --> 00:08:36,700
Theoreticians, on the other
hand, make their models very,

147
00:08:36,700 --> 00:08:38,860
very explicit.

148
00:08:38,860 --> 00:08:41,460
They don't often like to say
that they're toy models

149
00:08:41,460 --> 00:08:44,310
because DARPA doesn't tend to
support things that are

150
00:08:44,310 --> 00:08:45,820
working on toy models.

151
00:08:45,820 --> 00:08:47,560
So they try to conceal this.

152
00:08:47,560 --> 00:08:49,310
And often they just
hide the fact that

153
00:08:49,310 --> 00:08:51,230
they're using a model.

154
00:08:51,230 --> 00:08:53,670
So, all of this becomes
very complicated.

155
00:08:53,670 --> 00:08:57,570
But, for you, if you're trying
to do either engineering or

156
00:08:57,570 --> 00:09:03,270
mathematics or teaching, or
just be a sensible person.

157
00:09:03,270 --> 00:09:06,700
When you're dealing with these
problems, be explicit about

158
00:09:06,700 --> 00:09:08,100
what your models are.

159
00:09:08,100 --> 00:09:11,140
Try to understand the toy
problems before you understand

160
00:09:11,140 --> 00:09:12,810
the more complicated problems.

161
00:09:12,810 --> 00:09:15,140
If you understand that out of
this course, it'll be a

162
00:09:15,140 --> 00:09:17,400
worthwhile course for you.

163
00:09:17,400 --> 00:09:21,190
And, of course, you won't
understand it until you get

164
00:09:21,190 --> 00:09:23,090
lots more experience.

165
00:09:23,090 --> 00:09:25,420
But believe me, that's
the way it is.

166
00:09:25,420 --> 00:09:27,810
OK, so that's the whole
point of this.

167
00:09:27,810 --> 00:09:30,010
You want to start with
simple toy models.

168
00:09:30,010 --> 00:09:32,060
And I'm not just justifying the
fact that we're going to

169
00:09:32,060 --> 00:09:35,690
study this incredibly
simple model here,

170
00:09:35,690 --> 00:09:38,010
which is a toy model.

171
00:09:38,010 --> 00:09:40,390
But by studying this
you will see that

172
00:09:40,390 --> 00:09:42,300
everything else follows.

173
00:09:42,300 --> 00:09:45,890
If you read Claude Shannon's
work on information theory,

174
00:09:45,890 --> 00:09:47,830
this, in fact, is where
he started.

175
00:09:47,830 --> 00:09:51,335
He started with a beautiful
description of trying to model

176
00:09:51,335 --> 00:09:52,840
the English language.

177
00:09:52,840 --> 00:09:56,420
Finally wound up with
talking about this.

178
00:09:56,420 --> 00:10:00,110
The conclusions he drew, from
studying these discrete memory

179
00:10:00,110 --> 00:10:03,880
sources, lead to his general
theorems about data

180
00:10:03,880 --> 00:10:05,660
compression on sources.

181
00:10:05,660 --> 00:10:08,280
They led to his general
theorems about

182
00:10:08,280 --> 00:10:10,400
the capacity of channels.

183
00:10:10,400 --> 00:10:13,870
They led to the idea that you
want to separate source coding

184
00:10:13,870 --> 00:10:17,710
from channel coding, and
finally, they led to all of

185
00:10:17,710 --> 00:10:20,730
the modern ideas that we have
about quantization.

186
00:10:20,730 --> 00:10:23,670
In other words, the simple
ideas you get out of this

187
00:10:23,670 --> 00:10:27,070
generalize directly to
everything that's known about

188
00:10:27,070 --> 00:10:29,250
information theory.

189
00:10:29,250 --> 00:10:30,210
So.

190
00:10:30,210 --> 00:10:33,930
Enough philosophy, let's get on
with the business of what

191
00:10:33,930 --> 00:10:35,650
we're trying to do.

192
00:10:35,650 --> 00:10:40,670
A discrete memoryless sources
has the following properties.

193
00:10:40,670 --> 00:10:45,770
The source output has to be an
unending sequence, x1, x2, x3,

194
00:10:45,770 --> 00:10:49,880
blah, blah, blah, of random
letters drawn from a finite

195
00:10:49,880 --> 00:10:52,110
alphabet, x.

196
00:10:52,110 --> 00:10:55,560
In other words, we are taking
these real sources.

197
00:10:55,560 --> 00:10:58,050
And we're saying, let's make
them not real now.

198
00:10:58,050 --> 00:11:01,010
Let's put a probability
measure on them.

199
00:11:01,010 --> 00:11:03,800
And in this probability measure,
one of the things

200
00:11:03,800 --> 00:11:07,510
that the probability measure
will do will be to describe

201
00:11:07,510 --> 00:11:11,670
the probability on each one of
these letters and the sequence

202
00:11:11,670 --> 00:11:13,520
in which it's coming
out of the source.

203
00:11:13,520 --> 00:11:18,290
Each source output, x1, x2,
blah, blah, blah, is selected

204
00:11:18,290 --> 00:11:20,170
from a common alphabet.

205
00:11:20,170 --> 00:11:22,930
Namely, if you're using English
on one letter of the

206
00:11:22,930 --> 00:11:24,910
sequence, you're going to
use English on every

207
00:11:24,910 --> 00:11:27,200
letter of the sequence.

208
00:11:27,200 --> 00:11:30,260
Going to use a common
probability measure, with some

209
00:11:30,260 --> 00:11:34,600
probability mass function
p sub x of x.

210
00:11:34,600 --> 00:11:38,720
This notation means this is the
probability mass function

211
00:11:38,720 --> 00:11:42,890
for the chance variable X. A
chance variable is like a

212
00:11:42,890 --> 00:11:45,970
random variable, except
the objects are

213
00:11:45,970 --> 00:11:47,300
not necessarily numbers.

214
00:11:47,300 --> 00:11:49,010
The objects can be anything.

215
00:11:49,010 --> 00:11:50,270
So, a chance variable is a

216
00:11:50,270 --> 00:11:53,330
generalization of a random variable.

217
00:11:53,330 --> 00:11:56,440
So this probability mass
function talks about the

218
00:11:56,440 --> 00:12:01,700
probability of each of the
symbols in this alphabet x.

219
00:12:01,700 --> 00:12:05,370
Then the final thing is, each
source output, x sub k, is

220
00:12:05,370 --> 00:12:10,030
statistically independent of all
other source outputs, x1

221
00:12:10,030 --> 00:12:14,440
up x k minus 1, and x k
plus 1 on to forever.

222
00:12:14,440 --> 00:12:17,520
This is a nice example, because
if you're going to

223
00:12:17,520 --> 00:12:21,610
specify a source
probabilistically, you have to

224
00:12:21,610 --> 00:12:24,590
somehow find a way of
explaining what the

225
00:12:24,590 --> 00:12:29,610
probability of every possible
event within this source is.

226
00:12:29,610 --> 00:12:32,210
This is an easy way
of doing it.

227
00:12:32,210 --> 00:12:33,630
You say they're independent.

228
00:12:33,630 --> 00:12:35,780
And then you can find the
probability of anything you

229
00:12:35,780 --> 00:12:37,320
want to find.

230
00:12:37,320 --> 00:12:39,740
So that's a generic
way of putting

231
00:12:39,740 --> 00:12:41,520
probability measures on things.

232
00:12:44,900 --> 00:12:51,890
So then, we want to go into the
idea of prefix-free codes

233
00:12:51,890 --> 00:12:53,890
for these discrete memory
of the sources.

234
00:12:53,890 --> 00:12:57,060
We've already talked about
prefix-free codes.

235
00:12:57,060 --> 00:13:02,000
We talked about the
craft inequality.

236
00:13:02,000 --> 00:13:04,380
You might have thought it was
a little bit strange talking

237
00:13:04,380 --> 00:13:08,480
about the strictly combinatorial
property of

238
00:13:08,480 --> 00:13:13,080
codes without talking at all
about the probabilities, which

239
00:13:13,080 --> 00:13:15,740
are the things that led us into
talking about these codes

240
00:13:15,740 --> 00:13:16,500
in the first place.

241
00:13:16,500 --> 00:13:21,290
Namely, we want to use unequal
length, variable length codes.

242
00:13:21,290 --> 00:13:24,710
Because of the fact that some
letters are more likely than

243
00:13:24,710 --> 00:13:26,070
other letters.

244
00:13:26,070 --> 00:13:28,540
And eventually we'll be using
them because of all these

245
00:13:28,540 --> 00:13:31,870
constraints between
different words.

246
00:13:31,870 --> 00:13:36,080
So, for notation, let l of x be
the length of the codeword

247
00:13:36,080 --> 00:13:41,140
for letter x in, the alphabet
capital X. ok so that's the

248
00:13:41,140 --> 00:13:43,930
same as this y1 y2
up to y sub n.

249
00:13:43,930 --> 00:13:47,320
At some strength of
binary symbols.

250
00:13:47,320 --> 00:13:53,120
Capital L of x is a random
variable, where capital L of x

251
00:13:53,120 --> 00:13:57,270
is equal to little l of x, where
capital X equal to x.

252
00:13:57,270 --> 00:13:59,030
Now, what the heck
does that mean?

253
00:13:59,030 --> 00:14:00,770
It's just notation.

254
00:14:00,770 --> 00:14:03,610
In other words, what we're
starting out with is this

255
00:14:03,610 --> 00:14:05,780
ensemble of letters.

256
00:14:05,780 --> 00:14:09,700
We have a probability assignment
on each letter in

257
00:14:09,700 --> 00:14:11,140
that alphabet.

258
00:14:11,140 --> 00:14:14,440
And then what we would like
to talk about is a length

259
00:14:14,440 --> 00:14:16,520
function on those letters.

260
00:14:16,520 --> 00:14:20,310
So we have little l of x, which
is defined for each x.

261
00:14:20,310 --> 00:14:23,670
We then want to talk about this
as a random variable.

262
00:14:23,670 --> 00:14:29,590
Because when we choose some
random letter, x, little x, of

263
00:14:29,590 --> 00:14:36,720
this ensemble, capital X, l of
x becomes a random variable.

264
00:14:36,720 --> 00:14:39,390
We will always, in this course,
use capital letters to

265
00:14:39,390 --> 00:14:41,250
talk about random variables.

266
00:14:41,250 --> 00:14:44,710
And we will always use little
letters to talk about things

267
00:14:44,710 --> 00:14:46,900
which are not random
variables.

268
00:14:46,900 --> 00:14:49,690
Excuse me, not not random
variables but random variables

269
00:14:49,690 --> 00:14:51,260
or chance variables.

270
00:14:51,260 --> 00:14:54,160
I think we can probably
leave it open now.

271
00:14:54,160 --> 00:14:58,150
It seems as if the sloth
has gone away.

272
00:14:58,150 --> 00:15:00,240
So then we want to talk
about the expected

273
00:15:00,240 --> 00:15:01,890
value of the length.

274
00:15:01,890 --> 00:15:04,380
You talk about the expected
value of something, you're

275
00:15:04,380 --> 00:15:08,570
talking about the expected value
of a random variable.

276
00:15:08,570 --> 00:15:11,990
We will also denote the expected
value of this random

277
00:15:11,990 --> 00:15:16,060
variable L with a bar over it,
which is the sum over the

278
00:15:16,060 --> 00:15:20,810
letters in the alphabet,
of p of x times L of x.

279
00:15:20,810 --> 00:15:24,080
So all this is what you would do
anyway if you never thought

280
00:15:24,080 --> 00:15:26,120
about this.

281
00:15:26,120 --> 00:15:28,190
Until, at some point, when
you're taking a quiz or

282
00:15:28,190 --> 00:15:30,130
something and start to get
confused and say, what is this

283
00:15:30,130 --> 00:15:31,490
stuff all about?

284
00:15:31,490 --> 00:15:34,160
I don't have any idea what
this means after you've

285
00:15:34,160 --> 00:15:35,730
written five pages of stuff.

286
00:15:35,730 --> 00:15:39,210
So, it's worthwhile spending
a little bit of time

287
00:15:39,210 --> 00:15:41,220
sorting that out.

288
00:15:41,220 --> 00:15:43,800
So, L bar is the number
of encoder output

289
00:15:43,800 --> 00:15:46,260
bits per source symbol.

290
00:15:46,260 --> 00:15:48,170
In some strange sense.

291
00:15:48,170 --> 00:15:50,420
Namely, it's this
expected value.

292
00:15:50,420 --> 00:15:55,950
Now, to finish this off, if we
want to look at the number of

293
00:15:55,950 --> 00:15:59,280
binary digits to come out of
the source when a long

294
00:15:59,280 --> 00:16:05,900
sequence of letters come
out of the source --

295
00:16:05,900 --> 00:16:09,690
A long sequence of letters. x1,
x2, x3, and so forth, come

296
00:16:09,690 --> 00:16:10,860
out of the source.

297
00:16:10,860 --> 00:16:12,180
They go into the encoder.

298
00:16:12,180 --> 00:16:15,410
The encoder is mapping each
letter that comes out of the

299
00:16:15,410 --> 00:16:19,540
source into this codeword
c of x.

300
00:16:19,540 --> 00:16:23,090
So we have a sequence of
codewords which are all

301
00:16:23,090 --> 00:16:25,160
concatenated together.

302
00:16:25,160 --> 00:16:29,150
And, therefore, the total number
of binary digits which

303
00:16:29,150 --> 00:16:34,850
has come out of the source
corresponding to these n

304
00:16:34,850 --> 00:16:39,120
symbols that have come out of
the source, is the sum of L of

305
00:16:39,120 --> 00:16:45,050
x1 plus L of x2 plus of x3 plus
L of x4, and so forth.

306
00:16:45,050 --> 00:16:49,350
So we have a sum of independent
random variables.

307
00:16:49,350 --> 00:16:50,970
Now, what do you know
about sums of

308
00:16:50,970 --> 00:16:53,200
independent random variables?

309
00:16:53,200 --> 00:16:55,880
Well, the one thing you ought
to know about, and which

310
00:16:55,880 --> 00:16:59,940
should be stamped on your brain
because it's the central

311
00:16:59,940 --> 00:17:03,950
thing that makes any
probabilistic theory make

312
00:17:03,950 --> 00:17:07,170
sense, it's the only way that
we can ever understand our

313
00:17:07,170 --> 00:17:08,060
environment.

314
00:17:08,060 --> 00:17:09,700
You look at the past.

315
00:17:09,700 --> 00:17:11,600
You try to figure out
from the past what's

316
00:17:11,600 --> 00:17:13,150
going on in the future.

317
00:17:13,150 --> 00:17:16,030
And the only way you can do
that, the only tool you have,

318
00:17:16,030 --> 00:17:19,270
really, is this law
of large, numbers.

319
00:17:19,270 --> 00:17:22,390
Which says, when you see a long
sequence of things, from

320
00:17:22,390 --> 00:17:25,570
that long sequence of things,
you sort of figure out

321
00:17:25,570 --> 00:17:26,700
what's going on.

322
00:17:26,700 --> 00:17:28,900
If you're dealing with a random
variable, the thing you

323
00:17:28,900 --> 00:17:31,460
do is add up all of
these numbers.

324
00:17:31,460 --> 00:17:34,480
You divide by the total number
of them that you have.

325
00:17:34,480 --> 00:17:36,440
And that gives you the
expected value.

326
00:17:36,440 --> 00:17:38,080
It gives you a typical value.

327
00:17:38,080 --> 00:17:41,710
What the law of large numbers
really says is, if you look at

328
00:17:41,710 --> 00:17:46,910
the sum of binary digits out of
this encoder, over a very

329
00:17:46,910 --> 00:17:51,970
long period of time, divide by
the total number of symbols,

330
00:17:51,970 --> 00:17:54,510
that's a random variable
again.

331
00:17:54,510 --> 00:17:57,430
And this random variable is,
with high probability, going

332
00:17:57,430 --> 00:18:02,160
to be very, very close to this
expected value, which is this

333
00:18:02,160 --> 00:18:04,040
quantity here.

334
00:18:04,040 --> 00:18:07,520
In other words, the ensemble
average, which is this, is

335
00:18:07,520 --> 00:18:10,360
going to be very close
to the time average.

336
00:18:10,360 --> 00:18:12,820
And the time average, now,
is a random variable.

337
00:18:12,820 --> 00:18:15,930
And that's what the law
of large numbers says.

338
00:18:15,930 --> 00:18:19,840
You see, the problem that we
all have, dealing with real

339
00:18:19,840 --> 00:18:23,680
world problems, is that there's
nobody to tell us this

340
00:18:23,680 --> 00:18:25,920
is what the ensemble is.

341
00:18:25,920 --> 00:18:29,290
Unless you believe somebody
that doesn't know.

342
00:18:29,290 --> 00:18:31,100
And the only real evidence
that you have

343
00:18:31,100 --> 00:18:32,790
is the actual sequence.

344
00:18:32,790 --> 00:18:36,650
And from the actual sequence,
you then look at what happens

345
00:18:36,650 --> 00:18:38,400
for this particular sequence.

346
00:18:38,400 --> 00:18:40,080
You then build a model.

347
00:18:40,080 --> 00:18:44,480
And your model, by definition,
has the expected value of L

348
00:18:44,480 --> 00:18:48,060
going to be equal to the
expected value in the model

349
00:18:48,060 --> 00:18:49,310
that you've chosen.

350
00:18:51,330 --> 00:18:52,080
So.

351
00:18:52,080 --> 00:18:54,600
What's your objective?

352
00:18:54,600 --> 00:18:59,430
Your objective in trying to form
a prefix-free code, then,

353
00:18:59,430 --> 00:19:04,850
is to find a set of integers,
L of x, which satisfy the

354
00:19:04,850 --> 00:19:07,110
Kraft inequality.

355
00:19:07,110 --> 00:19:10,070
And they minimize L bar.

356
00:19:10,070 --> 00:19:12,470
In other words, what we're
trying to do is, we're trying

357
00:19:12,470 --> 00:19:15,610
to choose a code which minimizes
the expected length

358
00:19:15,610 --> 00:19:16,530
of the code.

359
00:19:16,530 --> 00:19:19,250
Which is really, over a long
period of time, going to

360
00:19:19,250 --> 00:19:23,620
minimize the number of binary
digits that come out of the

361
00:19:23,620 --> 00:19:25,670
source encoder.

362
00:19:25,670 --> 00:19:29,060
What we want to do is to choose
these integers to

363
00:19:29,060 --> 00:19:31,070
minimize this.

364
00:19:31,070 --> 00:19:34,530
So what we're going to do now
is, suppose our alphabet is

365
00:19:34,530 --> 00:19:38,670
just 1, 2, up to capital N.
What am I doing here?

366
00:19:38,670 --> 00:19:41,090
I'm saying, we don't care about
what these symbols are

367
00:19:41,090 --> 00:19:43,030
called, anyway.

368
00:19:43,030 --> 00:19:45,890
It's totally irrelevant what the
names of the symbols are.

369
00:19:45,890 --> 00:19:52,360
So I will name them 1, 2, up to
capital N. Probability mass

370
00:19:52,360 --> 00:19:56,790
function, then, I can denote as
p sub 1 up to p sub capital

371
00:19:56,790 --> 00:19:59,510
N. In other words I've gotten
rid of all these axes that

372
00:19:59,510 --> 00:20:02,720
were lousing up our equations
all along.

373
00:20:02,720 --> 00:20:07,310
Now, I'll denote the unknown
lengths by l1 up to l sub M.

374
00:20:07,310 --> 00:20:12,180
So the problem is, somebody
gives you this set of numbers.

375
00:20:12,180 --> 00:20:15,500
p1 to p sub M, which is a PMF.

376
00:20:15,500 --> 00:20:18,920
In other words, these
numbers add up to 1.

377
00:20:18,920 --> 00:20:22,670
And tells you, I want a
prefix-free code which

378
00:20:22,670 --> 00:20:25,860
minimizes this expected
length.

379
00:20:25,860 --> 00:20:29,695
Namely, the expected value
corresponding to

380
00:20:29,695 --> 00:20:31,700
these lengths here.

381
00:20:31,700 --> 00:20:36,080
So, the expected length, to
minimize it, what we want to

382
00:20:36,080 --> 00:20:40,630
do is to minimize over the
choice of l1 up to l sub M.

383
00:20:40,630 --> 00:20:43,570
Subject to the craft inequality,
we want to

384
00:20:43,570 --> 00:20:46,020
minimize this expected value.

385
00:20:46,020 --> 00:20:49,710
So we have a nice, clean,
mathematical problem now.

386
00:20:49,710 --> 00:20:54,420
We want to minimize this sum,
subject to this constraint.

387
00:20:54,420 --> 00:20:56,850
And the constraint includes
the fact that all of these

388
00:20:56,850 --> 00:20:59,380
things have to be integers.

389
00:20:59,380 --> 00:21:03,040
Well, for those of you who have
studied minimization,

390
00:21:03,040 --> 00:21:06,030
there's a funny thing in here.

391
00:21:06,030 --> 00:21:08,960
Because integer minimization
problems tend to

392
00:21:08,960 --> 00:21:11,800
be very, very nasty.

393
00:21:11,800 --> 00:21:14,740
And, therefore, you look at
this and you say, this is

394
00:21:14,740 --> 00:21:19,020
probably something I'm going
to have trouble solving.

395
00:21:19,020 --> 00:21:20,930
Strangely enough, it isn't.

396
00:21:20,930 --> 00:21:22,930
But you would think it probably
is something which

397
00:21:22,930 --> 00:21:25,520
will be hard to solve.

398
00:21:25,520 --> 00:21:30,420
So, since integers louse
up minimization

399
00:21:30,420 --> 00:21:32,430
problems, what do we do?

400
00:21:32,430 --> 00:21:35,230
Well, we say, just for fun
let's try to solve this

401
00:21:35,230 --> 00:21:38,150
problem without the integer
constraint on it.

402
00:21:38,150 --> 00:21:40,920
Let's see what that leads
to, and see if we can do

403
00:21:40,920 --> 00:21:42,250
anything with that.

404
00:21:42,250 --> 00:21:46,150
So we say, OK, let's try to
minimize this function here

405
00:21:46,150 --> 00:21:51,110
over the integers l1 to l sub M,
subject to this constraint.

406
00:21:51,110 --> 00:21:55,890
So we're minimizing this,
subject to this constraint.

407
00:21:55,890 --> 00:21:59,332
Now, an easy way to
do that -- yes.

408
00:21:59,332 --> 00:21:59,834
AUDIENCE: Are you

409
00:21:59,834 --> 00:22:05,360
saying that side length is
not a fixed probability.

410
00:22:05,360 --> 00:22:07,650
PROFESSOR: No, I still have
these fixed probabilities.

411
00:22:07,650 --> 00:22:12,860
I still have p1 up to p sub
M, as known probabilities.

412
00:22:12,860 --> 00:22:16,040
But I'm going to say, let's
suppose I can choose a length

413
00:22:16,040 --> 00:22:19,360
which is two point five bits
instead of two bits.

414
00:22:19,360 --> 00:22:21,817
AUDIENCE: You're saying
the shortest length

415
00:22:21,817 --> 00:22:23,240
[UNINTELLIGIBLE]

416
00:22:23,240 --> 00:22:26,040
PROFESSOR: Well, we're going to
wind up there eventually.

417
00:22:26,040 --> 00:22:29,800
But for now, all I want to do
is to look at this problem.

418
00:22:29,800 --> 00:22:33,150
If I start out by saying, assign
shortest lengths to the

419
00:22:33,150 --> 00:22:36,000
biggest probabilities,
I have two problems.

420
00:22:36,000 --> 00:22:38,690
One is, it's a little hard
to prove to you that

421
00:22:38,690 --> 00:22:39,970
I want to do that.

422
00:22:39,970 --> 00:22:42,610
Although we'll do that
later today.

423
00:22:42,610 --> 00:22:46,170
And the other is, it doesn't
really give you the general

424
00:22:46,170 --> 00:22:49,120
properties that we want
to know about this.

425
00:22:49,120 --> 00:22:53,640
So, for those two reasons, I
want to just attack this as a

426
00:22:53,640 --> 00:22:57,620
straightforward mathematical
problem.

427
00:22:57,620 --> 00:23:00,490
If you're a computer scientist,
this looks strange.

428
00:23:00,490 --> 00:23:03,370
Because computer scientists
like to attack problems by

429
00:23:03,370 --> 00:23:05,660
algorithms.

430
00:23:05,660 --> 00:23:09,150
Analog engineers like to attack
problems by writing a

431
00:23:09,150 --> 00:23:12,400
complicated formula, and taking
derivatives, and all

432
00:23:12,400 --> 00:23:13,800
sorts of things like that.

433
00:23:13,800 --> 00:23:15,340
We're going to be doing
both of those

434
00:23:15,340 --> 00:23:16,750
things in this course.

435
00:23:16,750 --> 00:23:19,210
And you'll see that both
of them lead to certain

436
00:23:19,210 --> 00:23:20,490
advantages.

437
00:23:20,490 --> 00:23:26,015
And here, we're taking, where
the analog engineer's approach

438
00:23:26,015 --> 00:23:28,830
is saying, suppose this
is a bunch of numbers.

439
00:23:28,830 --> 00:23:32,100
I want to minimize this
function over a set of

440
00:23:32,100 --> 00:23:36,100
numbers, l1 up to
l sub capital M.

441
00:23:36,100 --> 00:23:37,910
So, how do I do that?

442
00:23:37,910 --> 00:23:41,870
Well, this guy Lagrange, he
was a great mathematician.

443
00:23:41,870 --> 00:23:44,760
He was also a great
mathematician early enough

444
00:23:44,760 --> 00:23:47,040
that he could do some really
trivial things and become

445
00:23:47,040 --> 00:23:48,790
famous for them.

446
00:23:48,790 --> 00:23:50,990
Just like Kraft that we were
talking about before.

447
00:23:50,990 --> 00:23:54,450
But, unlike Kraft, Lagrange
really did a lot of other very

448
00:23:54,450 --> 00:23:56,140
important things.

449
00:23:56,140 --> 00:23:59,870
And what Lagrange said was the
following: Well, suppose I

450
00:23:59,870 --> 00:24:02,870
want to minimize this sum.

451
00:24:02,870 --> 00:24:07,160
And I want to have this
constraint added in.

452
00:24:07,160 --> 00:24:11,630
Sort of what I want to do,
then, is to minimize a

453
00:24:11,630 --> 00:24:14,740
weighted sum of this,
which is what I'm

454
00:24:14,740 --> 00:24:17,730
interested in, and this.

455
00:24:17,730 --> 00:24:21,510
In other words, if I minimize
this weighted sum here of

456
00:24:21,510 --> 00:24:26,160
these two things, I'm going to
wind up with some sort of

457
00:24:26,160 --> 00:24:27,590
value for this.

458
00:24:27,590 --> 00:24:30,080
And some sort of
value for this.

459
00:24:30,080 --> 00:24:33,420
By changing lambda, then, which
stands for Lagrange, he

460
00:24:33,420 --> 00:24:37,530
was also clever in making
himself famous that way, by

461
00:24:37,530 --> 00:24:40,700
changing lambda, I can change
the balance between how

462
00:24:40,700 --> 00:24:42,850
important these two
things are.

463
00:24:42,850 --> 00:24:45,470
And as I change the balance
between how important they

464
00:24:45,470 --> 00:24:49,280
are, when I change it to just
the right place, I'm going to

465
00:24:49,280 --> 00:24:54,230
have this constraint here
satisfied with equality.

466
00:24:54,230 --> 00:24:58,330
So that's the whole idea of
Lagrange minimization.

467
00:24:58,330 --> 00:25:00,820
So we take this function.

468
00:25:00,820 --> 00:25:03,440
How do you now minimize a

469
00:25:03,440 --> 00:25:06,010
function of multiple variables?

470
00:25:06,010 --> 00:25:08,680
Well, again it's a
messy problem.

471
00:25:08,680 --> 00:25:10,570
But the first thing you
can try to do is find

472
00:25:10,570 --> 00:25:12,250
a stationary point.

473
00:25:12,250 --> 00:25:15,560
So, let's always do the
easy thing first.

474
00:25:15,560 --> 00:25:19,260
We take the partial derivative
of this function here, with

475
00:25:19,260 --> 00:25:20,590
respect to l sub i.

476
00:25:20,590 --> 00:25:22,940
That's what we're trying
to minimize.

477
00:25:22,940 --> 00:25:27,060
And what we get is p sub i minus
lambda times the natural

478
00:25:27,060 --> 00:25:30,690
log of 2, times 2 the
minus l sub i.

479
00:25:30,690 --> 00:25:34,120
I'm not very good at
differentiation any more, so I

480
00:25:34,120 --> 00:25:36,810
only differentiate things
which are easy.

481
00:25:36,810 --> 00:25:38,210
And that's easy.

482
00:25:38,210 --> 00:25:40,630
I want to find a stationary
point, so I set

483
00:25:40,630 --> 00:25:41,880
this equal to 0.

484
00:25:44,620 --> 00:25:46,840
That makes the problem worse,
because now I have a function

485
00:25:46,840 --> 00:25:49,260
of lambda and also all
of these l sub i's.

486
00:25:49,260 --> 00:25:54,220
But now I choose l sub i, so
that I satisfy the constraint.

487
00:25:54,220 --> 00:25:58,750
Namely, I choose lambda to
satisfy this equation here.

488
00:25:58,750 --> 00:26:02,280
When I choose lambda to satisfy
this equation here,

489
00:26:02,280 --> 00:26:07,480
what I get is p sub i is equal
to 2 to the minus l i, and

490
00:26:07,480 --> 00:26:11,500
therefore, l sub i is equal
to minus log p sub i.

491
00:26:11,500 --> 00:26:13,710
In other words, I have
this equation here.

492
00:26:13,710 --> 00:26:17,870
What happens when I sum
this equation over i?

493
00:26:17,870 --> 00:26:19,800
Let's look at it.

494
00:26:19,800 --> 00:26:23,370
We sum this over i.

495
00:26:23,370 --> 00:26:31,690
The sum of p i over i is 1 minus
lambda times natural log

496
00:26:31,690 --> 00:26:38,150
of 2 times sum of 2 to
the minus l sub i.

497
00:26:38,150 --> 00:26:40,210
And I want to make
this equal to 1.

498
00:26:40,210 --> 00:26:46,350
So what I get is 1 is equal
to this times lambda

499
00:26:46,350 --> 00:26:47,680
natural log of 2.

500
00:26:47,680 --> 00:26:53,420
So, I hope that choosing lambda
equal to 1 over natural

501
00:26:53,420 --> 00:26:55,630
log of 2 is what I want to do.

502
00:26:55,630 --> 00:26:58,830
And when I do that, this
becomes 1 here.

503
00:26:58,830 --> 00:27:03,150
And I just have 2 to the minus
lambda i is equal to 1.

504
00:27:03,150 --> 00:27:04,790
OK, good.

505
00:27:04,790 --> 00:27:10,020
And then, going back to this
equation, p sub i is equal to

506
00:27:10,020 --> 00:27:12,640
2 to the minus l sub i.

507
00:27:12,640 --> 00:27:14,920
OK, this is this arithmetic.

508
00:27:14,920 --> 00:27:18,400
I mean, if you don't follow what
I'm doing, just look at

509
00:27:18,400 --> 00:27:24,080
it later and you'll find that
there's nothing really there.

510
00:27:24,080 --> 00:27:27,140
So we wind up with these lengths
being equal to the

511
00:27:27,140 --> 00:27:30,570
negative of the binary
logarithms of these

512
00:27:30,570 --> 00:27:33,390
probabilities.

513
00:27:33,390 --> 00:27:34,660
It's only a stationary point.

514
00:27:34,660 --> 00:27:36,980
We don't know whether
it's a minimum yet.

515
00:27:36,980 --> 00:27:39,330
And, unfortunately, we also
have the problem of, they

516
00:27:39,330 --> 00:27:43,020
might not be integers.

517
00:27:43,020 --> 00:27:46,890
But, anyway, what we wind up
with, then, if we ignore

518
00:27:46,890 --> 00:27:50,800
looking at these problems for
the time being, is that the

519
00:27:50,800 --> 00:27:53,150
lengths are going to
be equal to this.

520
00:27:53,150 --> 00:27:55,960
The expected value of the
lengths is then going to be

521
00:27:55,960 --> 00:27:59,990
equal to the sum, over i, of
minus p sub i times the

522
00:27:59,990 --> 00:28:03,480
logarithm of p sub i.

523
00:28:03,480 --> 00:28:08,280
When Shannon saw this, and when
various other people saw

524
00:28:08,280 --> 00:28:11,080
it, they said, gee, this looks
like the entropy of

525
00:28:11,080 --> 00:28:12,680
statistical mechanics.

526
00:28:12,680 --> 00:28:16,260
So let's call this
quantity entropy.

527
00:28:16,260 --> 00:28:18,980
For no better reason
than that.

528
00:28:18,980 --> 00:28:21,470
And it would probably have been
far better if they called

529
00:28:21,470 --> 00:28:23,060
it something else.

530
00:28:23,060 --> 00:28:27,050
Because for years, there were
physicists and philosophers

531
00:28:27,050 --> 00:28:31,090
trying to figure out what the
deep relationship was between

532
00:28:31,090 --> 00:28:33,570
statistical mechanical
entropy and

533
00:28:33,570 --> 00:28:36,360
information theoretic entropy.

534
00:28:36,360 --> 00:28:39,240
And there probably is such
a relationship, but the

535
00:28:39,240 --> 00:28:42,630
relationship is far more
complicated than understanding

536
00:28:42,630 --> 00:28:43,900
information theory.

537
00:28:43,900 --> 00:28:45,700
And it's far more
complicated than

538
00:28:45,700 --> 00:28:47,720
understanding statistical mechanics.

539
00:28:47,720 --> 00:28:52,100
So I advise you to not worry
about that one until after you

540
00:28:52,100 --> 00:28:56,480
understand what it means in an
information theoretic sense.

541
00:28:56,480 --> 00:29:00,110
So h of x is what we call
the entropy of the

542
00:29:00,110 --> 00:29:03,520
random variable x.

543
00:29:03,520 --> 00:29:06,590
And it really is the entropy
associated with these

544
00:29:06,590 --> 00:29:08,310
logarithms of p sub i.

545
00:29:08,310 --> 00:29:16,120
So when you take functions of
a random variable, a random

546
00:29:16,120 --> 00:29:18,870
variable carries along a
lot of baggage with it.

547
00:29:18,870 --> 00:29:21,730
Including the probabilities
of everything.

548
00:29:21,730 --> 00:29:26,640
And when you take the expected
value of a random variable,

549
00:29:26,640 --> 00:29:29,910
the individual values of the
sample points of that random

550
00:29:29,910 --> 00:29:31,100
variable are important.

551
00:29:31,100 --> 00:29:33,290
And the probabilities
are important.

552
00:29:33,290 --> 00:29:35,410
Here we have something
even stranger.

553
00:29:35,410 --> 00:29:37,500
Because it's only the
probabilities that have

554
00:29:37,500 --> 00:29:39,110
anything to do with it.

555
00:29:39,110 --> 00:29:40,080
And this makes sense.

556
00:29:40,080 --> 00:29:44,050
We already said that these
symbols have nothing to do

557
00:29:44,050 --> 00:29:45,600
with this problem we're
dealing with.

558
00:29:45,600 --> 00:29:47,040
You can call the symbols
whatever you

559
00:29:47,040 --> 00:29:48,230
want to call them.

560
00:29:48,230 --> 00:29:51,340
And, therefore, the only thing
of any interest to us is these

561
00:29:51,340 --> 00:29:53,710
probabilities that we're
dealing with.

562
00:29:53,710 --> 00:29:57,760
So H of x is a function only
of these probabilities.

563
00:29:57,760 --> 00:30:00,900
It's the expected value
of minus log p sub i.

564
00:30:00,900 --> 00:30:05,000
This is called entropy, and in
fact we will find out very

565
00:30:05,000 --> 00:30:09,010
shortly that it really is the
minimum number of bits per

566
00:30:09,010 --> 00:30:13,020
source symbol needed to
represent the source.

567
00:30:13,020 --> 00:30:16,170
In other words, when we
generalize the problem from

568
00:30:16,170 --> 00:30:22,210
just plain ordinary
garden-variety oh prefix-free

569
00:30:22,210 --> 00:30:27,330
codes, we will find that this
number is really what

570
00:30:27,330 --> 00:30:30,170
characterizes the whole
problem for discrete

571
00:30:30,170 --> 00:30:31,310
memory-less sources.

572
00:30:31,310 --> 00:30:34,160
So let's go on and say
more about that.

573
00:30:37,670 --> 00:30:42,090
Let's say something about
bounds on the entropy.

574
00:30:42,090 --> 00:30:46,450
First, what's the relationship
between the entropy and this

575
00:30:46,450 --> 00:30:53,590
minimum of the expected length
that we started to talk about?

576
00:30:53,590 --> 00:30:58,080
And I claim that H of x is less
than or equal to L min,

577
00:30:58,080 --> 00:31:01,730
which is less than the
entropy plus 1.

578
00:31:01,730 --> 00:31:02,590
And why is that?

579
00:31:02,590 --> 00:31:04,590
We already have the machinery
to see this.

580
00:31:04,590 --> 00:31:08,130
We almost have the machinery
to see this.

581
00:31:08,130 --> 00:31:12,130
Namely, we have solved this
minimization problem.

582
00:31:12,130 --> 00:31:15,580
We've only found a stationary
point, but we've ignored the

583
00:31:15,580 --> 00:31:19,070
fact that we have an
integer constraint.

584
00:31:19,070 --> 00:31:25,760
So, if you allow me to say for
the time being that, in fact,

585
00:31:25,760 --> 00:31:29,930
when we solve the problem
without worrying about

586
00:31:29,930 --> 00:31:32,760
integers, that it actually gives
me a minimum, then in

587
00:31:32,760 --> 00:31:34,700
fact this follows very easily.

588
00:31:34,700 --> 00:31:38,160
Because what I'm going to do
is to find those optimal

589
00:31:38,160 --> 00:31:40,980
lengths, which are
non-integers.

590
00:31:40,980 --> 00:31:44,410
And then I can solve the prefix
condition and get a

591
00:31:44,410 --> 00:31:49,510
code by simply increasing each
of those numbers to the next

592
00:31:49,510 --> 00:31:50,310
integer up.

593
00:31:50,310 --> 00:31:53,310
In other words, I can take the
ceiling function of each of

594
00:31:53,310 --> 00:31:55,780
those real numbers to
get an integer.

595
00:31:55,780 --> 00:31:59,040
When I take the ceiling
function, 2 to the minus l sub

596
00:31:59,040 --> 00:32:01,100
i is going to go down.

597
00:32:01,100 --> 00:32:04,340
So the craft inequality
is still satisfied.

598
00:32:04,340 --> 00:32:10,240
So the entropy has to be less
than or equal to this average.

599
00:32:10,240 --> 00:32:14,320
Has to be less than
H of x plus 1.

600
00:32:14,320 --> 00:32:21,890
So the average is equal to H of
x if and only if each these

601
00:32:21,890 --> 00:32:25,840
probabilities it an integer
power of 2 to start with.

602
00:32:25,840 --> 00:32:28,720
In other words, the solution I
came up with before is that

603
00:32:28,720 --> 00:32:32,360
the length I wanted should be
equal to minus the logarithm

604
00:32:32,360 --> 00:32:35,430
to base 2, of p sub i.

605
00:32:35,430 --> 00:32:43,380
So if p sub i is already a power
of 2 then I'm home free.

606
00:32:43,380 --> 00:32:47,165
Because I just pick that length
to be minus log of p

607
00:32:47,165 --> 00:32:49,640
sub i, and it happens
to be an integer.

608
00:32:49,640 --> 00:32:51,100
And I don't have
to round it up.

609
00:32:53,620 --> 00:32:57,460
So if I let l1 to lM be these
codeword lengths -- well,

610
00:32:57,460 --> 00:32:59,150
here's where I'm going
to prove this to you.

611
00:33:01,650 --> 00:33:06,020
And the proof is the following:
I want to prove

612
00:33:06,020 --> 00:33:10,220
that H of x is less than or
equal to l min, which I'll

613
00:33:10,220 --> 00:33:12,690
just call here L bar.

614
00:33:12,690 --> 00:33:17,810
So H of x minus L bar is equal
to, this is the entropy.

615
00:33:20,760 --> 00:33:23,420
This is the expected
length here.

616
00:33:23,420 --> 00:33:29,270
I can rewrite this as the sum of
p sub i times logarithm of

617
00:33:29,270 --> 00:33:33,600
2 to the minus l sub i
divided by p sub i.

618
00:33:33,600 --> 00:33:35,160
That's just arithmetic.

619
00:33:35,160 --> 00:33:40,700
l sub i is equal to to logarithm
of 2 to the l sub i,

620
00:33:40,700 --> 00:33:43,080
that's equal to minus
the logarithm of 2

621
00:33:43,080 --> 00:33:45,000
the minus l sub i.

622
00:33:45,000 --> 00:33:47,160
So I get this.

623
00:33:47,160 --> 00:33:49,550
There's an inequality.

624
00:33:49,550 --> 00:33:52,420
Hate to call it an inequality,
it's so trivial.

625
00:33:57,620 --> 00:34:00,040
Here's the point 1.

626
00:34:00,040 --> 00:34:07,910
If you plot a natural
log of x.

627
00:34:07,910 --> 00:34:14,640
And if you compare it with the
function x minus 1, you can

628
00:34:14,640 --> 00:34:20,570
see that natural log of
x is less than or

629
00:34:20,570 --> 00:34:24,270
equal to x minus 1.

630
00:34:24,270 --> 00:34:27,200
Now, this an inequality which
happens to be very useful in

631
00:34:27,200 --> 00:34:29,380
information theory.

632
00:34:29,380 --> 00:34:31,905
I would claim that any
inequality that you can prove

633
00:34:31,905 --> 00:34:35,370
in information theory, by any
means at all, I can prove

634
00:34:35,370 --> 00:34:39,240
using this inequality
and nothing else.

635
00:34:39,240 --> 00:34:41,890
And I've believed that for
50 years and nobody's

636
00:34:41,890 --> 00:34:43,140
proven me wrong yet.

637
00:34:47,140 --> 00:34:50,800
And also, this is something
you can draw and remember.

638
00:34:50,800 --> 00:34:52,710
So it's simple.

639
00:34:52,710 --> 00:34:58,260
So the idea, then, is since this
sum log of 2 to the minus

640
00:34:58,260 --> 00:35:03,990
l over p sub i is less than or
equal to the natural log of 2

641
00:35:03,990 --> 00:35:07,320
times the natural logarithm
of this.

642
00:35:07,320 --> 00:35:11,930
You just, here we go.

643
00:35:14,800 --> 00:35:18,030
For any u greater than 0,
natural log of u is less than

644
00:35:18,030 --> 00:35:19,410
or equal to this.

645
00:35:19,410 --> 00:35:23,600
So the logarithm to the base 2
of u is less than or equal to

646
00:35:23,600 --> 00:35:27,630
the logarithm to the base 2 of
e, which is some number,

647
00:35:27,630 --> 00:35:29,440
times u minus 1.

648
00:35:29,440 --> 00:35:31,860
With equality at u equals 1.

649
00:35:31,860 --> 00:35:34,870
So this is less than
or equal to this.

650
00:35:34,870 --> 00:35:36,010
And look how nice that is.

651
00:35:36,010 --> 00:35:40,330
The p sub i's cancel out and you
get the sum over i, of 2

652
00:35:40,330 --> 00:35:45,900
the minus l i minus p sub i is
less than or equal to 0.

653
00:35:45,900 --> 00:35:50,790
An equality occurs if, and only
if, p sub i is equal to 2

654
00:35:50,790 --> 00:35:52,160
to the minus l i.

655
00:35:52,160 --> 00:35:52,590
OK?

656
00:35:52,590 --> 00:35:56,650
So that's all there is to it.

657
00:35:56,650 --> 00:36:04,760
And that establishes that --
well, establishes part of this

658
00:36:04,760 --> 00:36:05,660
theorem here.

659
00:36:05,660 --> 00:36:10,040
And the other part we already
established, And if you don't

660
00:36:10,040 --> 00:36:14,040
believe me, the notes do
it more carefully.

661
00:36:14,040 --> 00:36:18,490
Well, this left us a serious
problem unknown.

662
00:36:18,490 --> 00:36:21,330
Which is, how do you actually
solve this integer

663
00:36:21,330 --> 00:36:23,010
minimization problem.

664
00:36:23,010 --> 00:36:25,810
How do you solve it if you have
a big, long, complicated

665
00:36:25,810 --> 00:36:28,600
source with lots of
probabilities in it?

666
00:36:28,600 --> 00:36:30,780
And everybody thought
it was hopeless.

667
00:36:30,780 --> 00:36:32,970
Even Shannon thought
it was hopeless.

668
00:36:32,970 --> 00:36:35,410
And Shannon sort of figured
out ways to

669
00:36:35,410 --> 00:36:36,470
approach this problem.

670
00:36:36,470 --> 00:36:39,330
He said, well, you want
to have about half the

671
00:36:39,330 --> 00:36:40,320
probability.

672
00:36:40,320 --> 00:36:42,630
Starting with 1, about
half the probability

673
00:36:42,630 --> 00:36:44,110
starting with 0.

674
00:36:44,110 --> 00:36:50,360
So he would divide up the
symbols in the alphabet, so he

675
00:36:50,360 --> 00:36:53,330
could come as close is possible
to half of them being

676
00:36:53,330 --> 00:36:56,380
up here and half of them
coming down here.

677
00:36:56,380 --> 00:36:58,910
And he would continue to do
that, I mean, I don't usually

678
00:36:58,910 --> 00:37:02,640
like to write on the blackboard,
but he would start

679
00:37:02,640 --> 00:37:06,280
out generating a
code like this.

680
00:37:06,280 --> 00:37:09,640
And this would be approximately
1/2.

681
00:37:09,640 --> 00:37:12,390
This is approximately 1/2.

682
00:37:12,390 --> 00:37:15,370
And then he would take
these symbols.

683
00:37:15,370 --> 00:37:18,090
Split them, again,
in probability.

684
00:37:18,090 --> 00:37:21,400
And everybody was starting the
problem over here, and trying

685
00:37:21,400 --> 00:37:25,090
to generate a code working
their way out.

686
00:37:25,090 --> 00:37:30,170
Well, Dave Huffman was a
graduate student at the time.

687
00:37:30,170 --> 00:37:34,810
and he took Bob Fano's graduate
course in information

688
00:37:34,810 --> 00:37:40,150
theory, I think a year or
so later than Kraft did.

689
00:37:40,150 --> 00:37:44,220
And Bob Fano assigned as a
homework problem, how do you

690
00:37:44,220 --> 00:37:46,120
solve this problem?

691
00:37:46,120 --> 00:37:48,530
Sneaky guy.

692
00:37:48,530 --> 00:37:52,200
And he was very amazed when Dave
Hoffman came in next day

693
00:37:52,200 --> 00:37:55,300
and said, oh, it's easy,
you do it this way.

694
00:37:55,300 --> 00:37:58,310
So the question is,
how did he do it?

695
00:37:58,310 --> 00:38:03,100
Well, Huffman, instead of
looking at the problem from

696
00:38:03,100 --> 00:38:07,920
here out, looked at the
problem from here in.

697
00:38:07,920 --> 00:38:09,170
He was --

698
00:38:09,170 --> 00:38:10,760
I mean, this was before
there was anything

699
00:38:10,760 --> 00:38:12,680
called computer science.

700
00:38:12,680 --> 00:38:15,090
But he thought like a computer
scientist did.

701
00:38:15,090 --> 00:38:18,120
In other words, he thought
algorithmically.

702
00:38:18,120 --> 00:38:21,800
And he also thought in terms
of discrete problems.

703
00:38:21,800 --> 00:38:24,140
And therefore, he looked for
properties that these optimum

704
00:38:24,140 --> 00:38:25,850
codes should have.

705
00:38:25,850 --> 00:38:27,510
And it was neat.

706
00:38:27,510 --> 00:38:30,080
So, he started out
with a limit.

707
00:38:30,080 --> 00:38:33,830
He said, an optimal code has to
have the property that a p

708
00:38:33,830 --> 00:38:36,330
i is greater than p sub j.

709
00:38:36,330 --> 00:38:41,090
Then the optimal length
associated with p sub i,

710
00:38:41,090 --> 00:38:45,090
namely the optimal length of the
i'th codeword, had to be

711
00:38:45,090 --> 00:38:49,990
less than or equal to the length
of the j'th codeword.

712
00:38:49,990 --> 00:38:52,700
And you can see this
by saying, well,

713
00:38:52,700 --> 00:38:54,650
suppose that's not true.

714
00:38:54,650 --> 00:38:57,560
Suppose that p i is
greater than p j.

715
00:38:57,560 --> 00:39:01,890
And also, li is greater
than lj.

716
00:39:01,890 --> 00:39:05,490
And then you say, OK,
take this situation.

717
00:39:05,490 --> 00:39:09,740
We will interchange those two
codewords in the code.

718
00:39:09,740 --> 00:39:12,190
And we'll look at what that
does to the average.

719
00:39:12,190 --> 00:39:15,880
And if you work that through,
you find out that since what

720
00:39:15,880 --> 00:39:19,590
you've done is, you've shortened
the codeword

721
00:39:19,590 --> 00:39:22,620
associated with this and
lengthened the codeword

722
00:39:22,620 --> 00:39:24,720
associated with this.

723
00:39:24,720 --> 00:39:28,930
You have changed the average
length to make it smaller.

724
00:39:28,930 --> 00:39:30,970
Now, let me warn you
about something.

725
00:39:30,970 --> 00:39:34,850
When you start looking at these
properties, the most

726
00:39:34,850 --> 00:39:38,820
confusing thing is what happens
when two probabilities

727
00:39:38,820 --> 00:39:42,480
are the same, or when two
lengths are the same.

728
00:39:42,480 --> 00:39:45,830
And I would advise you to just
ignore that problem, until you

729
00:39:45,830 --> 00:39:47,430
get an idea of what's
going on.

730
00:39:47,430 --> 00:39:49,650
Namely, assume that all
lengths are different.

731
00:39:49,650 --> 00:39:51,960
All probabilities
are different.

732
00:39:51,960 --> 00:39:54,020
And then it's easy to
see what's going on.

733
00:39:54,020 --> 00:39:56,740
And when you get all done, go
back and straighten out the

734
00:39:56,740 --> 00:39:59,230
cases where things are equal.

735
00:39:59,230 --> 00:40:03,620
And I think the notes
do this carefully.

736
00:40:03,620 --> 00:40:06,530
If you read books on information
theory, about half

737
00:40:06,530 --> 00:40:09,910
of them do it carefully, and
about half of them don't.

738
00:40:09,910 --> 00:40:12,570
So you should be suspicious.

739
00:40:12,570 --> 00:40:16,240
But anyway, that's one of those
trivialities that you

740
00:40:16,240 --> 00:40:18,830
just have to sort out
for yourself.

741
00:40:18,830 --> 00:40:19,120
OK.

742
00:40:19,120 --> 00:40:23,190
The next lemma is optimal
prefix-free codes are full.

743
00:40:23,190 --> 00:40:25,150
We talked about what
a full code is.

744
00:40:25,150 --> 00:40:29,370
When you draw this binary graph
for it, you don't have

745
00:40:29,370 --> 00:40:32,320
any nodes in a binary graph --

746
00:40:32,320 --> 00:40:33,780
you don't have any leaves
that are not

747
00:40:33,780 --> 00:40:35,740
associated with codewords.

748
00:40:35,740 --> 00:40:39,450
Because if you do, we showed
you that shortened the

749
00:40:39,450 --> 00:40:41,820
codeword of the part of
the tree on the other

750
00:40:41,820 --> 00:40:43,080
side of that leaf.

751
00:40:43,080 --> 00:40:46,040
In other words, if you have
something here, if this is a

752
00:40:46,040 --> 00:40:49,620
codeword and this is not a
codeword, then you just get

753
00:40:49,620 --> 00:40:51,690
rid of this, and bring
that back here.

754
00:40:51,690 --> 00:40:54,310
But this is a whole tree
stemming off here, you do the

755
00:40:54,310 --> 00:40:54,970
same thing.

756
00:40:54,970 --> 00:40:57,990
You take this whole tree and you
bring it into there, and

757
00:40:57,990 --> 00:41:00,560
you throw this away.

758
00:41:00,560 --> 00:41:05,540
So, optimal prefix-free
codes are full.

759
00:41:05,540 --> 00:41:07,020
So far there's nothing
to this.

760
00:41:10,310 --> 00:41:13,630
The next part of it is the
sibling of a codeword.

761
00:41:13,630 --> 00:41:16,250
And what's a sibling?

762
00:41:16,250 --> 00:41:18,660
Well, we used to call
it a brother.

763
00:41:18,660 --> 00:41:21,550
But then couldn't do that
because we have to call it a

764
00:41:21,550 --> 00:41:22,920
brother or sister.

765
00:41:22,920 --> 00:41:24,840
And that got to difficult.

766
00:41:24,840 --> 00:41:27,620
So people invented the word
sibling, to talk about a

767
00:41:27,620 --> 00:41:29,820
brother or a sister.

768
00:41:29,820 --> 00:41:33,350
So the sibling of a codeword
for is the string form by

769
00:41:33,350 --> 00:41:34,910
changing the last bit.

770
00:41:34,910 --> 00:41:37,410
In other words, in this
family tree here, the

771
00:41:37,410 --> 00:41:38,980
sibling of this is this.

772
00:41:38,980 --> 00:41:41,300
The sibling of this is this.

773
00:41:41,300 --> 00:41:43,890
The sibling of this is this.

774
00:41:43,890 --> 00:41:49,690
So, a leaf can have a sibling
which is an intermediate node,

775
00:41:49,690 --> 00:41:50,940
and vice versa.

776
00:41:54,520 --> 00:41:58,100
So then he said, the sibling
of a codeword is a string

777
00:41:58,100 --> 00:42:00,930
formed by changing
the last bit.

778
00:42:00,930 --> 00:42:04,720
I think he probably said the
brother, but, anyway.

779
00:42:04,720 --> 00:42:09,650
For optimality, the sibling of
each maximum length codeword

780
00:42:09,650 --> 00:42:12,200
is another codeword.

781
00:42:12,200 --> 00:42:14,520
Now, that's a really
simple one.

782
00:42:14,520 --> 00:42:17,790
If I make this a codeword, and
this is the maximal length

783
00:42:17,790 --> 00:42:22,380
codeword in this code I'm
talking about, this can't be

784
00:42:22,380 --> 00:42:24,640
an intermediate node because
then there would have to be

785
00:42:24,640 --> 00:42:27,240
longer codewords.

786
00:42:27,240 --> 00:42:30,030
And it can't be empty because
these optimal

787
00:42:30,030 --> 00:42:31,720
codes are all full.

788
00:42:31,720 --> 00:42:34,530
And therefore, this has
to have a sibling

789
00:42:34,530 --> 00:42:36,800
which is also a codeword.

790
00:42:36,800 --> 00:42:43,970
So the longest codewords
have to have siblings.

791
00:42:43,970 --> 00:42:46,960
Well, that's easy enough.

792
00:42:46,960 --> 00:42:50,435
Incidentally, one of the
problems that you have in

793
00:42:50,435 --> 00:42:53,250
proving this sort of thing is,
what happens if you have

794
00:42:53,250 --> 00:42:56,350
zero-probability letters.

795
00:42:56,350 --> 00:42:59,200
Well, we just get rid of that
problem and say, well, there

796
00:42:59,200 --> 00:43:01,860
aren't any zero-probability
letters.

797
00:43:01,860 --> 00:43:04,760
Because if we want to come up
with a sensible model for

798
00:43:04,760 --> 00:43:07,850
something, we're not going
to create a codeword for

799
00:43:07,850 --> 00:43:09,490
something that can't happen.

800
00:43:09,490 --> 00:43:13,630
So, there are no
zero-probability letters in

801
00:43:13,630 --> 00:43:15,620
this alphabet.

802
00:43:15,620 --> 00:43:17,810
I mean, if you want to put them
in, it just complicates

803
00:43:17,810 --> 00:43:19,290
the whole thing.

804
00:43:19,290 --> 00:43:20,540
And you can do it.

805
00:43:23,910 --> 00:43:26,730
Then, finally, there's this
lemma which says, there is an

806
00:43:26,730 --> 00:43:31,910
optimal prefix-free code in
which, after you order the

807
00:43:31,910 --> 00:43:37,250
probabilities of all of the
messages, namely you order p1

808
00:43:37,250 --> 00:43:40,000
to be greater than or equal
to p2, greater than or

809
00:43:40,000 --> 00:43:40,880
equal to p sub m.

810
00:43:40,880 --> 00:43:45,380
In other words, we just rename
the letters in the alphabet,

811
00:43:45,380 --> 00:43:49,780
so that letter m is less likely
than letter m minus 1,

812
00:43:49,780 --> 00:43:50,760
and so forth.

813
00:43:50,760 --> 00:43:51,480
Back to 1.

814
00:43:51,480 --> 00:43:55,560
1 is the most probable,
m is the least likely.

815
00:43:55,560 --> 00:43:58,010
Well, we've already concluded
that we want to assign the

816
00:43:58,010 --> 00:44:02,910
longest messages to the least
probable codewords.

817
00:44:02,910 --> 00:44:06,860
And this says, take the two
least probable codewords and

818
00:44:06,860 --> 00:44:09,400
we can always make an optimal
code in which those two

819
00:44:09,400 --> 00:44:11,660
codewords are siblings.

820
00:44:11,660 --> 00:44:14,280
And the reason for that is, one
of them is not going to be

821
00:44:14,280 --> 00:44:19,180
longer than the other or else
you can shorten the code by

822
00:44:19,180 --> 00:44:21,110
interchanging things.

823
00:44:21,110 --> 00:44:24,640
So there is an optimal
prefix-free code in which the

824
00:44:24,640 --> 00:44:27,520
codeword for m minus 1.

825
00:44:27,520 --> 00:44:29,370
And the codeword for
m are maximal

826
00:44:29,370 --> 00:44:33,640
length and they're siblings.

827
00:44:33,640 --> 00:44:37,360
So the Huffman algorithm first
combines these two.

828
00:44:37,360 --> 00:44:41,100
And then looks at the reduced
tree with m minus 1 nodes.

829
00:44:41,100 --> 00:44:42,900
Let me show you an
example of that.

830
00:44:46,700 --> 00:44:47,990
So it starts out.

831
00:44:47,990 --> 00:44:50,570
Here, I've ordered the
probabilities associated with

832
00:44:50,570 --> 00:44:51,410
a set of symbols.

833
00:44:51,410 --> 00:44:54,700
The symbols are 1, 2, 3, 4, 5.

834
00:44:54,700 --> 00:45:00,310
The two least likely messages
are 0.1 and 0.15.

835
00:45:00,310 --> 00:45:02,260
Obviously, I could've
interchanged these

836
00:45:02,260 --> 00:45:03,330
two if I want to.

837
00:45:03,330 --> 00:45:06,340
But why interchange them?

838
00:45:06,340 --> 00:45:11,270
So I say, OK, the last digit
on this one, I'm going to

839
00:45:11,270 --> 00:45:13,050
assign to be a 0.

840
00:45:13,050 --> 00:45:16,920
The last digit on this, I'm
going to assign to be a 1.

841
00:45:16,920 --> 00:45:19,000
And the important thing is,
I'm going to make them

842
00:45:19,000 --> 00:45:20,700
siblings in this tree.

843
00:45:20,700 --> 00:45:24,740
And what I'm going to do now,
terribly complicated thing,

844
00:45:24,740 --> 00:45:27,890
instead of building a tree from
left to right, I'm going

845
00:45:27,890 --> 00:45:30,500
to build a tree from
right to left.

846
00:45:30,500 --> 00:45:32,850
So when I get all done with
the tree it's going to

847
00:45:32,850 --> 00:45:34,250
come in like this.

848
00:45:34,250 --> 00:45:38,560
And what I'm doing is starting
out at the end, to start to

849
00:45:38,560 --> 00:45:40,760
build the end of the tree.

850
00:45:40,760 --> 00:45:44,610
And what happens after I go
through this first step is, I

851
00:45:44,610 --> 00:45:47,280
say, OK there is an
optimal code.

852
00:45:47,280 --> 00:45:50,420
In which these two
quantities are

853
00:45:50,420 --> 00:45:52,950
siblings of maximal length.

854
00:45:52,950 --> 00:45:55,740
I now want to form an optimal
code for these

855
00:45:55,740 --> 00:45:58,640
probabilities here.

856
00:45:58,640 --> 00:46:01,140
So, I go back and
I iterate again.

857
00:46:01,140 --> 00:46:04,590
And I said, OK, if I have these
probabilities here,

858
00:46:04,590 --> 00:46:07,150
what's the optimal code.

859
00:46:07,150 --> 00:46:08,830
Well, I could reorder
the things.

860
00:46:08,830 --> 00:46:11,400
But now I know that the only
thing I'm interested in is the

861
00:46:11,400 --> 00:46:16,690
two least likely symbols in
this new alphabet here.

862
00:46:16,690 --> 00:46:19,360
Which is 0.2 and 0.15.

863
00:46:19,360 --> 00:46:21,300
So I combine those together.

864
00:46:21,300 --> 00:46:24,610
I tie them together as
siblings in this last

865
00:46:24,610 --> 00:46:28,000
generation, however
it works out.

866
00:46:28,000 --> 00:46:31,450
So then I have an alphabet
of size three.

867
00:46:31,450 --> 00:46:32,990
And then down here,
I have these

868
00:46:32,990 --> 00:46:34,550
two things tied together.

869
00:46:34,550 --> 00:46:36,730
These two things
tied together.

870
00:46:36,730 --> 00:46:41,200
So I have a node of
probability 0.25.

871
00:46:41,200 --> 00:46:45,200
I have a node of probability
0.35, and I have a node of

872
00:46:45,200 --> 00:46:46,870
probability 0.4.

873
00:46:46,870 --> 00:46:51,310
I take the two least likely,
and I tie them together.

874
00:46:51,310 --> 00:46:54,120
And then I have two nodes left,
one with probability 0.6

875
00:46:54,120 --> 00:46:56,480
and one with probability 0.4.

876
00:46:56,480 --> 00:46:58,010
And I tie them together.

877
00:46:58,010 --> 00:47:01,760
And, presto, I have my whole
code, except for flipping it

878
00:47:01,760 --> 00:47:04,240
over, to go from left to
right if you like.

879
00:47:04,240 --> 00:47:05,960
Codes that go from
left and right,

880
00:47:05,960 --> 00:47:07,220
instead of right to left.

881
00:47:11,780 --> 00:47:12,190
OK.

882
00:47:12,190 --> 00:47:13,050
I have swindled you.

883
00:47:13,050 --> 00:47:14,790
How have I swindled you?

884
00:47:18,150 --> 00:47:20,680
I mean, I've swindled you a
little bit by talking about

885
00:47:20,680 --> 00:47:23,360
these things that might
be equal or not equal.

886
00:47:23,360 --> 00:47:24,350
And that's not important.

887
00:47:24,350 --> 00:47:26,380
You can sort that
out on your own.

888
00:47:26,380 --> 00:47:29,200
There's a very important
swindle I pulled.

889
00:47:29,200 --> 00:47:30,450
And what's that?

890
00:47:42,260 --> 00:47:46,450
What's very incomplete
in this argument?

891
00:47:50,240 --> 00:47:53,090
This part is fine.

892
00:47:53,090 --> 00:47:54,850
Nothing wrong here.

893
00:47:54,850 --> 00:47:58,750
We have a lemma which says, you
can find an optimal code

894
00:47:58,750 --> 00:48:00,410
by tying these two
things together.

895
00:48:03,270 --> 00:48:03,560
Yeah?

896
00:48:03,560 --> 00:48:04,810
AUDIENCE: [UNINTELLIGIBLE]

897
00:48:11,085 --> 00:48:14,020
combine those two
[UNINTELLIGIBLE] combination.

898
00:48:14,020 --> 00:48:15,540
PROFESSOR: You're saying,
how do I know to

899
00:48:15,540 --> 00:48:17,380
combine these two?

900
00:48:17,380 --> 00:48:18,640
OK, which means what?

901
00:48:18,640 --> 00:48:19,585
Yeah.

902
00:48:19,585 --> 00:48:21,722
AUDIENCE: [UNINTELLIGIBLE]
you've just added the

903
00:48:21,722 --> 00:48:24,340
probabilities --

904
00:48:24,340 --> 00:48:26,760
PROFESSOR: I've just added
those two probabilities.

905
00:48:26,760 --> 00:48:30,190
So I have a new ensemble where
I have four probabilities,

906
00:48:30,190 --> 00:48:36,350
0.25, 0.15, 0.2, and 0.4.

907
00:48:36,350 --> 00:48:37,070
And that's fine.

908
00:48:37,070 --> 00:48:39,250
I still have these things.

909
00:48:39,250 --> 00:48:41,450
No, there's no independence
involved here at all.

910
00:48:41,450 --> 00:48:46,140
I mean, I started out
with five letters.

911
00:48:46,140 --> 00:48:48,150
Which are disjoined.

912
00:48:48,150 --> 00:48:50,100
I now have four letters
that are disjoined.

913
00:48:56,200 --> 00:48:57,930
What have I done?

914
00:48:57,930 --> 00:48:58,160
Yeah.

915
00:48:58,160 --> 00:48:59,410
AUDIENCE: [UNINTELLIGIBLE]

916
00:49:03,490 --> 00:49:05,200
PROFESSOR: Yes.

917
00:49:05,200 --> 00:49:05,510
Yeah.

918
00:49:05,510 --> 00:49:11,610
I have assumed, now, that once I
get these four symbols, if I

919
00:49:11,610 --> 00:49:15,910
have those four symbols, I can
form an optimal code for those

920
00:49:15,910 --> 00:49:20,720
four symbols in which these two
symbols get tied together.

921
00:49:20,720 --> 00:49:24,500
But how do I know that an
optimal code for this reduced

922
00:49:24,500 --> 00:49:27,980
set of probabilities is also
an optimal code for the

923
00:49:27,980 --> 00:49:29,230
original problem?

924
00:49:34,940 --> 00:49:37,910
I have tied these two
things together.

925
00:49:37,910 --> 00:49:40,870
I know there's an optimal code
in which these two things are

926
00:49:40,870 --> 00:49:42,480
tied together.

927
00:49:42,480 --> 00:49:44,760
I then have four symbols.

928
00:49:44,760 --> 00:49:47,990
I want to find a code for
those four symbols.

929
00:49:47,990 --> 00:49:51,950
But I assume that the optimal
code for these four symbols,

930
00:49:51,950 --> 00:49:55,000
when I break apart these two
things, gives me an optimal

931
00:49:55,000 --> 00:49:58,430
code for five symbols.

932
00:49:58,430 --> 00:50:01,090
That's the sort of thing I
want you people to start

933
00:50:01,090 --> 00:50:02,620
catching onto immediately.

934
00:50:02,620 --> 00:50:07,520
I want you to start asking
those nasty questions.

935
00:50:07,520 --> 00:50:11,840
And those nasty questions are
the things that say, OK, how

936
00:50:11,840 --> 00:50:15,290
do I know that this works?

937
00:50:15,290 --> 00:50:17,090
In other words, you're not
here to learn these

938
00:50:17,090 --> 00:50:18,100
algorithms.

939
00:50:18,100 --> 00:50:21,160
I can tell you what the
algorithm is in an instant.

940
00:50:21,160 --> 00:50:23,440
You can do the algorithm.

941
00:50:23,440 --> 00:50:26,790
A computer can do the algorithm
about three thousand

942
00:50:26,790 --> 00:50:29,290
times faster than you can.

943
00:50:29,290 --> 00:50:32,260
And you can be replaced by a
computer, if you only learn

944
00:50:32,260 --> 00:50:34,440
the algorithms.

945
00:50:34,440 --> 00:50:37,190
You can program the algorithm.

946
00:50:37,190 --> 00:50:40,100
You can probably find the
computer that can program the

947
00:50:40,100 --> 00:50:42,130
algorithm too.

948
00:50:42,130 --> 00:50:45,370
And there's no need to program
it more than once.

949
00:50:45,370 --> 00:50:50,530
So that after you've done that,
you are useless again.

950
00:50:50,530 --> 00:50:53,030
So the only thing that's
worthwhile for you is to be

951
00:50:53,030 --> 00:50:55,910
able to spot these problems
and to understand

952
00:50:55,910 --> 00:50:58,590
what's going on.

953
00:50:58,590 --> 00:50:59,810
So.

954
00:50:59,810 --> 00:51:04,050
How do I know that this first
optimization leads to the

955
00:51:04,050 --> 00:51:07,200
second optimization.

956
00:51:07,200 --> 00:51:10,080
After combining these two least
likely codewords, or

957
00:51:10,080 --> 00:51:15,110
siblings, we've gotten a reduced
set of probabilities.

958
00:51:15,110 --> 00:51:19,040
In this problem here, what we've
done, the reduced set of

959
00:51:19,040 --> 00:51:27,770
probabilities are 0.4,
0.2, 0.15, and 0.25.

960
00:51:27,770 --> 00:51:32,000
Why does finding the optimal
code for this reduced set

961
00:51:32,000 --> 00:51:35,680
result in an optimal code
for the original set?

962
00:51:35,680 --> 00:51:39,000
That's really the question
that we're asking.

963
00:51:39,000 --> 00:51:43,240
Well, it's not hard.

964
00:51:43,240 --> 00:51:47,450
If you take any code for the
reduced set, let's call the

965
00:51:47,450 --> 00:51:51,410
reduced set x prime, set
of probabilities.

966
00:51:51,410 --> 00:51:55,160
Let the expected length
of that be l prime.

967
00:51:55,160 --> 00:51:58,450
It's not necessarily an optimal
code, but it's any old

968
00:51:58,450 --> 00:52:00,040
code that I generate.

969
00:52:00,040 --> 00:52:04,600
Any old code I generate for L
prime, I can now take that

970
00:52:04,600 --> 00:52:12,360
code for l prime and I can
expand it out to a code for L.

971
00:52:12,360 --> 00:52:18,070
Namely, I have this code here
this, this, this, and that's

972
00:52:18,070 --> 00:52:19,920
the expanded --

973
00:52:19,920 --> 00:52:23,620
and now I can expand it into a
code for the original set, by

974
00:52:23,620 --> 00:52:28,180
adding on this and this,
as leaves on this.

975
00:52:28,180 --> 00:52:31,100
This leaf here then becomes
an intermediate node.

976
00:52:31,100 --> 00:52:34,980
And I add two extra
leaves to it.

977
00:52:34,980 --> 00:52:37,130
OK, well, it's not hard.

978
00:52:37,130 --> 00:52:45,060
The expected length for this
code, for these five letters,

979
00:52:45,060 --> 00:52:48,340
I claim, is equal to the
expected length for this

980
00:52:48,340 --> 00:52:52,210
reduced code, this, this,
this, and this.

981
00:52:52,210 --> 00:52:55,160
Plus one extra digit for this.

982
00:52:55,160 --> 00:52:58,650
Plus one extra digit for this.

983
00:52:58,650 --> 00:53:02,310
So the expected length for L is
the expected length for L

984
00:53:02,310 --> 00:53:08,970
prime plus 0.15 plus 0.1.

985
00:53:08,970 --> 00:53:19,310
Which says the following: if I
want to minimize this, and I

986
00:53:19,310 --> 00:53:23,070
know that this has to be equal
to this, and these two numbers

987
00:53:23,070 --> 00:53:25,440
are fixed, I can't
change them.

988
00:53:25,440 --> 00:53:29,770
I can minimize this,
by minimizing this.

989
00:53:29,770 --> 00:53:31,610
And that's the final step
in the whole argument.

990
00:53:35,060 --> 00:53:37,170
And what's peculiar
is that everybody

991
00:53:37,170 --> 00:53:39,720
learns the Huffman algorithm.

992
00:53:39,720 --> 00:53:43,980
And what Huffman did, which was
really very smart, was to

993
00:53:43,980 --> 00:53:45,230
sort out this issue.

994
00:53:48,400 --> 00:53:50,860
And I can teach this to a
hundred classes, and nobody

995
00:53:50,860 --> 00:53:54,100
will ever point out to me that
there's a logical flaw in the

996
00:53:54,100 --> 00:53:56,320
whole argument.

997
00:53:56,320 --> 00:53:58,770
And you can look at most books
on information theory and they

998
00:53:58,770 --> 00:54:00,660
never point out that
there's that

999
00:54:00,660 --> 00:54:03,120
logical flaw there, either.

1000
00:54:03,120 --> 00:54:09,740
So, anyway, that's the end
of Huffman's algorithm.

1001
00:54:09,740 --> 00:54:11,960
You can see when you look at
this that this is really an

1002
00:54:11,960 --> 00:54:14,730
extraordinarily easy
thing to do.

1003
00:54:14,730 --> 00:54:16,530
I mean, you can take
an alphabet of

1004
00:54:16,530 --> 00:54:18,760
several thousand symbols.

1005
00:54:18,760 --> 00:54:20,870
All you have to do
is order them.

1006
00:54:20,870 --> 00:54:23,290
Tie the least two
likely together.

1007
00:54:23,290 --> 00:54:25,840
Assign a 1 and a 0 to them.

1008
00:54:25,840 --> 00:54:30,350
Then, stick that into an
ordered list again.

1009
00:54:30,350 --> 00:54:31,720
Take the two least probable.

1010
00:54:31,720 --> 00:54:32,910
Tie them together.

1011
00:54:32,910 --> 00:54:35,170
Stick it into an ordered
list again.

1012
00:54:35,170 --> 00:54:40,050
And, if you have some minimal
knowledge of data structures,

1013
00:54:40,050 --> 00:54:42,280
you can do this with essentially
on the order of

1014
00:54:42,280 --> 00:54:46,440
one operation for each letter
in this alphabet.

1015
00:54:46,440 --> 00:54:48,890
So it really isn't a
very difficult sum.

1016
00:54:48,890 --> 00:54:53,240
So here's an integer problem
which is really easy to solve.

1017
00:54:53,240 --> 00:54:55,880
And the way to solve it is to
look at the problem in the

1018
00:54:55,880 --> 00:54:58,160
opposite way from what
everybody else has

1019
00:54:58,160 --> 00:55:00,570
looked at it in.

1020
00:55:00,570 --> 00:55:03,190
Does this say you want to
ignore everything that

1021
00:55:03,190 --> 00:55:05,670
everybody else has done,
and go your own way?

1022
00:55:05,670 --> 00:55:07,470
Not quite.

1023
00:55:07,470 --> 00:55:10,440
But it says that's one of the
things you ought to try, if

1024
00:55:10,440 --> 00:55:13,560
you find that everybody is doing
something one way and

1025
00:55:13,560 --> 00:55:19,250
you can find another way to
look at, that's very rich.

1026
00:55:19,250 --> 00:55:21,880
It might turn out to be nothing
but it might turn out

1027
00:55:21,880 --> 00:55:24,610
to be something very
worthwhile.

1028
00:55:31,380 --> 00:55:36,870
Let's now talk about this
quantity, entropy.

1029
00:55:36,870 --> 00:55:45,560
And for every chance variable,
x, if that chance variable, x,

1030
00:55:45,560 --> 00:55:51,610
is discrete and has a finite
number of elements in it, so

1031
00:55:51,610 --> 00:55:56,490
I'm talking about a chance
variable x, what's a chance

1032
00:55:56,490 --> 00:56:00,660
variable have tagging
along after it?

1033
00:56:00,660 --> 00:56:02,190
It has a set of n probabilities

1034
00:56:02,190 --> 00:56:04,690
tagging along after it.

1035
00:56:04,690 --> 00:56:06,360
That's what a chance
variable is.

1036
00:56:06,360 --> 00:56:08,620
A chance variable is not
just the alphabet.

1037
00:56:08,620 --> 00:56:11,040
A chance variable is the
alphabet plus the

1038
00:56:11,040 --> 00:56:12,250
probabilities.

1039
00:56:12,250 --> 00:56:16,200
That's why you then talk about
it having an entropy.

1040
00:56:16,200 --> 00:56:20,020
And the entropy is the expected
value of, minus the

1041
00:56:20,020 --> 00:56:22,760
logarithm, of this
PMF function.

1042
00:56:26,310 --> 00:56:30,340
So, in fact, this is an unusual
statistic in the sense

1043
00:56:30,340 --> 00:56:34,400
that it has nothing to do with
the symbol values, and

1044
00:56:34,400 --> 00:56:37,090
everything to do with just
the probabilities

1045
00:56:37,090 --> 00:56:39,580
of the symbol values.

1046
00:56:39,580 --> 00:56:42,750
And as we go on, you'll see that
in fact this is a very

1047
00:56:42,750 --> 00:56:44,680
important property of it.

1048
00:56:44,680 --> 00:56:47,280
And dealing with the logarithms
of these symbol

1049
00:56:47,280 --> 00:56:53,400
values is, in fact, a much more
worthwhile thing to do

1050
00:56:53,400 --> 00:56:57,250
than dealing with the
probabilities of the symbols.

1051
00:56:57,250 --> 00:57:02,550
Now, let me pause again and see
if anybody can have any

1052
00:57:02,550 --> 00:57:07,290
idea of why logarithms of
probabilities might be more

1053
00:57:07,290 --> 00:57:10,990
significant than
probabilities.

1054
00:57:10,990 --> 00:57:13,410
And think of what we're going
to be doing here.

1055
00:57:13,410 --> 00:57:17,490
We're taking a sequence
of letters.

1056
00:57:17,490 --> 00:57:19,990
When I take a sequence of
letters, what's the

1057
00:57:19,990 --> 00:57:22,210
probability of the sequence
of letters?

1058
00:57:24,820 --> 00:57:25,760
If they're IID.

1059
00:57:25,760 --> 00:57:27,460
Namely, we're looking --

1060
00:57:27,460 --> 00:57:28,790
AUDIENCE: [UNINTELLIGIBLE]

1061
00:57:28,790 --> 00:57:32,310
PROFESSOR: It's the product
of those probabilities.

1062
00:57:32,310 --> 00:57:39,070
Now, if you agree with me that
the probability theory is

1063
00:57:39,070 --> 00:57:43,450
concerned 50% with the law of
large numbers and 50% with

1064
00:57:43,450 --> 00:57:47,030
everything else all put
together, why is the logarithm

1065
00:57:47,030 --> 00:57:48,900
of a probability important?

1066
00:57:48,900 --> 00:57:50,150
AUDIENCE: [UNINTELLIGIBLE]

1067
00:57:55,310 --> 00:57:58,020
PROFESSOR: You change your
product to a sum, yes.

1068
00:57:58,020 --> 00:58:02,140
If you have a product of
probabilities, you can talk

1069
00:58:02,140 --> 00:58:07,650
about a sum of the logarithms
of probabilities.

1070
00:58:07,650 --> 00:58:12,250
That's why entropy is important

1071
00:58:12,250 --> 00:58:14,340
in statistical mechanics.

1072
00:58:14,340 --> 00:58:18,290
It also is, fundamentally,
the reason why entropy is

1073
00:58:18,290 --> 00:58:20,530
important in information
theory.

1074
00:58:20,530 --> 00:58:24,360
Is because what you're almost
always interested in is a

1075
00:58:24,360 --> 00:58:26,360
product of probabilities.

1076
00:58:26,360 --> 00:58:29,310
And when you're interested in a
product of probabilities and

1077
00:58:29,310 --> 00:58:32,050
you want to use the law of large
numbers, you turn that

1078
00:58:32,050 --> 00:58:35,580
product of probabilities into
a sum of the logarithms of

1079
00:58:35,580 --> 00:58:37,820
probabilities.

1080
00:58:37,820 --> 00:58:39,070
Fundamental idea.

1081
00:58:41,560 --> 00:58:44,710
Shannon took eight years
sorting all this out.

1082
00:58:44,710 --> 00:58:49,970
And Shannon was by far the
smartest person I've ever met.

1083
00:58:49,970 --> 00:58:54,570
I mean, the problems that we
worry about, he just, bip.

1084
00:58:54,570 --> 00:58:57,970
Solves it with no
effort at all.

1085
00:58:57,970 --> 00:59:00,570
This one took them a
while to sort out.

1086
00:59:00,570 --> 00:59:03,030
It also took him a while to sort
out the fact that once he

1087
00:59:03,030 --> 00:59:06,670
sorted this out, he could sort
out all of the other problems.

1088
00:59:06,670 --> 00:59:09,730
As far as communications
was concerned.

1089
00:59:09,730 --> 00:59:13,660
So was quite important.

1090
00:59:13,660 --> 00:59:15,140
I mean, I can tell you
one of the peculiar

1091
00:59:15,140 --> 00:59:17,590
things about Shannon.

1092
00:59:17,590 --> 00:59:20,430
Just from the first time I ever
talked to him about a

1093
00:59:20,430 --> 00:59:22,300
technical problem.

1094
00:59:22,300 --> 00:59:24,860
I'd just become a faculty
member here.

1095
00:59:24,860 --> 00:59:28,790
And his office was about five
doors down from mine.

1096
00:59:28,790 --> 00:59:31,630
And one day I screwed up my
courage to go down and talk to

1097
00:59:31,630 --> 00:59:34,340
the guy about a problem
I was working on.

1098
00:59:34,340 --> 00:59:36,500
And I thought it was a
really neat problem.

1099
00:59:36,500 --> 00:59:39,020
It had all sorts of pieces to
it, all sorts of bells and

1100
00:59:39,020 --> 00:59:40,280
whistles on it.

1101
00:59:40,280 --> 00:59:42,850
And I started to explain
it to him.

1102
00:59:42,850 --> 00:59:45,380
And he said, well, can we look
at a slightly simpler case

1103
00:59:45,380 --> 00:59:49,270
where you throw out this part of
it, you throw out one bell.

1104
00:59:49,270 --> 00:59:50,840
Then he'd throw out a whistle.

1105
00:59:50,840 --> 00:59:52,530
Then he'd throw out a bell.

1106
00:59:52,530 --> 00:59:54,690
And I was going along with
this and saying,

1107
00:59:54,690 --> 00:59:55,690
yeah, I guess we could.

1108
00:59:55,690 --> 00:59:56,770
We could.

1109
00:59:56,770 --> 01:00:00,150
We can throw out all of these
things without really losing

1110
01:00:00,150 --> 01:00:02,380
the essence of the problem.

1111
01:00:02,380 --> 01:00:04,150
And, finally, I started
to get discouraged.

1112
01:00:04,150 --> 01:00:06,890
Because this really neat
research problem, this really

1113
01:00:06,890 --> 01:00:10,680
important research problem,
was turning a toy problem

1114
01:00:10,680 --> 01:00:13,530
which was almost trivial.

1115
01:00:13,530 --> 01:00:17,080
It had nothing to do, it
seemed, with anything.

1116
01:00:17,080 --> 01:00:20,260
And finally we got down
to a certain point.

1117
01:00:20,260 --> 01:00:24,200
And I said, yeah, but this is
trivial, the solution is this.

1118
01:00:24,200 --> 01:00:26,170
And he said, yeah.

1119
01:00:26,170 --> 01:00:29,180
And then we started putting
back all the pieces.

1120
01:00:29,180 --> 01:00:33,270
And his genius was, he knew
which things to throw out.

1121
01:00:33,270 --> 01:00:36,580
So that each of the things we
threw out, we could put them

1122
01:00:36,580 --> 01:00:38,060
back in again.

1123
01:00:38,060 --> 01:00:40,330
When we got done, the research
problem was trivial.

1124
01:00:42,870 --> 01:00:46,450
And his genius was in finding
the right trivial

1125
01:00:46,450 --> 01:00:48,960
example to look at.

1126
01:00:48,960 --> 01:00:52,290
So, in fact, what you always
want to look at, in the

1127
01:00:52,290 --> 01:00:55,980
communications field -- and in
most fields, I think -- is

1128
01:00:55,980 --> 01:00:59,320
finding the really simple way
of looking at something.

1129
01:00:59,320 --> 01:01:03,660
Which means you have to throw
out most of the nonsense.

1130
01:01:03,660 --> 01:01:06,560
So, in this case, it's looking
at entropy, which is the

1131
01:01:06,560 --> 01:01:09,010
logarithm of a probability
assignment.

1132
01:01:09,010 --> 01:01:12,250
And you want to look at that
because the logarithm of a

1133
01:01:12,250 --> 01:01:15,950
probability assignment lets
you add the logarithms of

1134
01:01:15,950 --> 01:01:16,830
probabilities.

1135
01:01:16,830 --> 01:01:18,990
Use the law of large numbers.

1136
01:01:18,990 --> 01:01:25,500
And then you can talk about
sequences of elements.

1137
01:01:25,500 --> 01:01:27,320
Properties of entropy.

1138
01:01:27,320 --> 01:01:32,570
For a discrete random
chance variable.

1139
01:01:32,570 --> 01:01:35,240
We have m elements
in the alphabet.

1140
01:01:35,240 --> 01:01:39,440
First thing is that the entropy
is always greater than

1141
01:01:39,440 --> 01:01:41,570
or equal to 0.

1142
01:01:41,570 --> 01:01:42,920
Why is that?

1143
01:01:42,920 --> 01:01:44,170
I'll let you figure it out.

1144
01:01:46,660 --> 01:01:49,910
Why is the logarithm of a
problem, minus the logarithm

1145
01:01:49,910 --> 01:01:55,270
of a probability, greater
than or equal to 0?

1146
01:01:55,270 --> 01:01:56,520
Why is it non-negative?

1147
01:01:58,810 --> 01:01:59,550
Yeah.

1148
01:01:59,550 --> 01:02:00,870
AUDIENCE: [UNINTELLIGIBLE]

1149
01:02:00,870 --> 01:02:02,330
PROFESSOR: Probabilities
are always less than

1150
01:02:02,330 --> 01:02:04,860
or equal to 1, yes.

1151
01:02:04,860 --> 01:02:07,390
So this quantity here
is always greater

1152
01:02:07,390 --> 01:02:08,550
than or equal to 0.

1153
01:02:08,550 --> 01:02:10,980
Because the logarithm
of 1 is equal to 0.

1154
01:02:15,970 --> 01:02:19,440
We have a quality here if
x is deterministic.

1155
01:02:19,440 --> 01:02:21,590
Which is just a special
case there.

1156
01:02:21,590 --> 01:02:24,880
Where you have an ensemble
of one element and it has

1157
01:02:24,880 --> 01:02:26,830
probability 1.

1158
01:02:26,830 --> 01:02:30,610
Or, in fact, at this point you
could add in things which have

1159
01:02:30,610 --> 01:02:32,750
zero probability.

1160
01:02:32,750 --> 01:02:35,500
Well, that's a little
bit tricky.

1161
01:02:35,500 --> 01:02:38,320
Because you add in something
that's zero probability.

1162
01:02:38,320 --> 01:02:44,640
And the logarithm of
0 is infinity.

1163
01:02:44,640 --> 01:02:48,820
So you're dealing with the
expected value of a bunch of

1164
01:02:48,820 --> 01:02:51,930
infinities, which each occur
with zero probability.

1165
01:02:51,930 --> 01:02:56,330
And you're forced to say, well,
I think that 0 times log

1166
01:02:56,330 --> 01:02:59,270
of 0 is equal to 0.

1167
01:02:59,270 --> 01:03:04,590
And in fact, epsilon times log
of episilon goes to 0 as

1168
01:03:04,590 --> 01:03:06,490
epsilon goes to 0.

1169
01:03:06,490 --> 01:03:09,670
But you save yourself a lot of
worry by just leaving out

1170
01:03:09,670 --> 01:03:13,540
things of zero probability.

1171
01:03:13,540 --> 01:03:16,580
So H of x is greater
than or equal to 0.

1172
01:03:16,580 --> 01:03:19,020
We have equality if x
is deterministic.

1173
01:03:19,020 --> 01:03:24,750
H of x is less than or
equal to log of m.

1174
01:03:24,750 --> 01:03:28,830
The quality, if x
is equiprobable.

1175
01:03:28,830 --> 01:03:30,080
And how do I know that?

1176
01:03:33,030 --> 01:03:35,140
I look at this again.

1177
01:03:35,140 --> 01:03:39,630
I'm not going to prove it here
but, essentially, this follows

1178
01:03:39,630 --> 01:03:45,620
from saying that the natural
logarithm of something is less

1179
01:03:45,620 --> 01:03:48,500
than or equal to that
something minus 1.

1180
01:03:48,500 --> 01:03:52,990
And you take the difference of
the entropy, and log of m.

1181
01:03:52,990 --> 01:03:57,510
And, presto, it gives you the
result that you want.

1182
01:03:57,510 --> 01:04:01,285
So you've got the most entropy,
if everything is

1183
01:04:01,285 --> 01:04:02,535
equiprobable.

1184
01:04:05,360 --> 01:04:09,450
For any code satisfying the
Kraft inequality, the entropy

1185
01:04:09,450 --> 01:04:12,130
is less than or equal
to L bar.

1186
01:04:12,130 --> 01:04:15,930
Well, that's what we
already proved.

1187
01:04:15,930 --> 01:04:20,490
Mainly in the middle of the
lecture, we showed that for

1188
01:04:20,490 --> 01:04:24,370
any code to satisfied the Kraft
inequality, the entropy

1189
01:04:24,370 --> 01:04:27,620
was always less than or equal to
L bar, because the entropy

1190
01:04:27,620 --> 01:04:30,700
is what you get if you minimize
the expected length

1191
01:04:30,700 --> 01:04:32,990
without the integer
constraint.

1192
01:04:32,990 --> 01:04:36,830
And L bar is what you get --
well, L bar min is what you

1193
01:04:36,830 --> 01:04:43,040
get when you minimize it with
the integer constraint.

1194
01:04:43,040 --> 01:04:45,100
I mean, you don't bother
about minimizing it.

1195
01:04:45,100 --> 01:04:47,010
You get something bigger
than L min.

1196
01:04:47,010 --> 01:04:51,320
So this is less than or equal to
the length of any codeword.

1197
01:04:51,320 --> 01:04:55,170
For the very best codeword, for
the very best code, the

1198
01:04:55,170 --> 01:04:59,290
expected value of the minimum
is less than or equal to the

1199
01:04:59,290 --> 01:05:01,190
entropy plus y.

1200
01:05:01,190 --> 01:05:05,010
And you get that just by adding
to each non-integer

1201
01:05:05,010 --> 01:05:09,530
length the ceiling function.

1202
01:05:09,530 --> 01:05:13,770
Which gives you, at most, one
extra digit for each codeword.

1203
01:05:13,770 --> 01:05:16,870
Now, here's the more
interesting one.

1204
01:05:16,870 --> 01:05:22,100
For independent chance
variables, x and y, here's

1205
01:05:22,100 --> 01:05:28,120
where the nice part about
notation comes along.

1206
01:05:28,120 --> 01:05:32,250
What's the entropy of x y?

1207
01:05:32,250 --> 01:05:35,330
Well, what do I mean
by x y first?

1208
01:05:35,330 --> 01:05:38,460
I have a chance variable, x.

1209
01:05:38,460 --> 01:05:45,300
And this chance variable x has
an alphabet associated with

1210
01:05:45,300 --> 01:05:47,800
it. x1 up to x sub m.

1211
01:05:47,800 --> 01:05:49,850
I have a chance variable y.

1212
01:05:49,850 --> 01:05:53,490
It has an alphabet associated
with it.

1213
01:05:53,490 --> 01:05:58,870
What's the sample space, what's
the set of events

1214
01:05:58,870 --> 01:06:02,950
corresponding to the chance
variable x y?

1215
01:06:05,790 --> 01:06:09,990
By x y, I mean a chance variable
whose elements are

1216
01:06:09,990 --> 01:06:14,170
the possible values
of both x and y.

1217
01:06:14,170 --> 01:06:17,740
So, I'm talking about the joint
ensemble of x and y.

1218
01:06:17,740 --> 01:06:21,080
I have a bunch of possible
values for that.

1219
01:06:21,080 --> 01:06:26,260
And those possible values, if
I have m possibilities for

1220
01:06:26,260 --> 01:06:28,940
each, I have m squared
possible values

1221
01:06:28,940 --> 01:06:31,180
for the two of them.

1222
01:06:31,180 --> 01:06:35,050
So I'm talking about the
expected value of minus the

1223
01:06:35,050 --> 01:06:40,750
logarithm of the probability
of x and y.

1224
01:06:40,750 --> 01:06:43,890
In other words, I am trying
to take the --

1225
01:06:43,890 --> 01:06:46,640
let me write it out.

1226
01:06:46,640 --> 01:06:49,840
I'm probably given conniptions
to -- no?

1227
01:06:49,840 --> 01:06:51,090
OK.

1228
01:06:58,930 --> 01:07:06,960
I want to take p of x y of
symbol x y, times minus the

1229
01:07:06,960 --> 01:07:16,440
logarithm to the base 2
of p sub x y of x y.

1230
01:07:16,440 --> 01:07:21,340
That's what this means
if I write it out.

1231
01:07:24,620 --> 01:07:32,740
Well, this probability here is
p sub x of little x, times p

1232
01:07:32,740 --> 01:07:35,280
sub y of little y.

1233
01:07:35,280 --> 01:07:36,490
Why is that?

1234
01:07:36,490 --> 01:07:37,660
Because I'm assuming
that they're

1235
01:07:37,660 --> 01:07:39,630
independent of each other.

1236
01:07:39,630 --> 01:07:42,160
And, therefore, the probability
of two of them is

1237
01:07:42,160 --> 01:07:45,405
the product of the
probabilities, times minus log

1238
01:07:45,405 --> 01:07:46,080
to the base 2.

1239
01:07:46,080 --> 01:07:55,410
So if p sub x of x minus
logarithm to the base 2 of p

1240
01:07:55,410 --> 01:07:58,560
sub y of y.

1241
01:07:58,560 --> 01:08:05,120
And I'm summing this over
all x and all y.

1242
01:08:05,120 --> 01:08:10,440
And the more sophisticated
way to write this --

1243
01:08:10,440 --> 01:08:13,070
things I say in lecture, you
don't have to copy down

1244
01:08:13,070 --> 01:08:16,290
because they're always
in the notes.

1245
01:08:16,290 --> 01:08:17,880
If they're not in the
notes, it's probably

1246
01:08:17,880 --> 01:08:19,130
wrong anyway, so.

1247
01:08:21,640 --> 01:08:25,190
So this expected value is
expected value of the

1248
01:08:25,190 --> 01:08:27,490
logarithm of the probability
of x y.

1249
01:08:27,490 --> 01:08:32,190
Which is the expected value
of the logarithm of p of

1250
01:08:32,190 --> 01:08:34,350
x times p of y.

1251
01:08:34,350 --> 01:08:37,290
And, since I have a logarithm
of a product, that's the

1252
01:08:37,290 --> 01:08:42,350
expected value of minus log p
of x minus log p of y, which

1253
01:08:42,350 --> 01:08:47,370
is the entropy of x plus
the entropy of y.

1254
01:08:47,370 --> 01:08:50,720
In other words, when I have a
joint ensemble of even more

1255
01:08:50,720 --> 01:08:56,660
independent quantities, The
entropy the sequence is equal

1256
01:08:56,660 --> 01:09:01,800
to the sum of the entropies of
the individual elements in

1257
01:09:01,800 --> 01:09:03,050
that sequence.

1258
01:09:07,270 --> 01:09:09,630
Well, that's all I wanted
to talk about today.

1259
01:09:09,630 --> 01:09:12,010
If any of you have any questions
to ask, you should

1260
01:09:12,010 --> 01:09:13,500
ask them now.

1261
01:09:13,500 --> 01:09:17,210
I went through Huffman coding
pretty quickly, because it's

1262
01:09:17,210 --> 01:09:19,670
something where you have
to do some exercises on

1263
01:09:19,670 --> 01:09:21,920
it to sort it out.

1264
01:09:21,920 --> 01:09:24,480
And I didn't want to do
any more than that.