1
00:00:00,000 --> 00:00:02,350
The following content is
provided under a Creative

2
00:00:02,350 --> 00:00:03,640
Commons license.

3
00:00:03,640 --> 00:00:06,590
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,590 --> 00:00:09,970
offer high quality educational
resources for free.

5
00:00:09,970 --> 00:00:13,060
To make a donation or to view
additional materials from

6
00:00:13,060 --> 00:00:16,780
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:16,780 --> 00:00:18,030
ocw.mit.edu.

8
00:00:21,570 --> 00:00:22,070
PROFESSOR: OK.

9
00:00:22,070 --> 00:00:26,430
Just to review where we are,
we've been talking about

10
00:00:26,430 --> 00:00:30,230
source coding as one of the
two major parts of digital

11
00:00:30,230 --> 00:00:31,000
communication.

12
00:00:31,000 --> 00:00:34,420
Remember, you take sources,
you turn them into bits.

13
00:00:34,420 --> 00:00:38,110
Then you take bits and you
transmit them over channels.

14
00:00:38,110 --> 00:00:40,470
And that sums up the
whole course.

15
00:00:40,470 --> 00:00:44,420
This is the part where you
transmit over channels.

16
00:00:44,420 --> 00:00:48,220
This is the part where you
process the sources.

17
00:00:48,220 --> 00:00:51,740
We're concentrating now on the
source side of things.

18
00:00:51,740 --> 00:00:54,760
Partly because by concentrating
on the source

19
00:00:54,760 --> 00:00:58,300
side of things, we will build
up the machinery we need to

20
00:00:58,300 --> 00:00:59,820
look at the channel
side of things.

21
00:00:59,820 --> 00:01:03,760
The channel side is really more
interesting, I think.

22
00:01:03,760 --> 00:01:07,090
Although there's been a great
deal of work on both of them.

23
00:01:07,090 --> 00:01:09,070
They're both important.

24
00:01:09,070 --> 00:01:12,950
And we said that we could
separate source coding into

25
00:01:12,950 --> 00:01:13,770
three pieces.

26
00:01:13,770 --> 00:01:17,030
If you start out with a waveform
source, the typical

27
00:01:17,030 --> 00:01:21,210
thing to do, and almost the only
thing to do, is to turn

28
00:01:21,210 --> 00:01:24,430
those waveforms into sequences
of numbers.

29
00:01:24,430 --> 00:01:27,130
Those sequences might
be samples.

30
00:01:27,130 --> 00:01:31,080
They might be numbers
in an expansion.

31
00:01:31,080 --> 00:01:32,450
They might be whatever.

32
00:01:32,450 --> 00:01:36,440
But the first thing you almost
always do is turn waveforms

33
00:01:36,440 --> 00:01:38,280
into a sequence of numbers.

34
00:01:38,280 --> 00:01:42,600
Because waveforms are just too
complicated to deal with.

35
00:01:42,600 --> 00:01:44,860
The next thing we do with
a sequence of numbers is

36
00:01:44,860 --> 00:01:46,240
quantize them.

37
00:01:46,240 --> 00:01:48,440
After we quantize them
we wind up with a

38
00:01:48,440 --> 00:01:50,170
finite set of symbols.

39
00:01:50,170 --> 00:01:51,900
And the next thing
we do is, we take

40
00:01:51,900 --> 00:01:54,270
that sequence of symbols.

41
00:01:54,270 --> 00:01:55,750
And --

42
00:01:58,510 --> 00:02:01,960
and what we do at that point
is to do data compression.

43
00:02:01,960 --> 00:02:05,940
So that we try to represent
those symbols with as small as

44
00:02:05,940 --> 00:02:09,950
possible a number of binary
digits per source symbol.

45
00:02:09,950 --> 00:02:13,290
We want to do that in such
a way that we can receive

46
00:02:13,290 --> 00:02:15,760
it the other end.

47
00:02:15,760 --> 00:02:19,060
So let's review a little bit
about what we've done in the

48
00:02:19,060 --> 00:02:22,230
last couple of lectures.

49
00:02:22,230 --> 00:02:26,680
We talked about the
Kraft inequality.

50
00:02:26,680 --> 00:02:30,820
And the Kraft inequality, you
remember, says that the

51
00:02:30,820 --> 00:02:35,350
lengths of the codewords in any
prefix-free code have to

52
00:02:35,350 --> 00:02:38,120
satisfy this funny
inequality here.

53
00:02:38,120 --> 00:02:42,580
And this funny inequality, in
some sense, says if you want

54
00:02:42,580 --> 00:02:47,550
to make one codeword short, by
making that one codeword

55
00:02:47,550 --> 00:02:53,590
short, it eats up a large
part of this fraction.

56
00:02:53,590 --> 00:02:56,840
Since this sum has to be less
than or equal to 1.

57
00:02:56,840 --> 00:03:01,150
If, for example, you make l sub
1 equal to 1, then that

58
00:03:01,150 --> 00:03:03,870
uses up a half of this
sum right there.

59
00:03:03,870 --> 00:03:06,990
And all the other codewords
have to be much longer.

60
00:03:06,990 --> 00:03:09,890
So what this is saying,
essentially, and we proved it,

61
00:03:09,890 --> 00:03:13,760
and we did a bunch of things
with it, and your homework you

62
00:03:13,760 --> 00:03:18,060
worked with it, we have
shown that that

63
00:03:18,060 --> 00:03:20,380
inequality has to hold.

64
00:03:20,380 --> 00:03:23,750
The next thing we did is, given
a set of probabilities

65
00:03:23,750 --> 00:03:29,120
on a source, for example, p1
up to p sub m, by this time

66
00:03:29,120 --> 00:03:32,170
you should feel very comfortable
in realizing that

67
00:03:32,170 --> 00:03:35,020
what you call these symbols
doesn't make any difference

68
00:03:35,020 --> 00:03:38,970
whatsoever as far
as any matter of

69
00:03:38,970 --> 00:03:41,180
encoding sources is concerned.

70
00:03:41,180 --> 00:03:44,520
The first thing you can do, if
you like to, is take whatever

71
00:03:44,520 --> 00:03:47,940
name somebody has given to a set
of symbols, replace them

72
00:03:47,940 --> 00:03:52,410
with your own symbols, and the
easiest set of symbols to use

73
00:03:52,410 --> 00:03:55,020
is the integers 1 to m.

74
00:03:55,020 --> 00:03:58,110
And that's what we
will usually do.

75
00:03:58,110 --> 00:04:01,820
So, given this set of
probabilities, and they have

76
00:04:01,820 --> 00:04:06,430
to add up to 1, the Huffman
algorithm is this ingenious

77
00:04:06,430 --> 00:04:10,410
algorithm, very, very, clever,
which constructs a prefix-free

78
00:04:10,410 --> 00:04:14,170
code of minimum expected
length.

79
00:04:14,170 --> 00:04:18,560
And that minimum expected length
is just defined as the

80
00:04:18,560 --> 00:04:22,080
sum over i, of p sub
i times l sub i.

81
00:04:22,080 --> 00:04:25,340
And the trick in the algorithm
is to find that set of l sub

82
00:04:25,340 --> 00:04:30,570
i's that satisfy this inequality
but minimize this

83
00:04:30,570 --> 00:04:32,810
expected value.

84
00:04:32,810 --> 00:04:35,110
Next thing we started to talk
about was a discrete

85
00:04:35,110 --> 00:04:36,790
memoryless source.

86
00:04:36,790 --> 00:04:40,150
A discrete memoryless source
is really a toy source.

87
00:04:40,150 --> 00:04:43,770
It's a toy source where you
assume that each letter in the

88
00:04:43,770 --> 00:04:48,700
sequence is independent, and
equally independent, and

89
00:04:48,700 --> 00:04:49,820
identically distributed.

90
00:04:49,820 --> 00:04:51,880
In other words, every
letter is the same.

91
00:04:51,880 --> 00:04:55,000
Every letter is independent
of every other letter.

92
00:04:55,000 --> 00:04:58,590
That's more appropriate for a
gambling game than it is for

93
00:04:58,590 --> 00:04:59,940
real sources.

94
00:04:59,940 --> 00:05:02,830
But, on the other hand, by
understanding this problem,

95
00:05:02,830 --> 00:05:05,385
we're starting to see that we
understand the whole problem

96
00:05:05,385 --> 00:05:07,350
of source coding.

97
00:05:07,350 --> 00:05:10,050
So we'll get on with
that as we go.

98
00:05:10,050 --> 00:05:13,910
But, anyway, when we have a
discrete memoryless source,

99
00:05:13,910 --> 00:05:17,780
what we found -- first we
defined the entropy of such a

100
00:05:17,780 --> 00:05:24,460
source as h of x, which is the
sum over i of minus p sub i,

101
00:05:24,460 --> 00:05:27,580
of logarithm of p sub i.

102
00:05:27,580 --> 00:05:30,570
And that was just something that
came out of trying to do

103
00:05:30,570 --> 00:05:34,170
this optimization not the way
that Huffman did it but the

104
00:05:34,170 --> 00:05:35,940
way that Shannon did it.

105
00:05:35,940 --> 00:05:39,500
Namely, Shannon looked at this
optimization in terms of

106
00:05:39,500 --> 00:05:42,600
dealing with entropy and
things like that.

107
00:05:42,600 --> 00:05:45,610
Huffman dealt with it in terms
of finding the optimal code.

108
00:05:45,610 --> 00:05:49,100
One of the very surprising
things is that the way Huffman

109
00:05:49,100 --> 00:05:52,450
looked at it, in terms of
entropy, is the way this

110
00:05:52,450 --> 00:05:55,120
really valuable, even though
it doesn't come up with an

111
00:05:55,120 --> 00:05:56,410
optimal code.

112
00:05:56,410 --> 00:06:00,420
I mean, here was poor Huffman,
who generated this beautiful

113
00:06:00,420 --> 00:06:03,610
algorithm, which is
extraordinarily simple, which

114
00:06:03,610 --> 00:06:06,350
solved what looked like
a hard problem.

115
00:06:06,350 --> 00:06:10,440
But yet, as far as information
theory is concerned, he used

116
00:06:10,440 --> 00:06:11,940
that algorithm, sure.

117
00:06:11,940 --> 00:06:14,900
But as far as all the
generalizations are concerned,

118
00:06:14,900 --> 00:06:17,960
it has almost nothing
to do with anything.

119
00:06:17,960 --> 00:06:21,790
But, anyway, when you look at
this entropy, what comes out

120
00:06:21,790 --> 00:06:25,710
of it is the fact that the
entropy of the source is less

121
00:06:25,710 --> 00:06:30,100
than or equal to the minimum
number of bits per source

122
00:06:30,100 --> 00:06:33,840
symbol that you can come up with
in a prefix-free code,

123
00:06:33,840 --> 00:06:36,930
which is less than
h of x plus 1.

124
00:06:36,930 --> 00:06:39,340
And the way we did that was just
to try to look at this

125
00:06:39,340 --> 00:06:40,860
minimization.

126
00:06:40,860 --> 00:06:42,720
And by looking at the
minimization, we usually

127
00:06:42,720 --> 00:06:45,900
showed it had to be greater
than or equal to H of x.

128
00:06:45,900 --> 00:06:48,810
And by looking at any code
which satisfied this

129
00:06:48,810 --> 00:06:51,050
inequality with any
set of length --

130
00:06:51,050 --> 00:06:55,630
well, after we looked at this,
this said that what we really

131
00:06:55,630 --> 00:06:59,680
wanted to do was to make the
length of each codeword minus

132
00:06:59,680 --> 00:07:01,700
log of p sub i.

133
00:07:01,700 --> 00:07:03,050
That's not an integer.

134
00:07:03,050 --> 00:07:06,140
So the thing we did to get this
inequality is said, OK,

135
00:07:06,140 --> 00:07:09,630
if it's not an integer, we'll
raise it up the next value.

136
00:07:09,630 --> 00:07:10,880
Make it an integer.

137
00:07:10,880 --> 00:07:14,810
And as soon as we do that, the
Kraft inequality is satisfied.

138
00:07:14,810 --> 00:07:17,170
And you can generate a code
with that set of lengths.

139
00:07:17,170 --> 00:07:21,710
So that's where you get
these two bounds.

140
00:07:21,710 --> 00:07:25,840
This bound here is usually
very, very weak.

141
00:07:25,840 --> 00:07:31,030
Can anybody suggest the almost
unique example where this is

142
00:07:31,030 --> 00:07:33,890
almost tight?

143
00:07:33,890 --> 00:07:36,600
It's a simplistic sample
you can think of.

144
00:07:36,600 --> 00:07:38,960
It's a binary source.

145
00:07:38,960 --> 00:07:44,840
And what binary source has the
property that this is almost

146
00:07:44,840 --> 00:07:46,810
equal to this?

147
00:07:52,000 --> 00:07:53,150
Anybody out there?

148
00:07:53,150 --> 00:07:57,540
AUDIENCE: [UNINTELLIGIBLE]

149
00:07:57,540 --> 00:08:00,310
PROFESSOR: Make it almost
probability 0

150
00:08:00,310 --> 00:08:01,700
and probability 1.

151
00:08:01,700 --> 00:08:05,390
You can't quite do that because
as soon as you make

152
00:08:05,390 --> 00:08:09,280
the probability of the 0, 0,
then you don't have to

153
00:08:09,280 --> 00:08:10,360
represent it.

154
00:08:10,360 --> 00:08:13,260
And you just know that it's
a sequence of all 1's.

155
00:08:13,260 --> 00:08:14,140
So you're all done.

156
00:08:14,140 --> 00:08:16,980
And you don't need any
bits to represent it.

157
00:08:16,980 --> 00:08:21,740
But if there's just some very
tiny epsilon probability of a

158
00:08:21,740 --> 00:08:26,910
0 and a big probability of a 1,
then the entropy is almost

159
00:08:26,910 --> 00:08:28,570
equal to 0.

160
00:08:28,570 --> 00:08:34,040
And this 1 here is needed
because l bar min is 1.

161
00:08:34,040 --> 00:08:37,250
You can't make it any
smaller than that.

162
00:08:37,250 --> 00:08:42,590
So, that's where that
comes from.

163
00:08:42,590 --> 00:08:47,390
Let's talk about entropy
just a little bit.

164
00:08:47,390 --> 00:08:53,450
If we have an alphabet which
has size m, which is the

165
00:08:53,450 --> 00:08:58,370
symbol we'll usually use
for the alphabet, x.

166
00:08:58,370 --> 00:09:01,670
And the probability that
x equals i, is p sub i.

167
00:09:01,670 --> 00:09:04,520
In other words, again, we're
using this convention so we're

168
00:09:04,520 --> 00:09:08,850
going to call the symbols the
integers 1 to capital M. Then

169
00:09:08,850 --> 00:09:11,340
the entropy is equal to this.

170
00:09:11,340 --> 00:09:15,050
And a nice way of representing
this is that the entropy is

171
00:09:15,050 --> 00:09:18,930
equal to the expected value
of minus the logarithm.

172
00:09:18,930 --> 00:09:22,090
Logarithms are always to the
base 2, in this course.

173
00:09:22,090 --> 00:09:24,950
When we want natural logarithms
we'll write ln, in

174
00:09:24,950 --> 00:09:26,770
other words it's log
to the base 2.

175
00:09:26,770 --> 00:09:30,960
So it's log to the base
2 of the probability

176
00:09:30,960 --> 00:09:32,910
of the symbol x.

177
00:09:32,910 --> 00:09:37,630
We call this the log pmf
random variable.

178
00:09:37,630 --> 00:09:42,580
We started out with x being
a chance variable.

179
00:09:42,580 --> 00:09:44,800
I mean, we happen to have
turned it into a random

180
00:09:44,800 --> 00:09:46,840
variable because we've
given it numbers.

181
00:09:46,840 --> 00:09:48,360
But that's irrelevant.

182
00:09:48,360 --> 00:09:52,050
We really want to think of
it as a chance variable.

183
00:09:52,050 --> 00:09:56,720
But now this quantity is a
numerical function of the

184
00:09:56,720 --> 00:09:59,300
symbol which comes out
of the source.

185
00:09:59,300 --> 00:10:02,140
And, therefore, this
quantity is a

186
00:10:02,140 --> 00:10:03,860
well-defined random variable.

187
00:10:03,860 --> 00:10:06,690
And we call it the log
pmf random variable.

188
00:10:06,690 --> 00:10:08,980
Some people call it
self-information.

189
00:10:08,980 --> 00:10:10,850
We'll find out why later.

190
00:10:10,850 --> 00:10:13,180
I don't particularly
like that word.

191
00:10:13,180 --> 00:10:16,030
One, because what we're dealing
with here has nothing

192
00:10:16,030 --> 00:10:18,300
to do with information.

193
00:10:18,300 --> 00:10:20,780
Probably the thought that this
all has something to do with

194
00:10:20,780 --> 00:10:23,350
information, namely, that
information theory has

195
00:10:23,350 --> 00:10:26,700
something to do with
information, probably held up

196
00:10:26,700 --> 00:10:28,730
the field for about
five years.

197
00:10:28,730 --> 00:10:31,370
Because everyone tried to figure
out, what does this

198
00:10:31,370 --> 00:10:33,510
have to do with information.

199
00:10:33,510 --> 00:10:36,050
And, of course, it had nothing
to do with information.

200
00:10:36,050 --> 00:10:38,110
It really only had
to do with data.

201
00:10:38,110 --> 00:10:41,260
And with probabilities of
various things in the data.

202
00:10:41,260 --> 00:10:43,110
So, anyway.

203
00:10:43,110 --> 00:10:45,930
Some people call it
self-information and

204
00:10:45,930 --> 00:10:47,530
we'll see why later.

205
00:10:47,530 --> 00:10:50,130
But this is the quantity
we're interested in.

206
00:10:50,130 --> 00:10:52,350
And we call it log pmf,
we sort of forget

207
00:10:52,350 --> 00:10:54,750
about the minus sign.

208
00:10:54,750 --> 00:10:57,340
It's not good to forget
about the minus sign,

209
00:10:57,340 --> 00:10:58,240
but I always do it.

210
00:10:58,240 --> 00:11:02,660
So I sort of expect other
people to do it, too.

211
00:11:02,660 --> 00:11:05,060
One of the properties of entropy
is, it has to be

212
00:11:05,060 --> 00:11:06,920
greater than or equal to 0.

213
00:11:06,920 --> 00:11:09,050
Why is it greater than
or equal to 0?

214
00:11:09,050 --> 00:11:12,130
Well, because these
probabilities here have to be

215
00:11:12,130 --> 00:11:14,120
less than or equal to 1.

216
00:11:14,120 --> 00:11:16,070
And the logarithm of something
less than or

217
00:11:16,070 --> 00:11:18,000
equal to 1 is negative.

218
00:11:18,000 --> 00:11:21,480
And therefore minus the
logarithm has to be greater

219
00:11:21,480 --> 00:11:23,130
than or equal to 0.

220
00:11:23,130 --> 00:11:26,970
This entropy is also less than
or equal to log M, log capital

221
00:11:26,970 --> 00:11:28,950
M. I'm not going to
prove that here.

222
00:11:28,950 --> 00:11:32,370
It's proven in the notes, it's
a trivial thing to do.

223
00:11:32,370 --> 00:11:35,410
Or maybe it's proven in one
of the problems, I forget.

224
00:11:35,410 --> 00:11:39,860
But, anyway, you can do it using
the inequality log of x

225
00:11:39,860 --> 00:11:42,380
is less than or equal
to x minus 1.

226
00:11:42,380 --> 00:11:44,890
Just like all inequalities
can be proven with that

227
00:11:44,890 --> 00:11:47,300
inequality.

228
00:11:47,300 --> 00:11:52,240
So there's a quality here
if x is equiprobable.

229
00:11:52,240 --> 00:11:56,420
Which is pretty clear, because
if all of these probabilities

230
00:11:56,420 --> 00:12:00,200
are equal to 1 over M, and you
take the expected value of

231
00:12:00,200 --> 00:12:04,140
logarithm of M, you get
logarithm of M. So there's

232
00:12:04,140 --> 00:12:07,030
nothing very surprising here.

233
00:12:07,030 --> 00:12:11,210
Now, the next thing -- and
here's where what we're going

234
00:12:11,210 --> 00:12:15,640
to do is, on one hand very
simple, and on the other hand

235
00:12:15,640 --> 00:12:17,400
very confusing.

236
00:12:17,400 --> 00:12:21,480
After you get the picture of
it, it becomes very simple.

237
00:12:21,480 --> 00:12:24,780
Before that, it all looks
rather strange.

238
00:12:24,780 --> 00:12:30,680
If you have two independent
chance variables, say x and y,

239
00:12:30,680 --> 00:12:37,030
then the choice where the sample
value of the chance

240
00:12:37,030 --> 00:12:41,660
variable x, and the choice of
the sample value y, together

241
00:12:41,660 --> 00:12:44,900
that's a pair of sample values
which we can view as one

242
00:12:44,900 --> 00:12:46,730
sample value.

243
00:12:46,730 --> 00:12:51,540
In other words, we can view xy
as a chance variable all in

244
00:12:51,540 --> 00:12:54,280
its own right.

245
00:12:54,280 --> 00:12:57,540
This isn't the sequence x,
followed by y, where you can

246
00:12:57,540 --> 00:12:58,730
think of it that way.

247
00:12:58,730 --> 00:13:01,150
But we want to think
of this here as a

248
00:13:01,150 --> 00:13:02,730
chance variable itself.

249
00:13:02,730 --> 00:13:04,690
Which takes on different
values.

250
00:13:04,690 --> 00:13:11,400
And the values it takes on are
pairs of sample values, 1 from

251
00:13:11,400 --> 00:13:21,290
x, 1 from ensemble y, and xy
takes on the sample value of

252
00:13:21,290 --> 00:13:28,500
little xy with the probability
p of x times p of y.

253
00:13:28,500 --> 00:13:31,510
As we move one with this course,
we'll become much less

254
00:13:31,510 --> 00:13:34,520
careful about putting these
subscripts down, which talk

255
00:13:34,520 --> 00:13:36,540
about random variables.

256
00:13:36,540 --> 00:13:40,510
And the arguments which talk
about sample values of those

257
00:13:40,510 --> 00:13:41,830
random variables.

258
00:13:41,830 --> 00:13:46,500
I want to keep doing it for a
while because most courses in

259
00:13:46,500 --> 00:13:50,250
probability, even 6.041, which
is the first course in

260
00:13:50,250 --> 00:13:53,910
probability, almost deliberately
fudges the

261
00:13:53,910 --> 00:13:57,360
difference between sample values
and random variables.

262
00:13:57,360 --> 00:13:59,800
And most people who work
with probability do

263
00:13:59,800 --> 00:14:00,870
this all the time.

264
00:14:00,870 --> 00:14:03,520
And you never know when you're
talking about a random

265
00:14:03,520 --> 00:14:06,210
variable and when you're talking
about a sample value

266
00:14:06,210 --> 00:14:07,500
of a random variable.

267
00:14:07,500 --> 00:14:11,040
And that's convenient for
getting insight about thing.

268
00:14:11,040 --> 00:14:13,770
And you do it for a while and
then pretty soon you wonder,

269
00:14:13,770 --> 00:14:15,410
what the heck is going on.

270
00:14:15,410 --> 00:14:19,055
Because you have no idea what's
a random variable any

271
00:14:19,055 --> 00:14:22,180
more, and what's a sample
value of it.

272
00:14:22,180 --> 00:14:28,190
So, this entropy, H, of the
chance variable, xy, is then

273
00:14:28,190 --> 00:14:32,580
expected value of minus the
logarithm of the probability

274
00:14:32,580 --> 00:14:37,480
of the chance variable, xy.

275
00:14:37,480 --> 00:14:39,890
Mainly, when you take the
expected value, you're taking

276
00:14:39,890 --> 00:14:43,610
the expected value of
a random variable.

277
00:14:43,610 --> 00:14:49,750
And for the random variable
here, in the chance variables,

278
00:14:49,750 --> 00:14:51,000
xy and here.

279
00:14:51,000 --> 00:14:55,180
This is the expected value of
minus the logarithm of p of x

280
00:14:55,180 --> 00:14:57,750
times the probability p of x.

281
00:14:57,750 --> 00:14:59,490
Which is the expected value.

282
00:14:59,490 --> 00:15:03,320
Since they're independent of
each other it's the sum.

283
00:15:03,320 --> 00:15:08,380
And that gives you H of x y is
equal to H of x plus H of y.

284
00:15:08,380 --> 00:15:11,350
This is really the reason why
you're interested in these

285
00:15:11,350 --> 00:15:14,870
chance variables which are
logarithms of probabilities.

286
00:15:14,870 --> 00:15:18,820
Because when you have
independent chance variables

287
00:15:18,820 --> 00:15:22,780
then you have the situation that
probabilities multiply

288
00:15:22,780 --> 00:15:26,810
and therefore log probabilities
add.

289
00:15:26,810 --> 00:15:29,780
All of the major theorems in
probability theory, in

290
00:15:29,780 --> 00:15:32,360
particular the law of large
numbers, which is the most

291
00:15:32,360 --> 00:15:35,700
important result in probability,
simple though it

292
00:15:35,700 --> 00:15:39,700
might be, that's the key to
everything in probability.

293
00:15:39,700 --> 00:15:44,700
That particular result talks
about sums of random variables

294
00:15:44,700 --> 00:15:47,090
and not about products
of random variables.

295
00:15:47,090 --> 00:15:50,400
So that's why Shannon did
everything in terms

296
00:15:50,400 --> 00:15:53,550
of these log PMF.

297
00:15:53,550 --> 00:15:55,720
And we will soon be doing
everything in

298
00:15:55,720 --> 00:15:57,660
terms of log PMF also.

299
00:16:01,120 --> 00:16:06,710
So now let's get back to
discrete memoryless sources.

300
00:16:06,710 --> 00:16:12,710
If you now have a block of n
chance variables, x1 to xn,

301
00:16:12,710 --> 00:16:18,500
and they're all IID, again we
can do this whole block as one

302
00:16:18,500 --> 00:16:21,640
big monster chance variable.

303
00:16:21,640 --> 00:16:27,100
If each one of these takes on
m possible values, then this

304
00:16:27,100 --> 00:16:29,630
monster chance variable
can take on m to

305
00:16:29,630 --> 00:16:32,400
the n possible values.

306
00:16:32,400 --> 00:16:37,190
Namely, each possible string
of symbols, each possible

307
00:16:37,190 --> 00:16:41,060
string of n symbols where each
one is a choice from the

308
00:16:41,060 --> 00:16:44,640
integers 1 to capital M. So
we're talking about tuples of

309
00:16:44,640 --> 00:16:46,420
numbers now.

310
00:16:46,420 --> 00:16:48,860
And those are the values that
this particular chance

311
00:16:48,860 --> 00:16:52,420
variable, x sub n, takes on.

312
00:16:52,420 --> 00:16:55,820
So it takes on these
probabilities, takes on these

313
00:16:55,820 --> 00:17:02,145
values with the probability p
of x n is the product from i

314
00:17:02,145 --> 00:17:05,220
equals 1 to n, of the individual
probabilities.

315
00:17:05,220 --> 00:17:08,570
In other words, again, when you
have independent chance

316
00:17:08,570 --> 00:17:11,570
variables, the probabilities
multiply.

317
00:17:11,570 --> 00:17:13,160
Which is all I'm saying here.

318
00:17:13,160 --> 00:17:17,730
So the chance variable x sub
n has the entropy H of x n,

319
00:17:17,730 --> 00:17:20,930
which is the expected value of
minus the logarithm of that

320
00:17:20,930 --> 00:17:22,010
probability.

321
00:17:22,010 --> 00:17:25,260
Which is the expected value of
minus the logarithm of the

322
00:17:25,260 --> 00:17:28,570
product of probabilities, which
is the expected value of

323
00:17:28,570 --> 00:17:32,590
the sum of minus the log of the
probabilities, which is n

324
00:17:32,590 --> 00:17:34,590
times H of x.

325
00:17:34,590 --> 00:17:37,650
If you compare this with the
previous slide, you'll see I

326
00:17:37,650 --> 00:17:41,410
haven't said anything new.

327
00:17:41,410 --> 00:17:44,420
This argument and this
argument are

328
00:17:44,420 --> 00:17:46,120
really exactly the same.

329
00:17:46,120 --> 00:17:49,910
All I did was do it for two
chance variables first.

330
00:17:49,910 --> 00:17:51,390
And then observe.

331
00:17:51,390 --> 00:17:54,000
But it generalizes,
to an arbitrary

332
00:17:54,000 --> 00:17:56,510
number of chance variables.

333
00:17:56,510 --> 00:17:58,850
You can say that it generalizes
to an infinite

334
00:17:58,850 --> 00:18:00,550
number of chance
variables also.

335
00:18:00,550 --> 00:18:02,170
And in some sense it does.

336
00:18:02,170 --> 00:18:04,870
And I would advise you
not to go there.

337
00:18:04,870 --> 00:18:08,510
Because you just get tangled up
with a lot of mathematics

338
00:18:08,510 --> 00:18:09,760
that you don't need.

339
00:18:12,470 --> 00:18:16,410
So the next thing is, how
do we fix the variable

340
00:18:16,410 --> 00:18:20,400
prefix-free codes and what
do we gain by it?

341
00:18:20,400 --> 00:18:24,170
So the thing we're going to do
now, instead of trying to

342
00:18:24,170 --> 00:18:27,680
compress one symbol at a time
from the source, we're going

343
00:18:27,680 --> 00:18:32,270
to segment the source into
blocks of n symbols each.

344
00:18:32,270 --> 00:18:35,650
And after we segment it into
blocks of n symbols each,

345
00:18:35,650 --> 00:18:39,620
we're going to encode the
block of n symbols.

346
00:18:39,620 --> 00:18:41,750
Now, what's new there?

347
00:18:41,750 --> 00:18:44,280
Nothing whatsoever is new.

348
00:18:44,280 --> 00:18:48,230
A block of n symbols is just
a chance variable.

349
00:18:48,230 --> 00:18:51,290
We know how to do optimal
encoding of chance variables.

350
00:18:51,290 --> 00:18:53,430
Namely, we use the Huffman
algorithm.

351
00:18:53,430 --> 00:18:56,980
You can do that here
on these n blocks.

352
00:18:56,980 --> 00:19:00,520
We also have this nice theorem,
which says that the

353
00:19:00,520 --> 00:19:06,020
entropy -- well, first the
entropy of x n as n times the

354
00:19:06,020 --> 00:19:07,650
entropy of x.

355
00:19:07,650 --> 00:19:10,610
So, in other words, when you
have independent identically

356
00:19:10,610 --> 00:19:15,560
distributed chance variables,
this entropy is just n times

357
00:19:15,560 --> 00:19:16,440
this entropy.

358
00:19:16,440 --> 00:19:19,500
But the important thing
is this result

359
00:19:19,500 --> 00:19:21,180
of doing the encoding.

360
00:19:21,180 --> 00:19:23,930
Which is the same result
we had before.

361
00:19:23,930 --> 00:19:27,800
Namely, this is the result of
what happens when you take a

362
00:19:27,800 --> 00:19:31,530
set -- when you take an
alphabet, and the alphabet can

363
00:19:31,530 --> 00:19:33,780
be anything whatsoever.

364
00:19:33,780 --> 00:19:37,690
And you encode that alphabet in
an optimal way, according

365
00:19:37,690 --> 00:19:42,040
to the probabilities of each
symbol within the alphabet.

366
00:19:42,040 --> 00:19:47,030
And the result that you get is
the entropy of this big chance

367
00:19:47,030 --> 00:19:52,420
variable x sub n is less than
or equal to the minimum --

368
00:19:52,420 --> 00:19:57,730
well, it's less than or equal
to the expected value of the

369
00:19:57,730 --> 00:20:00,300
length of a code, whatever
code you have.

370
00:20:00,300 --> 00:20:05,090
But when you put the minimum on
here, this is less than the

371
00:20:05,090 --> 00:20:09,730
entropy of the chance variable
x super n plus 1.

372
00:20:09,730 --> 00:20:13,270
That's the same theorem
that we proved before.

373
00:20:13,270 --> 00:20:15,490
There's nothing new here.

374
00:20:15,490 --> 00:20:20,220
Now, if you divide this by n,
this by n, and this by n, you

375
00:20:20,220 --> 00:20:22,200
still have a valid inequality.

376
00:20:22,200 --> 00:20:25,710
When you divide this by
n, what do you get?

377
00:20:25,710 --> 00:20:27,620
You get H of x.

378
00:20:27,620 --> 00:20:34,110
When you divide this by n,
by definition L bar --

379
00:20:34,110 --> 00:20:38,210
what we mean is the number of
bits per source symbol.

380
00:20:38,210 --> 00:20:42,700
We have n source symbols here.

381
00:20:42,700 --> 00:20:46,990
So when we divide by n, we get
the number of bits per source

382
00:20:46,990 --> 00:20:51,300
symbol in this monster symbol.

383
00:20:51,300 --> 00:20:53,540
So l min is equal to this.

384
00:20:53,540 --> 00:20:56,260
When we divide this
by n, we get this.

385
00:20:56,260 --> 00:20:59,530
When we divide this by
n, we get H of x.

386
00:20:59,530 --> 00:21:03,110
And now the whole reason for
doing this is, this silly

387
00:21:03,110 --> 00:21:06,900
little 1 here, which we're
trying very hard to think of

388
00:21:06,900 --> 00:21:10,400
as being negligible or
unimportant, has suddenly

389
00:21:10,400 --> 00:21:12,090
become 1 over n.

390
00:21:12,090 --> 00:21:15,690
And by making n big enough,
this truly is unimportant.

391
00:21:18,610 --> 00:21:22,870
If you're thinking in terms of
encoding a binary source, this

392
00:21:22,870 --> 00:21:25,000
1 here is very important.

393
00:21:27,960 --> 00:21:30,860
In other words, when you're
trying to encode a binary

394
00:21:30,860 --> 00:21:34,230
source, if you're encoding one
letter at a time, there's

395
00:21:34,230 --> 00:21:36,110
nothing you can do.

396
00:21:36,110 --> 00:21:37,450
You're absolutely stuck.

397
00:21:37,450 --> 00:21:40,250
Because if both of those
letters have non-zero

398
00:21:40,250 --> 00:21:44,870
probabilities, and you want a
uniquely decodable code, and

399
00:21:44,870 --> 00:21:48,690
you want to find codewords for
each of those two symbols, the

400
00:21:48,690 --> 00:21:52,940
best you can do is to have
an expected length of 1.

401
00:21:52,940 --> 00:21:58,530
Namely, you need 1 to encode 1,
and you need 0 to encode 0.

402
00:21:58,530 --> 00:22:01,800
And there's nothing else,
there's no freedom at

403
00:22:01,800 --> 00:22:03,340
all that you have.

404
00:22:03,340 --> 00:22:06,930
So you say, OK, in that
situation, I really have to go

405
00:22:06,930 --> 00:22:08,300
to longer blocks.

406
00:22:08,300 --> 00:22:10,720
And when I go to longer
blocks, I

407
00:22:10,720 --> 00:22:12,330
can get this resolved.

408
00:22:12,330 --> 00:22:13,540
And I know how to do it.

409
00:22:13,540 --> 00:22:16,430
I use Huffman's algorithm
or whatever.

410
00:22:16,430 --> 00:22:21,110
So, suddenly, I can start to
get the expected number of

411
00:22:21,110 --> 00:22:23,850
bits per source symbol.

412
00:22:23,850 --> 00:22:27,440
Down as close to h of x
as I want to make it.

413
00:22:27,440 --> 00:22:30,040
And I can't make
it any smaller.

414
00:22:30,040 --> 00:22:35,080
Which says that H of x now has
a very clear interpretation,

415
00:22:35,080 --> 00:22:38,493
at least for prefix-free codes,
of being the number of

416
00:22:38,493 --> 00:22:42,430
bits you need for encoding
prefix-free codes when you

417
00:22:42,430 --> 00:22:44,940
allow the possibility
of encoding them

418
00:22:44,940 --> 00:22:47,570
a block at a time.

419
00:22:47,570 --> 00:22:50,560
We're going to find later today
that the significance of

420
00:22:50,560 --> 00:22:53,640
this is far greater
than that, even.

421
00:22:53,640 --> 00:22:56,710
Because why use prefix-free
codes, we could use any old

422
00:22:56,710 --> 00:22:57,780
kind of code.

423
00:22:57,780 --> 00:23:00,840
When we study the Lempel-Ziv
codes tomorrow, we'll find out

424
00:23:00,840 --> 00:23:04,810
they aren't prefix-free
codes at all.

425
00:23:04,810 --> 00:23:07,450
They're really variable length
of variable length codes.

426
00:23:07,450 --> 00:23:10,170
So they aren't fixed
to variable length.

427
00:23:10,170 --> 00:23:13,050
And they do some pretty fancy
and tricky things.

428
00:23:13,050 --> 00:23:16,140
But they're still limited
to this same inequality.

429
00:23:16,140 --> 00:23:18,390
You can never beat the
entropy bound.

430
00:23:18,390 --> 00:23:21,170
If you want something to be
uniquely decodable, you're

431
00:23:21,170 --> 00:23:22,980
stuck with this bound.

432
00:23:22,980 --> 00:23:26,620
And we'll see why in a very
straightforward way, later.

433
00:23:26,620 --> 00:23:31,500
It's a very straightforward way
which I can guarantee all

434
00:23:31,500 --> 00:23:35,960
of you are going to look at it
and say, yes, that's obvious.

435
00:23:35,960 --> 00:23:38,540
And tomorrow you will look at it
and say, I don't understand

436
00:23:38,540 --> 00:23:39,730
that at all.

437
00:23:39,730 --> 00:23:41,500
And the next day you'll
look at it again and

438
00:23:41,500 --> 00:23:42,880
say, well, of course.

439
00:23:42,880 --> 00:23:45,740
And the day after that you'll
say, I don't understand that.

440
00:23:45,740 --> 00:23:48,690
And you'll go back and forth
like that for about two weeks.

441
00:23:48,690 --> 00:23:51,980
Don't be frustrated, because
it is simple.

442
00:23:51,980 --> 00:23:54,950
But at the same time it's
very sophisticated.

443
00:23:59,370 --> 00:24:05,170
So, let's now review the weak
law of large numbers.

444
00:24:05,170 --> 00:24:09,000
I usually just call it the
law of large numbers.

445
00:24:09,000 --> 00:24:12,670
I bridle a little bit when
people call it weak because,

446
00:24:12,670 --> 00:24:16,070
in fact it's the centerpiece
of probability theory.

447
00:24:16,070 --> 00:24:19,310
But there is this other thing
called the strong law of large

448
00:24:19,310 --> 00:24:24,070
numbers, which mathematicians
love because it lets them look

449
00:24:24,070 --> 00:24:27,630
at all kinds of mathematical
minutiae.

450
00:24:27,630 --> 00:24:29,830
It's also important,
I shouldn't

451
00:24:29,830 --> 00:24:30,950
play it down too much.

452
00:24:30,950 --> 00:24:33,530
And there are places where
you truly need it.

453
00:24:33,530 --> 00:24:36,540
For what we'll be doing, we
don't need it at all.

454
00:24:36,540 --> 00:24:39,470
And the weak law of large
numbers does in fact hold in

455
00:24:39,470 --> 00:24:41,930
many places where the strong
law doesn't hold.

456
00:24:41,930 --> 00:24:48,130
So if you know what the strong
law, is temporarily forget it

457
00:24:48,130 --> 00:24:50,320
for the for the rest
of the term.

458
00:24:50,320 --> 00:24:52,580
And just focus on
the weak law.

459
00:24:52,580 --> 00:24:56,130
And the weak law is not
terribly complicated.

460
00:24:56,130 --> 00:25:00,310
We have a sequence of
random variables.

461
00:25:00,310 --> 00:25:04,230
And each of them has
a mean y bar.

462
00:25:04,230 --> 00:25:08,360
And each of them has a variance
sigma sub y squared.

463
00:25:08,360 --> 00:25:10,950
And let's assume that they're
independent and identically

464
00:25:10,950 --> 00:25:12,600
distributed for the
time being.

465
00:25:12,600 --> 00:25:15,900
Just to avoid worrying
about anything.

466
00:25:15,900 --> 00:25:20,020
If we look at the sum of those
random variables, namely a is

467
00:25:20,020 --> 00:25:23,700
equal to y1 up to y sub n.

468
00:25:23,700 --> 00:25:27,570
Then the expected value of a is
the expected value of this

469
00:25:27,570 --> 00:25:30,860
plus the expected valuable
of y2, and so forth.

470
00:25:30,860 --> 00:25:35,270
So the expected value of
a is n times y bar.

471
00:25:35,270 --> 00:25:37,990
And I think in one of the
homework problems, you found

472
00:25:37,990 --> 00:25:39,640
the variance of a.

473
00:25:39,640 --> 00:25:46,090
And the variance of a, well, the
easiest thing to do is to

474
00:25:46,090 --> 00:25:49,280
reduce this to its
fluctuation.

475
00:25:49,280 --> 00:25:51,790
Reduce all of these to
their fluctuation.

476
00:25:51,790 --> 00:25:54,570
Then look at the variance of the
fluctuation, which is just

477
00:25:54,570 --> 00:25:56,960
the expected value
of this squared.

478
00:25:56,960 --> 00:25:59,760
Which is the expected value of
this squared plus the expected

479
00:25:59,760 --> 00:26:02,250
value of this squared,
and so forth.

480
00:26:02,250 --> 00:26:06,940
So the variance of a is n times
sigma sub y squared.

481
00:26:06,940 --> 00:26:08,530
I want all of you know that.

482
00:26:08,530 --> 00:26:13,230
That's sort of day two of
a probability course.

483
00:26:13,230 --> 00:26:15,700
As soon as you start talking
about random variables, that's

484
00:26:15,700 --> 00:26:17,630
one of the key things
that you do.

485
00:26:17,630 --> 00:26:21,320
One of the most important
things you do.

486
00:26:21,320 --> 00:26:23,270
The thing that we're interested
in here is more the

487
00:26:23,270 --> 00:26:26,600
sample average of y1
up to y sub n.

488
00:26:26,600 --> 00:26:29,570
And the sample average,
by definition, is the

489
00:26:29,570 --> 00:26:32,050
sum divided by n.

490
00:26:32,050 --> 00:26:35,870
So in other words, the thing
that you're interested in here

491
00:26:35,870 --> 00:26:39,990
is to add all of these
random variables up.

492
00:26:39,990 --> 00:26:43,360
Take one over n times it.

493
00:26:43,360 --> 00:26:44,950
Which is a thing we
do all the time.

494
00:26:44,950 --> 00:26:50,270
I mean, we sum up a lot of
events, we divide by n, and we

495
00:26:50,270 --> 00:26:54,790
hope by doing that to get some
sort of typical value.

496
00:26:54,790 --> 00:26:58,210
And, usually there is some sort
of typical value that

497
00:26:58,210 --> 00:26:59,200
arises from that.

498
00:26:59,200 --> 00:27:02,810
What the law of large numbers
says is that there in fact is

499
00:27:02,810 --> 00:27:05,620
a typical value that arises.

500
00:27:05,620 --> 00:27:08,600
So this sample value is
a over n, which is the

501
00:27:08,600 --> 00:27:10,410
sum divided by n.

502
00:27:10,410 --> 00:27:12,630
And we call that the
sample average.

503
00:27:12,630 --> 00:27:18,340
The mean of the sample average
is just the mean of a, which

504
00:27:18,340 --> 00:27:23,020
is n times y bar,
divided by n.

505
00:27:23,020 --> 00:27:27,780
So the mean of the sample
average is y bar itself.

506
00:27:27,780 --> 00:27:37,190
The variance of the sample
variance, --

507
00:27:37,190 --> 00:27:42,290
the variance of the sample
average, OK, that's, --

508
00:27:45,630 --> 00:27:48,880
I'm talking too fast.

509
00:27:48,880 --> 00:27:55,600
The sample average here, you
would like to think of it as

510
00:27:55,600 --> 00:27:58,380
something which is known
and specific, like

511
00:27:58,380 --> 00:27:59,680
the expected value.

512
00:27:59,680 --> 00:28:02,150
It, in fact, is a
random variable.

513
00:28:02,150 --> 00:28:05,140
It changes with different
sample values.

514
00:28:05,140 --> 00:28:07,840
It can change from almost
nothing to very large

515
00:28:07,840 --> 00:28:08,820
quantities.

516
00:28:08,820 --> 00:28:11,250
And what we're interested in
saying is that most of the

517
00:28:11,250 --> 00:28:14,480
time, it's close to the
expected value.

518
00:28:14,480 --> 00:28:15,830
And that's what we're
aiming at here.

519
00:28:15,830 --> 00:28:19,020
And that's what the law
of large numbers says.

520
00:28:19,020 --> 00:28:22,970
The sample average here, the
variance of this, is now equal

521
00:28:22,970 --> 00:28:28,000
to the variance of a divided
by n squared.

522
00:28:28,000 --> 00:28:31,530
In other words, we're trying to
take the expected value of

523
00:28:31,530 --> 00:28:33,080
this quantity squared.

524
00:28:33,080 --> 00:28:36,770
So there's a 1 over n squared
that comes in here.

525
00:28:36,770 --> 00:28:40,800
When you take the 1 over n
squared here, this variance

526
00:28:40,800 --> 00:28:44,080
then becomes sigma --

527
00:28:46,610 --> 00:28:50,230
I don't know why I
have the n there.

528
00:28:50,230 --> 00:28:52,100
Just take that n out,
if you will.

529
00:28:52,100 --> 00:28:54,640
I don't have my red
pen with me.

530
00:28:57,290 --> 00:29:03,390
And so it's the variance
of the random variable

531
00:29:03,390 --> 00:29:06,590
y, divided by n.

532
00:29:06,590 --> 00:29:12,170
In other words, the limit as n
goes to infinity of the of the

533
00:29:12,170 --> 00:29:16,980
variance of the sum is
equal to infinity.

534
00:29:16,980 --> 00:29:21,630
And the variance of the sample
average as n goes to infinity

535
00:29:21,630 --> 00:29:23,790
is equal to 0.

536
00:29:23,790 --> 00:29:27,890
And that's because of this 1
over n squared effect here.

537
00:29:27,890 --> 00:29:32,400
When you add up a lot of
independent random variables,

538
00:29:32,400 --> 00:29:35,990
what you wind up with is the
sample average has a variance,

539
00:29:35,990 --> 00:29:38,440
which is going to 0.

540
00:29:38,440 --> 00:29:44,150
And the sum has a variance which
is going to infinity.

541
00:29:44,150 --> 00:29:46,560
That's important.

542
00:29:46,560 --> 00:29:49,820
Aside from all of the theorems
you've ever heard, this is

543
00:29:49,820 --> 00:29:54,520
sort of the gross, simple-minded
thing which you

544
00:29:54,520 --> 00:29:57,690
always ought to keep foremost
in your mind.

545
00:29:57,690 --> 00:30:00,290
This is what's happening
in probability theory.

546
00:30:00,290 --> 00:30:03,350
When you talk about sample
averages, this variance is

547
00:30:03,350 --> 00:30:06,590
getting small.

548
00:30:06,590 --> 00:30:09,710
Let's look at a picture
of this.

549
00:30:09,710 --> 00:30:12,420
Let's look at the distribution
function

550
00:30:12,420 --> 00:30:14,500
of this random variable.

551
00:30:14,500 --> 00:30:18,110
The sample average as
a random variable.

552
00:30:18,110 --> 00:30:22,070
And what we're finding here
is that this distribution

553
00:30:22,070 --> 00:30:27,510
function, if we look at it for
some modest value of n, we get

554
00:30:27,510 --> 00:30:31,250
something which looks like
this upper curve here.

555
00:30:31,250 --> 00:30:33,360
Which is then the lower
curve here.

556
00:30:33,360 --> 00:30:37,580
It's spread out more, so it
has a larger variance.

557
00:30:37,580 --> 00:30:40,180
Namely, the sample average
has a larger variance.

558
00:30:40,180 --> 00:30:45,360
When you make n bigger, what's
happening to the variance?

559
00:30:45,360 --> 00:30:46,900
The variance is getting
smaller.

560
00:30:46,900 --> 00:30:51,850
The variance is getting smaller
by a factor of 1/2.

561
00:30:51,850 --> 00:30:55,800
So this quantity is supposed
to have a variance which is

562
00:30:55,800 --> 00:30:59,200
equal to 1/2 of the
variance of this.

563
00:30:59,200 --> 00:31:01,270
How you find a variance
in a distribution

564
00:31:01,270 --> 00:31:04,010
function is your problem.

565
00:31:04,010 --> 00:31:08,310
But you know that if something
has a small variance, it's

566
00:31:08,310 --> 00:31:10,600
very closely tucked
in around this.

567
00:31:10,600 --> 00:31:14,150
In other words, as the variance
goes to 0, and the

568
00:31:14,150 --> 00:31:17,620
mean is y bar, you have a
distribution function which

569
00:31:17,620 --> 00:31:20,910
approaches a unit step.

570
00:31:20,910 --> 00:31:23,410
And all that just comes from
this very, very simple

571
00:31:23,410 --> 00:31:27,260
argument that says, when you
have a sum of IID random

572
00:31:27,260 --> 00:31:31,050
variables and you take the
sample average of it, namely,

573
00:31:31,050 --> 00:31:34,780
you divide by n, the
variance goes to 0.

574
00:31:34,780 --> 00:31:37,850
Which says, no matter how you
look at it, you wind up with

575
00:31:37,850 --> 00:31:40,770
something that looks
like a unit step.

576
00:31:40,770 --> 00:31:45,480
Now, the Chebyshev inequality,
which is one of the simpler

577
00:31:45,480 --> 00:31:49,500
inequalities in probability
theory, and I don't prove it

578
00:31:49,500 --> 00:31:52,040
because it's something
you've all seen.

579
00:31:52,040 --> 00:31:55,800
I don't know of any course in
probability which avoids the

580
00:31:55,800 --> 00:31:57,780
Chebyshev inequality.

581
00:31:57,780 --> 00:32:02,190
And what it says is, for any
epsilon greater than 0, the

582
00:32:02,190 --> 00:32:05,280
probability that the difference
between the sample

583
00:32:05,280 --> 00:32:09,350
average and the true mean, the
probability that that quantity

584
00:32:09,350 --> 00:32:13,380
and magnitude is greater than
or equal to epsilon, is less

585
00:32:13,380 --> 00:32:17,070
than or equal to sigma
sub y squared divided

586
00:32:17,070 --> 00:32:18,580
by n epsilon squared.

587
00:32:18,580 --> 00:32:22,340
Oh, incidentally that thing that
was called sigma sub n

588
00:32:22,340 --> 00:32:26,310
before was really
sigma squared.

589
00:32:26,310 --> 00:32:31,420
That's mainly the
variance of y.

590
00:32:31,420 --> 00:32:33,120
I hope it's right
in the notes.

591
00:32:33,120 --> 00:32:34,970
Might not be.

592
00:32:34,970 --> 00:32:35,930
It is?

593
00:32:35,930 --> 00:32:37,180
Good.

594
00:32:39,940 --> 00:32:42,520
So, that's what this
inequality says.

595
00:32:42,520 --> 00:32:46,540
There's an easy way to derive
this on the fly.

596
00:32:46,540 --> 00:32:49,480
Namely, if you're wondering what
all these constants are

597
00:32:49,480 --> 00:32:53,890
here, here's a way to do it.

598
00:32:53,890 --> 00:32:58,980
What we're looking at, in this
curve here, is we're trying to

599
00:32:58,980 --> 00:33:04,250
say, how much probability is
there outside of these plus

600
00:33:04,250 --> 00:33:06,360
and minus epsilon limits.

601
00:33:06,360 --> 00:33:09,550
And the Chebyshev inequality
says there can't be too much

602
00:33:09,550 --> 00:33:11,250
probability out here.

603
00:33:11,250 --> 00:33:14,780
And there can't be too much
probability out here.

604
00:33:14,780 --> 00:33:19,620
So, one way to get at this is
to say, OK, suppose I have

605
00:33:19,620 --> 00:33:22,970
some given probability
out here.

606
00:33:22,970 --> 00:33:25,950
And some given probability
out here.

607
00:33:25,950 --> 00:33:29,380
And suppose I want to minimize
the variance of a random

608
00:33:29,380 --> 00:33:32,960
variable which has that much
probability out here and that

609
00:33:32,960 --> 00:33:35,050
much probability out here.

610
00:33:35,050 --> 00:33:36,630
How do I do it?

611
00:33:36,630 --> 00:33:40,700
Well, the variance deals with
the square of how far you were

612
00:33:40,700 --> 00:33:42,160
away from the mean.

613
00:33:42,160 --> 00:33:44,840
So if I want to have a certain
amount of probability out

614
00:33:44,840 --> 00:33:49,750
here, I minimize my variance by
making this come straight,

615
00:33:49,750 --> 00:33:54,370
come up here with a little step,
then go across here.

616
00:33:54,370 --> 00:33:56,050
Go up here.

617
00:33:56,050 --> 00:33:59,560
And then, oops.

618
00:33:59,560 --> 00:34:02,160
Go up here.

619
00:34:02,160 --> 00:34:03,710
I wish I had my red pencil.

620
00:34:03,710 --> 00:34:07,060
Does anybody have a red pen?

621
00:34:07,060 --> 00:34:08,480
That will write on this stuff?

622
00:34:13,870 --> 00:34:14,360
Yes?

623
00:34:14,360 --> 00:34:15,610
No?

624
00:34:21,140 --> 00:34:21,670
Oh, great.

625
00:34:21,670 --> 00:34:23,990
Thank you.

626
00:34:23,990 --> 00:34:25,240
I will return it.

627
00:34:27,220 --> 00:34:31,330
So what we want is something
which goes over here.

628
00:34:31,330 --> 00:34:33,170
Comes up here.

629
00:34:33,170 --> 00:34:35,220
Goes across here.

630
00:34:35,220 --> 00:34:36,470
Goes up here.

631
00:34:39,400 --> 00:34:42,900
Goes across here, and
goes up again.

632
00:34:42,900 --> 00:34:44,535
That's the smallest you
can make the variance.

633
00:34:44,535 --> 00:34:46,790
It's squeezing everything
in as far as it

634
00:34:46,790 --> 00:34:48,130
can be squeezed in.

635
00:34:48,130 --> 00:34:50,830
Namely, everything out
here gets squeezed

636
00:34:50,830 --> 00:34:53,270
in to y minus epsilon.

637
00:34:53,270 --> 00:34:55,910
Everything in here gets
squeezed into 0.

638
00:34:55,910 --> 00:34:59,500
And everything out here gets
squeezed into epsilon.

639
00:34:59,500 --> 00:35:01,570
OK, calculate the
variance there.

640
00:35:01,570 --> 00:35:03,650
And that satisfies
the Chebyshev

641
00:35:03,650 --> 00:35:06,130
inequality with equality.

642
00:35:06,130 --> 00:35:10,410
So that's all the Chebyshev
inequality is.

643
00:35:10,410 --> 00:35:13,210
And it's a loose inequality
usually, because usually these

644
00:35:13,210 --> 00:35:14,900
curves look very nice.

645
00:35:14,900 --> 00:35:17,410
Usually this looks like a
Gaussian distribution

646
00:35:17,410 --> 00:35:20,660
function, and the central limit
theorem says that we

647
00:35:20,660 --> 00:35:23,810
don't need the central limit
theorem here, and we don't

648
00:35:23,810 --> 00:35:26,980
want the central limit theorem
here, because this thing is an

649
00:35:26,980 --> 00:35:31,490
inequality that says, life can't
be any worse than this.

650
00:35:31,490 --> 00:35:33,890
And all the central limit
theorem is, is an

651
00:35:33,890 --> 00:35:35,570
approximation.

652
00:35:35,570 --> 00:35:37,550
And then we have to worry
about when it's a good

653
00:35:37,550 --> 00:35:41,510
approximation and when it's
not a good approximation.

654
00:35:41,510 --> 00:35:45,160
So this says, when we carry it
one piece further, it's for

655
00:35:45,160 --> 00:35:48,050
any epsilon and delta
greater than 0.

656
00:35:48,050 --> 00:35:51,500
If we make n large enough --
in other words, substitute

657
00:35:51,500 --> 00:35:53,330
delta for this.

658
00:35:53,330 --> 00:35:55,850
And then, when you make n small
enough, this quantity is

659
00:35:55,850 --> 00:35:57,600
smaller than delta.

660
00:35:57,600 --> 00:36:01,180
And that says that the
probability that s and y

661
00:36:01,180 --> 00:36:05,240
differ by more than epsilon is
less than or equal to delta

662
00:36:05,240 --> 00:36:08,180
when we make n big enough.

663
00:36:08,180 --> 00:36:10,960
So it says, you can make this
as small as you want.

664
00:36:10,960 --> 00:36:13,050
You can make this as
small as you want.

665
00:36:13,050 --> 00:36:15,530
And all you need to do
is make a sequence

666
00:36:15,530 --> 00:36:17,540
which is long enough.

667
00:36:17,540 --> 00:36:21,380
Now, the thing which is
mystifying about the law of

668
00:36:21,380 --> 00:36:24,720
large numbers is, you
need both the

669
00:36:24,720 --> 00:36:26,510
epsilon and the delta.

670
00:36:26,510 --> 00:36:29,470
You can't get rid of
either of them.

671
00:36:29,470 --> 00:36:33,260
In other words, you
can't say --

672
00:36:33,260 --> 00:36:36,500
you can't reduce this to 0.

673
00:36:36,500 --> 00:36:38,670
Because it won't
make any sense.

674
00:36:38,670 --> 00:36:41,550
In other words, this
curve here tends to

675
00:36:41,550 --> 00:36:44,520
spread out on its tails.

676
00:36:44,520 --> 00:36:49,070
And therefore, there's always a
probability of error there.

677
00:36:49,070 --> 00:36:54,160
You can't move epsilon into 0
because, for no finite end, do

678
00:36:54,160 --> 00:36:56,590
you really get a step
function here.

679
00:36:56,590 --> 00:37:00,580
So you need some wiggle
room on both end.

680
00:37:00,580 --> 00:37:05,950
You need wiggle room here, and
you need wiggle room here.

681
00:37:05,950 --> 00:37:08,950
And once you recognize that you
need those two pieces of

682
00:37:08,950 --> 00:37:09,850
wiggle room.

683
00:37:09,850 --> 00:37:13,830
Namely, you cannot talk about
the probability that this is

684
00:37:13,830 --> 00:37:19,230
equal to y bar, because
that's usually 0.

685
00:37:19,230 --> 00:37:25,390
And you cannot talk about
reducing this to 0 either.

686
00:37:25,390 --> 00:37:27,750
So both of those are needed.

687
00:37:27,750 --> 00:37:29,590
Why did I go through
all of this?

688
00:37:29,590 --> 00:37:31,780
Well, partly because
it's important.

689
00:37:31,780 --> 00:37:36,140
But partly because I want to
talk about something which is

690
00:37:36,140 --> 00:37:39,890
called the asymptotic
equipartition property.

691
00:37:39,890 --> 00:37:43,520
And because of those long words
you believe this has to

692
00:37:43,520 --> 00:37:45,980
be very complicated.

693
00:37:45,980 --> 00:37:48,690
I hope to convince you that
what the asymptotic

694
00:37:48,690 --> 00:37:52,580
equipartition property is, is
simply the week law of large

695
00:37:52,580 --> 00:37:57,570
numbers applied to
the log pmf.

696
00:37:57,570 --> 00:38:01,030
Because that, in fact,
is all it is.

697
00:38:01,030 --> 00:38:05,430
But it says some unusual
and fascinating things.

698
00:38:05,430 --> 00:38:11,020
So let's suppose that x1, x2,
and so forth is the output

699
00:38:11,020 --> 00:38:14,970
from a discrete memoryless
source.

700
00:38:14,970 --> 00:38:18,180
Look at the log pmf
of each of these.

701
00:38:18,180 --> 00:38:22,800
Namely, they each have the same
distribution function.

702
00:38:22,800 --> 00:38:26,090
So w of f x is going to be equal
to minus the logarithm

703
00:38:26,090 --> 00:38:31,790
of p sub x of x, for each of
these chance variables.

704
00:38:31,790 --> 00:38:36,460
w of x maps source symbols
into real numbers.

705
00:38:36,460 --> 00:38:40,850
So there's a random variable,
capital W of x sub j, which is

706
00:38:40,850 --> 00:38:41,970
a random variable.

707
00:38:41,970 --> 00:38:46,140
We have a random variable for
each one of these symbols to

708
00:38:46,140 --> 00:38:47,660
come out of the source.

709
00:38:47,660 --> 00:38:51,050
So, for each one of these
symbols, there's this log pmf

710
00:38:51,050 --> 00:38:55,900
random variable, which takes
on different values.

711
00:38:55,900 --> 00:39:00,790
So the expected value of this
log pmf, for the j'th symbol

712
00:39:00,790 --> 00:39:05,820
out of the source is the sum
of p sub x of x, namely the

713
00:39:05,820 --> 00:39:09,560
probability that the source
takes on the value x, times

714
00:39:09,560 --> 00:39:12,290
minus the logarithm
of p sub x.

715
00:39:12,290 --> 00:39:14,530
And that's just the entropy.

716
00:39:14,530 --> 00:39:18,770
So, the strange feeling about
this log pmf random variable

717
00:39:18,770 --> 00:39:22,480
is its expected value
is entropy.

718
00:39:22,480 --> 00:39:27,440
And w of x1, w of x2, and so
forth, are a sequence of IID

719
00:39:27,440 --> 00:39:31,320
random variables, each one of
them which has a mean, which

720
00:39:31,320 --> 00:39:32,570
is the entropy.

721
00:39:35,320 --> 00:39:38,560
So it's just exactly the
situation we had before.

722
00:39:38,560 --> 00:39:42,460
Instead of y bar, we have
the entropy of x.

723
00:39:42,460 --> 00:39:48,840
And instead of the random
variable y sub j, we have this

724
00:39:48,840 --> 00:39:53,580
random variable w of x sub j,
which is defined by the symbol

725
00:39:53,580 --> 00:39:54,830
in an alphabet.

726
00:40:00,170 --> 00:40:04,240
And just to review this, but
it's what we said before, if

727
00:40:04,240 --> 00:40:09,900
capital X1, this little x1,
namely, if little x1 is the

728
00:40:09,900 --> 00:40:15,650
sample value for the chance
variable x1 and if x2 is a

729
00:40:15,650 --> 00:40:19,720
sample value for the chance
variable X2, then the outcome

730
00:40:19,720 --> 00:40:25,660
for w of x1 plus w of x2 --

731
00:40:28,350 --> 00:40:31,200
very hard to keep all these
little letters and capital

732
00:40:31,200 --> 00:40:32,450
letters straight.

733
00:40:35,030 --> 00:40:39,850
Is w of x1 plus w of x2 is minus
the logarithm of the

734
00:40:39,850 --> 00:40:43,570
probability of x1 minus the
logarithm of the probability

735
00:40:43,570 --> 00:40:47,790
of x2, which is minus the
logarithm of the product,

736
00:40:47,790 --> 00:40:51,620
which is minus the logarithm of
the joint probability of x1

737
00:40:51,620 --> 00:40:57,610
and x2, which is the random
variable w of x1 x2.

738
00:40:57,610 --> 00:41:03,870
So the sum here is the random
variable corresponding to log

739
00:41:03,870 --> 00:41:10,650
pmf of the joint outputs
x1 and x2.

740
00:41:10,650 --> 00:41:18,110
So w of x1 x2 is the log pmf of
the event, but this joint

741
00:41:18,110 --> 00:41:21,740
chance variable takes
on the value x1 x2.

742
00:41:21,740 --> 00:41:27,820
And the random variable x1 x2
is the sum of x1 and x2.

743
00:41:27,820 --> 00:41:29,690
So, what have I done here?

744
00:41:29,690 --> 00:41:32,540
I said this at the end of the
last slide, and you won't

745
00:41:32,540 --> 00:41:34,050
believe me.

746
00:41:34,050 --> 00:41:39,690
So, again this is one of these
things where tomorrow you

747
00:41:39,690 --> 00:41:40,610
won't believe me.

748
00:41:40,610 --> 00:41:42,580
And you'll have to go back
and look at that.

749
00:41:42,580 --> 00:41:45,630
But, anyway, x1 x2 is
a chance variable.

750
00:41:45,630 --> 00:41:50,090
And probabilities multiply in
log pmf's add, which is what

751
00:41:50,090 --> 00:41:52,020
we've been saying for a
couple of days now.

752
00:41:55,460 --> 00:41:56,820
So.

753
00:41:56,820 --> 00:42:06,430
If I look at the sum of n of
these random variables, the

754
00:42:06,430 --> 00:42:11,000
sum of these log probabilities
is the sum of the log of

755
00:42:11,000 --> 00:42:16,370
pmf's, which is minus the
logarithm of the probability

756
00:42:16,370 --> 00:42:19,190
of the entire sequence.

757
00:42:19,190 --> 00:42:22,110
That's just saying the same
thing we said before, for two

758
00:42:22,110 --> 00:42:23,250
random variables.

759
00:42:23,250 --> 00:42:28,140
The sample average of a log
pmf's is the sum of the w's

760
00:42:28,140 --> 00:42:31,480
divided by n, which is minus
the logarithm of the

761
00:42:31,480 --> 00:42:33,830
probability divided by n.

762
00:42:33,830 --> 00:42:37,700
The weak law of large numbers
applies, and the probability

763
00:42:37,700 --> 00:42:42,840
that this sample average minus
the expected value of w of x

764
00:42:42,840 --> 00:42:46,170
is greater than or equal to
epsilon is less than or equal

765
00:42:46,170 --> 00:42:47,690
to this quantity here.

766
00:42:47,690 --> 00:42:51,610
This quantity is minus the
logarithm of the probability

767
00:42:51,610 --> 00:42:57,740
of x sub n, divided by n, minus
H of x, greater than or

768
00:42:57,740 --> 00:42:59,340
equal to epsilon.

769
00:43:07,210 --> 00:43:09,450
So this is the thing that
we really want.

770
00:43:15,610 --> 00:43:17,470
I'm going to spend a few
slides trying to

771
00:43:17,470 --> 00:43:18,620
say what this means.

772
00:43:18,620 --> 00:43:22,170
But let's try to just look
at it now, and see

773
00:43:22,170 --> 00:43:24,190
what it must mean.

774
00:43:24,190 --> 00:43:29,350
It says that with high
probability, this quantity is

775
00:43:29,350 --> 00:43:32,900
almost the same as
this quantity.

776
00:43:32,900 --> 00:43:35,810
It says that with high
probability, the thing which

777
00:43:35,810 --> 00:43:42,630
comes out of the source is going
to have a probability, a

778
00:43:42,630 --> 00:43:47,930
log probability, divided by n,
which is close to the entropy.

779
00:43:47,930 --> 00:43:52,350
It says in some sense that with
high probability, the

780
00:43:52,350 --> 00:43:54,240
probability of what
comes out of the

781
00:43:54,240 --> 00:43:55,940
source is almost a constant.

782
00:43:59,020 --> 00:44:02,060
And that's amazing.

783
00:44:02,060 --> 00:44:04,200
That's what you'll wake up in
the morning and say, I don't

784
00:44:04,200 --> 00:44:05,450
believe that.

785
00:44:07,900 --> 00:44:10,430
But it's true.

786
00:44:10,430 --> 00:44:12,870
But you have to be careful
to interpret it right.

787
00:44:15,450 --> 00:44:18,710
So, we're going to define
the typical set.

788
00:44:18,710 --> 00:44:22,680
Namely, this is the typical set
of x's, which come out of

789
00:44:22,680 --> 00:44:23,630
the source.

790
00:44:23,630 --> 00:44:26,520
Namely, the typical
set of blocks of n

791
00:44:26,520 --> 00:44:29,490
symbols out of the source.

792
00:44:29,490 --> 00:44:32,510
And when you talk about a
typical set, you want

793
00:44:32,510 --> 00:44:36,180
something which includes most
of the probability.

794
00:44:36,180 --> 00:44:40,560
So what I'm going to include in
this typical set is all of

795
00:44:40,560 --> 00:44:43,160
these things that we talked
about before.

796
00:44:43,160 --> 00:44:46,520
Namely, we showed that the
probability that this quantity

797
00:44:46,520 --> 00:44:49,960
is greater than or equal to
epsilon is very small.

798
00:44:49,960 --> 00:44:53,980
So, with high probability what
comes out of the source

799
00:44:53,980 --> 00:44:57,030
satisfies this inequality
here.

800
00:44:57,030 --> 00:45:00,840
So I can write down the
distribution function of this

801
00:45:00,840 --> 00:45:02,480
random variable here.

802
00:45:02,480 --> 00:45:09,070
It's just this w --

803
00:45:12,840 --> 00:45:14,750
this is a random variable w.

804
00:45:14,750 --> 00:45:17,550
I'm looking at the distribution
of that random

805
00:45:17,550 --> 00:45:20,170
variable w.

806
00:45:20,170 --> 00:45:25,340
And this quantity in here is
the probability of this

807
00:45:25,340 --> 00:45:28,260
typical set.

808
00:45:28,260 --> 00:45:31,090
In other words, when I draw this
distribution function for

809
00:45:31,090 --> 00:45:34,820
this combined random variable,
I've defined this typical set

810
00:45:34,820 --> 00:45:40,020
to be all the sequences which
lie between this point and

811
00:45:40,020 --> 00:45:41,070
this point.

812
00:45:41,070 --> 00:45:43,690
Namely, this is H
minus epsilon.

813
00:45:43,690 --> 00:45:47,360
And this is H plus epsilon,
moving H out here.

814
00:45:47,360 --> 00:45:50,580
So these are all the sequences
which satisfy

815
00:45:50,580 --> 00:45:52,510
this inequality here.

816
00:45:52,510 --> 00:45:54,550
So that's what I mean
by the typical set.

817
00:45:54,550 --> 00:45:59,290
It's all things which are
clustered around H in this

818
00:45:59,290 --> 00:46:00,540
distribution function.

819
00:46:03,450 --> 00:46:07,320
And as n approaches infinity,
this typical set approaches

820
00:46:07,320 --> 00:46:09,170
probability 1.

821
00:46:09,170 --> 00:46:11,560
In the same way that the
law of large numbers

822
00:46:11,560 --> 00:46:12,420
behaves that way.

823
00:46:12,420 --> 00:46:18,090
The probability that x sub n
is in this typical set is

824
00:46:18,090 --> 00:46:23,180
greater than or equal to 1 minus
sigma squared divided by

825
00:46:23,180 --> 00:46:25,080
n epsilon squared.

826
00:46:30,800 --> 00:46:34,230
Let's try to express that in
a bunch of other ways.

827
00:46:40,400 --> 00:46:44,880
If you're getting lost,
please ask questions.

828
00:46:44,880 --> 00:46:49,800
But I hope to come back to this
in a little bit, after we

829
00:46:49,800 --> 00:46:52,850
finish a little more
of the story.

830
00:46:52,850 --> 00:47:03,060
So, another way of expressing
this typical set -- let me

831
00:47:03,060 --> 00:47:05,760
look at that as the
typical set.

832
00:47:05,760 --> 00:47:10,920
If I take this inequality here
and I rewrite this, namely,

833
00:47:10,920 --> 00:47:16,190
this is the set of x's for
which this is less than

834
00:47:16,190 --> 00:47:20,970
epsilon plus H of x
and greater than

835
00:47:20,970 --> 00:47:23,330
H of x minus epsilon.

836
00:47:23,330 --> 00:47:26,900
So that's what I've
expressed here.

837
00:47:26,900 --> 00:47:31,880
It's the set of x's for which
n times H of x minus epsilon

838
00:47:31,880 --> 00:47:36,260
is less than this logarithm of
probability is great less than

839
00:47:36,260 --> 00:47:38,830
n times H of x plus epsilon.

840
00:47:38,830 --> 00:47:43,630
Namely, I'm looking at this
range of epsilon around H,

841
00:47:43,630 --> 00:47:46,980
which is this and this.

842
00:47:46,980 --> 00:47:50,840
If I write it again, if I
exponentiate all of this, it's

843
00:47:50,840 --> 00:47:55,810
the set of x's for which 2 to
the minus n, H of x, plus

844
00:47:55,810 --> 00:47:59,740
epsilon, that's this term
exponentiated, is less than

845
00:47:59,740 --> 00:48:03,170
this is less than this
term exponentiated.

846
00:48:03,170 --> 00:48:05,610
And what's going on here
is, I've taken care of

847
00:48:05,610 --> 00:48:08,170
the minus sign also.

848
00:48:08,170 --> 00:48:10,400
And if you can follow that in
your head, you're a better

849
00:48:10,400 --> 00:48:12,400
person than I am.

850
00:48:12,400 --> 00:48:14,700
But, anyway, it works.

851
00:48:14,700 --> 00:48:16,790
And if you fiddle around with
that, you'll see that that's

852
00:48:16,790 --> 00:48:17,860
what it is.

853
00:48:17,860 --> 00:48:24,300
So this typical set is a bound
on the probabilities of all

854
00:48:24,300 --> 00:48:26,090
these typical sequences.

855
00:48:26,090 --> 00:48:31,980
The typical sequences all are
enclosed in this range of

856
00:48:31,980 --> 00:48:33,230
probabilities.

857
00:48:35,680 --> 00:48:39,440
So the typical elements are
approximately equiprobable, in

858
00:48:39,440 --> 00:48:42,130
this strange sense above.

859
00:48:42,130 --> 00:48:45,100
Why do I say this is
a strange sense?

860
00:48:45,100 --> 00:48:49,690
Well, as n gets large,
what happens here?

861
00:48:49,690 --> 00:48:52,810
This is 2 to the minus
n times H of x.

862
00:48:52,810 --> 00:48:55,360
Which is the important
part of this.

863
00:48:55,360 --> 00:48:59,820
This epsilon here is
multiplied by n.

864
00:48:59,820 --> 00:49:02,400
And we're trying to say, as n
gets very, very big, we can

865
00:49:02,400 --> 00:49:04,700
make epsilon very, very small.

866
00:49:04,700 --> 00:49:09,130
But we really can't make n times
epsilon very negligible.

867
00:49:09,130 --> 00:49:12,410
But the point is, the important
thing here is, 2 to

868
00:49:12,410 --> 00:49:15,680
the minus n times H of x.

869
00:49:15,680 --> 00:49:19,640
So, in some sense, this
is close to 2 to the

870
00:49:19,640 --> 00:49:21,750
minus n H of x.

871
00:49:21,750 --> 00:49:23,140
In what sense is it true?

872
00:49:23,140 --> 00:49:28,140
Well, it's true in that sense.

873
00:49:28,140 --> 00:49:31,980
Where that, in fact is,
a valid inequality.

874
00:49:31,980 --> 00:49:35,130
Namely in terms of
sample averages,

875
00:49:35,130 --> 00:49:37,160
these things are close.

876
00:49:37,160 --> 00:49:40,210
When I do the exponentiation and
get rid of the n and all

877
00:49:40,210 --> 00:49:43,820
that stuff, they aren't
very close.

878
00:49:43,820 --> 00:49:48,760
But saying this sort of thing is
sort of like saying that 10

879
00:49:48,760 --> 00:49:52,502
to the minus 23 is approximately
equal to 10 to

880
00:49:52,502 --> 00:49:54,950
the minus 25.

881
00:49:54,950 --> 00:49:57,060
And they're approximately equal
because they're both

882
00:49:57,060 --> 00:49:59,170
very, very small.

883
00:49:59,170 --> 00:50:02,510
And that's the kind of thing
that's going on here.

884
00:50:02,510 --> 00:50:05,330
And you're trying to distinguish
10 to the minus 23

885
00:50:05,330 --> 00:50:10,540
and 10 to the minus 25 from 10
to the minus 60th and from 10

886
00:50:10,540 --> 00:50:12,950
to the minus three.

887
00:50:12,950 --> 00:50:16,500
So that's the kind of
approximations we're using.

888
00:50:16,500 --> 00:50:20,040
Namely, we're using
approximations on a log scale,

889
00:50:20,040 --> 00:50:23,510
instead of approximations
of ordinary numbers.

890
00:50:23,510 --> 00:50:27,800
But, still it's convenient to
think of these typical x's,

891
00:50:27,800 --> 00:50:31,270
typical sequences, as being
sequences which are

892
00:50:31,270 --> 00:50:33,900
constrained in probability
in this way.

893
00:50:33,900 --> 00:50:37,290
And this is the thing which
is easy to work with.

894
00:50:37,290 --> 00:50:41,910
The atypical set of strings,
namely, the compliment to this

895
00:50:41,910 --> 00:50:45,990
set, the thing we know about
that is the entire set doesn't

896
00:50:45,990 --> 00:50:48,030
have much probability.

897
00:50:48,030 --> 00:50:53,080
Namely, if you fix epsilon and
you let n get bigger and

898
00:50:53,080 --> 00:50:57,860
bigger, this atypical set
becomes totally negligible.

899
00:50:57,860 --> 00:50:59,110
And you can ignore it.

900
00:51:02,330 --> 00:51:06,940
So let's plow ahead.

901
00:51:06,940 --> 00:51:12,220
Stop for an example pretty
soon, but --

902
00:51:12,220 --> 00:51:15,830
If I have a sequence which is
in the typical set, we then

903
00:51:15,830 --> 00:51:20,400
know that its probability is
greater than 2 to the minus n

904
00:51:20,400 --> 00:51:23,520
times H of x plus epsilon.

905
00:51:23,520 --> 00:51:26,150
That's what we said before.

906
00:51:26,150 --> 00:51:29,330
And, therefore, when I use
this inequality, the

907
00:51:29,330 --> 00:51:34,900
probability of x to the n, for
something in the typical set,

908
00:51:34,900 --> 00:51:37,940
is greater than this
quantity here.

909
00:51:37,940 --> 00:51:47,950
In other words, this is
greater than that.

910
00:51:47,950 --> 00:51:50,420
For everything in
a typical set.

911
00:51:50,420 --> 00:51:53,640
So now I'm heading over things
in a typical set.

912
00:51:53,640 --> 00:51:56,170
So I need to include the
number of things

913
00:51:56,170 --> 00:51:57,590
in a typical set.

914
00:51:57,590 --> 00:52:01,190
So what I have is this sum.

915
00:52:01,190 --> 00:52:02,470
And what is this sum?

916
00:52:02,470 --> 00:52:06,000
This is the probability
of the typical set.

917
00:52:06,000 --> 00:52:08,960
Because I'm adding overall
elements in the typical set.

918
00:52:08,960 --> 00:52:11,880
And it's greater than or equal
to the number of elements in a

919
00:52:11,880 --> 00:52:15,660
typical set times these
small probabilities.

920
00:52:15,660 --> 00:52:19,230
If I turn this around, it says
that the number of elements in

921
00:52:19,230 --> 00:52:22,460
a typical set is less
than 2 to the n

922
00:52:22,460 --> 00:52:25,820
times H of x plus epsilon.

923
00:52:25,820 --> 00:52:30,000
For any epsilon, no matter how
small I want to make it.

924
00:52:30,000 --> 00:52:33,710
Which says that the elements
in a typical set have

925
00:52:33,710 --> 00:52:38,200
probabilities which are about 2
to the minus n times H of x.

926
00:52:38,200 --> 00:52:41,480
And the number of them is
approximately 2 to the

927
00:52:41,480 --> 00:52:44,110
n times H of x.

928
00:52:44,110 --> 00:52:47,910
In other words, what it says is
that this typical set is a

929
00:52:47,910 --> 00:52:53,900
bunch of essentially uniform
probabilities.

930
00:52:53,900 --> 00:52:58,550
So what I've done is to take
this very complicated source.

931
00:52:58,550 --> 00:53:05,360
And when I look at these very
humongous chance variables,

932
00:53:05,360 --> 00:53:10,670
which are very large sequences
out of the source, what I find

933
00:53:10,670 --> 00:53:14,510
is that there's a bunch of
things which collectively have

934
00:53:14,510 --> 00:53:16,410
zilch probability.

935
00:53:16,410 --> 00:53:18,980
There's a bunch of other things
which all have equal

936
00:53:18,980 --> 00:53:20,090
probability.

937
00:53:20,090 --> 00:53:24,650
And a number of them is
enough to add up to y.

938
00:53:24,650 --> 00:53:28,820
So I have turned this source,
when I look at it over a long

939
00:53:28,820 --> 00:53:38,080
enough sequence, into a source
of equiprobable events.

940
00:53:38,080 --> 00:53:41,470
And each of those events has
this probability here.

941
00:53:41,470 --> 00:53:46,540
Now, we know how to encode
equiprobable events.

942
00:53:46,540 --> 00:53:48,140
And that's the whole
point of this.

943
00:53:50,770 --> 00:53:55,820
So, this is less than
or equal to that.

944
00:53:55,820 --> 00:53:59,000
On the other side, we know that
1 minus delta is less

945
00:53:59,000 --> 00:54:04,970
than or equal to this
probability of a typical set.

946
00:54:04,970 --> 00:54:09,590
And this is less than the
number of elements in a

947
00:54:09,590 --> 00:54:13,860
typical set times 2 to the minus
n h of x minus epsilon.

948
00:54:13,860 --> 00:54:16,320
This is an upper
bound on this.

949
00:54:16,320 --> 00:54:24,240
This is less than this.

950
00:54:27,600 --> 00:54:30,570
So I just add all these things
up and I get this bound.

951
00:54:30,570 --> 00:54:34,200
So it says, the size of the
typical set is greater than 1

952
00:54:34,200 --> 00:54:37,360
minus delta, times
this quantity.

953
00:54:37,360 --> 00:54:41,520
In other words, this is a pretty
exact sort of thing.

954
00:54:41,520 --> 00:54:44,870
If you don't mind dealing
with this 2 to the n

955
00:54:44,870 --> 00:54:47,270
epsilon factor here.

956
00:54:47,270 --> 00:54:50,150
If you agree that that's
negligible in some strange

957
00:54:50,150 --> 00:54:53,860
sense, the all of this
makes good sense.

958
00:54:53,860 --> 00:54:57,760
And if it is negligible, let me
start talking about source

959
00:54:57,760 --> 00:55:01,420
coding, which is why
this all works out.

960
00:55:01,420 --> 00:55:05,460
So the summary is that the
probability of the complement

961
00:55:05,460 --> 00:55:10,650
of the typical set
is essentially 0.

962
00:55:10,650 --> 00:55:14,340
The number of elements in a
typical set is approximately 2

963
00:55:14,340 --> 00:55:16,130
to the n times h of x.

964
00:55:16,130 --> 00:55:18,610
I'm getting rid of all the
deltas and epsilons here, to

965
00:55:18,610 --> 00:55:22,380
get sort of the broad view
of what's important here.

966
00:55:22,380 --> 00:55:25,650
Each of the elements in a
typical set has probability 2

967
00:55:25,650 --> 00:55:28,170
to the minus n times H of x.

968
00:55:28,170 --> 00:55:32,175
So I've turned a source
into a source of

969
00:55:32,175 --> 00:55:34,230
equiprobable elements.

970
00:55:34,230 --> 00:55:37,070
And there are 2 to the n
times h of x of them.

971
00:55:43,100 --> 00:55:46,320
Let's do an example of this.

972
00:55:46,320 --> 00:55:48,890
It's an example that you'll work
on more in the homework

973
00:55:48,890 --> 00:55:52,810
and do it a little
more cleanly.

974
00:55:52,810 --> 00:55:57,120
Let's look at a binary discrete
memoryless source,

975
00:55:57,120 --> 00:56:02,310
where the probability that x is
equal to 1 is p, which is

976
00:56:02,310 --> 00:56:03,920
less than 1/2.

977
00:56:03,920 --> 00:56:07,070
And the probability of 0
is greater than 1/2.

978
00:56:07,070 --> 00:56:12,640
So, this is what you get when
you have a biased coin.

979
00:56:12,640 --> 00:56:17,420
And the biased coin has a
1 on one side and a 0

980
00:56:17,420 --> 00:56:19,340
on the other side.

981
00:56:19,340 --> 00:56:23,070
And it's more likely to
come up 0's than 1's.

982
00:56:23,070 --> 00:56:26,080
I always used to wonder how
to make a biased coin.

983
00:56:26,080 --> 00:56:28,240
And I can give you a little
experiment which shows you you

984
00:56:28,240 --> 00:56:30,400
can make a biased coin.

985
00:56:30,400 --> 00:56:34,140
I mean, a biased is a little
round thing which is flat on

986
00:56:34,140 --> 00:56:35,840
the top and bottom.

987
00:56:35,840 --> 00:56:40,070
Suppose instead of that you
make a triangular coin.

988
00:56:40,070 --> 00:56:43,140
And instead of making it flat on
top and bottom, you turn it

989
00:56:43,140 --> 00:56:45,800
into a tetrahedron.

990
00:56:45,800 --> 00:56:50,630
So in fact, what this is now is
a coin which is built up on

991
00:56:50,630 --> 00:56:54,090
one side into a very
massive thing.

992
00:56:54,090 --> 00:56:57,070
And is flat on the other side.

993
00:56:57,070 --> 00:56:59,700
Since it's a tetrahedron
and it's an equilateral

994
00:56:59,700 --> 00:57:04,730
tetrahedron, the probability of
1 is going to be 1/4, and

995
00:57:04,730 --> 00:57:07,850
the probability of 0
is going to be 3/4.

996
00:57:07,850 --> 00:57:10,760
So you can make biased coins.

997
00:57:10,760 --> 00:57:12,760
So when you get into
coin-tossing games with

998
00:57:12,760 --> 00:57:15,045
people, watch the coin
that they're using.

999
00:57:15,045 --> 00:57:19,120
It probably won't be a
tetrahedron, but anyway.

1000
00:57:21,820 --> 00:57:28,520
So the entropy here, the log pmf
random variable, takes on

1001
00:57:28,520 --> 00:57:32,300
the value of minus log
p with probability p.

1002
00:57:32,300 --> 00:57:35,950
And it takes on the value minus
log 1 minus p, with

1003
00:57:35,950 --> 00:57:37,490
probability 1 minus p.

1004
00:57:37,490 --> 00:57:40,080
This is a probability of a 1.

1005
00:57:40,080 --> 00:57:42,700
This is a probability of a 0.

1006
00:57:42,700 --> 00:57:46,270
So, the entropy is
equal to this.

1007
00:57:46,270 --> 00:57:48,980
Used to be that in information
theory courses, people would

1008
00:57:48,980 --> 00:57:52,050
almost memorize what this
curve looked like.

1009
00:57:52,050 --> 00:57:53,250
And they'd draw pictures
of it.

1010
00:57:53,250 --> 00:57:56,140
There were famous curves
of this function,

1011
00:57:56,140 --> 00:57:58,950
which looks like this.

1012
00:58:07,280 --> 00:58:17,620
0, 1, 1.

1013
00:58:17,620 --> 00:58:20,800
Turns out, that's not all that
important a distribution.

1014
00:58:20,800 --> 00:58:24,510
It's a nice example
to talk about.

1015
00:58:24,510 --> 00:58:28,400
The typical set, t epsilon n,
is the set of strings with

1016
00:58:28,400 --> 00:58:34,710
about p n1's and about 1
minus p times n 0's.

1017
00:58:34,710 --> 00:58:38,770
In other words, that's the
typical thing to happen.

1018
00:58:38,770 --> 00:58:41,900
And it's the typical thing in
terms of this law of large

1019
00:58:41,900 --> 00:58:42,690
numbers here.

1020
00:58:42,690 --> 00:58:46,520
Because you get 1's with
probability p.

1021
00:58:46,520 --> 00:58:48,700
And therefore in a long
sequence, you're going to get

1022
00:58:48,700 --> 00:58:53,190
about pn 1's and
1 minus p 0's.

1023
00:58:53,190 --> 00:58:58,520
The probability of a typical
string is, if you get a string

1024
00:58:58,520 --> 00:59:01,940
with this many 1's and
this many 0's, it's

1025
00:59:01,940 --> 00:59:04,500
probability is p.

1026
00:59:04,500 --> 00:59:08,280
Namely, the probability of a 1
times the number of 1's you

1027
00:59:08,280 --> 00:59:10,610
get, which is pn.

1028
00:59:10,610 --> 00:59:13,420
Times the probability
of a 0, times the

1029
00:59:13,420 --> 00:59:16,210
number of 0's you get.

1030
00:59:16,210 --> 00:59:19,170
And if you look at what this
is, if you take p up in the

1031
00:59:19,170 --> 00:59:22,850
exponent and 1 minus the p up in
the exponent, this becomes

1032
00:59:22,850 --> 00:59:27,700
2 to the minus n times h of x,
just like what it should be.

1033
00:59:27,700 --> 00:59:31,780
So these typical strings, with
about pn 1's and 1 minus pn

1034
00:59:31,780 --> 00:59:34,720
0's, are in fact typical
in the sense we've

1035
00:59:34,720 --> 00:59:36,560
been talking about.

1036
00:59:36,560 --> 00:59:43,100
The number of n strings with pn
1's is n factorial divided

1037
00:59:43,100 --> 00:59:47,760
by pn factorial divided by n
times 1 minus p factorial.

1038
00:59:52,070 --> 00:59:54,960
I mean I hope you learned that
a long time ago, but you

1039
00:59:54,960 --> 00:59:56,910
should learn it in probability
anyway.

1040
00:59:56,910 --> 01:00:01,260
It's just very simple
combinatorics.

1041
01:00:01,260 --> 01:00:04,270
So you have that many
different strings.

1042
01:00:04,270 --> 01:00:07,430
So what I'm trying to get across
here is, there are a

1043
01:00:07,430 --> 01:00:10,580
bunch of different things
going on here.

1044
01:00:10,580 --> 01:00:13,600
We can talk about the random
variable which is the number

1045
01:00:13,600 --> 01:00:16,990
of 1's that occur in
this long sequence.

1046
01:00:16,990 --> 01:00:20,460
And with high probability, the
number of 1's that occur is

1047
01:00:20,460 --> 01:00:22,970
close to pn.

1048
01:00:22,970 --> 01:00:26,470
But if pn 1's occur, there's
still an awful lot of

1049
01:00:26,470 --> 01:00:28,400
randomness left.

1050
01:00:28,400 --> 01:00:33,310
Because we have to worry about
where those pn 1's appear.

1051
01:00:33,310 --> 01:00:36,140
And those are the sequences
we're talking about.

1052
01:00:36,140 --> 01:00:41,520
So, there are this many
sequences, all of which have

1053
01:00:41,520 --> 01:00:44,940
that many 1's in them.

1054
01:00:44,940 --> 01:00:48,850
And there's a similar number of
sequences for all similar

1055
01:00:48,850 --> 01:00:50,160
numbers of 1's.

1056
01:00:50,160 --> 01:00:54,510
Namely, if you take pn plus 1
and pn plus 2, pn minus 1, pn

1057
01:00:54,510 --> 01:00:57,780
minus 2, you get similar
numbers here.

1058
01:00:57,780 --> 01:01:00,890
So those are the typical
sequences.

1059
01:01:00,890 --> 01:01:03,980
Now, the important thing to
observe here is that you

1060
01:01:03,980 --> 01:01:08,890
really have 2 to the n binary
strings altogether.

1061
01:01:08,890 --> 01:01:13,270
And what this result is saying
is that collectively those

1062
01:01:13,270 --> 01:01:14,490
don't make any difference.

1063
01:01:14,490 --> 01:01:17,820
The law of large numbers says,
OK, there's just a humongous

1064
01:01:17,820 --> 01:01:20,080
number of strings.

1065
01:01:20,080 --> 01:01:23,780
You get the largest number
strings which have about half

1066
01:01:23,780 --> 01:01:25,510
1's and half 0's.

1067
01:01:25,510 --> 01:01:29,100
But their probability
is zilch.

1068
01:01:29,100 --> 01:01:32,540
So the thing which is probable
is getting pn 1's

1069
01:01:32,540 --> 01:01:34,750
and 1 minus pn 0's.

1070
01:01:34,750 --> 01:01:37,290
Now, we have this typical set.

1071
01:01:37,290 --> 01:01:41,410
What is the most likely sequence
of all, in this

1072
01:01:41,410 --> 01:01:42,660
experiment?

1073
01:01:45,450 --> 01:01:48,130
How do I maximize the
probability of

1074
01:01:48,130 --> 01:01:49,620
a particular sequence?

1075
01:01:49,620 --> 01:02:03,910
The probability of the sequence
is p to the i times 1

1076
01:02:03,910 --> 01:02:07,420
minus p to the n minus i.

1077
01:02:07,420 --> 01:02:11,050
And 1 minus p is the
probability of 0.

1078
01:02:11,050 --> 01:02:14,240
And p is the probability
of a 1.

1079
01:02:14,240 --> 01:02:15,970
How do I choose i to
maximize this?

1080
01:02:15,970 --> 01:02:16,300
Yeah?

1081
01:02:16,300 --> 01:02:18,150
AUDIENCE: [UNINTELLIGIBLE]
all 0's.

1082
01:02:18,150 --> 01:02:19,540
PROFESSOR: You make
them all 0's.

1083
01:02:19,540 --> 01:02:23,750
So the most likely sequence
is all 0's.

1084
01:02:23,750 --> 01:02:25,860
But that's not a typical
sequence.

1085
01:02:29,700 --> 01:02:33,290
Why isn't it a typical
sequence?

1086
01:02:33,290 --> 01:02:36,060
Because we chose to define
typical sequence in a

1087
01:02:36,060 --> 01:02:37,880
different way.

1088
01:02:37,880 --> 01:02:41,180
Namely is only one of those, and
there are only n of them

1089
01:02:41,180 --> 01:02:43,650
with only a single one.

1090
01:02:43,650 --> 01:02:46,920
So, in other words, what's going
on is that we have an

1091
01:02:46,920 --> 01:02:49,640
enormous number of sequences
which have around half

1092
01:02:49,640 --> 01:02:50,890
1's and half 0's.

1093
01:02:53,430 --> 01:02:55,240
But they don't have
any probability.

1094
01:02:55,240 --> 01:02:57,840
And collectively they don't
have any probability.

1095
01:02:57,840 --> 01:03:01,380
We have a very small number of
sequences which have a very

1096
01:03:01,380 --> 01:03:03,750
large number of 0's.

1097
01:03:03,750 --> 01:03:07,960
But there aren't enough of those
to make any difference.

1098
01:03:07,960 --> 01:03:10,750
And, therefore, the things that
make a difference are

1099
01:03:10,750 --> 01:03:14,710
these typical things which
have about np 1's

1100
01:03:14,710 --> 01:03:18,270
and 1 minus pn 0's.

1101
01:03:18,270 --> 01:03:20,680
And that all sounds
very strange.

1102
01:03:20,680 --> 01:03:22,800
But if I phrase this a different
way, you would all

1103
01:03:22,800 --> 01:03:27,470
say that's exactly the way
you ought to do things.

1104
01:03:27,470 --> 01:03:32,210
Because, in fact, when we look
at very, very long sequences,

1105
01:03:32,210 --> 01:03:35,175
you know with extraordinarily
high probability what's going

1106
01:03:35,175 --> 01:03:39,050
to come out of the source is
something with about pn 1's

1107
01:03:39,050 --> 01:03:42,430
and about 1 minus
p times n 0's.

1108
01:03:42,430 --> 01:03:46,410
So that's the likely set of
things to have happen.

1109
01:03:46,410 --> 01:03:47,590
And it's just that there
are an enormous

1110
01:03:47,590 --> 01:03:49,200
number of those things.

1111
01:03:49,200 --> 01:03:51,890
There are this many of them.

1112
01:03:51,890 --> 01:03:56,150
So, here what we're dealing with
is a balance between the

1113
01:03:56,150 --> 01:04:01,090
number of elements of a
particular type, and the

1114
01:04:01,090 --> 01:04:03,520
probability of them.

1115
01:04:03,520 --> 01:04:07,030
And it turns out that this
number and its probability

1116
01:04:07,030 --> 01:04:10,650
balance out to say that usually
what you get is about

1117
01:04:10,650 --> 01:04:13,780
pn 1's and 1 minus
p times n 0's.

1118
01:04:13,780 --> 01:04:16,730
Which is what the law of large
numbers said to begin with.

1119
01:04:16,730 --> 01:04:20,300
All we're doing is interpreting
that here.

1120
01:04:20,300 --> 01:04:25,210
But the thing that you see from
this example is, all of

1121
01:04:25,210 --> 01:04:28,680
these things with exactly pn 1's
in them, assuming that pn

1122
01:04:28,680 --> 01:04:31,270
is an integer, are
all equiprobable.

1123
01:04:31,270 --> 01:04:34,940
They're all exactly
equiprobable.

1124
01:04:34,940 --> 01:04:37,990
So what we're doing when we're
talking about this typical

1125
01:04:37,990 --> 01:04:42,140
set, is first throwing out all
the things which have to many

1126
01:04:42,140 --> 01:04:44,570
1's are or too few
1's in them.

1127
01:04:44,570 --> 01:04:48,560
We're keeping only the ones
which are typical in the sense

1128
01:04:48,560 --> 01:04:50,920
that they obey the law
of large numbers.

1129
01:04:50,920 --> 01:04:54,100
And in this case, they obey the
law of large numbers for

1130
01:04:54,100 --> 01:04:56,730
log pmf's also.

1131
01:04:56,730 --> 01:05:01,770
And then all of those things
are about equally probable.

1132
01:05:01,770 --> 01:05:05,460
So the idea in source coding
is, one of the ways to deal

1133
01:05:05,460 --> 01:05:10,430
with source coding is, you want
to assign codewords to

1134
01:05:10,430 --> 01:05:13,570
only these typical things.

1135
01:05:13,570 --> 01:05:16,240
Now, maybe you might want to
assign codewords to something

1136
01:05:16,240 --> 01:05:17,870
like all 0's also.

1137
01:05:17,870 --> 01:05:20,570
Because it hardly
costs anything.

1138
01:05:20,570 --> 01:05:23,810
And a Huffman code would
certainly do that.

1139
01:05:23,810 --> 01:05:27,310
But it's not very important
whether you do or not.

1140
01:05:27,310 --> 01:05:30,300
The important thing is, you
assign codewords to all of

1141
01:05:30,300 --> 01:05:31,910
these typical sequences.

1142
01:05:37,770 --> 01:05:41,280
So let's go back to
fixed-to-fixed

1143
01:05:41,280 --> 01:05:42,660
length source codes.

1144
01:05:42,660 --> 01:05:45,500
We talked a little bit about
fixed-to-fixed length source

1145
01:05:45,500 --> 01:05:46,940
codes before.

1146
01:05:46,940 --> 01:05:48,980
Do you remember what we did
with fixed-to-fixed length

1147
01:05:48,980 --> 01:05:50,720
source codes before?

1148
01:05:50,720 --> 01:05:53,520
We said we have an alphabet
of size m.

1149
01:05:53,520 --> 01:05:56,250
We want something which
is uniquely decodable.

1150
01:05:56,250 --> 01:05:59,020
And since we want something
which is uniquely decodable,

1151
01:05:59,020 --> 01:06:02,510
we have to provide codewords
for everything.

1152
01:06:02,510 --> 01:06:07,780
And, therefore, if we want to
choose a block length of n,

1153
01:06:07,780 --> 01:06:11,730
we've got to generate m
to the n codewords.

1154
01:06:11,730 --> 01:06:14,700
Here we say, wow, maybe we
don't have to provide

1155
01:06:14,700 --> 01:06:17,250
codewords for everything.

1156
01:06:17,250 --> 01:06:20,520
Maybe we're willing to tolerate
a certain small

1157
01:06:20,520 --> 01:06:23,070
probability that the whole
thing fails and

1158
01:06:23,070 --> 01:06:24,320
falls on its face.

1159
01:06:27,040 --> 01:06:30,280
Now, does that make any sense?

1160
01:06:30,280 --> 01:06:32,330
Well, view things the
following way.

1161
01:06:32,330 --> 01:06:36,090
We said, when we started out
all of this, that we were

1162
01:06:36,090 --> 01:06:38,880
going to look at prefix-free
codes.

1163
01:06:38,880 --> 01:06:42,640
Where some codewords had a
longer length and some

1164
01:06:42,640 --> 01:06:44,730
codewords had a shorter
length.

1165
01:06:44,730 --> 01:06:48,040
And we were thinking of encoding
either single letters

1166
01:06:48,040 --> 01:06:52,340
at a time, or a small block
of letters at a time.

1167
01:06:52,340 --> 01:06:55,960
So think of encoding, say,
10 letters at a time.

1168
01:06:55,960 --> 01:07:02,250
And think of doing this for
10 to the 20th letters.

1169
01:07:02,250 --> 01:07:05,740
So you have the source here
which is pumping out letters

1170
01:07:05,740 --> 01:07:08,280
at a regular rate.

1171
01:07:08,280 --> 01:07:12,540
You're blocking them into
n letters at a time.

1172
01:07:12,540 --> 01:07:15,540
You're encoding in a
prefix-free code.

1173
01:07:15,540 --> 01:07:17,790
Out comes something.

1174
01:07:17,790 --> 01:07:22,560
What comes is not coming
out at a regular right.

1175
01:07:22,560 --> 01:07:25,670
What is coming out, sometimes
you get a lot of bits out.

1176
01:07:25,670 --> 01:07:28,450
Sometimes a small number
of bits out.

1177
01:07:28,450 --> 01:07:30,730
So, in other words, if you want
to send things over a

1178
01:07:30,730 --> 01:07:34,970
channel, you need a buffer
there to save things.

1179
01:07:34,970 --> 01:07:39,000
If, in fact, we decide that the
expected number of bits

1180
01:07:39,000 --> 01:07:43,960
per source letter is, say, five
bits per source letter,

1181
01:07:43,960 --> 01:07:48,540
then we expect over a very long
time to be producing five

1182
01:07:48,540 --> 01:07:50,830
bits per source letter.

1183
01:07:50,830 --> 01:07:54,460
And if we turn our channel on
for one year, to transmit all

1184
01:07:54,460 --> 01:07:59,010
of these things, what's going
to happen is this very

1185
01:07:59,010 --> 01:08:02,080
unlikely sequence occurs.

1186
01:08:02,080 --> 01:08:05,910
Which in fact requires not one
year to transmit, but two

1187
01:08:05,910 --> 01:08:09,520
years to transmit.

1188
01:08:09,520 --> 01:08:13,150
In fact, what do we do if it
takes one year and five

1189
01:08:13,150 --> 01:08:18,140
minutes to transmit instead
of one year?

1190
01:08:18,140 --> 01:08:19,050
Well, we've got a failure.

1191
01:08:19,050 --> 01:08:22,520
Somehow or other, the network
is going to fail us.

1192
01:08:22,520 --> 01:08:25,350
I mean we all know that networks
fail all the time

1193
01:08:25,350 --> 01:08:28,530
despite what engineers say.

1194
01:08:28,530 --> 01:08:32,120
I mean, all of us who use
networks know that they do

1195
01:08:32,120 --> 01:08:33,820
crazy things.

1196
01:08:33,820 --> 01:08:36,590
And one of those crazy things
is that unusual things

1197
01:08:36,590 --> 01:08:38,270
sometimes happen.

1198
01:08:38,270 --> 01:08:42,640
So, we develop this very nice
theory of prefix-free codes.

1199
01:08:42,640 --> 01:08:46,580
But prefix-free codes,
in fact, fail also.

1200
01:08:46,580 --> 01:08:50,880
And they fail also because
buffers overflow.

1201
01:08:50,880 --> 01:08:54,160
In other words, we are counting
on encoding things

1202
01:08:54,160 --> 01:08:58,020
with a certain number of
bits per source symbol.

1203
01:08:58,020 --> 01:09:00,770
And if these unusual things
occur, and we have too many

1204
01:09:00,770 --> 01:09:04,780
bits per source symbol,
then we fail.

1205
01:09:04,780 --> 01:09:08,960
So the idea that we're trying
to get at now is that

1206
01:09:08,960 --> 01:09:13,560
prefix-free codes and
fixed-to-fixed length source

1207
01:09:13,560 --> 01:09:16,640
codes which only encode
typical things.

1208
01:09:16,640 --> 01:09:20,710
In fact, are sort of the same
if you look at them over a

1209
01:09:20,710 --> 01:09:22,860
very, very large sequence
length.

1210
01:09:22,860 --> 01:09:26,980
In other words, if you look at
a prefix-free code which is

1211
01:09:26,980 --> 01:09:31,190
dealing with blocks of 10
letters, and you look at a

1212
01:09:31,190 --> 01:09:34,120
fixed-to-fixed length code which
is only dealing with

1213
01:09:34,120 --> 01:09:39,320
typical things but is looking at
a length of 10 to the 20th,

1214
01:09:39,320 --> 01:09:43,570
then over that length of 10 to
the 20th, your variable length

1215
01:09:43,570 --> 01:09:47,020
code is going to have a bunch of
things which are about the

1216
01:09:47,020 --> 01:09:48,630
length they ought to be.

1217
01:09:48,630 --> 01:09:50,970
And a bunch of other
things which are

1218
01:09:50,970 --> 01:09:53,090
extraordinarily long.

1219
01:09:53,090 --> 01:09:56,360
The bunch of things which are
extraordinarily long are

1220
01:09:56,360 --> 01:09:59,910
extraordinarily unpopular, but
there are an extraordinarily

1221
01:09:59,910 --> 01:10:02,020
large number of them.

1222
01:10:02,020 --> 01:10:05,760
Just like with a fixed-to-fixed
length code,

1223
01:10:05,760 --> 01:10:07,700
you are going to fail.

1224
01:10:07,700 --> 01:10:10,200
And you're going to fail on
an extraordinary number of

1225
01:10:10,200 --> 01:10:12,500
different sequences.

1226
01:10:12,500 --> 01:10:15,290
But, collectively, that set of
sequences don't have any

1227
01:10:15,290 --> 01:10:17,850
probability.

1228
01:10:17,850 --> 01:10:20,720
So the point that I'm trying to
get across is that, really,

1229
01:10:20,720 --> 01:10:24,020
these two situations come
together when we look very

1230
01:10:24,020 --> 01:10:25,630
long lengths.

1231
01:10:25,630 --> 01:10:30,030
Namely, prefix-free codes are
just a way of generating codes

1232
01:10:30,030 --> 01:10:33,260
that work for typical sequences
and over a very

1233
01:10:33,260 --> 01:10:37,390
large, long period of time, will
generate about the right

1234
01:10:37,390 --> 01:10:40,550
number of symbols.

1235
01:10:40,550 --> 01:10:42,420
And that's what I'm trying
to get at here.

1236
01:10:42,420 --> 01:10:45,980
Or what I'm trying to get
at in the next slide.

1237
01:10:45,980 --> 01:10:50,650
So the fixed-to-fixed length
source code, I'm going to pick

1238
01:10:50,650 --> 01:10:52,860
some epsilon and some delta.

1239
01:10:52,860 --> 01:10:55,770
Namely, that epsilon and delta
which appeared in the law of

1240
01:10:55,770 --> 01:10:58,280
large numbers.

1241
01:10:58,280 --> 01:11:01,400
I'm going to make n as big as
I have to make it for that

1242
01:11:01,400 --> 01:11:03,220
epsilon and that delta.

1243
01:11:03,220 --> 01:11:07,120
And calculate how large it
has to be, but we won't.

1244
01:11:07,120 --> 01:11:12,150
Then I'm going to assign fixed
length codewords to each

1245
01:11:12,150 --> 01:11:15,390
sequence in the typical set.

1246
01:11:15,390 --> 01:11:16,490
Now, am I going to really build

1247
01:11:16,490 --> 01:11:18,410
something which does this?

1248
01:11:18,410 --> 01:11:20,210
Of course not.

1249
01:11:20,210 --> 01:11:23,140
I mean, I'm talking about
truly humongous lengths.

1250
01:11:23,140 --> 01:11:25,620
So, this is really a conceptual
tool to understand

1251
01:11:25,620 --> 01:11:27,070
what's going on.

1252
01:11:27,070 --> 01:11:30,100
It's not something we're
going to implement.

1253
01:11:30,100 --> 01:11:32,490
So I'm going to assign
codewords to all

1254
01:11:32,490 --> 01:11:34,910
these typical elements.

1255
01:11:34,910 --> 01:11:40,900
And then what I find is that
since the typical set, since

1256
01:11:40,900 --> 01:11:44,730
the number of elements in it is
less than 2 to the n times

1257
01:11:44,730 --> 01:11:51,200
H of x plus epsilon, if I choose
L bar, namely, the

1258
01:11:51,200 --> 01:11:56,980
number of bits I'm going to use
for encoding these things,

1259
01:11:56,980 --> 01:12:00,470
it's going to have to be H of
x plus epsilon in length.

1260
01:12:00,470 --> 01:12:02,190
Because I need to provide
codewords for

1261
01:12:02,190 --> 01:12:05,600
each of these things.

1262
01:12:05,600 --> 01:12:08,930
And it needs to be an extra 1
over n because of this integer

1263
01:12:08,930 --> 01:12:11,460
constraint that we've been
dealing with all along, which

1264
01:12:11,460 --> 01:12:14,120
doesn't make any difference.

1265
01:12:14,120 --> 01:12:17,830
So if I choose L bar, that big,
in other words, if I make

1266
01:12:17,830 --> 01:12:21,670
it just a little bit bigger
than the entropy, the

1267
01:12:21,670 --> 01:12:23,790
probability of failure
is going to be less

1268
01:12:23,790 --> 01:12:25,640
than or equal to delta.

1269
01:12:25,640 --> 01:12:27,910
And I can make delta -- and I
can make the probability of

1270
01:12:27,910 --> 01:12:30,110
failure as small as I want.

1271
01:12:30,110 --> 01:12:32,960
So I can make this epsilon here
which is the extra bits

1272
01:12:32,960 --> 01:12:36,710
per source symbol as
small as I want.

1273
01:12:36,710 --> 01:12:39,790
So it says I can come as close
to the entropy bound in doing

1274
01:12:39,790 --> 01:12:43,350
this, and come as close to
unique decodability as I want

1275
01:12:43,350 --> 01:12:45,140
in doing this.

1276
01:12:45,140 --> 01:12:48,720
And I have a fixed-to-fixed
length code, which after one

1277
01:12:48,720 --> 01:12:50,880
year is going to stop.

1278
01:12:50,880 --> 01:12:53,730
And I can turn my decoder off.

1279
01:12:53,730 --> 01:12:55,950
I can turn my encoder off.

1280
01:12:55,950 --> 01:12:59,160
I can go buy a new encoder
and a new decoder, which

1281
01:12:59,160 --> 01:13:01,770
presumably works a little
bit better.

1282
01:13:01,770 --> 01:13:04,150
And there isn't any problem
about when to turn it off.

1283
01:13:04,150 --> 01:13:05,730
Because I know I can
turn it off.

1284
01:13:05,730 --> 01:13:09,630
Because everything will
have come in by then.

1285
01:13:09,630 --> 01:13:12,420
Here's a more interesting
story.

1286
01:13:12,420 --> 01:13:18,250
Suppose I choose the number of
bits per source symbol that

1287
01:13:18,250 --> 01:13:23,390
I'm going to use to be less than
or equal to the entropy

1288
01:13:23,390 --> 01:13:24,420
minus 2 epsilon.

1289
01:13:24,420 --> 01:13:25,670
Why 2 epsilon?

1290
01:13:25,670 --> 01:13:29,110
Well, just wait a second.

1291
01:13:29,110 --> 01:13:31,830
I mean, 2 epsilon is small
and epsilon is small.

1292
01:13:31,830 --> 01:13:34,145
But I want to compare with this
other epsilon and my law

1293
01:13:34,145 --> 01:13:35,590
of large numbers.

1294
01:13:35,590 --> 01:13:39,430
And I'm going to pick
n large enough.

1295
01:13:39,430 --> 01:13:43,480
The number of typical sequences,
we said before, was

1296
01:13:43,480 --> 01:13:48,300
greater than 1 minus delta times
2 to the n times h of x

1297
01:13:48,300 --> 01:13:48,950
minus epsilon.

1298
01:13:48,950 --> 01:13:52,430
I'm going to make this epsilon
the same as that epsilon,

1299
01:13:52,430 --> 01:13:54,170
which is why I wanted this
to be 2 epsilon.

1300
01:13:56,700 --> 01:14:01,680
So my typical set is this big
when I choose n large enough.

1301
01:14:01,680 --> 01:14:04,890
And this says that most
of the typical set

1302
01:14:04,890 --> 01:14:07,440
can't be assigned codewords.

1303
01:14:07,440 --> 01:14:15,510
In other words, this number
here is humongously larger

1304
01:14:15,510 --> 01:14:35,870
then 2 to the l bar, which is in
the order of 2 to the nh of

1305
01:14:35,870 --> 01:14:42,200
x minus 2 epsilon n.

1306
01:14:42,200 --> 01:14:45,660
So the fraction of typical
elements that I can provide

1307
01:14:45,660 --> 01:14:52,040
codewords for, between this and
this, I can only provide

1308
01:14:52,040 --> 01:14:54,660
codewords for a fraction
2 to the minus

1309
01:14:54,660 --> 01:14:58,670
epsilon n of the codewords.

1310
01:14:58,670 --> 01:15:01,770
We have this big sea of
codewords, which are all

1311
01:15:01,770 --> 01:15:04,200
essentially equally likely.

1312
01:15:04,200 --> 01:15:07,230
And I can't provide codewords
for even a

1313
01:15:07,230 --> 01:15:09,860
small fraction of them.

1314
01:15:09,860 --> 01:15:13,130
So the probability of failure is
going to be 1 minus delta.

1315
01:15:13,130 --> 01:15:15,460
The 1 minus delta's the
probability that I get

1316
01:15:15,460 --> 01:15:17,950
something atypical.

1317
01:15:17,950 --> 01:15:24,190
Plus, well, minus in this case,
2 to the minus epsilon

1318
01:15:24,190 --> 01:15:28,280
n, which is the probability that
I can't encode a typical

1319
01:15:28,280 --> 01:15:30,670
codeword that comes out.

1320
01:15:30,670 --> 01:15:34,550
And this quantity goes to 1.

1321
01:15:34,550 --> 01:15:37,995
So this says that if I'm willing
to use a number of

1322
01:15:37,995 --> 01:15:42,690
bits bigger than the entropy, I
can succeed with probability

1323
01:15:42,690 --> 01:15:45,010
very close to 1.

1324
01:15:45,010 --> 01:15:48,150
And if I want to use a smaller
number of bits, I fail with

1325
01:15:48,150 --> 01:15:49,400
probability 1.

1326
01:15:52,810 --> 01:15:56,320
Which is the same as saying that
I'm using a prefix-free

1327
01:15:56,320 --> 01:16:01,950
code, I'm going to run out of
buffer space eventually if I

1328
01:16:01,950 --> 01:16:05,730
run long enough.

1329
01:16:05,730 --> 01:16:11,650
If I have something that
I'm encoding --

1330
01:16:11,650 --> 01:16:13,980
well, just erase that.

1331
01:16:13,980 --> 01:16:15,570
I'll say it more carefully
later.

1332
01:16:18,150 --> 01:16:22,210
I do want to talk a little bit
about this Kraft inequality

1333
01:16:22,210 --> 01:16:23,610
for unique decodability.

1334
01:16:23,610 --> 01:16:26,780
You remember we proved the
Kraft inequality for

1335
01:16:26,780 --> 01:16:29,460
prefix-free codes.

1336
01:16:29,460 --> 01:16:32,930
I now want to talk about the
Kraft inequality for uniquely

1337
01:16:32,930 --> 01:16:36,060
decodable codes.

1338
01:16:36,060 --> 01:16:39,330
And you might think that I've
done all of this development

1339
01:16:39,330 --> 01:16:45,990
of the AEP, the asymptotic
equipartition property.

1340
01:16:45,990 --> 01:16:49,560
Incidentally, you now know where
those words come from.

1341
01:16:49,560 --> 01:16:53,500
It's asymptotic because this
result is valid asymptotically

1342
01:16:53,500 --> 01:16:55,960
as n goes to infinity.

1343
01:16:55,960 --> 01:17:01,260
It's equipartition because
everything is equally likely.

1344
01:17:01,260 --> 01:17:03,480
And its property, because
it's a property.

1345
01:17:03,480 --> 01:17:08,490
So it's the asymptotic
equipartition property.

1346
01:17:08,490 --> 01:17:12,260
And I didn't do it so I could
prove the Kraft inequality.

1347
01:17:12,260 --> 01:17:14,850
It's just that that's an extra
bonus that we get.

1348
01:17:14,850 --> 01:17:20,070
And by understanding why the
Kraft inequality has to hold

1349
01:17:20,070 --> 01:17:28,890
for uniquely decodable codes, if
is one application for AEP

1350
01:17:28,890 --> 01:17:32,470
which lets you see a little
bit about how to use it.

1351
01:17:32,470 --> 01:17:36,520
OK, so the argument is an
argument by contradiction.

1352
01:17:36,520 --> 01:17:43,010
Suppose you generate a set
of lengths for codewords.

1353
01:17:43,010 --> 01:17:44,550
And you want this -- yeah?

1354
01:17:55,250 --> 01:17:58,090
And the thing you would like to
do is to assign codewords

1355
01:17:58,090 --> 01:18:01,220
of these lengths.

1356
01:18:01,220 --> 01:18:04,860
And what we want to do is to
set this equal to some

1357
01:18:04,860 --> 01:18:05,630
quantity b.

1358
01:18:05,630 --> 01:18:09,020
In other words, suppose we beat
the Kraft inequality.

1359
01:18:09,020 --> 01:18:12,130
Suppose we can make the lengths
even shorter than

1360
01:18:12,130 --> 01:18:15,730
Kraft says we can make them.

1361
01:18:15,730 --> 01:18:17,905
I mean, he was only a graduate
student, so we've got to be

1362
01:18:17,905 --> 01:18:21,480
able to beat his inequality
somehow.

1363
01:18:21,480 --> 01:18:24,460
So we're going to try to
make this equal to b.

1364
01:18:24,460 --> 01:18:27,930
We're going to assume that
b is greater than 1.

1365
01:18:27,930 --> 01:18:30,890
And then what we're going to
do is to show that we get a

1366
01:18:30,890 --> 01:18:32,470
contradiction here.

1367
01:18:32,470 --> 01:18:36,090
And this same argument can
work whether we have a

1368
01:18:36,090 --> 01:18:39,600
discrete memoryless source or
a source with memory, or

1369
01:18:39,600 --> 01:18:40,420
anything else.

1370
01:18:40,420 --> 01:18:42,830
It can work with blocks, it can
work with variable length

1371
01:18:42,830 --> 01:18:46,000
to variable length codes.

1372
01:18:46,000 --> 01:18:49,560
It's all essentially
the same argument.

1373
01:18:49,560 --> 01:18:52,390
So what I want to do is to
get a contradiction.

1374
01:18:52,390 --> 01:18:56,230
I'm going to choose a discrete
memoryless source.

1375
01:18:56,230 --> 01:18:58,900
And I'm going to make the
probabilities equal to 1 over

1376
01:18:58,900 --> 01:19:02,300
b times 2 to the minus li.

1377
01:19:02,300 --> 01:19:04,800
In other words, I can generate
a discrete memoryless source

1378
01:19:04,800 --> 01:19:07,270
for talking about it with
any probabilities I

1379
01:19:07,270 --> 01:19:08,800
want to give it.

1380
01:19:08,800 --> 01:19:12,650
So I'm going to generate one
with these probabilities.

1381
01:19:12,650 --> 01:19:16,530
So the lengths are going to
be equal to minus log of

1382
01:19:16,530 --> 01:19:19,220
b times p sub i.

1383
01:19:19,220 --> 01:19:22,920
Which says that the expected
length of the codewords is

1384
01:19:22,920 --> 01:19:27,820
equal to the sum of p sub i l
sub i, which is equal to the

1385
01:19:27,820 --> 01:19:31,780
entropy minus the
logarithm of b.

1386
01:19:31,780 --> 01:19:34,450
Which means I can get an
expected length which is a

1387
01:19:34,450 --> 01:19:37,440
little bit less than
the entropy.

1388
01:19:37,440 --> 01:19:40,600
So now what I'm going to do is
to consider strings of n

1389
01:19:40,600 --> 01:19:41,330
source letters.

1390
01:19:41,330 --> 01:19:43,460
I'm going to make these string
very, very long.

1391
01:19:46,270 --> 01:19:50,430
When I concatenate all these
codewords, I'm going to wind

1392
01:19:50,430 --> 01:19:54,290
up with a length that's less
than n times H of x minus b

1393
01:19:54,290 --> 01:19:59,400
over 2, minus log b over 2
with high probability.

1394
01:20:13,510 --> 01:20:18,940
And as a fixed-length code of
this length it's going to have

1395
01:20:18,940 --> 01:20:21,810
a low failure probability.

1396
01:20:21,810 --> 01:20:26,740
And, therefore, what this says
is I can, using this

1397
01:20:26,740 --> 01:20:32,670
remarkable code with unique
decodability, and generating

1398
01:20:32,670 --> 01:20:37,500
very long strings from it, I
can generate a fixed-length

1399
01:20:37,500 --> 01:20:41,550
code which has a low failure
probability.

1400
01:20:41,550 --> 01:20:45,640
And I just showed you
in the last slide

1401
01:20:45,640 --> 01:20:46,530
that I can't do that.

1402
01:20:46,530 --> 01:20:49,830
The probability of failure with
such a code has to be

1403
01:20:49,830 --> 01:20:51,540
essentially 1.

1404
01:20:51,540 --> 01:20:54,870
So that's a contradiction that
says you can't have these

1405
01:20:54,870 --> 01:20:57,460
unique decodable codes.

1406
01:20:57,460 --> 01:21:01,670
If you didn't get that in what
I said, don't be surprised.

1407
01:21:01,670 --> 01:21:06,200
Because all I'm trying to do is
to steer you towards how to

1408
01:21:06,200 --> 01:21:09,610
look at the section in the
notes that does that.

1409
01:21:09,610 --> 01:21:12,430
It was a little too fast
and a little too late.

1410
01:21:12,430 --> 01:21:15,570
But, anyway, that is the Kraft
inequality for unique

1411
01:21:15,570 --> 01:21:16,650
decodability.

1412
01:21:16,650 --> 01:21:18,170
OK, thanks.