1
00:00:00,000 --> 00:00:02,780
SPEAKER: The following content
is provided under a creative

2
00:00:02,780 --> 00:00:03,640
commons license.

3
00:00:03,640 --> 00:00:06,600
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,600 --> 00:00:09,420
offer high quality educational
resources for free.

5
00:00:09,420 --> 00:00:12,780
To make a donation, or to view
additional materials from

6
00:00:12,780 --> 00:00:16,740
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:16,740 --> 00:00:17,990
ocw.mit.edu.

8
00:00:21,790 --> 00:00:26,450
PROFESSOR: The AEP is probably
one of the most difficult

9
00:00:26,450 --> 00:00:29,640
concepts we talk about
in this course.

10
00:00:29,640 --> 00:00:32,370
It seems simple to start with.

11
00:00:32,370 --> 00:00:34,870
As I said before, it's one of
those things where you think

12
00:00:34,870 --> 00:00:36,930
you understand it, and
then you think you

13
00:00:36,930 --> 00:00:38,240
don't understand it.

14
00:00:38,240 --> 00:00:41,120
When Shannon first came out with
this theory, there were a

15
00:00:41,120 --> 00:00:44,670
lot of very, very good
professional mathematicians

16
00:00:44,670 --> 00:00:48,080
who spent a long time trying
to understand this.

17
00:00:48,080 --> 00:00:49,920
And who, in fact, blew it.

18
00:00:49,920 --> 00:00:53,330
Because, in fact, they were
trying to look at it strictly

19
00:00:53,330 --> 00:00:55,930
in terms of mathematics.

20
00:00:55,930 --> 00:00:58,900
They were looking for strict
mathematical theorems, they

21
00:00:58,900 --> 00:01:01,590
weren't looking to try to
get the insight from it.

22
00:01:01,590 --> 00:01:03,560
Because of that they
couldn't absorb it.

23
00:01:03,560 --> 00:01:05,720
There were a lot of engineers
who looked at it, and couldn't

24
00:01:05,720 --> 00:01:09,690
absorb it because they couldn't
match it with any of

25
00:01:09,690 --> 00:01:11,990
the mathematics, and therefore
they started thinking there

26
00:01:11,990 --> 00:01:15,530
was more there then there really
was so, so this is, in

27
00:01:15,530 --> 00:01:16,760
fact, tricky.

28
00:01:16,760 --> 00:01:24,820
What we're looking at is a
sequence of chance variables

29
00:01:24,820 --> 00:01:28,920
coming from a discrete
memoryless source.

30
00:01:28,920 --> 00:01:31,650
In other words, a discrete
memoryless source is something

31
00:01:31,650 --> 00:01:35,520
that spits out symbols and each
symbol is independent of

32
00:01:35,520 --> 00:01:39,810
each other symbol, each symbol
has the same probability mass

33
00:01:39,810 --> 00:01:42,820
function as every
other symbol.

34
00:01:42,820 --> 00:01:46,380
We said it's a neat thing to
look at this log pmf random

35
00:01:46,380 --> 00:01:51,260
variable, and a log pmf random
variable is minus log of the

36
00:01:51,260 --> 00:01:55,550
probability of the particular
symbol.

37
00:01:55,550 --> 00:02:00,120
So we have a sequence now
of random variables.

38
00:02:00,120 --> 00:02:04,430
And the expected value of that
random variable, for each of

39
00:02:04,430 --> 00:02:12,280
the random variables, is the
expected value of the log pmf

40
00:02:12,280 --> 00:02:15,140
which is, as we said before, is
this thing we've called the

41
00:02:15,140 --> 00:02:18,240
entropy, which we're trying
to get some insight into.

42
00:02:18,240 --> 00:02:21,550
So we have this log pmf random
variable, we have the entropy,

43
00:02:21,550 --> 00:02:23,970
which is the expected
value of it.

44
00:02:23,970 --> 00:02:25,790
We then talked about
the sequence of

45
00:02:25,790 --> 00:02:27,540
these random variables.

46
00:02:27,540 --> 00:02:31,990
We talked about the sample
average of them, and the whole

47
00:02:31,990 --> 00:02:34,930
reason you want to look at this
random variable is the

48
00:02:34,930 --> 00:02:40,440
sample average of a lot of these
pmf's, since logs add

49
00:02:40,440 --> 00:02:44,870
when the thing you're taking
the log of multiplies.

50
00:02:44,870 --> 00:02:49,870
What you wind up with is the sum
of these log pmf's is, in

51
00:02:49,870 --> 00:02:53,930
fact, equal to minus the lof
of the probability of the

52
00:02:53,930 --> 00:02:54,840
whole sequence.

53
00:02:54,840 --> 00:02:59,200
In other words, you look at this
whole sequence as a big,

54
00:02:59,200 --> 00:03:05,600
giant chance variable, which
has capital X to the n

55
00:03:05,600 --> 00:03:07,100
different possible values.

56
00:03:07,100 --> 00:03:09,840
Namely, every possible
sequence of length n.

57
00:03:09,840 --> 00:03:12,420
When you're talking about source
coding, you have to

58
00:03:12,420 --> 00:03:17,680
find the code word for
each of those m to

59
00:03:17,680 --> 00:03:20,950
the n different sequences.

60
00:03:20,950 --> 00:03:23,260
So what you're doing here
is trying to look at the

61
00:03:23,260 --> 00:03:27,130
probability of each of those
m to the n sequences.

62
00:03:27,130 --> 00:03:30,130
We can then use Huffman coding,
or whatever we choose

63
00:03:30,130 --> 00:03:32,560
to use, to try to encode
those things.

64
00:03:36,850 --> 00:03:40,910
OK, the weak law of large
numbers applies here, and says

65
00:03:40,910 --> 00:03:46,450
that the probability that this
sample average of the log pmf

66
00:03:46,450 --> 00:03:50,840
is close to the expected
value of the log pmf.

67
00:03:50,840 --> 00:03:53,370
The probability that that's
greater than or equal to

68
00:03:53,370 --> 00:03:57,770
epsilon is less than or equal to
the variance of the log pmf

69
00:03:57,770 --> 00:04:02,850
random variable, divided by
n times epsilon squared.

70
00:04:02,850 --> 00:04:08,090
Now, we are going to take the
viewpoint that n epsilon

71
00:04:08,090 --> 00:04:10,860
squared, is very small.

72
00:04:10,860 --> 00:04:12,440
Excuse me, it's very large.

73
00:04:12,440 --> 00:04:15,970
Epsilon we're thinking as a
small number, and we're

74
00:04:15,970 --> 00:04:19,680
thinking of as a large number,
but the game we always play

75
00:04:19,680 --> 00:04:25,790
here is to first pick some
epsilong, as small as you want

76
00:04:25,790 --> 00:04:30,640
to make it, and then you make
n larger and larger.

77
00:04:30,640 --> 00:04:34,640
And as n gets bigger and bigger,
eventually this number

78
00:04:34,640 --> 00:04:35,850
gets very small.

79
00:04:35,850 --> 00:04:37,970
So that's the game that
we're playing.

80
00:04:37,970 --> 00:04:41,070
It's y n times epsilon squared,
we're thinking of it

81
00:04:41,070 --> 00:04:42,350
is a large number.

82
00:04:42,350 --> 00:04:47,590
So what this says is the
probability that this log pmf

83
00:04:47,590 --> 00:04:52,600
of the entire sequence, well the
sample value of it, namely

84
00:04:52,600 --> 00:04:57,780
the log pmf divided by n,
us close to H of X. The

85
00:04:57,780 --> 00:05:02,220
probability of that is less
than or equal to this.

86
00:05:02,220 --> 00:05:06,180
We then define the typical
set, T sub epsilon of n.

87
00:05:06,180 --> 00:05:11,240
This is the set of all typical
sequences out of the source.

88
00:05:11,240 --> 00:05:15,120
So we're defining typical
sequences in this way.

89
00:05:15,120 --> 00:05:18,000
It's a set of all sequences
which are

90
00:05:18,000 --> 00:05:19,560
what we put into there.

91
00:05:19,560 --> 00:05:22,510
Well, it's what we put into
there, but we're looking at

92
00:05:22,510 --> 00:05:24,340
the compliment of this set.

93
00:05:24,340 --> 00:05:26,880
These are the exceptional
things.

94
00:05:26,880 --> 00:05:29,790
These are the typical things.

95
00:05:29,790 --> 00:05:32,470
We're saying that the
exceptional things here have a

96
00:05:32,470 --> 00:05:35,650
small probability when you
make n big enough.

97
00:05:35,650 --> 00:05:39,070
This is saying that the
exceptional things don't

98
00:05:39,070 --> 00:05:42,680
amount to anything, and the
typical things fill up the

99
00:05:42,680 --> 00:05:44,840
entire probability space.

100
00:05:44,840 --> 00:05:49,880
So what were saying is the
probability that this sequence

101
00:05:49,880 --> 00:05:54,150
is actually typical is greater
than or equal to 1 minus

102
00:05:54,150 --> 00:05:55,490
something small.

103
00:05:55,490 --> 00:06:00,940
It says when n gets big enough
the probability that you get a

104
00:06:00,940 --> 00:06:04,230
typical sequence out of the
source is going to wan.

105
00:06:08,580 --> 00:06:08,900
OK.

106
00:06:08,900 --> 00:06:12,660
We drew this in terms of a
probability distribution,

107
00:06:12,660 --> 00:06:16,530
which I hope makes things look a
little more straightforward.

108
00:06:16,530 --> 00:06:20,770
This is the distribution
function of the sample average

109
00:06:20,770 --> 00:06:23,320
of this log pmf random
variable.

110
00:06:23,320 --> 00:06:28,210
And what we're saying is that
as n gets larger and larger,

111
00:06:28,210 --> 00:06:30,950
the thing that's going to happen
because of the fact

112
00:06:30,950 --> 00:06:35,860
that the variance of the sample
average -- the sample

113
00:06:35,860 --> 00:06:38,230
average is always a
random variable.

114
00:06:38,230 --> 00:06:42,270
The average is not a random
variable, it's a fixed number.

115
00:06:42,270 --> 00:06:45,150
And what this is saying is
that as n gets larger and

116
00:06:45,150 --> 00:06:49,560
larger, this distribution
function here gets closer and

117
00:06:49,560 --> 00:06:52,530
closer to a stack.

118
00:06:52,530 --> 00:06:55,750
In other words, the thing that's
happening is that as n

119
00:06:55,750 --> 00:07:00,580
gets big, you go along here,
nothing happens suddenly you

120
00:07:00,580 --> 00:07:04,450
move up here and suddenly
you move across there.

121
00:07:04,450 --> 00:07:07,270
That's what happens in the
limit, mainly the sample

122
00:07:07,270 --> 00:07:13,050
average is, at that point,
always equal to the average.

123
00:07:13,050 --> 00:07:15,550
So as n goes to infinity,
a typical set approaches

124
00:07:15,550 --> 00:07:17,170
probability 1.

125
00:07:17,170 --> 00:07:20,740
We express that in terms of the
Chebyshev inequality in

126
00:07:20,740 --> 00:07:24,650
this way, but the picture says
you can interpret it in any

127
00:07:24,650 --> 00:07:26,890
one of a hundred
different ways.

128
00:07:26,890 --> 00:07:29,760
Because the real essence of
this is not the Chebyshev

129
00:07:29,760 --> 00:07:33,970
inequality, the real essence of
it is saying that as n gets

130
00:07:33,970 --> 00:07:35,900
bigger and bigger,
this distribution

131
00:07:35,900 --> 00:07:37,950
function becomes a stack.

132
00:07:37,950 --> 00:07:39,970
So that's a nice way of
thinking about it.

133
00:07:43,410 --> 00:07:46,570
Let's summarized what
we did with that.

134
00:07:46,570 --> 00:07:50,350
All of the major results
about typical sets.

135
00:07:50,350 --> 00:07:53,170
The first of them,
is this one.

136
00:07:53,170 --> 00:07:56,430
It's a bound on the number of
elements in a typical set, I'm

137
00:07:56,430 --> 00:07:59,260
not going to rederive that
again, but it comes from

138
00:07:59,260 --> 00:08:01,740
looking at the fact that all of
the typical elements have

139
00:08:01,740 --> 00:08:05,790
roughly the same probability.

140
00:08:05,790 --> 00:08:08,880
All of them collectively fill
up the whole space, and

141
00:08:08,880 --> 00:08:11,760
therefore you can find the
probability of each of them by

142
00:08:11,760 --> 00:08:14,960
taking one and dividing by the
probability of each of them.

143
00:08:14,960 --> 00:08:17,140
When you do that, you
get two bounds.

144
00:08:17,140 --> 00:08:20,960
One of them says that this
magnitude is less than 2 to

145
00:08:20,960 --> 00:08:23,380
the n times H of
X plus epsilon.

146
00:08:23,380 --> 00:08:27,250
The other one says it's greater
than 2 to the n H of X

147
00:08:27,250 --> 00:08:28,380
minus epsilon.

148
00:08:28,380 --> 00:08:31,130
And you have to throw in this
little fudge factor here

149
00:08:31,130 --> 00:08:34,060
because of the fact that the
typical set doesn't quite fill

150
00:08:34,060 --> 00:08:36,490
up the whole space.

151
00:08:36,490 --> 00:08:39,770
What that all says is when n
gets large the number of

152
00:08:39,770 --> 00:08:43,000
typical elements is
approximately equal to 2 to

153
00:08:43,000 --> 00:08:46,990
the n times H of X.
Again, think hard

154
00:08:46,990 --> 00:08:48,990
about what this means.

155
00:08:48,990 --> 00:08:54,530
This number here really isn't
all like 2 to the n H of X.

156
00:08:54,530 --> 00:08:58,770
This n times epsilon, is going
to be a big number.

157
00:08:58,770 --> 00:09:02,610
As n gets bigger, it gets
bigger and bigger.

158
00:09:02,610 --> 00:09:06,140
But, at the same time, when you
look at 2 to the n times H

159
00:09:06,140 --> 00:09:10,090
of X plus epsilon, and you know
that H of X is something

160
00:09:10,090 --> 00:09:14,720
substantial and you know that
epsilon is very small, in some

161
00:09:14,720 --> 00:09:17,600
sense it still says this.

162
00:09:17,600 --> 00:09:20,750
And in source coding terms,
it very much says this.

163
00:09:20,750 --> 00:09:23,050
Because in source coding terms,
what you're always

164
00:09:23,050 --> 00:09:25,070
looking at is these exponents.

165
00:09:25,070 --> 00:09:29,060
Because if you have n different
things you're trying

166
00:09:29,060 --> 00:09:32,940
to encode, it takes you log
of n bits to encode them.

167
00:09:32,940 --> 00:09:35,890
So when you take the log of
this, you see that the number

168
00:09:35,890 --> 00:09:39,720
of extra bits you needs to
encode these sequences is on

169
00:09:39,720 --> 00:09:42,030
the order of n times epsilon.

170
00:09:42,030 --> 00:09:45,250
Which is some number of bits,
but the real number bits

171
00:09:45,250 --> 00:09:49,110
you're looking at is n times
H of X. So that's the major

172
00:09:49,110 --> 00:09:52,030
term, and this is just fiddly.

173
00:09:52,030 --> 00:09:54,630
The next thing we found is that
the probability of an

174
00:09:54,630 --> 00:09:58,400
element in a typical set is
between 2 the minus n times H

175
00:09:58,400 --> 00:10:01,280
of X plus epsilon and 2
to the minus n times

176
00:10:01,280 --> 00:10:02,760
H of X minus epsilon.

177
00:10:02,760 --> 00:10:07,180
Which is saying almost the same
thing as this is saying.

178
00:10:07,180 --> 00:10:11,940
And again, there's this
approximation which says this

179
00:10:11,940 --> 00:10:16,130
is about equal to 2 to the minus
n of H of X. And this is

180
00:10:16,130 --> 00:10:19,620
an approximation in exactly the
same sense that that's an

181
00:10:19,620 --> 00:10:21,440
approximation.

182
00:10:21,440 --> 00:10:25,300
Finally, the last statement is
that the probability that

183
00:10:25,300 --> 00:10:27,890
this typical --

184
00:10:27,890 --> 00:10:31,490
the probability that you get a
typical sequence is greater

185
00:10:31,490 --> 00:10:35,140
than or equal to 1 minus this
variance divided by n times

186
00:10:35,140 --> 00:10:38,740
epsilon squared, with the same
kind of approximation.

187
00:10:38,740 --> 00:10:41,250
The probability that
you get the typical

188
00:10:41,250 --> 00:10:44,090
sequence is about 1.

189
00:10:44,090 --> 00:10:46,600
So what this is saying is
there are hardly any

190
00:10:46,600 --> 00:10:48,840
exceptions in terms
of probability.

191
00:10:48,840 --> 00:10:51,120
There are a huge number
of exceptions.

192
00:10:51,120 --> 00:10:52,970
But they're all extraordinarily
small in

193
00:10:52,970 --> 00:10:55,830
probability.

194
00:10:55,830 --> 00:11:00,440
So most of the space is all
tucked into this typical set.

195
00:11:00,440 --> 00:11:03,910
Most of these, all of these
typical sequences, have about

196
00:11:03,910 --> 00:11:05,700
the same probability.

197
00:11:05,700 --> 00:11:09,030
And the number of typical
sequences is about 2 to the n

198
00:11:09,030 --> 00:11:11,450
times H of X.

199
00:11:11,450 --> 00:11:14,900
Does that say that the entropy
has any significance?

200
00:11:14,900 --> 00:11:17,020
It sure does.

201
00:11:17,020 --> 00:11:20,060
Because the entropy is really
what's determining everything

202
00:11:20,060 --> 00:11:22,270
about this distribution,
as far as

203
00:11:22,270 --> 00:11:24,480
probabilities are concerned.

204
00:11:24,480 --> 00:11:26,600
There's nothing left
beyond that.

205
00:11:26,600 --> 00:11:29,250
If you're going to look at very
long sequences, and in

206
00:11:29,250 --> 00:11:30,950
source coding we want
to look at long

207
00:11:30,950 --> 00:11:34,460
sequences, that's the story.

208
00:11:34,460 --> 00:11:37,720
So the entropy tells a story.

209
00:11:37,720 --> 00:11:40,470
Despite what Huffman said.

210
00:11:40,470 --> 00:11:45,370
Huffman totally ignored the idea
of entropy and because of

211
00:11:45,370 --> 00:11:48,140
that he came up with the optimal
algorithm, but his

212
00:11:48,140 --> 00:11:50,450
optimal algorithm didn't
tell him anything about

213
00:11:50,450 --> 00:11:51,890
what was going on.

214
00:11:51,890 --> 00:11:52,710
This is not the [UNINTELLIGIBLE]

215
00:11:52,710 --> 00:11:58,260
Huffman, I think his algorithm
is one of the neatest things

216
00:11:58,260 --> 00:12:00,090
I've ever seen, because
he came up

217
00:12:00,090 --> 00:12:01,390
with it out of nowhere.

218
00:12:01,390 --> 00:12:04,070
Just pure thought,
which is nice.

219
00:12:04,070 --> 00:12:06,810
We then start to talk about
fixed lenght to fized length

220
00:12:06,810 --> 00:12:08,120
source coding.

221
00:12:08,120 --> 00:12:10,140
This is not a practical thing.

222
00:12:10,140 --> 00:12:14,230
This is not something I would
advise trying to do.

223
00:12:14,230 --> 00:12:17,560
It's something which is
conceptually useful, because

224
00:12:17,560 --> 00:12:21,910
it's talks about anything you
can do eventually, when you

225
00:12:21,910 --> 00:12:25,280
look at an almost infinite
sequence of symbols from the

226
00:12:25,280 --> 00:12:28,150
source and you look at how
many bits it takes you to

227
00:12:28,150 --> 00:12:30,310
represent them.

228
00:12:30,310 --> 00:12:35,000
Ultimately you need to turn that
encoder on, at some point

229
00:12:35,000 --> 00:12:38,700
in pre-history, and pre-history
in our present day

230
00:12:38,700 --> 00:12:40,660
is about one year ago.

231
00:12:40,660 --> 00:12:43,440
And you have to turn it off
sometime in the distant

232
00:12:43,440 --> 00:12:47,300
future, which is maybe
six months from now.

233
00:12:47,300 --> 00:12:50,770
During that time you have to
encode all these bits.

234
00:12:50,770 --> 00:12:53,160
So, in fact, when you get it
all done and look at the

235
00:12:53,160 --> 00:12:57,870
overall picture, it's fixed
length to fixed length.

236
00:12:57,870 --> 00:13:00,560
And all of these algorithms
are just ways of doing the

237
00:13:00,560 --> 00:13:04,200
fixed length to fixed length
without too much delay

238
00:13:04,200 --> 00:13:06,890
involved in them.

239
00:13:10,780 --> 00:13:14,350
I didn't say everything there.

240
00:13:14,350 --> 00:13:19,280
What this typical set picture
gives us there, is you can

241
00:13:19,280 --> 00:13:23,640
achieve an expected number of
bits per source symbol, which

242
00:13:23,640 --> 00:13:27,990
is about n times H of X, with
very rare failures.

243
00:13:27,990 --> 00:13:32,190
And if you try to achieve H of
X minus epsilon bits per

244
00:13:32,190 --> 00:13:36,220
symbol, the interesting thing
we found last time, was that

245
00:13:36,220 --> 00:13:42,950
the fraction of sequences you
can encode was zilch.

246
00:13:42,950 --> 00:13:45,730
In other words, there is a very
rapid transition here.

247
00:13:45,730 --> 00:13:51,320
If you try to get by with too
few bits per symbol, you die

248
00:13:51,320 --> 00:13:53,250
very, very quickly.

249
00:13:53,250 --> 00:13:55,910
It's not that your error
probability is large, your

250
00:13:55,910 --> 00:13:59,400
error probability is
asymptotically equal to 1.

251
00:13:59,400 --> 00:14:02,190
You always screw up.

252
00:14:02,190 --> 00:14:07,890
So that's the picture.

253
00:14:07,890 --> 00:14:09,740
We want to go onto
Markow sources.

254
00:14:09,740 --> 00:14:13,080
Let me explain why I want
to do this first.

255
00:14:15,960 --> 00:14:19,100
When we're talking about the
discrete memoryless sources,

256
00:14:19,100 --> 00:14:24,530
it should be obvious that that's
totally a toy problem.

257
00:14:24,530 --> 00:14:28,180
There aren't any sources I can
imagine where you would want

258
00:14:28,180 --> 00:14:32,640
to encode their output where
you can reasonably conclude

259
00:14:32,640 --> 00:14:34,760
that they were discrete
and memoryless.

260
00:14:34,760 --> 00:14:37,040
Namely, that each symbol
was independent

261
00:14:37,040 --> 00:14:38,250
of each other symbol.

262
00:14:38,250 --> 00:14:41,060
The only possibility I can
think of is where you're

263
00:14:41,060 --> 00:14:44,890
trying to report the results
of gambling or something.

264
00:14:44,890 --> 00:14:47,250
And gambling is so dishonest
that they probably aren't

265
00:14:47,250 --> 00:14:49,690
independent anyway.

266
00:14:49,690 --> 00:14:51,730
You could use this is a way
of showing they aren't

267
00:14:51,730 --> 00:14:55,970
independent, but it's not
a very useful thing.

268
00:14:55,970 --> 00:14:59,300
So somehow we want to be able
to talk about how do you

269
00:14:59,300 --> 00:15:01,450
encode sources with memory.

270
00:15:01,450 --> 00:15:04,860
Well Markow sources are
the easiest kind of

271
00:15:04,860 --> 00:15:06,640
sources with memory.

272
00:15:06,640 --> 00:15:10,260
And they have the nice property
that you can include

273
00:15:10,260 --> 00:15:13,010
as much statistics in
them as you want to.

274
00:15:13,010 --> 00:15:16,710
You can make them include as
much of the structure you can

275
00:15:16,710 --> 00:15:21,430
find as anything else will do.

276
00:15:21,430 --> 00:15:24,290
So that people talk about much
more general classes of

277
00:15:24,290 --> 00:15:27,130
sources, but these are
really sufficient.

278
00:15:27,130 --> 00:15:30,060
These are sufficient to talk
about everything useful.

279
00:15:30,060 --> 00:15:34,060
Not necessarily the nicest way
to think about useful things

280
00:15:34,060 --> 00:15:35,990
but it's sufficient.

281
00:15:35,990 --> 00:15:38,590
So a finite state in Markov
chain, I assume you're

282
00:15:38,590 --> 00:15:41,690
somewhat familiar with Markov
chaings from taking a

283
00:15:41,690 --> 00:15:43,800
probability courses.

284
00:15:43,800 --> 00:15:46,820
If not, you should probably
review it there, because the

285
00:15:46,820 --> 00:15:49,150
notes go through it
pretty quickly.

286
00:15:49,150 --> 00:15:51,570
There's nothing terribly
complicated there, but a

287
00:15:51,570 --> 00:15:55,820
finite state in a Markov chain
is a sequence of discrete

288
00:15:55,820 --> 00:15:59,900
chance variables, in that sense
it's exactly like the

289
00:15:59,900 --> 00:16:02,700
discrete memoryless sources
we were looking at.

290
00:16:02,700 --> 00:16:06,210
The letters come from some
finite alphabet, so in that

291
00:16:06,210 --> 00:16:10,590
sense it's like the discrete
memoryless sources.

292
00:16:10,590 --> 00:16:16,740
But here the difference is that
each letter depends on

293
00:16:16,740 --> 00:16:18,920
the letter before.

294
00:16:18,920 --> 00:16:23,580
Namely, before the Markov chain
changes from one state

295
00:16:23,580 --> 00:16:27,040
to another state in one of these
steps, it looks at the

296
00:16:27,040 --> 00:16:31,580
state it's in and decides where
it's going to go next.

297
00:16:31,580 --> 00:16:36,960
So we have a transition
probability matrix, you can

298
00:16:36,960 --> 00:16:39,260
think of this as a matrix,
it's something which has

299
00:16:39,260 --> 00:16:43,350
values for every S in the state
space, and for everys S

300
00:16:43,350 --> 00:16:45,980
prime in the state space.

301
00:16:45,980 --> 00:16:49,800
Which represents the probability
that the state at

302
00:16:49,800 --> 00:16:54,400
time n is equal the this state
S, and the state of time n

303
00:16:54,400 --> 00:16:59,000
minus 1 is equal to the
states S prime.

304
00:16:59,000 --> 00:17:02,080
So this tells you what's the
probabilities are of going

305
00:17:02,080 --> 00:17:04,070
from one state to another.

306
00:17:04,070 --> 00:17:09,440
The important thing here is
that this single step

307
00:17:09,440 --> 00:17:14,340
transition incorporates all of
the statistical knowledge.

308
00:17:14,340 --> 00:17:18,600
In other words, this is also
equal to the probability that

309
00:17:18,600 --> 00:17:23,490
S n is equal to this state S,
given that S n minus 1 is

310
00:17:23,490 --> 00:17:28,120
equal to the state S prime, and
also that S n minus 2 is

311
00:17:28,120 --> 00:17:32,360
equal to any given state of time
n minus 2, and all the

312
00:17:32,360 --> 00:17:37,650
way back to S minus zero being
any old state at time S zero.

313
00:17:37,650 --> 00:17:42,330
So it says that this source
loses memory, except for the

314
00:17:42,330 --> 00:17:44,370
first thing back.

315
00:17:44,370 --> 00:17:48,270
I like to think of this as a
blind frog who has very good

316
00:17:48,270 --> 00:17:52,360
sensory perception, jumping
around on lily pads.

317
00:17:52,360 --> 00:17:55,250
In other words, he can't see
the lily pad, he can only

318
00:17:55,250 --> 00:17:57,530
sense the nearby lily pads.

319
00:17:57,530 --> 00:18:00,550
So he jumps from one lily pad
to the next and when he gets

320
00:18:00,550 --> 00:18:03,960
to the next lily pad, he then
decides which lily pad he's

321
00:18:03,960 --> 00:18:05,740
going go to go to next.

322
00:18:10,250 --> 00:18:11,500
That'll wake you up, anyway.

323
00:18:15,150 --> 00:18:20,120
And we also want to define
some initial pmf on the

324
00:18:20,120 --> 00:18:21,250
initial state.

325
00:18:21,250 --> 00:18:24,350
So you like to think of Markov
chains as starting at some

326
00:18:24,350 --> 00:18:27,880
time and then proceeding forever
into the future.

327
00:18:30,440 --> 00:18:34,990
That seems to indicate that all
we've done is to replace

328
00:18:34,990 --> 00:18:38,240
one trivial problem with another
trivial problem.

329
00:18:38,240 --> 00:18:42,020
In other words, on step memory
is not enough to deal with

330
00:18:42,020 --> 00:18:45,590
things like English text
and things like that.

331
00:18:45,590 --> 00:18:48,960
Or any other language
that you like.

332
00:18:48,960 --> 00:18:55,630
So the idea of a Markov source
is that you create whatever

333
00:18:55,630 --> 00:18:58,780
set of states that you want,
you can have a very large

334
00:18:58,780 --> 00:19:04,480
state space, but you associate
the output of the source not

335
00:19:04,480 --> 00:19:07,140
with the states, but with
the transitions from

336
00:19:07,140 --> 00:19:09,750
one state to another.

337
00:19:09,750 --> 00:19:12,840
So this gives an example of it,
it's easier to explain the

338
00:19:12,840 --> 00:19:17,370
idea by simply looking
at an example.

339
00:19:17,370 --> 00:19:20,090
In this particular example,
which I think is the same as

340
00:19:20,090 --> 00:19:25,300
the one in the notes, what
you're looking at is a memory

341
00:19:25,300 --> 00:19:27,520
of the two previous states.

342
00:19:27,520 --> 00:19:30,590
So this is a binary source, it
produces binary digits, just

343
00:19:30,590 --> 00:19:32,930
either zero or 1.

344
00:19:32,930 --> 00:19:38,190
If the previous two digits were
zero zero, it says that

345
00:19:38,190 --> 00:19:43,240
the next digit is going to be
a 1 with probability 0.1 and

346
00:19:43,240 --> 00:19:47,910
the next is going to be a zero
with probability 0.9.

347
00:19:47,910 --> 00:19:53,280
If you're in states 0,0 and the
next thing that comes out

348
00:19:53,280 --> 00:19:57,490
is a zero, then at that point,
namely at that point one

349
00:19:57,490 --> 00:20:01,180
advanced into the future, you
have a zero as your last

350
00:20:01,180 --> 00:20:04,460
digit, a zero as a previous
digit, and a zero as the digit

351
00:20:04,460 --> 00:20:05,530
before that.

352
00:20:05,530 --> 00:20:15,020
So your state has move from xn
minus 2 and xn minus 1, to xn

353
00:20:15,020 --> 00:20:19,420
minus 1 and xn.

354
00:20:19,420 --> 00:20:23,830
In other words, any time you
make a transition, this digit

355
00:20:23,830 --> 00:20:28,290
here, which is the previous
digit, has to always become

356
00:20:28,290 --> 00:20:30,590
that digit.

357
00:20:30,590 --> 00:20:32,500
You'll notice the same thing
in all of these.

358
00:20:32,500 --> 00:20:37,750
When you go from here to there,
this last digit becomes

359
00:20:37,750 --> 00:20:38,940
the first digit.

360
00:20:38,940 --> 00:20:43,380
When you go from here to here,
the last digit because the

361
00:20:43,380 --> 00:20:45,320
first digit, and so forth.

362
00:20:45,320 --> 00:20:48,970
And that's a characteristic of
this particular structure of

363
00:20:48,970 --> 00:20:53,640
having the memory represented
by the last two digits.

364
00:20:53,640 --> 00:20:56,730
So the kind of output you can
get from this source --

365
00:20:56,730 --> 00:21:01,300
I mean you see that it doesn't
do anything very interesting.

366
00:21:01,300 --> 00:21:05,040
When you have two zeroes in the
past, it tends to produce

367
00:21:05,040 --> 00:21:09,640
a lot more zeroes, it tends to
get stuck with lots of zeroes.

368
00:21:09,640 --> 00:21:14,240
If you got a single 1 that goes
over into this state, and

369
00:21:14,240 --> 00:21:18,540
from there it can either
go there or there.

370
00:21:18,540 --> 00:21:22,390
Once it gets here, it tends to
produce a large number of 1's.

371
00:21:22,390 --> 00:21:25,660
So what's happening here is you
have a Markov chain which

372
00:21:25,660 --> 00:21:29,530
goes from long sequences of
zeroes, to transition regions

373
00:21:29,530 --> 00:21:32,210
where there are a bunch of zeros
and 1's, and finally

374
00:21:32,210 --> 00:21:35,640
gets trapped into either the all
zero state again, or the

375
00:21:35,640 --> 00:21:38,300
all 1 state again.

376
00:21:38,300 --> 00:21:40,350
And moves on from there.

377
00:21:40,350 --> 00:21:47,460
So the letter is what's in this
case the source output.

378
00:21:47,460 --> 00:21:50,825
If you know the old state, the
source output specifies in

379
00:21:50,825 --> 00:21:51,980
this state.

380
00:21:51,980 --> 00:21:55,730
If you know the old state and
the new state, that specifies

381
00:21:55,730 --> 00:21:56,170
the letter.

382
00:21:56,170 --> 00:21:58,290
In other words, one of the
curious things about this

383
00:21:58,290 --> 00:22:02,410
chain is I've arranged it so
that, since we have a binary

384
00:22:02,410 --> 00:22:04,560
output, they're only
two possible

385
00:22:04,560 --> 00:22:07,390
transitions from each state.

386
00:22:07,390 --> 00:22:11,840
Which says that the state plus
the sequence of source outputs

387
00:22:11,840 --> 00:22:15,970
specifies the state at every
point in time, the stayed at

388
00:22:15,970 --> 00:22:19,490
every point in time specifies
the source sequence.

389
00:22:19,490 --> 00:22:22,490
In other words, the two are
isomorphic to each other.

390
00:22:22,490 --> 00:22:23,920
One specifies the other.

391
00:22:23,920 --> 00:22:28,240
Since one specifies the other,
you can pretty much forget

392
00:22:28,240 --> 00:22:32,010
about the sequence and
look at the state

393
00:22:32,010 --> 00:22:33,890
chain, if you want to.

394
00:22:33,890 --> 00:22:36,470
And therefore everything you
know about Markov chains is

395
00:22:36,470 --> 00:22:38,180
useful here.

396
00:22:38,180 --> 00:22:40,650
Or if you like to think about
the real source, you can think

397
00:22:40,650 --> 00:22:43,580
about the real source as
producing these letters.

398
00:22:43,580 --> 00:22:46,040
So either way is fine
because either one

399
00:22:46,040 --> 00:22:47,270
specifies the other.

400
00:22:47,270 --> 00:22:51,140
When you don't have that
property, and you look at a

401
00:22:51,140 --> 00:22:55,420
sequence of letters from this
source, these are called

402
00:22:55,420 --> 00:22:58,710
partially specified
Markov chains.

403
00:22:58,710 --> 00:23:01,680
And they're awful things
to deal with.

404
00:23:01,680 --> 00:23:04,940
You can write theses about, but
you don't get any insight

405
00:23:04,940 --> 00:23:05,910
about them.

406
00:23:05,910 --> 00:23:08,140
There's hardly anything that
you would like to be true

407
00:23:08,140 --> 00:23:10,540
which is true.

408
00:23:10,540 --> 00:23:14,890
And these are just awful
things to look at.

409
00:23:14,890 --> 00:23:16,710
So we won't look them.

410
00:23:16,710 --> 00:23:19,800
One of the nice things about
engineering is you can create

411
00:23:19,800 --> 00:23:21,370
your own models.

412
00:23:21,370 --> 00:23:24,420
Mathematicians have to look
at the crazy models that

413
00:23:24,420 --> 00:23:26,400
engineers suggest to them.

414
00:23:26,400 --> 00:23:28,230
They have no choice.

415
00:23:28,230 --> 00:23:29,620
That's their job.

416
00:23:29,620 --> 00:23:33,760
But as an engineer, we can only
look at the nice models.

417
00:23:33,760 --> 00:23:36,160
We can play and the
mathematicians have to work.

418
00:23:36,160 --> 00:23:41,650
So it's nicer to be an
engineer, I think.

419
00:23:41,650 --> 00:23:45,150
Although famous mathematicians
only look at the engineering

420
00:23:45,150 --> 00:23:48,870
problems that appeal to them,
so in fact, when they become

421
00:23:48,870 --> 00:23:53,130
famous the two groups come
back together again.

422
00:23:53,130 --> 00:23:56,100
Because the good engineers are
also good mathematician, so

423
00:23:56,100 --> 00:24:00,450
they become sort of
the same group.

424
00:24:00,450 --> 00:24:05,950
These transitions, mainly the
transition lines that we draw

425
00:24:05,950 --> 00:24:09,860
on a graph like this, always
indicate positive probability.

426
00:24:09,860 --> 00:24:12,540
In other words, that there's
zero probability from going to

427
00:24:12,540 --> 00:24:13,550
here to here.

428
00:24:13,550 --> 00:24:17,090
You don't clutter up the diagram
by putting a line in,

429
00:24:17,090 --> 00:24:20,650
which allows you to just look at
these transitions to figure

430
00:24:20,650 --> 00:24:23,810
out what's going on.

431
00:24:23,810 --> 00:24:27,490
One of the things that you
learned when you study finite

432
00:24:27,490 --> 00:24:33,730
state Markov chains, is that a
state s is accessible from

433
00:24:33,730 --> 00:24:38,130
some other state s prime, if the
graph has some path from s

434
00:24:38,130 --> 00:24:39,360
prime to s.

435
00:24:39,360 --> 00:24:40,920
In other words, it's not
saying you can go

436
00:24:40,920 --> 00:24:42,450
there in one step.

437
00:24:42,450 --> 00:24:45,740
It's saying that there's some
way you can get there if you

438
00:24:45,740 --> 00:24:47,060
go long enough.

439
00:24:47,060 --> 00:24:50,630
Which means there's some
probability of getting there.

440
00:24:50,630 --> 00:24:54,490
And the fact that there's a
probability of getting there

441
00:24:54,490 --> 00:24:58,930
pretty much means that you're
going to get there eventually.

442
00:24:58,930 --> 00:25:02,200
That's not an obvious statement
but let's see what

443
00:25:02,200 --> 00:25:04,330
that means here.

444
00:25:04,330 --> 00:25:08,210
This state is accessible from
this state, in the sense that

445
00:25:08,210 --> 00:25:12,840
you can get from here to there
by going over here and then

446
00:25:12,840 --> 00:25:15,890
going to here.

447
00:25:15,890 --> 00:25:19,490
If we look at the states which
are accessible from each

448
00:25:19,490 --> 00:25:23,640
other, you get some set of
states, and if you're in that

449
00:25:23,640 --> 00:25:26,430
set of states, you can
never get out of it.

450
00:25:26,430 --> 00:25:30,290
Therefore, every one of those
states remains with positive

451
00:25:30,290 --> 00:25:33,680
probability and you keep
rotating back and forth

452
00:25:33,680 --> 00:25:38,950
between them in some way, but
you never get out of them.

453
00:25:38,950 --> 00:25:43,480
In other words, a Markov chain
which doesn't have this

454
00:25:43,480 --> 00:25:46,700
property would be the following
Markov chain.

455
00:25:49,720 --> 00:25:51,880
That's the simplest one
I can think of.

456
00:25:51,880 --> 00:25:54,440
If you start out in this
state you stay there.

457
00:25:54,440 --> 00:26:00,100
If you start out in this state,
you stay there, That's

458
00:26:00,100 --> 00:26:01,930
not a very nice chain.

459
00:26:01,930 --> 00:26:05,390
Is this a decent model for
an engineering study?

460
00:26:05,390 --> 00:26:08,050
No.

461
00:26:08,050 --> 00:26:11,120
Because when you're looking at
engineering, the thing that

462
00:26:11,120 --> 00:26:13,410
you're interested in, is you're
looking at something

463
00:26:13,410 --> 00:26:16,240
that happens over a long
period of time.

464
00:26:16,240 --> 00:26:19,640
Back at time infinity you can
decide whether you're here, or

465
00:26:19,640 --> 00:26:24,940
whether you're here, and you
might as well not worry a

466
00:26:24,940 --> 00:26:29,500
whole lot about what happened
back at time minus infinity,

467
00:26:29,500 --> 00:26:30,800
as far as building your model.

468
00:26:30,800 --> 00:26:33,550
You may as well just build a
model for this, or build a

469
00:26:33,550 --> 00:26:35,540
model for that.

470
00:26:35,540 --> 00:26:39,030
There's another thing here,
which is periodicity.

471
00:26:39,030 --> 00:26:45,980
In some chains, you can go from
this state to this state

472
00:26:45,980 --> 00:26:49,320
in one step, well
one, two steps.

473
00:26:49,320 --> 00:26:54,350
Or you can go there in one,
two, three, four steps.

474
00:26:54,350 --> 00:27:00,650
Or you can go there in one, two,
three steps and so forth.

475
00:27:00,650 --> 00:27:05,690
Which says, in terms of m the
period of s is the greatest

476
00:27:05,690 --> 00:27:08,420
common denominator
of path lengths

477
00:27:08,420 --> 00:27:11,250
from s back to s again.

478
00:27:11,250 --> 00:27:16,850
If that period is equal to 1,
mainly if there's not some

479
00:27:16,850 --> 00:27:20,210
periodic structure which says
the only way you can get back

480
00:27:20,210 --> 00:27:27,000
to a state is by coming back
every two steps or every three

481
00:27:27,000 --> 00:27:29,100
steps or something, then
again it's not a

482
00:27:29,100 --> 00:27:31,400
very nice Markov chain.

483
00:27:31,400 --> 00:27:36,060
Because if you're modeling it,
you might as well just model

484
00:27:36,060 --> 00:27:40,520
things over two states instead
of over single states.

485
00:27:40,520 --> 00:27:45,510
So the upshot of that is you
define these nice Markov

486
00:27:45,510 --> 00:27:49,790
chains, which are aperiodic,
which don't have any of this

487
00:27:49,790 --> 00:27:53,470
periodic structure, and every
state is accessible from every

488
00:27:53,470 --> 00:27:54,730
other state.

489
00:27:54,730 --> 00:27:58,790
And you call them ergodic
Markov chains.

490
00:27:58,790 --> 00:28:04,600
And what ergodic means, sort
of and as a more general

491
00:28:04,600 --> 00:28:15,140
principle, is that the
probabilities of things are

492
00:28:15,140 --> 00:28:17,180
equal to the relative
frequency of things.

493
00:28:17,180 --> 00:28:20,210
Namely, if you look at a very
long sequence of things out of

494
00:28:20,210 --> 00:28:24,130
a Markov chain, what you see
in that very long sequence

495
00:28:24,130 --> 00:28:28,420
should be representative of the
probabilities of that very

496
00:28:28,420 --> 00:28:29,270
long sequence.

497
00:28:29,270 --> 00:28:32,230
The probabilities of transitions
at various times

498
00:28:32,230 --> 00:28:35,380
should be the same from
one time to another.

499
00:28:35,380 --> 00:28:37,050
And that's whate ergodicity
means.

500
00:28:37,050 --> 00:28:39,030
It means that the thing
is stationery.

501
00:28:39,030 --> 00:28:42,080
You look at it at one time, it
behaves the same way as at

502
00:28:42,080 --> 00:28:43,010
another time.

503
00:28:43,010 --> 00:28:45,820
It doesn't have any periodic
structure, which means if you

504
00:28:45,820 --> 00:28:49,110
look at it at even times, it
behaves differently from

505
00:28:49,110 --> 00:28:51,310
looking at it at odd times.

506
00:28:51,310 --> 00:28:53,920
That's the kind of Markov chain
you would think you

507
00:28:53,920 --> 00:28:56,890
would have, unless you look at
these odd ball examples of

508
00:28:56,890 --> 00:28:57,920
other things.

509
00:28:57,920 --> 00:29:01,170
Everything we do is going to
be based on the idea of

510
00:29:01,170 --> 00:29:03,580
ergodic Markov chains,
because they're the

511
00:29:03,580 --> 00:29:06,020
nicest models to use.

512
00:29:06,020 --> 00:29:09,250
A Markov source then is
a sequence of labeled

513
00:29:09,250 --> 00:29:12,340
transitions on an and ergodic
Markov chain.

514
00:29:12,340 --> 00:29:14,530
Those are the only things
we want to look at.

515
00:29:14,530 --> 00:29:16,350
But that's general enough
to do most of the

516
00:29:16,350 --> 00:29:17,600
things we want to do.

517
00:29:21,860 --> 00:29:24,470
And once you have ergodic Markov
chains, there are a lot

518
00:29:24,470 --> 00:29:25,870
of nice things that happen.

519
00:29:29,690 --> 00:29:38,460
Mainly, if you try to solve this
set of equations, namely

520
00:29:38,460 --> 00:29:42,480
if you want to say, well suppose
there is some pmf

521
00:29:42,480 --> 00:29:45,090
function which gives
me the relative

522
00:29:45,090 --> 00:29:47,850
frequency of a given state.

523
00:29:47,850 --> 00:29:51,470
Namely, if I look at an enormous
number of states, I

524
00:29:51,470 --> 00:29:57,800
would like the state little
s to come up with

525
00:29:57,800 --> 00:29:59,460
some relative frequency.

526
00:29:59,460 --> 00:30:01,400
All the time that I do it.

527
00:30:01,400 --> 00:30:05,610
Namely, I wouldn't like to have
one sample path which

528
00:30:05,610 --> 00:30:08,750
comes up with a relative
frequency 1/2, and another

529
00:30:08,750 --> 00:30:12,050
sample path of almost infinite
length which comes up with a

530
00:30:12,050 --> 00:30:13,480
different relative frequency.

531
00:30:13,480 --> 00:30:17,720
Because it would mean that
different sequences of states

532
00:30:17,720 --> 00:30:21,610
are not typical of
the Markov chain.

533
00:30:21,610 --> 00:30:26,460
That's another way of looking
at what ergodicity means.

534
00:30:26,460 --> 00:30:29,270
It means that infinite
length sequences

535
00:30:29,270 --> 00:30:31,670
are not typical anymore.

536
00:30:31,670 --> 00:30:36,020
It depends on when they
start, when they stop.

537
00:30:36,020 --> 00:30:38,020
It depends on whether
you start at an even

538
00:30:38,020 --> 00:30:39,560
time or an odd time.

539
00:30:39,560 --> 00:30:43,920
Depends on all of these things,
that -- all of these

540
00:30:43,920 --> 00:30:47,250
things that real sources
shouldn't depend on.

541
00:30:47,250 --> 00:30:51,220
So, if you have relative
frequencies then you should

542
00:30:51,220 --> 00:30:54,390
have those relative frequencies
at time n and at

543
00:30:54,390 --> 00:30:56,020
time n minus 1.

544
00:30:56,020 --> 00:31:02,030
And probability of a particular
symbol s, if the

545
00:31:02,030 --> 00:31:05,890
probabilities of the previous
symbol s prime were the same

546
00:31:05,890 --> 00:31:07,590
values, q of s prime.

547
00:31:07,590 --> 00:31:10,390
We know the transition
probabilities, that's q of s

548
00:31:10,390 --> 00:31:11,880
given s prime.

549
00:31:11,880 --> 00:31:18,440
The sum over s prime, of q of s
prime, given capital Q of s

550
00:31:18,440 --> 00:31:20,530
given s prime, what is that?

551
00:31:20,530 --> 00:31:24,340
That's the probability of x.

552
00:31:24,340 --> 00:31:27,970
In other words, if you start
out at time n minus 1 with

553
00:31:27,970 --> 00:31:33,100
something pmf function on the
states at time n minus 1, this

554
00:31:33,100 --> 00:31:37,180
is the formula you would use to
calculate the probability

555
00:31:37,180 --> 00:31:42,560
the pmf function for
states at time s.

556
00:31:42,560 --> 00:31:47,120
This is the probability mass
function for the states at the

557
00:31:47,120 --> 00:31:50,640
next unit of time, time n.

558
00:31:50,640 --> 00:31:53,130
If this probability distribution
is the same as

559
00:31:53,130 --> 00:31:57,380
this probability distribution,
then you say that you're in

560
00:31:57,380 --> 00:32:00,970
steady state, because you do
this again, you plug this into

561
00:32:00,970 --> 00:32:02,410
here, it's the same thing.

562
00:32:02,410 --> 00:32:05,070
You get the same answer
at time n plus 1.

563
00:32:05,070 --> 00:32:08,350
You plug it in again you got the
same answer at time n plus

564
00:32:08,350 --> 00:32:10,860
2, and so on forever.

565
00:32:10,860 --> 00:32:14,250
So you stay in steady state.

566
00:32:14,250 --> 00:32:18,540
The question is, if you have a
matrix here, can you solve

567
00:32:18,540 --> 00:32:20,400
this equation?

568
00:32:20,400 --> 00:32:23,030
Is it easy to solve?

569
00:32:23,030 --> 00:32:25,840
And what's the solution?

570
00:32:25,840 --> 00:32:29,590
There's a nice theorem that
says, if the chain is ergodic,

571
00:32:29,590 --> 00:32:33,000
namely if it has these nice
properties of transition

572
00:32:33,000 --> 00:32:35,860
probabilities, that corresponds
to a particular

573
00:32:35,860 --> 00:32:37,340
kind of matrix here.

574
00:32:37,340 --> 00:32:41,790
If you have that kind of matrix,
this is just a vector

575
00:32:41,790 --> 00:32:44,440
matrix equation.

576
00:32:44,440 --> 00:32:49,230
That vector matrix equation has
a unique solution then for

577
00:32:49,230 --> 00:32:54,550
this probability little q, in
terms of this transition

578
00:32:54,550 --> 00:32:55,590
probability.

579
00:32:55,590 --> 00:32:59,230
It also has the nice property
that if you start out with any

580
00:32:59,230 --> 00:33:02,790
old distribution, and you grind
this thing away for a

581
00:33:02,790 --> 00:33:08,840
number of times, this q of x
is going to approach the

582
00:33:08,840 --> 00:33:10,170
steady state solution.

583
00:33:10,170 --> 00:33:14,090
Which means if you start a
Markov chain out in some known

584
00:33:14,090 --> 00:33:18,580
state, after awhile the
probability that you're in

585
00:33:18,580 --> 00:33:22,190
state s is going to become this
steady state probability.

586
00:33:22,190 --> 00:33:25,220
It gets closer and closer
to it exponentially

587
00:33:25,220 --> 00:33:27,730
as time goes on.

588
00:33:27,730 --> 00:33:31,170
So that's just arithmetic.

589
00:33:31,170 --> 00:33:33,520
These steady state probabilities
are approached

590
00:33:33,520 --> 00:33:37,380
asymptotically from any starting
state, i.e. for all s

591
00:33:37,380 --> 00:33:38,790
and s prime and s.

592
00:33:38,790 --> 00:33:42,260
The limit of the probability
that s sub n, the state at

593
00:33:42,260 --> 00:33:46,320
time n, is equal to a given
state s, given that the state

594
00:33:46,320 --> 00:33:49,020
at time zero was equal
to s prime.

595
00:33:49,020 --> 00:33:53,370
This probability is equal to q
of s and the limit as n goes

596
00:33:53,370 --> 00:33:54,990
to infinity.

597
00:33:54,990 --> 00:33:57,770
All of you know those things
from studying Markov chains.

598
00:33:57,770 --> 00:34:01,670
I hope, because those
are the main facts

599
00:34:01,670 --> 00:34:03,870
about Markov chains.

600
00:34:03,870 --> 00:34:05,190
Incidentally I'm not
interested in

601
00:34:05,190 --> 00:34:07,220
Markov chains here.

602
00:34:07,220 --> 00:34:10,630
We're not going to do anything
with them, the only thing I

603
00:34:10,630 --> 00:34:14,700
want to do with them is to show
you that there are ways

604
00:34:14,700 --> 00:34:18,490
of modeling real sources, and
coming as close to good models

605
00:34:18,490 --> 00:34:22,480
for real sources as
you want to come.

606
00:34:22,480 --> 00:34:24,550
That's the whole
approach here.

607
00:34:24,550 --> 00:34:28,770
How do you do coding
for Markov sources?

608
00:34:28,770 --> 00:34:33,500
The simplest approach, which
doesn't work very well, is to

609
00:34:33,500 --> 00:34:37,590
use a separate prefix-free code
for each prior state.

610
00:34:37,590 --> 00:34:47,225
Namely, if I look at this Markov
chain that I had, it

611
00:34:47,225 --> 00:34:51,220
says that when I'm in the state
I want to somehow encode

612
00:34:51,220 --> 00:34:53,930
the next state that I go to, or
the next letter that comes

613
00:34:53,930 --> 00:34:55,470
out of the Markov source.

614
00:34:55,470 --> 00:34:57,810
The things that can come out
of the Markov source are

615
00:34:57,810 --> 00:35:01,530
either a 1 or a zero.

616
00:35:01,530 --> 00:35:04,130
Now you see the whole problem
with this approach.

617
00:35:04,130 --> 00:35:07,870
As soon as you look at an
example, it sort of blows the

618
00:35:07,870 --> 00:35:09,990
cover on this.

619
00:35:09,990 --> 00:35:15,340
What's the best prefix-free code
to encode a 1 and a zero?

620
00:35:15,340 --> 00:35:17,840
Where one appears with
probability 0.9 and the other

621
00:35:17,840 --> 00:35:20,480
one appears with probability
0.1.

622
00:35:20,480 --> 00:35:24,200
What's the Huffman encoder do?

623
00:35:24,200 --> 00:35:27,640
It assigns one of those symols
to 1 and one of them to zero.

624
00:35:27,640 --> 00:35:30,390
You might as well encode
1 to this one, and

625
00:35:30,390 --> 00:35:31,710
zero to that one.

626
00:35:31,710 --> 00:35:38,260
Which means that all of the
theory, all it does is

627
00:35:38,260 --> 00:35:40,850
generate the same symbols
that you had before.

628
00:35:40,850 --> 00:35:44,040
You're not doing any
compression at all.

629
00:35:44,040 --> 00:35:45,400
It's a nice thing
to think about.

630
00:35:47,990 --> 00:35:50,300
In other words, by thinking
about these things you then

631
00:35:50,300 --> 00:35:51,660
see the solution.

632
00:35:51,660 --> 00:35:55,740
And our solution before to these
problems was that if you

633
00:35:55,740 --> 00:36:00,170
don't get anywhere by using a
prefix-free code on a single

634
00:36:00,170 --> 00:36:03,230
digit, take a block of n digits
and use a prefix-free

635
00:36:03,230 --> 00:36:04,530
code there.

636
00:36:04,530 --> 00:36:08,240
So that's the approach
we will take here.

637
00:36:08,240 --> 00:36:13,300
The general idea for single
letters is this prefix-free

638
00:36:13,300 --> 00:36:17,330
code we're going to generate
satisfies a Kraft inequality,

639
00:36:17,330 --> 00:36:19,980
you can use the Huffman
algorithm, you get this

640
00:36:19,980 --> 00:36:21,970
property here.

641
00:36:21,970 --> 00:36:27,940
And this entropy, which is now
a function of the particular

642
00:36:27,940 --> 00:36:31,330
state that we were in to
start with, is just

643
00:36:31,330 --> 00:36:32,770
this entropy here.

644
00:36:32,770 --> 00:36:36,540
So this is a conditional
entropy, which we get a

645
00:36:36,540 --> 00:36:38,530
different conditional
entropy for each

646
00:36:38,530 --> 00:36:40,320
possible previous state.

647
00:36:40,320 --> 00:36:43,250
And that conditional entropy
for each possible previous

648
00:36:43,250 --> 00:36:47,580
state, is what tells us exactly
what we can do as far

649
00:36:47,580 --> 00:36:50,320
as generating a Huffman code
for that next state.

650
00:36:50,320 --> 00:36:53,630
This would work fine if you had
a symbol alphabet of size

651
00:36:53,630 --> 00:36:55,730
10,000 or something.

652
00:36:55,730 --> 00:36:57,080
It just doesn't work
well when your

653
00:36:57,080 --> 00:36:58,660
symbol alphabet is binary.

654
00:37:09,990 --> 00:37:12,480
If we start out in a steady
state, then all of these

655
00:37:12,480 --> 00:37:16,420
probability stay in
steady state.

656
00:37:16,420 --> 00:37:20,840
When we look at the number of
binary digits per source

657
00:37:20,840 --> 00:37:24,280
symbol and we average them
over all of the initial

658
00:37:24,280 --> 00:37:26,590
states, the initial states
occur with these

659
00:37:26,590 --> 00:37:29,080
probabilities q of s.

660
00:37:29,080 --> 00:37:32,040
We have these best
Huffman code.

661
00:37:32,040 --> 00:37:36,940
So the number of binary digits
we're using per source symbol

662
00:37:36,940 --> 00:37:39,920
is really this average here.

663
00:37:39,920 --> 00:37:41,820
Because this is averaging
over all the states

664
00:37:41,820 --> 00:37:44,730
you're going to go into.

665
00:37:44,730 --> 00:37:52,000
And the entropy of the source
output, conditional on the

666
00:37:52,000 --> 00:37:55,430
chance variable s, is now,
in fact, defined

667
00:37:55,430 --> 00:37:58,280
just as that average.

668
00:37:58,280 --> 00:38:01,790
So they encoder transmits s
zero, followed by the code

669
00:38:01,790 --> 00:38:05,150
word for s 1 using s zero.

670
00:38:05,150 --> 00:38:08,740
That specifies s 1, and
then you encode x 2,

671
00:38:08,740 --> 00:38:11,480
using s 1 and so forth.

672
00:38:11,480 --> 00:38:13,760
And the decoder is sitting there
and the decoder does

673
00:38:13,760 --> 00:38:15,980
exactly the same thing.

674
00:38:15,980 --> 00:38:21,220
Namely, the decoder first sees
what s zero is, then it uses

675
00:38:21,220 --> 00:38:25,750
the code for s zero to decide
what x 1 was, then it uses the

676
00:38:25,750 --> 00:38:34,050
code for s 1, which is
determined by s 1, s one is

677
00:38:34,050 --> 00:38:38,100
determined by x 1, and it goes
on and on like that.

678
00:38:43,180 --> 00:38:48,460
Let me review a little bit about
conditional entropy.

679
00:38:48,460 --> 00:38:51,080
I'm going pretty fast here and
I'm not deriving these things

680
00:38:51,080 --> 00:38:54,810
because they're in the notes.

681
00:38:54,810 --> 00:38:58,880
And it's almost as if I want
you to get some kind of a

682
00:38:58,880 --> 00:39:03,300
pattern sensitivity to these
things, without the idea that

683
00:39:03,300 --> 00:39:04,350
we're going to use them a lot.

684
00:39:04,350 --> 00:39:08,040
You ought to have a general idea
of waht the results are.

685
00:39:08,040 --> 00:39:10,860
We're not going to spend a lot
of time on this because, as I

686
00:39:10,860 --> 00:39:14,610
said before, the only thing I
want you to recognize is that

687
00:39:14,610 --> 00:39:17,390
if you ever want to model a
source, this, in fact, gives

688
00:39:17,390 --> 00:39:20,350
you a general way of doing it.

689
00:39:20,350 --> 00:39:25,120
So this general entropy then is
using what we had before.

690
00:39:25,120 --> 00:39:28,510
It's the sum over all the
states, and the sum over all

691
00:39:28,510 --> 00:39:36,880
of the source outputs of these
log pmf probabilities.

692
00:39:36,880 --> 00:39:43,050
This general entropy of both a
symbol and a state is equal to

693
00:39:43,050 --> 00:39:45,160
this combined thing.

694
00:39:45,160 --> 00:39:48,730
Which is equal, not
surprisingly, to the entropy

695
00:39:48,730 --> 00:39:50,110
of the state.

696
00:39:50,110 --> 00:39:53,650
Namely, first you want to know
what the state is that has

697
00:39:53,650 --> 00:39:55,060
this entropy.

698
00:39:55,060 --> 00:39:59,020
And then given the state, this
is the entropy of the next

699
00:39:59,020 --> 00:40:01,060
letter, conditional
on the state.

700
00:40:03,610 --> 00:40:11,140
Since this joint entropy is less
than or equal to H of s

701
00:40:11,140 --> 00:40:14,000
plus H of x, I think you just
proved that in the homework,

702
00:40:14,000 --> 00:40:15,300
didn't you?

703
00:40:15,300 --> 00:40:17,620
I hope so.

704
00:40:17,620 --> 00:40:22,660
That says that the entropy of
x conditional on s, is less

705
00:40:22,660 --> 00:40:28,270
than or equal to the entropy of
x, which is not surprising.

706
00:40:28,270 --> 00:40:31,810
It says that if you use the
previous state in trying to do

707
00:40:31,810 --> 00:40:34,660
source encoding, you're going
to do better than if

708
00:40:34,660 --> 00:40:35,910
you don't use it.

709
00:40:38,360 --> 00:40:40,010
I mean the whole theory
would be pretty

710
00:40:40,010 --> 00:40:41,260
stupid if you didn't.

711
00:40:47,010 --> 00:40:49,700
That's what that says.

712
00:40:49,700 --> 00:40:52,380
As I told you before, the only
way we can make all of this

713
00:40:52,380 --> 00:40:56,230
work is to use
n-to-variable-length codes for

714
00:40:56,230 --> 00:40:56,930
each state.

715
00:40:56,930 --> 00:41:01,390
In other words, you encode n
letters at the same time.

716
00:41:01,390 --> 00:41:07,060
If you look at the entropy of
the first n states given a

717
00:41:07,060 --> 00:41:12,350
starting state, turns out to
be n times H of x given s.

718
00:41:12,350 --> 00:41:15,320
By the same kind of rule that
you were using before.

719
00:41:27,020 --> 00:41:30,010
The same argument that you
used to show that this is

720
00:41:30,010 --> 00:41:36,460
equal to that, you can use to
show that this is equal to H

721
00:41:36,460 --> 00:41:41,940
of S 1, given S zero, plus H of
S 2, given S 1, plus H of S

722
00:41:41,940 --> 00:41:44,370
3, given S 2 and so forth.

723
00:41:44,370 --> 00:41:47,070
And by the stationarity that
we have here, these are all

724
00:41:47,070 --> 00:41:51,080
equal, so you wind up
with n times times

725
00:41:51,080 --> 00:41:53,260
this conditional entropy.

726
00:41:53,260 --> 00:41:57,650
And since the source outputs
specified the states, and the

727
00:41:57,650 --> 00:42:01,220
states specified the source
outputs, you can then convince

728
00:42:01,220 --> 00:42:05,560
yourself that this entropy is
also equal to n times H of X,

729
00:42:05,560 --> 00:42:09,760
given S. And once you do that,
you're back in the same

730
00:42:09,760 --> 00:42:13,470
position we were in when we
looked at n-to-variable-length

731
00:42:13,470 --> 00:42:18,220
coding when we were looking at
discrete memoryless sources.

732
00:42:18,220 --> 00:42:21,330
Namely, the only thing that
happens when you're looking at

733
00:42:21,330 --> 00:42:26,680
n-to-variable length coding,
is it that one fudge factor

734
00:42:26,680 --> 00:42:29,030
becomes a 1 over n.

735
00:42:29,030 --> 00:42:33,330
When you have a small symbol
space by going two blocks, you

736
00:42:33,330 --> 00:42:38,060
get rid of this, you can make
the expected length close to H

737
00:42:38,060 --> 00:42:42,490
of X, given S. Which means, in
fact, that all of the memory

738
00:42:42,490 --> 00:42:46,700
is taking into account and it
still is this one parameter,

739
00:42:46,700 --> 00:42:50,830
the entropy, that
says everything.

740
00:42:50,830 --> 00:42:53,690
The AEP holds --

741
00:42:53,690 --> 00:42:56,530
I mean if you want to, you can
sit down and just see that it

742
00:42:56,530 --> 00:43:01,130
holds, once you see what these
entropies are, I mean the

743
00:43:01,130 --> 00:43:06,110
entropy you're using log pmf's,
you're just looking at

744
00:43:06,110 --> 00:43:09,960
products of probabilities, which
are sums of log pmf's,

745
00:43:09,960 --> 00:43:13,170
and everything is the
same as before.

746
00:43:13,170 --> 00:43:17,920
And again, if you're using
n-to-variable-length codes,

747
00:43:17,920 --> 00:43:22,010
you just can't achieve an
expected length less than H of

748
00:43:22,010 --> 00:43:26,510
X, given S. So H of X, given
S gives the whole story.

749
00:43:31,140 --> 00:43:34,130
You should read those notes,
because I've gone through that

750
00:43:34,130 --> 00:43:38,160
very, very fast, partly because
some of you are

751
00:43:38,160 --> 00:43:40,840
already very familiar with
Markov chains some of you are

752
00:43:40,840 --> 00:43:44,490
probably less familiar with it,
so you should check that

753
00:43:44,490 --> 00:43:45,730
out a little bit on your own.

754
00:43:45,730 --> 00:43:50,980
I want to talk about the
Lempel Ziv universal

755
00:43:50,980 --> 00:43:56,270
algorithm, which was rather
surprising to many people.

756
00:43:59,440 --> 00:44:03,120
Jacob Ziv is one of the
great theorists

757
00:44:03,120 --> 00:44:05,560
of information theory.

758
00:44:05,560 --> 00:44:08,300
Before this time, he wrote
a lot of very,

759
00:44:08,300 --> 00:44:10,720
very powerful papers.

760
00:44:10,720 --> 00:44:14,500
Which were quite hard to
read in many cases.

761
00:44:14,500 --> 00:44:17,690
So it was a real surprise to
people when he came up with

762
00:44:17,690 --> 00:44:21,750
this beautiful idea, which was
a lovely, simple algorithm.

763
00:44:21,750 --> 00:44:25,425
Some people, because of that,
thought it was Abe Lempel, who

764
00:44:25,425 --> 00:44:28,760
spends part of his year at
Brandeis, who was really the

765
00:44:28,760 --> 00:44:30,720
genius behind it.

766
00:44:30,720 --> 00:44:34,160
In fact, it wasn't Abe Lempel,
it was Jacob Ziv who was the

767
00:44:34,160 --> 00:44:36,070
genius behind it.

768
00:44:36,070 --> 00:44:42,390
Abe Lempel was pretty much the
one who really implemented it,

769
00:44:42,390 --> 00:44:45,630
because once you see what the
algorithm is, it's still not

770
00:44:45,630 --> 00:44:49,650
trivial to try to find how
to implement it in a

771
00:44:49,650 --> 00:44:52,990
simple, easy way.

772
00:44:52,990 --> 00:44:56,350
If you look at all the articles
about it, the authors

773
00:44:56,350 --> 00:45:00,230
are Ziv and Lempel, instead of
Lempel and Ziv, so why it got

774
00:45:00,230 --> 00:45:04,020
called Lempel Ziv is a
mystery to everyone.

775
00:45:04,020 --> 00:45:07,450
Anyway, they came up with
two algorithms.

776
00:45:07,450 --> 00:45:12,770
One which they came up with in
1977, and people looked at

777
00:45:12,770 --> 00:45:16,530
their 1977 algorithm and said,
oh that's much too complicated

778
00:45:16,530 --> 00:45:18,210
to implement.

779
00:45:18,210 --> 00:45:20,830
So they went back to the drawing
board, came up with

780
00:45:20,830 --> 00:45:24,890
another one in 1978, which
people said, ah, we can

781
00:45:24,890 --> 00:45:26,900
implement that.

782
00:45:26,900 --> 00:45:31,750
So people started implementing
the LZ78 and of course, by

783
00:45:31,750 --> 00:45:34,920
that time, all the technology
was much better.

784
00:45:34,920 --> 00:45:39,380
You could do things faster and
cheaper then you could before,

785
00:45:39,380 --> 00:45:44,920
and what happened then is that
a few years after that people

786
00:45:44,920 --> 00:45:48,720
were implementing LZ77,
which turned out

787
00:45:48,720 --> 00:45:50,190
to work much better.

788
00:45:50,190 --> 00:45:51,930
Which is often the way
this field works.

789
00:45:51,930 --> 00:45:55,030
People do something interesting
theoretically,

790
00:45:55,030 --> 00:45:59,580
people say, no you can't do
it, so they simplify it,

791
00:45:59,580 --> 00:46:03,200
thereby destroying some of
its best characteristics.

792
00:46:03,200 --> 00:46:06,210
And then a few years later
people are doing the more

793
00:46:06,210 --> 00:46:08,930
sophisticated thing, which they
should have started out

794
00:46:08,930 --> 00:46:10,180
doing at the beginning.

795
00:46:13,730 --> 00:46:17,210
What is a Universal Data
Compression algorithm?

796
00:46:17,210 --> 00:46:21,330
A Universal Data Compression
algorithm, is an algorithm

797
00:46:21,330 --> 00:46:25,240
which doesn't have any
probabilities tucked into it.

798
00:46:25,240 --> 00:46:28,690
In other words, the algorithm
itself simply looks at a

799
00:46:28,690 --> 00:46:31,320
sequence of letters from
an alphabet, and

800
00:46:31,320 --> 00:46:34,060
encodes it in some way.

801
00:46:34,060 --> 00:46:38,300
And what you would like to be
able to do is somehow measure

802
00:46:38,300 --> 00:46:42,400
what the statistics are, and
at the same time as you're

803
00:46:42,400 --> 00:46:47,000
measuring the statistics, you
want to encode the digits.

804
00:46:47,000 --> 00:46:49,630
You don't care too
much about delay.

805
00:46:49,630 --> 00:46:51,610
In fact, one way to do this --

806
00:46:51,610 --> 00:46:53,680
I mean, if you're surprised
that you can build a good

807
00:46:53,680 --> 00:46:56,880
universal encoder,
you shouldn't be.

808
00:46:56,880 --> 00:47:00,200
Because you could just take the
first million letters out

809
00:47:00,200 --> 00:47:03,510
of the source, go through all
the statistical analysis you

810
00:47:03,510 --> 00:47:08,560
want to, model the source in
whatever way makes the best

811
00:47:08,560 --> 00:47:12,270
sense to you, and then build a
Huffman encoder, which, in

812
00:47:12,270 --> 00:47:16,780
fact, encodes things according
to that model that you have.

813
00:47:16,780 --> 00:47:19,710
Of course you then have to send
the decoder the first

814
00:47:19,710 --> 00:47:23,560
million digits, and the decoder
goes through the same

815
00:47:23,560 --> 00:47:26,730
statistical analysis, and
therefore finds out what code

816
00:47:26,730 --> 00:47:30,410
you're going to use, and then
the encoder encodes, the

817
00:47:30,410 --> 00:47:34,690
decoder decodes and you have
this million symbols of

818
00:47:34,690 --> 00:47:38,460
overhead in the algorithm, that
if you use the algorithm

819
00:47:38,460 --> 00:47:41,710
for a billion letters instead of
a million letters, then it

820
00:47:41,710 --> 00:47:43,160
all works pretty well.

821
00:47:43,160 --> 00:47:46,610
So there's a little bit of
that flavor here, but the

822
00:47:46,610 --> 00:47:50,730
other part of it is, it's
a neat algorithm.

823
00:47:50,730 --> 00:47:53,550
And the algorithm measures
things in a faster way then

824
00:47:53,550 --> 00:47:55,520
you would believe.

825
00:47:55,520 --> 00:47:59,080
And as you look at it later
you say, gee, this makes a

826
00:47:59,080 --> 00:48:02,820
great deal of sense even if
there isn't much statistical

827
00:48:02,820 --> 00:48:04,690
structure here.

828
00:48:04,690 --> 00:48:09,860
In other words, you can show
that if the source really is a

829
00:48:09,860 --> 00:48:15,220
Markov source, then this
algorithm will behave just as

830
00:48:15,220 --> 00:48:20,740
well, asymptotically, as the
best algorithm you can design

831
00:48:20,740 --> 00:48:22,350
for that Markov source.

832
00:48:22,350 --> 00:48:28,450
Namely, it's so good that it
will in fact measure the

833
00:48:28,450 --> 00:48:33,970
statistics in that Markov model
and implement them.

834
00:48:33,970 --> 00:48:36,780
But it does something
better than that.

835
00:48:36,780 --> 00:48:40,280
And the thing which is better
is that, if you're going to

836
00:48:40,280 --> 00:48:43,690
look at this first million
symbols and your objective

837
00:48:43,690 --> 00:48:48,920
then is to build a Markov model,
after you build the

838
00:48:48,920 --> 00:48:52,240
Markov model for that million
symbols, one of the things

839
00:48:52,240 --> 00:48:54,890
that you always question about
is, should I have used a

840
00:48:54,890 --> 00:48:59,270
Markov model or should I have
used some other kind of model?

841
00:48:59,270 --> 00:49:01,330
And that's a difficult
question to ask.

842
00:49:01,330 --> 00:49:04,460
You go through all of the
different possibilities, and

843
00:49:04,460 --> 00:49:07,460
one of the nice things about the
Lempel Ziv algorithm is,

844
00:49:07,460 --> 00:49:11,010
in a sense, it just does
this automatically.

845
00:49:11,010 --> 00:49:13,620
If there's some kind of
statistical structure there,

846
00:49:13,620 --> 00:49:15,820
it's going to find it.

847
00:49:15,820 --> 00:49:19,040
If it's not Markov, if it's some
other kind of structure,

848
00:49:19,040 --> 00:49:20,050
it will find it.

849
00:49:20,050 --> 00:49:22,790
The question is how does it
find this statistical

850
00:49:22,790 --> 00:49:27,800
structure without knowing what
kind of model you should use

851
00:49:27,800 --> 00:49:29,010
to start with?

852
00:49:29,010 --> 00:49:32,100
And that's the genius of things
which are universal,

853
00:49:32,100 --> 00:49:35,430
because they don't assume that
you have to measure particular

854
00:49:35,430 --> 00:49:38,560
things in some model that
you believe in.

855
00:49:38,560 --> 00:49:41,090
It just does the whole
thing all at once.

856
00:49:41,090 --> 00:49:45,230
If it's running along and the
statistics change, bing, it

857
00:49:45,230 --> 00:49:48,070
changes, too.

858
00:49:48,070 --> 00:49:51,750
And suddenly it will start
producing more binary digits

859
00:49:51,750 --> 00:49:54,590
per source symbol, or fewer,
because that's

860
00:49:54,590 --> 00:49:56,480
what it has to do.

861
00:49:56,480 --> 00:49:59,140
And that's just the
way it works.

862
00:49:59,140 --> 00:50:02,510
But it does have all these
nice properties.

863
00:50:02,510 --> 00:50:05,770
It has instantaneous
decodability.

864
00:50:05,770 --> 00:50:09,770
In a sense, it is a prefix-free
code, although you

865
00:50:09,770 --> 00:50:11,430
have to interpret
pretty carefully

866
00:50:11,430 --> 00:50:12,730
what you mean by that.

867
00:50:12,730 --> 00:50:16,010
We'll understand that
in a little bit.

868
00:50:16,010 --> 00:50:20,680
But in fact, it does do all
of these neat things.

869
00:50:20,680 --> 00:50:25,420
And there are better algorithms
out there now,

870
00:50:25,420 --> 00:50:29,380
whether they're better in terms
of the trade-off between

871
00:50:29,380 --> 00:50:34,690
complexity and compressability,
I don't know.

872
00:50:34,690 --> 00:50:37,390
But anyway, the people who do
research on these things have

873
00:50:37,390 --> 00:50:40,670
to have something to
keep them busy.

874
00:50:40,670 --> 00:50:44,690
And they have to have some kind
of results to get money

875
00:50:44,690 --> 00:50:48,160
for, and therefore they claim
that the new algorithms are

876
00:50:48,160 --> 00:50:50,110
better than the old
algorithms.

877
00:50:50,110 --> 00:50:53,620
And they probably are,
but I'm not sure.

878
00:50:53,620 --> 00:50:55,140
Anyway, this is a very
cute algorithm.

879
00:51:00,540 --> 00:51:04,190
So what you're trying to do
here, the objective, one

880
00:51:04,190 --> 00:51:09,500
objective which is achieved, is
if you observe the output

881
00:51:09,500 --> 00:51:13,620
from the given probability
model, say a Markov source,

882
00:51:13,620 --> 00:51:17,810
and I build the best code I can
for that Markov source,

883
00:51:17,810 --> 00:51:20,250
then we know how many bits
we need per symbol.

884
00:51:20,250 --> 00:51:23,800
Number of bits we need per
symbol is this entropy of a

885
00:51:23,800 --> 00:51:27,280
letter of a symbol
given the state.

886
00:51:27,280 --> 00:51:29,750
That's the best we can do.

887
00:51:29,750 --> 00:51:33,160
How well does the Lempel
Ziv algorithm do?

888
00:51:33,160 --> 00:51:37,290
Asymptotically, when you make
everything large it will

889
00:51:37,290 --> 00:51:41,130
encode using a number of bits
per symbol, which is H of X,

890
00:51:41,130 --> 00:51:45,770
given S. So it'll do just as
well as the best thing does

891
00:51:45,770 --> 00:51:49,030
which happens to know the
model to start with.

892
00:51:49,030 --> 00:51:53,520
As I said before the algorithm
also compresses in the absence

893
00:51:53,520 --> 00:51:56,120
of any ordinary kind of
statistical structure.

894
00:51:56,120 --> 00:51:58,860
Whatever kind of structure
is there, this

895
00:51:58,860 --> 00:52:01,160
algorithm sorts it out.

896
00:52:01,160 --> 00:52:04,080
It should deal with gradually
changing statistics.

897
00:52:04,080 --> 00:52:06,980
It does that also, but perhaps
not in the best way.

898
00:52:06,980 --> 00:52:08,280
We'll talk about that later.

899
00:52:11,730 --> 00:52:13,180
Let's describe it
a little bit.

900
00:52:19,350 --> 00:52:23,800
If we let x 1, x 2 blah blah
blah, be the output of the

901
00:52:23,800 --> 00:52:29,610
source, and the alphabet is some
alphabet capital X, which

902
00:52:29,610 --> 00:52:41,720
has size m, let's just as
notation, let x sub m super n

903
00:52:41,720 --> 00:52:46,780
denote the string xm,
xm plus 1, up to xn.

904
00:52:46,780 --> 00:52:49,900
In other words, in describing
this algorithm we are, all the

905
00:52:49,900 --> 00:52:54,310
time talking about strings of
letters taken out of this

906
00:52:54,310 --> 00:52:57,110
infinite length string that
comes out of the source.

907
00:52:57,110 --> 00:53:00,800
We want to have a nice notation
for talking about a

908
00:53:00,800 --> 00:53:04,210
sub string of the
actual sequence.

909
00:53:04,210 --> 00:53:08,300
We're going to use a window
in this algorithm.

910
00:53:08,300 --> 00:53:12,510
We want the window to have a
size with the power of 2.

911
00:53:12,510 --> 00:53:17,290
Typical values for the window
range from about a thousand up

912
00:53:17,290 --> 00:53:20,270
to about a million.

913
00:53:20,270 --> 00:53:23,830
Maybe they're even bigger
now, I don't know.

914
00:53:23,830 --> 00:53:25,740
But as we'll see later, there's

915
00:53:25,740 --> 00:53:28,370
some constraints there.

916
00:53:28,370 --> 00:53:33,950
What the Lempel Ziv algorithm
does, this LZ77 algorithm, is

917
00:53:33,950 --> 00:53:38,830
it matches the longest string
of yet unencoded -- this is

918
00:53:38,830 --> 00:53:47,310
unencoded also, isn't it,
that's simple enough --

919
00:53:47,310 --> 00:53:51,330
of yet unencoded symbols
by using strings

920
00:53:51,330 --> 00:53:52,370
starting in the window.

921
00:53:52,370 --> 00:53:56,700
So it takes this sequence of
stuff we haven't observed yet,

922
00:53:56,700 --> 00:54:00,750
it tries to find the longest
string starting there which it

923
00:54:00,750 --> 00:54:04,590
can match with something that's
already in the window.

924
00:54:04,590 --> 00:54:07,580
If it can find something which
matches with something in the

925
00:54:07,580 --> 00:54:10,630
window, what does it do?

926
00:54:10,630 --> 00:54:13,940
It's going to first say how long
the match was, and then

927
00:54:13,940 --> 00:54:17,210
it's going to say where in
the window it found it.

928
00:54:17,210 --> 00:54:19,830
And the decoder is sitting
there, the decoder has this

929
00:54:19,830 --> 00:54:24,570
window which it observes also,
so the decoder can find the

930
00:54:24,570 --> 00:54:27,200
same match which is
in the window.

931
00:54:27,200 --> 00:54:29,870
Why does it work?

932
00:54:29,870 --> 00:54:33,740
Well, it works because with all
of these AEP properties

933
00:54:33,740 --> 00:54:37,750
that we're thinking of, you tend
to have typical sequences

934
00:54:37,750 --> 00:54:39,790
sitting there in the window.

935
00:54:39,790 --> 00:54:42,420
And you tend to have typical
sequences which come out of

936
00:54:42,420 --> 00:54:43,300
the source.

937
00:54:43,300 --> 00:54:46,300
So the thing we're trying to
encode is some typical

938
00:54:46,300 --> 00:54:48,660
sequence --

939
00:54:48,660 --> 00:54:51,470
well you can you can think of
short typical sequences and

940
00:54:51,470 --> 00:54:53,510
longer typical sequences.

941
00:54:53,510 --> 00:54:56,980
We try to find the longest
typical sequence that we can.

942
00:54:56,980 --> 00:54:59,530
And we're looking back into
this window, and there are

943
00:54:59,530 --> 00:55:02,290
enormous number of typical
sequences there.

944
00:55:02,290 --> 00:55:04,720
If we make the typical sequences
short enough, there

945
00:55:04,720 --> 00:55:07,430
aren't too many of them, and
most of them are sitting there

946
00:55:07,430 --> 00:55:09,000
in the window.

947
00:55:09,000 --> 00:55:13,150
This'll become clearer
as we go.

948
00:55:13,150 --> 00:55:15,280
Let's go on and actually
explain what

949
00:55:15,280 --> 00:55:16,560
the algorithm does.

950
00:55:20,130 --> 00:55:23,930
So here's the algorithm.

951
00:55:23,930 --> 00:55:29,850
First, you take this large W,
this large window size, and

952
00:55:29,850 --> 00:55:32,190
we're going to encode the
first thought W symbols.

953
00:55:32,190 --> 00:55:35,500
We're not even going to use any
compression, that's just

954
00:55:35,500 --> 00:55:37,320
lost stuff.

955
00:55:37,320 --> 00:55:41,820
So we encode this first million
symbols, we grin and

956
00:55:41,820 --> 00:55:45,640
bear it, and then the decoder
has this window

957
00:55:45,640 --> 00:55:47,170
of a million symbols.

958
00:55:47,170 --> 00:55:50,840
We at the encoder have this
window of a million symbols,

959
00:55:50,840 --> 00:55:53,350
and we proceed from there.

960
00:55:53,350 --> 00:55:56,150
So it gets amortized,
so we don't care.

961
00:55:56,150 --> 00:56:01,350
So we then have a pointer, and
we set the pointer to W.

962
00:56:01,350 --> 00:56:05,610
So the pointer is the last
thing that we encoded.

963
00:56:05,610 --> 00:56:10,190
So we have all this encoded
stuff starting at time P,

964
00:56:10,190 --> 00:56:14,850
everything beyond there
is as yet unencoded.

965
00:56:14,850 --> 00:56:17,200
That's the first step
in the algorithm.

966
00:56:17,200 --> 00:56:18,450
So far, so good.

967
00:56:22,740 --> 00:56:27,360
The next step is to find the
largest n, greater than or

968
00:56:27,360 --> 00:56:29,780
equal to 2, I'll explain why
greater than or equal to 2

969
00:56:29,780 --> 00:56:38,160
later, such that the string x
sub p plus 1 up to p plus n,

970
00:56:38,160 --> 00:56:39,190
what is that?

971
00:56:39,190 --> 00:56:43,820
It's the string which starts
right beyond the pointer,

972
00:56:43,820 --> 00:56:47,160
namely the string that starts
here, what we're trying to do

973
00:56:47,160 --> 00:56:50,560
is find the largest n, in other
words the longest string

974
00:56:50,560 --> 00:56:54,020
starting here, which we can
match with something that's in

975
00:56:54,020 --> 00:56:55,290
the window.

976
00:56:55,290 --> 00:57:00,430
Now we look at a, a is in the
window, we look at a b, a b as

977
00:57:00,430 --> 00:57:01,530
in the window.

978
00:57:01,530 --> 00:57:10,900
We look at a b a, a b a as in
the window. a b a b, a b a b

979
00:57:10,900 --> 00:57:12,530
is not in the window.

980
00:57:12,530 --> 00:57:15,960
At least I hope it's not in the
window or I screwed up.

981
00:57:15,960 --> 00:57:17,210
Yeah, it's not in the window.

982
00:57:17,210 --> 00:57:20,810
So the longest thing we can find
which matches with what's

983
00:57:20,810 --> 00:57:23,430
in the window is this match
of length three.

984
00:57:26,440 --> 00:57:29,420
So this is finding the longest
match which matches with

985
00:57:29,420 --> 00:57:31,770
something here.

986
00:57:31,770 --> 00:57:36,900
This next example, I think
the only way I can regard

987
00:57:36,900 --> 00:57:39,300
that is as a hack.

988
00:57:39,300 --> 00:57:42,410
It's a kind of hack that
programmers like.

989
00:57:42,410 --> 00:57:45,500
It's very mysterious, but it's
also the kind of hack that

990
00:57:45,500 --> 00:57:49,840
mathematicians like because in
this case this particular hack

991
00:57:49,840 --> 00:57:52,790
makes the analysis
much easier.

992
00:57:52,790 --> 00:57:55,420
So this is another
kind of match.

993
00:57:55,420 --> 00:57:58,240
It's looking for the longest
string here,

994
00:57:58,240 --> 00:58:00,510
starting at this pointer.

995
00:58:00,510 --> 00:58:05,450
a b a b so forth, which matches
things starting here.

996
00:58:05,450 --> 00:58:07,150
Starting somewhere
in the window.

997
00:58:07,150 --> 00:58:11,990
So it finds a match a
b here. a b a here.

998
00:58:11,990 --> 00:58:17,050
But now it looks for a b
a b, a match of four.

999
00:58:17,050 --> 00:58:18,640
Where do we find it?

1000
00:58:18,640 --> 00:58:21,940
We can start back here, which
is still in the window, and

1001
00:58:21,940 --> 00:58:25,190
what we see is a b a b.

1002
00:58:25,190 --> 00:58:30,160
So these four digits match
these four digits.

1003
00:58:30,160 --> 00:58:34,410
Well you might say foul ball,
because if I tell you there's

1004
00:58:34,410 --> 00:58:40,740
a match of four and I tell you
where it is, in fact, all the

1005
00:58:40,740 --> 00:58:44,600
poor decoder knows is this.

1006
00:58:44,600 --> 00:58:48,180
If I tell you there's a match
of four and it starts here,

1007
00:58:48,180 --> 00:58:50,110
what's the poor decoder
going to do?

1008
00:58:50,110 --> 00:58:56,950
The poor decoder says, ok, so
a is that digit two digits

1009
00:58:56,950 --> 00:58:59,180
ago, so that gives me the a.

1010
00:58:59,180 --> 00:59:02,000
So I know there's an a
there. b is the next

1011
00:59:02,000 --> 00:59:04,920
digit, so b is there.

1012
00:59:04,920 --> 00:59:08,420
And then I know the first two
digits beyond the window, and

1013
00:59:08,420 --> 00:59:12,960
therefore this third digit is a,
so that must be that digit.

1014
00:59:12,960 --> 00:59:15,340
The fourth digit is
this digit, which

1015
00:59:15,340 --> 00:59:17,030
must be that digit.

1016
00:59:17,030 --> 00:59:18,430
OK?

1017
00:59:18,430 --> 00:59:21,220
If you didn't catch that, you
can just think about it, it'll

1018
00:59:21,220 --> 00:59:23,140
become clear.

1019
00:59:23,140 --> 00:59:24,510
I mean it really is a hack.

1020
00:59:24,510 --> 00:59:25,830
It's not very important.

1021
00:59:25,830 --> 00:59:28,000
It won't change the way
this thing bahaves.

1022
00:59:28,000 --> 00:59:33,470
But it does change the
way you analyze it

1023
00:59:33,470 --> 00:59:34,690
So that's the first
thing you do, you

1024
00:59:34,690 --> 00:59:38,170
look for these matches.

1025
00:59:38,170 --> 00:59:40,230
Next thing we're going to do
is we're going to try to

1026
00:59:40,230 --> 00:59:42,690
encode the matches.

1027
00:59:42,690 --> 00:59:46,640
Namely, we're going to try to
encode the thing the we found

1028
00:59:46,640 --> 00:59:48,180
in the window.

1029
00:59:48,180 --> 00:59:50,900
How do we encode what we
found in the window?

1030
00:59:50,900 --> 00:59:52,480
Well the first thing we
have to do -- yeah?

1031
00:59:52,480 --> 00:59:55,630
AUDIENCE: What if you don't
find any matches?

1032
00:59:55,630 --> 00:59:56,980
PROFESSOR: I'm going to
talk about that later.

1033
00:59:56,980 --> 01:00:00,030
If you don't find any matches,
I mean what I was looking for

1034
01:00:00,030 --> 01:00:02,900
was matches of two or more.

1035
01:00:02,900 --> 01:00:06,440
If you don't find any matches of
two or more, what you do is

1036
01:00:06,440 --> 01:00:11,040
you just take the first letter
in the window and you encode

1037
01:00:11,040 --> 01:00:12,390
that without any compression.

1038
01:00:16,790 --> 01:00:21,180
I mean our strategy here is to
always send the length of the

1039
01:00:21,180 --> 01:00:22,530
match first.

1040
01:00:22,530 --> 01:00:25,870
If you say the length of the
match is one, than the decoder

1041
01:00:25,870 --> 01:00:28,850
knows to look for uncompressed
symbols, instead of looking

1042
01:00:28,850 --> 01:00:30,750
for something in the window.

1043
01:00:30,750 --> 01:00:33,880
So it takes care of the case
where there haven't been any

1044
01:00:33,880 --> 01:00:38,240
occurrences of symbol anywhere
in the window.

1045
01:00:38,240 --> 01:00:41,030
So you only look for matches
of length two or more.

1046
01:00:44,400 --> 01:00:50,350
So then you use something called
a unary-binary code.

1047
01:00:50,350 --> 01:00:54,990
Theoriticians always copy
everybody else's work.

1048
01:00:54,990 --> 01:01:01,030
The unary-binary code was due
to Peter Elias, who was the

1049
01:01:01,030 --> 01:01:03,960
head of this department
for a long time.

1050
01:01:03,960 --> 01:01:07,610
He just died about
six months ago.

1051
01:01:07,610 --> 01:01:10,970
He was here up until
his death.

1052
01:01:10,970 --> 01:01:14,290
He used to organize department
colloquia.

1053
01:01:14,290 --> 01:01:17,620
He was so essential that since
he died, nobody's taken over

1054
01:01:17,620 --> 01:01:19,430
the department colloquia.

1055
01:01:19,430 --> 01:01:23,090
He was my thesis adviser, so I
tend to think very kindly of

1056
01:01:23,090 --> 01:01:26,350
him And he was lots
of other things.

1057
01:01:26,350 --> 01:01:31,195
But anyway, he invented this
unary-binary code, which is a

1058
01:01:31,195 --> 01:01:33,980
way of encoding the integers,
which has a lot of nice

1059
01:01:33,980 --> 01:01:35,530
properties.

1060
01:01:35,530 --> 01:01:39,230
And they're universal
properties, as you will see.

1061
01:01:39,230 --> 01:01:42,130
The idea is to encode the
integers, there are infinite

1062
01:01:42,130 --> 01:01:44,550
number of integers.

1063
01:01:44,550 --> 01:01:47,870
What you'd like to do, somehow
or other, is have shorter code

1064
01:01:47,870 --> 01:01:51,530
words for lower integers,
and longer code

1065
01:01:51,530 --> 01:01:54,920
words for longer integers.

1066
01:01:54,920 --> 01:01:58,140
In this particular Lempel Ziv
algorithm, it's particularly

1067
01:01:58,140 --> 01:02:01,920
important to have the lenght of
the code words growing as

1068
01:02:01,920 --> 01:02:04,790
the logarithm of n.

1069
01:02:04,790 --> 01:02:08,720
Because then anytime you find
a really long match, and you

1070
01:02:08,720 --> 01:02:11,560
got a very large n, you're
encoding a whole lot of

1071
01:02:11,560 --> 01:02:16,020
letters, and therefore you
don't care if there's an

1072
01:02:16,020 --> 01:02:19,080
overhead which is proportional
to log n.

1073
01:02:19,080 --> 01:02:20,720
So you don't mind that.

1074
01:02:20,720 --> 01:02:23,680
And if there's a very small
number of letters encoded, you

1075
01:02:23,680 --> 01:02:27,280
want something very
efficient then.

1076
01:02:27,280 --> 01:02:28,910
So it does that.

1077
01:02:28,910 --> 01:02:33,950
And the way it does it is, first
you generate a prefix,

1078
01:02:33,950 --> 01:02:38,470
and then you have a
representation in base 2.

1079
01:02:38,470 --> 01:02:40,450
Namely base 2 expansion.

1080
01:02:40,450 --> 01:02:46,750
So the number n, the prefix
here, I think I said it here,

1081
01:02:46,750 --> 01:02:49,980
the positive integer n is
encoded into the binary

1082
01:02:49,980 --> 01:02:56,070
representation of n, preceded by
a prefix of integer part of

1083
01:02:56,070 --> 01:02:58,790
log to the base 2 of n zero.

1084
01:02:58,790 --> 01:03:03,930
Now what's the integer part
of log to the base 2 of 1?

1085
01:03:03,930 --> 01:03:07,220
Log to the base 2
of 1 is zero.

1086
01:03:07,220 --> 01:03:10,320
It was a prefix of zero zeros.

1087
01:03:10,320 --> 01:03:12,390
So zero zeros is nothing.

1088
01:03:12,390 --> 01:03:17,890
So the prefix is nothing, the
expansion of 1, in a base 2

1089
01:03:17,890 --> 01:03:21,360
expansion or any other
expansion, is 1.

1090
01:03:21,360 --> 01:03:25,830
So the code word for 1 is 1.

1091
01:03:25,830 --> 01:03:31,820
If you have the number 2, log
to the base 2 of 2 is 1.

1092
01:03:31,820 --> 01:03:35,510
The integer part of 1 is 1, so
you start out with a single

1093
01:03:35,510 --> 01:03:38,550
zero and then you have
the expansion, 2 is

1094
01:03:38,550 --> 01:03:40,590
expanded as 1 zero.

1095
01:03:40,590 --> 01:03:41,970
And so forth.

1096
01:03:41,970 --> 01:03:46,700
Oh and then 3 is expanded as 1
1, again with a prefix of 1.

1097
01:03:46,700 --> 01:03:49,470
Four is encoded as
1 zero zero, blah

1098
01:03:49,470 --> 01:03:52,600
blah blah and so forth.

1099
01:03:52,600 --> 01:03:55,500
Why don't you just leave
the prefix out?

1100
01:04:00,090 --> 01:04:02,750
Anybody figure out why I
need the prefix there?

1101
01:04:02,750 --> 01:04:07,250
AUDIENCE: If without those
prefixes, you don't a

1102
01:04:07,250 --> 01:04:07,490
prefix-free code.

1103
01:04:07,490 --> 01:04:12,430
PROFESSOR: Yeah Right.

1104
01:04:12,430 --> 01:04:15,850
If I left them out, everything
would start with 1.

1105
01:04:15,850 --> 01:04:18,680
I would get to the 1, I would
say, gee, is that the end of

1106
01:04:18,680 --> 01:04:20,580
it or isn't it the end of it?

1107
01:04:20,580 --> 01:04:23,050
I wouldn't know.

1108
01:04:23,050 --> 01:04:26,840
But with this, if I see 1, the
only code word that starts

1109
01:04:26,840 --> 01:04:30,030
with 1 is this one.

1110
01:04:30,030 --> 01:04:34,780
If it's 2 or 3, it starts with
zero and then there's a 1,

1111
01:04:34,780 --> 01:04:39,010
which says it's on that next
branch which has probably 1/4.

1112
01:04:39,010 --> 01:04:45,380
I have a 1 0, 1 1, a prefix of
0 0, followed by a 1, put me

1113
01:04:45,380 --> 01:04:47,640
off on another branch.

1114
01:04:47,640 --> 01:04:49,150
And so forth.

1115
01:04:49,150 --> 01:04:51,810
So, yes, this is a
prefix-free code.

1116
01:04:51,810 --> 01:04:56,000
And it's a prefix-free code
which has this nice property

1117
01:04:56,000 --> 01:05:02,890
that the number of digits in the
code word is approximately

1118
01:05:02,890 --> 01:05:04,980
2 times log n.

1119
01:05:04,980 --> 01:05:08,940
Namely, it goes up
both ways here.

1120
01:05:08,940 --> 01:05:12,540
The number of zeros I need is
log to the base 2 of n.

1121
01:05:12,540 --> 01:05:15,910
The number of digits in the base
2 expansion, is also the

1122
01:05:15,910 --> 01:05:19,310
integer part of log to
the base 2 of n.

1123
01:05:19,310 --> 01:05:21,820
So it works both ways.

1124
01:05:21,820 --> 01:05:26,920
And there's always this
1 in the middle.

1125
01:05:26,920 --> 01:05:28,880
Again, it's a hack.

1126
01:05:28,880 --> 01:05:30,940
It's a hack that works
very nicely when you

1127
01:05:30,940 --> 01:05:32,350
try to analyze this.

1128
01:05:35,830 --> 01:05:41,480
OK so if the size of the match
is bigger than one, we're

1129
01:05:41,480 --> 01:05:44,810
going to encode the positive
integer u.

1130
01:05:44,810 --> 01:05:47,750
u was where the match
occurred.

1131
01:05:47,750 --> 01:05:49,670
How far back do you have
to count before

1132
01:05:49,670 --> 01:05:51,780
you find this match?

1133
01:05:51,780 --> 01:05:55,020
You're going to encode that
integer u into a fixed length

1134
01:05:55,020 --> 01:06:00,530
code of length log of w bits.

1135
01:06:00,530 --> 01:06:03,020
In other words, you
have a window of

1136
01:06:03,020 --> 01:06:05,780
size 2 to the twentieth.

1137
01:06:05,780 --> 01:06:10,860
You can encode any point in
there with 20 binary digits.

1138
01:06:10,860 --> 01:06:14,740
The 20 binary digits say how far
do you have to go back to

1139
01:06:14,740 --> 01:06:16,060
find this code word.

1140
01:06:19,660 --> 01:06:23,850
So first we're encoding n, by
this unary-binary code, then

1141
01:06:23,850 --> 01:06:27,240
we're encoding w just with
this simple minded way of

1142
01:06:27,240 --> 01:06:29,780
encoding log w bits.

1143
01:06:29,780 --> 01:06:32,000
And that tells us where
the match is.

1144
01:06:32,000 --> 01:06:40,110
The decoder goes back, there
finds the match, pumps it out

1145
01:06:40,110 --> 01:06:42,990
if n is equal to 1, here's the
answer to your question, you

1146
01:06:42,990 --> 01:06:45,840
encode the single letter
without compression.

1147
01:06:45,840 --> 01:06:49,370
And that takes care of the case,
either where you have a

1148
01:06:49,370 --> 01:06:52,890
match to that single letter,
or there isn't any match to

1149
01:06:52,890 --> 01:06:55,980
the single letter.

1150
01:06:55,980 --> 01:06:58,140
The next thing, as you might
imagine, is you set the

1151
01:06:58,140 --> 01:07:03,570
pointer to P plus n, because
you've encoded n digits, and

1152
01:07:03,570 --> 01:07:04,770
you go to step two.

1153
01:07:04,770 --> 01:07:07,470
Namely, you keep iterating
forever.

1154
01:07:07,470 --> 01:07:11,420
Until the source wears out, or
until the encoder wears out,

1155
01:07:11,420 --> 01:07:13,370
or until the decoder
wears out.

1156
01:07:13,370 --> 01:07:16,030
You just keep going.

1157
01:07:16,030 --> 01:07:17,790
That's all the algorithm is.

1158
01:07:17,790 --> 01:07:18,050
Yeah?

1159
01:07:18,050 --> 01:07:23,692
AUDIENCE: Can you throw out
the first n bits, when you

1160
01:07:23,692 --> 01:07:27,852
reset the pointer, because you
only have w bits that say

1161
01:07:27,852 --> 01:07:29,102
where n was for the
next iteration?

1162
01:07:31,260 --> 01:07:35,730
PROFESSOR: No, I throw out the
n oldest bits in the window.

1163
01:07:35,730 --> 01:07:37,570
AUDIENCE: Well, those are
the first n bits.

1164
01:07:37,570 --> 01:07:40,790
PROFESSOR: Yes the first n bits
out of the window and I

1165
01:07:40,790 --> 01:07:45,010
keep all of the more
recent bits.

1166
01:07:45,010 --> 01:07:47,530
I tend to think of the first
ones as the things closest to

1167
01:07:47,530 --> 01:07:53,600
the pointer, but you think of it
either way, which is fine.

1168
01:07:53,600 --> 01:07:56,960
So as you do it, the window
keeps sliding along.

1169
01:08:00,790 --> 01:08:02,700
That's what it does.

1170
01:08:07,650 --> 01:08:11,430
Why do you think this works?
/ There's a half

1171
01:08:11,430 --> 01:08:13,520
analysis in the notes.

1172
01:08:17,000 --> 01:08:20,530
i'd like to say a little bit
about how that analysis is

1173
01:08:20,530 --> 01:08:25,450
cheating, because it's not
quite a fair analysis.

1174
01:08:25,450 --> 01:08:29,460
If you look at the window, there
are w different starting

1175
01:08:29,460 --> 01:08:32,830
points in the window.

1176
01:08:32,830 --> 01:08:34,630
So let's write this down.

1177
01:08:38,100 --> 01:08:40,780
w starting points.

1178
01:08:48,590 --> 01:08:56,860
So for any given n, there
are there are w

1179
01:08:56,860 --> 01:09:02,050
springs of length n.

1180
01:09:07,180 --> 01:09:11,070
We don't know how long this
match is going to be, but what

1181
01:09:11,070 --> 01:09:14,850
I would like to do, if I'm
thinking of a Markov source,

1182
01:09:14,850 --> 01:09:20,810
is to say, OK, let's make n
large enough so that the size

1183
01:09:20,810 --> 01:09:24,620
of the typical set is about w.

1184
01:09:24,620 --> 01:09:25,040
ok

1185
01:09:25,040 --> 01:09:45,910
So choose n to be about w
divided by H of X given S. And

1186
01:09:45,910 --> 01:09:50,530
the size of the typical set is
then going to be 2 to the n,

1187
01:09:50,530 --> 01:09:52,660
wait a minute.

1188
01:10:00,400 --> 01:10:02,860
n is equal to log w.

1189
01:10:12,000 --> 01:10:16,460
So the size of the typical set
then is T sub epsilon, is

1190
01:10:16,460 --> 01:10:23,240
going to be roughly from what we
said 2 to the n times H of

1191
01:10:23,240 --> 01:10:32,550
X given S. So what I'm going to
do is to set w equal to T

1192
01:10:32,550 --> 01:10:33,800
of epsilon.

1193
01:10:39,090 --> 01:10:45,080
I'm going to focus on a match
length which I'm hoping to

1194
01:10:45,080 --> 01:10:50,470
achieve, of log w over H of X
given S. The typical set,

1195
01:10:50,470 --> 01:10:55,675
then, is a size 2 to the n times
H of x give S. And if

1196
01:10:55,675 --> 01:11:00,380
the typical set is of this size,
and I look at these w

1197
01:11:00,380 --> 01:11:04,030
strings in the window, yeah,
I'm going to have some

1198
01:11:04,030 --> 01:11:08,380
duplicates but roughly I'm going
to have a large enough

1199
01:11:08,380 --> 01:11:12,780
number of things in the window
to represent all of these

1200
01:11:12,780 --> 01:11:13,990
typical strings.

1201
01:11:13,990 --> 01:11:15,660
Or most of them.

1202
01:11:15,660 --> 01:11:19,680
If I try to choose an n which is
a little bigger than that,

1203
01:11:19,680 --> 01:11:22,540
let's call this n star.

1204
01:11:22,540 --> 01:11:26,190
If I try to make n a little bit
bigger than this typical

1205
01:11:26,190 --> 01:11:29,970
match size, I don't have a
prayer of a chance, because

1206
01:11:29,970 --> 01:11:34,340
the typical set then is just
very much larger than w, so

1207
01:11:34,340 --> 01:11:38,360
I'd be very, very lucky if I
found anything in the window.

1208
01:11:38,360 --> 01:11:40,310
So that can't work.

1209
01:11:40,310 --> 01:11:45,160
If I make n a good deal smaller,
then I'm going to

1210
01:11:45,160 --> 01:11:49,030
succeed with great probability
it seems, because I'm even

1211
01:11:49,030 --> 01:11:51,600
allowing for many, many
duplicates of each of these

1212
01:11:51,600 --> 01:11:55,540
typical sets to be
in the window.

1213
01:11:55,540 --> 01:11:57,610
So what this is saying is
there ought to be some

1214
01:11:57,610 --> 01:12:00,600
critical length when the
window was very large,

1215
01:12:00,600 --> 01:12:04,050
critical match length, and most
of the time the match is

1216
01:12:04,050 --> 01:12:08,310
going to be somewhere around
this value here.

1217
01:12:08,310 --> 01:12:11,880
And as w becomes truly
humongous, and as the match

1218
01:12:11,880 --> 01:12:13,990
size becomes large --

1219
01:12:13,990 --> 01:12:17,360
you remember for these typical
sets to make any sense, this

1220
01:12:17,360 --> 01:12:19,690
number has to be large.

1221
01:12:19,690 --> 01:12:22,230
And when this number gets large,
the size of the typical

1222
01:12:22,230 --> 01:12:25,750
set it humongous.

1223
01:12:25,750 --> 01:12:29,770
Which says, that for this
asymptotic analysis, a window

1224
01:12:29,770 --> 01:12:32,370
of 2 to the twentieth, probably

1225
01:12:32,370 --> 01:12:34,530
isn't nearly big enough.

1226
01:12:34,530 --> 01:12:38,530
So the asymptotic analysis is
really saying, when you have

1227
01:12:38,530 --> 01:12:41,670
really humongous windows
this is going to work.

1228
01:12:44,630 --> 01:12:47,390
You don't make windows that
large, so you have to have

1229
01:12:47,390 --> 01:12:50,560
some faith that this theoretical
argument is going

1230
01:12:50,560 --> 01:12:51,250
to work here.

1231
01:12:51,250 --> 01:12:54,950
But that tells you roughly what
the size of these matches

1232
01:12:54,950 --> 01:12:56,280
is going to be.

1233
01:12:56,280 --> 01:13:05,020
If the size of the matches is
that, and you use log w bits

1234
01:13:05,020 --> 01:13:09,390
plus 2 log n bits, to encode
each match, what happens?

1235
01:13:14,130 --> 01:13:15,890
Encode match.

1236
01:13:23,350 --> 01:13:36,180
You use log w, plus
2 log n star.

1237
01:13:36,180 --> 01:13:38,530
That's the number of bits it
takes you, this is the number

1238
01:13:38,530 --> 01:13:42,770
of bits it takes you to encode
what the match size is.

1239
01:13:42,770 --> 01:13:44,720
You still have to encode that.

1240
01:13:44,720 --> 01:13:48,790
This is the number of bits it
takes you to encode where the

1241
01:13:48,790 --> 01:13:50,720
match occurs.

1242
01:13:50,720 --> 01:13:54,290
Now how big is this
relative to this?

1243
01:13:54,290 --> 01:14:00,700
Well n star is on the order of
log w, so we're taking log w

1244
01:14:00,700 --> 01:14:05,480
plus 2 times log of log of w.

1245
01:14:05,480 --> 01:14:08,860
So in an approximate analysis,
you say, I don't even care

1246
01:14:08,860 --> 01:14:11,760
about that.

1247
01:14:11,760 --> 01:14:17,460
You wind up with encoding
a match with log w bits.

1248
01:14:17,460 --> 01:14:22,530
So you encode n star bits, you
use log w bits to do it, how

1249
01:14:22,530 --> 01:14:27,300
many bits are you using
per symbol?

1250
01:14:27,300 --> 01:14:43,725
H of X given S. That's roughly
the idea of why the Lempel Ziv

1251
01:14:43,725 --> 01:14:46,020
algorithm works.

1252
01:14:46,020 --> 01:14:48,880
Can anybody spot any problems
with that analysis?

1253
01:14:48,880 --> 01:14:49,210
Yeah.

1254
01:14:49,210 --> 01:14:55,643
AUDIENCE: You don't know the
probabilities beforehand, so

1255
01:14:55,643 --> 01:15:00,490
how do you pick w?

1256
01:15:00,490 --> 01:15:02,350
PROFESSOR: Good one.

1257
01:15:02,350 --> 01:15:09,320
You picked w by saying, I have
a computer which will go at a

1258
01:15:09,320 --> 01:15:10,820
certain speed.

1259
01:15:10,820 --> 01:15:14,390
My data rate is coming in at a
certain speed, and I'm going

1260
01:15:14,390 --> 01:15:17,690
to pick w as large as
I can keep up with.

1261
01:15:17,690 --> 01:15:19,710
With the best algorithm
I can think of for

1262
01:15:19,710 --> 01:15:23,010
doing string matching.

1263
01:15:23,010 --> 01:15:26,690
And string matching a hard thing
to do, but it's not a

1264
01:15:26,690 --> 01:15:29,100
terribly easy thing
to do either.

1265
01:15:29,100 --> 01:15:30,810
So you make w as large
as you can.

1266
01:15:33,830 --> 01:15:36,480
And if it's not large
enough, tough.

1267
01:15:36,480 --> 01:15:39,760
You got matches which are
somewhat smaller -- all this

1268
01:15:39,760 --> 01:15:43,830
argument about typical sets
still work except for the

1269
01:15:43,830 --> 01:15:48,690
epsilon and deltas that
are tucked into there.

1270
01:15:48,690 --> 01:15:51,560
So it's just that the epsilons
and the deltas get too big

1271
01:15:51,560 --> 01:15:54,410
when you're strings are
not long enough.

1272
01:15:54,410 --> 01:15:54,740
Yeah?

1273
01:15:54,740 --> 01:15:59,931
AUDIENCE: So your w is just
make your processing time

1274
01:15:59,931 --> 01:16:01,160
equal the time [UNINTELLIGIBLE]?

1275
01:16:01,160 --> 01:16:03,550
PROFESSOR: Yeah.

1276
01:16:03,550 --> 01:16:05,290
That's what the determines w.

1277
01:16:05,290 --> 01:16:08,690
It's how fast you can do a
string search over this long,

1278
01:16:08,690 --> 01:16:10,410
long window.

1279
01:16:10,410 --> 01:16:13,950
You're not going to just search
everything one by one,

1280
01:16:13,950 --> 01:16:16,300
you're going to build some kind
of data structure there

1281
01:16:16,300 --> 01:16:19,940
that makes these searches
run fast.

1282
01:16:19,940 --> 01:16:21,870
Can anybody think of why
you might not want

1283
01:16:21,870 --> 01:16:23,350
to make w too large?

1284
01:16:28,150 --> 01:16:31,010
This isn't a theoretical
reason, this is

1285
01:16:31,010 --> 01:16:31,790
more practical thing.

1286
01:16:31,790 --> 01:16:33,993
AUDIENCE: Is it if the
probabilities change, it's

1287
01:16:33,993 --> 01:16:35,370
slow to react to
those changes?

1288
01:16:35,370 --> 01:16:37,110
PROFESSOR: If the probabilities
change, it's

1289
01:16:37,110 --> 01:16:41,900
slow to react to them, because
it's got this humongous window

1290
01:16:41,900 --> 01:16:46,300
here, and it's not until the
window fills up with all of

1291
01:16:46,300 --> 01:16:53,240
this new stuff, that it
starts to work well.

1292
01:16:53,240 --> 01:16:57,150
And before it fills up, you're
using an effective small

1293
01:16:57,150 --> 01:17:00,910
window, but you're using a
number of bits which is

1294
01:17:00,910 --> 01:17:04,250
proportionate to log of a large
window, and therefore

1295
01:17:04,250 --> 01:17:06,370
you're wasting bits.

1296
01:17:06,370 --> 01:17:09,450
So another thing that that
determines how big you

1297
01:17:09,450 --> 01:17:11,630
want w to be --

1298
01:17:11,630 --> 01:17:14,040
I mean the main thing that's
determines it is just that you

1299
01:17:14,040 --> 01:17:15,940
can't run that fast.

1300
01:17:15,940 --> 01:17:19,130
Because you'd like to
make it pretty big.

1301
01:17:19,130 --> 01:17:21,090
Another question.

1302
01:17:21,090 --> 01:17:25,140
How about this matter of wasting
w symbols at the

1303
01:17:25,140 --> 01:17:27,560
beginning to fill
up the window?

1304
01:17:27,560 --> 01:17:28,810
what do you do about that?

1305
01:17:31,420 --> 01:17:33,220
I mean, that's a pretty
stupid thing, right?

1306
01:17:37,900 --> 01:17:40,340
Anybody suggest a solution
to that?

1307
01:17:40,340 --> 01:17:44,720
If you're building this
yourself, how

1308
01:17:44,720 --> 01:17:47,400
would you handle it?

1309
01:17:47,400 --> 01:17:50,970
It's the same argument as if
the statistics change in

1310
01:17:50,970 --> 01:17:52,220
mid-stream.

1311
01:17:54,010 --> 01:17:56,890
I mean, you don't measure that
the statistics have changed,

1312
01:17:56,890 --> 01:17:59,060
and throw out what's
in the window.

1313
01:18:05,610 --> 01:18:06,000
What?

1314
01:18:06,000 --> 01:18:08,366
AUDIENCE: Can you not assume
that you already have a

1315
01:18:08,366 --> 01:18:09,040
typical sequence?

1316
01:18:09,040 --> 01:18:10,810
PROFESSOR: Yes, and you
don't care whether

1317
01:18:10,810 --> 01:18:12,730
it's right or wrong.

1318
01:18:12,730 --> 01:18:16,240
You could assume that the
typical sequence is all zeros,

1319
01:18:16,240 --> 01:18:18,570
so you fill up the window
with all zeros.

1320
01:18:18,570 --> 01:18:21,340
The decoder also fills it up
with all zeros, because this

1321
01:18:21,340 --> 01:18:23,750
is the way you always start.

1322
01:18:23,750 --> 01:18:25,840
And then you just start running
a log, looking for

1323
01:18:25,840 --> 01:18:28,300
matches and encoding things.

1324
01:18:28,300 --> 01:18:35,530
And as w builds up, you
start matching things.

1325
01:18:35,530 --> 01:18:38,320
You could even be smarter, and
know that you're window wasn't

1326
01:18:38,320 --> 01:18:41,780
very big and let your
window grow also.

1327
01:18:41,780 --> 01:18:45,990
If you wanted to really
be fancy about this.

1328
01:18:45,990 --> 01:18:48,880
So if you want to encode this
you can have a lot of fun, and

1329
01:18:48,880 --> 01:18:51,600
a lot of people over the years
have had a lot of fun trying

1330
01:18:51,600 --> 01:18:53,570
to encode these things.

1331
01:18:53,570 --> 01:18:54,820
It's a neat thing to do.