1 00:00:00,000 --> 00:00:02,350 The following content is provided under a Creative 2 00:00:02,350 --> 00:00:03,640 Commons license. 3 00:00:03,640 --> 00:00:06,590 Your support will help MIT OpenCourseWare continue to 4 00:00:06,590 --> 00:00:09,970 offer high quality educational resources for free. 5 00:00:09,970 --> 00:00:13,060 To make a donation or to view additional materials from 6 00:00:13,060 --> 00:00:16,780 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:16,780 --> 00:00:18,030 ocw.mit.edu. 8 00:00:21,570 --> 00:00:22,070 PROFESSOR: OK. 9 00:00:22,070 --> 00:00:26,430 Just to review where we are, we've been talking about 10 00:00:26,430 --> 00:00:30,230 source coding as one of the two major parts of digital 11 00:00:30,230 --> 00:00:31,000 communication. 12 00:00:31,000 --> 00:00:34,420 Remember, you take sources, you turn them into bits. 13 00:00:34,420 --> 00:00:38,110 Then you take bits and you transmit them over channels. 14 00:00:38,110 --> 00:00:40,470 And that sums up the whole course. 15 00:00:40,470 --> 00:00:44,420 This is the part where you transmit over channels. 16 00:00:44,420 --> 00:00:48,220 This is the part where you process the sources. 17 00:00:48,220 --> 00:00:51,740 We're concentrating now on the source side of things. 18 00:00:51,740 --> 00:00:54,760 Partly because by concentrating on the source 19 00:00:54,760 --> 00:00:58,300 side of things, we will build up the machinery we need to 20 00:00:58,300 --> 00:00:59,820 look at the channel side of things. 21 00:00:59,820 --> 00:01:03,760 The channel side is really more interesting, I think. 22 00:01:03,760 --> 00:01:07,090 Although there's been a great deal of work on both of them. 23 00:01:07,090 --> 00:01:09,070 They're both important. 24 00:01:09,070 --> 00:01:12,950 And we said that we could separate source coding into 25 00:01:12,950 --> 00:01:13,770 three pieces. 26 00:01:13,770 --> 00:01:17,030 If you start out with a waveform source, the typical 27 00:01:17,030 --> 00:01:21,210 thing to do, and almost the only thing to do, is to turn 28 00:01:21,210 --> 00:01:24,430 those waveforms into sequences of numbers. 29 00:01:24,430 --> 00:01:27,130 Those sequences might be samples. 30 00:01:27,130 --> 00:01:31,080 They might be numbers in an expansion. 31 00:01:31,080 --> 00:01:32,450 They might be whatever. 32 00:01:32,450 --> 00:01:36,440 But the first thing you almost always do is turn waveforms 33 00:01:36,440 --> 00:01:38,280 into a sequence of numbers. 34 00:01:38,280 --> 00:01:42,600 Because waveforms are just too complicated to deal with. 35 00:01:42,600 --> 00:01:44,860 The next thing we do with a sequence of numbers is 36 00:01:44,860 --> 00:01:46,240 quantize them. 37 00:01:46,240 --> 00:01:48,440 After we quantize them we wind up with a 38 00:01:48,440 --> 00:01:50,170 finite set of symbols. 39 00:01:50,170 --> 00:01:51,900 And the next thing we do is, we take 40 00:01:51,900 --> 00:01:54,270 that sequence of symbols. 41 00:01:54,270 --> 00:01:55,750 And -- 42 00:01:58,510 --> 00:02:01,960 and what we do at that point is to do data compression. 43 00:02:01,960 --> 00:02:05,940 So that we try to represent those symbols with as small as 44 00:02:05,940 --> 00:02:09,950 possible a number of binary digits per source symbol. 45 00:02:09,950 --> 00:02:13,290 We want to do that in such a way that we can receive 46 00:02:13,290 --> 00:02:15,760 it the other end. 47 00:02:15,760 --> 00:02:19,060 So let's review a little bit about what we've done in the 48 00:02:19,060 --> 00:02:22,230 last couple of lectures. 49 00:02:22,230 --> 00:02:26,680 We talked about the Kraft inequality. 50 00:02:26,680 --> 00:02:30,820 And the Kraft inequality, you remember, says that the 51 00:02:30,820 --> 00:02:35,350 lengths of the codewords in any prefix-free code have to 52 00:02:35,350 --> 00:02:38,120 satisfy this funny inequality here. 53 00:02:38,120 --> 00:02:42,580 And this funny inequality, in some sense, says if you want 54 00:02:42,580 --> 00:02:47,550 to make one codeword short, by making that one codeword 55 00:02:47,550 --> 00:02:53,590 short, it eats up a large part of this fraction. 56 00:02:53,590 --> 00:02:56,840 Since this sum has to be less than or equal to 1. 57 00:02:56,840 --> 00:03:01,150 If, for example, you make l sub 1 equal to 1, then that 58 00:03:01,150 --> 00:03:03,870 uses up a half of this sum right there. 59 00:03:03,870 --> 00:03:06,990 And all the other codewords have to be much longer. 60 00:03:06,990 --> 00:03:09,890 So what this is saying, essentially, and we proved it, 61 00:03:09,890 --> 00:03:13,760 and we did a bunch of things with it, and your homework you 62 00:03:13,760 --> 00:03:18,060 worked with it, we have shown that that 63 00:03:18,060 --> 00:03:20,380 inequality has to hold. 64 00:03:20,380 --> 00:03:23,750 The next thing we did is, given a set of probabilities 65 00:03:23,750 --> 00:03:29,120 on a source, for example, p1 up to p sub m, by this time 66 00:03:29,120 --> 00:03:32,170 you should feel very comfortable in realizing that 67 00:03:32,170 --> 00:03:35,020 what you call these symbols doesn't make any difference 68 00:03:35,020 --> 00:03:38,970 whatsoever as far as any matter of 69 00:03:38,970 --> 00:03:41,180 encoding sources is concerned. 70 00:03:41,180 --> 00:03:44,520 The first thing you can do, if you like to, is take whatever 71 00:03:44,520 --> 00:03:47,940 name somebody has given to a set of symbols, replace them 72 00:03:47,940 --> 00:03:52,410 with your own symbols, and the easiest set of symbols to use 73 00:03:52,410 --> 00:03:55,020 is the integers 1 to m. 74 00:03:55,020 --> 00:03:58,110 And that's what we will usually do. 75 00:03:58,110 --> 00:04:01,820 So, given this set of probabilities, and they have 76 00:04:01,820 --> 00:04:06,430 to add up to 1, the Huffman algorithm is this ingenious 77 00:04:06,430 --> 00:04:10,410 algorithm, very, very, clever, which constructs a prefix-free 78 00:04:10,410 --> 00:04:14,170 code of minimum expected length. 79 00:04:14,170 --> 00:04:18,560 And that minimum expected length is just defined as the 80 00:04:18,560 --> 00:04:22,080 sum over i, of p sub i times l sub i. 81 00:04:22,080 --> 00:04:25,340 And the trick in the algorithm is to find that set of l sub 82 00:04:25,340 --> 00:04:30,570 i's that satisfy this inequality but minimize this 83 00:04:30,570 --> 00:04:32,810 expected value. 84 00:04:32,810 --> 00:04:35,110 Next thing we started to talk about was a discrete 85 00:04:35,110 --> 00:04:36,790 memoryless source. 86 00:04:36,790 --> 00:04:40,150 A discrete memoryless source is really a toy source. 87 00:04:40,150 --> 00:04:43,770 It's a toy source where you assume that each letter in the 88 00:04:43,770 --> 00:04:48,700 sequence is independent, and equally independent, and 89 00:04:48,700 --> 00:04:49,820 identically distributed. 90 00:04:49,820 --> 00:04:51,880 In other words, every letter is the same. 91 00:04:51,880 --> 00:04:55,000 Every letter is independent of every other letter. 92 00:04:55,000 --> 00:04:58,590 That's more appropriate for a gambling game than it is for 93 00:04:58,590 --> 00:04:59,940 real sources. 94 00:04:59,940 --> 00:05:02,830 But, on the other hand, by understanding this problem, 95 00:05:02,830 --> 00:05:05,385 we're starting to see that we understand the whole problem 96 00:05:05,385 --> 00:05:07,350 of source coding. 97 00:05:07,350 --> 00:05:10,050 So we'll get on with that as we go. 98 00:05:10,050 --> 00:05:13,910 But, anyway, when we have a discrete memoryless source, 99 00:05:13,910 --> 00:05:17,780 what we found -- first we defined the entropy of such a 100 00:05:17,780 --> 00:05:24,460 source as h of x, which is the sum over i of minus p sub i, 101 00:05:24,460 --> 00:05:27,580 of logarithm of p sub i. 102 00:05:27,580 --> 00:05:30,570 And that was just something that came out of trying to do 103 00:05:30,570 --> 00:05:34,170 this optimization not the way that Huffman did it but the 104 00:05:34,170 --> 00:05:35,940 way that Shannon did it. 105 00:05:35,940 --> 00:05:39,500 Namely, Shannon looked at this optimization in terms of 106 00:05:39,500 --> 00:05:42,600 dealing with entropy and things like that. 107 00:05:42,600 --> 00:05:45,610 Huffman dealt with it in terms of finding the optimal code. 108 00:05:45,610 --> 00:05:49,100 One of the very surprising things is that the way Huffman 109 00:05:49,100 --> 00:05:52,450 looked at it, in terms of entropy, is the way this 110 00:05:52,450 --> 00:05:55,120 really valuable, even though it doesn't come up with an 111 00:05:55,120 --> 00:05:56,410 optimal code. 112 00:05:56,410 --> 00:06:00,420 I mean, here was poor Huffman, who generated this beautiful 113 00:06:00,420 --> 00:06:03,610 algorithm, which is extraordinarily simple, which 114 00:06:03,610 --> 00:06:06,350 solved what looked like a hard problem. 115 00:06:06,350 --> 00:06:10,440 But yet, as far as information theory is concerned, he used 116 00:06:10,440 --> 00:06:11,940 that algorithm, sure. 117 00:06:11,940 --> 00:06:14,900 But as far as all the generalizations are concerned, 118 00:06:14,900 --> 00:06:17,960 it has almost nothing to do with anything. 119 00:06:17,960 --> 00:06:21,790 But, anyway, when you look at this entropy, what comes out 120 00:06:21,790 --> 00:06:25,710 of it is the fact that the entropy of the source is less 121 00:06:25,710 --> 00:06:30,100 than or equal to the minimum number of bits per source 122 00:06:30,100 --> 00:06:33,840 symbol that you can come up with in a prefix-free code, 123 00:06:33,840 --> 00:06:36,930 which is less than h of x plus 1. 124 00:06:36,930 --> 00:06:39,340 And the way we did that was just to try to look at this 125 00:06:39,340 --> 00:06:40,860 minimization. 126 00:06:40,860 --> 00:06:42,720 And by looking at the minimization, we usually 127 00:06:42,720 --> 00:06:45,900 showed it had to be greater than or equal to H of x. 128 00:06:45,900 --> 00:06:48,810 And by looking at any code which satisfied this 129 00:06:48,810 --> 00:06:51,050 inequality with any set of length -- 130 00:06:51,050 --> 00:06:55,630 well, after we looked at this, this said that what we really 131 00:06:55,630 --> 00:06:59,680 wanted to do was to make the length of each codeword minus 132 00:06:59,680 --> 00:07:01,700 log of p sub i. 133 00:07:01,700 --> 00:07:03,050 That's not an integer. 134 00:07:03,050 --> 00:07:06,140 So the thing we did to get this inequality is said, OK, 135 00:07:06,140 --> 00:07:09,630 if it's not an integer, we'll raise it up the next value. 136 00:07:09,630 --> 00:07:10,880 Make it an integer. 137 00:07:10,880 --> 00:07:14,810 And as soon as we do that, the Kraft inequality is satisfied. 138 00:07:14,810 --> 00:07:17,170 And you can generate a code with that set of lengths. 139 00:07:17,170 --> 00:07:21,710 So that's where you get these two bounds. 140 00:07:21,710 --> 00:07:25,840 This bound here is usually very, very weak. 141 00:07:25,840 --> 00:07:31,030 Can anybody suggest the almost unique example where this is 142 00:07:31,030 --> 00:07:33,890 almost tight? 143 00:07:33,890 --> 00:07:36,600 It's a simplistic sample you can think of. 144 00:07:36,600 --> 00:07:38,960 It's a binary source. 145 00:07:38,960 --> 00:07:44,840 And what binary source has the property that this is almost 146 00:07:44,840 --> 00:07:46,810 equal to this? 147 00:07:52,000 --> 00:07:53,150 Anybody out there? 148 00:07:53,150 --> 00:07:57,540 AUDIENCE: [UNINTELLIGIBLE] 149 00:07:57,540 --> 00:08:00,310 PROFESSOR: Make it almost probability 0 150 00:08:00,310 --> 00:08:01,700 and probability 1. 151 00:08:01,700 --> 00:08:05,390 You can't quite do that because as soon as you make 152 00:08:05,390 --> 00:08:09,280 the probability of the 0, 0, then you don't have to 153 00:08:09,280 --> 00:08:10,360 represent it. 154 00:08:10,360 --> 00:08:13,260 And you just know that it's a sequence of all 1's. 155 00:08:13,260 --> 00:08:14,140 So you're all done. 156 00:08:14,140 --> 00:08:16,980 And you don't need any bits to represent it. 157 00:08:16,980 --> 00:08:21,740 But if there's just some very tiny epsilon probability of a 158 00:08:21,740 --> 00:08:26,910 0 and a big probability of a 1, then the entropy is almost 159 00:08:26,910 --> 00:08:28,570 equal to 0. 160 00:08:28,570 --> 00:08:34,040 And this 1 here is needed because l bar min is 1. 161 00:08:34,040 --> 00:08:37,250 You can't make it any smaller than that. 162 00:08:37,250 --> 00:08:42,590 So, that's where that comes from. 163 00:08:42,590 --> 00:08:47,390 Let's talk about entropy just a little bit. 164 00:08:47,390 --> 00:08:53,450 If we have an alphabet which has size m, which is the 165 00:08:53,450 --> 00:08:58,370 symbol we'll usually use for the alphabet, x. 166 00:08:58,370 --> 00:09:01,670 And the probability that x equals i, is p sub i. 167 00:09:01,670 --> 00:09:04,520 In other words, again, we're using this convention so we're 168 00:09:04,520 --> 00:09:08,850 going to call the symbols the integers 1 to capital M. Then 169 00:09:08,850 --> 00:09:11,340 the entropy is equal to this. 170 00:09:11,340 --> 00:09:15,050 And a nice way of representing this is that the entropy is 171 00:09:15,050 --> 00:09:18,930 equal to the expected value of minus the logarithm. 172 00:09:18,930 --> 00:09:22,090 Logarithms are always to the base 2, in this course. 173 00:09:22,090 --> 00:09:24,950 When we want natural logarithms we'll write ln, in 174 00:09:24,950 --> 00:09:26,770 other words it's log to the base 2. 175 00:09:26,770 --> 00:09:30,960 So it's log to the base 2 of the probability 176 00:09:30,960 --> 00:09:32,910 of the symbol x. 177 00:09:32,910 --> 00:09:37,630 We call this the log pmf random variable. 178 00:09:37,630 --> 00:09:42,580 We started out with x being a chance variable. 179 00:09:42,580 --> 00:09:44,800 I mean, we happen to have turned it into a random 180 00:09:44,800 --> 00:09:46,840 variable because we've given it numbers. 181 00:09:46,840 --> 00:09:48,360 But that's irrelevant. 182 00:09:48,360 --> 00:09:52,050 We really want to think of it as a chance variable. 183 00:09:52,050 --> 00:09:56,720 But now this quantity is a numerical function of the 184 00:09:56,720 --> 00:09:59,300 symbol which comes out of the source. 185 00:09:59,300 --> 00:10:02,140 And, therefore, this quantity is a 186 00:10:02,140 --> 00:10:03,860 well-defined random variable. 187 00:10:03,860 --> 00:10:06,690 And we call it the log pmf random variable. 188 00:10:06,690 --> 00:10:08,980 Some people call it self-information. 189 00:10:08,980 --> 00:10:10,850 We'll find out why later. 190 00:10:10,850 --> 00:10:13,180 I don't particularly like that word. 191 00:10:13,180 --> 00:10:16,030 One, because what we're dealing with here has nothing 192 00:10:16,030 --> 00:10:18,300 to do with information. 193 00:10:18,300 --> 00:10:20,780 Probably the thought that this all has something to do with 194 00:10:20,780 --> 00:10:23,350 information, namely, that information theory has 195 00:10:23,350 --> 00:10:26,700 something to do with information, probably held up 196 00:10:26,700 --> 00:10:28,730 the field for about five years. 197 00:10:28,730 --> 00:10:31,370 Because everyone tried to figure out, what does this 198 00:10:31,370 --> 00:10:33,510 have to do with information. 199 00:10:33,510 --> 00:10:36,050 And, of course, it had nothing to do with information. 200 00:10:36,050 --> 00:10:38,110 It really only had to do with data. 201 00:10:38,110 --> 00:10:41,260 And with probabilities of various things in the data. 202 00:10:41,260 --> 00:10:43,110 So, anyway. 203 00:10:43,110 --> 00:10:45,930 Some people call it self-information and 204 00:10:45,930 --> 00:10:47,530 we'll see why later. 205 00:10:47,530 --> 00:10:50,130 But this is the quantity we're interested in. 206 00:10:50,130 --> 00:10:52,350 And we call it log pmf, we sort of forget 207 00:10:52,350 --> 00:10:54,750 about the minus sign. 208 00:10:54,750 --> 00:10:57,340 It's not good to forget about the minus sign, 209 00:10:57,340 --> 00:10:58,240 but I always do it. 210 00:10:58,240 --> 00:11:02,660 So I sort of expect other people to do it, too. 211 00:11:02,660 --> 00:11:05,060 One of the properties of entropy is, it has to be 212 00:11:05,060 --> 00:11:06,920 greater than or equal to 0. 213 00:11:06,920 --> 00:11:09,050 Why is it greater than or equal to 0? 214 00:11:09,050 --> 00:11:12,130 Well, because these probabilities here have to be 215 00:11:12,130 --> 00:11:14,120 less than or equal to 1. 216 00:11:14,120 --> 00:11:16,070 And the logarithm of something less than or 217 00:11:16,070 --> 00:11:18,000 equal to 1 is negative. 218 00:11:18,000 --> 00:11:21,480 And therefore minus the logarithm has to be greater 219 00:11:21,480 --> 00:11:23,130 than or equal to 0. 220 00:11:23,130 --> 00:11:26,970 This entropy is also less than or equal to log M, log capital 221 00:11:26,970 --> 00:11:28,950 M. I'm not going to prove that here. 222 00:11:28,950 --> 00:11:32,370 It's proven in the notes, it's a trivial thing to do. 223 00:11:32,370 --> 00:11:35,410 Or maybe it's proven in one of the problems, I forget. 224 00:11:35,410 --> 00:11:39,860 But, anyway, you can do it using the inequality log of x 225 00:11:39,860 --> 00:11:42,380 is less than or equal to x minus 1. 226 00:11:42,380 --> 00:11:44,890 Just like all inequalities can be proven with that 227 00:11:44,890 --> 00:11:47,300 inequality. 228 00:11:47,300 --> 00:11:52,240 So there's a quality here if x is equiprobable. 229 00:11:52,240 --> 00:11:56,420 Which is pretty clear, because if all of these probabilities 230 00:11:56,420 --> 00:12:00,200 are equal to 1 over M, and you take the expected value of 231 00:12:00,200 --> 00:12:04,140 logarithm of M, you get logarithm of M. So there's 232 00:12:04,140 --> 00:12:07,030 nothing very surprising here. 233 00:12:07,030 --> 00:12:11,210 Now, the next thing -- and here's where what we're going 234 00:12:11,210 --> 00:12:15,640 to do is, on one hand very simple, and on the other hand 235 00:12:15,640 --> 00:12:17,400 very confusing. 236 00:12:17,400 --> 00:12:21,480 After you get the picture of it, it becomes very simple. 237 00:12:21,480 --> 00:12:24,780 Before that, it all looks rather strange. 238 00:12:24,780 --> 00:12:30,680 If you have two independent chance variables, say x and y, 239 00:12:30,680 --> 00:12:37,030 then the choice where the sample value of the chance 240 00:12:37,030 --> 00:12:41,660 variable x, and the choice of the sample value y, together 241 00:12:41,660 --> 00:12:44,900 that's a pair of sample values which we can view as one 242 00:12:44,900 --> 00:12:46,730 sample value. 243 00:12:46,730 --> 00:12:51,540 In other words, we can view xy as a chance variable all in 244 00:12:51,540 --> 00:12:54,280 its own right. 245 00:12:54,280 --> 00:12:57,540 This isn't the sequence x, followed by y, where you can 246 00:12:57,540 --> 00:12:58,730 think of it that way. 247 00:12:58,730 --> 00:13:01,150 But we want to think of this here as a 248 00:13:01,150 --> 00:13:02,730 chance variable itself. 249 00:13:02,730 --> 00:13:04,690 Which takes on different values. 250 00:13:04,690 --> 00:13:11,400 And the values it takes on are pairs of sample values, 1 from 251 00:13:11,400 --> 00:13:21,290 x, 1 from ensemble y, and xy takes on the sample value of 252 00:13:21,290 --> 00:13:28,500 little xy with the probability p of x times p of y. 253 00:13:28,500 --> 00:13:31,510 As we move one with this course, we'll become much less 254 00:13:31,510 --> 00:13:34,520 careful about putting these subscripts down, which talk 255 00:13:34,520 --> 00:13:36,540 about random variables. 256 00:13:36,540 --> 00:13:40,510 And the arguments which talk about sample values of those 257 00:13:40,510 --> 00:13:41,830 random variables. 258 00:13:41,830 --> 00:13:46,500 I want to keep doing it for a while because most courses in 259 00:13:46,500 --> 00:13:50,250 probability, even 6.041, which is the first course in 260 00:13:50,250 --> 00:13:53,910 probability, almost deliberately fudges the 261 00:13:53,910 --> 00:13:57,360 difference between sample values and random variables. 262 00:13:57,360 --> 00:13:59,800 And most people who work with probability do 263 00:13:59,800 --> 00:14:00,870 this all the time. 264 00:14:00,870 --> 00:14:03,520 And you never know when you're talking about a random 265 00:14:03,520 --> 00:14:06,210 variable and when you're talking about a sample value 266 00:14:06,210 --> 00:14:07,500 of a random variable. 267 00:14:07,500 --> 00:14:11,040 And that's convenient for getting insight about thing. 268 00:14:11,040 --> 00:14:13,770 And you do it for a while and then pretty soon you wonder, 269 00:14:13,770 --> 00:14:15,410 what the heck is going on. 270 00:14:15,410 --> 00:14:19,055 Because you have no idea what's a random variable any 271 00:14:19,055 --> 00:14:22,180 more, and what's a sample value of it. 272 00:14:22,180 --> 00:14:28,190 So, this entropy, H, of the chance variable, xy, is then 273 00:14:28,190 --> 00:14:32,580 expected value of minus the logarithm of the probability 274 00:14:32,580 --> 00:14:37,480 of the chance variable, xy. 275 00:14:37,480 --> 00:14:39,890 Mainly, when you take the expected value, you're taking 276 00:14:39,890 --> 00:14:43,610 the expected value of a random variable. 277 00:14:43,610 --> 00:14:49,750 And for the random variable here, in the chance variables, 278 00:14:49,750 --> 00:14:51,000 xy and here. 279 00:14:51,000 --> 00:14:55,180 This is the expected value of minus the logarithm of p of x 280 00:14:55,180 --> 00:14:57,750 times the probability p of x. 281 00:14:57,750 --> 00:14:59,490 Which is the expected value. 282 00:14:59,490 --> 00:15:03,320 Since they're independent of each other it's the sum. 283 00:15:03,320 --> 00:15:08,380 And that gives you H of x y is equal to H of x plus H of y. 284 00:15:08,380 --> 00:15:11,350 This is really the reason why you're interested in these 285 00:15:11,350 --> 00:15:14,870 chance variables which are logarithms of probabilities. 286 00:15:14,870 --> 00:15:18,820 Because when you have independent chance variables 287 00:15:18,820 --> 00:15:22,780 then you have the situation that probabilities multiply 288 00:15:22,780 --> 00:15:26,810 and therefore log probabilities add. 289 00:15:26,810 --> 00:15:29,780 All of the major theorems in probability theory, in 290 00:15:29,780 --> 00:15:32,360 particular the law of large numbers, which is the most 291 00:15:32,360 --> 00:15:35,700 important result in probability, simple though it 292 00:15:35,700 --> 00:15:39,700 might be, that's the key to everything in probability. 293 00:15:39,700 --> 00:15:44,700 That particular result talks about sums of random variables 294 00:15:44,700 --> 00:15:47,090 and not about products of random variables. 295 00:15:47,090 --> 00:15:50,400 So that's why Shannon did everything in terms 296 00:15:50,400 --> 00:15:53,550 of these log PMF. 297 00:15:53,550 --> 00:15:55,720 And we will soon be doing everything in 298 00:15:55,720 --> 00:15:57,660 terms of log PMF also. 299 00:16:01,120 --> 00:16:06,710 So now let's get back to discrete memoryless sources. 300 00:16:06,710 --> 00:16:12,710 If you now have a block of n chance variables, x1 to xn, 301 00:16:12,710 --> 00:16:18,500 and they're all IID, again we can do this whole block as one 302 00:16:18,500 --> 00:16:21,640 big monster chance variable. 303 00:16:21,640 --> 00:16:27,100 If each one of these takes on m possible values, then this 304 00:16:27,100 --> 00:16:29,630 monster chance variable can take on m to 305 00:16:29,630 --> 00:16:32,400 the n possible values. 306 00:16:32,400 --> 00:16:37,190 Namely, each possible string of symbols, each possible 307 00:16:37,190 --> 00:16:41,060 string of n symbols where each one is a choice from the 308 00:16:41,060 --> 00:16:44,640 integers 1 to capital M. So we're talking about tuples of 309 00:16:44,640 --> 00:16:46,420 numbers now. 310 00:16:46,420 --> 00:16:48,860 And those are the values that this particular chance 311 00:16:48,860 --> 00:16:52,420 variable, x sub n, takes on. 312 00:16:52,420 --> 00:16:55,820 So it takes on these probabilities, takes on these 313 00:16:55,820 --> 00:17:02,145 values with the probability p of x n is the product from i 314 00:17:02,145 --> 00:17:05,220 equals 1 to n, of the individual probabilities. 315 00:17:05,220 --> 00:17:08,570 In other words, again, when you have independent chance 316 00:17:08,570 --> 00:17:11,570 variables, the probabilities multiply. 317 00:17:11,570 --> 00:17:13,160 Which is all I'm saying here. 318 00:17:13,160 --> 00:17:17,730 So the chance variable x sub n has the entropy H of x n, 319 00:17:17,730 --> 00:17:20,930 which is the expected value of minus the logarithm of that 320 00:17:20,930 --> 00:17:22,010 probability. 321 00:17:22,010 --> 00:17:25,260 Which is the expected value of minus the logarithm of the 322 00:17:25,260 --> 00:17:28,570 product of probabilities, which is the expected value of 323 00:17:28,570 --> 00:17:32,590 the sum of minus the log of the probabilities, which is n 324 00:17:32,590 --> 00:17:34,590 times H of x. 325 00:17:34,590 --> 00:17:37,650 If you compare this with the previous slide, you'll see I 326 00:17:37,650 --> 00:17:41,410 haven't said anything new. 327 00:17:41,410 --> 00:17:44,420 This argument and this argument are 328 00:17:44,420 --> 00:17:46,120 really exactly the same. 329 00:17:46,120 --> 00:17:49,910 All I did was do it for two chance variables first. 330 00:17:49,910 --> 00:17:51,390 And then observe. 331 00:17:51,390 --> 00:17:54,000 But it generalizes, to an arbitrary 332 00:17:54,000 --> 00:17:56,510 number of chance variables. 333 00:17:56,510 --> 00:17:58,850 You can say that it generalizes to an infinite 334 00:17:58,850 --> 00:18:00,550 number of chance variables also. 335 00:18:00,550 --> 00:18:02,170 And in some sense it does. 336 00:18:02,170 --> 00:18:04,870 And I would advise you not to go there. 337 00:18:04,870 --> 00:18:08,510 Because you just get tangled up with a lot of mathematics 338 00:18:08,510 --> 00:18:09,760 that you don't need. 339 00:18:12,470 --> 00:18:16,410 So the next thing is, how do we fix the variable 340 00:18:16,410 --> 00:18:20,400 prefix-free codes and what do we gain by it? 341 00:18:20,400 --> 00:18:24,170 So the thing we're going to do now, instead of trying to 342 00:18:24,170 --> 00:18:27,680 compress one symbol at a time from the source, we're going 343 00:18:27,680 --> 00:18:32,270 to segment the source into blocks of n symbols each. 344 00:18:32,270 --> 00:18:35,650 And after we segment it into blocks of n symbols each, 345 00:18:35,650 --> 00:18:39,620 we're going to encode the block of n symbols. 346 00:18:39,620 --> 00:18:41,750 Now, what's new there? 347 00:18:41,750 --> 00:18:44,280 Nothing whatsoever is new. 348 00:18:44,280 --> 00:18:48,230 A block of n symbols is just a chance variable. 349 00:18:48,230 --> 00:18:51,290 We know how to do optimal encoding of chance variables. 350 00:18:51,290 --> 00:18:53,430 Namely, we use the Huffman algorithm. 351 00:18:53,430 --> 00:18:56,980 You can do that here on these n blocks. 352 00:18:56,980 --> 00:19:00,520 We also have this nice theorem, which says that the 353 00:19:00,520 --> 00:19:06,020 entropy -- well, first the entropy of x n as n times the 354 00:19:06,020 --> 00:19:07,650 entropy of x. 355 00:19:07,650 --> 00:19:10,610 So, in other words, when you have independent identically 356 00:19:10,610 --> 00:19:15,560 distributed chance variables, this entropy is just n times 357 00:19:15,560 --> 00:19:16,440 this entropy. 358 00:19:16,440 --> 00:19:19,500 But the important thing is this result 359 00:19:19,500 --> 00:19:21,180 of doing the encoding. 360 00:19:21,180 --> 00:19:23,930 Which is the same result we had before. 361 00:19:23,930 --> 00:19:27,800 Namely, this is the result of what happens when you take a 362 00:19:27,800 --> 00:19:31,530 set -- when you take an alphabet, and the alphabet can 363 00:19:31,530 --> 00:19:33,780 be anything whatsoever. 364 00:19:33,780 --> 00:19:37,690 And you encode that alphabet in an optimal way, according 365 00:19:37,690 --> 00:19:42,040 to the probabilities of each symbol within the alphabet. 366 00:19:42,040 --> 00:19:47,030 And the result that you get is the entropy of this big chance 367 00:19:47,030 --> 00:19:52,420 variable x sub n is less than or equal to the minimum -- 368 00:19:52,420 --> 00:19:57,730 well, it's less than or equal to the expected value of the 369 00:19:57,730 --> 00:20:00,300 length of a code, whatever code you have. 370 00:20:00,300 --> 00:20:05,090 But when you put the minimum on here, this is less than the 371 00:20:05,090 --> 00:20:09,730 entropy of the chance variable x super n plus 1. 372 00:20:09,730 --> 00:20:13,270 That's the same theorem that we proved before. 373 00:20:13,270 --> 00:20:15,490 There's nothing new here. 374 00:20:15,490 --> 00:20:20,220 Now, if you divide this by n, this by n, and this by n, you 375 00:20:20,220 --> 00:20:22,200 still have a valid inequality. 376 00:20:22,200 --> 00:20:25,710 When you divide this by n, what do you get? 377 00:20:25,710 --> 00:20:27,620 You get H of x. 378 00:20:27,620 --> 00:20:34,110 When you divide this by n, by definition L bar -- 379 00:20:34,110 --> 00:20:38,210 what we mean is the number of bits per source symbol. 380 00:20:38,210 --> 00:20:42,700 We have n source symbols here. 381 00:20:42,700 --> 00:20:46,990 So when we divide by n, we get the number of bits per source 382 00:20:46,990 --> 00:20:51,300 symbol in this monster symbol. 383 00:20:51,300 --> 00:20:53,540 So l min is equal to this. 384 00:20:53,540 --> 00:20:56,260 When we divide this by n, we get this. 385 00:20:56,260 --> 00:20:59,530 When we divide this by n, we get H of x. 386 00:20:59,530 --> 00:21:03,110 And now the whole reason for doing this is, this silly 387 00:21:03,110 --> 00:21:06,900 little 1 here, which we're trying very hard to think of 388 00:21:06,900 --> 00:21:10,400 as being negligible or unimportant, has suddenly 389 00:21:10,400 --> 00:21:12,090 become 1 over n. 390 00:21:12,090 --> 00:21:15,690 And by making n big enough, this truly is unimportant. 391 00:21:18,610 --> 00:21:22,870 If you're thinking in terms of encoding a binary source, this 392 00:21:22,870 --> 00:21:25,000 1 here is very important. 393 00:21:27,960 --> 00:21:30,860 In other words, when you're trying to encode a binary 394 00:21:30,860 --> 00:21:34,230 source, if you're encoding one letter at a time, there's 395 00:21:34,230 --> 00:21:36,110 nothing you can do. 396 00:21:36,110 --> 00:21:37,450 You're absolutely stuck. 397 00:21:37,450 --> 00:21:40,250 Because if both of those letters have non-zero 398 00:21:40,250 --> 00:21:44,870 probabilities, and you want a uniquely decodable code, and 399 00:21:44,870 --> 00:21:48,690 you want to find codewords for each of those two symbols, the 400 00:21:48,690 --> 00:21:52,940 best you can do is to have an expected length of 1. 401 00:21:52,940 --> 00:21:58,530 Namely, you need 1 to encode 1, and you need 0 to encode 0. 402 00:21:58,530 --> 00:22:01,800 And there's nothing else, there's no freedom at 403 00:22:01,800 --> 00:22:03,340 all that you have. 404 00:22:03,340 --> 00:22:06,930 So you say, OK, in that situation, I really have to go 405 00:22:06,930 --> 00:22:08,300 to longer blocks. 406 00:22:08,300 --> 00:22:10,720 And when I go to longer blocks, I 407 00:22:10,720 --> 00:22:12,330 can get this resolved. 408 00:22:12,330 --> 00:22:13,540 And I know how to do it. 409 00:22:13,540 --> 00:22:16,430 I use Huffman's algorithm or whatever. 410 00:22:16,430 --> 00:22:21,110 So, suddenly, I can start to get the expected number of 411 00:22:21,110 --> 00:22:23,850 bits per source symbol. 412 00:22:23,850 --> 00:22:27,440 Down as close to h of x as I want to make it. 413 00:22:27,440 --> 00:22:30,040 And I can't make it any smaller. 414 00:22:30,040 --> 00:22:35,080 Which says that H of x now has a very clear interpretation, 415 00:22:35,080 --> 00:22:38,493 at least for prefix-free codes, of being the number of 416 00:22:38,493 --> 00:22:42,430 bits you need for encoding prefix-free codes when you 417 00:22:42,430 --> 00:22:44,940 allow the possibility of encoding them 418 00:22:44,940 --> 00:22:47,570 a block at a time. 419 00:22:47,570 --> 00:22:50,560 We're going to find later today that the significance of 420 00:22:50,560 --> 00:22:53,640 this is far greater than that, even. 421 00:22:53,640 --> 00:22:56,710 Because why use prefix-free codes, we could use any old 422 00:22:56,710 --> 00:22:57,780 kind of code. 423 00:22:57,780 --> 00:23:00,840 When we study the Lempel-Ziv codes tomorrow, we'll find out 424 00:23:00,840 --> 00:23:04,810 they aren't prefix-free codes at all. 425 00:23:04,810 --> 00:23:07,450 They're really variable length of variable length codes. 426 00:23:07,450 --> 00:23:10,170 So they aren't fixed to variable length. 427 00:23:10,170 --> 00:23:13,050 And they do some pretty fancy and tricky things. 428 00:23:13,050 --> 00:23:16,140 But they're still limited to this same inequality. 429 00:23:16,140 --> 00:23:18,390 You can never beat the entropy bound. 430 00:23:18,390 --> 00:23:21,170 If you want something to be uniquely decodable, you're 431 00:23:21,170 --> 00:23:22,980 stuck with this bound. 432 00:23:22,980 --> 00:23:26,620 And we'll see why in a very straightforward way, later. 433 00:23:26,620 --> 00:23:31,500 It's a very straightforward way which I can guarantee all 434 00:23:31,500 --> 00:23:35,960 of you are going to look at it and say, yes, that's obvious. 435 00:23:35,960 --> 00:23:38,540 And tomorrow you will look at it and say, I don't understand 436 00:23:38,540 --> 00:23:39,730 that at all. 437 00:23:39,730 --> 00:23:41,500 And the next day you'll look at it again and 438 00:23:41,500 --> 00:23:42,880 say, well, of course. 439 00:23:42,880 --> 00:23:45,740 And the day after that you'll say, I don't understand that. 440 00:23:45,740 --> 00:23:48,690 And you'll go back and forth like that for about two weeks. 441 00:23:48,690 --> 00:23:51,980 Don't be frustrated, because it is simple. 442 00:23:51,980 --> 00:23:54,950 But at the same time it's very sophisticated. 443 00:23:59,370 --> 00:24:05,170 So, let's now review the weak law of large numbers. 444 00:24:05,170 --> 00:24:09,000 I usually just call it the law of large numbers. 445 00:24:09,000 --> 00:24:12,670 I bridle a little bit when people call it weak because, 446 00:24:12,670 --> 00:24:16,070 in fact it's the centerpiece of probability theory. 447 00:24:16,070 --> 00:24:19,310 But there is this other thing called the strong law of large 448 00:24:19,310 --> 00:24:24,070 numbers, which mathematicians love because it lets them look 449 00:24:24,070 --> 00:24:27,630 at all kinds of mathematical minutiae. 450 00:24:27,630 --> 00:24:29,830 It's also important, I shouldn't 451 00:24:29,830 --> 00:24:30,950 play it down too much. 452 00:24:30,950 --> 00:24:33,530 And there are places where you truly need it. 453 00:24:33,530 --> 00:24:36,540 For what we'll be doing, we don't need it at all. 454 00:24:36,540 --> 00:24:39,470 And the weak law of large numbers does in fact hold in 455 00:24:39,470 --> 00:24:41,930 many places where the strong law doesn't hold. 456 00:24:41,930 --> 00:24:48,130 So if you know what the strong law, is temporarily forget it 457 00:24:48,130 --> 00:24:50,320 for the for the rest of the term. 458 00:24:50,320 --> 00:24:52,580 And just focus on the weak law. 459 00:24:52,580 --> 00:24:56,130 And the weak law is not terribly complicated. 460 00:24:56,130 --> 00:25:00,310 We have a sequence of random variables. 461 00:25:00,310 --> 00:25:04,230 And each of them has a mean y bar. 462 00:25:04,230 --> 00:25:08,360 And each of them has a variance sigma sub y squared. 463 00:25:08,360 --> 00:25:10,950 And let's assume that they're independent and identically 464 00:25:10,950 --> 00:25:12,600 distributed for the time being. 465 00:25:12,600 --> 00:25:15,900 Just to avoid worrying about anything. 466 00:25:15,900 --> 00:25:20,020 If we look at the sum of those random variables, namely a is 467 00:25:20,020 --> 00:25:23,700 equal to y1 up to y sub n. 468 00:25:23,700 --> 00:25:27,570 Then the expected value of a is the expected value of this 469 00:25:27,570 --> 00:25:30,860 plus the expected valuable of y2, and so forth. 470 00:25:30,860 --> 00:25:35,270 So the expected value of a is n times y bar. 471 00:25:35,270 --> 00:25:37,990 And I think in one of the homework problems, you found 472 00:25:37,990 --> 00:25:39,640 the variance of a. 473 00:25:39,640 --> 00:25:46,090 And the variance of a, well, the easiest thing to do is to 474 00:25:46,090 --> 00:25:49,280 reduce this to its fluctuation. 475 00:25:49,280 --> 00:25:51,790 Reduce all of these to their fluctuation. 476 00:25:51,790 --> 00:25:54,570 Then look at the variance of the fluctuation, which is just 477 00:25:54,570 --> 00:25:56,960 the expected value of this squared. 478 00:25:56,960 --> 00:25:59,760 Which is the expected value of this squared plus the expected 479 00:25:59,760 --> 00:26:02,250 value of this squared, and so forth. 480 00:26:02,250 --> 00:26:06,940 So the variance of a is n times sigma sub y squared. 481 00:26:06,940 --> 00:26:08,530 I want all of you know that. 482 00:26:08,530 --> 00:26:13,230 That's sort of day two of a probability course. 483 00:26:13,230 --> 00:26:15,700 As soon as you start talking about random variables, that's 484 00:26:15,700 --> 00:26:17,630 one of the key things that you do. 485 00:26:17,630 --> 00:26:21,320 One of the most important things you do. 486 00:26:21,320 --> 00:26:23,270 The thing that we're interested in here is more the 487 00:26:23,270 --> 00:26:26,600 sample average of y1 up to y sub n. 488 00:26:26,600 --> 00:26:29,570 And the sample average, by definition, is the 489 00:26:29,570 --> 00:26:32,050 sum divided by n. 490 00:26:32,050 --> 00:26:35,870 So in other words, the thing that you're interested in here 491 00:26:35,870 --> 00:26:39,990 is to add all of these random variables up. 492 00:26:39,990 --> 00:26:43,360 Take one over n times it. 493 00:26:43,360 --> 00:26:44,950 Which is a thing we do all the time. 494 00:26:44,950 --> 00:26:50,270 I mean, we sum up a lot of events, we divide by n, and we 495 00:26:50,270 --> 00:26:54,790 hope by doing that to get some sort of typical value. 496 00:26:54,790 --> 00:26:58,210 And, usually there is some sort of typical value that 497 00:26:58,210 --> 00:26:59,200 arises from that. 498 00:26:59,200 --> 00:27:02,810 What the law of large numbers says is that there in fact is 499 00:27:02,810 --> 00:27:05,620 a typical value that arises. 500 00:27:05,620 --> 00:27:08,600 So this sample value is a over n, which is the 501 00:27:08,600 --> 00:27:10,410 sum divided by n. 502 00:27:10,410 --> 00:27:12,630 And we call that the sample average. 503 00:27:12,630 --> 00:27:18,340 The mean of the sample average is just the mean of a, which 504 00:27:18,340 --> 00:27:23,020 is n times y bar, divided by n. 505 00:27:23,020 --> 00:27:27,780 So the mean of the sample average is y bar itself. 506 00:27:27,780 --> 00:27:37,190 The variance of the sample variance, -- 507 00:27:37,190 --> 00:27:42,290 the variance of the sample average, OK, that's, -- 508 00:27:45,630 --> 00:27:48,880 I'm talking too fast. 509 00:27:48,880 --> 00:27:55,600 The sample average here, you would like to think of it as 510 00:27:55,600 --> 00:27:58,380 something which is known and specific, like 511 00:27:58,380 --> 00:27:59,680 the expected value. 512 00:27:59,680 --> 00:28:02,150 It, in fact, is a random variable. 513 00:28:02,150 --> 00:28:05,140 It changes with different sample values. 514 00:28:05,140 --> 00:28:07,840 It can change from almost nothing to very large 515 00:28:07,840 --> 00:28:08,820 quantities. 516 00:28:08,820 --> 00:28:11,250 And what we're interested in saying is that most of the 517 00:28:11,250 --> 00:28:14,480 time, it's close to the expected value. 518 00:28:14,480 --> 00:28:15,830 And that's what we're aiming at here. 519 00:28:15,830 --> 00:28:19,020 And that's what the law of large numbers says. 520 00:28:19,020 --> 00:28:22,970 The sample average here, the variance of this, is now equal 521 00:28:22,970 --> 00:28:28,000 to the variance of a divided by n squared. 522 00:28:28,000 --> 00:28:31,530 In other words, we're trying to take the expected value of 523 00:28:31,530 --> 00:28:33,080 this quantity squared. 524 00:28:33,080 --> 00:28:36,770 So there's a 1 over n squared that comes in here. 525 00:28:36,770 --> 00:28:40,800 When you take the 1 over n squared here, this variance 526 00:28:40,800 --> 00:28:44,080 then becomes sigma -- 527 00:28:46,610 --> 00:28:50,230 I don't know why I have the n there. 528 00:28:50,230 --> 00:28:52,100 Just take that n out, if you will. 529 00:28:52,100 --> 00:28:54,640 I don't have my red pen with me. 530 00:28:57,290 --> 00:29:03,390 And so it's the variance of the random variable 531 00:29:03,390 --> 00:29:06,590 y, divided by n. 532 00:29:06,590 --> 00:29:12,170 In other words, the limit as n goes to infinity of the of the 533 00:29:12,170 --> 00:29:16,980 variance of the sum is equal to infinity. 534 00:29:16,980 --> 00:29:21,630 And the variance of the sample average as n goes to infinity 535 00:29:21,630 --> 00:29:23,790 is equal to 0. 536 00:29:23,790 --> 00:29:27,890 And that's because of this 1 over n squared effect here. 537 00:29:27,890 --> 00:29:32,400 When you add up a lot of independent random variables, 538 00:29:32,400 --> 00:29:35,990 what you wind up with is the sample average has a variance, 539 00:29:35,990 --> 00:29:38,440 which is going to 0. 540 00:29:38,440 --> 00:29:44,150 And the sum has a variance which is going to infinity. 541 00:29:44,150 --> 00:29:46,560 That's important. 542 00:29:46,560 --> 00:29:49,820 Aside from all of the theorems you've ever heard, this is 543 00:29:49,820 --> 00:29:54,520 sort of the gross, simple-minded thing which you 544 00:29:54,520 --> 00:29:57,690 always ought to keep foremost in your mind. 545 00:29:57,690 --> 00:30:00,290 This is what's happening in probability theory. 546 00:30:00,290 --> 00:30:03,350 When you talk about sample averages, this variance is 547 00:30:03,350 --> 00:30:06,590 getting small. 548 00:30:06,590 --> 00:30:09,710 Let's look at a picture of this. 549 00:30:09,710 --> 00:30:12,420 Let's look at the distribution function 550 00:30:12,420 --> 00:30:14,500 of this random variable. 551 00:30:14,500 --> 00:30:18,110 The sample average as a random variable. 552 00:30:18,110 --> 00:30:22,070 And what we're finding here is that this distribution 553 00:30:22,070 --> 00:30:27,510 function, if we look at it for some modest value of n, we get 554 00:30:27,510 --> 00:30:31,250 something which looks like this upper curve here. 555 00:30:31,250 --> 00:30:33,360 Which is then the lower curve here. 556 00:30:33,360 --> 00:30:37,580 It's spread out more, so it has a larger variance. 557 00:30:37,580 --> 00:30:40,180 Namely, the sample average has a larger variance. 558 00:30:40,180 --> 00:30:45,360 When you make n bigger, what's happening to the variance? 559 00:30:45,360 --> 00:30:46,900 The variance is getting smaller. 560 00:30:46,900 --> 00:30:51,850 The variance is getting smaller by a factor of 1/2. 561 00:30:51,850 --> 00:30:55,800 So this quantity is supposed to have a variance which is 562 00:30:55,800 --> 00:30:59,200 equal to 1/2 of the variance of this. 563 00:30:59,200 --> 00:31:01,270 How you find a variance in a distribution 564 00:31:01,270 --> 00:31:04,010 function is your problem. 565 00:31:04,010 --> 00:31:08,310 But you know that if something has a small variance, it's 566 00:31:08,310 --> 00:31:10,600 very closely tucked in around this. 567 00:31:10,600 --> 00:31:14,150 In other words, as the variance goes to 0, and the 568 00:31:14,150 --> 00:31:17,620 mean is y bar, you have a distribution function which 569 00:31:17,620 --> 00:31:20,910 approaches a unit step. 570 00:31:20,910 --> 00:31:23,410 And all that just comes from this very, very simple 571 00:31:23,410 --> 00:31:27,260 argument that says, when you have a sum of IID random 572 00:31:27,260 --> 00:31:31,050 variables and you take the sample average of it, namely, 573 00:31:31,050 --> 00:31:34,780 you divide by n, the variance goes to 0. 574 00:31:34,780 --> 00:31:37,850 Which says, no matter how you look at it, you wind up with 575 00:31:37,850 --> 00:31:40,770 something that looks like a unit step. 576 00:31:40,770 --> 00:31:45,480 Now, the Chebyshev inequality, which is one of the simpler 577 00:31:45,480 --> 00:31:49,500 inequalities in probability theory, and I don't prove it 578 00:31:49,500 --> 00:31:52,040 because it's something you've all seen. 579 00:31:52,040 --> 00:31:55,800 I don't know of any course in probability which avoids the 580 00:31:55,800 --> 00:31:57,780 Chebyshev inequality. 581 00:31:57,780 --> 00:32:02,190 And what it says is, for any epsilon greater than 0, the 582 00:32:02,190 --> 00:32:05,280 probability that the difference between the sample 583 00:32:05,280 --> 00:32:09,350 average and the true mean, the probability that that quantity 584 00:32:09,350 --> 00:32:13,380 and magnitude is greater than or equal to epsilon, is less 585 00:32:13,380 --> 00:32:17,070 than or equal to sigma sub y squared divided 586 00:32:17,070 --> 00:32:18,580 by n epsilon squared. 587 00:32:18,580 --> 00:32:22,340 Oh, incidentally that thing that was called sigma sub n 588 00:32:22,340 --> 00:32:26,310 before was really sigma squared. 589 00:32:26,310 --> 00:32:31,420 That's mainly the variance of y. 590 00:32:31,420 --> 00:32:33,120 I hope it's right in the notes. 591 00:32:33,120 --> 00:32:34,970 Might not be. 592 00:32:34,970 --> 00:32:35,930 It is? 593 00:32:35,930 --> 00:32:37,180 Good. 594 00:32:39,940 --> 00:32:42,520 So, that's what this inequality says. 595 00:32:42,520 --> 00:32:46,540 There's an easy way to derive this on the fly. 596 00:32:46,540 --> 00:32:49,480 Namely, if you're wondering what all these constants are 597 00:32:49,480 --> 00:32:53,890 here, here's a way to do it. 598 00:32:53,890 --> 00:32:58,980 What we're looking at, in this curve here, is we're trying to 599 00:32:58,980 --> 00:33:04,250 say, how much probability is there outside of these plus 600 00:33:04,250 --> 00:33:06,360 and minus epsilon limits. 601 00:33:06,360 --> 00:33:09,550 And the Chebyshev inequality says there can't be too much 602 00:33:09,550 --> 00:33:11,250 probability out here. 603 00:33:11,250 --> 00:33:14,780 And there can't be too much probability out here. 604 00:33:14,780 --> 00:33:19,620 So, one way to get at this is to say, OK, suppose I have 605 00:33:19,620 --> 00:33:22,970 some given probability out here. 606 00:33:22,970 --> 00:33:25,950 And some given probability out here. 607 00:33:25,950 --> 00:33:29,380 And suppose I want to minimize the variance of a random 608 00:33:29,380 --> 00:33:32,960 variable which has that much probability out here and that 609 00:33:32,960 --> 00:33:35,050 much probability out here. 610 00:33:35,050 --> 00:33:36,630 How do I do it? 611 00:33:36,630 --> 00:33:40,700 Well, the variance deals with the square of how far you were 612 00:33:40,700 --> 00:33:42,160 away from the mean. 613 00:33:42,160 --> 00:33:44,840 So if I want to have a certain amount of probability out 614 00:33:44,840 --> 00:33:49,750 here, I minimize my variance by making this come straight, 615 00:33:49,750 --> 00:33:54,370 come up here with a little step, then go across here. 616 00:33:54,370 --> 00:33:56,050 Go up here. 617 00:33:56,050 --> 00:33:59,560 And then, oops. 618 00:33:59,560 --> 00:34:02,160 Go up here. 619 00:34:02,160 --> 00:34:03,710 I wish I had my red pencil. 620 00:34:03,710 --> 00:34:07,060 Does anybody have a red pen? 621 00:34:07,060 --> 00:34:08,480 That will write on this stuff? 622 00:34:13,870 --> 00:34:14,360 Yes? 623 00:34:14,360 --> 00:34:15,610 No? 624 00:34:21,140 --> 00:34:21,670 Oh, great. 625 00:34:21,670 --> 00:34:23,990 Thank you. 626 00:34:23,990 --> 00:34:25,240 I will return it. 627 00:34:27,220 --> 00:34:31,330 So what we want is something which goes over here. 628 00:34:31,330 --> 00:34:33,170 Comes up here. 629 00:34:33,170 --> 00:34:35,220 Goes across here. 630 00:34:35,220 --> 00:34:36,470 Goes up here. 631 00:34:39,400 --> 00:34:42,900 Goes across here, and goes up again. 632 00:34:42,900 --> 00:34:44,535 That's the smallest you can make the variance. 633 00:34:44,535 --> 00:34:46,790 It's squeezing everything in as far as it 634 00:34:46,790 --> 00:34:48,130 can be squeezed in. 635 00:34:48,130 --> 00:34:50,830 Namely, everything out here gets squeezed 636 00:34:50,830 --> 00:34:53,270 in to y minus epsilon. 637 00:34:53,270 --> 00:34:55,910 Everything in here gets squeezed into 0. 638 00:34:55,910 --> 00:34:59,500 And everything out here gets squeezed into epsilon. 639 00:34:59,500 --> 00:35:01,570 OK, calculate the variance there. 640 00:35:01,570 --> 00:35:03,650 And that satisfies the Chebyshev 641 00:35:03,650 --> 00:35:06,130 inequality with equality. 642 00:35:06,130 --> 00:35:10,410 So that's all the Chebyshev inequality is. 643 00:35:10,410 --> 00:35:13,210 And it's a loose inequality usually, because usually these 644 00:35:13,210 --> 00:35:14,900 curves look very nice. 645 00:35:14,900 --> 00:35:17,410 Usually this looks like a Gaussian distribution 646 00:35:17,410 --> 00:35:20,660 function, and the central limit theorem says that we 647 00:35:20,660 --> 00:35:23,810 don't need the central limit theorem here, and we don't 648 00:35:23,810 --> 00:35:26,980 want the central limit theorem here, because this thing is an 649 00:35:26,980 --> 00:35:31,490 inequality that says, life can't be any worse than this. 650 00:35:31,490 --> 00:35:33,890 And all the central limit theorem is, is an 651 00:35:33,890 --> 00:35:35,570 approximation. 652 00:35:35,570 --> 00:35:37,550 And then we have to worry about when it's a good 653 00:35:37,550 --> 00:35:41,510 approximation and when it's not a good approximation. 654 00:35:41,510 --> 00:35:45,160 So this says, when we carry it one piece further, it's for 655 00:35:45,160 --> 00:35:48,050 any epsilon and delta greater than 0. 656 00:35:48,050 --> 00:35:51,500 If we make n large enough -- in other words, substitute 657 00:35:51,500 --> 00:35:53,330 delta for this. 658 00:35:53,330 --> 00:35:55,850 And then, when you make n small enough, this quantity is 659 00:35:55,850 --> 00:35:57,600 smaller than delta. 660 00:35:57,600 --> 00:36:01,180 And that says that the probability that s and y 661 00:36:01,180 --> 00:36:05,240 differ by more than epsilon is less than or equal to delta 662 00:36:05,240 --> 00:36:08,180 when we make n big enough. 663 00:36:08,180 --> 00:36:10,960 So it says, you can make this as small as you want. 664 00:36:10,960 --> 00:36:13,050 You can make this as small as you want. 665 00:36:13,050 --> 00:36:15,530 And all you need to do is make a sequence 666 00:36:15,530 --> 00:36:17,540 which is long enough. 667 00:36:17,540 --> 00:36:21,380 Now, the thing which is mystifying about the law of 668 00:36:21,380 --> 00:36:24,720 large numbers is, you need both the 669 00:36:24,720 --> 00:36:26,510 epsilon and the delta. 670 00:36:26,510 --> 00:36:29,470 You can't get rid of either of them. 671 00:36:29,470 --> 00:36:33,260 In other words, you can't say -- 672 00:36:33,260 --> 00:36:36,500 you can't reduce this to 0. 673 00:36:36,500 --> 00:36:38,670 Because it won't make any sense. 674 00:36:38,670 --> 00:36:41,550 In other words, this curve here tends to 675 00:36:41,550 --> 00:36:44,520 spread out on its tails. 676 00:36:44,520 --> 00:36:49,070 And therefore, there's always a probability of error there. 677 00:36:49,070 --> 00:36:54,160 You can't move epsilon into 0 because, for no finite end, do 678 00:36:54,160 --> 00:36:56,590 you really get a step function here. 679 00:36:56,590 --> 00:37:00,580 So you need some wiggle room on both end. 680 00:37:00,580 --> 00:37:05,950 You need wiggle room here, and you need wiggle room here. 681 00:37:05,950 --> 00:37:08,950 And once you recognize that you need those two pieces of 682 00:37:08,950 --> 00:37:09,850 wiggle room. 683 00:37:09,850 --> 00:37:13,830 Namely, you cannot talk about the probability that this is 684 00:37:13,830 --> 00:37:19,230 equal to y bar, because that's usually 0. 685 00:37:19,230 --> 00:37:25,390 And you cannot talk about reducing this to 0 either. 686 00:37:25,390 --> 00:37:27,750 So both of those are needed. 687 00:37:27,750 --> 00:37:29,590 Why did I go through all of this? 688 00:37:29,590 --> 00:37:31,780 Well, partly because it's important. 689 00:37:31,780 --> 00:37:36,140 But partly because I want to talk about something which is 690 00:37:36,140 --> 00:37:39,890 called the asymptotic equipartition property. 691 00:37:39,890 --> 00:37:43,520 And because of those long words you believe this has to 692 00:37:43,520 --> 00:37:45,980 be very complicated. 693 00:37:45,980 --> 00:37:48,690 I hope to convince you that what the asymptotic 694 00:37:48,690 --> 00:37:52,580 equipartition property is, is simply the week law of large 695 00:37:52,580 --> 00:37:57,570 numbers applied to the log pmf. 696 00:37:57,570 --> 00:38:01,030 Because that, in fact, is all it is. 697 00:38:01,030 --> 00:38:05,430 But it says some unusual and fascinating things. 698 00:38:05,430 --> 00:38:11,020 So let's suppose that x1, x2, and so forth is the output 699 00:38:11,020 --> 00:38:14,970 from a discrete memoryless source. 700 00:38:14,970 --> 00:38:18,180 Look at the log pmf of each of these. 701 00:38:18,180 --> 00:38:22,800 Namely, they each have the same distribution function. 702 00:38:22,800 --> 00:38:26,090 So w of f x is going to be equal to minus the logarithm 703 00:38:26,090 --> 00:38:31,790 of p sub x of x, for each of these chance variables. 704 00:38:31,790 --> 00:38:36,460 w of x maps source symbols into real numbers. 705 00:38:36,460 --> 00:38:40,850 So there's a random variable, capital W of x sub j, which is 706 00:38:40,850 --> 00:38:41,970 a random variable. 707 00:38:41,970 --> 00:38:46,140 We have a random variable for each one of these symbols to 708 00:38:46,140 --> 00:38:47,660 come out of the source. 709 00:38:47,660 --> 00:38:51,050 So, for each one of these symbols, there's this log pmf 710 00:38:51,050 --> 00:38:55,900 random variable, which takes on different values. 711 00:38:55,900 --> 00:39:00,790 So the expected value of this log pmf, for the j'th symbol 712 00:39:00,790 --> 00:39:05,820 out of the source is the sum of p sub x of x, namely the 713 00:39:05,820 --> 00:39:09,560 probability that the source takes on the value x, times 714 00:39:09,560 --> 00:39:12,290 minus the logarithm of p sub x. 715 00:39:12,290 --> 00:39:14,530 And that's just the entropy. 716 00:39:14,530 --> 00:39:18,770 So, the strange feeling about this log pmf random variable 717 00:39:18,770 --> 00:39:22,480 is its expected value is entropy. 718 00:39:22,480 --> 00:39:27,440 And w of x1, w of x2, and so forth, are a sequence of IID 719 00:39:27,440 --> 00:39:31,320 random variables, each one of them which has a mean, which 720 00:39:31,320 --> 00:39:32,570 is the entropy. 721 00:39:35,320 --> 00:39:38,560 So it's just exactly the situation we had before. 722 00:39:38,560 --> 00:39:42,460 Instead of y bar, we have the entropy of x. 723 00:39:42,460 --> 00:39:48,840 And instead of the random variable y sub j, we have this 724 00:39:48,840 --> 00:39:53,580 random variable w of x sub j, which is defined by the symbol 725 00:39:53,580 --> 00:39:54,830 in an alphabet. 726 00:40:00,170 --> 00:40:04,240 And just to review this, but it's what we said before, if 727 00:40:04,240 --> 00:40:09,900 capital X1, this little x1, namely, if little x1 is the 728 00:40:09,900 --> 00:40:15,650 sample value for the chance variable x1 and if x2 is a 729 00:40:15,650 --> 00:40:19,720 sample value for the chance variable X2, then the outcome 730 00:40:19,720 --> 00:40:25,660 for w of x1 plus w of x2 -- 731 00:40:28,350 --> 00:40:31,200 very hard to keep all these little letters and capital 732 00:40:31,200 --> 00:40:32,450 letters straight. 733 00:40:35,030 --> 00:40:39,850 Is w of x1 plus w of x2 is minus the logarithm of the 734 00:40:39,850 --> 00:40:43,570 probability of x1 minus the logarithm of the probability 735 00:40:43,570 --> 00:40:47,790 of x2, which is minus the logarithm of the product, 736 00:40:47,790 --> 00:40:51,620 which is minus the logarithm of the joint probability of x1 737 00:40:51,620 --> 00:40:57,610 and x2, which is the random variable w of x1 x2. 738 00:40:57,610 --> 00:41:03,870 So the sum here is the random variable corresponding to log 739 00:41:03,870 --> 00:41:10,650 pmf of the joint outputs x1 and x2. 740 00:41:10,650 --> 00:41:18,110 So w of x1 x2 is the log pmf of the event, but this joint 741 00:41:18,110 --> 00:41:21,740 chance variable takes on the value x1 x2. 742 00:41:21,740 --> 00:41:27,820 And the random variable x1 x2 is the sum of x1 and x2. 743 00:41:27,820 --> 00:41:29,690 So, what have I done here? 744 00:41:29,690 --> 00:41:32,540 I said this at the end of the last slide, and you won't 745 00:41:32,540 --> 00:41:34,050 believe me. 746 00:41:34,050 --> 00:41:39,690 So, again this is one of these things where tomorrow you 747 00:41:39,690 --> 00:41:40,610 won't believe me. 748 00:41:40,610 --> 00:41:42,580 And you'll have to go back and look at that. 749 00:41:42,580 --> 00:41:45,630 But, anyway, x1 x2 is a chance variable. 750 00:41:45,630 --> 00:41:50,090 And probabilities multiply in log pmf's add, which is what 751 00:41:50,090 --> 00:41:52,020 we've been saying for a couple of days now. 752 00:41:55,460 --> 00:41:56,820 So. 753 00:41:56,820 --> 00:42:06,430 If I look at the sum of n of these random variables, the 754 00:42:06,430 --> 00:42:11,000 sum of these log probabilities is the sum of the log of 755 00:42:11,000 --> 00:42:16,370 pmf's, which is minus the logarithm of the probability 756 00:42:16,370 --> 00:42:19,190 of the entire sequence. 757 00:42:19,190 --> 00:42:22,110 That's just saying the same thing we said before, for two 758 00:42:22,110 --> 00:42:23,250 random variables. 759 00:42:23,250 --> 00:42:28,140 The sample average of a log pmf's is the sum of the w's 760 00:42:28,140 --> 00:42:31,480 divided by n, which is minus the logarithm of the 761 00:42:31,480 --> 00:42:33,830 probability divided by n. 762 00:42:33,830 --> 00:42:37,700 The weak law of large numbers applies, and the probability 763 00:42:37,700 --> 00:42:42,840 that this sample average minus the expected value of w of x 764 00:42:42,840 --> 00:42:46,170 is greater than or equal to epsilon is less than or equal 765 00:42:46,170 --> 00:42:47,690 to this quantity here. 766 00:42:47,690 --> 00:42:51,610 This quantity is minus the logarithm of the probability 767 00:42:51,610 --> 00:42:57,740 of x sub n, divided by n, minus H of x, greater than or 768 00:42:57,740 --> 00:42:59,340 equal to epsilon. 769 00:43:07,210 --> 00:43:09,450 So this is the thing that we really want. 770 00:43:15,610 --> 00:43:17,470 I'm going to spend a few slides trying to 771 00:43:17,470 --> 00:43:18,620 say what this means. 772 00:43:18,620 --> 00:43:22,170 But let's try to just look at it now, and see 773 00:43:22,170 --> 00:43:24,190 what it must mean. 774 00:43:24,190 --> 00:43:29,350 It says that with high probability, this quantity is 775 00:43:29,350 --> 00:43:32,900 almost the same as this quantity. 776 00:43:32,900 --> 00:43:35,810 It says that with high probability, the thing which 777 00:43:35,810 --> 00:43:42,630 comes out of the source is going to have a probability, a 778 00:43:42,630 --> 00:43:47,930 log probability, divided by n, which is close to the entropy. 779 00:43:47,930 --> 00:43:52,350 It says in some sense that with high probability, the 780 00:43:52,350 --> 00:43:54,240 probability of what comes out of the 781 00:43:54,240 --> 00:43:55,940 source is almost a constant. 782 00:43:59,020 --> 00:44:02,060 And that's amazing. 783 00:44:02,060 --> 00:44:04,200 That's what you'll wake up in the morning and say, I don't 784 00:44:04,200 --> 00:44:05,450 believe that. 785 00:44:07,900 --> 00:44:10,430 But it's true. 786 00:44:10,430 --> 00:44:12,870 But you have to be careful to interpret it right. 787 00:44:15,450 --> 00:44:18,710 So, we're going to define the typical set. 788 00:44:18,710 --> 00:44:22,680 Namely, this is the typical set of x's, which come out of 789 00:44:22,680 --> 00:44:23,630 the source. 790 00:44:23,630 --> 00:44:26,520 Namely, the typical set of blocks of n 791 00:44:26,520 --> 00:44:29,490 symbols out of the source. 792 00:44:29,490 --> 00:44:32,510 And when you talk about a typical set, you want 793 00:44:32,510 --> 00:44:36,180 something which includes most of the probability. 794 00:44:36,180 --> 00:44:40,560 So what I'm going to include in this typical set is all of 795 00:44:40,560 --> 00:44:43,160 these things that we talked about before. 796 00:44:43,160 --> 00:44:46,520 Namely, we showed that the probability that this quantity 797 00:44:46,520 --> 00:44:49,960 is greater than or equal to epsilon is very small. 798 00:44:49,960 --> 00:44:53,980 So, with high probability what comes out of the source 799 00:44:53,980 --> 00:44:57,030 satisfies this inequality here. 800 00:44:57,030 --> 00:45:00,840 So I can write down the distribution function of this 801 00:45:00,840 --> 00:45:02,480 random variable here. 802 00:45:02,480 --> 00:45:09,070 It's just this w -- 803 00:45:12,840 --> 00:45:14,750 this is a random variable w. 804 00:45:14,750 --> 00:45:17,550 I'm looking at the distribution of that random 805 00:45:17,550 --> 00:45:20,170 variable w. 806 00:45:20,170 --> 00:45:25,340 And this quantity in here is the probability of this 807 00:45:25,340 --> 00:45:28,260 typical set. 808 00:45:28,260 --> 00:45:31,090 In other words, when I draw this distribution function for 809 00:45:31,090 --> 00:45:34,820 this combined random variable, I've defined this typical set 810 00:45:34,820 --> 00:45:40,020 to be all the sequences which lie between this point and 811 00:45:40,020 --> 00:45:41,070 this point. 812 00:45:41,070 --> 00:45:43,690 Namely, this is H minus epsilon. 813 00:45:43,690 --> 00:45:47,360 And this is H plus epsilon, moving H out here. 814 00:45:47,360 --> 00:45:50,580 So these are all the sequences which satisfy 815 00:45:50,580 --> 00:45:52,510 this inequality here. 816 00:45:52,510 --> 00:45:54,550 So that's what I mean by the typical set. 817 00:45:54,550 --> 00:45:59,290 It's all things which are clustered around H in this 818 00:45:59,290 --> 00:46:00,540 distribution function. 819 00:46:03,450 --> 00:46:07,320 And as n approaches infinity, this typical set approaches 820 00:46:07,320 --> 00:46:09,170 probability 1. 821 00:46:09,170 --> 00:46:11,560 In the same way that the law of large numbers 822 00:46:11,560 --> 00:46:12,420 behaves that way. 823 00:46:12,420 --> 00:46:18,090 The probability that x sub n is in this typical set is 824 00:46:18,090 --> 00:46:23,180 greater than or equal to 1 minus sigma squared divided by 825 00:46:23,180 --> 00:46:25,080 n epsilon squared. 826 00:46:30,800 --> 00:46:34,230 Let's try to express that in a bunch of other ways. 827 00:46:40,400 --> 00:46:44,880 If you're getting lost, please ask questions. 828 00:46:44,880 --> 00:46:49,800 But I hope to come back to this in a little bit, after we 829 00:46:49,800 --> 00:46:52,850 finish a little more of the story. 830 00:46:52,850 --> 00:47:03,060 So, another way of expressing this typical set -- let me 831 00:47:03,060 --> 00:47:05,760 look at that as the typical set. 832 00:47:05,760 --> 00:47:10,920 If I take this inequality here and I rewrite this, namely, 833 00:47:10,920 --> 00:47:16,190 this is the set of x's for which this is less than 834 00:47:16,190 --> 00:47:20,970 epsilon plus H of x and greater than 835 00:47:20,970 --> 00:47:23,330 H of x minus epsilon. 836 00:47:23,330 --> 00:47:26,900 So that's what I've expressed here. 837 00:47:26,900 --> 00:47:31,880 It's the set of x's for which n times H of x minus epsilon 838 00:47:31,880 --> 00:47:36,260 is less than this logarithm of probability is great less than 839 00:47:36,260 --> 00:47:38,830 n times H of x plus epsilon. 840 00:47:38,830 --> 00:47:43,630 Namely, I'm looking at this range of epsilon around H, 841 00:47:43,630 --> 00:47:46,980 which is this and this. 842 00:47:46,980 --> 00:47:50,840 If I write it again, if I exponentiate all of this, it's 843 00:47:50,840 --> 00:47:55,810 the set of x's for which 2 to the minus n, H of x, plus 844 00:47:55,810 --> 00:47:59,740 epsilon, that's this term exponentiated, is less than 845 00:47:59,740 --> 00:48:03,170 this is less than this term exponentiated. 846 00:48:03,170 --> 00:48:05,610 And what's going on here is, I've taken care of 847 00:48:05,610 --> 00:48:08,170 the minus sign also. 848 00:48:08,170 --> 00:48:10,400 And if you can follow that in your head, you're a better 849 00:48:10,400 --> 00:48:12,400 person than I am. 850 00:48:12,400 --> 00:48:14,700 But, anyway, it works. 851 00:48:14,700 --> 00:48:16,790 And if you fiddle around with that, you'll see that that's 852 00:48:16,790 --> 00:48:17,860 what it is. 853 00:48:17,860 --> 00:48:24,300 So this typical set is a bound on the probabilities of all 854 00:48:24,300 --> 00:48:26,090 these typical sequences. 855 00:48:26,090 --> 00:48:31,980 The typical sequences all are enclosed in this range of 856 00:48:31,980 --> 00:48:33,230 probabilities. 857 00:48:35,680 --> 00:48:39,440 So the typical elements are approximately equiprobable, in 858 00:48:39,440 --> 00:48:42,130 this strange sense above. 859 00:48:42,130 --> 00:48:45,100 Why do I say this is a strange sense? 860 00:48:45,100 --> 00:48:49,690 Well, as n gets large, what happens here? 861 00:48:49,690 --> 00:48:52,810 This is 2 to the minus n times H of x. 862 00:48:52,810 --> 00:48:55,360 Which is the important part of this. 863 00:48:55,360 --> 00:48:59,820 This epsilon here is multiplied by n. 864 00:48:59,820 --> 00:49:02,400 And we're trying to say, as n gets very, very big, we can 865 00:49:02,400 --> 00:49:04,700 make epsilon very, very small. 866 00:49:04,700 --> 00:49:09,130 But we really can't make n times epsilon very negligible. 867 00:49:09,130 --> 00:49:12,410 But the point is, the important thing here is, 2 to 868 00:49:12,410 --> 00:49:15,680 the minus n times H of x. 869 00:49:15,680 --> 00:49:19,640 So, in some sense, this is close to 2 to the 870 00:49:19,640 --> 00:49:21,750 minus n H of x. 871 00:49:21,750 --> 00:49:23,140 In what sense is it true? 872 00:49:23,140 --> 00:49:28,140 Well, it's true in that sense. 873 00:49:28,140 --> 00:49:31,980 Where that, in fact is, a valid inequality. 874 00:49:31,980 --> 00:49:35,130 Namely in terms of sample averages, 875 00:49:35,130 --> 00:49:37,160 these things are close. 876 00:49:37,160 --> 00:49:40,210 When I do the exponentiation and get rid of the n and all 877 00:49:40,210 --> 00:49:43,820 that stuff, they aren't very close. 878 00:49:43,820 --> 00:49:48,760 But saying this sort of thing is sort of like saying that 10 879 00:49:48,760 --> 00:49:52,502 to the minus 23 is approximately equal to 10 to 880 00:49:52,502 --> 00:49:54,950 the minus 25. 881 00:49:54,950 --> 00:49:57,060 And they're approximately equal because they're both 882 00:49:57,060 --> 00:49:59,170 very, very small. 883 00:49:59,170 --> 00:50:02,510 And that's the kind of thing that's going on here. 884 00:50:02,510 --> 00:50:05,330 And you're trying to distinguish 10 to the minus 23 885 00:50:05,330 --> 00:50:10,540 and 10 to the minus 25 from 10 to the minus 60th and from 10 886 00:50:10,540 --> 00:50:12,950 to the minus three. 887 00:50:12,950 --> 00:50:16,500 So that's the kind of approximations we're using. 888 00:50:16,500 --> 00:50:20,040 Namely, we're using approximations on a log scale, 889 00:50:20,040 --> 00:50:23,510 instead of approximations of ordinary numbers. 890 00:50:23,510 --> 00:50:27,800 But, still it's convenient to think of these typical x's, 891 00:50:27,800 --> 00:50:31,270 typical sequences, as being sequences which are 892 00:50:31,270 --> 00:50:33,900 constrained in probability in this way. 893 00:50:33,900 --> 00:50:37,290 And this is the thing which is easy to work with. 894 00:50:37,290 --> 00:50:41,910 The atypical set of strings, namely, the compliment to this 895 00:50:41,910 --> 00:50:45,990 set, the thing we know about that is the entire set doesn't 896 00:50:45,990 --> 00:50:48,030 have much probability. 897 00:50:48,030 --> 00:50:53,080 Namely, if you fix epsilon and you let n get bigger and 898 00:50:53,080 --> 00:50:57,860 bigger, this atypical set becomes totally negligible. 899 00:50:57,860 --> 00:50:59,110 And you can ignore it. 900 00:51:02,330 --> 00:51:06,940 So let's plow ahead. 901 00:51:06,940 --> 00:51:12,220 Stop for an example pretty soon, but -- 902 00:51:12,220 --> 00:51:15,830 If I have a sequence which is in the typical set, we then 903 00:51:15,830 --> 00:51:20,400 know that its probability is greater than 2 to the minus n 904 00:51:20,400 --> 00:51:23,520 times H of x plus epsilon. 905 00:51:23,520 --> 00:51:26,150 That's what we said before. 906 00:51:26,150 --> 00:51:29,330 And, therefore, when I use this inequality, the 907 00:51:29,330 --> 00:51:34,900 probability of x to the n, for something in the typical set, 908 00:51:34,900 --> 00:51:37,940 is greater than this quantity here. 909 00:51:37,940 --> 00:51:47,950 In other words, this is greater than that. 910 00:51:47,950 --> 00:51:50,420 For everything in a typical set. 911 00:51:50,420 --> 00:51:53,640 So now I'm heading over things in a typical set. 912 00:51:53,640 --> 00:51:56,170 So I need to include the number of things 913 00:51:56,170 --> 00:51:57,590 in a typical set. 914 00:51:57,590 --> 00:52:01,190 So what I have is this sum. 915 00:52:01,190 --> 00:52:02,470 And what is this sum? 916 00:52:02,470 --> 00:52:06,000 This is the probability of the typical set. 917 00:52:06,000 --> 00:52:08,960 Because I'm adding overall elements in the typical set. 918 00:52:08,960 --> 00:52:11,880 And it's greater than or equal to the number of elements in a 919 00:52:11,880 --> 00:52:15,660 typical set times these small probabilities. 920 00:52:15,660 --> 00:52:19,230 If I turn this around, it says that the number of elements in 921 00:52:19,230 --> 00:52:22,460 a typical set is less than 2 to the n 922 00:52:22,460 --> 00:52:25,820 times H of x plus epsilon. 923 00:52:25,820 --> 00:52:30,000 For any epsilon, no matter how small I want to make it. 924 00:52:30,000 --> 00:52:33,710 Which says that the elements in a typical set have 925 00:52:33,710 --> 00:52:38,200 probabilities which are about 2 to the minus n times H of x. 926 00:52:38,200 --> 00:52:41,480 And the number of them is approximately 2 to the 927 00:52:41,480 --> 00:52:44,110 n times H of x. 928 00:52:44,110 --> 00:52:47,910 In other words, what it says is that this typical set is a 929 00:52:47,910 --> 00:52:53,900 bunch of essentially uniform probabilities. 930 00:52:53,900 --> 00:52:58,550 So what I've done is to take this very complicated source. 931 00:52:58,550 --> 00:53:05,360 And when I look at these very humongous chance variables, 932 00:53:05,360 --> 00:53:10,670 which are very large sequences out of the source, what I find 933 00:53:10,670 --> 00:53:14,510 is that there's a bunch of things which collectively have 934 00:53:14,510 --> 00:53:16,410 zilch probability. 935 00:53:16,410 --> 00:53:18,980 There's a bunch of other things which all have equal 936 00:53:18,980 --> 00:53:20,090 probability. 937 00:53:20,090 --> 00:53:24,650 And a number of them is enough to add up to y. 938 00:53:24,650 --> 00:53:28,820 So I have turned this source, when I look at it over a long 939 00:53:28,820 --> 00:53:38,080 enough sequence, into a source of equiprobable events. 940 00:53:38,080 --> 00:53:41,470 And each of those events has this probability here. 941 00:53:41,470 --> 00:53:46,540 Now, we know how to encode equiprobable events. 942 00:53:46,540 --> 00:53:48,140 And that's the whole point of this. 943 00:53:50,770 --> 00:53:55,820 So, this is less than or equal to that. 944 00:53:55,820 --> 00:53:59,000 On the other side, we know that 1 minus delta is less 945 00:53:59,000 --> 00:54:04,970 than or equal to this probability of a typical set. 946 00:54:04,970 --> 00:54:09,590 And this is less than the number of elements in a 947 00:54:09,590 --> 00:54:13,860 typical set times 2 to the minus n h of x minus epsilon. 948 00:54:13,860 --> 00:54:16,320 This is an upper bound on this. 949 00:54:16,320 --> 00:54:24,240 This is less than this. 950 00:54:27,600 --> 00:54:30,570 So I just add all these things up and I get this bound. 951 00:54:30,570 --> 00:54:34,200 So it says, the size of the typical set is greater than 1 952 00:54:34,200 --> 00:54:37,360 minus delta, times this quantity. 953 00:54:37,360 --> 00:54:41,520 In other words, this is a pretty exact sort of thing. 954 00:54:41,520 --> 00:54:44,870 If you don't mind dealing with this 2 to the n 955 00:54:44,870 --> 00:54:47,270 epsilon factor here. 956 00:54:47,270 --> 00:54:50,150 If you agree that that's negligible in some strange 957 00:54:50,150 --> 00:54:53,860 sense, the all of this makes good sense. 958 00:54:53,860 --> 00:54:57,760 And if it is negligible, let me start talking about source 959 00:54:57,760 --> 00:55:01,420 coding, which is why this all works out. 960 00:55:01,420 --> 00:55:05,460 So the summary is that the probability of the complement 961 00:55:05,460 --> 00:55:10,650 of the typical set is essentially 0. 962 00:55:10,650 --> 00:55:14,340 The number of elements in a typical set is approximately 2 963 00:55:14,340 --> 00:55:16,130 to the n times h of x. 964 00:55:16,130 --> 00:55:18,610 I'm getting rid of all the deltas and epsilons here, to 965 00:55:18,610 --> 00:55:22,380 get sort of the broad view of what's important here. 966 00:55:22,380 --> 00:55:25,650 Each of the elements in a typical set has probability 2 967 00:55:25,650 --> 00:55:28,170 to the minus n times H of x. 968 00:55:28,170 --> 00:55:32,175 So I've turned a source into a source of 969 00:55:32,175 --> 00:55:34,230 equiprobable elements. 970 00:55:34,230 --> 00:55:37,070 And there are 2 to the n times h of x of them. 971 00:55:43,100 --> 00:55:46,320 Let's do an example of this. 972 00:55:46,320 --> 00:55:48,890 It's an example that you'll work on more in the homework 973 00:55:48,890 --> 00:55:52,810 and do it a little more cleanly. 974 00:55:52,810 --> 00:55:57,120 Let's look at a binary discrete memoryless source, 975 00:55:57,120 --> 00:56:02,310 where the probability that x is equal to 1 is p, which is 976 00:56:02,310 --> 00:56:03,920 less than 1/2. 977 00:56:03,920 --> 00:56:07,070 And the probability of 0 is greater than 1/2. 978 00:56:07,070 --> 00:56:12,640 So, this is what you get when you have a biased coin. 979 00:56:12,640 --> 00:56:17,420 And the biased coin has a 1 on one side and a 0 980 00:56:17,420 --> 00:56:19,340 on the other side. 981 00:56:19,340 --> 00:56:23,070 And it's more likely to come up 0's than 1's. 982 00:56:23,070 --> 00:56:26,080 I always used to wonder how to make a biased coin. 983 00:56:26,080 --> 00:56:28,240 And I can give you a little experiment which shows you you 984 00:56:28,240 --> 00:56:30,400 can make a biased coin. 985 00:56:30,400 --> 00:56:34,140 I mean, a biased is a little round thing which is flat on 986 00:56:34,140 --> 00:56:35,840 the top and bottom. 987 00:56:35,840 --> 00:56:40,070 Suppose instead of that you make a triangular coin. 988 00:56:40,070 --> 00:56:43,140 And instead of making it flat on top and bottom, you turn it 989 00:56:43,140 --> 00:56:45,800 into a tetrahedron. 990 00:56:45,800 --> 00:56:50,630 So in fact, what this is now is a coin which is built up on 991 00:56:50,630 --> 00:56:54,090 one side into a very massive thing. 992 00:56:54,090 --> 00:56:57,070 And is flat on the other side. 993 00:56:57,070 --> 00:56:59,700 Since it's a tetrahedron and it's an equilateral 994 00:56:59,700 --> 00:57:04,730 tetrahedron, the probability of 1 is going to be 1/4, and 995 00:57:04,730 --> 00:57:07,850 the probability of 0 is going to be 3/4. 996 00:57:07,850 --> 00:57:10,760 So you can make biased coins. 997 00:57:10,760 --> 00:57:12,760 So when you get into coin-tossing games with 998 00:57:12,760 --> 00:57:15,045 people, watch the coin that they're using. 999 00:57:15,045 --> 00:57:19,120 It probably won't be a tetrahedron, but anyway. 1000 00:57:21,820 --> 00:57:28,520 So the entropy here, the log pmf random variable, takes on 1001 00:57:28,520 --> 00:57:32,300 the value of minus log p with probability p. 1002 00:57:32,300 --> 00:57:35,950 And it takes on the value minus log 1 minus p, with 1003 00:57:35,950 --> 00:57:37,490 probability 1 minus p. 1004 00:57:37,490 --> 00:57:40,080 This is a probability of a 1. 1005 00:57:40,080 --> 00:57:42,700 This is a probability of a 0. 1006 00:57:42,700 --> 00:57:46,270 So, the entropy is equal to this. 1007 00:57:46,270 --> 00:57:48,980 Used to be that in information theory courses, people would 1008 00:57:48,980 --> 00:57:52,050 almost memorize what this curve looked like. 1009 00:57:52,050 --> 00:57:53,250 And they'd draw pictures of it. 1010 00:57:53,250 --> 00:57:56,140 There were famous curves of this function, 1011 00:57:56,140 --> 00:57:58,950 which looks like this. 1012 00:58:07,280 --> 00:58:17,620 0, 1, 1. 1013 00:58:17,620 --> 00:58:20,800 Turns out, that's not all that important a distribution. 1014 00:58:20,800 --> 00:58:24,510 It's a nice example to talk about. 1015 00:58:24,510 --> 00:58:28,400 The typical set, t epsilon n, is the set of strings with 1016 00:58:28,400 --> 00:58:34,710 about p n1's and about 1 minus p times n 0's. 1017 00:58:34,710 --> 00:58:38,770 In other words, that's the typical thing to happen. 1018 00:58:38,770 --> 00:58:41,900 And it's the typical thing in terms of this law of large 1019 00:58:41,900 --> 00:58:42,690 numbers here. 1020 00:58:42,690 --> 00:58:46,520 Because you get 1's with probability p. 1021 00:58:46,520 --> 00:58:48,700 And therefore in a long sequence, you're going to get 1022 00:58:48,700 --> 00:58:53,190 about pn 1's and 1 minus p 0's. 1023 00:58:53,190 --> 00:58:58,520 The probability of a typical string is, if you get a string 1024 00:58:58,520 --> 00:59:01,940 with this many 1's and this many 0's, it's 1025 00:59:01,940 --> 00:59:04,500 probability is p. 1026 00:59:04,500 --> 00:59:08,280 Namely, the probability of a 1 times the number of 1's you 1027 00:59:08,280 --> 00:59:10,610 get, which is pn. 1028 00:59:10,610 --> 00:59:13,420 Times the probability of a 0, times the 1029 00:59:13,420 --> 00:59:16,210 number of 0's you get. 1030 00:59:16,210 --> 00:59:19,170 And if you look at what this is, if you take p up in the 1031 00:59:19,170 --> 00:59:22,850 exponent and 1 minus the p up in the exponent, this becomes 1032 00:59:22,850 --> 00:59:27,700 2 to the minus n times h of x, just like what it should be. 1033 00:59:27,700 --> 00:59:31,780 So these typical strings, with about pn 1's and 1 minus pn 1034 00:59:31,780 --> 00:59:34,720 0's, are in fact typical in the sense we've 1035 00:59:34,720 --> 00:59:36,560 been talking about. 1036 00:59:36,560 --> 00:59:43,100 The number of n strings with pn 1's is n factorial divided 1037 00:59:43,100 --> 00:59:47,760 by pn factorial divided by n times 1 minus p factorial. 1038 00:59:52,070 --> 00:59:54,960 I mean I hope you learned that a long time ago, but you 1039 00:59:54,960 --> 00:59:56,910 should learn it in probability anyway. 1040 00:59:56,910 --> 01:00:01,260 It's just very simple combinatorics. 1041 01:00:01,260 --> 01:00:04,270 So you have that many different strings. 1042 01:00:04,270 --> 01:00:07,430 So what I'm trying to get across here is, there are a 1043 01:00:07,430 --> 01:00:10,580 bunch of different things going on here. 1044 01:00:10,580 --> 01:00:13,600 We can talk about the random variable which is the number 1045 01:00:13,600 --> 01:00:16,990 of 1's that occur in this long sequence. 1046 01:00:16,990 --> 01:00:20,460 And with high probability, the number of 1's that occur is 1047 01:00:20,460 --> 01:00:22,970 close to pn. 1048 01:00:22,970 --> 01:00:26,470 But if pn 1's occur, there's still an awful lot of 1049 01:00:26,470 --> 01:00:28,400 randomness left. 1050 01:00:28,400 --> 01:00:33,310 Because we have to worry about where those pn 1's appear. 1051 01:00:33,310 --> 01:00:36,140 And those are the sequences we're talking about. 1052 01:00:36,140 --> 01:00:41,520 So, there are this many sequences, all of which have 1053 01:00:41,520 --> 01:00:44,940 that many 1's in them. 1054 01:00:44,940 --> 01:00:48,850 And there's a similar number of sequences for all similar 1055 01:00:48,850 --> 01:00:50,160 numbers of 1's. 1056 01:00:50,160 --> 01:00:54,510 Namely, if you take pn plus 1 and pn plus 2, pn minus 1, pn 1057 01:00:54,510 --> 01:00:57,780 minus 2, you get similar numbers here. 1058 01:00:57,780 --> 01:01:00,890 So those are the typical sequences. 1059 01:01:00,890 --> 01:01:03,980 Now, the important thing to observe here is that you 1060 01:01:03,980 --> 01:01:08,890 really have 2 to the n binary strings altogether. 1061 01:01:08,890 --> 01:01:13,270 And what this result is saying is that collectively those 1062 01:01:13,270 --> 01:01:14,490 don't make any difference. 1063 01:01:14,490 --> 01:01:17,820 The law of large numbers says, OK, there's just a humongous 1064 01:01:17,820 --> 01:01:20,080 number of strings. 1065 01:01:20,080 --> 01:01:23,780 You get the largest number strings which have about half 1066 01:01:23,780 --> 01:01:25,510 1's and half 0's. 1067 01:01:25,510 --> 01:01:29,100 But their probability is zilch. 1068 01:01:29,100 --> 01:01:32,540 So the thing which is probable is getting pn 1's 1069 01:01:32,540 --> 01:01:34,750 and 1 minus pn 0's. 1070 01:01:34,750 --> 01:01:37,290 Now, we have this typical set. 1071 01:01:37,290 --> 01:01:41,410 What is the most likely sequence of all, in this 1072 01:01:41,410 --> 01:01:42,660 experiment? 1073 01:01:45,450 --> 01:01:48,130 How do I maximize the probability of 1074 01:01:48,130 --> 01:01:49,620 a particular sequence? 1075 01:01:49,620 --> 01:02:03,910 The probability of the sequence is p to the i times 1 1076 01:02:03,910 --> 01:02:07,420 minus p to the n minus i. 1077 01:02:07,420 --> 01:02:11,050 And 1 minus p is the probability of 0. 1078 01:02:11,050 --> 01:02:14,240 And p is the probability of a 1. 1079 01:02:14,240 --> 01:02:15,970 How do I choose i to maximize this? 1080 01:02:15,970 --> 01:02:16,300 Yeah? 1081 01:02:16,300 --> 01:02:18,150 AUDIENCE: [UNINTELLIGIBLE] all 0's. 1082 01:02:18,150 --> 01:02:19,540 PROFESSOR: You make them all 0's. 1083 01:02:19,540 --> 01:02:23,750 So the most likely sequence is all 0's. 1084 01:02:23,750 --> 01:02:25,860 But that's not a typical sequence. 1085 01:02:29,700 --> 01:02:33,290 Why isn't it a typical sequence? 1086 01:02:33,290 --> 01:02:36,060 Because we chose to define typical sequence in a 1087 01:02:36,060 --> 01:02:37,880 different way. 1088 01:02:37,880 --> 01:02:41,180 Namely is only one of those, and there are only n of them 1089 01:02:41,180 --> 01:02:43,650 with only a single one. 1090 01:02:43,650 --> 01:02:46,920 So, in other words, what's going on is that we have an 1091 01:02:46,920 --> 01:02:49,640 enormous number of sequences which have around half 1092 01:02:49,640 --> 01:02:50,890 1's and half 0's. 1093 01:02:53,430 --> 01:02:55,240 But they don't have any probability. 1094 01:02:55,240 --> 01:02:57,840 And collectively they don't have any probability. 1095 01:02:57,840 --> 01:03:01,380 We have a very small number of sequences which have a very 1096 01:03:01,380 --> 01:03:03,750 large number of 0's. 1097 01:03:03,750 --> 01:03:07,960 But there aren't enough of those to make any difference. 1098 01:03:07,960 --> 01:03:10,750 And, therefore, the things that make a difference are 1099 01:03:10,750 --> 01:03:14,710 these typical things which have about np 1's 1100 01:03:14,710 --> 01:03:18,270 and 1 minus pn 0's. 1101 01:03:18,270 --> 01:03:20,680 And that all sounds very strange. 1102 01:03:20,680 --> 01:03:22,800 But if I phrase this a different way, you would all 1103 01:03:22,800 --> 01:03:27,470 say that's exactly the way you ought to do things. 1104 01:03:27,470 --> 01:03:32,210 Because, in fact, when we look at very, very long sequences, 1105 01:03:32,210 --> 01:03:35,175 you know with extraordinarily high probability what's going 1106 01:03:35,175 --> 01:03:39,050 to come out of the source is something with about pn 1's 1107 01:03:39,050 --> 01:03:42,430 and about 1 minus p times n 0's. 1108 01:03:42,430 --> 01:03:46,410 So that's the likely set of things to have happen. 1109 01:03:46,410 --> 01:03:47,590 And it's just that there are an enormous 1110 01:03:47,590 --> 01:03:49,200 number of those things. 1111 01:03:49,200 --> 01:03:51,890 There are this many of them. 1112 01:03:51,890 --> 01:03:56,150 So, here what we're dealing with is a balance between the 1113 01:03:56,150 --> 01:04:01,090 number of elements of a particular type, and the 1114 01:04:01,090 --> 01:04:03,520 probability of them. 1115 01:04:03,520 --> 01:04:07,030 And it turns out that this number and its probability 1116 01:04:07,030 --> 01:04:10,650 balance out to say that usually what you get is about 1117 01:04:10,650 --> 01:04:13,780 pn 1's and 1 minus p times n 0's. 1118 01:04:13,780 --> 01:04:16,730 Which is what the law of large numbers said to begin with. 1119 01:04:16,730 --> 01:04:20,300 All we're doing is interpreting that here. 1120 01:04:20,300 --> 01:04:25,210 But the thing that you see from this example is, all of 1121 01:04:25,210 --> 01:04:28,680 these things with exactly pn 1's in them, assuming that pn 1122 01:04:28,680 --> 01:04:31,270 is an integer, are all equiprobable. 1123 01:04:31,270 --> 01:04:34,940 They're all exactly equiprobable. 1124 01:04:34,940 --> 01:04:37,990 So what we're doing when we're talking about this typical 1125 01:04:37,990 --> 01:04:42,140 set, is first throwing out all the things which have to many 1126 01:04:42,140 --> 01:04:44,570 1's are or too few 1's in them. 1127 01:04:44,570 --> 01:04:48,560 We're keeping only the ones which are typical in the sense 1128 01:04:48,560 --> 01:04:50,920 that they obey the law of large numbers. 1129 01:04:50,920 --> 01:04:54,100 And in this case, they obey the law of large numbers for 1130 01:04:54,100 --> 01:04:56,730 log pmf's also. 1131 01:04:56,730 --> 01:05:01,770 And then all of those things are about equally probable. 1132 01:05:01,770 --> 01:05:05,460 So the idea in source coding is, one of the ways to deal 1133 01:05:05,460 --> 01:05:10,430 with source coding is, you want to assign codewords to 1134 01:05:10,430 --> 01:05:13,570 only these typical things. 1135 01:05:13,570 --> 01:05:16,240 Now, maybe you might want to assign codewords to something 1136 01:05:16,240 --> 01:05:17,870 like all 0's also. 1137 01:05:17,870 --> 01:05:20,570 Because it hardly costs anything. 1138 01:05:20,570 --> 01:05:23,810 And a Huffman code would certainly do that. 1139 01:05:23,810 --> 01:05:27,310 But it's not very important whether you do or not. 1140 01:05:27,310 --> 01:05:30,300 The important thing is, you assign codewords to all of 1141 01:05:30,300 --> 01:05:31,910 these typical sequences. 1142 01:05:37,770 --> 01:05:41,280 So let's go back to fixed-to-fixed 1143 01:05:41,280 --> 01:05:42,660 length source codes. 1144 01:05:42,660 --> 01:05:45,500 We talked a little bit about fixed-to-fixed length source 1145 01:05:45,500 --> 01:05:46,940 codes before. 1146 01:05:46,940 --> 01:05:48,980 Do you remember what we did with fixed-to-fixed length 1147 01:05:48,980 --> 01:05:50,720 source codes before? 1148 01:05:50,720 --> 01:05:53,520 We said we have an alphabet of size m. 1149 01:05:53,520 --> 01:05:56,250 We want something which is uniquely decodable. 1150 01:05:56,250 --> 01:05:59,020 And since we want something which is uniquely decodable, 1151 01:05:59,020 --> 01:06:02,510 we have to provide codewords for everything. 1152 01:06:02,510 --> 01:06:07,780 And, therefore, if we want to choose a block length of n, 1153 01:06:07,780 --> 01:06:11,730 we've got to generate m to the n codewords. 1154 01:06:11,730 --> 01:06:14,700 Here we say, wow, maybe we don't have to provide 1155 01:06:14,700 --> 01:06:17,250 codewords for everything. 1156 01:06:17,250 --> 01:06:20,520 Maybe we're willing to tolerate a certain small 1157 01:06:20,520 --> 01:06:23,070 probability that the whole thing fails and 1158 01:06:23,070 --> 01:06:24,320 falls on its face. 1159 01:06:27,040 --> 01:06:30,280 Now, does that make any sense? 1160 01:06:30,280 --> 01:06:32,330 Well, view things the following way. 1161 01:06:32,330 --> 01:06:36,090 We said, when we started out all of this, that we were 1162 01:06:36,090 --> 01:06:38,880 going to look at prefix-free codes. 1163 01:06:38,880 --> 01:06:42,640 Where some codewords had a longer length and some 1164 01:06:42,640 --> 01:06:44,730 codewords had a shorter length. 1165 01:06:44,730 --> 01:06:48,040 And we were thinking of encoding either single letters 1166 01:06:48,040 --> 01:06:52,340 at a time, or a small block of letters at a time. 1167 01:06:52,340 --> 01:06:55,960 So think of encoding, say, 10 letters at a time. 1168 01:06:55,960 --> 01:07:02,250 And think of doing this for 10 to the 20th letters. 1169 01:07:02,250 --> 01:07:05,740 So you have the source here which is pumping out letters 1170 01:07:05,740 --> 01:07:08,280 at a regular rate. 1171 01:07:08,280 --> 01:07:12,540 You're blocking them into n letters at a time. 1172 01:07:12,540 --> 01:07:15,540 You're encoding in a prefix-free code. 1173 01:07:15,540 --> 01:07:17,790 Out comes something. 1174 01:07:17,790 --> 01:07:22,560 What comes is not coming out at a regular right. 1175 01:07:22,560 --> 01:07:25,670 What is coming out, sometimes you get a lot of bits out. 1176 01:07:25,670 --> 01:07:28,450 Sometimes a small number of bits out. 1177 01:07:28,450 --> 01:07:30,730 So, in other words, if you want to send things over a 1178 01:07:30,730 --> 01:07:34,970 channel, you need a buffer there to save things. 1179 01:07:34,970 --> 01:07:39,000 If, in fact, we decide that the expected number of bits 1180 01:07:39,000 --> 01:07:43,960 per source letter is, say, five bits per source letter, 1181 01:07:43,960 --> 01:07:48,540 then we expect over a very long time to be producing five 1182 01:07:48,540 --> 01:07:50,830 bits per source letter. 1183 01:07:50,830 --> 01:07:54,460 And if we turn our channel on for one year, to transmit all 1184 01:07:54,460 --> 01:07:59,010 of these things, what's going to happen is this very 1185 01:07:59,010 --> 01:08:02,080 unlikely sequence occurs. 1186 01:08:02,080 --> 01:08:05,910 Which in fact requires not one year to transmit, but two 1187 01:08:05,910 --> 01:08:09,520 years to transmit. 1188 01:08:09,520 --> 01:08:13,150 In fact, what do we do if it takes one year and five 1189 01:08:13,150 --> 01:08:18,140 minutes to transmit instead of one year? 1190 01:08:18,140 --> 01:08:19,050 Well, we've got a failure. 1191 01:08:19,050 --> 01:08:22,520 Somehow or other, the network is going to fail us. 1192 01:08:22,520 --> 01:08:25,350 I mean we all know that networks fail all the time 1193 01:08:25,350 --> 01:08:28,530 despite what engineers say. 1194 01:08:28,530 --> 01:08:32,120 I mean, all of us who use networks know that they do 1195 01:08:32,120 --> 01:08:33,820 crazy things. 1196 01:08:33,820 --> 01:08:36,590 And one of those crazy things is that unusual things 1197 01:08:36,590 --> 01:08:38,270 sometimes happen. 1198 01:08:38,270 --> 01:08:42,640 So, we develop this very nice theory of prefix-free codes. 1199 01:08:42,640 --> 01:08:46,580 But prefix-free codes, in fact, fail also. 1200 01:08:46,580 --> 01:08:50,880 And they fail also because buffers overflow. 1201 01:08:50,880 --> 01:08:54,160 In other words, we are counting on encoding things 1202 01:08:54,160 --> 01:08:58,020 with a certain number of bits per source symbol. 1203 01:08:58,020 --> 01:09:00,770 And if these unusual things occur, and we have too many 1204 01:09:00,770 --> 01:09:04,780 bits per source symbol, then we fail. 1205 01:09:04,780 --> 01:09:08,960 So the idea that we're trying to get at now is that 1206 01:09:08,960 --> 01:09:13,560 prefix-free codes and fixed-to-fixed length source 1207 01:09:13,560 --> 01:09:16,640 codes which only encode typical things. 1208 01:09:16,640 --> 01:09:20,710 In fact, are sort of the same if you look at them over a 1209 01:09:20,710 --> 01:09:22,860 very, very large sequence length. 1210 01:09:22,860 --> 01:09:26,980 In other words, if you look at a prefix-free code which is 1211 01:09:26,980 --> 01:09:31,190 dealing with blocks of 10 letters, and you look at a 1212 01:09:31,190 --> 01:09:34,120 fixed-to-fixed length code which is only dealing with 1213 01:09:34,120 --> 01:09:39,320 typical things but is looking at a length of 10 to the 20th, 1214 01:09:39,320 --> 01:09:43,570 then over that length of 10 to the 20th, your variable length 1215 01:09:43,570 --> 01:09:47,020 code is going to have a bunch of things which are about the 1216 01:09:47,020 --> 01:09:48,630 length they ought to be. 1217 01:09:48,630 --> 01:09:50,970 And a bunch of other things which are 1218 01:09:50,970 --> 01:09:53,090 extraordinarily long. 1219 01:09:53,090 --> 01:09:56,360 The bunch of things which are extraordinarily long are 1220 01:09:56,360 --> 01:09:59,910 extraordinarily unpopular, but there are an extraordinarily 1221 01:09:59,910 --> 01:10:02,020 large number of them. 1222 01:10:02,020 --> 01:10:05,760 Just like with a fixed-to-fixed length code, 1223 01:10:05,760 --> 01:10:07,700 you are going to fail. 1224 01:10:07,700 --> 01:10:10,200 And you're going to fail on an extraordinary number of 1225 01:10:10,200 --> 01:10:12,500 different sequences. 1226 01:10:12,500 --> 01:10:15,290 But, collectively, that set of sequences don't have any 1227 01:10:15,290 --> 01:10:17,850 probability. 1228 01:10:17,850 --> 01:10:20,720 So the point that I'm trying to get across is that, really, 1229 01:10:20,720 --> 01:10:24,020 these two situations come together when we look very 1230 01:10:24,020 --> 01:10:25,630 long lengths. 1231 01:10:25,630 --> 01:10:30,030 Namely, prefix-free codes are just a way of generating codes 1232 01:10:30,030 --> 01:10:33,260 that work for typical sequences and over a very 1233 01:10:33,260 --> 01:10:37,390 large, long period of time, will generate about the right 1234 01:10:37,390 --> 01:10:40,550 number of symbols. 1235 01:10:40,550 --> 01:10:42,420 And that's what I'm trying to get at here. 1236 01:10:42,420 --> 01:10:45,980 Or what I'm trying to get at in the next slide. 1237 01:10:45,980 --> 01:10:50,650 So the fixed-to-fixed length source code, I'm going to pick 1238 01:10:50,650 --> 01:10:52,860 some epsilon and some delta. 1239 01:10:52,860 --> 01:10:55,770 Namely, that epsilon and delta which appeared in the law of 1240 01:10:55,770 --> 01:10:58,280 large numbers. 1241 01:10:58,280 --> 01:11:01,400 I'm going to make n as big as I have to make it for that 1242 01:11:01,400 --> 01:11:03,220 epsilon and that delta. 1243 01:11:03,220 --> 01:11:07,120 And calculate how large it has to be, but we won't. 1244 01:11:07,120 --> 01:11:12,150 Then I'm going to assign fixed length codewords to each 1245 01:11:12,150 --> 01:11:15,390 sequence in the typical set. 1246 01:11:15,390 --> 01:11:16,490 Now, am I going to really build 1247 01:11:16,490 --> 01:11:18,410 something which does this? 1248 01:11:18,410 --> 01:11:20,210 Of course not. 1249 01:11:20,210 --> 01:11:23,140 I mean, I'm talking about truly humongous lengths. 1250 01:11:23,140 --> 01:11:25,620 So, this is really a conceptual tool to understand 1251 01:11:25,620 --> 01:11:27,070 what's going on. 1252 01:11:27,070 --> 01:11:30,100 It's not something we're going to implement. 1253 01:11:30,100 --> 01:11:32,490 So I'm going to assign codewords to all 1254 01:11:32,490 --> 01:11:34,910 these typical elements. 1255 01:11:34,910 --> 01:11:40,900 And then what I find is that since the typical set, since 1256 01:11:40,900 --> 01:11:44,730 the number of elements in it is less than 2 to the n times 1257 01:11:44,730 --> 01:11:51,200 H of x plus epsilon, if I choose L bar, namely, the 1258 01:11:51,200 --> 01:11:56,980 number of bits I'm going to use for encoding these things, 1259 01:11:56,980 --> 01:12:00,470 it's going to have to be H of x plus epsilon in length. 1260 01:12:00,470 --> 01:12:02,190 Because I need to provide codewords for 1261 01:12:02,190 --> 01:12:05,600 each of these things. 1262 01:12:05,600 --> 01:12:08,930 And it needs to be an extra 1 over n because of this integer 1263 01:12:08,930 --> 01:12:11,460 constraint that we've been dealing with all along, which 1264 01:12:11,460 --> 01:12:14,120 doesn't make any difference. 1265 01:12:14,120 --> 01:12:17,830 So if I choose L bar, that big, in other words, if I make 1266 01:12:17,830 --> 01:12:21,670 it just a little bit bigger than the entropy, the 1267 01:12:21,670 --> 01:12:23,790 probability of failure is going to be less 1268 01:12:23,790 --> 01:12:25,640 than or equal to delta. 1269 01:12:25,640 --> 01:12:27,910 And I can make delta -- and I can make the probability of 1270 01:12:27,910 --> 01:12:30,110 failure as small as I want. 1271 01:12:30,110 --> 01:12:32,960 So I can make this epsilon here which is the extra bits 1272 01:12:32,960 --> 01:12:36,710 per source symbol as small as I want. 1273 01:12:36,710 --> 01:12:39,790 So it says I can come as close to the entropy bound in doing 1274 01:12:39,790 --> 01:12:43,350 this, and come as close to unique decodability as I want 1275 01:12:43,350 --> 01:12:45,140 in doing this. 1276 01:12:45,140 --> 01:12:48,720 And I have a fixed-to-fixed length code, which after one 1277 01:12:48,720 --> 01:12:50,880 year is going to stop. 1278 01:12:50,880 --> 01:12:53,730 And I can turn my decoder off. 1279 01:12:53,730 --> 01:12:55,950 I can turn my encoder off. 1280 01:12:55,950 --> 01:12:59,160 I can go buy a new encoder and a new decoder, which 1281 01:12:59,160 --> 01:13:01,770 presumably works a little bit better. 1282 01:13:01,770 --> 01:13:04,150 And there isn't any problem about when to turn it off. 1283 01:13:04,150 --> 01:13:05,730 Because I know I can turn it off. 1284 01:13:05,730 --> 01:13:09,630 Because everything will have come in by then. 1285 01:13:09,630 --> 01:13:12,420 Here's a more interesting story. 1286 01:13:12,420 --> 01:13:18,250 Suppose I choose the number of bits per source symbol that 1287 01:13:18,250 --> 01:13:23,390 I'm going to use to be less than or equal to the entropy 1288 01:13:23,390 --> 01:13:24,420 minus 2 epsilon. 1289 01:13:24,420 --> 01:13:25,670 Why 2 epsilon? 1290 01:13:25,670 --> 01:13:29,110 Well, just wait a second. 1291 01:13:29,110 --> 01:13:31,830 I mean, 2 epsilon is small and epsilon is small. 1292 01:13:31,830 --> 01:13:34,145 But I want to compare with this other epsilon and my law 1293 01:13:34,145 --> 01:13:35,590 of large numbers. 1294 01:13:35,590 --> 01:13:39,430 And I'm going to pick n large enough. 1295 01:13:39,430 --> 01:13:43,480 The number of typical sequences, we said before, was 1296 01:13:43,480 --> 01:13:48,300 greater than 1 minus delta times 2 to the n times h of x 1297 01:13:48,300 --> 01:13:48,950 minus epsilon. 1298 01:13:48,950 --> 01:13:52,430 I'm going to make this epsilon the same as that epsilon, 1299 01:13:52,430 --> 01:13:54,170 which is why I wanted this to be 2 epsilon. 1300 01:13:56,700 --> 01:14:01,680 So my typical set is this big when I choose n large enough. 1301 01:14:01,680 --> 01:14:04,890 And this says that most of the typical set 1302 01:14:04,890 --> 01:14:07,440 can't be assigned codewords. 1303 01:14:07,440 --> 01:14:15,510 In other words, this number here is humongously larger 1304 01:14:15,510 --> 01:14:35,870 then 2 to the l bar, which is in the order of 2 to the nh of 1305 01:14:35,870 --> 01:14:42,200 x minus 2 epsilon n. 1306 01:14:42,200 --> 01:14:45,660 So the fraction of typical elements that I can provide 1307 01:14:45,660 --> 01:14:52,040 codewords for, between this and this, I can only provide 1308 01:14:52,040 --> 01:14:54,660 codewords for a fraction 2 to the minus 1309 01:14:54,660 --> 01:14:58,670 epsilon n of the codewords. 1310 01:14:58,670 --> 01:15:01,770 We have this big sea of codewords, which are all 1311 01:15:01,770 --> 01:15:04,200 essentially equally likely. 1312 01:15:04,200 --> 01:15:07,230 And I can't provide codewords for even a 1313 01:15:07,230 --> 01:15:09,860 small fraction of them. 1314 01:15:09,860 --> 01:15:13,130 So the probability of failure is going to be 1 minus delta. 1315 01:15:13,130 --> 01:15:15,460 The 1 minus delta's the probability that I get 1316 01:15:15,460 --> 01:15:17,950 something atypical. 1317 01:15:17,950 --> 01:15:24,190 Plus, well, minus in this case, 2 to the minus epsilon 1318 01:15:24,190 --> 01:15:28,280 n, which is the probability that I can't encode a typical 1319 01:15:28,280 --> 01:15:30,670 codeword that comes out. 1320 01:15:30,670 --> 01:15:34,550 And this quantity goes to 1. 1321 01:15:34,550 --> 01:15:37,995 So this says that if I'm willing to use a number of 1322 01:15:37,995 --> 01:15:42,690 bits bigger than the entropy, I can succeed with probability 1323 01:15:42,690 --> 01:15:45,010 very close to 1. 1324 01:15:45,010 --> 01:15:48,150 And if I want to use a smaller number of bits, I fail with 1325 01:15:48,150 --> 01:15:49,400 probability 1. 1326 01:15:52,810 --> 01:15:56,320 Which is the same as saying that I'm using a prefix-free 1327 01:15:56,320 --> 01:16:01,950 code, I'm going to run out of buffer space eventually if I 1328 01:16:01,950 --> 01:16:05,730 run long enough. 1329 01:16:05,730 --> 01:16:11,650 If I have something that I'm encoding -- 1330 01:16:11,650 --> 01:16:13,980 well, just erase that. 1331 01:16:13,980 --> 01:16:15,570 I'll say it more carefully later. 1332 01:16:18,150 --> 01:16:22,210 I do want to talk a little bit about this Kraft inequality 1333 01:16:22,210 --> 01:16:23,610 for unique decodability. 1334 01:16:23,610 --> 01:16:26,780 You remember we proved the Kraft inequality for 1335 01:16:26,780 --> 01:16:29,460 prefix-free codes. 1336 01:16:29,460 --> 01:16:32,930 I now want to talk about the Kraft inequality for uniquely 1337 01:16:32,930 --> 01:16:36,060 decodable codes. 1338 01:16:36,060 --> 01:16:39,330 And you might think that I've done all of this development 1339 01:16:39,330 --> 01:16:45,990 of the AEP, the asymptotic equipartition property. 1340 01:16:45,990 --> 01:16:49,560 Incidentally, you now know where those words come from. 1341 01:16:49,560 --> 01:16:53,500 It's asymptotic because this result is valid asymptotically 1342 01:16:53,500 --> 01:16:55,960 as n goes to infinity. 1343 01:16:55,960 --> 01:17:01,260 It's equipartition because everything is equally likely. 1344 01:17:01,260 --> 01:17:03,480 And its property, because it's a property. 1345 01:17:03,480 --> 01:17:08,490 So it's the asymptotic equipartition property. 1346 01:17:08,490 --> 01:17:12,260 And I didn't do it so I could prove the Kraft inequality. 1347 01:17:12,260 --> 01:17:14,850 It's just that that's an extra bonus that we get. 1348 01:17:14,850 --> 01:17:20,070 And by understanding why the Kraft inequality has to hold 1349 01:17:20,070 --> 01:17:28,890 for uniquely decodable codes, if is one application for AEP 1350 01:17:28,890 --> 01:17:32,470 which lets you see a little bit about how to use it. 1351 01:17:32,470 --> 01:17:36,520 OK, so the argument is an argument by contradiction. 1352 01:17:36,520 --> 01:17:43,010 Suppose you generate a set of lengths for codewords. 1353 01:17:43,010 --> 01:17:44,550 And you want this -- yeah? 1354 01:17:55,250 --> 01:17:58,090 And the thing you would like to do is to assign codewords 1355 01:17:58,090 --> 01:18:01,220 of these lengths. 1356 01:18:01,220 --> 01:18:04,860 And what we want to do is to set this equal to some 1357 01:18:04,860 --> 01:18:05,630 quantity b. 1358 01:18:05,630 --> 01:18:09,020 In other words, suppose we beat the Kraft inequality. 1359 01:18:09,020 --> 01:18:12,130 Suppose we can make the lengths even shorter than 1360 01:18:12,130 --> 01:18:15,730 Kraft says we can make them. 1361 01:18:15,730 --> 01:18:17,905 I mean, he was only a graduate student, so we've got to be 1362 01:18:17,905 --> 01:18:21,480 able to beat his inequality somehow. 1363 01:18:21,480 --> 01:18:24,460 So we're going to try to make this equal to b. 1364 01:18:24,460 --> 01:18:27,930 We're going to assume that b is greater than 1. 1365 01:18:27,930 --> 01:18:30,890 And then what we're going to do is to show that we get a 1366 01:18:30,890 --> 01:18:32,470 contradiction here. 1367 01:18:32,470 --> 01:18:36,090 And this same argument can work whether we have a 1368 01:18:36,090 --> 01:18:39,600 discrete memoryless source or a source with memory, or 1369 01:18:39,600 --> 01:18:40,420 anything else. 1370 01:18:40,420 --> 01:18:42,830 It can work with blocks, it can work with variable length 1371 01:18:42,830 --> 01:18:46,000 to variable length codes. 1372 01:18:46,000 --> 01:18:49,560 It's all essentially the same argument. 1373 01:18:49,560 --> 01:18:52,390 So what I want to do is to get a contradiction. 1374 01:18:52,390 --> 01:18:56,230 I'm going to choose a discrete memoryless source. 1375 01:18:56,230 --> 01:18:58,900 And I'm going to make the probabilities equal to 1 over 1376 01:18:58,900 --> 01:19:02,300 b times 2 to the minus li. 1377 01:19:02,300 --> 01:19:04,800 In other words, I can generate a discrete memoryless source 1378 01:19:04,800 --> 01:19:07,270 for talking about it with any probabilities I 1379 01:19:07,270 --> 01:19:08,800 want to give it. 1380 01:19:08,800 --> 01:19:12,650 So I'm going to generate one with these probabilities. 1381 01:19:12,650 --> 01:19:16,530 So the lengths are going to be equal to minus log of 1382 01:19:16,530 --> 01:19:19,220 b times p sub i. 1383 01:19:19,220 --> 01:19:22,920 Which says that the expected length of the codewords is 1384 01:19:22,920 --> 01:19:27,820 equal to the sum of p sub i l sub i, which is equal to the 1385 01:19:27,820 --> 01:19:31,780 entropy minus the logarithm of b. 1386 01:19:31,780 --> 01:19:34,450 Which means I can get an expected length which is a 1387 01:19:34,450 --> 01:19:37,440 little bit less than the entropy. 1388 01:19:37,440 --> 01:19:40,600 So now what I'm going to do is to consider strings of n 1389 01:19:40,600 --> 01:19:41,330 source letters. 1390 01:19:41,330 --> 01:19:43,460 I'm going to make these string very, very long. 1391 01:19:46,270 --> 01:19:50,430 When I concatenate all these codewords, I'm going to wind 1392 01:19:50,430 --> 01:19:54,290 up with a length that's less than n times H of x minus b 1393 01:19:54,290 --> 01:19:59,400 over 2, minus log b over 2 with high probability. 1394 01:20:13,510 --> 01:20:18,940 And as a fixed-length code of this length it's going to have 1395 01:20:18,940 --> 01:20:21,810 a low failure probability. 1396 01:20:21,810 --> 01:20:26,740 And, therefore, what this says is I can, using this 1397 01:20:26,740 --> 01:20:32,670 remarkable code with unique decodability, and generating 1398 01:20:32,670 --> 01:20:37,500 very long strings from it, I can generate a fixed-length 1399 01:20:37,500 --> 01:20:41,550 code which has a low failure probability. 1400 01:20:41,550 --> 01:20:45,640 And I just showed you in the last slide 1401 01:20:45,640 --> 01:20:46,530 that I can't do that. 1402 01:20:46,530 --> 01:20:49,830 The probability of failure with such a code has to be 1403 01:20:49,830 --> 01:20:51,540 essentially 1. 1404 01:20:51,540 --> 01:20:54,870 So that's a contradiction that says you can't have these 1405 01:20:54,870 --> 01:20:57,460 unique decodable codes. 1406 01:20:57,460 --> 01:21:01,670 If you didn't get that in what I said, don't be surprised. 1407 01:21:01,670 --> 01:21:06,200 Because all I'm trying to do is to steer you towards how to 1408 01:21:06,200 --> 01:21:09,610 look at the section in the notes that does that. 1409 01:21:09,610 --> 01:21:12,430 It was a little too fast and a little too late. 1410 01:21:12,430 --> 01:21:15,570 But, anyway, that is the Kraft inequality for unique 1411 01:21:15,570 --> 01:21:16,650 decodability. 1412 01:21:16,650 --> 01:21:18,170 OK, thanks.