1 00:00:00,000 --> 00:00:02,360 The following content is provided under a Creative 2 00:00:02,360 --> 00:00:03,650 Commons license. 3 00:00:03,650 --> 00:00:06,540 Your support will help MIT OpenCourseWare continue to 4 00:00:06,540 --> 00:00:09,970 offer high quality educational resources for free. 5 00:00:09,970 --> 00:00:12,780 To make a donation or to view additional materials from 6 00:00:12,780 --> 00:00:16,610 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:16,610 --> 00:00:17,860 ocw.mit.edu. 8 00:00:23,300 --> 00:00:25,440 PROFESSOR: I'm going to review what we did with the craft 9 00:00:25,440 --> 00:00:29,640 inequality just a little bit, because evidently a number of 10 00:00:29,640 --> 00:00:31,650 people were confused about this. 11 00:00:31,650 --> 00:00:35,410 I'm going to put a little more notation in with it. 12 00:00:35,410 --> 00:00:37,850 For some people, notation helps. 13 00:00:37,850 --> 00:00:40,110 For other people, it hinders things. 14 00:00:40,110 --> 00:00:42,780 But after you've thought about it a little bit, a little more 15 00:00:42,780 --> 00:00:47,630 notation can certainly be helpful. 16 00:00:47,630 --> 00:00:51,470 What we're trying to do in this craft inequality is, 17 00:00:51,470 --> 00:00:54,920 we're thinking of a set of symbols where, supposing that 18 00:00:54,920 --> 00:00:58,050 there's a codeword for each symbol, c of x is the codeword 19 00:00:58,050 --> 00:01:01,330 for symbol x, which is a string of binary digits. 20 00:01:01,330 --> 00:01:05,240 y1 up to y sub n. 21 00:01:05,240 --> 00:01:10,150 In the world of two-toed sloths, the representation of 22 00:01:10,150 --> 00:01:14,700 numbers that sloths use is binary, base 2. 23 00:01:14,700 --> 00:01:20,820 And therefore, they use the number associated with some 24 00:01:20,820 --> 00:01:25,090 sequence of bits like this would be the sum of y sub i, 25 00:01:25,090 --> 00:01:29,070 times 2 to the minus i. 26 00:01:29,070 --> 00:01:32,110 In other words, it's the same thing as a decimal, except 27 00:01:32,110 --> 00:01:34,760 it's in a world where people have two 28 00:01:34,760 --> 00:01:37,210 fingers instead of ten. 29 00:01:37,210 --> 00:01:40,420 There's an interval associated with this number, also. 30 00:01:40,420 --> 00:01:42,800 And the interval is what you would get if you took an 31 00:01:42,800 --> 00:01:48,360 arbitrary real number and rounded it down to l sub i. 32 00:01:48,360 --> 00:01:53,220 Well, in this case to m binary digits. 33 00:01:53,220 --> 00:01:57,500 So that interval, then, is the number itself. 34 00:01:57,500 --> 00:02:00,460 And the other side of the interval is that 35 00:02:00,460 --> 00:02:02,250 number itself -- 36 00:02:06,080 --> 00:02:09,050 the two-toed sloth didn't like what I was going to say about 37 00:02:09,050 --> 00:02:11,810 it, and it changed my slide. 38 00:02:15,110 --> 00:02:20,670 So, is that extra factor the 2 to the minus n, there. 39 00:02:20,670 --> 00:02:23,910 In other words, any number in that range, if you round it 40 00:02:23,910 --> 00:02:28,550 down to m significant binary digits is going to the this 41 00:02:28,550 --> 00:02:30,240 number here. 42 00:02:30,240 --> 00:02:38,520 Well the point of this, if the number, namely, this base 2 43 00:02:38,520 --> 00:02:45,590 expansion of a number y prime is in this interval, then y is 44 00:02:45,590 --> 00:02:47,530 going to be a prefix of y prime. 45 00:02:47,530 --> 00:02:50,070 Let me just give you some examples of that because 46 00:02:50,070 --> 00:02:52,050 saying it in words is confusing, the 47 00:02:52,050 --> 00:02:54,190 idea is very simple. 48 00:02:54,190 --> 00:02:59,650 Suppose you have a binary string, 011. 49 00:02:59,650 --> 00:03:02,480 That corresponds to the number 3/8. 50 00:03:02,480 --> 00:03:06,590 Namely, 1/4 plus 1/8. 51 00:03:06,590 --> 00:03:15,000 And the interval there is going to be the interval from 52 00:03:15,000 --> 00:03:19,640 3/8, including 3/8, up to 1/4. 53 00:03:19,640 --> 00:03:21,070 But not including 1/4. 54 00:03:21,070 --> 00:03:27,850 Namely, 1/4 will be represented as -- the sloth 55 00:03:27,850 --> 00:03:31,240 really is hitting hard this morning. 56 00:03:31,240 --> 00:03:33,980 Maybe I'm the sloth. 57 00:03:33,980 --> 00:03:34,460 OK. 58 00:03:34,460 --> 00:03:35,350 There we go. 59 00:03:35,350 --> 00:03:37,270 From 3/8 to 1/2. 60 00:03:37,270 --> 00:03:40,590 Namely, 1/2 is just 1. 61 00:03:40,590 --> 00:03:41,590 Nothing more than that. 62 00:03:41,590 --> 00:03:45,680 Or it could be 10 or 100, and so forth. 63 00:03:45,680 --> 00:03:52,940 So then, as an example, 011 is going to be a prefix of all of 64 00:03:52,940 --> 00:03:54,420 these quantities here. 65 00:03:54,420 --> 00:04:00,880 That's a prefix of 0111 because 0111 is 3/8 plus 1/16, 66 00:04:00,880 --> 00:04:03,260 which is in that interval we're talking about. 67 00:04:03,260 --> 00:04:06,460 0110 is a more interesting case. 68 00:04:06,460 --> 00:04:11,280 Because 0110 is itself, the number associated 69 00:04:11,280 --> 00:04:13,650 with it, is just 3/8. 70 00:04:13,650 --> 00:04:17,090 But what this is saying is, if you take the same number but 71 00:04:17,090 --> 00:04:21,320 expand it to four digits, according to what this says, r 72 00:04:21,320 --> 00:04:24,360 of y like is then in this interval. 73 00:04:24,360 --> 00:04:28,570 And therefore this prefix situation holds. 74 00:04:28,570 --> 00:04:31,750 So this is a prefix of this. 75 00:04:31,750 --> 00:04:33,940 Why do I make it so complicated? 76 00:04:33,940 --> 00:04:36,750 I make it so complicated because I want to talk about 77 00:04:36,750 --> 00:04:38,360 the length of that interval. 78 00:04:38,360 --> 00:04:41,160 And the length of these intervals is denoted in this 79 00:04:41,160 --> 00:04:42,670 diagram here. 80 00:04:42,670 --> 00:04:48,990 Anytime I have a number expressed to n binary digits, 81 00:04:48,990 --> 00:04:53,170 it covers an interval of 2 to the minus n. 82 00:04:53,170 --> 00:04:56,850 And because of this prefix property, as soon as one 83 00:04:56,850 --> 00:05:01,720 number covers an interval, no other number have its base in 84 00:05:01,720 --> 00:05:03,290 that interval. 85 00:05:03,290 --> 00:05:06,270 In other words, all of these intervals have to be disjoint, 86 00:05:06,270 --> 00:05:08,950 exactly as is indicated here. 87 00:05:08,950 --> 00:05:12,040 So when you add up the size of all these intervals, you have 88 00:05:12,040 --> 00:05:14,520 to get something less than or equal to 1. 89 00:05:14,520 --> 00:05:15,920 And that's the proof of the craft inequality. 90 00:05:21,140 --> 00:05:25,440 So, let's go on and talk about discrete source probabilities. 91 00:05:35,290 --> 00:05:38,510 Now the two-toed sloth has gotten his machine out there, 92 00:05:38,510 --> 00:05:41,070 he's really mad at me this morning. 93 00:05:41,070 --> 00:05:45,340 If we try to model English text, you know that some 94 00:05:45,340 --> 00:05:47,980 letters are far more probable than others. 95 00:05:47,980 --> 00:05:50,900 Namely, if you take an enormous amount of English 96 00:05:50,900 --> 00:05:53,760 text and you measure the relative frequency with which 97 00:05:53,760 --> 00:05:57,740 each letter occurs, you'll get more or less stable relative 98 00:05:57,740 --> 00:06:00,310 frequencies if you take enough text. 99 00:06:00,310 --> 00:06:04,130 And these letters are far more probable than these letters. 100 00:06:04,130 --> 00:06:07,180 So that gives you part of a model. 101 00:06:07,180 --> 00:06:10,450 You can also say that successive letters are going 102 00:06:10,450 --> 00:06:12,480 to be very dependent. 103 00:06:12,480 --> 00:06:17,200 Namely, t is very often followed by h, and h is often 104 00:06:17,200 --> 00:06:19,070 preceded by t. 105 00:06:19,070 --> 00:06:24,750 q u is even more of a case here, because as far as 106 00:06:24,750 --> 00:06:30,140 English language words are concerned, u always follows q. 107 00:06:30,140 --> 00:06:32,010 Some letter strings are words. 108 00:06:32,010 --> 00:06:34,750 Other letter strings are not words. 109 00:06:34,750 --> 00:06:37,300 There are constraints on grammar. 110 00:06:37,300 --> 00:06:40,580 And what is really the clincher, which says there's 111 00:06:40,580 --> 00:06:44,570 no way you're going to model English in any sensible nice 112 00:06:44,570 --> 00:06:47,550 way, is meaning. 113 00:06:47,550 --> 00:06:51,340 And even worse than that, depending on who writes the 114 00:06:51,340 --> 00:06:55,920 English, it might have meaning or it might not have meaning. 115 00:06:55,920 --> 00:07:00,900 And for those English texts that don't have any meaning, 116 00:07:00,900 --> 00:07:04,420 the entropy is going to depend very much on whether the 117 00:07:04,420 --> 00:07:07,840 meaningless text is written by a salesperson or it's written 118 00:07:07,840 --> 00:07:09,640 by James Joyce. 119 00:07:09,640 --> 00:07:14,100 And in one case you have -- 120 00:07:14,100 --> 00:07:18,790 well, in one case you have an enormous amount of freedom in 121 00:07:18,790 --> 00:07:20,580 what this sequence of letters are. 122 00:07:20,580 --> 00:07:22,780 And in the other case you have letters which are very, very 123 00:07:22,780 --> 00:07:24,450 constrained. 124 00:07:24,450 --> 00:07:26,110 So what's the point of this? 125 00:07:26,110 --> 00:07:31,220 The point of this is, if you're interested in trying to 126 00:07:31,220 --> 00:07:36,070 find the source coding method for English, what you don't 127 00:07:36,070 --> 00:07:39,330 want to do is to start out trying to get the best 128 00:07:39,330 --> 00:07:42,390 statistical model of English that you can. 129 00:07:42,390 --> 00:07:44,780 Because it's a losing proposition. 130 00:07:44,780 --> 00:07:48,740 And by trying to do that, you'll spend all your time 131 00:07:48,740 --> 00:07:50,220 trying to get the model. 132 00:07:50,220 --> 00:07:53,330 And you won't get any insight into what you ought to do as 133 00:07:53,330 --> 00:07:55,620 far as source coding is concerned. 134 00:07:55,620 --> 00:08:00,800 This is pretty much true throughout all of technology. 135 00:08:00,800 --> 00:08:04,710 You don't solve problems by first getting too far into the 136 00:08:04,710 --> 00:08:07,970 details of what the problem is, before you start thinking 137 00:08:07,970 --> 00:08:10,860 about structures of possible solutions. 138 00:08:10,860 --> 00:08:14,520 In other words, we always deal with technological problems by 139 00:08:14,520 --> 00:08:17,520 dealing with toy problems first. 140 00:08:17,520 --> 00:08:20,640 Now, there's a difference between engineers who worry 141 00:08:20,640 --> 00:08:22,480 about toy problems. 142 00:08:22,480 --> 00:08:25,910 Because engineers, if they hate theory, usually, don't 143 00:08:25,910 --> 00:08:27,740 say what the toy problem is. 144 00:08:27,740 --> 00:08:30,750 But they have that toy problem very firmly in the back of 145 00:08:30,750 --> 00:08:33,510 their mind, because of all their experience. 146 00:08:33,510 --> 00:08:36,700 Theoreticians, on the other hand, make their models very, 147 00:08:36,700 --> 00:08:38,860 very explicit. 148 00:08:38,860 --> 00:08:41,460 They don't often like to say that they're toy models 149 00:08:41,460 --> 00:08:44,310 because DARPA doesn't tend to support things that are 150 00:08:44,310 --> 00:08:45,820 working on toy models. 151 00:08:45,820 --> 00:08:47,560 So they try to conceal this. 152 00:08:47,560 --> 00:08:49,310 And often they just hide the fact that 153 00:08:49,310 --> 00:08:51,230 they're using a model. 154 00:08:51,230 --> 00:08:53,670 So, all of this becomes very complicated. 155 00:08:53,670 --> 00:08:57,570 But, for you, if you're trying to do either engineering or 156 00:08:57,570 --> 00:09:03,270 mathematics or teaching, or just be a sensible person. 157 00:09:03,270 --> 00:09:06,700 When you're dealing with these problems, be explicit about 158 00:09:06,700 --> 00:09:08,100 what your models are. 159 00:09:08,100 --> 00:09:11,140 Try to understand the toy problems before you understand 160 00:09:11,140 --> 00:09:12,810 the more complicated problems. 161 00:09:12,810 --> 00:09:15,140 If you understand that out of this course, it'll be a 162 00:09:15,140 --> 00:09:17,400 worthwhile course for you. 163 00:09:17,400 --> 00:09:21,190 And, of course, you won't understand it until you get 164 00:09:21,190 --> 00:09:23,090 lots more experience. 165 00:09:23,090 --> 00:09:25,420 But believe me, that's the way it is. 166 00:09:25,420 --> 00:09:27,810 OK, so that's the whole point of this. 167 00:09:27,810 --> 00:09:30,010 You want to start with simple toy models. 168 00:09:30,010 --> 00:09:32,060 And I'm not just justifying the fact that we're going to 169 00:09:32,060 --> 00:09:35,690 study this incredibly simple model here, 170 00:09:35,690 --> 00:09:38,010 which is a toy model. 171 00:09:38,010 --> 00:09:40,390 But by studying this you will see that 172 00:09:40,390 --> 00:09:42,300 everything else follows. 173 00:09:42,300 --> 00:09:45,890 If you read Claude Shannon's work on information theory, 174 00:09:45,890 --> 00:09:47,830 this, in fact, is where he started. 175 00:09:47,830 --> 00:09:51,335 He started with a beautiful description of trying to model 176 00:09:51,335 --> 00:09:52,840 the English language. 177 00:09:52,840 --> 00:09:56,420 Finally wound up with talking about this. 178 00:09:56,420 --> 00:10:00,110 The conclusions he drew, from studying these discrete memory 179 00:10:00,110 --> 00:10:03,880 sources, lead to his general theorems about data 180 00:10:03,880 --> 00:10:05,660 compression on sources. 181 00:10:05,660 --> 00:10:08,280 They led to his general theorems about 182 00:10:08,280 --> 00:10:10,400 the capacity of channels. 183 00:10:10,400 --> 00:10:13,870 They led to the idea that you want to separate source coding 184 00:10:13,870 --> 00:10:17,710 from channel coding, and finally, they led to all of 185 00:10:17,710 --> 00:10:20,730 the modern ideas that we have about quantization. 186 00:10:20,730 --> 00:10:23,670 In other words, the simple ideas you get out of this 187 00:10:23,670 --> 00:10:27,070 generalize directly to everything that's known about 188 00:10:27,070 --> 00:10:29,250 information theory. 189 00:10:29,250 --> 00:10:30,210 So. 190 00:10:30,210 --> 00:10:33,930 Enough philosophy, let's get on with the business of what 191 00:10:33,930 --> 00:10:35,650 we're trying to do. 192 00:10:35,650 --> 00:10:40,670 A discrete memoryless sources has the following properties. 193 00:10:40,670 --> 00:10:45,770 The source output has to be an unending sequence, x1, x2, x3, 194 00:10:45,770 --> 00:10:49,880 blah, blah, blah, of random letters drawn from a finite 195 00:10:49,880 --> 00:10:52,110 alphabet, x. 196 00:10:52,110 --> 00:10:55,560 In other words, we are taking these real sources. 197 00:10:55,560 --> 00:10:58,050 And we're saying, let's make them not real now. 198 00:10:58,050 --> 00:11:01,010 Let's put a probability measure on them. 199 00:11:01,010 --> 00:11:03,800 And in this probability measure, one of the things 200 00:11:03,800 --> 00:11:07,510 that the probability measure will do will be to describe 201 00:11:07,510 --> 00:11:11,670 the probability on each one of these letters and the sequence 202 00:11:11,670 --> 00:11:13,520 in which it's coming out of the source. 203 00:11:13,520 --> 00:11:18,290 Each source output, x1, x2, blah, blah, blah, is selected 204 00:11:18,290 --> 00:11:20,170 from a common alphabet. 205 00:11:20,170 --> 00:11:22,930 Namely, if you're using English on one letter of the 206 00:11:22,930 --> 00:11:24,910 sequence, you're going to use English on every 207 00:11:24,910 --> 00:11:27,200 letter of the sequence. 208 00:11:27,200 --> 00:11:30,260 Going to use a common probability measure, with some 209 00:11:30,260 --> 00:11:34,600 probability mass function p sub x of x. 210 00:11:34,600 --> 00:11:38,720 This notation means this is the probability mass function 211 00:11:38,720 --> 00:11:42,890 for the chance variable X. A chance variable is like a 212 00:11:42,890 --> 00:11:45,970 random variable, except the objects are 213 00:11:45,970 --> 00:11:47,300 not necessarily numbers. 214 00:11:47,300 --> 00:11:49,010 The objects can be anything. 215 00:11:49,010 --> 00:11:50,270 So, a chance variable is a 216 00:11:50,270 --> 00:11:53,330 generalization of a random variable. 217 00:11:53,330 --> 00:11:56,440 So this probability mass function talks about the 218 00:11:56,440 --> 00:12:01,700 probability of each of the symbols in this alphabet x. 219 00:12:01,700 --> 00:12:05,370 Then the final thing is, each source output, x sub k, is 220 00:12:05,370 --> 00:12:10,030 statistically independent of all other source outputs, x1 221 00:12:10,030 --> 00:12:14,440 up x k minus 1, and x k plus 1 on to forever. 222 00:12:14,440 --> 00:12:17,520 This is a nice example, because if you're going to 223 00:12:17,520 --> 00:12:21,610 specify a source probabilistically, you have to 224 00:12:21,610 --> 00:12:24,590 somehow find a way of explaining what the 225 00:12:24,590 --> 00:12:29,610 probability of every possible event within this source is. 226 00:12:29,610 --> 00:12:32,210 This is an easy way of doing it. 227 00:12:32,210 --> 00:12:33,630 You say they're independent. 228 00:12:33,630 --> 00:12:35,780 And then you can find the probability of anything you 229 00:12:35,780 --> 00:12:37,320 want to find. 230 00:12:37,320 --> 00:12:39,740 So that's a generic way of putting 231 00:12:39,740 --> 00:12:41,520 probability measures on things. 232 00:12:44,900 --> 00:12:51,890 So then, we want to go into the idea of prefix-free codes 233 00:12:51,890 --> 00:12:53,890 for these discrete memory of the sources. 234 00:12:53,890 --> 00:12:57,060 We've already talked about prefix-free codes. 235 00:12:57,060 --> 00:13:02,000 We talked about the craft inequality. 236 00:13:02,000 --> 00:13:04,380 You might have thought it was a little bit strange talking 237 00:13:04,380 --> 00:13:08,480 about the strictly combinatorial property of 238 00:13:08,480 --> 00:13:13,080 codes without talking at all about the probabilities, which 239 00:13:13,080 --> 00:13:15,740 are the things that led us into talking about these codes 240 00:13:15,740 --> 00:13:16,500 in the first place. 241 00:13:16,500 --> 00:13:21,290 Namely, we want to use unequal length, variable length codes. 242 00:13:21,290 --> 00:13:24,710 Because of the fact that some letters are more likely than 243 00:13:24,710 --> 00:13:26,070 other letters. 244 00:13:26,070 --> 00:13:28,540 And eventually we'll be using them because of all these 245 00:13:28,540 --> 00:13:31,870 constraints between different words. 246 00:13:31,870 --> 00:13:36,080 So, for notation, let l of x be the length of the codeword 247 00:13:36,080 --> 00:13:41,140 for letter x in, the alphabet capital X. ok so that's the 248 00:13:41,140 --> 00:13:43,930 same as this y1 y2 up to y sub n. 249 00:13:43,930 --> 00:13:47,320 At some strength of binary symbols. 250 00:13:47,320 --> 00:13:53,120 Capital L of x is a random variable, where capital L of x 251 00:13:53,120 --> 00:13:57,270 is equal to little l of x, where capital X equal to x. 252 00:13:57,270 --> 00:13:59,030 Now, what the heck does that mean? 253 00:13:59,030 --> 00:14:00,770 It's just notation. 254 00:14:00,770 --> 00:14:03,610 In other words, what we're starting out with is this 255 00:14:03,610 --> 00:14:05,780 ensemble of letters. 256 00:14:05,780 --> 00:14:09,700 We have a probability assignment on each letter in 257 00:14:09,700 --> 00:14:11,140 that alphabet. 258 00:14:11,140 --> 00:14:14,440 And then what we would like to talk about is a length 259 00:14:14,440 --> 00:14:16,520 function on those letters. 260 00:14:16,520 --> 00:14:20,310 So we have little l of x, which is defined for each x. 261 00:14:20,310 --> 00:14:23,670 We then want to talk about this as a random variable. 262 00:14:23,670 --> 00:14:29,590 Because when we choose some random letter, x, little x, of 263 00:14:29,590 --> 00:14:36,720 this ensemble, capital X, l of x becomes a random variable. 264 00:14:36,720 --> 00:14:39,390 We will always, in this course, use capital letters to 265 00:14:39,390 --> 00:14:41,250 talk about random variables. 266 00:14:41,250 --> 00:14:44,710 And we will always use little letters to talk about things 267 00:14:44,710 --> 00:14:46,900 which are not random variables. 268 00:14:46,900 --> 00:14:49,690 Excuse me, not not random variables but random variables 269 00:14:49,690 --> 00:14:51,260 or chance variables. 270 00:14:51,260 --> 00:14:54,160 I think we can probably leave it open now. 271 00:14:54,160 --> 00:14:58,150 It seems as if the sloth has gone away. 272 00:14:58,150 --> 00:15:00,240 So then we want to talk about the expected 273 00:15:00,240 --> 00:15:01,890 value of the length. 274 00:15:01,890 --> 00:15:04,380 You talk about the expected value of something, you're 275 00:15:04,380 --> 00:15:08,570 talking about the expected value of a random variable. 276 00:15:08,570 --> 00:15:11,990 We will also denote the expected value of this random 277 00:15:11,990 --> 00:15:16,060 variable L with a bar over it, which is the sum over the 278 00:15:16,060 --> 00:15:20,810 letters in the alphabet, of p of x times L of x. 279 00:15:20,810 --> 00:15:24,080 So all this is what you would do anyway if you never thought 280 00:15:24,080 --> 00:15:26,120 about this. 281 00:15:26,120 --> 00:15:28,190 Until, at some point, when you're taking a quiz or 282 00:15:28,190 --> 00:15:30,130 something and start to get confused and say, what is this 283 00:15:30,130 --> 00:15:31,490 stuff all about? 284 00:15:31,490 --> 00:15:34,160 I don't have any idea what this means after you've 285 00:15:34,160 --> 00:15:35,730 written five pages of stuff. 286 00:15:35,730 --> 00:15:39,210 So, it's worthwhile spending a little bit of time 287 00:15:39,210 --> 00:15:41,220 sorting that out. 288 00:15:41,220 --> 00:15:43,800 So, L bar is the number of encoder output 289 00:15:43,800 --> 00:15:46,260 bits per source symbol. 290 00:15:46,260 --> 00:15:48,170 In some strange sense. 291 00:15:48,170 --> 00:15:50,420 Namely, it's this expected value. 292 00:15:50,420 --> 00:15:55,950 Now, to finish this off, if we want to look at the number of 293 00:15:55,950 --> 00:15:59,280 binary digits to come out of the source when a long 294 00:15:59,280 --> 00:16:05,900 sequence of letters come out of the source -- 295 00:16:05,900 --> 00:16:09,690 A long sequence of letters. x1, x2, x3, and so forth, come 296 00:16:09,690 --> 00:16:10,860 out of the source. 297 00:16:10,860 --> 00:16:12,180 They go into the encoder. 298 00:16:12,180 --> 00:16:15,410 The encoder is mapping each letter that comes out of the 299 00:16:15,410 --> 00:16:19,540 source into this codeword c of x. 300 00:16:19,540 --> 00:16:23,090 So we have a sequence of codewords which are all 301 00:16:23,090 --> 00:16:25,160 concatenated together. 302 00:16:25,160 --> 00:16:29,150 And, therefore, the total number of binary digits which 303 00:16:29,150 --> 00:16:34,850 has come out of the source corresponding to these n 304 00:16:34,850 --> 00:16:39,120 symbols that have come out of the source, is the sum of L of 305 00:16:39,120 --> 00:16:45,050 x1 plus L of x2 plus of x3 plus L of x4, and so forth. 306 00:16:45,050 --> 00:16:49,350 So we have a sum of independent random variables. 307 00:16:49,350 --> 00:16:50,970 Now, what do you know about sums of 308 00:16:50,970 --> 00:16:53,200 independent random variables? 309 00:16:53,200 --> 00:16:55,880 Well, the one thing you ought to know about, and which 310 00:16:55,880 --> 00:16:59,940 should be stamped on your brain because it's the central 311 00:16:59,940 --> 00:17:03,950 thing that makes any probabilistic theory make 312 00:17:03,950 --> 00:17:07,170 sense, it's the only way that we can ever understand our 313 00:17:07,170 --> 00:17:08,060 environment. 314 00:17:08,060 --> 00:17:09,700 You look at the past. 315 00:17:09,700 --> 00:17:11,600 You try to figure out from the past what's 316 00:17:11,600 --> 00:17:13,150 going on in the future. 317 00:17:13,150 --> 00:17:16,030 And the only way you can do that, the only tool you have, 318 00:17:16,030 --> 00:17:19,270 really, is this law of large, numbers. 319 00:17:19,270 --> 00:17:22,390 Which says, when you see a long sequence of things, from 320 00:17:22,390 --> 00:17:25,570 that long sequence of things, you sort of figure out 321 00:17:25,570 --> 00:17:26,700 what's going on. 322 00:17:26,700 --> 00:17:28,900 If you're dealing with a random variable, the thing you 323 00:17:28,900 --> 00:17:31,460 do is add up all of these numbers. 324 00:17:31,460 --> 00:17:34,480 You divide by the total number of them that you have. 325 00:17:34,480 --> 00:17:36,440 And that gives you the expected value. 326 00:17:36,440 --> 00:17:38,080 It gives you a typical value. 327 00:17:38,080 --> 00:17:41,710 What the law of large numbers really says is, if you look at 328 00:17:41,710 --> 00:17:46,910 the sum of binary digits out of this encoder, over a very 329 00:17:46,910 --> 00:17:51,970 long period of time, divide by the total number of symbols, 330 00:17:51,970 --> 00:17:54,510 that's a random variable again. 331 00:17:54,510 --> 00:17:57,430 And this random variable is, with high probability, going 332 00:17:57,430 --> 00:18:02,160 to be very, very close to this expected value, which is this 333 00:18:02,160 --> 00:18:04,040 quantity here. 334 00:18:04,040 --> 00:18:07,520 In other words, the ensemble average, which is this, is 335 00:18:07,520 --> 00:18:10,360 going to be very close to the time average. 336 00:18:10,360 --> 00:18:12,820 And the time average, now, is a random variable. 337 00:18:12,820 --> 00:18:15,930 And that's what the law of large numbers says. 338 00:18:15,930 --> 00:18:19,840 You see, the problem that we all have, dealing with real 339 00:18:19,840 --> 00:18:23,680 world problems, is that there's nobody to tell us this 340 00:18:23,680 --> 00:18:25,920 is what the ensemble is. 341 00:18:25,920 --> 00:18:29,290 Unless you believe somebody that doesn't know. 342 00:18:29,290 --> 00:18:31,100 And the only real evidence that you have 343 00:18:31,100 --> 00:18:32,790 is the actual sequence. 344 00:18:32,790 --> 00:18:36,650 And from the actual sequence, you then look at what happens 345 00:18:36,650 --> 00:18:38,400 for this particular sequence. 346 00:18:38,400 --> 00:18:40,080 You then build a model. 347 00:18:40,080 --> 00:18:44,480 And your model, by definition, has the expected value of L 348 00:18:44,480 --> 00:18:48,060 going to be equal to the expected value in the model 349 00:18:48,060 --> 00:18:49,310 that you've chosen. 350 00:18:51,330 --> 00:18:52,080 So. 351 00:18:52,080 --> 00:18:54,600 What's your objective? 352 00:18:54,600 --> 00:18:59,430 Your objective in trying to form a prefix-free code, then, 353 00:18:59,430 --> 00:19:04,850 is to find a set of integers, L of x, which satisfy the 354 00:19:04,850 --> 00:19:07,110 Kraft inequality. 355 00:19:07,110 --> 00:19:10,070 And they minimize L bar. 356 00:19:10,070 --> 00:19:12,470 In other words, what we're trying to do is, we're trying 357 00:19:12,470 --> 00:19:15,610 to choose a code which minimizes the expected length 358 00:19:15,610 --> 00:19:16,530 of the code. 359 00:19:16,530 --> 00:19:19,250 Which is really, over a long period of time, going to 360 00:19:19,250 --> 00:19:23,620 minimize the number of binary digits that come out of the 361 00:19:23,620 --> 00:19:25,670 source encoder. 362 00:19:25,670 --> 00:19:29,060 What we want to do is to choose these integers to 363 00:19:29,060 --> 00:19:31,070 minimize this. 364 00:19:31,070 --> 00:19:34,530 So what we're going to do now is, suppose our alphabet is 365 00:19:34,530 --> 00:19:38,670 just 1, 2, up to capital N. What am I doing here? 366 00:19:38,670 --> 00:19:41,090 I'm saying, we don't care about what these symbols are 367 00:19:41,090 --> 00:19:43,030 called, anyway. 368 00:19:43,030 --> 00:19:45,890 It's totally irrelevant what the names of the symbols are. 369 00:19:45,890 --> 00:19:52,360 So I will name them 1, 2, up to capital N. Probability mass 370 00:19:52,360 --> 00:19:56,790 function, then, I can denote as p sub 1 up to p sub capital 371 00:19:56,790 --> 00:19:59,510 N. In other words I've gotten rid of all these axes that 372 00:19:59,510 --> 00:20:02,720 were lousing up our equations all along. 373 00:20:02,720 --> 00:20:07,310 Now, I'll denote the unknown lengths by l1 up to l sub M. 374 00:20:07,310 --> 00:20:12,180 So the problem is, somebody gives you this set of numbers. 375 00:20:12,180 --> 00:20:15,500 p1 to p sub M, which is a PMF. 376 00:20:15,500 --> 00:20:18,920 In other words, these numbers add up to 1. 377 00:20:18,920 --> 00:20:22,670 And tells you, I want a prefix-free code which 378 00:20:22,670 --> 00:20:25,860 minimizes this expected length. 379 00:20:25,860 --> 00:20:29,695 Namely, the expected value corresponding to 380 00:20:29,695 --> 00:20:31,700 these lengths here. 381 00:20:31,700 --> 00:20:36,080 So, the expected length, to minimize it, what we want to 382 00:20:36,080 --> 00:20:40,630 do is to minimize over the choice of l1 up to l sub M. 383 00:20:40,630 --> 00:20:43,570 Subject to the craft inequality, we want to 384 00:20:43,570 --> 00:20:46,020 minimize this expected value. 385 00:20:46,020 --> 00:20:49,710 So we have a nice, clean, mathematical problem now. 386 00:20:49,710 --> 00:20:54,420 We want to minimize this sum, subject to this constraint. 387 00:20:54,420 --> 00:20:56,850 And the constraint includes the fact that all of these 388 00:20:56,850 --> 00:20:59,380 things have to be integers. 389 00:20:59,380 --> 00:21:03,040 Well, for those of you who have studied minimization, 390 00:21:03,040 --> 00:21:06,030 there's a funny thing in here. 391 00:21:06,030 --> 00:21:08,960 Because integer minimization problems tend to 392 00:21:08,960 --> 00:21:11,800 be very, very nasty. 393 00:21:11,800 --> 00:21:14,740 And, therefore, you look at this and you say, this is 394 00:21:14,740 --> 00:21:19,020 probably something I'm going to have trouble solving. 395 00:21:19,020 --> 00:21:20,930 Strangely enough, it isn't. 396 00:21:20,930 --> 00:21:22,930 But you would think it probably is something which 397 00:21:22,930 --> 00:21:25,520 will be hard to solve. 398 00:21:25,520 --> 00:21:30,420 So, since integers louse up minimization 399 00:21:30,420 --> 00:21:32,430 problems, what do we do? 400 00:21:32,430 --> 00:21:35,230 Well, we say, just for fun let's try to solve this 401 00:21:35,230 --> 00:21:38,150 problem without the integer constraint on it. 402 00:21:38,150 --> 00:21:40,920 Let's see what that leads to, and see if we can do 403 00:21:40,920 --> 00:21:42,250 anything with that. 404 00:21:42,250 --> 00:21:46,150 So we say, OK, let's try to minimize this function here 405 00:21:46,150 --> 00:21:51,110 over the integers l1 to l sub M, subject to this constraint. 406 00:21:51,110 --> 00:21:55,890 So we're minimizing this, subject to this constraint. 407 00:21:55,890 --> 00:21:59,332 Now, an easy way to do that -- yes. 408 00:21:59,332 --> 00:21:59,834 AUDIENCE: Are you 409 00:21:59,834 --> 00:22:05,360 saying that side length is not a fixed probability. 410 00:22:05,360 --> 00:22:07,650 PROFESSOR: No, I still have these fixed probabilities. 411 00:22:07,650 --> 00:22:12,860 I still have p1 up to p sub M, as known probabilities. 412 00:22:12,860 --> 00:22:16,040 But I'm going to say, let's suppose I can choose a length 413 00:22:16,040 --> 00:22:19,360 which is two point five bits instead of two bits. 414 00:22:19,360 --> 00:22:21,817 AUDIENCE: You're saying the shortest length 415 00:22:21,817 --> 00:22:23,240 [UNINTELLIGIBLE] 416 00:22:23,240 --> 00:22:26,040 PROFESSOR: Well, we're going to wind up there eventually. 417 00:22:26,040 --> 00:22:29,800 But for now, all I want to do is to look at this problem. 418 00:22:29,800 --> 00:22:33,150 If I start out by saying, assign shortest lengths to the 419 00:22:33,150 --> 00:22:36,000 biggest probabilities, I have two problems. 420 00:22:36,000 --> 00:22:38,690 One is, it's a little hard to prove to you that 421 00:22:38,690 --> 00:22:39,970 I want to do that. 422 00:22:39,970 --> 00:22:42,610 Although we'll do that later today. 423 00:22:42,610 --> 00:22:46,170 And the other is, it doesn't really give you the general 424 00:22:46,170 --> 00:22:49,120 properties that we want to know about this. 425 00:22:49,120 --> 00:22:53,640 So, for those two reasons, I want to just attack this as a 426 00:22:53,640 --> 00:22:57,620 straightforward mathematical problem. 427 00:22:57,620 --> 00:23:00,490 If you're a computer scientist, this looks strange. 428 00:23:00,490 --> 00:23:03,370 Because computer scientists like to attack problems by 429 00:23:03,370 --> 00:23:05,660 algorithms. 430 00:23:05,660 --> 00:23:09,150 Analog engineers like to attack problems by writing a 431 00:23:09,150 --> 00:23:12,400 complicated formula, and taking derivatives, and all 432 00:23:12,400 --> 00:23:13,800 sorts of things like that. 433 00:23:13,800 --> 00:23:15,340 We're going to be doing both of those 434 00:23:15,340 --> 00:23:16,750 things in this course. 435 00:23:16,750 --> 00:23:19,210 And you'll see that both of them lead to certain 436 00:23:19,210 --> 00:23:20,490 advantages. 437 00:23:20,490 --> 00:23:26,015 And here, we're taking, where the analog engineer's approach 438 00:23:26,015 --> 00:23:28,830 is saying, suppose this is a bunch of numbers. 439 00:23:28,830 --> 00:23:32,100 I want to minimize this function over a set of 440 00:23:32,100 --> 00:23:36,100 numbers, l1 up to l sub capital M. 441 00:23:36,100 --> 00:23:37,910 So, how do I do that? 442 00:23:37,910 --> 00:23:41,870 Well, this guy Lagrange, he was a great mathematician. 443 00:23:41,870 --> 00:23:44,760 He was also a great mathematician early enough 444 00:23:44,760 --> 00:23:47,040 that he could do some really trivial things and become 445 00:23:47,040 --> 00:23:48,790 famous for them. 446 00:23:48,790 --> 00:23:50,990 Just like Kraft that we were talking about before. 447 00:23:50,990 --> 00:23:54,450 But, unlike Kraft, Lagrange really did a lot of other very 448 00:23:54,450 --> 00:23:56,140 important things. 449 00:23:56,140 --> 00:23:59,870 And what Lagrange said was the following: Well, suppose I 450 00:23:59,870 --> 00:24:02,870 want to minimize this sum. 451 00:24:02,870 --> 00:24:07,160 And I want to have this constraint added in. 452 00:24:07,160 --> 00:24:11,630 Sort of what I want to do, then, is to minimize a 453 00:24:11,630 --> 00:24:14,740 weighted sum of this, which is what I'm 454 00:24:14,740 --> 00:24:17,730 interested in, and this. 455 00:24:17,730 --> 00:24:21,510 In other words, if I minimize this weighted sum here of 456 00:24:21,510 --> 00:24:26,160 these two things, I'm going to wind up with some sort of 457 00:24:26,160 --> 00:24:27,590 value for this. 458 00:24:27,590 --> 00:24:30,080 And some sort of value for this. 459 00:24:30,080 --> 00:24:33,420 By changing lambda, then, which stands for Lagrange, he 460 00:24:33,420 --> 00:24:37,530 was also clever in making himself famous that way, by 461 00:24:37,530 --> 00:24:40,700 changing lambda, I can change the balance between how 462 00:24:40,700 --> 00:24:42,850 important these two things are. 463 00:24:42,850 --> 00:24:45,470 And as I change the balance between how important they 464 00:24:45,470 --> 00:24:49,280 are, when I change it to just the right place, I'm going to 465 00:24:49,280 --> 00:24:54,230 have this constraint here satisfied with equality. 466 00:24:54,230 --> 00:24:58,330 So that's the whole idea of Lagrange minimization. 467 00:24:58,330 --> 00:25:00,820 So we take this function. 468 00:25:00,820 --> 00:25:03,440 How do you now minimize a 469 00:25:03,440 --> 00:25:06,010 function of multiple variables? 470 00:25:06,010 --> 00:25:08,680 Well, again it's a messy problem. 471 00:25:08,680 --> 00:25:10,570 But the first thing you can try to do is find 472 00:25:10,570 --> 00:25:12,250 a stationary point. 473 00:25:12,250 --> 00:25:15,560 So, let's always do the easy thing first. 474 00:25:15,560 --> 00:25:19,260 We take the partial derivative of this function here, with 475 00:25:19,260 --> 00:25:20,590 respect to l sub i. 476 00:25:20,590 --> 00:25:22,940 That's what we're trying to minimize. 477 00:25:22,940 --> 00:25:27,060 And what we get is p sub i minus lambda times the natural 478 00:25:27,060 --> 00:25:30,690 log of 2, times 2 the minus l sub i. 479 00:25:30,690 --> 00:25:34,120 I'm not very good at differentiation any more, so I 480 00:25:34,120 --> 00:25:36,810 only differentiate things which are easy. 481 00:25:36,810 --> 00:25:38,210 And that's easy. 482 00:25:38,210 --> 00:25:40,630 I want to find a stationary point, so I set 483 00:25:40,630 --> 00:25:41,880 this equal to 0. 484 00:25:44,620 --> 00:25:46,840 That makes the problem worse, because now I have a function 485 00:25:46,840 --> 00:25:49,260 of lambda and also all of these l sub i's. 486 00:25:49,260 --> 00:25:54,220 But now I choose l sub i, so that I satisfy the constraint. 487 00:25:54,220 --> 00:25:58,750 Namely, I choose lambda to satisfy this equation here. 488 00:25:58,750 --> 00:26:02,280 When I choose lambda to satisfy this equation here, 489 00:26:02,280 --> 00:26:07,480 what I get is p sub i is equal to 2 to the minus l i, and 490 00:26:07,480 --> 00:26:11,500 therefore, l sub i is equal to minus log p sub i. 491 00:26:11,500 --> 00:26:13,710 In other words, I have this equation here. 492 00:26:13,710 --> 00:26:17,870 What happens when I sum this equation over i? 493 00:26:17,870 --> 00:26:19,800 Let's look at it. 494 00:26:19,800 --> 00:26:23,370 We sum this over i. 495 00:26:23,370 --> 00:26:31,690 The sum of p i over i is 1 minus lambda times natural log 496 00:26:31,690 --> 00:26:38,150 of 2 times sum of 2 to the minus l sub i. 497 00:26:38,150 --> 00:26:40,210 And I want to make this equal to 1. 498 00:26:40,210 --> 00:26:46,350 So what I get is 1 is equal to this times lambda 499 00:26:46,350 --> 00:26:47,680 natural log of 2. 500 00:26:47,680 --> 00:26:53,420 So, I hope that choosing lambda equal to 1 over natural 501 00:26:53,420 --> 00:26:55,630 log of 2 is what I want to do. 502 00:26:55,630 --> 00:26:58,830 And when I do that, this becomes 1 here. 503 00:26:58,830 --> 00:27:03,150 And I just have 2 to the minus lambda i is equal to 1. 504 00:27:03,150 --> 00:27:04,790 OK, good. 505 00:27:04,790 --> 00:27:10,020 And then, going back to this equation, p sub i is equal to 506 00:27:10,020 --> 00:27:12,640 2 to the minus l sub i. 507 00:27:12,640 --> 00:27:14,920 OK, this is this arithmetic. 508 00:27:14,920 --> 00:27:18,400 I mean, if you don't follow what I'm doing, just look at 509 00:27:18,400 --> 00:27:24,080 it later and you'll find that there's nothing really there. 510 00:27:24,080 --> 00:27:27,140 So we wind up with these lengths being equal to the 511 00:27:27,140 --> 00:27:30,570 negative of the binary logarithms of these 512 00:27:30,570 --> 00:27:33,390 probabilities. 513 00:27:33,390 --> 00:27:34,660 It's only a stationary point. 514 00:27:34,660 --> 00:27:36,980 We don't know whether it's a minimum yet. 515 00:27:36,980 --> 00:27:39,330 And, unfortunately, we also have the problem of, they 516 00:27:39,330 --> 00:27:43,020 might not be integers. 517 00:27:43,020 --> 00:27:46,890 But, anyway, what we wind up with, then, if we ignore 518 00:27:46,890 --> 00:27:50,800 looking at these problems for the time being, is that the 519 00:27:50,800 --> 00:27:53,150 lengths are going to be equal to this. 520 00:27:53,150 --> 00:27:55,960 The expected value of the lengths is then going to be 521 00:27:55,960 --> 00:27:59,990 equal to the sum, over i, of minus p sub i times the 522 00:27:59,990 --> 00:28:03,480 logarithm of p sub i. 523 00:28:03,480 --> 00:28:08,280 When Shannon saw this, and when various other people saw 524 00:28:08,280 --> 00:28:11,080 it, they said, gee, this looks like the entropy of 525 00:28:11,080 --> 00:28:12,680 statistical mechanics. 526 00:28:12,680 --> 00:28:16,260 So let's call this quantity entropy. 527 00:28:16,260 --> 00:28:18,980 For no better reason than that. 528 00:28:18,980 --> 00:28:21,470 And it would probably have been far better if they called 529 00:28:21,470 --> 00:28:23,060 it something else. 530 00:28:23,060 --> 00:28:27,050 Because for years, there were physicists and philosophers 531 00:28:27,050 --> 00:28:31,090 trying to figure out what the deep relationship was between 532 00:28:31,090 --> 00:28:33,570 statistical mechanical entropy and 533 00:28:33,570 --> 00:28:36,360 information theoretic entropy. 534 00:28:36,360 --> 00:28:39,240 And there probably is such a relationship, but the 535 00:28:39,240 --> 00:28:42,630 relationship is far more complicated than understanding 536 00:28:42,630 --> 00:28:43,900 information theory. 537 00:28:43,900 --> 00:28:45,700 And it's far more complicated than 538 00:28:45,700 --> 00:28:47,720 understanding statistical mechanics. 539 00:28:47,720 --> 00:28:52,100 So I advise you to not worry about that one until after you 540 00:28:52,100 --> 00:28:56,480 understand what it means in an information theoretic sense. 541 00:28:56,480 --> 00:29:00,110 So h of x is what we call the entropy of the 542 00:29:00,110 --> 00:29:03,520 random variable x. 543 00:29:03,520 --> 00:29:06,590 And it really is the entropy associated with these 544 00:29:06,590 --> 00:29:08,310 logarithms of p sub i. 545 00:29:08,310 --> 00:29:16,120 So when you take functions of a random variable, a random 546 00:29:16,120 --> 00:29:18,870 variable carries along a lot of baggage with it. 547 00:29:18,870 --> 00:29:21,730 Including the probabilities of everything. 548 00:29:21,730 --> 00:29:26,640 And when you take the expected value of a random variable, 549 00:29:26,640 --> 00:29:29,910 the individual values of the sample points of that random 550 00:29:29,910 --> 00:29:31,100 variable are important. 551 00:29:31,100 --> 00:29:33,290 And the probabilities are important. 552 00:29:33,290 --> 00:29:35,410 Here we have something even stranger. 553 00:29:35,410 --> 00:29:37,500 Because it's only the probabilities that have 554 00:29:37,500 --> 00:29:39,110 anything to do with it. 555 00:29:39,110 --> 00:29:40,080 And this makes sense. 556 00:29:40,080 --> 00:29:44,050 We already said that these symbols have nothing to do 557 00:29:44,050 --> 00:29:45,600 with this problem we're dealing with. 558 00:29:45,600 --> 00:29:47,040 You can call the symbols whatever you 559 00:29:47,040 --> 00:29:48,230 want to call them. 560 00:29:48,230 --> 00:29:51,340 And, therefore, the only thing of any interest to us is these 561 00:29:51,340 --> 00:29:53,710 probabilities that we're dealing with. 562 00:29:53,710 --> 00:29:57,760 So H of x is a function only of these probabilities. 563 00:29:57,760 --> 00:30:00,900 It's the expected value of minus log p sub i. 564 00:30:00,900 --> 00:30:05,000 This is called entropy, and in fact we will find out very 565 00:30:05,000 --> 00:30:09,010 shortly that it really is the minimum number of bits per 566 00:30:09,010 --> 00:30:13,020 source symbol needed to represent the source. 567 00:30:13,020 --> 00:30:16,170 In other words, when we generalize the problem from 568 00:30:16,170 --> 00:30:22,210 just plain ordinary garden-variety oh prefix-free 569 00:30:22,210 --> 00:30:27,330 codes, we will find that this number is really what 570 00:30:27,330 --> 00:30:30,170 characterizes the whole problem for discrete 571 00:30:30,170 --> 00:30:31,310 memory-less sources. 572 00:30:31,310 --> 00:30:34,160 So let's go on and say more about that. 573 00:30:37,670 --> 00:30:42,090 Let's say something about bounds on the entropy. 574 00:30:42,090 --> 00:30:46,450 First, what's the relationship between the entropy and this 575 00:30:46,450 --> 00:30:53,590 minimum of the expected length that we started to talk about? 576 00:30:53,590 --> 00:30:58,080 And I claim that H of x is less than or equal to L min, 577 00:30:58,080 --> 00:31:01,730 which is less than the entropy plus 1. 578 00:31:01,730 --> 00:31:02,590 And why is that? 579 00:31:02,590 --> 00:31:04,590 We already have the machinery to see this. 580 00:31:04,590 --> 00:31:08,130 We almost have the machinery to see this. 581 00:31:08,130 --> 00:31:12,130 Namely, we have solved this minimization problem. 582 00:31:12,130 --> 00:31:15,580 We've only found a stationary point, but we've ignored the 583 00:31:15,580 --> 00:31:19,070 fact that we have an integer constraint. 584 00:31:19,070 --> 00:31:25,760 So, if you allow me to say for the time being that, in fact, 585 00:31:25,760 --> 00:31:29,930 when we solve the problem without worrying about 586 00:31:29,930 --> 00:31:32,760 integers, that it actually gives me a minimum, then in 587 00:31:32,760 --> 00:31:34,700 fact this follows very easily. 588 00:31:34,700 --> 00:31:38,160 Because what I'm going to do is to find those optimal 589 00:31:38,160 --> 00:31:40,980 lengths, which are non-integers. 590 00:31:40,980 --> 00:31:44,410 And then I can solve the prefix condition and get a 591 00:31:44,410 --> 00:31:49,510 code by simply increasing each of those numbers to the next 592 00:31:49,510 --> 00:31:50,310 integer up. 593 00:31:50,310 --> 00:31:53,310 In other words, I can take the ceiling function of each of 594 00:31:53,310 --> 00:31:55,780 those real numbers to get an integer. 595 00:31:55,780 --> 00:31:59,040 When I take the ceiling function, 2 to the minus l sub 596 00:31:59,040 --> 00:32:01,100 i is going to go down. 597 00:32:01,100 --> 00:32:04,340 So the craft inequality is still satisfied. 598 00:32:04,340 --> 00:32:10,240 So the entropy has to be less than or equal to this average. 599 00:32:10,240 --> 00:32:14,320 Has to be less than H of x plus 1. 600 00:32:14,320 --> 00:32:21,890 So the average is equal to H of x if and only if each these 601 00:32:21,890 --> 00:32:25,840 probabilities it an integer power of 2 to start with. 602 00:32:25,840 --> 00:32:28,720 In other words, the solution I came up with before is that 603 00:32:28,720 --> 00:32:32,360 the length I wanted should be equal to minus the logarithm 604 00:32:32,360 --> 00:32:35,430 to base 2, of p sub i. 605 00:32:35,430 --> 00:32:43,380 So if p sub i is already a power of 2 then I'm home free. 606 00:32:43,380 --> 00:32:47,165 Because I just pick that length to be minus log of p 607 00:32:47,165 --> 00:32:49,640 sub i, and it happens to be an integer. 608 00:32:49,640 --> 00:32:51,100 And I don't have to round it up. 609 00:32:53,620 --> 00:32:57,460 So if I let l1 to lM be these codeword lengths -- well, 610 00:32:57,460 --> 00:32:59,150 here's where I'm going to prove this to you. 611 00:33:01,650 --> 00:33:06,020 And the proof is the following: I want to prove 612 00:33:06,020 --> 00:33:10,220 that H of x is less than or equal to l min, which I'll 613 00:33:10,220 --> 00:33:12,690 just call here L bar. 614 00:33:12,690 --> 00:33:17,810 So H of x minus L bar is equal to, this is the entropy. 615 00:33:20,760 --> 00:33:23,420 This is the expected length here. 616 00:33:23,420 --> 00:33:29,270 I can rewrite this as the sum of p sub i times logarithm of 617 00:33:29,270 --> 00:33:33,600 2 to the minus l sub i divided by p sub i. 618 00:33:33,600 --> 00:33:35,160 That's just arithmetic. 619 00:33:35,160 --> 00:33:40,700 l sub i is equal to to logarithm of 2 to the l sub i, 620 00:33:40,700 --> 00:33:43,080 that's equal to minus the logarithm of 2 621 00:33:43,080 --> 00:33:45,000 the minus l sub i. 622 00:33:45,000 --> 00:33:47,160 So I get this. 623 00:33:47,160 --> 00:33:49,550 There's an inequality. 624 00:33:49,550 --> 00:33:52,420 Hate to call it an inequality, it's so trivial. 625 00:33:57,620 --> 00:34:00,040 Here's the point 1. 626 00:34:00,040 --> 00:34:07,910 If you plot a natural log of x. 627 00:34:07,910 --> 00:34:14,640 And if you compare it with the function x minus 1, you can 628 00:34:14,640 --> 00:34:20,570 see that natural log of x is less than or 629 00:34:20,570 --> 00:34:24,270 equal to x minus 1. 630 00:34:24,270 --> 00:34:27,200 Now, this an inequality which happens to be very useful in 631 00:34:27,200 --> 00:34:29,380 information theory. 632 00:34:29,380 --> 00:34:31,905 I would claim that any inequality that you can prove 633 00:34:31,905 --> 00:34:35,370 in information theory, by any means at all, I can prove 634 00:34:35,370 --> 00:34:39,240 using this inequality and nothing else. 635 00:34:39,240 --> 00:34:41,890 And I've believed that for 50 years and nobody's 636 00:34:41,890 --> 00:34:43,140 proven me wrong yet. 637 00:34:47,140 --> 00:34:50,800 And also, this is something you can draw and remember. 638 00:34:50,800 --> 00:34:52,710 So it's simple. 639 00:34:52,710 --> 00:34:58,260 So the idea, then, is since this sum log of 2 to the minus 640 00:34:58,260 --> 00:35:03,990 l over p sub i is less than or equal to the natural log of 2 641 00:35:03,990 --> 00:35:07,320 times the natural logarithm of this. 642 00:35:07,320 --> 00:35:11,930 You just, here we go. 643 00:35:14,800 --> 00:35:18,030 For any u greater than 0, natural log of u is less than 644 00:35:18,030 --> 00:35:19,410 or equal to this. 645 00:35:19,410 --> 00:35:23,600 So the logarithm to the base 2 of u is less than or equal to 646 00:35:23,600 --> 00:35:27,630 the logarithm to the base 2 of e, which is some number, 647 00:35:27,630 --> 00:35:29,440 times u minus 1. 648 00:35:29,440 --> 00:35:31,860 With equality at u equals 1. 649 00:35:31,860 --> 00:35:34,870 So this is less than or equal to this. 650 00:35:34,870 --> 00:35:36,010 And look how nice that is. 651 00:35:36,010 --> 00:35:40,330 The p sub i's cancel out and you get the sum over i, of 2 652 00:35:40,330 --> 00:35:45,900 the minus l i minus p sub i is less than or equal to 0. 653 00:35:45,900 --> 00:35:50,790 An equality occurs if, and only if, p sub i is equal to 2 654 00:35:50,790 --> 00:35:52,160 to the minus l i. 655 00:35:52,160 --> 00:35:52,590 OK? 656 00:35:52,590 --> 00:35:56,650 So that's all there is to it. 657 00:35:56,650 --> 00:36:04,760 And that establishes that -- well, establishes part of this 658 00:36:04,760 --> 00:36:05,660 theorem here. 659 00:36:05,660 --> 00:36:10,040 And the other part we already established, And if you don't 660 00:36:10,040 --> 00:36:14,040 believe me, the notes do it more carefully. 661 00:36:14,040 --> 00:36:18,490 Well, this left us a serious problem unknown. 662 00:36:18,490 --> 00:36:21,330 Which is, how do you actually solve this integer 663 00:36:21,330 --> 00:36:23,010 minimization problem. 664 00:36:23,010 --> 00:36:25,810 How do you solve it if you have a big, long, complicated 665 00:36:25,810 --> 00:36:28,600 source with lots of probabilities in it? 666 00:36:28,600 --> 00:36:30,780 And everybody thought it was hopeless. 667 00:36:30,780 --> 00:36:32,970 Even Shannon thought it was hopeless. 668 00:36:32,970 --> 00:36:35,410 And Shannon sort of figured out ways to 669 00:36:35,410 --> 00:36:36,470 approach this problem. 670 00:36:36,470 --> 00:36:39,330 He said, well, you want to have about half the 671 00:36:39,330 --> 00:36:40,320 probability. 672 00:36:40,320 --> 00:36:42,630 Starting with 1, about half the probability 673 00:36:42,630 --> 00:36:44,110 starting with 0. 674 00:36:44,110 --> 00:36:50,360 So he would divide up the symbols in the alphabet, so he 675 00:36:50,360 --> 00:36:53,330 could come as close is possible to half of them being 676 00:36:53,330 --> 00:36:56,380 up here and half of them coming down here. 677 00:36:56,380 --> 00:36:58,910 And he would continue to do that, I mean, I don't usually 678 00:36:58,910 --> 00:37:02,640 like to write on the blackboard, but he would start 679 00:37:02,640 --> 00:37:06,280 out generating a code like this. 680 00:37:06,280 --> 00:37:09,640 And this would be approximately 1/2. 681 00:37:09,640 --> 00:37:12,390 This is approximately 1/2. 682 00:37:12,390 --> 00:37:15,370 And then he would take these symbols. 683 00:37:15,370 --> 00:37:18,090 Split them, again, in probability. 684 00:37:18,090 --> 00:37:21,400 And everybody was starting the problem over here, and trying 685 00:37:21,400 --> 00:37:25,090 to generate a code working their way out. 686 00:37:25,090 --> 00:37:30,170 Well, Dave Huffman was a graduate student at the time. 687 00:37:30,170 --> 00:37:34,810 and he took Bob Fano's graduate course in information 688 00:37:34,810 --> 00:37:40,150 theory, I think a year or so later than Kraft did. 689 00:37:40,150 --> 00:37:44,220 And Bob Fano assigned as a homework problem, how do you 690 00:37:44,220 --> 00:37:46,120 solve this problem? 691 00:37:46,120 --> 00:37:48,530 Sneaky guy. 692 00:37:48,530 --> 00:37:52,200 And he was very amazed when Dave Hoffman came in next day 693 00:37:52,200 --> 00:37:55,300 and said, oh, it's easy, you do it this way. 694 00:37:55,300 --> 00:37:58,310 So the question is, how did he do it? 695 00:37:58,310 --> 00:38:03,100 Well, Huffman, instead of looking at the problem from 696 00:38:03,100 --> 00:38:07,920 here out, looked at the problem from here in. 697 00:38:07,920 --> 00:38:09,170 He was -- 698 00:38:09,170 --> 00:38:10,760 I mean, this was before there was anything 699 00:38:10,760 --> 00:38:12,680 called computer science. 700 00:38:12,680 --> 00:38:15,090 But he thought like a computer scientist did. 701 00:38:15,090 --> 00:38:18,120 In other words, he thought algorithmically. 702 00:38:18,120 --> 00:38:21,800 And he also thought in terms of discrete problems. 703 00:38:21,800 --> 00:38:24,140 And therefore, he looked for properties that these optimum 704 00:38:24,140 --> 00:38:25,850 codes should have. 705 00:38:25,850 --> 00:38:27,510 And it was neat. 706 00:38:27,510 --> 00:38:30,080 So, he started out with a limit. 707 00:38:30,080 --> 00:38:33,830 He said, an optimal code has to have the property that a p 708 00:38:33,830 --> 00:38:36,330 i is greater than p sub j. 709 00:38:36,330 --> 00:38:41,090 Then the optimal length associated with p sub i, 710 00:38:41,090 --> 00:38:45,090 namely the optimal length of the i'th codeword, had to be 711 00:38:45,090 --> 00:38:49,990 less than or equal to the length of the j'th codeword. 712 00:38:49,990 --> 00:38:52,700 And you can see this by saying, well, 713 00:38:52,700 --> 00:38:54,650 suppose that's not true. 714 00:38:54,650 --> 00:38:57,560 Suppose that p i is greater than p j. 715 00:38:57,560 --> 00:39:01,890 And also, li is greater than lj. 716 00:39:01,890 --> 00:39:05,490 And then you say, OK, take this situation. 717 00:39:05,490 --> 00:39:09,740 We will interchange those two codewords in the code. 718 00:39:09,740 --> 00:39:12,190 And we'll look at what that does to the average. 719 00:39:12,190 --> 00:39:15,880 And if you work that through, you find out that since what 720 00:39:15,880 --> 00:39:19,590 you've done is, you've shortened the codeword 721 00:39:19,590 --> 00:39:22,620 associated with this and lengthened the codeword 722 00:39:22,620 --> 00:39:24,720 associated with this. 723 00:39:24,720 --> 00:39:28,930 You have changed the average length to make it smaller. 724 00:39:28,930 --> 00:39:30,970 Now, let me warn you about something. 725 00:39:30,970 --> 00:39:34,850 When you start looking at these properties, the most 726 00:39:34,850 --> 00:39:38,820 confusing thing is what happens when two probabilities 727 00:39:38,820 --> 00:39:42,480 are the same, or when two lengths are the same. 728 00:39:42,480 --> 00:39:45,830 And I would advise you to just ignore that problem, until you 729 00:39:45,830 --> 00:39:47,430 get an idea of what's going on. 730 00:39:47,430 --> 00:39:49,650 Namely, assume that all lengths are different. 731 00:39:49,650 --> 00:39:51,960 All probabilities are different. 732 00:39:51,960 --> 00:39:54,020 And then it's easy to see what's going on. 733 00:39:54,020 --> 00:39:56,740 And when you get all done, go back and straighten out the 734 00:39:56,740 --> 00:39:59,230 cases where things are equal. 735 00:39:59,230 --> 00:40:03,620 And I think the notes do this carefully. 736 00:40:03,620 --> 00:40:06,530 If you read books on information theory, about half 737 00:40:06,530 --> 00:40:09,910 of them do it carefully, and about half of them don't. 738 00:40:09,910 --> 00:40:12,570 So you should be suspicious. 739 00:40:12,570 --> 00:40:16,240 But anyway, that's one of those trivialities that you 740 00:40:16,240 --> 00:40:18,830 just have to sort out for yourself. 741 00:40:18,830 --> 00:40:19,120 OK. 742 00:40:19,120 --> 00:40:23,190 The next lemma is optimal prefix-free codes are full. 743 00:40:23,190 --> 00:40:25,150 We talked about what a full code is. 744 00:40:25,150 --> 00:40:29,370 When you draw this binary graph for it, you don't have 745 00:40:29,370 --> 00:40:32,320 any nodes in a binary graph -- 746 00:40:32,320 --> 00:40:33,780 you don't have any leaves that are not 747 00:40:33,780 --> 00:40:35,740 associated with codewords. 748 00:40:35,740 --> 00:40:39,450 Because if you do, we showed you that shortened the 749 00:40:39,450 --> 00:40:41,820 codeword of the part of the tree on the other 750 00:40:41,820 --> 00:40:43,080 side of that leaf. 751 00:40:43,080 --> 00:40:46,040 In other words, if you have something here, if this is a 752 00:40:46,040 --> 00:40:49,620 codeword and this is not a codeword, then you just get 753 00:40:49,620 --> 00:40:51,690 rid of this, and bring that back here. 754 00:40:51,690 --> 00:40:54,310 But this is a whole tree stemming off here, you do the 755 00:40:54,310 --> 00:40:54,970 same thing. 756 00:40:54,970 --> 00:40:57,990 You take this whole tree and you bring it into there, and 757 00:40:57,990 --> 00:41:00,560 you throw this away. 758 00:41:00,560 --> 00:41:05,540 So, optimal prefix-free codes are full. 759 00:41:05,540 --> 00:41:07,020 So far there's nothing to this. 760 00:41:10,310 --> 00:41:13,630 The next part of it is the sibling of a codeword. 761 00:41:13,630 --> 00:41:16,250 And what's a sibling? 762 00:41:16,250 --> 00:41:18,660 Well, we used to call it a brother. 763 00:41:18,660 --> 00:41:21,550 But then couldn't do that because we have to call it a 764 00:41:21,550 --> 00:41:22,920 brother or sister. 765 00:41:22,920 --> 00:41:24,840 And that got to difficult. 766 00:41:24,840 --> 00:41:27,620 So people invented the word sibling, to talk about a 767 00:41:27,620 --> 00:41:29,820 brother or a sister. 768 00:41:29,820 --> 00:41:33,350 So the sibling of a codeword for is the string form by 769 00:41:33,350 --> 00:41:34,910 changing the last bit. 770 00:41:34,910 --> 00:41:37,410 In other words, in this family tree here, the 771 00:41:37,410 --> 00:41:38,980 sibling of this is this. 772 00:41:38,980 --> 00:41:41,300 The sibling of this is this. 773 00:41:41,300 --> 00:41:43,890 The sibling of this is this. 774 00:41:43,890 --> 00:41:49,690 So, a leaf can have a sibling which is an intermediate node, 775 00:41:49,690 --> 00:41:50,940 and vice versa. 776 00:41:54,520 --> 00:41:58,100 So then he said, the sibling of a codeword is a string 777 00:41:58,100 --> 00:42:00,930 formed by changing the last bit. 778 00:42:00,930 --> 00:42:04,720 I think he probably said the brother, but, anyway. 779 00:42:04,720 --> 00:42:09,650 For optimality, the sibling of each maximum length codeword 780 00:42:09,650 --> 00:42:12,200 is another codeword. 781 00:42:12,200 --> 00:42:14,520 Now, that's a really simple one. 782 00:42:14,520 --> 00:42:17,790 If I make this a codeword, and this is the maximal length 783 00:42:17,790 --> 00:42:22,380 codeword in this code I'm talking about, this can't be 784 00:42:22,380 --> 00:42:24,640 an intermediate node because then there would have to be 785 00:42:24,640 --> 00:42:27,240 longer codewords. 786 00:42:27,240 --> 00:42:30,030 And it can't be empty because these optimal 787 00:42:30,030 --> 00:42:31,720 codes are all full. 788 00:42:31,720 --> 00:42:34,530 And therefore, this has to have a sibling 789 00:42:34,530 --> 00:42:36,800 which is also a codeword. 790 00:42:36,800 --> 00:42:43,970 So the longest codewords have to have siblings. 791 00:42:43,970 --> 00:42:46,960 Well, that's easy enough. 792 00:42:46,960 --> 00:42:50,435 Incidentally, one of the problems that you have in 793 00:42:50,435 --> 00:42:53,250 proving this sort of thing is, what happens if you have 794 00:42:53,250 --> 00:42:56,350 zero-probability letters. 795 00:42:56,350 --> 00:42:59,200 Well, we just get rid of that problem and say, well, there 796 00:42:59,200 --> 00:43:01,860 aren't any zero-probability letters. 797 00:43:01,860 --> 00:43:04,760 Because if we want to come up with a sensible model for 798 00:43:04,760 --> 00:43:07,850 something, we're not going to create a codeword for 799 00:43:07,850 --> 00:43:09,490 something that can't happen. 800 00:43:09,490 --> 00:43:13,630 So, there are no zero-probability letters in 801 00:43:13,630 --> 00:43:15,620 this alphabet. 802 00:43:15,620 --> 00:43:17,810 I mean, if you want to put them in, it just complicates 803 00:43:17,810 --> 00:43:19,290 the whole thing. 804 00:43:19,290 --> 00:43:20,540 And you can do it. 805 00:43:23,910 --> 00:43:26,730 Then, finally, there's this lemma which says, there is an 806 00:43:26,730 --> 00:43:31,910 optimal prefix-free code in which, after you order the 807 00:43:31,910 --> 00:43:37,250 probabilities of all of the messages, namely you order p1 808 00:43:37,250 --> 00:43:40,000 to be greater than or equal to p2, greater than or 809 00:43:40,000 --> 00:43:40,880 equal to p sub m. 810 00:43:40,880 --> 00:43:45,380 In other words, we just rename the letters in the alphabet, 811 00:43:45,380 --> 00:43:49,780 so that letter m is less likely than letter m minus 1, 812 00:43:49,780 --> 00:43:50,760 and so forth. 813 00:43:50,760 --> 00:43:51,480 Back to 1. 814 00:43:51,480 --> 00:43:55,560 1 is the most probable, m is the least likely. 815 00:43:55,560 --> 00:43:58,010 Well, we've already concluded that we want to assign the 816 00:43:58,010 --> 00:44:02,910 longest messages to the least probable codewords. 817 00:44:02,910 --> 00:44:06,860 And this says, take the two least probable codewords and 818 00:44:06,860 --> 00:44:09,400 we can always make an optimal code in which those two 819 00:44:09,400 --> 00:44:11,660 codewords are siblings. 820 00:44:11,660 --> 00:44:14,280 And the reason for that is, one of them is not going to be 821 00:44:14,280 --> 00:44:19,180 longer than the other or else you can shorten the code by 822 00:44:19,180 --> 00:44:21,110 interchanging things. 823 00:44:21,110 --> 00:44:24,640 So there is an optimal prefix-free code in which the 824 00:44:24,640 --> 00:44:27,520 codeword for m minus 1. 825 00:44:27,520 --> 00:44:29,370 And the codeword for m are maximal 826 00:44:29,370 --> 00:44:33,640 length and they're siblings. 827 00:44:33,640 --> 00:44:37,360 So the Huffman algorithm first combines these two. 828 00:44:37,360 --> 00:44:41,100 And then looks at the reduced tree with m minus 1 nodes. 829 00:44:41,100 --> 00:44:42,900 Let me show you an example of that. 830 00:44:46,700 --> 00:44:47,990 So it starts out. 831 00:44:47,990 --> 00:44:50,570 Here, I've ordered the probabilities associated with 832 00:44:50,570 --> 00:44:51,410 a set of symbols. 833 00:44:51,410 --> 00:44:54,700 The symbols are 1, 2, 3, 4, 5. 834 00:44:54,700 --> 00:45:00,310 The two least likely messages are 0.1 and 0.15. 835 00:45:00,310 --> 00:45:02,260 Obviously, I could've interchanged these 836 00:45:02,260 --> 00:45:03,330 two if I want to. 837 00:45:03,330 --> 00:45:06,340 But why interchange them? 838 00:45:06,340 --> 00:45:11,270 So I say, OK, the last digit on this one, I'm going to 839 00:45:11,270 --> 00:45:13,050 assign to be a 0. 840 00:45:13,050 --> 00:45:16,920 The last digit on this, I'm going to assign to be a 1. 841 00:45:16,920 --> 00:45:19,000 And the important thing is, I'm going to make them 842 00:45:19,000 --> 00:45:20,700 siblings in this tree. 843 00:45:20,700 --> 00:45:24,740 And what I'm going to do now, terribly complicated thing, 844 00:45:24,740 --> 00:45:27,890 instead of building a tree from left to right, I'm going 845 00:45:27,890 --> 00:45:30,500 to build a tree from right to left. 846 00:45:30,500 --> 00:45:32,850 So when I get all done with the tree it's going to 847 00:45:32,850 --> 00:45:34,250 come in like this. 848 00:45:34,250 --> 00:45:38,560 And what I'm doing is starting out at the end, to start to 849 00:45:38,560 --> 00:45:40,760 build the end of the tree. 850 00:45:40,760 --> 00:45:44,610 And what happens after I go through this first step is, I 851 00:45:44,610 --> 00:45:47,280 say, OK there is an optimal code. 852 00:45:47,280 --> 00:45:50,420 In which these two quantities are 853 00:45:50,420 --> 00:45:52,950 siblings of maximal length. 854 00:45:52,950 --> 00:45:55,740 I now want to form an optimal code for these 855 00:45:55,740 --> 00:45:58,640 probabilities here. 856 00:45:58,640 --> 00:46:01,140 So, I go back and I iterate again. 857 00:46:01,140 --> 00:46:04,590 And I said, OK, if I have these probabilities here, 858 00:46:04,590 --> 00:46:07,150 what's the optimal code. 859 00:46:07,150 --> 00:46:08,830 Well, I could reorder the things. 860 00:46:08,830 --> 00:46:11,400 But now I know that the only thing I'm interested in is the 861 00:46:11,400 --> 00:46:16,690 two least likely symbols in this new alphabet here. 862 00:46:16,690 --> 00:46:19,360 Which is 0.2 and 0.15. 863 00:46:19,360 --> 00:46:21,300 So I combine those together. 864 00:46:21,300 --> 00:46:24,610 I tie them together as siblings in this last 865 00:46:24,610 --> 00:46:28,000 generation, however it works out. 866 00:46:28,000 --> 00:46:31,450 So then I have an alphabet of size three. 867 00:46:31,450 --> 00:46:32,990 And then down here, I have these 868 00:46:32,990 --> 00:46:34,550 two things tied together. 869 00:46:34,550 --> 00:46:36,730 These two things tied together. 870 00:46:36,730 --> 00:46:41,200 So I have a node of probability 0.25. 871 00:46:41,200 --> 00:46:45,200 I have a node of probability 0.35, and I have a node of 872 00:46:45,200 --> 00:46:46,870 probability 0.4. 873 00:46:46,870 --> 00:46:51,310 I take the two least likely, and I tie them together. 874 00:46:51,310 --> 00:46:54,120 And then I have two nodes left, one with probability 0.6 875 00:46:54,120 --> 00:46:56,480 and one with probability 0.4. 876 00:46:56,480 --> 00:46:58,010 And I tie them together. 877 00:46:58,010 --> 00:47:01,760 And, presto, I have my whole code, except for flipping it 878 00:47:01,760 --> 00:47:04,240 over, to go from left to right if you like. 879 00:47:04,240 --> 00:47:05,960 Codes that go from left and right, 880 00:47:05,960 --> 00:47:07,220 instead of right to left. 881 00:47:11,780 --> 00:47:12,190 OK. 882 00:47:12,190 --> 00:47:13,050 I have swindled you. 883 00:47:13,050 --> 00:47:14,790 How have I swindled you? 884 00:47:18,150 --> 00:47:20,680 I mean, I've swindled you a little bit by talking about 885 00:47:20,680 --> 00:47:23,360 these things that might be equal or not equal. 886 00:47:23,360 --> 00:47:24,350 And that's not important. 887 00:47:24,350 --> 00:47:26,380 You can sort that out on your own. 888 00:47:26,380 --> 00:47:29,200 There's a very important swindle I pulled. 889 00:47:29,200 --> 00:47:30,450 And what's that? 890 00:47:42,260 --> 00:47:46,450 What's very incomplete in this argument? 891 00:47:50,240 --> 00:47:53,090 This part is fine. 892 00:47:53,090 --> 00:47:54,850 Nothing wrong here. 893 00:47:54,850 --> 00:47:58,750 We have a lemma which says, you can find an optimal code 894 00:47:58,750 --> 00:48:00,410 by tying these two things together. 895 00:48:03,270 --> 00:48:03,560 Yeah? 896 00:48:03,560 --> 00:48:04,810 AUDIENCE: [UNINTELLIGIBLE] 897 00:48:11,085 --> 00:48:14,020 combine those two [UNINTELLIGIBLE] combination. 898 00:48:14,020 --> 00:48:15,540 PROFESSOR: You're saying, how do I know to 899 00:48:15,540 --> 00:48:17,380 combine these two? 900 00:48:17,380 --> 00:48:18,640 OK, which means what? 901 00:48:18,640 --> 00:48:19,585 Yeah. 902 00:48:19,585 --> 00:48:21,722 AUDIENCE: [UNINTELLIGIBLE] you've just added the 903 00:48:21,722 --> 00:48:24,340 probabilities -- 904 00:48:24,340 --> 00:48:26,760 PROFESSOR: I've just added those two probabilities. 905 00:48:26,760 --> 00:48:30,190 So I have a new ensemble where I have four probabilities, 906 00:48:30,190 --> 00:48:36,350 0.25, 0.15, 0.2, and 0.4. 907 00:48:36,350 --> 00:48:37,070 And that's fine. 908 00:48:37,070 --> 00:48:39,250 I still have these things. 909 00:48:39,250 --> 00:48:41,450 No, there's no independence involved here at all. 910 00:48:41,450 --> 00:48:46,140 I mean, I started out with five letters. 911 00:48:46,140 --> 00:48:48,150 Which are disjoined. 912 00:48:48,150 --> 00:48:50,100 I now have four letters that are disjoined. 913 00:48:56,200 --> 00:48:57,930 What have I done? 914 00:48:57,930 --> 00:48:58,160 Yeah. 915 00:48:58,160 --> 00:48:59,410 AUDIENCE: [UNINTELLIGIBLE] 916 00:49:03,490 --> 00:49:05,200 PROFESSOR: Yes. 917 00:49:05,200 --> 00:49:05,510 Yeah. 918 00:49:05,510 --> 00:49:11,610 I have assumed, now, that once I get these four symbols, if I 919 00:49:11,610 --> 00:49:15,910 have those four symbols, I can form an optimal code for those 920 00:49:15,910 --> 00:49:20,720 four symbols in which these two symbols get tied together. 921 00:49:20,720 --> 00:49:24,500 But how do I know that an optimal code for this reduced 922 00:49:24,500 --> 00:49:27,980 set of probabilities is also an optimal code for the 923 00:49:27,980 --> 00:49:29,230 original problem? 924 00:49:34,940 --> 00:49:37,910 I have tied these two things together. 925 00:49:37,910 --> 00:49:40,870 I know there's an optimal code in which these two things are 926 00:49:40,870 --> 00:49:42,480 tied together. 927 00:49:42,480 --> 00:49:44,760 I then have four symbols. 928 00:49:44,760 --> 00:49:47,990 I want to find a code for those four symbols. 929 00:49:47,990 --> 00:49:51,950 But I assume that the optimal code for these four symbols, 930 00:49:51,950 --> 00:49:55,000 when I break apart these two things, gives me an optimal 931 00:49:55,000 --> 00:49:58,430 code for five symbols. 932 00:49:58,430 --> 00:50:01,090 That's the sort of thing I want you people to start 933 00:50:01,090 --> 00:50:02,620 catching onto immediately. 934 00:50:02,620 --> 00:50:07,520 I want you to start asking those nasty questions. 935 00:50:07,520 --> 00:50:11,840 And those nasty questions are the things that say, OK, how 936 00:50:11,840 --> 00:50:15,290 do I know that this works? 937 00:50:15,290 --> 00:50:17,090 In other words, you're not here to learn these 938 00:50:17,090 --> 00:50:18,100 algorithms. 939 00:50:18,100 --> 00:50:21,160 I can tell you what the algorithm is in an instant. 940 00:50:21,160 --> 00:50:23,440 You can do the algorithm. 941 00:50:23,440 --> 00:50:26,790 A computer can do the algorithm about three thousand 942 00:50:26,790 --> 00:50:29,290 times faster than you can. 943 00:50:29,290 --> 00:50:32,260 And you can be replaced by a computer, if you only learn 944 00:50:32,260 --> 00:50:34,440 the algorithms. 945 00:50:34,440 --> 00:50:37,190 You can program the algorithm. 946 00:50:37,190 --> 00:50:40,100 You can probably find the computer that can program the 947 00:50:40,100 --> 00:50:42,130 algorithm too. 948 00:50:42,130 --> 00:50:45,370 And there's no need to program it more than once. 949 00:50:45,370 --> 00:50:50,530 So that after you've done that, you are useless again. 950 00:50:50,530 --> 00:50:53,030 So the only thing that's worthwhile for you is to be 951 00:50:53,030 --> 00:50:55,910 able to spot these problems and to understand 952 00:50:55,910 --> 00:50:58,590 what's going on. 953 00:50:58,590 --> 00:50:59,810 So. 954 00:50:59,810 --> 00:51:04,050 How do I know that this first optimization leads to the 955 00:51:04,050 --> 00:51:07,200 second optimization. 956 00:51:07,200 --> 00:51:10,080 After combining these two least likely codewords, or 957 00:51:10,080 --> 00:51:15,110 siblings, we've gotten a reduced set of probabilities. 958 00:51:15,110 --> 00:51:19,040 In this problem here, what we've done, the reduced set of 959 00:51:19,040 --> 00:51:27,770 probabilities are 0.4, 0.2, 0.15, and 0.25. 960 00:51:27,770 --> 00:51:32,000 Why does finding the optimal code for this reduced set 961 00:51:32,000 --> 00:51:35,680 result in an optimal code for the original set? 962 00:51:35,680 --> 00:51:39,000 That's really the question that we're asking. 963 00:51:39,000 --> 00:51:43,240 Well, it's not hard. 964 00:51:43,240 --> 00:51:47,450 If you take any code for the reduced set, let's call the 965 00:51:47,450 --> 00:51:51,410 reduced set x prime, set of probabilities. 966 00:51:51,410 --> 00:51:55,160 Let the expected length of that be l prime. 967 00:51:55,160 --> 00:51:58,450 It's not necessarily an optimal code, but it's any old 968 00:51:58,450 --> 00:52:00,040 code that I generate. 969 00:52:00,040 --> 00:52:04,600 Any old code I generate for L prime, I can now take that 970 00:52:04,600 --> 00:52:12,360 code for l prime and I can expand it out to a code for L. 971 00:52:12,360 --> 00:52:18,070 Namely, I have this code here this, this, this, and that's 972 00:52:18,070 --> 00:52:19,920 the expanded -- 973 00:52:19,920 --> 00:52:23,620 and now I can expand it into a code for the original set, by 974 00:52:23,620 --> 00:52:28,180 adding on this and this, as leaves on this. 975 00:52:28,180 --> 00:52:31,100 This leaf here then becomes an intermediate node. 976 00:52:31,100 --> 00:52:34,980 And I add two extra leaves to it. 977 00:52:34,980 --> 00:52:37,130 OK, well, it's not hard. 978 00:52:37,130 --> 00:52:45,060 The expected length for this code, for these five letters, 979 00:52:45,060 --> 00:52:48,340 I claim, is equal to the expected length for this 980 00:52:48,340 --> 00:52:52,210 reduced code, this, this, this, and this. 981 00:52:52,210 --> 00:52:55,160 Plus one extra digit for this. 982 00:52:55,160 --> 00:52:58,650 Plus one extra digit for this. 983 00:52:58,650 --> 00:53:02,310 So the expected length for L is the expected length for L 984 00:53:02,310 --> 00:53:08,970 prime plus 0.15 plus 0.1. 985 00:53:08,970 --> 00:53:19,310 Which says the following: if I want to minimize this, and I 986 00:53:19,310 --> 00:53:23,070 know that this has to be equal to this, and these two numbers 987 00:53:23,070 --> 00:53:25,440 are fixed, I can't change them. 988 00:53:25,440 --> 00:53:29,770 I can minimize this, by minimizing this. 989 00:53:29,770 --> 00:53:31,610 And that's the final step in the whole argument. 990 00:53:35,060 --> 00:53:37,170 And what's peculiar is that everybody 991 00:53:37,170 --> 00:53:39,720 learns the Huffman algorithm. 992 00:53:39,720 --> 00:53:43,980 And what Huffman did, which was really very smart, was to 993 00:53:43,980 --> 00:53:45,230 sort out this issue. 994 00:53:48,400 --> 00:53:50,860 And I can teach this to a hundred classes, and nobody 995 00:53:50,860 --> 00:53:54,100 will ever point out to me that there's a logical flaw in the 996 00:53:54,100 --> 00:53:56,320 whole argument. 997 00:53:56,320 --> 00:53:58,770 And you can look at most books on information theory and they 998 00:53:58,770 --> 00:54:00,660 never point out that there's that 999 00:54:00,660 --> 00:54:03,120 logical flaw there, either. 1000 00:54:03,120 --> 00:54:09,740 So, anyway, that's the end of Huffman's algorithm. 1001 00:54:09,740 --> 00:54:11,960 You can see when you look at this that this is really an 1002 00:54:11,960 --> 00:54:14,730 extraordinarily easy thing to do. 1003 00:54:14,730 --> 00:54:16,530 I mean, you can take an alphabet of 1004 00:54:16,530 --> 00:54:18,760 several thousand symbols. 1005 00:54:18,760 --> 00:54:20,870 All you have to do is order them. 1006 00:54:20,870 --> 00:54:23,290 Tie the least two likely together. 1007 00:54:23,290 --> 00:54:25,840 Assign a 1 and a 0 to them. 1008 00:54:25,840 --> 00:54:30,350 Then, stick that into an ordered list again. 1009 00:54:30,350 --> 00:54:31,720 Take the two least probable. 1010 00:54:31,720 --> 00:54:32,910 Tie them together. 1011 00:54:32,910 --> 00:54:35,170 Stick it into an ordered list again. 1012 00:54:35,170 --> 00:54:40,050 And, if you have some minimal knowledge of data structures, 1013 00:54:40,050 --> 00:54:42,280 you can do this with essentially on the order of 1014 00:54:42,280 --> 00:54:46,440 one operation for each letter in this alphabet. 1015 00:54:46,440 --> 00:54:48,890 So it really isn't a very difficult sum. 1016 00:54:48,890 --> 00:54:53,240 So here's an integer problem which is really easy to solve. 1017 00:54:53,240 --> 00:54:55,880 And the way to solve it is to look at the problem in the 1018 00:54:55,880 --> 00:54:58,160 opposite way from what everybody else has 1019 00:54:58,160 --> 00:55:00,570 looked at it in. 1020 00:55:00,570 --> 00:55:03,190 Does this say you want to ignore everything that 1021 00:55:03,190 --> 00:55:05,670 everybody else has done, and go your own way? 1022 00:55:05,670 --> 00:55:07,470 Not quite. 1023 00:55:07,470 --> 00:55:10,440 But it says that's one of the things you ought to try, if 1024 00:55:10,440 --> 00:55:13,560 you find that everybody is doing something one way and 1025 00:55:13,560 --> 00:55:19,250 you can find another way to look at, that's very rich. 1026 00:55:19,250 --> 00:55:21,880 It might turn out to be nothing but it might turn out 1027 00:55:21,880 --> 00:55:24,610 to be something very worthwhile. 1028 00:55:31,380 --> 00:55:36,870 Let's now talk about this quantity, entropy. 1029 00:55:36,870 --> 00:55:45,560 And for every chance variable, x, if that chance variable, x, 1030 00:55:45,560 --> 00:55:51,610 is discrete and has a finite number of elements in it, so 1031 00:55:51,610 --> 00:55:56,490 I'm talking about a chance variable x, what's a chance 1032 00:55:56,490 --> 00:56:00,660 variable have tagging along after it? 1033 00:56:00,660 --> 00:56:02,190 It has a set of n probabilities 1034 00:56:02,190 --> 00:56:04,690 tagging along after it. 1035 00:56:04,690 --> 00:56:06,360 That's what a chance variable is. 1036 00:56:06,360 --> 00:56:08,620 A chance variable is not just the alphabet. 1037 00:56:08,620 --> 00:56:11,040 A chance variable is the alphabet plus the 1038 00:56:11,040 --> 00:56:12,250 probabilities. 1039 00:56:12,250 --> 00:56:16,200 That's why you then talk about it having an entropy. 1040 00:56:16,200 --> 00:56:20,020 And the entropy is the expected value of, minus the 1041 00:56:20,020 --> 00:56:22,760 logarithm, of this PMF function. 1042 00:56:26,310 --> 00:56:30,340 So, in fact, this is an unusual statistic in the sense 1043 00:56:30,340 --> 00:56:34,400 that it has nothing to do with the symbol values, and 1044 00:56:34,400 --> 00:56:37,090 everything to do with just the probabilities 1045 00:56:37,090 --> 00:56:39,580 of the symbol values. 1046 00:56:39,580 --> 00:56:42,750 And as we go on, you'll see that in fact this is a very 1047 00:56:42,750 --> 00:56:44,680 important property of it. 1048 00:56:44,680 --> 00:56:47,280 And dealing with the logarithms of these symbol 1049 00:56:47,280 --> 00:56:53,400 values is, in fact, a much more worthwhile thing to do 1050 00:56:53,400 --> 00:56:57,250 than dealing with the probabilities of the symbols. 1051 00:56:57,250 --> 00:57:02,550 Now, let me pause again and see if anybody can have any 1052 00:57:02,550 --> 00:57:07,290 idea of why logarithms of probabilities might be more 1053 00:57:07,290 --> 00:57:10,990 significant than probabilities. 1054 00:57:10,990 --> 00:57:13,410 And think of what we're going to be doing here. 1055 00:57:13,410 --> 00:57:17,490 We're taking a sequence of letters. 1056 00:57:17,490 --> 00:57:19,990 When I take a sequence of letters, what's the 1057 00:57:19,990 --> 00:57:22,210 probability of the sequence of letters? 1058 00:57:24,820 --> 00:57:25,760 If they're IID. 1059 00:57:25,760 --> 00:57:27,460 Namely, we're looking -- 1060 00:57:27,460 --> 00:57:28,790 AUDIENCE: [UNINTELLIGIBLE] 1061 00:57:28,790 --> 00:57:32,310 PROFESSOR: It's the product of those probabilities. 1062 00:57:32,310 --> 00:57:39,070 Now, if you agree with me that the probability theory is 1063 00:57:39,070 --> 00:57:43,450 concerned 50% with the law of large numbers and 50% with 1064 00:57:43,450 --> 00:57:47,030 everything else all put together, why is the logarithm 1065 00:57:47,030 --> 00:57:48,900 of a probability important? 1066 00:57:48,900 --> 00:57:50,150 AUDIENCE: [UNINTELLIGIBLE] 1067 00:57:55,310 --> 00:57:58,020 PROFESSOR: You change your product to a sum, yes. 1068 00:57:58,020 --> 00:58:02,140 If you have a product of probabilities, you can talk 1069 00:58:02,140 --> 00:58:07,650 about a sum of the logarithms of probabilities. 1070 00:58:07,650 --> 00:58:12,250 That's why entropy is important 1071 00:58:12,250 --> 00:58:14,340 in statistical mechanics. 1072 00:58:14,340 --> 00:58:18,290 It also is, fundamentally, the reason why entropy is 1073 00:58:18,290 --> 00:58:20,530 important in information theory. 1074 00:58:20,530 --> 00:58:24,360 Is because what you're almost always interested in is a 1075 00:58:24,360 --> 00:58:26,360 product of probabilities. 1076 00:58:26,360 --> 00:58:29,310 And when you're interested in a product of probabilities and 1077 00:58:29,310 --> 00:58:32,050 you want to use the law of large numbers, you turn that 1078 00:58:32,050 --> 00:58:35,580 product of probabilities into a sum of the logarithms of 1079 00:58:35,580 --> 00:58:37,820 probabilities. 1080 00:58:37,820 --> 00:58:39,070 Fundamental idea. 1081 00:58:41,560 --> 00:58:44,710 Shannon took eight years sorting all this out. 1082 00:58:44,710 --> 00:58:49,970 And Shannon was by far the smartest person I've ever met. 1083 00:58:49,970 --> 00:58:54,570 I mean, the problems that we worry about, he just, bip. 1084 00:58:54,570 --> 00:58:57,970 Solves it with no effort at all. 1085 00:58:57,970 --> 00:59:00,570 This one took them a while to sort out. 1086 00:59:00,570 --> 00:59:03,030 It also took him a while to sort out the fact that once he 1087 00:59:03,030 --> 00:59:06,670 sorted this out, he could sort out all of the other problems. 1088 00:59:06,670 --> 00:59:09,730 As far as communications was concerned. 1089 00:59:09,730 --> 00:59:13,660 So was quite important. 1090 00:59:13,660 --> 00:59:15,140 I mean, I can tell you one of the peculiar 1091 00:59:15,140 --> 00:59:17,590 things about Shannon. 1092 00:59:17,590 --> 00:59:20,430 Just from the first time I ever talked to him about a 1093 00:59:20,430 --> 00:59:22,300 technical problem. 1094 00:59:22,300 --> 00:59:24,860 I'd just become a faculty member here. 1095 00:59:24,860 --> 00:59:28,790 And his office was about five doors down from mine. 1096 00:59:28,790 --> 00:59:31,630 And one day I screwed up my courage to go down and talk to 1097 00:59:31,630 --> 00:59:34,340 the guy about a problem I was working on. 1098 00:59:34,340 --> 00:59:36,500 And I thought it was a really neat problem. 1099 00:59:36,500 --> 00:59:39,020 It had all sorts of pieces to it, all sorts of bells and 1100 00:59:39,020 --> 00:59:40,280 whistles on it. 1101 00:59:40,280 --> 00:59:42,850 And I started to explain it to him. 1102 00:59:42,850 --> 00:59:45,380 And he said, well, can we look at a slightly simpler case 1103 00:59:45,380 --> 00:59:49,270 where you throw out this part of it, you throw out one bell. 1104 00:59:49,270 --> 00:59:50,840 Then he'd throw out a whistle. 1105 00:59:50,840 --> 00:59:52,530 Then he'd throw out a bell. 1106 00:59:52,530 --> 00:59:54,690 And I was going along with this and saying, 1107 00:59:54,690 --> 00:59:55,690 yeah, I guess we could. 1108 00:59:55,690 --> 00:59:56,770 We could. 1109 00:59:56,770 --> 01:00:00,150 We can throw out all of these things without really losing 1110 01:00:00,150 --> 01:00:02,380 the essence of the problem. 1111 01:00:02,380 --> 01:00:04,150 And, finally, I started to get discouraged. 1112 01:00:04,150 --> 01:00:06,890 Because this really neat research problem, this really 1113 01:00:06,890 --> 01:00:10,680 important research problem, was turning a toy problem 1114 01:00:10,680 --> 01:00:13,530 which was almost trivial. 1115 01:00:13,530 --> 01:00:17,080 It had nothing to do, it seemed, with anything. 1116 01:00:17,080 --> 01:00:20,260 And finally we got down to a certain point. 1117 01:00:20,260 --> 01:00:24,200 And I said, yeah, but this is trivial, the solution is this. 1118 01:00:24,200 --> 01:00:26,170 And he said, yeah. 1119 01:00:26,170 --> 01:00:29,180 And then we started putting back all the pieces. 1120 01:00:29,180 --> 01:00:33,270 And his genius was, he knew which things to throw out. 1121 01:00:33,270 --> 01:00:36,580 So that each of the things we threw out, we could put them 1122 01:00:36,580 --> 01:00:38,060 back in again. 1123 01:00:38,060 --> 01:00:40,330 When we got done, the research problem was trivial. 1124 01:00:42,870 --> 01:00:46,450 And his genius was in finding the right trivial 1125 01:00:46,450 --> 01:00:48,960 example to look at. 1126 01:00:48,960 --> 01:00:52,290 So, in fact, what you always want to look at, in the 1127 01:00:52,290 --> 01:00:55,980 communications field -- and in most fields, I think -- is 1128 01:00:55,980 --> 01:00:59,320 finding the really simple way of looking at something. 1129 01:00:59,320 --> 01:01:03,660 Which means you have to throw out most of the nonsense. 1130 01:01:03,660 --> 01:01:06,560 So, in this case, it's looking at entropy, which is the 1131 01:01:06,560 --> 01:01:09,010 logarithm of a probability assignment. 1132 01:01:09,010 --> 01:01:12,250 And you want to look at that because the logarithm of a 1133 01:01:12,250 --> 01:01:15,950 probability assignment lets you add the logarithms of 1134 01:01:15,950 --> 01:01:16,830 probabilities. 1135 01:01:16,830 --> 01:01:18,990 Use the law of large numbers. 1136 01:01:18,990 --> 01:01:25,500 And then you can talk about sequences of elements. 1137 01:01:25,500 --> 01:01:27,320 Properties of entropy. 1138 01:01:27,320 --> 01:01:32,570 For a discrete random chance variable. 1139 01:01:32,570 --> 01:01:35,240 We have m elements in the alphabet. 1140 01:01:35,240 --> 01:01:39,440 First thing is that the entropy is always greater than 1141 01:01:39,440 --> 01:01:41,570 or equal to 0. 1142 01:01:41,570 --> 01:01:42,920 Why is that? 1143 01:01:42,920 --> 01:01:44,170 I'll let you figure it out. 1144 01:01:46,660 --> 01:01:49,910 Why is the logarithm of a problem, minus the logarithm 1145 01:01:49,910 --> 01:01:55,270 of a probability, greater than or equal to 0? 1146 01:01:55,270 --> 01:01:56,520 Why is it non-negative? 1147 01:01:58,810 --> 01:01:59,550 Yeah. 1148 01:01:59,550 --> 01:02:00,870 AUDIENCE: [UNINTELLIGIBLE] 1149 01:02:00,870 --> 01:02:02,330 PROFESSOR: Probabilities are always less than 1150 01:02:02,330 --> 01:02:04,860 or equal to 1, yes. 1151 01:02:04,860 --> 01:02:07,390 So this quantity here is always greater 1152 01:02:07,390 --> 01:02:08,550 than or equal to 0. 1153 01:02:08,550 --> 01:02:10,980 Because the logarithm of 1 is equal to 0. 1154 01:02:15,970 --> 01:02:19,440 We have a quality here if x is deterministic. 1155 01:02:19,440 --> 01:02:21,590 Which is just a special case there. 1156 01:02:21,590 --> 01:02:24,880 Where you have an ensemble of one element and it has 1157 01:02:24,880 --> 01:02:26,830 probability 1. 1158 01:02:26,830 --> 01:02:30,610 Or, in fact, at this point you could add in things which have 1159 01:02:30,610 --> 01:02:32,750 zero probability. 1160 01:02:32,750 --> 01:02:35,500 Well, that's a little bit tricky. 1161 01:02:35,500 --> 01:02:38,320 Because you add in something that's zero probability. 1162 01:02:38,320 --> 01:02:44,640 And the logarithm of 0 is infinity. 1163 01:02:44,640 --> 01:02:48,820 So you're dealing with the expected value of a bunch of 1164 01:02:48,820 --> 01:02:51,930 infinities, which each occur with zero probability. 1165 01:02:51,930 --> 01:02:56,330 And you're forced to say, well, I think that 0 times log 1166 01:02:56,330 --> 01:02:59,270 of 0 is equal to 0. 1167 01:02:59,270 --> 01:03:04,590 And in fact, epsilon times log of episilon goes to 0 as 1168 01:03:04,590 --> 01:03:06,490 epsilon goes to 0. 1169 01:03:06,490 --> 01:03:09,670 But you save yourself a lot of worry by just leaving out 1170 01:03:09,670 --> 01:03:13,540 things of zero probability. 1171 01:03:13,540 --> 01:03:16,580 So H of x is greater than or equal to 0. 1172 01:03:16,580 --> 01:03:19,020 We have equality if x is deterministic. 1173 01:03:19,020 --> 01:03:24,750 H of x is less than or equal to log of m. 1174 01:03:24,750 --> 01:03:28,830 The quality, if x is equiprobable. 1175 01:03:28,830 --> 01:03:30,080 And how do I know that? 1176 01:03:33,030 --> 01:03:35,140 I look at this again. 1177 01:03:35,140 --> 01:03:39,630 I'm not going to prove it here but, essentially, this follows 1178 01:03:39,630 --> 01:03:45,620 from saying that the natural logarithm of something is less 1179 01:03:45,620 --> 01:03:48,500 than or equal to that something minus 1. 1180 01:03:48,500 --> 01:03:52,990 And you take the difference of the entropy, and log of m. 1181 01:03:52,990 --> 01:03:57,510 And, presto, it gives you the result that you want. 1182 01:03:57,510 --> 01:04:01,285 So you've got the most entropy, if everything is 1183 01:04:01,285 --> 01:04:02,535 equiprobable. 1184 01:04:05,360 --> 01:04:09,450 For any code satisfying the Kraft inequality, the entropy 1185 01:04:09,450 --> 01:04:12,130 is less than or equal to L bar. 1186 01:04:12,130 --> 01:04:15,930 Well, that's what we already proved. 1187 01:04:15,930 --> 01:04:20,490 Mainly in the middle of the lecture, we showed that for 1188 01:04:20,490 --> 01:04:24,370 any code to satisfied the Kraft inequality, the entropy 1189 01:04:24,370 --> 01:04:27,620 was always less than or equal to L bar, because the entropy 1190 01:04:27,620 --> 01:04:30,700 is what you get if you minimize the expected length 1191 01:04:30,700 --> 01:04:32,990 without the integer constraint. 1192 01:04:32,990 --> 01:04:36,830 And L bar is what you get -- well, L bar min is what you 1193 01:04:36,830 --> 01:04:43,040 get when you minimize it with the integer constraint. 1194 01:04:43,040 --> 01:04:45,100 I mean, you don't bother about minimizing it. 1195 01:04:45,100 --> 01:04:47,010 You get something bigger than L min. 1196 01:04:47,010 --> 01:04:51,320 So this is less than or equal to the length of any codeword. 1197 01:04:51,320 --> 01:04:55,170 For the very best codeword, for the very best code, the 1198 01:04:55,170 --> 01:04:59,290 expected value of the minimum is less than or equal to the 1199 01:04:59,290 --> 01:05:01,190 entropy plus y. 1200 01:05:01,190 --> 01:05:05,010 And you get that just by adding to each non-integer 1201 01:05:05,010 --> 01:05:09,530 length the ceiling function. 1202 01:05:09,530 --> 01:05:13,770 Which gives you, at most, one extra digit for each codeword. 1203 01:05:13,770 --> 01:05:16,870 Now, here's the more interesting one. 1204 01:05:16,870 --> 01:05:22,100 For independent chance variables, x and y, here's 1205 01:05:22,100 --> 01:05:28,120 where the nice part about notation comes along. 1206 01:05:28,120 --> 01:05:32,250 What's the entropy of x y? 1207 01:05:32,250 --> 01:05:35,330 Well, what do I mean by x y first? 1208 01:05:35,330 --> 01:05:38,460 I have a chance variable, x. 1209 01:05:38,460 --> 01:05:45,300 And this chance variable x has an alphabet associated with 1210 01:05:45,300 --> 01:05:47,800 it. x1 up to x sub m. 1211 01:05:47,800 --> 01:05:49,850 I have a chance variable y. 1212 01:05:49,850 --> 01:05:53,490 It has an alphabet associated with it. 1213 01:05:53,490 --> 01:05:58,870 What's the sample space, what's the set of events 1214 01:05:58,870 --> 01:06:02,950 corresponding to the chance variable x y? 1215 01:06:05,790 --> 01:06:09,990 By x y, I mean a chance variable whose elements are 1216 01:06:09,990 --> 01:06:14,170 the possible values of both x and y. 1217 01:06:14,170 --> 01:06:17,740 So, I'm talking about the joint ensemble of x and y. 1218 01:06:17,740 --> 01:06:21,080 I have a bunch of possible values for that. 1219 01:06:21,080 --> 01:06:26,260 And those possible values, if I have m possibilities for 1220 01:06:26,260 --> 01:06:28,940 each, I have m squared possible values 1221 01:06:28,940 --> 01:06:31,180 for the two of them. 1222 01:06:31,180 --> 01:06:35,050 So I'm talking about the expected value of minus the 1223 01:06:35,050 --> 01:06:40,750 logarithm of the probability of x and y. 1224 01:06:40,750 --> 01:06:43,890 In other words, I am trying to take the -- 1225 01:06:43,890 --> 01:06:46,640 let me write it out. 1226 01:06:46,640 --> 01:06:49,840 I'm probably given conniptions to -- no? 1227 01:06:49,840 --> 01:06:51,090 OK. 1228 01:06:58,930 --> 01:07:06,960 I want to take p of x y of symbol x y, times minus the 1229 01:07:06,960 --> 01:07:16,440 logarithm to the base 2 of p sub x y of x y. 1230 01:07:16,440 --> 01:07:21,340 That's what this means if I write it out. 1231 01:07:24,620 --> 01:07:32,740 Well, this probability here is p sub x of little x, times p 1232 01:07:32,740 --> 01:07:35,280 sub y of little y. 1233 01:07:35,280 --> 01:07:36,490 Why is that? 1234 01:07:36,490 --> 01:07:37,660 Because I'm assuming that they're 1235 01:07:37,660 --> 01:07:39,630 independent of each other. 1236 01:07:39,630 --> 01:07:42,160 And, therefore, the probability of two of them is 1237 01:07:42,160 --> 01:07:45,405 the product of the probabilities, times minus log 1238 01:07:45,405 --> 01:07:46,080 to the base 2. 1239 01:07:46,080 --> 01:07:55,410 So if p sub x of x minus logarithm to the base 2 of p 1240 01:07:55,410 --> 01:07:58,560 sub y of y. 1241 01:07:58,560 --> 01:08:05,120 And I'm summing this over all x and all y. 1242 01:08:05,120 --> 01:08:10,440 And the more sophisticated way to write this -- 1243 01:08:10,440 --> 01:08:13,070 things I say in lecture, you don't have to copy down 1244 01:08:13,070 --> 01:08:16,290 because they're always in the notes. 1245 01:08:16,290 --> 01:08:17,880 If they're not in the notes, it's probably 1246 01:08:17,880 --> 01:08:19,130 wrong anyway, so. 1247 01:08:21,640 --> 01:08:25,190 So this expected value is expected value of the 1248 01:08:25,190 --> 01:08:27,490 logarithm of the probability of x y. 1249 01:08:27,490 --> 01:08:32,190 Which is the expected value of the logarithm of p of 1250 01:08:32,190 --> 01:08:34,350 x times p of y. 1251 01:08:34,350 --> 01:08:37,290 And, since I have a logarithm of a product, that's the 1252 01:08:37,290 --> 01:08:42,350 expected value of minus log p of x minus log p of y, which 1253 01:08:42,350 --> 01:08:47,370 is the entropy of x plus the entropy of y. 1254 01:08:47,370 --> 01:08:50,720 In other words, when I have a joint ensemble of even more 1255 01:08:50,720 --> 01:08:56,660 independent quantities, The entropy the sequence is equal 1256 01:08:56,660 --> 01:09:01,800 to the sum of the entropies of the individual elements in 1257 01:09:01,800 --> 01:09:03,050 that sequence. 1258 01:09:07,270 --> 01:09:09,630 Well, that's all I wanted to talk about today. 1259 01:09:09,630 --> 01:09:12,010 If any of you have any questions to ask, you should 1260 01:09:12,010 --> 01:09:13,500 ask them now. 1261 01:09:13,500 --> 01:09:17,210 I went through Huffman coding pretty quickly, because it's 1262 01:09:17,210 --> 01:09:19,670 something where you have to do some exercises on 1263 01:09:19,670 --> 01:09:21,920 it to sort it out. 1264 01:09:21,920 --> 01:09:24,480 And I didn't want to do any more than that.