1 00:00:00,520 --> 00:00:03,180 We have seen that several properties, such as, for 2 00:00:03,180 --> 00:00:06,320 example, linearity of expectations, are common for 3 00:00:06,320 --> 00:00:08,670 discrete and continuous random variables. 4 00:00:08,670 --> 00:00:11,570 For this reason, it would be nice to have a way of talking 5 00:00:11,570 --> 00:00:14,890 about the distribution of all kinds of random variables 6 00:00:14,890 --> 00:00:17,920 without having to keep making a distinction between the 7 00:00:17,920 --> 00:00:19,150 different types-- 8 00:00:19,150 --> 00:00:20,865 discrete or continuous. 9 00:00:20,865 --> 00:00:23,900 This leads us to describe the distribution of a random 10 00:00:23,900 --> 00:00:27,510 variable in a new way, in terms of a so-called 11 00:00:27,510 --> 00:00:31,660 cumulative distribution function or CDF for short. 12 00:00:31,660 --> 00:00:35,370 A CDF is defined as follows. 13 00:00:35,370 --> 00:00:38,620 The CDF is a function of a single argument, which we 14 00:00:38,620 --> 00:00:41,060 denote by little x in this case. 15 00:00:41,060 --> 00:00:43,600 And it gives us the probability that the random 16 00:00:43,600 --> 00:00:46,210 variable takes a value less than or equal to this 17 00:00:46,210 --> 00:00:47,930 particular little x. 18 00:00:47,930 --> 00:00:52,880 We will always use uppercase Fs to indicate CDFs. 19 00:00:52,880 --> 00:00:56,760 And we will always have some subscripts that indicate which 20 00:00:56,760 --> 00:01:00,090 random variable we're talking about. 21 00:01:00,090 --> 00:01:02,800 The beauty of the CDF is that it just involves a 22 00:01:02,800 --> 00:01:03,600 probability-- 23 00:01:03,600 --> 00:01:06,280 a concept that is well defined, no matter what kind 24 00:01:06,280 --> 00:01:08,920 of random variable we're dealing with. 25 00:01:08,920 --> 00:01:13,710 So in particular, if X is a continuous random variable, 26 00:01:13,710 --> 00:01:17,289 the probability that X is less than or equal 27 00:01:17,289 --> 00:01:18,789 to a certain number-- 28 00:01:18,789 --> 00:01:24,539 this is just the integral of the PDF over that range from 29 00:01:24,539 --> 00:01:27,670 minus infinity up to that number. 30 00:01:27,670 --> 00:01:31,039 As a more concrete example, let us consider a uniform 31 00:01:31,039 --> 00:01:35,330 random variable that ranges between a and b, and let us 32 00:01:35,330 --> 00:01:39,030 just try to plot the corresponding CDF. 33 00:01:39,030 --> 00:01:41,630 The CDF is a function of little x. 34 00:01:41,630 --> 00:01:44,780 And the form that it takes depends on what kind of x 35 00:01:44,780 --> 00:01:46,080 we're talking about. 36 00:01:46,080 --> 00:01:52,009 If little x falls somewhere here to the left of a, and we 37 00:01:52,009 --> 00:01:54,440 ask for the probability that our random variable takes 38 00:01:54,440 --> 00:01:58,400 values in this interval, then this probability will be 0 39 00:01:58,400 --> 00:02:01,150 because all of the probability of this uniform is 40 00:02:01,150 --> 00:02:03,010 between a and b. 41 00:02:03,010 --> 00:02:09,038 Therefore, the CDF is going to be 0 for values of x less than 42 00:02:09,038 --> 00:02:10,990 or equal to a. 43 00:02:10,990 --> 00:02:14,100 How about the case where x lies somewhere 44 00:02:14,100 --> 00:02:16,620 between a and b? 45 00:02:16,620 --> 00:02:20,710 In that case, the probability that our random variable falls 46 00:02:20,710 --> 00:02:23,650 to the left of here-- 47 00:02:23,650 --> 00:02:29,680 this is whatever mass there is under the PDF when we consider 48 00:02:29,680 --> 00:02:32,430 the integral up to this particular point. 49 00:02:32,430 --> 00:02:37,310 So we're looking at the area under the PDF up to this 50 00:02:37,310 --> 00:02:39,030 particular point x. 51 00:02:39,030 --> 00:02:43,490 This area is of the form the base of the rectangle, which 52 00:02:43,490 --> 00:02:47,360 is x minus a, times the height of the rectangle, which is 1 53 00:02:47,360 --> 00:02:49,340 over b minus a. 54 00:02:49,340 --> 00:02:53,500 This is a linear function in x that takes the value of 0 when 55 00:02:53,500 --> 00:03:00,170 x is equal to a, grows linearly, and when x reaches a 56 00:03:00,170 --> 00:03:03,130 value of b, it becomes equal to 1. 57 00:03:08,060 --> 00:03:12,830 How about the case where x lies to the right of b? 58 00:03:12,830 --> 00:03:14,840 We're talking about the probability that our random 59 00:03:14,840 --> 00:03:18,050 variable takes values less than or equal to this 60 00:03:18,050 --> 00:03:19,840 particular x. 61 00:03:19,840 --> 00:03:22,420 But this includes the entire probability 62 00:03:22,420 --> 00:03:23,900 mass of this uniform. 63 00:03:23,900 --> 00:03:27,090 We have unit mass on this particular interval, so the 64 00:03:27,090 --> 00:03:32,030 probability of falling to the left of here is equal to 1. 65 00:03:32,030 --> 00:03:35,210 And this is the shape of the CDF for the case of a uniform 66 00:03:35,210 --> 00:03:36,220 random variable. 67 00:03:36,220 --> 00:03:39,880 It starts at 0, eventually it rises, and eventually it 68 00:03:39,880 --> 00:03:43,290 reaches a value of 1 and stays constant. 69 00:03:43,290 --> 00:03:47,150 Coming back to the general case, CDFs are very useful, 70 00:03:47,150 --> 00:03:50,900 because once we know the CDF of a random variable, we have 71 00:03:50,900 --> 00:03:54,140 enough information to calculate anything we might 72 00:03:54,140 --> 00:03:55,720 want to calculate. 73 00:03:55,720 --> 00:04:00,050 For example, consider the following calculation. 74 00:04:00,050 --> 00:04:05,160 Let us look at the range of numbers from minus infinity to 75 00:04:05,160 --> 00:04:08,320 3 and then up to 4. 76 00:04:08,320 --> 00:04:12,750 If we want to calculate the probability that X is less 77 00:04:12,750 --> 00:04:17,329 than or equal to 4, we can break it down as the 78 00:04:17,329 --> 00:04:22,130 probability that X is less than or equal to 3-- 79 00:04:22,130 --> 00:04:25,040 this is one term-- 80 00:04:25,040 --> 00:04:35,860 plus the probability that X falls between 3 and 4, which 81 00:04:35,860 --> 00:04:40,340 would be this event here. 82 00:04:40,340 --> 00:04:43,320 So this equality is true because of the additivity 83 00:04:43,320 --> 00:04:45,940 property of probabilities. 84 00:04:45,940 --> 00:04:49,409 This event is broken down into two possible events. 85 00:04:49,409 --> 00:04:53,070 Either x is less than or equal to 3 or x is bigger than 3 but 86 00:04:53,070 --> 00:04:55,120 less than or equal to 4. 87 00:04:55,120 --> 00:04:59,050 But now we recognize that if we know the CDF of the random 88 00:04:59,050 --> 00:05:01,580 variable, then we know this quantity. 89 00:05:01,580 --> 00:05:04,820 We also know this quantity, and this allows us to 90 00:05:04,820 --> 00:05:06,390 calculate this quantity. 91 00:05:06,390 --> 00:05:08,810 So we can calculate the probability of a 92 00:05:08,810 --> 00:05:10,370 more general interval. 93 00:05:10,370 --> 00:05:13,140 So in general, the CDF contains all available 94 00:05:13,140 --> 00:05:15,770 probabilistic information about a random variable. 95 00:05:15,770 --> 00:05:19,170 It is just a different way of describing the probability 96 00:05:19,170 --> 00:05:20,280 distribution. 97 00:05:20,280 --> 00:05:22,900 From the CDF, we can recover any quantity we 98 00:05:22,900 --> 00:05:24,530 might wish to know. 99 00:05:24,530 --> 00:05:27,430 And for continuous random variables, the CDF actually 100 00:05:27,430 --> 00:05:32,680 has enough information for us to be able to recover the PDF. 101 00:05:32,680 --> 00:05:34,230 How can we do that? 102 00:05:34,230 --> 00:05:37,050 Let's look at this relation here, and let's take 103 00:05:37,050 --> 00:05:39,510 derivatives of both sides. 104 00:05:39,510 --> 00:05:44,040 On the left, we obtain the derivative of the CDF. 105 00:05:44,040 --> 00:05:47,150 And let's evaluate it at a particular point x. 106 00:05:47,150 --> 00:05:50,250 What do we get on the right? 107 00:05:50,250 --> 00:05:54,850 By basic calculus results, the derivative of an integral, 108 00:05:54,850 --> 00:05:57,820 with respect to the upper limit of the integration, is 109 00:05:57,820 --> 00:06:00,400 just the integrand itself. 110 00:06:00,400 --> 00:06:02,500 So it is the density itself. 111 00:06:05,690 --> 00:06:08,830 So this is a very useful formula, which tells us that 112 00:06:08,830 --> 00:06:12,730 once we have the CDF, we can calculate the PDF. 113 00:06:12,730 --> 00:06:16,830 And conversely, if we have the PDF, we can find the CDF by 114 00:06:16,830 --> 00:06:18,120 integrating. 115 00:06:18,120 --> 00:06:21,410 Of course, this formula can only be correct at those 116 00:06:21,410 --> 00:06:24,920 places where the CDF has a derivative. 117 00:06:24,920 --> 00:06:28,480 For example, at this corner here, the derivative of the 118 00:06:28,480 --> 00:06:30,370 CDF is not well defined. 119 00:06:30,370 --> 00:06:33,150 We would get a different value if we differentiate from the 120 00:06:33,150 --> 00:06:35,580 left, a different value when we differentiate from the 121 00:06:35,580 --> 00:06:38,370 right, so we cannot apply this formula. 122 00:06:38,370 --> 00:06:42,760 But at those places where the CDF is differentiable, at 123 00:06:42,760 --> 00:06:44,940 those places we can find the corresponding 124 00:06:44,940 --> 00:06:47,030 value of the PDF. 125 00:06:47,030 --> 00:06:51,159 For instance, in this diagram, at this point the CDF is 126 00:06:51,159 --> 00:06:52,590 differentiable. 127 00:06:52,590 --> 00:06:56,850 The derivative is equal to the slope, which is this quantity. 128 00:06:56,850 --> 00:07:01,070 And this quantity happens to be exactly the same as the 129 00:07:01,070 --> 00:07:03,090 value of the PDF. 130 00:07:03,090 --> 00:07:07,430 So indeed, here, we see that the PDF can be found by taking 131 00:07:07,430 --> 00:07:11,820 the derivative of the CDF. 132 00:07:11,820 --> 00:07:15,390 Now, as we discussed earlier, CDFs are relevant to all types 133 00:07:15,390 --> 00:07:16,510 of random variables. 134 00:07:16,510 --> 00:07:19,200 So in particular, they are also relevant to discrete 135 00:07:19,200 --> 00:07:20,580 random variables. 136 00:07:20,580 --> 00:07:23,000 For a discrete random variable, the CDF is, of 137 00:07:23,000 --> 00:07:27,230 course, defined the same way, except that we calculate this 138 00:07:27,230 --> 00:07:31,460 probability by adding the probabilities of all possible 139 00:07:31,460 --> 00:07:35,050 values of the random variable that are less than 140 00:07:35,050 --> 00:07:35,115 [or equal to] 141 00:07:35,115 --> 00:07:38,650 the particular little x that we're considering. 142 00:07:38,650 --> 00:07:41,720 So we have a summation instead of an integral. 143 00:07:41,720 --> 00:07:43,500 Let us look at an example. 144 00:07:43,500 --> 00:07:46,030 This is an example of a discrete random variable 145 00:07:46,030 --> 00:07:47,940 described by a PMF. 146 00:07:47,940 --> 00:07:51,530 And let us try to calculate the corresponding CDF. 147 00:07:51,530 --> 00:07:55,770 The probability of falling to the left of this number, for 148 00:07:55,770 --> 00:07:58,020 example, is equal to 0. 149 00:07:58,020 --> 00:08:03,160 And all the way up to 1, there is 0 probability of getting a 150 00:08:03,160 --> 00:08:06,470 value for the random variable less than that. 151 00:08:06,470 --> 00:08:11,380 But now, if we let x to be equal to 1, then we're talking 152 00:08:11,380 --> 00:08:17,210 about the probability that the random variable takes a value 153 00:08:17,210 --> 00:08:19,540 less than or equal to 1. 154 00:08:19,540 --> 00:08:23,670 And because this includes the value of 1, this probability 155 00:08:23,670 --> 00:08:25,880 would be equal to 1/4. 156 00:08:25,880 --> 00:08:29,810 This means that once we reach this point, the value of the 157 00:08:29,810 --> 00:08:34,320 CDF becomes 1/4. 158 00:08:34,320 --> 00:08:38,500 At this point, the CDF makes a jump. 159 00:08:38,500 --> 00:08:42,440 At 1, the value of the CDF is equal to 1/4. 160 00:08:42,440 --> 00:08:47,340 Just before 1, the value of the CDF was equal to 0. 161 00:08:47,340 --> 00:08:50,080 Now what's the probability of falling to the left 162 00:08:50,080 --> 00:08:52,320 of, let's say, 2? 163 00:08:52,320 --> 00:08:54,880 This probability is again 1/4. 164 00:08:54,880 --> 00:08:58,450 There's no change in the probability as we keep moving 165 00:08:58,450 --> 00:09:00,280 inside this interval. 166 00:09:00,280 --> 00:09:04,590 So the CDF stays constant, until at some point we reach 167 00:09:04,590 --> 00:09:07,730 the value of 3. 168 00:09:07,730 --> 00:09:11,410 And at that point, the probability that the random 169 00:09:11,410 --> 00:09:15,750 variable takes a value less than or equal to 3 is going to 170 00:09:15,750 --> 00:09:20,520 be the probability of a 3 plus the probability of a 1 which 171 00:09:20,520 --> 00:09:22,675 becomes 3 over 4. 172 00:09:29,040 --> 00:09:32,940 For any other x in this interval, the probability that 173 00:09:32,940 --> 00:09:36,300 the random variable takes a value less than this number 174 00:09:36,300 --> 00:09:42,730 will stay at 1/4 plus 1/2, so the CDF stays constant. 175 00:09:42,730 --> 00:09:47,200 And at this point, the probability of being less than 176 00:09:47,200 --> 00:09:53,160 or equal to 4, this probability becomes 1. 177 00:09:53,160 --> 00:09:59,130 And so the CDF jumps once more to a value of 1. 178 00:09:59,130 --> 00:10:03,800 Again, at the places where the CDF makes a jump, which one of 179 00:10:03,800 --> 00:10:05,470 the two is the correct value? 180 00:10:05,470 --> 00:10:08,060 The correct value is this one. 181 00:10:08,060 --> 00:10:13,250 And this is because the CDF is defined by using a less than 182 00:10:13,250 --> 00:10:18,280 or equal sign in the probability involved here. 183 00:10:18,280 --> 00:10:21,630 So in the case of discrete random variables, the CDF 184 00:10:21,630 --> 00:10:24,310 takes the form of a staircase function. 185 00:10:24,310 --> 00:10:25,530 It starts at 0. 186 00:10:25,530 --> 00:10:27,290 It ends up at 1. 187 00:10:27,290 --> 00:10:32,390 It has a jump at those points where the PMF assigns a 188 00:10:32,390 --> 00:10:33,810 positive mass. 189 00:10:33,810 --> 00:10:39,630 And the size of the jump is exactly equal to the 190 00:10:39,630 --> 00:10:43,870 corresponding value of the PMF. 191 00:10:43,870 --> 00:10:49,450 Similarly, the size of the PMF here is 1/4, and so the size 192 00:10:49,450 --> 00:10:52,190 of the corresponding jump at the CDF will 193 00:10:52,190 --> 00:10:55,390 also be equal to 1/4. 194 00:10:55,390 --> 00:10:59,140 CDFs have some general properties, and we have seen a 195 00:10:59,140 --> 00:11:03,110 hint of those properties in what we have done so far. 196 00:11:03,110 --> 00:11:07,060 So the CDF is, by definition, the probability of obtaining a 197 00:11:07,060 --> 00:11:11,025 value less than or equal to a certain number little x. 198 00:11:11,025 --> 00:11:13,810 It's the probability of this interval. 199 00:11:13,810 --> 00:11:18,040 If I were to take a larger interval, and go up to some 200 00:11:18,040 --> 00:11:20,380 larger number y, this would be the 201 00:11:20,380 --> 00:11:22,100 probability of a bigger interval. 202 00:11:22,100 --> 00:11:25,770 So that probability would only be bigger. 203 00:11:25,770 --> 00:11:28,990 And this translates into the fact that the CDF is an 204 00:11:28,990 --> 00:11:31,340 non-decreasing function. 205 00:11:31,340 --> 00:11:36,970 If y is larger than or equal to x, as in this picture, then 206 00:11:36,970 --> 00:11:41,690 the value of the CDF evaluated at that point y is going to be 207 00:11:41,690 --> 00:11:45,830 larger than or equal to the CDF evaluated at that 208 00:11:45,830 --> 00:11:47,700 point little x. 209 00:11:47,700 --> 00:11:52,270 Other properties that the CDF has is that as x goes to 210 00:11:52,270 --> 00:11:56,060 infinity, we're talking about the probability essentially of 211 00:11:56,060 --> 00:11:58,200 the entire real line. 212 00:11:58,200 --> 00:12:00,900 And so the CDF will converge to 1. 213 00:12:00,900 --> 00:12:06,160 On the other hand, if x tends to minus infinity, so we're 214 00:12:06,160 --> 00:12:09,740 talking about the probability of an interval to the left of 215 00:12:09,740 --> 00:12:14,650 a point that's all the way out, further and further out. 216 00:12:14,650 --> 00:12:17,640 That probability has to diminish, and eventually 217 00:12:17,640 --> 00:12:19,070 converge to 0. 218 00:12:19,070 --> 00:12:24,070 So in general, CDFs asymptotically start at 0. 219 00:12:24,070 --> 00:12:25,500 They can never go down. 220 00:12:25,500 --> 00:12:27,410 They can only go up. 221 00:12:27,410 --> 00:12:32,560 And in the limit, as x goes to infinity, the CDF has to 222 00:12:32,560 --> 00:12:34,230 approach 1. 223 00:12:34,230 --> 00:12:37,800 Actually in the examples that we saw earlier, it reaches the 224 00:12:37,800 --> 00:12:41,640 value of 1 after a certain finite point. 225 00:12:41,640 --> 00:12:44,420 But in general, for general random variables, it might 226 00:12:44,420 --> 00:12:47,270 only reach the value 1 asymptotically