1 00:00:00,580 --> 00:00:03,220 The covariance between two random variables 2 00:00:03,220 --> 00:00:04,930 tells us something about the strength 3 00:00:04,930 --> 00:00:07,160 of the dependence between them. 4 00:00:07,160 --> 00:00:09,890 But it is not so easy to interpret qualitatively. 5 00:00:09,890 --> 00:00:13,100 For example, if I tell you that the covariance of X and Y 6 00:00:13,100 --> 00:00:15,690 is equal to 5, this does not tell you 7 00:00:15,690 --> 00:00:20,990 very much about whether X and Y are closely related or not. 8 00:00:20,990 --> 00:00:25,300 Another difficulty is that if X and Y are in units, 9 00:00:25,300 --> 00:00:27,830 let's say, of meters, then the covariance 10 00:00:27,830 --> 00:00:30,760 will have units of meters squared. 11 00:00:30,760 --> 00:00:32,820 And this is hard to interpret. 12 00:00:32,820 --> 00:00:36,160 A much more informative quantity is the so-called correlation 13 00:00:36,160 --> 00:00:39,880 coefficient, which is a dimensionless version 14 00:00:39,880 --> 00:00:41,960 of the covariance. 15 00:00:41,960 --> 00:00:45,750 It is defined by this formula here. 16 00:00:45,750 --> 00:00:48,480 We just take the covariance and divide it 17 00:00:48,480 --> 00:00:51,430 by the product of the standard deviations of the two 18 00:00:51,430 --> 00:00:53,210 random variables. 19 00:00:53,210 --> 00:00:57,660 Now, if X has units of meters, then the standard deviation 20 00:00:57,660 --> 00:01:00,140 also has units of meters. 21 00:01:00,140 --> 00:01:03,430 And so this ratio will be dimensionless. 22 00:01:03,430 --> 00:01:06,300 And it is not affected by the units that we're using. 23 00:01:06,300 --> 00:01:08,990 The same is true for this ratio here, 24 00:01:08,990 --> 00:01:13,100 and this is why the correlation coefficient does not 25 00:01:13,100 --> 00:01:16,090 have any units of its own. 26 00:01:16,090 --> 00:01:20,330 One remark-- if we're dealing with a random variable whose 27 00:01:20,330 --> 00:01:23,450 standard deviation is equal to 0-- 28 00:01:23,450 --> 00:01:25,850 so its variance is also equal to 0-- 29 00:01:25,850 --> 00:01:27,890 then we have a random variable, which 30 00:01:27,890 --> 00:01:30,460 is identically equal to a constant. 31 00:01:30,460 --> 00:01:33,620 Well, for such cases of degenerate random variables, 32 00:01:33,620 --> 00:01:35,700 then the correlation coefficient is not 33 00:01:35,700 --> 00:01:40,450 defined, because it would have involved a division by 0. 34 00:01:40,450 --> 00:01:43,500 A very important property of the correlation coefficient 35 00:01:43,500 --> 00:01:44,940 is the following. 36 00:01:44,940 --> 00:01:47,220 It turns out that the correlation coefficient 37 00:01:47,220 --> 00:01:50,440 is always between minus 1 and 1. 38 00:01:50,440 --> 00:01:54,130 And this allows us to judge whether a certain correlation 39 00:01:54,130 --> 00:01:56,930 coefficient is big or not, because we now 40 00:01:56,930 --> 00:01:58,789 have an absolute scale. 41 00:01:58,789 --> 00:02:03,070 And so it does provide a measure of the degree to which two 42 00:02:03,070 --> 00:02:05,530 random variables are associated. 43 00:02:05,530 --> 00:02:07,620 To interpret the correlation coefficient, 44 00:02:07,620 --> 00:02:10,820 let's now look at some extreme cases. 45 00:02:10,820 --> 00:02:14,820 Suppose that X and Y are independent. 46 00:02:14,820 --> 00:02:17,570 In that case, we know that the covariance 47 00:02:17,570 --> 00:02:19,490 is going to be equal to 0. 48 00:02:19,490 --> 00:02:21,480 And therefore, the correlation coefficient 49 00:02:21,480 --> 00:02:23,780 is also going to be equal to 0. 50 00:02:23,780 --> 00:02:26,690 And in that case, we say that the two random variables 51 00:02:26,690 --> 00:02:28,590 are uncorrelated. 52 00:02:28,590 --> 00:02:32,810 However, the converse statement is not true. 53 00:02:32,810 --> 00:02:35,960 We have seen already an example in which we 54 00:02:35,960 --> 00:02:39,310 have zero covariance and therefore zero correlation, 55 00:02:39,310 --> 00:02:44,220 but yet the two random variables were dependent. 56 00:02:44,220 --> 00:02:46,130 Let us now look at the other extreme, 57 00:02:46,130 --> 00:02:48,400 where the two random variables are 58 00:02:48,400 --> 00:02:51,180 as dependent as they can be. 59 00:02:51,180 --> 00:02:53,300 So let's look at the correlation coefficient 60 00:02:53,300 --> 00:02:56,840 of one random variable with itself. 61 00:02:56,840 --> 00:02:58,520 What is it going to be? 62 00:02:58,520 --> 00:03:01,650 The covariance of a random variable with itself 63 00:03:01,650 --> 00:03:07,090 is just the variance of that random variable, now, 64 00:03:07,090 --> 00:03:09,950 sigma X is going to be the same as sigma Y, 65 00:03:09,950 --> 00:03:12,900 because we're taking Y to be the same as X. 66 00:03:12,900 --> 00:03:15,260 So we're dividing by sigma X squared. 67 00:03:15,260 --> 00:03:18,420 But the square of the standard deviation is the variance. 68 00:03:18,420 --> 00:03:21,500 So we obtain a value of 1. 69 00:03:21,500 --> 00:03:23,579 So a correlation coefficient of 1 70 00:03:23,579 --> 00:03:27,770 shows up in such a case of an extreme dependence. 71 00:03:27,770 --> 00:03:31,220 If instead we had taken the correlation coefficient of X 72 00:03:31,220 --> 00:03:34,510 with the negative of X, in that case, 73 00:03:34,510 --> 00:03:37,200 we would have obtained a correlation coefficient 74 00:03:37,200 --> 00:03:38,540 of minus 1. 75 00:03:41,540 --> 00:03:43,400 A somewhat more general situation 76 00:03:43,400 --> 00:03:47,970 than the one we considered here is the following. 77 00:03:47,970 --> 00:03:50,800 If we have two random variables that 78 00:03:50,800 --> 00:03:54,560 have a linear relationship-- that is, if I know Y 79 00:03:54,560 --> 00:03:58,540 I can figure out the value of X with absolute certainty, 80 00:03:58,540 --> 00:04:02,600 and I can figure it out by using a linear formula. 81 00:04:02,600 --> 00:04:06,840 In this case, it turns out that the correlation coefficient 82 00:04:06,840 --> 00:04:09,660 is either plus 1 or minus 1. 83 00:04:09,660 --> 00:04:11,380 And the converse is true. 84 00:04:11,380 --> 00:04:14,310 If the correlation coefficient has absolute value of 1, 85 00:04:14,310 --> 00:04:15,820 then the two random variables obey 86 00:04:15,820 --> 00:04:19,899 a deterministic linear relation between them. 87 00:04:19,899 --> 00:04:24,380 So to conclude, an extreme value for the correlation coefficient 88 00:04:24,380 --> 00:04:28,280 of plus or minus 1 is equivalent to having 89 00:04:28,280 --> 00:04:31,730 a deterministic relation between the two random variables 90 00:04:31,730 --> 00:04:33,820 involved. 91 00:04:33,820 --> 00:04:36,500 A final remark about the algebraic properties 92 00:04:36,500 --> 00:04:40,530 of the correlation coefficient- What can we 93 00:04:40,530 --> 00:04:42,250 say about the correlation coefficient 94 00:04:42,250 --> 00:04:46,450 of a linear function of a random variable with another? 95 00:04:46,450 --> 00:04:48,760 Well, we already know something about what 96 00:04:48,760 --> 00:04:52,280 happens to the covariance when we form a linear function. 97 00:04:52,280 --> 00:04:57,430 And the covariance of aX plus b with Y is related this way 98 00:04:57,430 --> 00:05:01,980 to the covariance of X with Y. Now, let us use this property 99 00:05:01,980 --> 00:05:04,830 and calculate the correlation coefficient between aX 100 00:05:04,830 --> 00:05:06,630 plus b and Y. 101 00:05:06,630 --> 00:05:10,010 In the numerator, we have the covariance of aX 102 00:05:10,010 --> 00:05:13,890 plus b with Y, which is equal to a times 103 00:05:13,890 --> 00:05:20,640 the covariance of X with Y. At the denominator, 104 00:05:20,640 --> 00:05:27,120 we have the standard deviation of this random variable. 105 00:05:27,120 --> 00:05:30,310 Now, the standard deviation of this random variable 106 00:05:30,310 --> 00:05:34,450 is equal to a times the standard deviation of X, 107 00:05:34,450 --> 00:05:36,080 if a is positive. 108 00:05:36,080 --> 00:05:39,110 If a is negative, then we need to put the minus sign. 109 00:05:39,110 --> 00:05:43,450 But in either case, we will have here the absolute value 110 00:05:43,450 --> 00:05:45,940 of a times the standard deviation 111 00:05:45,940 --> 00:05:49,030 of X. And then we divide by the standard deviation 112 00:05:49,030 --> 00:05:52,370 of the second random variable, which is Y. 113 00:05:52,370 --> 00:05:57,080 And so what we obtain here is this ratio, which 114 00:05:57,080 --> 00:06:01,840 is a correlation coefficient of X with Y times 115 00:06:01,840 --> 00:06:07,160 this quantity, which is the sign of a. 116 00:06:07,160 --> 00:06:11,690 So we have the sign of a times the correlation coefficient 117 00:06:11,690 --> 00:06:15,950 of X with Y. So in particular, the magnitude 118 00:06:15,950 --> 00:06:18,050 of the correlation coefficient is not 119 00:06:18,050 --> 00:06:22,370 going to change when we replace X by aX plus b. 120 00:06:22,370 --> 00:06:24,210 And this essentially means that if we 121 00:06:24,210 --> 00:06:27,280 change the units of the random variable X, 122 00:06:27,280 --> 00:06:29,300 for example, suppose that X was degrees 123 00:06:29,300 --> 00:06:34,180 Celsius and aX plus b is degrees Fahrenheit, going 124 00:06:34,180 --> 00:06:37,850 from one set of units, Celsius degrees, 125 00:06:37,850 --> 00:06:41,490 to another set of units, degrees in Fahrenheit, 126 00:06:41,490 --> 00:06:43,570 is not going to change the correlation 127 00:06:43,570 --> 00:06:46,909 coefficient of the temperature with some other random 128 00:06:46,909 --> 00:06:48,690 variable. 129 00:06:48,690 --> 00:06:51,380 So this is a nice property of the correlation coefficient, 130 00:06:51,380 --> 00:06:54,710 again, which reflects the fact that it's dimensionless, 131 00:06:54,710 --> 00:06:57,490 it doesn't have any units of its own, 132 00:06:57,490 --> 00:07:00,120 and it doesn't depend on what kinds of units 133 00:07:00,120 --> 00:07:03,360 we use for each one of the random variables.