1 00:00:00,060 --> 00:00:01,740 The mathematics of the correlation 2 00:00:01,740 --> 00:00:03,810 coefficient are important. 3 00:00:03,810 --> 00:00:06,410 But it is perhaps more important to be able to 4 00:00:06,410 --> 00:00:08,370 interpret it correctly. 5 00:00:08,370 --> 00:00:12,400 A correlation coefficient of let's say 0.5, tells us that 6 00:00:12,400 --> 00:00:15,790 something interesting is going on as far as the relation of X 7 00:00:15,790 --> 00:00:17,910 and Y is concerned. 8 00:00:17,910 --> 00:00:19,300 But what exactly? 9 00:00:19,300 --> 00:00:21,570 It tells us that the two random variables are 10 00:00:21,570 --> 00:00:23,770 associated in some sense. 11 00:00:23,770 --> 00:00:26,740 But this is often misinterpreted to mean that 12 00:00:26,740 --> 00:00:29,950 there is a causal relation between the two. 13 00:00:29,950 --> 00:00:32,155 But this is wrong. 14 00:00:32,155 --> 00:00:35,950 A large correlation coefficient in general does 15 00:00:35,950 --> 00:00:39,580 not indicate that there is a causal relation between the 16 00:00:39,580 --> 00:00:41,450 random variables. 17 00:00:41,450 --> 00:00:45,220 As an example, suppose that X somehow quantifies the 18 00:00:45,220 --> 00:00:47,310 mathematical aptitude of a person. 19 00:00:47,310 --> 00:00:52,290 And Y somehow quantifies the musical ability of a person. 20 00:00:52,290 --> 00:00:55,450 In general, it has been found that mathematical aptitude and 21 00:00:55,450 --> 00:00:57,780 musical ability are correlated. 22 00:00:57,780 --> 00:01:00,800 People who score high on one will score high on 23 00:01:00,800 --> 00:01:02,420 the other as well. 24 00:01:02,420 --> 00:01:04,720 Is there a causal relation? 25 00:01:04,720 --> 00:01:09,789 If you study math a lot and you become very good at math, 26 00:01:09,789 --> 00:01:12,890 does it mean that you would become a better musician? 27 00:01:12,890 --> 00:01:14,140 Not necessarily. 28 00:01:14,140 --> 00:01:18,210 Or if you practice the violin day in and day out, does it 29 00:01:18,210 --> 00:01:21,510 mean that you will score better in the math exam? 30 00:01:21,510 --> 00:01:23,360 Again, not necessarily. 31 00:01:23,360 --> 00:01:26,760 Perhaps what is going on is that there's a certain feature 32 00:01:26,760 --> 00:01:29,630 of the human brain and when that feature is well 33 00:01:29,630 --> 00:01:35,289 developed, then that feature helps both in math and in 34 00:01:35,289 --> 00:01:36,750 musical ability. 35 00:01:36,750 --> 00:01:39,920 And this is a typical situation of how a correlation 36 00:01:39,920 --> 00:01:41,940 coefficient may arise. 37 00:01:41,940 --> 00:01:44,340 That is often a correlation coefficient that's 38 00:01:44,340 --> 00:01:48,570 significant, reflects that there is an underlying common 39 00:01:48,570 --> 00:01:52,810 but perhaps hidden factor that affects both of the random 40 00:01:52,810 --> 00:01:54,710 variables X and Y. 41 00:01:54,710 --> 00:01:58,640 Let's us go through a simple numerical example that models 42 00:01:58,640 --> 00:02:00,980 a situation of this kind. 43 00:02:00,980 --> 00:02:04,440 Suppose that Z, V, and w are independent random variables. 44 00:02:04,440 --> 00:02:07,120 And that we have two more random variables defined by 45 00:02:07,120 --> 00:02:08,490 these relations. 46 00:02:08,490 --> 00:02:13,110 Not that there's no direct influence from X to Y or from 47 00:02:13,110 --> 00:02:14,520 Y to X. 48 00:02:14,520 --> 00:02:16,990 But on the other hand, there's a common underlying factor, 49 00:02:16,990 --> 00:02:20,880 this random variable Z that affects both X and Y. Because 50 00:02:20,880 --> 00:02:25,270 of this, we expect that X and Y will somehow have some kind 51 00:02:25,270 --> 00:02:28,000 of relation or association between them. 52 00:02:28,000 --> 00:02:30,210 And we would like to measure the strength of that 53 00:02:30,210 --> 00:02:31,460 association. 54 00:02:31,460 --> 00:02:34,329 The way to measure it will be in terms of the correlation 55 00:02:34,329 --> 00:02:37,800 coefficient, which we will now proceed to compute. 56 00:02:37,800 --> 00:02:41,890 To have a complete example in our hands and in order to also 57 00:02:41,890 --> 00:02:44,630 keep things simple, let's us assume that the basic 58 00:02:44,630 --> 00:02:48,540 underlying random variable Z, V, and W all have 0 means and 59 00:02:48,540 --> 00:02:50,240 unit variances. 60 00:02:50,240 --> 00:02:52,880 And now let us take the definition of the correlation 61 00:02:52,880 --> 00:02:56,090 coefficient and start calculating. 62 00:02:56,090 --> 00:03:01,600 Let us look at the variance of X. Because X is the sum of two 63 00:03:01,600 --> 00:03:05,300 independent random variables, its variance is going to be 64 00:03:05,300 --> 00:03:08,290 the sum of those variances. 65 00:03:08,290 --> 00:03:11,370 And we have assumed that each one of those variances is 66 00:03:11,370 --> 00:03:12,100 equal to 1. 67 00:03:12,100 --> 00:03:15,050 So the variance of X is equal to 2. 68 00:03:15,050 --> 00:03:18,000 And that implies that the standard deviation of X is 69 00:03:18,000 --> 00:03:20,270 equal to the square root of 2. 70 00:03:20,270 --> 00:03:23,650 By a similar argument, the standard deviation of Y is 71 00:03:23,650 --> 00:03:26,440 also equal to the square root of 2. 72 00:03:26,440 --> 00:03:31,350 Now, let us look at the covariance between X and Y. 73 00:03:31,350 --> 00:03:36,660 Because X and Y have 0 means, the covariance is just the 74 00:03:36,660 --> 00:03:40,160 expected value of the product of the two random variables. 75 00:03:40,160 --> 00:03:42,990 And using the definition of what these two random 76 00:03:42,990 --> 00:03:47,450 variables are, it's this particular product here. 77 00:03:47,450 --> 00:03:51,040 We expand the product into a sum of four terms. 78 00:03:51,040 --> 00:03:54,590 And take the expected value of each one of the four terms. 79 00:04:04,170 --> 00:04:11,580 Which leaves us with this particular expression here. 80 00:04:11,580 --> 00:04:15,240 Now, Z has 0 mean and unit variance. 81 00:04:15,240 --> 00:04:18,940 Therefore, the expected value of Z squared is equal to 1. 82 00:04:18,940 --> 00:04:20,680 How about the next term? 83 00:04:20,680 --> 00:04:22,240 V and Z are independent. 84 00:04:22,240 --> 00:04:25,140 So the expected value of the product is the product of the 85 00:04:25,140 --> 00:04:26,210 expected values. 86 00:04:26,210 --> 00:04:30,410 But the expected values are zero, so this term is zero. 87 00:04:30,410 --> 00:04:32,500 And with a similar argument, the other 88 00:04:32,500 --> 00:04:35,150 terms are zero as well. 89 00:04:35,150 --> 00:04:37,970 So the co-variance is equal to 1. 90 00:04:37,970 --> 00:04:43,350 And from this, we can conclude our calculation and write that 91 00:04:43,350 --> 00:04:48,340 the correlation coefficient between X and Y is equal to 1 92 00:04:48,340 --> 00:04:53,270 divided by the square root of 2 times square root of 2, 93 00:04:53,270 --> 00:04:54,520 which is 1/2. 94 00:04:56,940 --> 00:05:00,610 This example also serves to give you a rough idea of what 95 00:05:00,610 --> 00:05:02,790 it may mean to have a correlation 96 00:05:02,790 --> 00:05:05,320 coefficient of 1/2. 97 00:05:05,320 --> 00:05:07,510 It means that the two random variables 98 00:05:07,510 --> 00:05:10,310 have some common elements. 99 00:05:10,310 --> 00:05:13,690 And they also have some idiosyncratic elements. 100 00:05:13,690 --> 00:05:18,390 And these two elements are roughly equal in weight. 101 00:05:18,390 --> 00:05:22,220 If V and W were completely absent, the correlation 102 00:05:22,220 --> 00:05:24,405 coefficient would have been 1. 103 00:05:24,405 --> 00:05:29,900 If on the other hand V and W had a huge variance, so as to 104 00:05:29,900 --> 00:05:33,810 completely hide the effect of Z, then the value of the 105 00:05:33,810 --> 00:05:37,490 correlation coefficient would have been much, much smaller 106 00:05:37,490 --> 00:05:39,480 perhaps closer to 0. 107 00:05:39,480 --> 00:05:42,260 And in the extreme case of course where Z is completely 108 00:05:42,260 --> 00:05:45,610 absent, then X and Y are independent, and we get a 109 00:05:45,610 --> 00:05:47,300 correlation coefficient of 0.