1 00:00:04,500 --> 00:00:06,680 In the previous video, we observed 2 00:00:06,680 --> 00:00:10,700 that Age and FrancePopulation are highly correlated. 3 00:00:10,700 --> 00:00:12,670 But what is correlation? 4 00:00:12,670 --> 00:00:15,920 Correlation measures the linear relationship between two 5 00:00:15,920 --> 00:00:21,050 variables and is a number between -1 and +1. 6 00:00:21,050 --> 00:00:24,820 A correlation of +1 means a perfect positive linear 7 00:00:24,820 --> 00:00:26,640 relationship. 8 00:00:26,640 --> 00:00:30,450 A correlation of -1 means a perfect negative linear 9 00:00:30,450 --> 00:00:32,090 relationship. 10 00:00:32,090 --> 00:00:33,980 In the middle of these two extremes 11 00:00:33,980 --> 00:00:36,490 is a correlation of 0, which means 12 00:00:36,490 --> 00:00:39,050 that there is no linear relationship between the two 13 00:00:39,050 --> 00:00:41,400 variables. 14 00:00:41,400 --> 00:00:44,390 When we say that two variables are highly correlated, 15 00:00:44,390 --> 00:00:47,220 we mean that the absolute value of the correlation 16 00:00:47,220 --> 00:00:49,220 is close to 1. 17 00:00:49,220 --> 00:00:52,770 Let's look at some examples. 18 00:00:52,770 --> 00:00:56,790 This plot graphs WinterRain on the x-axis 19 00:00:56,790 --> 00:00:59,670 and Price on the y-axis. 20 00:00:59,670 --> 00:01:01,810 By visually inspecting the plot, it's 21 00:01:01,810 --> 00:01:05,260 hard to detect any linear relationship. 22 00:01:05,260 --> 00:01:08,360 But it turns out that the correlation between WinterRain 23 00:01:08,360 --> 00:01:11,740 and Price is 0.14. 24 00:01:11,740 --> 00:01:14,220 So there's a slight positive relationship 25 00:01:14,220 --> 00:01:15,390 between these two variables. 26 00:01:18,280 --> 00:01:23,100 This plot has HarvestRain on the x-axis and the Average Growing 27 00:01:23,100 --> 00:01:25,930 Season Temperature on the y-axis. 28 00:01:25,930 --> 00:01:28,940 It's again hard to visually see a linear relationship 29 00:01:28,940 --> 00:01:31,710 between the two variables, and it turns out 30 00:01:31,710 --> 00:01:37,680 that the correlation is -0.06, even closer to 0 than before. 31 00:01:40,620 --> 00:01:44,830 This plot shows Age of the Wine on the x-axis 32 00:01:44,830 --> 00:01:48,580 and the Population of France on the y-axis. 33 00:01:48,580 --> 00:01:52,000 We can easily see that there's a strong negative linear 34 00:01:52,000 --> 00:01:55,060 relationship between these two variables. 35 00:01:55,060 --> 00:01:57,770 This makes sense, since the population of France 36 00:01:57,770 --> 00:02:00,170 has increased with time. 37 00:02:00,170 --> 00:02:05,810 If we compute the correlation, we get -0.99. 38 00:02:05,810 --> 00:02:10,190 So these two variables are indeed highly correlated. 39 00:02:10,190 --> 00:02:14,030 Let's compute some correlations in R. 40 00:02:14,030 --> 00:02:17,280 We can compute the correlation between a pair of variables 41 00:02:17,280 --> 00:02:20,820 in R by using the cor function. 42 00:02:20,820 --> 00:02:25,440 Let's compute the correlation between WinterRain and Price. 43 00:02:25,440 --> 00:02:29,090 So we type cor, and then in parentheses we 44 00:02:29,090 --> 00:02:34,380 give the names of the two variables, WinterRain 45 00:02:34,380 --> 00:02:37,300 and Price. 46 00:02:37,300 --> 00:02:39,670 If we hit Enter, this function tells us 47 00:02:39,670 --> 00:02:45,190 that the correlation between the two variables is 0.137. 48 00:02:45,190 --> 00:02:47,280 Let's look at another example. 49 00:02:47,280 --> 00:02:49,190 This time we'll compute the correlation 50 00:02:49,190 --> 00:02:52,540 between Age and FrancePopulation. 51 00:02:52,540 --> 00:02:55,650 So again, we use the cor function, 52 00:02:55,650 --> 00:02:58,480 but this time we give as the two variables 53 00:02:58,480 --> 00:03:01,400 Age and FrancePopulation. 54 00:03:06,440 --> 00:03:09,790 As we saw earlier, Age and FrancePopulation 55 00:03:09,790 --> 00:03:15,650 are highly correlated with a correlation of -0.99. 56 00:03:15,650 --> 00:03:17,550 We can also compute the correlation 57 00:03:17,550 --> 00:03:19,680 between all pairs of variables in our data 58 00:03:19,680 --> 00:03:21,900 set using the cor function. 59 00:03:21,900 --> 00:03:29,579 To do so, we just type cor, and then our data set name wine. 60 00:03:29,579 --> 00:03:33,730 The output is a grid of numbers with the rows and columns 61 00:03:33,730 --> 00:03:36,670 labeled with the variables in our data set. 62 00:03:36,670 --> 00:03:39,790 To find the correlation between two variables, 63 00:03:39,790 --> 00:03:42,180 you just need to find the row for one of them 64 00:03:42,180 --> 00:03:43,850 and the column for the other. 65 00:03:43,850 --> 00:03:47,870 For example, we can find the column for Age 66 00:03:47,870 --> 00:03:50,690 and then go down to the row for FrancePopulation 67 00:03:50,690 --> 00:03:53,579 to see the number that we just computed. 68 00:03:53,579 --> 00:03:56,079 Or we could check if WinterRain is 69 00:03:56,079 --> 00:03:59,180 highly correlated with any other independent variables 70 00:03:59,180 --> 00:04:02,710 by looking at the WinterRain column. 71 00:04:02,710 --> 00:04:04,390 So how does this information help 72 00:04:04,390 --> 00:04:07,680 us understand our linear regression model? 73 00:04:07,680 --> 00:04:11,680 We've confirmed that Age and FrancePopulation are definitely 74 00:04:11,680 --> 00:04:13,100 highly correlated. 75 00:04:13,100 --> 00:04:16,589 So we do have multicollinearity problems in our model 76 00:04:16,589 --> 00:04:20,450 that uses all of the available independent variables. 77 00:04:20,450 --> 00:04:22,750 Keep in mind that multicollinearity 78 00:04:22,750 --> 00:04:26,820 refers to the situation when two independent variables are 79 00:04:26,820 --> 00:04:28,640 highly correlated. 80 00:04:28,640 --> 00:04:31,450 A high correlation between an independent variable 81 00:04:31,450 --> 00:04:34,210 and the dependent variable is a good thing 82 00:04:34,210 --> 00:04:37,210 since we're trying to predict the dependent variable using 83 00:04:37,210 --> 00:04:40,020 the independent variables. 84 00:04:40,020 --> 00:04:43,110 Now due to the possibility of multicollinearity, 85 00:04:43,110 --> 00:04:46,170 you always want to remove the insignificant variables 86 00:04:46,170 --> 00:04:47,980 one at a time. 87 00:04:47,980 --> 00:04:49,400 Let's see what would have happened 88 00:04:49,400 --> 00:04:53,630 if we had removed both Age and FrancePopulation, since they 89 00:04:53,630 --> 00:04:55,890 were both insignificant in our model 90 00:04:55,890 --> 00:04:59,140 that used all of the independent variables. 91 00:04:59,140 --> 00:05:03,190 We'll call this new model model5, 92 00:05:03,190 --> 00:05:05,850 and again, we'll use the lm function 93 00:05:05,850 --> 00:05:10,250 to predict Price using as independent variables 94 00:05:10,250 --> 00:05:19,470 AGST, HarvestRain, and WinterRain. 95 00:05:23,020 --> 00:05:27,630 Again, we'll use the data set wine. 96 00:05:27,630 --> 00:05:33,730 If we take a look at the summary of this new model 97 00:05:33,730 --> 00:05:36,240 and look at the Coefficients table, 98 00:05:36,240 --> 00:05:38,830 we can see that AverageGrowingSeasonTemperature 99 00:05:38,830 --> 00:05:41,650 and HarvestRain are very significant, 100 00:05:41,650 --> 00:05:44,810 and WinterRain is almost significant. 101 00:05:44,810 --> 00:05:46,940 So this model looks pretty good, 102 00:05:46,940 --> 00:05:48,840 but if we look at our R-squared, we 103 00:05:48,840 --> 00:05:52,980 can see that it dropped to 0.75. 104 00:05:52,980 --> 00:05:57,500 The model that includes Age has an R-squared of 0.83. 105 00:05:57,500 --> 00:06:00,590 So if we had removed Age and FrancePopulation 106 00:06:00,590 --> 00:06:04,690 at the same time, we would have missed a significant variable, 107 00:06:04,690 --> 00:06:09,210 and the R-squared of our final model would have been lower. 108 00:06:09,210 --> 00:06:11,650 So why didn't we keep FrancePopulation 109 00:06:11,650 --> 00:06:13,440 instead of Age? 110 00:06:13,440 --> 00:06:16,680 Well, we expect Age to be significant. 111 00:06:16,680 --> 00:06:19,600 Older wines are typically more expensive, 112 00:06:19,600 --> 00:06:23,570 so Age makes more intuitive sense in our model. 113 00:06:23,570 --> 00:06:26,650 Multicollinearity reminds us that coefficients 114 00:06:26,650 --> 00:06:29,200 are only interpretable in the presence 115 00:06:29,200 --> 00:06:31,820 of other variables being used. 116 00:06:31,820 --> 00:06:34,270 High correlations can even cause coefficients 117 00:06:34,270 --> 00:06:36,720 to have an unintuitive sign. 118 00:06:36,720 --> 00:06:40,750 We'll see an example of this in the next lecture. 119 00:06:40,750 --> 00:06:43,080 So we fixed the multicollinearity issue 120 00:06:43,080 --> 00:06:46,390 caused by Age and FrancePopulation. 121 00:06:46,390 --> 00:06:48,920 Do we have any other highly-correlated independent 122 00:06:48,920 --> 00:06:50,770 variables? 123 00:06:50,770 --> 00:06:53,350 There is no definitive cut-off value 124 00:06:53,350 --> 00:06:55,930 for what makes a correlation too high. 125 00:06:55,930 --> 00:06:59,270 But typically, a correlation greater than 0.7 126 00:06:59,270 --> 00:07:03,610 or less than -0.7 is cause for concern. 127 00:07:03,610 --> 00:07:06,330 If you look back at all of the correlations we computed 128 00:07:06,330 --> 00:07:09,130 for our data set, you can see that it doesn't 129 00:07:09,130 --> 00:07:11,870 look like we have any other highly-correlated independent 130 00:07:11,870 --> 00:07:13,530 variables. 131 00:07:13,530 --> 00:07:15,610 So we'll stick with model4 for the rest 132 00:07:15,610 --> 00:07:21,300 of this lecture, the model that uses AGST, HarvestRain, 133 00:07:21,300 --> 00:07:25,650 WinterRain, and Age as the independent variables.