1 00:00:04,500 --> 00:00:07,620 In the previous video, we created linear regression 2 00:00:07,620 --> 00:00:11,510 models in R. Using the summary function, 3 00:00:11,510 --> 00:00:14,700 we were able to see the coefficients, as well 4 00:00:14,700 --> 00:00:17,010 as some other information. 5 00:00:17,010 --> 00:00:20,690 The output of the coefficient section of the summary function 6 00:00:20,690 --> 00:00:22,540 is shown here. 7 00:00:22,540 --> 00:00:26,650 The independent variables are listed on the left. 8 00:00:26,650 --> 00:00:31,690 The estimate column gives the coefficients for the intercept 9 00:00:31,690 --> 00:00:36,110 and for each of the independent variables in our model. 10 00:00:36,110 --> 00:00:38,580 The remaining columns help us determine 11 00:00:38,580 --> 00:00:41,340 if a variable should be included in the model, 12 00:00:41,340 --> 00:00:46,190 or if its coefficient is significantly different from 0. 13 00:00:46,190 --> 00:00:49,020 A coefficient of 0 means that the value 14 00:00:49,020 --> 00:00:51,360 of the independent variable does not 15 00:00:51,360 --> 00:00:55,060 change our prediction for the dependent variable. 16 00:00:55,060 --> 00:00:59,300 If a coefficient is not significantly different from 0, 17 00:00:59,300 --> 00:01:02,760 then we should probably remove the variable from our model 18 00:01:02,760 --> 00:01:07,330 since it's not helping to predict the dependent variable. 19 00:01:07,330 --> 00:01:10,270 The standard error gives a measure 20 00:01:10,270 --> 00:01:13,270 of how much the coefficient is likely to vary 21 00:01:13,270 --> 00:01:16,220 from the estimate value. 22 00:01:16,220 --> 00:01:23,990 The t value is the estimate divided by the standard error. 23 00:01:28,190 --> 00:01:31,170 It will be negative if the estimate is negative, 24 00:01:31,170 --> 00:01:34,560 and positive if the estimate is positive. 25 00:01:34,560 --> 00:01:37,800 The larger the absolute value of the t statistic, 26 00:01:37,800 --> 00:01:41,550 the more likely the coefficient is to be significant, 27 00:01:41,550 --> 00:01:44,970 so we want variables with a large absolute value 28 00:01:44,970 --> 00:01:47,039 in this column. 29 00:01:47,039 --> 00:01:49,940 The last column gives the probability 30 00:01:49,940 --> 00:01:53,020 that a coefficient is actually 0. 31 00:01:53,020 --> 00:01:56,690 It will be large if the absolute value of the t statistic 32 00:01:56,690 --> 00:02:00,790 is small, and it will be small if the absolute value of the t 33 00:02:00,790 --> 00:02:03,130 statistic is large. 34 00:02:03,130 --> 00:02:07,520 We want variables with small values in this column. 35 00:02:07,520 --> 00:02:10,190 This is a lot of information, but the easiest way 36 00:02:10,190 --> 00:02:13,350 in R to determine if a variable is significant 37 00:02:13,350 --> 00:02:17,000 is to look at the stars at the end of each row. 38 00:02:17,000 --> 00:02:20,000 Three stars is the highest level of significance, 39 00:02:20,000 --> 00:02:25,440 and corresponds to a probability less than 0.001. 40 00:02:25,440 --> 00:02:27,590 The star coding scheme is explained 41 00:02:27,590 --> 00:02:30,890 at the bottom of the coefficient output. 42 00:02:30,890 --> 00:02:35,270 Three stars corresponds to probability values between 0 43 00:02:35,270 --> 00:02:41,280 and 0.001, or the smallest possible probabilities. 44 00:02:41,280 --> 00:02:44,160 Two stars is also very significant, 45 00:02:44,160 --> 00:02:50,500 and corresponds to a probability between 0.001 and 0.01. 46 00:02:50,500 --> 00:02:54,300 One star is also significant, and corresponds 47 00:02:54,300 --> 00:02:59,600 to a probability between 0.01 and 0.05. 48 00:02:59,600 --> 00:03:04,560 A period, or dot, means that the variable is almost significant, 49 00:03:04,560 --> 00:03:09,080 and corresponds to a probability between 0.05 and 0.1. 50 00:03:09,080 --> 00:03:11,690 When we ask you to list the significant variables 51 00:03:11,690 --> 00:03:15,700 in the model, we will usually not include these. 52 00:03:15,700 --> 00:03:19,800 Nothing at the end a row means that the variable is not 53 00:03:19,800 --> 00:03:22,050 significant in the model. 54 00:03:22,050 --> 00:03:24,540 Age and FrancePopulation are both 55 00:03:24,540 --> 00:03:27,079 insignificant in our model. 56 00:03:27,079 --> 00:03:32,280 Let's switch to R, and see if we can improve our model. 57 00:03:32,280 --> 00:03:36,030 In the previous video, we built a linear regression model 58 00:03:36,030 --> 00:03:40,190 called model3 that used all of our independent variables 59 00:03:40,190 --> 00:03:44,300 to predict the dependent variable, Price. 60 00:03:44,300 --> 00:03:49,100 In R console, we can see the summary output for this model. 61 00:03:49,100 --> 00:03:51,430 By looking at the coefficient section, 62 00:03:51,430 --> 00:03:54,680 we can see that both Age and FrancePopulation 63 00:03:54,680 --> 00:03:57,490 are insignificant in our model. 64 00:03:57,490 --> 00:03:59,800 Because of this, we should consider 65 00:03:59,800 --> 00:04:03,320 removing these variables from our model. 66 00:04:03,320 --> 00:04:06,790 Let's start by just removing FrancePopulation, 67 00:04:06,790 --> 00:04:09,130 which we intuitively don't expect 68 00:04:09,130 --> 00:04:12,580 to be predictive of wine price anyway. 69 00:04:12,580 --> 00:04:16,490 Let's create a new model called model4, 70 00:04:16,490 --> 00:04:21,110 which again uses the lm function to predict price using 71 00:04:21,110 --> 00:04:28,180 the independent variables, AGST, HarvestRain, 72 00:04:28,180 --> 00:04:31,710 WinterRain, and Age. 73 00:04:31,710 --> 00:04:33,350 Here, we're not using FrancePopulation. 74 00:04:35,890 --> 00:04:38,220 Our data set, again, is wine, which 75 00:04:38,220 --> 00:04:42,050 will be the data used to create our model. 76 00:04:42,050 --> 00:04:43,680 Let's look at the summary of model4. 77 00:04:48,100 --> 00:04:53,610 We can see that our R-squared in this model is 0.8286, 78 00:04:53,610 --> 00:04:57,650 and our adjusted R-squared is 0.79. 79 00:04:57,650 --> 00:05:01,310 If we scroll back up to our previous model, 80 00:05:01,310 --> 00:05:06,110 our R-squared was 0.8294, and our adjusted R-squared 81 00:05:06,110 --> 00:05:08,750 was 0.784. 82 00:05:08,750 --> 00:05:13,960 So this model is just as strong, if not stronger, than before, 83 00:05:13,960 --> 00:05:16,240 because our adjusted R-squared actually 84 00:05:16,240 --> 00:05:18,180 increased by removing FrancePopulation. 85 00:05:20,800 --> 00:05:24,520 If we look at each of our variables and the stars, 86 00:05:24,520 --> 00:05:28,000 we now see that something a little strange happened. 87 00:05:28,000 --> 00:05:31,770 Before, age was not significant at all in our model. 88 00:05:31,770 --> 00:05:35,490 But now, the variable Age has two stars, 89 00:05:35,490 --> 00:05:39,010 meaning that it's very significant in this new model. 90 00:05:39,010 --> 00:05:41,170 Why did this happen? 91 00:05:41,170 --> 00:05:45,000 This is due to something called multi-colinearity. 92 00:05:45,000 --> 00:05:50,530 Age and FrancePopulation are what we call highly correlated. 93 00:05:50,530 --> 00:05:53,550 Let's learn a bit more about correlation. 94 00:05:53,550 --> 00:05:56,990 Correlation measures the linear relationship between two 95 00:05:56,990 --> 00:06:02,250 variables, and is a number between +1 and -1. 96 00:06:02,250 --> 00:06:05,420 The highest a correlation can be is positivev 1, 97 00:06:05,420 --> 00:06:09,260 which corresponds to a perfect positive linear relationship 98 00:06:09,260 --> 00:06:11,450 between the two variables. 99 00:06:11,450 --> 00:06:14,510 The smallest a correlation can be is negative 1, 100 00:06:14,510 --> 00:06:18,140 which corresponds to a perfect negative linear relationship 101 00:06:18,140 --> 00:06:20,140 between the two variables. 102 00:06:20,140 --> 00:06:22,040 In the middle of these two extremes 103 00:06:22,040 --> 00:06:25,000 is a correlation of 0, which corresponds 104 00:06:25,000 --> 00:06:28,600 to no linear relationship between the two variables. 105 00:06:28,600 --> 00:06:31,800 Let's look at some examples. 106 00:06:31,800 --> 00:06:35,130 This plot graphs WinterRain on the x-axis, 107 00:06:35,130 --> 00:06:38,110 and wine price on the y-axis. 108 00:06:38,110 --> 00:06:40,180 By visually inspecting this plot, 109 00:06:40,180 --> 00:06:42,100 we can see that it looks like there's 110 00:06:42,100 --> 00:06:45,450 a slight positive linear relationship between these two 111 00:06:45,450 --> 00:06:47,040 variables. 112 00:06:47,040 --> 00:06:50,290 It turns out that the correlation between WinterRain 113 00:06:50,290 --> 00:06:54,950 and wine price is 0.14, which corresponds 114 00:06:54,950 --> 00:06:57,470 to a slightly positive linear relationship, 115 00:06:57,470 --> 00:07:00,290 as we saw visually. 116 00:07:00,290 --> 00:07:04,170 This plot graphs HarvestRain on the x-axis, 117 00:07:04,170 --> 00:07:08,110 and AGST on the y-axis. 118 00:07:08,110 --> 00:07:11,600 It's hard to visually see a positive or negative 119 00:07:11,600 --> 00:07:14,140 linear trend in this data. 120 00:07:14,140 --> 00:07:16,020 It turns out that the correlation 121 00:07:16,020 --> 00:07:21,560 is equal to negative 0.06, which is very close to 0, 122 00:07:21,560 --> 00:07:24,310 and corresponds to very little linear relationship. 123 00:07:27,060 --> 00:07:30,080 This plot shows the age of wine compared 124 00:07:30,080 --> 00:07:32,200 to the population of France. 125 00:07:32,200 --> 00:07:34,420 It looks like there's a very strong 126 00:07:34,420 --> 00:07:38,520 negative linear relationship between these two variables. 127 00:07:38,520 --> 00:07:40,390 It turns out that the correlation 128 00:07:40,390 --> 00:07:46,420 is equal to -0.99, which is very close to -1, 129 00:07:46,420 --> 00:07:49,460 the smallest a correlation can be. 130 00:07:49,460 --> 00:07:52,790 Let's compute some correlations in R. 131 00:07:52,790 --> 00:07:56,100 We can compute the correlation between a pair of variables 132 00:07:56,100 --> 00:07:59,260 by using the cor function. 133 00:07:59,260 --> 00:08:03,650 Let's compute the correlation between WinterRain and price. 134 00:08:03,650 --> 00:08:08,080 We start by typing the name of the function cor, 135 00:08:08,080 --> 00:08:13,570 then the name of the first variable-- wine$WinterRain, 136 00:08:13,570 --> 00:08:17,020 followed by a comma, and then the name of the second 137 00:08:17,020 --> 00:08:18,800 variable. 138 00:08:18,800 --> 00:08:22,210 If we hit Enter, we see that the correlation here 139 00:08:22,210 --> 00:08:29,140 between WinterRain and price is 0.1366505. 140 00:08:29,140 --> 00:08:31,410 We can also compute the correlation between two 141 00:08:31,410 --> 00:08:35,950 different variables-- let's say Age and FrancePopulation. 142 00:08:35,950 --> 00:08:39,840 Again, we use the function cor, and then type 143 00:08:39,840 --> 00:08:41,500 the names of the two variables. 144 00:08:45,020 --> 00:08:48,010 If we hit Enter, we see that the correlation between Age 145 00:08:48,010 --> 00:08:52,820 and FrancePopulation is about -0.99. 146 00:08:52,820 --> 00:08:54,880 We can also compute the correlation 147 00:08:54,880 --> 00:08:56,590 between all variables in our data 148 00:08:56,590 --> 00:09:00,130 set by using the same function, cor, 149 00:09:00,130 --> 00:09:04,690 and typing the name of the data set, wine. 150 00:09:04,690 --> 00:09:08,110 Here, our output shows us a lot of numbers, 151 00:09:08,110 --> 00:09:11,100 and the rows are labeled by the variable names, as well 152 00:09:11,100 --> 00:09:12,810 as the columns. 153 00:09:12,810 --> 00:09:16,210 To find the correlation between two variables, 154 00:09:16,210 --> 00:09:18,850 we find one variable name on the row, 155 00:09:18,850 --> 00:09:20,650 and then go to the column labeled 156 00:09:20,650 --> 00:09:22,600 by the other variable name. 157 00:09:22,600 --> 00:09:24,750 So for example, if we want to find 158 00:09:24,750 --> 00:09:29,150 the correlation between AGST and price, 159 00:09:29,150 --> 00:09:32,660 we can go down to the row labeled price, 160 00:09:32,660 --> 00:09:36,170 and then through that row across to the column, labeled 161 00:09:36,170 --> 00:09:46,440 AGST Here, we find that the correlation is about 0.6595. 162 00:09:46,440 --> 00:09:50,030 So we have confirmed that the correlation between Age 163 00:09:50,030 --> 00:09:52,920 and FrancePopulation is very high, 164 00:09:52,920 --> 00:09:56,770 so we do have multi-colinearity in our model. 165 00:09:56,770 --> 00:10:00,110 Note that multi-colinearity refers to the situation 166 00:10:00,110 --> 00:10:04,060 where two independent variables are highly correlated. 167 00:10:04,060 --> 00:10:06,910 A high correlation between an independent variable 168 00:10:06,910 --> 00:10:09,600 and the dependent variable, like the correlation 169 00:10:09,600 --> 00:10:13,730 between AGST and price, is a good thing, 170 00:10:13,730 --> 00:10:16,740 since we're trying to predict the dependent variable using 171 00:10:16,740 --> 00:10:18,570 the independent variable. 172 00:10:18,570 --> 00:10:21,270 Multi-colinearity only applies to the case 173 00:10:21,270 --> 00:10:23,820 where two independent variables are 174 00:10:23,820 --> 00:10:28,410 highly positive or negatively correlated. 175 00:10:28,410 --> 00:10:31,140 Because of multi-colinearity, you always 176 00:10:31,140 --> 00:10:35,420 want to remove the insignificant variables one at a time. 177 00:10:35,420 --> 00:10:38,790 Let's see what would've happened if we had removed both Age 178 00:10:38,790 --> 00:10:42,000 and FrancePopulation at the same time. 179 00:10:42,000 --> 00:10:46,750 We'll call this model, model5, and use the lm function 180 00:10:46,750 --> 00:10:54,430 to predict price using only AGST, HarvestRain, 181 00:10:54,430 --> 00:10:56,500 and WinterRain. 182 00:10:56,500 --> 00:11:00,200 Again, we'll use the data set, wine. 183 00:11:00,200 --> 00:11:04,200 If we look at the summary of our model, model5, 184 00:11:04,200 --> 00:11:09,690 we can see that AGST, HarvestRain, and WinterRain are 185 00:11:09,690 --> 00:11:12,900 all fairly significant, but our multiple R-squared 186 00:11:12,900 --> 00:11:15,410 dropped to 0.75. 187 00:11:15,410 --> 00:11:21,180 In our previous model-- model4-- our R-squared was about 0.83. 188 00:11:21,180 --> 00:11:24,590 So if we had removed both Age and FrancePopulation 189 00:11:24,590 --> 00:11:28,460 at the same time, we would have lost a significant variable, 190 00:11:28,460 --> 00:11:32,730 Age, and the R-squared of our model would have been lower. 191 00:11:32,730 --> 00:11:35,320 Why did we keep FrancePopulation, 192 00:11:35,320 --> 00:11:37,520 and remove Age instead? 193 00:11:37,520 --> 00:11:40,690 Well, we expect Age to be significant. 194 00:11:40,690 --> 00:11:43,610 Older wines are typically more expensive. 195 00:11:43,610 --> 00:11:45,770 Since the population of France steadily 196 00:11:45,770 --> 00:11:49,260 increases with the year, this captures the same effect 197 00:11:49,260 --> 00:11:53,330 as Age, but is less interpretable in our model. 198 00:11:53,330 --> 00:11:56,300 Multi-colinearity reminds us that coefficients 199 00:11:56,300 --> 00:11:58,540 are only interpretable in the presence 200 00:11:58,540 --> 00:12:00,910 of other variables being used. 201 00:12:00,910 --> 00:12:03,750 High correlations can even cause coefficients 202 00:12:03,750 --> 00:12:05,530 to have the wrong sign. 203 00:12:05,530 --> 00:12:09,230 We'll see this in the next lecture. 204 00:12:09,230 --> 00:12:11,720 So we fixed the multi-colinearity problem 205 00:12:11,720 --> 00:12:14,410 between Age and FrancePopulation. 206 00:12:14,410 --> 00:12:16,490 Do we have any other highly correlated 207 00:12:16,490 --> 00:12:18,370 independent variables? 208 00:12:18,370 --> 00:12:20,650 If you look back at our correlation matrix, 209 00:12:20,650 --> 00:12:22,660 you can see that we don't. 210 00:12:22,660 --> 00:12:26,590 Since all of our other remaining variables are also significant, 211 00:12:26,590 --> 00:12:29,180 we'll stick with model4 as our model 212 00:12:29,180 --> 00:12:31,570 for the rest of this lecture.