1 00:00:04,500 --> 00:00:08,560 In the previous video, we only used one independent variable, 2 00:00:08,560 --> 00:00:10,520 but there are many different variables 3 00:00:10,520 --> 00:00:13,380 that could be used to predict wine price. 4 00:00:13,380 --> 00:00:16,430 We used average growing season temperature, 5 00:00:16,430 --> 00:00:20,690 but we also have data for other weather-related variables-- 6 00:00:20,690 --> 00:00:23,330 harvest rain and winter rain. 7 00:00:23,330 --> 00:00:27,720 Additionally, the age of wine is suspected to be important, 8 00:00:27,720 --> 00:00:29,470 and many other variables could also 9 00:00:29,470 --> 00:00:34,080 be used, such as the population of France. 10 00:00:34,080 --> 00:00:38,760 We can use each variable in a one variable regression model. 11 00:00:38,760 --> 00:00:40,750 Average growing season temperature 12 00:00:40,750 --> 00:00:46,730 gives the best R squared of 0.44, followed by harvest rain 13 00:00:46,730 --> 00:00:50,290 within R squared of 0.32. 14 00:00:50,290 --> 00:00:53,170 France's population and age, both 15 00:00:53,170 --> 00:00:56,650 give models within R squared around 0.2, 16 00:00:56,650 --> 00:01:00,980 and winter rain gives a pretty low R squared of 0.02, 17 00:01:00,980 --> 00:01:04,330 or just barely better than the baseline. 18 00:01:04,330 --> 00:01:07,950 So if we only used one variable, average growing season 19 00:01:07,950 --> 00:01:10,340 temperature is the best choice. 20 00:01:10,340 --> 00:01:13,100 But multiple linear regression allows 21 00:01:13,100 --> 00:01:18,630 you to use multiple variables at once to improve the model. 22 00:01:18,630 --> 00:01:21,090 The multiple linear regression model 23 00:01:21,090 --> 00:01:23,440 is similar to the one variable regression 24 00:01:23,440 --> 00:01:26,250 model that has a coefficient beta 25 00:01:26,250 --> 00:01:28,960 for each independent variable. 26 00:01:28,960 --> 00:01:32,030 We predict the dependent variable y 27 00:01:32,030 --> 00:01:37,850 using the independent variables x1, x2, through xk, 28 00:01:37,850 --> 00:01:41,300 where k here denotes the number of independent variables 29 00:01:41,300 --> 00:01:43,430 in our model. 30 00:01:43,430 --> 00:01:45,789 Beta 0 is, again, the coefficient 31 00:01:45,789 --> 00:01:51,800 for our intercept term, and beta 1, beta 2, through beta k 32 00:01:51,800 --> 00:01:55,640 are the coefficients for the independent variables. 33 00:01:55,640 --> 00:01:59,680 We use i to denote the data for a particular data point 34 00:01:59,680 --> 00:02:01,650 or observation. 35 00:02:01,650 --> 00:02:05,610 The best model is selected in the same way as before. 36 00:02:05,610 --> 00:02:09,389 To minimize the sum of squared errors, using the error terms, 37 00:02:09,389 --> 00:02:11,880 epsilon. 38 00:02:11,880 --> 00:02:15,270 We can start by building a linear regression model that 39 00:02:15,270 --> 00:02:19,290 just uses the variable with the best R squared-- average 40 00:02:19,290 --> 00:02:21,400 growing season temperature. 41 00:02:21,400 --> 00:02:27,030 We saw before that this gives us an R squared of 0.44. 42 00:02:27,030 --> 00:02:29,829 Then we can add variables one at a time 43 00:02:29,829 --> 00:02:33,510 and look at the improvement in R squared. 44 00:02:33,510 --> 00:02:37,040 Note that the improvement is not equal to the one variable 45 00:02:37,040 --> 00:02:40,640 R squared for each independent variable we add, 46 00:02:40,640 --> 00:02:43,329 since they're interactions between the independent 47 00:02:43,329 --> 00:02:45,300 variables. 48 00:02:45,300 --> 00:02:47,480 Adding independent variables improves 49 00:02:47,480 --> 00:02:50,079 the R squared to almost double what 50 00:02:50,079 --> 00:02:52,930 it was with a single independent variable. 51 00:02:52,930 --> 00:02:55,280 But there are diminishing returns. 52 00:02:55,280 --> 00:02:58,860 The marginal improvement from adding an additional variable 53 00:02:58,860 --> 00:03:02,620 decreases as we add more and more variables. 54 00:03:02,620 --> 00:03:05,550 So which model should we use? 55 00:03:05,550 --> 00:03:08,670 Often not all variable should be used. 56 00:03:08,670 --> 00:03:11,930 This is because each additional variable used requires 57 00:03:11,930 --> 00:03:14,740 more data, and using more variables 58 00:03:14,740 --> 00:03:17,579 creates a more complicated model. 59 00:03:17,579 --> 00:03:20,150 Overly complicated models often cause 60 00:03:20,150 --> 00:03:22,600 what's known as overfitting. 61 00:03:22,600 --> 00:03:24,900 This is when you have a higher R squared 62 00:03:24,900 --> 00:03:27,190 on data used to create the model, 63 00:03:27,190 --> 00:03:30,420 but bad performance on unseen data. 64 00:03:30,420 --> 00:03:33,260 For example, suppose we want to use this model 65 00:03:33,260 --> 00:03:36,680 to make a prediction for the year 2013. 66 00:03:36,680 --> 00:03:39,920 We expect an overfit model to perform poorly 67 00:03:39,920 --> 00:03:42,110 on this new data. 68 00:03:42,110 --> 00:03:44,950 In the next video, we'll learn how to build a regression 69 00:03:44,950 --> 00:03:47,400 models in R and then we'll discuss 70 00:03:47,400 --> 00:03:49,180 how to select the variables that should 71 00:03:49,180 --> 00:03:52,180 be included in the final model.