1 00:00:08,019 --> 00:00:13,759 Let's discuss the method Ashenfelter used to build his model, linear regression. 2 00:00:13,759 --> 00:00:17,520 We'll start with one-variable linear regression, which 3 00:00:17,520 --> 00:00:23,630 just uses one independent variable to predict the dependent variable. 4 00:00:23,630 --> 00:00:27,200 This figure shows a plot of one of the independent variables, 5 00:00:27,200 --> 00:00:32,800 average growing season temperature, and the dependent variable, wine price. 6 00:00:32,800 --> 00:00:36,140 The goal of linear regression is to create a predictive line 7 00:00:36,140 --> 00:00:37,890 through the data. 8 00:00:37,890 --> 00:00:41,940 There are many different lines that could be drawn to predict wine price using 9 00:00:41,940 --> 00:00:44,560 average growing season temperature. 10 00:00:44,560 --> 00:00:49,260 A simple option would be a flat line at the average price, 11 00:00:49,260 --> 00:00:52,460 in this case 7.07. 12 00:00:52,460 --> 00:00:58,740 The equation for this line is y equals 7.07. 13 00:00:58,740 --> 00:01:03,719 This linear regression model would predict 7.07 regardless 14 00:01:03,719 --> 00:01:05,969 of the temperature. 15 00:01:05,969 --> 00:01:11,280 But it looks like a better line would have a positive slope, such as this line in 16 00:01:11,280 --> 00:01:13,530 blue. 17 00:01:13,530 --> 00:01:25,539 The equation for this line is y equals 0.5*(AGST) -1.25. 18 00:01:25,539 --> 00:01:28,729 This linear regression model would predict a higher price 19 00:01:28,729 --> 00:01:31,340 when the temperature is higher. 20 00:01:31,340 --> 00:01:34,329 Let's make this idea a little more formal. 21 00:01:34,329 --> 00:01:38,319 In general form a one-variable linear regression model 22 00:01:38,319 --> 00:01:42,810 is a linear equation to predict the dependent variable, y, 23 00:01:42,810 --> 00:01:45,619 using the independent variable, x. 24 00:01:45,619 --> 00:01:50,439 Beta 0 is the intercept term or intercept coefficient, 25 00:01:50,439 --> 00:01:56,759 and Beta 1 is the slope of the line or coefficient for the independent variable, x. 26 00:01:56,759 --> 00:02:03,060 For each observation, i, we have data for the dependent variable Yi and data 27 00:02:03,060 --> 00:02:06,849 for the independent variable, Xi. 28 00:02:06,849 --> 00:02:11,569 Using this equation we make a prediction beta 0 plus Beta 29 00:02:11,569 --> 00:02:15,730 1 times Xi for each data point, i. 30 00:02:15,730 --> 00:02:20,820 This prediction is hopefully close to the true outcome, Yi. 31 00:02:20,820 --> 00:02:24,670 But since the coefficients have to be the same for all data 32 00:02:24,670 --> 00:02:30,810 points, i, we often make a small error, which we'll call epsilon i. 33 00:02:30,810 --> 00:02:35,480 This error term is also often called a residual. 34 00:02:35,480 --> 00:02:39,860 Our errors will only all be 0 if all our points lie perfectly 35 00:02:39,860 --> 00:02:41,720 on the same line. 36 00:02:41,720 --> 00:02:44,990 This rarely happens, so we know that our model will probably 37 00:02:44,990 --> 00:02:47,040 make some errors. 38 00:02:47,040 --> 00:02:52,240 The best model or best choice of coefficients Beta 0 and Beta 1 39 00:02:52,240 --> 00:03:04,220 has the smallest error terms or smallest residuals. 40 00:03:04,220 --> 00:03:08,130 This figure shows the blue line that we drew in the beginning. 41 00:03:08,130 --> 00:03:13,610 We can compute the residuals or errors of this line for each data point. 42 00:03:13,610 --> 00:03:19,760 For example, for this point the actual value is about 6.2. 43 00:03:19,760 --> 00:03:24,420 Using our regression model we predict about 6.5. 44 00:03:24,420 --> 00:03:28,640 So the error for this data point is negative 0.3, 45 00:03:28,640 --> 00:03:32,770 which is the actual value minus our prediction. 46 00:03:32,770 --> 00:03:39,180 As another example for this point, the actual value is about 8. 47 00:03:39,180 --> 00:03:44,200 Using our regression model we predict about 7.5. 48 00:03:44,200 --> 00:03:48,560 So the error for this data point is about 0.5. 49 00:03:48,560 --> 00:03:53,820 Again the actual value minus our prediction. 50 00:03:53,820 --> 00:03:56,790 One measure of the quality of a regression line 51 00:03:56,790 --> 00:04:01,110 is the sum of squared errors, or SSE. 52 00:04:01,110 --> 00:04:05,850 This is the sum of the squared residuals or error terms. 53 00:04:05,850 --> 00:04:09,740 Let n equal the number of data points that we have in our data 54 00:04:09,740 --> 00:04:12,400 set. 55 00:04:12,400 --> 00:04:17,339 Then the sum of squared errors is equal to the error we make on the first data 56 00:04:17,339 --> 00:04:21,370 point squared plus the error we make on the second data 57 00:04:21,370 --> 00:04:24,800 point squared plus the errors that you make on all data 58 00:04:24,800 --> 00:04:32,979 points up to the n-th data point squared. 59 00:04:32,979 --> 00:04:38,210 We can compute the sum of squared errors for both the red line and the blue line. 60 00:04:38,210 --> 00:04:42,139 As expected the blue line is a better fit than the red line 61 00:04:42,139 --> 00:04:45,930 since it has a smaller sum of squared errors. 62 00:04:45,930 --> 00:04:48,900 The line that gives the minimum sum of squared errors 63 00:04:48,900 --> 00:04:50,680 is shown in green. 64 00:04:50,680 --> 00:04:54,830 This is the line that our regression model will find. 65 00:04:54,830 --> 00:04:58,210 Although sum of squared errors allows us to compare lines 66 00:04:58,210 --> 00:05:03,289 on the same data set, it's hard to interpret for two reasons. 67 00:05:03,289 --> 00:05:07,849 The first is that it scales with n, the number of data points. 68 00:05:07,849 --> 00:05:11,449 If we built the same model with twice as much data, 69 00:05:11,449 --> 00:05:14,180 the sum of squared errors might be twice as big. 70 00:05:14,180 --> 00:05:17,419 But this doesn't mean it's a worse model. 71 00:05:17,419 --> 00:05:20,270 The second is that the units are hard to understand. 72 00:05:20,270 --> 00:05:26,270 Some of squared errors is in squared units of the dependent variable. 73 00:05:26,270 --> 00:05:31,039 Because of these problems, Root Means Squared Error, or RMSE, 74 00:05:31,039 --> 00:05:32,819 is often used. 75 00:05:32,819 --> 00:05:37,699 This divides sum of squared errors by n and then takes a square root. 76 00:05:37,699 --> 00:05:41,180 So it's normalized by n and is in the same units 77 00:05:41,180 --> 00:05:44,129 as the dependent variable. 78 00:05:44,129 --> 00:05:48,759 Another common error measure for linear regression is R squared. 79 00:05:48,759 --> 00:05:51,699 This error measure is nice because it compares the best 80 00:05:51,699 --> 00:05:55,308 model to a baseline model, the model that does not 81 00:05:55,308 --> 00:05:59,479 use any variables, or the red line from before. 82 00:05:59,479 --> 00:06:05,370 The baseline model predicts the average value of the dependent variable regardless 83 00:06:05,370 --> 00:06:08,979 of the value of the independent variable. 84 00:06:08,979 --> 00:06:12,279 We can compute that the sum of squared errors for the best fit 85 00:06:12,279 --> 00:06:16,949 line or the green line is 5.73. 86 00:06:16,949 --> 00:06:23,080 And the sum of squared errors for the baseline or the red line is 10.15. 87 00:06:23,080 --> 00:06:25,819 The sum of squared errors for the baseline model 88 00:06:25,819 --> 00:06:29,880 is also known as the total sum of squares, commonly referred 89 00:06:29,880 --> 00:06:32,590 to as SST. 90 00:06:32,590 --> 00:06:40,860 Then the formula for R squared is R squared equals 1 minus sum of squared errors 91 00:06:40,860 --> 00:06:44,599 divided by total sum of squares. 92 00:06:44,599 --> 00:06:56,400 In this case it equals 1 minus 5.73 divided by 10.15 which equals 0.44. 93 00:06:56,400 --> 00:07:01,719 R squared is nice because it captures the value added from using a linear regression 94 00:07:01,719 --> 00:07:06,129 model over just predicting the average outcome for every data 95 00:07:06,129 --> 00:07:07,319 point. 96 00:07:07,319 --> 00:07:10,610 So what values do we expect to see for R squared? 97 00:07:10,610 --> 00:07:16,520 Well both the sum of squared errors and the total sum of squares have 98 00:07:16,520 --> 00:07:19,430 to be greater than or equal to zero because they're 99 00:07:19,430 --> 00:07:23,680 the sum of squared terms so they can't be negative. 100 00:07:23,680 --> 00:07:27,809 Additionally the sum of squared errors has to be less than 101 00:07:27,809 --> 00:07:31,270 or equal to the total sum of squares. 102 00:07:31,270 --> 00:07:34,460 This is because our linear regression model could just 103 00:07:34,460 --> 00:07:38,749 set the coefficient for the independent variable to 0 104 00:07:38,749 --> 00:07:41,720 and then we would have the baseline model. 105 00:07:41,720 --> 00:07:46,110 So our linear regression model will never be worse than the baseline model. 106 00:07:46,110 --> 00:07:52,960 So in the worst case the sum of squares errors equals the total sum of squares, and our R 107 00:07:52,960 --> 00:07:55,449 squared is equal to 0. 108 00:07:55,449 --> 00:07:59,529 So this means no improvement over the baseline. 109 00:07:59,529 --> 00:08:05,699 In the best case our linear regression model makes no errors, and the sum of squared errors 110 00:08:05,699 --> 00:08:07,809 is equal to 0. 111 00:08:07,809 --> 00:08:11,029 And then our R squared is equal to 1. 112 00:08:11,029 --> 00:08:17,689 So an R squared equal to 1 or close to 1 means a perfect or almost perfect predictive 113 00:08:17,689 --> 00:08:20,089 model. 114 00:08:20,089 --> 00:08:23,069 R squared is nice because it's unitless and therefore 115 00:08:23,069 --> 00:08:26,469 universally interpretable between problems. 116 00:08:26,469 --> 00:08:31,060 However, it can still be hard to compare between problems. 117 00:08:31,060 --> 00:08:35,549 Good models for easy problems will have an R squared close to 1. 118 00:08:35,549 --> 00:08:41,570 But good models for hard problems can still have an R squared close to zero. 119 00:08:41,570 --> 00:08:45,660 Throughout this course we will see examples of both types of problems.