1 00:00:04,940 --> 00:00:12,350 Our wine model had an R-squared value of 0.83, which tells us how accurate our model 2 00:00:12,350 --> 00:00:15,720 is on the data we used to construct the model. 3 00:00:15,720 --> 00:00:21,080 So we know our model does a good job predicting the data it's seen. 4 00:00:21,080 --> 00:00:26,540 But we also want a model that does well on new data or data it's never seen before 5 00:00:26,540 --> 00:00:30,060 so that we can use the model to make predictions for later years. 6 00:00:30,060 --> 00:00:32,280 Bordeaux wine buyers 7 00:00:32,280 --> 00:00:35,820 profit from being able to predict the quality of a wine 8 00:00:35,820 --> 00:00:38,010 years before it matures. 9 00:00:38,010 --> 00:00:40,980 They know the values of the independent variables, age 10 00:00:40,980 --> 00:00:44,880 and weather, but they don't know the price the wine. 11 00:00:44,880 --> 00:00:49,519 So it's important to build a model that does well at predicting data it's never seen 12 00:00:49,519 --> 00:00:50,519 before. 13 00:00:50,519 --> 00:00:57,030 The data that we use to build a model is often called the training data, 14 00:00:57,030 --> 00:01:01,309 and the new data is often called the test data. 15 00:01:01,309 --> 00:01:12,680 The accuracy of the model on the test data is often referred to as out-of-sample accuracy. 16 00:01:12,680 --> 00:01:17,640 Let's see how well our model performs on some test data in R. 17 00:01:17,640 --> 00:01:21,460 We have two data points that we did not use to build our model 18 00:01:21,460 --> 00:01:25,360 in the file "wine_test.csv". 19 00:01:25,360 --> 00:01:31,490 Let's load this new data file into R. We'll call it wineTest, 20 00:01:31,490 --> 00:01:34,710 and we'll use the read.csv function to read in the data 21 00:01:34,710 --> 00:01:39,979 file "wine_test.csv". 22 00:01:39,979 --> 00:01:43,920 If we take a look at the structure of wineTest, we can see 23 00:01:43,920 --> 00:01:48,039 that we have two observations of the same variables we had before. 24 00:01:48,039 --> 00:01:54,580 These data points are for the years 1979 and 1980. 25 00:01:54,580 --> 00:02:00,140 To make predictions for these two test points, we'll use the function predict. 26 00:02:00,140 --> 00:02:06,460 We'll call our predictions predictTest, and we'll use the predict function. 27 00:02:06,460 --> 00:02:10,240 The first argument to this function is the name of our model. 28 00:02:10,240 --> 00:02:14,060 Here the name of our model is model4. 29 00:02:14,060 --> 00:02:19,840 Then, we say newdata equals name of the data set that we want to 30 00:02:19,840 --> 00:02:25,960 make predictions for, in this case wineTest. 31 00:02:25,960 --> 00:02:31,390 If we look at the values in predictTest, we can see that for the first data point we 32 00:02:31,390 --> 00:02:41,340 predict 6.768925, and for the second data point we predict 6.684910. 33 00:02:41,340 --> 00:02:46,620 If we look at our structure output, we can see that the actual Price for the first 34 00:02:46,630 --> 00:02:51,400 data point is 6.95, and the actual Price for the second 35 00:02:51,400 --> 00:02:54,120 data point is 6.5. 36 00:02:54,120 --> 00:02:57,460 So it looks like our predictions are pretty good. 37 00:02:57,460 --> 00:03:00,640 Let's verify this by computing the R-squared value 38 00:03:00,640 --> 00:03:04,400 for our test set. 39 00:03:04,400 --> 00:03:11,440 Recall that the formula for R-squared is: R-squared equals 40 00:03:11,440 --> 00:03:16,920 1 minus the Sum of Squared Errors divided 41 00:03:16,920 --> 00:03:21,730 by the Total Sum of Squares. 42 00:03:21,730 --> 00:03:24,920 So let's start by computing the Sum of Squared Errors 43 00:03:24,920 --> 00:03:26,540 on our test set. 44 00:03:26,540 --> 00:03:34,780 The Sum of Squared Errors equals the sum of the actual values wineTest 45 00:03:34,780 --> 00:03:43,360 dollar sign Price minus our predictions predictTest squared, 46 00:03:43,360 --> 00:03:45,620 and then summed. 47 00:03:45,620 --> 00:03:51,000 The Total Sum of Squares equals, the sum again of the actual 48 00:03:51,000 --> 00:03:55,260 values wineTest$price, 49 00:03:55,260 --> 00:04:00,380 and difference between the mean of the prices on the training set which is our 50 00:04:00,380 --> 00:04:02,080 baseline model. 51 00:04:02,080 --> 00:04:06,480 We square these values and add them up. 52 00:04:06,480 --> 00:04:13,000 To compute the R-squared now, we type 1 minus 53 00:04:13,000 --> 00:04:17,660 Sum of Squared Errors divided by the Total Sum of Squares. 54 00:04:17,660 --> 00:04:22,060 And we see that the out-of-sample R-squared on this test set 55 00:04:22,060 --> 00:04:26,100 is .7944278. 56 00:04:26,100 --> 00:04:29,260 This is a pretty good out-of-sample R-squared. 57 00:04:29,260 --> 00:04:35,670 But while we do well on these two test points, keep in mind that our test set is really small. 58 00:04:35,670 --> 00:04:40,190 We should increase the size of our test set to be more confident about the out-of-sample 59 00:04:40,190 --> 00:04:43,290 accuracy of our model. 60 00:04:43,290 --> 00:04:47,860 We can compute the test set R-squared for several different models. 61 00:04:47,860 --> 00:04:52,180 This shows the model R-squared and the test set R-squared for our 62 00:04:52,180 --> 00:04:55,400 model as we add more variables. 63 00:04:55,400 --> 00:05:00,160 We saw that the model R-squared will always increases or stays the same 64 00:05:00,160 --> 00:05:02,420 as we add more variables. 65 00:05:02,420 --> 00:05:06,360 However, this is not true for the test set. 66 00:05:06,360 --> 00:05:09,460 We want to look for a model with a good model R-squared 67 00:05:09,460 --> 00:05:12,900 but also with a good test set R-squared. 68 00:05:12,900 --> 00:05:14,600 In this case we would need 69 00:05:14,600 --> 00:05:18,200 more data to be conclusive since two data points in 70 00:05:18,200 --> 00:05:22,560 the test set are not really enough to reach any conclusions. 71 00:05:22,560 --> 00:05:24,900 However, it looks like our model that 72 00:05:24,900 --> 00:05:27,380 uses Average Growing Season Temperature, 73 00:05:27,380 --> 00:05:30,740 Harvest Rain, Age, and Winter Rain does 74 00:05:30,740 --> 00:05:34,580 very well in sample on the training set 75 00:05:34,580 --> 00:05:38,480 as well as out-of-sample on the test set. 76 00:05:38,480 --> 00:05:43,380 Note here that the test set R-squared can actually be negative. 77 00:05:43,389 --> 00:05:48,580 The model R-squared is never negative since our model can't do worse on the training 78 00:05:48,580 --> 00:05:51,040 data than the baseline model. 79 00:05:51,040 --> 00:05:57,220 However, our model can do worse on the test data then the baseline model does. 80 00:05:57,220 --> 00:06:01,300 This leads to a negative R-squared value. 81 00:06:01,300 --> 00:06:07,920 But it looks like our model Average Growing Season Temperature, Harvest Rain, Age, and Winter Rain 82 00:06:07,920 --> 00:06:10,000 beats the baseline model. 83 00:06:10,000 --> 00:06:16,620 We'll see in the next video how well Ashenfelter did using this model to make predictions.