1
00:00:04,940 --> 00:00:12,350
Our wine model had an R-squared value
of 0.83, which tells us how accurate our model

2
00:00:12,350 --> 00:00:15,720
is
on the data we used to construct the model.

3
00:00:15,720 --> 00:00:21,080
So we know our model does a good job predicting
the data it's seen.

4
00:00:21,080 --> 00:00:26,540
But we also want a model that does well
on new data or data it's never seen before

5
00:00:26,540 --> 00:00:30,060
so that we can use
the model to make predictions for later years.

6
00:00:30,060 --> 00:00:32,280
Bordeaux
wine buyers

7
00:00:32,280 --> 00:00:35,820
profit from being able to predict the quality
of a wine

8
00:00:35,820 --> 00:00:38,010
years before it matures.

9
00:00:38,010 --> 00:00:40,980
They know the values of the independent variables,
age

10
00:00:40,980 --> 00:00:44,880
and weather, but they don't know the price
the wine.

11
00:00:44,880 --> 00:00:49,519
So it's important to build a model that
does well at predicting data it's never seen

12
00:00:49,519 --> 00:00:50,519
before.

13
00:00:50,519 --> 00:00:57,030
The data that we use to build a model
is often called the training data,

14
00:00:57,030 --> 00:01:01,309
and the new data is often called the test
data.

15
00:01:01,309 --> 00:01:12,680
The accuracy of the model on the test data
is often referred to as out-of-sample accuracy.

16
00:01:12,680 --> 00:01:17,640
Let's see how well our model performs
on some test data in R.

17
00:01:17,640 --> 00:01:21,460
We have two data points that we did not use
to build our model

18
00:01:21,460 --> 00:01:25,360
in the file "wine_test.csv".

19
00:01:25,360 --> 00:01:31,490
Let's load this new data file into R.
We'll call it wineTest,

20
00:01:31,490 --> 00:01:34,710
and we'll use the read.csv function to read
in the data

21
00:01:34,710 --> 00:01:39,979
file "wine_test.csv".

22
00:01:39,979 --> 00:01:43,920
If we take a look at the structure of wineTest,
we can see

23
00:01:43,920 --> 00:01:48,039
that we have two observations of the same
variables we had before.

24
00:01:48,039 --> 00:01:54,580
These data points are for the years 1979
and 1980.

25
00:01:54,580 --> 00:02:00,140
To make predictions for these two test points,
we'll use the function predict.

26
00:02:00,140 --> 00:02:06,460
We'll call our predictions predictTest,
and we'll use the predict function.

27
00:02:06,460 --> 00:02:10,240
The first argument to this function is
the name of our model.

28
00:02:10,240 --> 00:02:14,060
Here the name of our model is model4.

29
00:02:14,060 --> 00:02:19,840
Then, we say newdata equals
name of the data set that we want to

30
00:02:19,840 --> 00:02:25,960
make predictions for, in this case wineTest.

31
00:02:25,960 --> 00:02:31,390
If we look at the values in predictTest,
we can see that for the first data point we

32
00:02:31,390 --> 00:02:41,340
predict 6.768925, and for the
second data point we predict 6.684910.

33
00:02:41,340 --> 00:02:46,620
If we look at our structure output,
we can see that the actual Price for the first

34
00:02:46,630 --> 00:02:51,400
data point
is 6.95, and the actual Price for the second

35
00:02:51,400 --> 00:02:54,120
data point
is 6.5.

36
00:02:54,120 --> 00:02:57,460
So it looks like our predictions are pretty
good.

37
00:02:57,460 --> 00:03:00,640
Let's verify this by computing the
R-squared value

38
00:03:00,640 --> 00:03:04,400
for our test set.

39
00:03:04,400 --> 00:03:11,440
Recall that the formula for R-squared
is: R-squared equals

40
00:03:11,440 --> 00:03:16,920
1 minus the Sum of Squared Errors divided

41
00:03:16,920 --> 00:03:21,730
by the Total Sum of Squares.

42
00:03:21,730 --> 00:03:24,920
So let's start by computing the
Sum of Squared Errors

43
00:03:24,920 --> 00:03:26,540
on our test set.

44
00:03:26,540 --> 00:03:34,780
The Sum of Squared Errors equals the
sum of the actual values wineTest

45
00:03:34,780 --> 00:03:43,360
dollar sign Price minus our
predictions predictTest squared,

46
00:03:43,360 --> 00:03:45,620
and then summed.

47
00:03:45,620 --> 00:03:51,000
The Total Sum of Squares equals,
the sum again of the actual

48
00:03:51,000 --> 00:03:55,260
values wineTest$price,

49
00:03:55,260 --> 00:04:00,380
and difference between the mean of the
prices on the training set which is our

50
00:04:00,380 --> 00:04:02,080
baseline model.

51
00:04:02,080 --> 00:04:06,480
We square these values and add them up.

52
00:04:06,480 --> 00:04:13,000
To compute the R-squared now, we type 1 minus

53
00:04:13,000 --> 00:04:17,660
Sum of Squared Errors divided by the
Total Sum of Squares.

54
00:04:17,660 --> 00:04:22,060
And we see that the
out-of-sample R-squared on this test set

55
00:04:22,060 --> 00:04:26,100
is .7944278.

56
00:04:26,100 --> 00:04:29,260
This is a pretty good out-of-sample R-squared.

57
00:04:29,260 --> 00:04:35,670
But while we do well on these two test points,
keep in mind that our test set is really small.

58
00:04:35,670 --> 00:04:40,190
We should increase the size of our test set
to be more confident about the out-of-sample

59
00:04:40,190 --> 00:04:43,290
accuracy
of our model.

60
00:04:43,290 --> 00:04:47,860
We can compute the test set R-squared
for several different models.

61
00:04:47,860 --> 00:04:52,180
This shows the model R-squared
and the test set R-squared for our

62
00:04:52,180 --> 00:04:55,400
model
as we add more variables.

63
00:04:55,400 --> 00:05:00,160
We saw that the model R-squared will always
increases or stays the same

64
00:05:00,160 --> 00:05:02,420
as we add more variables.

65
00:05:02,420 --> 00:05:06,360
However, this is not true for the test set.

66
00:05:06,360 --> 00:05:09,460
We want to look for a model with a
good model R-squared

67
00:05:09,460 --> 00:05:12,900
but also with a good test set R-squared.

68
00:05:12,900 --> 00:05:14,600
In this case we would need

69
00:05:14,600 --> 00:05:18,200
more data
to be conclusive since two data points in

70
00:05:18,200 --> 00:05:22,560
the test set
are not really enough to reach any conclusions.

71
00:05:22,560 --> 00:05:24,900
However, it looks like our model that

72
00:05:24,900 --> 00:05:27,380
uses Average Growing Season Temperature,

73
00:05:27,380 --> 00:05:30,740
Harvest Rain, Age, and Winter Rain does

74
00:05:30,740 --> 00:05:34,580
very well in sample on the training set

75
00:05:34,580 --> 00:05:38,480
as well as out-of-sample on the test set.

76
00:05:38,480 --> 00:05:43,380
Note here that the test set
R-squared can actually be negative.

77
00:05:43,389 --> 00:05:48,580
The model R-squared is never negative
since our model can't do worse on the training

78
00:05:48,580 --> 00:05:51,040
data than the baseline model.

79
00:05:51,040 --> 00:05:57,220
However, our model can do worse on the test
data then the baseline model does.

80
00:05:57,220 --> 00:06:01,300
This leads to a negative R-squared value.

81
00:06:01,300 --> 00:06:07,920
But it looks like our model Average Growing Season Temperature, Harvest Rain, Age, and Winter Rain

82
00:06:07,920 --> 00:06:10,000
beats the baseline model.

83
00:06:10,000 --> 00:06:16,620
We'll see in the next video how well Ashenfelter
did using this model to make predictions.