1
00:00:04,500 --> 00:00:08,690
In our R Console, let's start
by loading our data set.

2
00:00:08,690 --> 00:00:11,430
Don't forget to make sure you're
in the directory containing

3
00:00:11,430 --> 00:00:15,340
the file wine.csv first.

4
00:00:15,340 --> 00:00:18,440
We'll call our data
frame wine, and we'll

5
00:00:18,440 --> 00:00:23,790
use the read.csv function to
read in the data file wine.csv.

6
00:00:27,930 --> 00:00:30,030
We can look at the
structure of our data

7
00:00:30,030 --> 00:00:31,840
by using the str function.

8
00:00:35,300 --> 00:00:37,120
We can see that we
have a data frame

9
00:00:37,120 --> 00:00:42,130
with 25 observations of
seven different variables.

10
00:00:42,130 --> 00:00:45,600
Year gives the year
the wine was produced,

11
00:00:45,600 --> 00:00:47,340
and it's just a
unique identifier

12
00:00:47,340 --> 00:00:49,530
for each observation.

13
00:00:49,530 --> 00:00:53,690
Price is the dependent variable
we're trying to predict.

14
00:00:53,690 --> 00:01:01,130
And WinterRain, AGST,
HarvestRain, Age, and FrancePop

15
00:01:01,130 --> 00:01:05,830
are the independent variables
we'll use to predict Price.

16
00:01:05,830 --> 00:01:09,510
We can also look at the
statistical summary of our data

17
00:01:09,510 --> 00:01:10,730
using the summary function.

18
00:01:15,160 --> 00:01:17,590
This gives us information
about the range

19
00:01:17,590 --> 00:01:22,510
of values for each
variable in our data set.

20
00:01:22,510 --> 00:01:26,780
Let's now create a one-variable
linear regression equation

21
00:01:26,780 --> 00:01:30,910
using AGST to predict Price.

22
00:01:30,910 --> 00:01:35,530
We'll call our
regression model model1,

23
00:01:35,530 --> 00:01:40,090
and we'll use the lm function,
which stands for linear model.

24
00:01:40,090 --> 00:01:42,120
This is the function
we'll use whenever

25
00:01:42,120 --> 00:01:45,470
we want to build a
linear regression model.

26
00:01:45,470 --> 00:01:51,060
Then inside parentheses, type
Price, our dependent variable,

27
00:01:51,060 --> 00:01:53,420
and then a tilde
symbol, and then

28
00:01:53,420 --> 00:01:58,780
AGST, the independent variable
we'll use in this model.

29
00:01:58,780 --> 00:02:04,550
Then after a comma, we
need to add data = wine

30
00:02:04,550 --> 00:02:06,910
to tell the lm
function what data

31
00:02:06,910 --> 00:02:10,729
set to use to build the model.

32
00:02:10,729 --> 00:02:13,310
We're saving the output
of the lm function

33
00:02:13,310 --> 00:02:15,790
to the variable named model1.

34
00:02:15,790 --> 00:02:18,810
So when we hit Enter,
we didn't see any output

35
00:02:18,810 --> 00:02:22,620
because it's been saved
to the variable model1.

36
00:02:22,620 --> 00:02:25,250
Let's take a look at
the summary of model1.

37
00:02:28,790 --> 00:02:31,040
The first thing we
see is a description

38
00:02:31,040 --> 00:02:34,370
of the function we used
to build the model.

39
00:02:34,370 --> 00:02:39,070
Then we see a summary of the
residuals or error terms.

40
00:02:39,070 --> 00:02:42,010
Following that is a
description of the coefficients

41
00:02:42,010 --> 00:02:43,320
of our model.

42
00:02:43,320 --> 00:02:46,850
The first row corresponds
to the intercept term,

43
00:02:46,850 --> 00:02:50,700
and the second row corresponds
to our independent variable,

44
00:02:50,700 --> 00:02:53,250
AGST.

45
00:02:53,250 --> 00:02:55,840
The Estimate column
gives estimates

46
00:02:55,840 --> 00:02:58,170
of the beta values
for our model.

47
00:02:58,170 --> 00:03:00,870
So here beta 0,
or the coefficient

48
00:03:00,870 --> 00:03:05,820
for the intercept term,
is estimated to be -3.4.

49
00:03:05,820 --> 00:03:10,140
And beta 1, or the coefficient
for our independent variable,

50
00:03:10,140 --> 00:03:13,770
is estimated to be 0.635.

51
00:03:13,770 --> 00:03:16,120
There's additional
information in this table

52
00:03:16,120 --> 00:03:19,470
that we'll discuss
in the next video.

53
00:03:19,470 --> 00:03:21,370
Towards the bottom
of the output,

54
00:03:21,370 --> 00:03:26,320
you can see Multiple
R-squared, 0.435,

55
00:03:26,320 --> 00:03:28,250
which is the R-squared
value that we

56
00:03:28,250 --> 00:03:30,840
discussed in the previous video.

57
00:03:30,840 --> 00:03:35,350
Beside it is a number
labeled Adjusted R-squared.

58
00:03:35,350 --> 00:03:38,290
In this case, it's 0.41.

59
00:03:38,290 --> 00:03:40,940
This number adjusts
the R-squared value

60
00:03:40,940 --> 00:03:44,780
to account for the number of
independent variables used

61
00:03:44,780 --> 00:03:48,020
relative to the
number of data points.

62
00:03:48,020 --> 00:03:50,860
Multiple R-squared
will always increase

63
00:03:50,860 --> 00:03:53,340
if you add more
independent variables.

64
00:03:53,340 --> 00:03:55,860
But Adjusted R-squared
will decrease

65
00:03:55,860 --> 00:03:58,350
if you add an
independent variable that

66
00:03:58,350 --> 00:04:00,450
doesn't help the model.

67
00:04:00,450 --> 00:04:03,930
This is a good way to determine
if an additional variable

68
00:04:03,930 --> 00:04:06,790
should even be
included in the model.

69
00:04:06,790 --> 00:04:09,140
We'll also discuss
other ways to select

70
00:04:09,140 --> 00:04:13,640
important independent
variables in the next video.

71
00:04:13,640 --> 00:04:17,200
Let's also compute the sum
of squared errors, or SSE,

72
00:04:17,200 --> 00:04:18,940
for our model.

73
00:04:18,940 --> 00:04:22,600
Our residuals, or error terms,
are stored in the vector

74
00:04:22,600 --> 00:04:23,300
model1$residuals.

75
00:04:29,050 --> 00:04:33,810
By hitting Enter, we can see the
values of all of our residuals.

76
00:04:33,810 --> 00:04:37,990
We can compute the Sum of
Squared Errors, or SSE,

77
00:04:37,990 --> 00:04:40,790
by taking the
sum(model1$residuals^2).

78
00:04:47,190 --> 00:04:50,350
If we type SSE and
hit Enter, we can

79
00:04:50,350 --> 00:04:56,659
see that our sum of
squared errors is 5.73.

80
00:04:56,659 --> 00:04:59,360
Now let's add another
variable to our regression

81
00:04:59,360 --> 00:05:01,570
model, HarvestRain.

82
00:05:01,570 --> 00:05:03,870
We'll call our new model model2.

83
00:05:07,060 --> 00:05:11,370
And again, we'll use the lm
function to predict Price,

84
00:05:11,370 --> 00:05:16,660
but this time using
AGST and HarvestRain.

85
00:05:20,050 --> 00:05:23,170
When you want to use more
than one independent variable,

86
00:05:23,170 --> 00:05:27,030
you can just separate them with
a plus sign like we did here.

87
00:05:27,030 --> 00:05:29,960
Then we again need to
indicate that the data that

88
00:05:29,960 --> 00:05:31,910
should be used is wine.

89
00:05:35,170 --> 00:05:38,190
Let's take a look at the
summary of our new model

90
00:05:38,190 --> 00:05:39,380
using the summary function.

91
00:05:43,900 --> 00:05:46,230
We have a third row in
our Coefficients table

92
00:05:46,230 --> 00:05:48,940
now corresponding
to HarvestRain.

93
00:05:48,940 --> 00:05:52,630
The coefficient estimate for
this new independent variable

94
00:05:52,630 --> 00:05:56,560
is negative 0.00457.

95
00:05:56,560 --> 00:05:59,960
And if you look at the R-squared
near the bottom of the output,

96
00:05:59,960 --> 00:06:03,390
you can see that this variable
really helped our model.

97
00:06:03,390 --> 00:06:06,550
Our Multiple R-squared
and Adjusted R-squared

98
00:06:06,550 --> 00:06:11,180
both increased significantly
compared to the previous model.

99
00:06:11,180 --> 00:06:13,790
This indicates that this
new model is probably

100
00:06:13,790 --> 00:06:16,870
better than the previous model.

101
00:06:16,870 --> 00:06:19,310
Let's now also compute
the sum of squared errors

102
00:06:19,310 --> 00:06:21,200
for this new model.

103
00:06:21,200 --> 00:06:24,570
So SSE equals, and then
sum(model2$residuals^2).

104
00:06:31,960 --> 00:06:35,800
If we type SSE, we can see
that the sum of squared errors

105
00:06:35,800 --> 00:06:39,500
for model2 is
2.97, which is much

106
00:06:39,500 --> 00:06:43,250
better than the sum of
squared errors for model1.

107
00:06:43,250 --> 00:06:45,850
Now let's build a
third model, this time

108
00:06:45,850 --> 00:06:48,510
with all of our
independent variables.

109
00:06:48,510 --> 00:06:51,700
We'll call this one model3.

110
00:06:51,700 --> 00:07:01,430
And again, use the lm function
to predict Price using AGST

111
00:07:01,430 --> 00:07:19,160
and HarvestRain and WinterRain
and Age and FrancePop.

112
00:07:22,170 --> 00:07:24,420
Again, we need to
tell the lm function

113
00:07:24,420 --> 00:07:28,340
to use the data set wine.

114
00:07:28,340 --> 00:07:30,290
Let's take a look at
the summary of model3.

115
00:07:35,810 --> 00:07:40,650
Now the Coefficients table has
six rows, one for the intercept

116
00:07:40,650 --> 00:07:44,159
and one for each of the
five independent variables.

117
00:07:44,159 --> 00:07:46,290
If we look at the
bottom of the output,

118
00:07:46,290 --> 00:07:49,750
we can again see that the
Multiple R-squared and Adjusted

119
00:07:49,750 --> 00:07:53,310
R-squared have both increased.

120
00:07:53,310 --> 00:07:55,460
Let's now compute the
sum of squared errors

121
00:07:55,460 --> 00:07:57,140
for this new model.

122
00:07:57,140 --> 00:08:00,240
SSE equals the
sum(model3$residuals^2).

123
00:08:09,350 --> 00:08:13,430
And if we type SSE, we can see
that the sum of squared errors

124
00:08:13,430 --> 00:08:18,830
for model3 is 1.7, even
better than before.

125
00:08:18,830 --> 00:08:21,040
In the next video,
we'll determine

126
00:08:21,040 --> 00:08:25,160
if we should keep all of these
variables in our final model.