1
00:00:04,500 --> 00:00:08,560
In the previous video, we only
used one independent variable,

2
00:00:08,560 --> 00:00:10,520
but there are many
different variables

3
00:00:10,520 --> 00:00:13,380
that could be used to
predict wine price.

4
00:00:13,380 --> 00:00:16,430
We used average growing
season temperature,

5
00:00:16,430 --> 00:00:20,690
but we also have data for other
weather-related variables--

6
00:00:20,690 --> 00:00:23,330
harvest rain and winter rain.

7
00:00:23,330 --> 00:00:27,720
Additionally, the age of wine
is suspected to be important,

8
00:00:27,720 --> 00:00:29,470
and many other
variables could also

9
00:00:29,470 --> 00:00:34,080
be used, such as the
population of France.

10
00:00:34,080 --> 00:00:38,760
We can use each variable in a
one variable regression model.

11
00:00:38,760 --> 00:00:40,750
Average growing
season temperature

12
00:00:40,750 --> 00:00:46,730
gives the best R squared of
0.44, followed by harvest rain

13
00:00:46,730 --> 00:00:50,290
within R squared of 0.32.

14
00:00:50,290 --> 00:00:53,170
France's population
and age, both

15
00:00:53,170 --> 00:00:56,650
give models within R
squared around 0.2,

16
00:00:56,650 --> 00:01:00,980
and winter rain gives a
pretty low R squared of 0.02,

17
00:01:00,980 --> 00:01:04,330
or just barely better
than the baseline.

18
00:01:04,330 --> 00:01:07,950
So if we only used one
variable, average growing season

19
00:01:07,950 --> 00:01:10,340
temperature is the best choice.

20
00:01:10,340 --> 00:01:13,100
But multiple linear
regression allows

21
00:01:13,100 --> 00:01:18,630
you to use multiple variables
at once to improve the model.

22
00:01:18,630 --> 00:01:21,090
The multiple linear
regression model

23
00:01:21,090 --> 00:01:23,440
is similar to the one
variable regression

24
00:01:23,440 --> 00:01:26,250
model that has a
coefficient beta

25
00:01:26,250 --> 00:01:28,960
for each independent variable.

26
00:01:28,960 --> 00:01:32,030
We predict the
dependent variable y

27
00:01:32,030 --> 00:01:37,850
using the independent
variables x1, x2, through xk,

28
00:01:37,850 --> 00:01:41,300
where k here denotes the
number of independent variables

29
00:01:41,300 --> 00:01:43,430
in our model.

30
00:01:43,430 --> 00:01:45,789
Beta 0 is, again,
the coefficient

31
00:01:45,789 --> 00:01:51,800
for our intercept term, and
beta 1, beta 2, through beta k

32
00:01:51,800 --> 00:01:55,640
are the coefficients for
the independent variables.

33
00:01:55,640 --> 00:01:59,680
We use i to denote the data
for a particular data point

34
00:01:59,680 --> 00:02:01,650
or observation.

35
00:02:01,650 --> 00:02:05,610
The best model is selected
in the same way as before.

36
00:02:05,610 --> 00:02:09,389
To minimize the sum of squared
errors, using the error terms,

37
00:02:09,389 --> 00:02:11,880
epsilon.

38
00:02:11,880 --> 00:02:15,270
We can start by building a
linear regression model that

39
00:02:15,270 --> 00:02:19,290
just uses the variable with
the best R squared-- average

40
00:02:19,290 --> 00:02:21,400
growing season temperature.

41
00:02:21,400 --> 00:02:27,030
We saw before that this gives
us an R squared of 0.44.

42
00:02:27,030 --> 00:02:29,829
Then we can add
variables one at a time

43
00:02:29,829 --> 00:02:33,510
and look at the
improvement in R squared.

44
00:02:33,510 --> 00:02:37,040
Note that the improvement is
not equal to the one variable

45
00:02:37,040 --> 00:02:40,640
R squared for each
independent variable we add,

46
00:02:40,640 --> 00:02:43,329
since they're interactions
between the independent

47
00:02:43,329 --> 00:02:45,300
variables.

48
00:02:45,300 --> 00:02:47,480
Adding independent
variables improves

49
00:02:47,480 --> 00:02:50,079
the R squared to
almost double what

50
00:02:50,079 --> 00:02:52,930
it was with a single
independent variable.

51
00:02:52,930 --> 00:02:55,280
But there are
diminishing returns.

52
00:02:55,280 --> 00:02:58,860
The marginal improvement from
adding an additional variable

53
00:02:58,860 --> 00:03:02,620
decreases as we add
more and more variables.

54
00:03:02,620 --> 00:03:05,550
So which model should we use?

55
00:03:05,550 --> 00:03:08,670
Often not all variable
should be used.

56
00:03:08,670 --> 00:03:11,930
This is because each additional
variable used requires

57
00:03:11,930 --> 00:03:14,740
more data, and
using more variables

58
00:03:14,740 --> 00:03:17,579
creates a more
complicated model.

59
00:03:17,579 --> 00:03:20,150
Overly complicated
models often cause

60
00:03:20,150 --> 00:03:22,600
what's known as overfitting.

61
00:03:22,600 --> 00:03:24,900
This is when you have
a higher R squared

62
00:03:24,900 --> 00:03:27,190
on data used to
create the model,

63
00:03:27,190 --> 00:03:30,420
but bad performance
on unseen data.

64
00:03:30,420 --> 00:03:33,260
For example, suppose we
want to use this model

65
00:03:33,260 --> 00:03:36,680
to make a prediction
for the year 2013.

66
00:03:36,680 --> 00:03:39,920
We expect an overfit
model to perform poorly

67
00:03:39,920 --> 00:03:42,110
on this new data.

68
00:03:42,110 --> 00:03:44,950
In the next video, we'll learn
how to build a regression

69
00:03:44,950 --> 00:03:47,400
models in R and
then we'll discuss

70
00:03:47,400 --> 00:03:49,180
how to select the
variables that should

71
00:03:49,180 --> 00:03:52,180
be included in the final model.