1 00:00:04,500 --> 00:00:08,690 In our R Console, let's start by loading our data set. 2 00:00:08,690 --> 00:00:11,430 Don't forget to make sure you're in the directory containing 3 00:00:11,430 --> 00:00:15,340 the file wine.csv first. 4 00:00:15,340 --> 00:00:18,440 We'll call our data frame wine, and we'll 5 00:00:18,440 --> 00:00:23,790 use the read.csv function to read in the data file wine.csv. 6 00:00:27,930 --> 00:00:30,030 We can look at the structure of our data 7 00:00:30,030 --> 00:00:31,840 by using the str function. 8 00:00:35,300 --> 00:00:37,120 We can see that we have a data frame 9 00:00:37,120 --> 00:00:42,130 with 25 observations of seven different variables. 10 00:00:42,130 --> 00:00:45,600 Year gives the year the wine was produced, 11 00:00:45,600 --> 00:00:47,340 and it's just a unique identifier 12 00:00:47,340 --> 00:00:49,530 for each observation. 13 00:00:49,530 --> 00:00:53,690 Price is the dependent variable we're trying to predict. 14 00:00:53,690 --> 00:01:01,130 And WinterRain, AGST, HarvestRain, Age, and FrancePop 15 00:01:01,130 --> 00:01:05,830 are the independent variables we'll use to predict Price. 16 00:01:05,830 --> 00:01:09,510 We can also look at the statistical summary of our data 17 00:01:09,510 --> 00:01:10,730 using the summary function. 18 00:01:15,160 --> 00:01:17,590 This gives us information about the range 19 00:01:17,590 --> 00:01:22,510 of values for each variable in our data set. 20 00:01:22,510 --> 00:01:26,780 Let's now create a one-variable linear regression equation 21 00:01:26,780 --> 00:01:30,910 using AGST to predict Price. 22 00:01:30,910 --> 00:01:35,530 We'll call our regression model model1, 23 00:01:35,530 --> 00:01:40,090 and we'll use the lm function, which stands for linear model. 24 00:01:40,090 --> 00:01:42,120 This is the function we'll use whenever 25 00:01:42,120 --> 00:01:45,470 we want to build a linear regression model. 26 00:01:45,470 --> 00:01:51,060 Then inside parentheses, type Price, our dependent variable, 27 00:01:51,060 --> 00:01:53,420 and then a tilde symbol, and then 28 00:01:53,420 --> 00:01:58,780 AGST, the independent variable we'll use in this model. 29 00:01:58,780 --> 00:02:04,550 Then after a comma, we need to add data = wine 30 00:02:04,550 --> 00:02:06,910 to tell the lm function what data 31 00:02:06,910 --> 00:02:10,729 set to use to build the model. 32 00:02:10,729 --> 00:02:13,310 We're saving the output of the lm function 33 00:02:13,310 --> 00:02:15,790 to the variable named model1. 34 00:02:15,790 --> 00:02:18,810 So when we hit Enter, we didn't see any output 35 00:02:18,810 --> 00:02:22,620 because it's been saved to the variable model1. 36 00:02:22,620 --> 00:02:25,250 Let's take a look at the summary of model1. 37 00:02:28,790 --> 00:02:31,040 The first thing we see is a description 38 00:02:31,040 --> 00:02:34,370 of the function we used to build the model. 39 00:02:34,370 --> 00:02:39,070 Then we see a summary of the residuals or error terms. 40 00:02:39,070 --> 00:02:42,010 Following that is a description of the coefficients 41 00:02:42,010 --> 00:02:43,320 of our model. 42 00:02:43,320 --> 00:02:46,850 The first row corresponds to the intercept term, 43 00:02:46,850 --> 00:02:50,700 and the second row corresponds to our independent variable, 44 00:02:50,700 --> 00:02:53,250 AGST. 45 00:02:53,250 --> 00:02:55,840 The Estimate column gives estimates 46 00:02:55,840 --> 00:02:58,170 of the beta values for our model. 47 00:02:58,170 --> 00:03:00,870 So here beta 0, or the coefficient 48 00:03:00,870 --> 00:03:05,820 for the intercept term, is estimated to be -3.4. 49 00:03:05,820 --> 00:03:10,140 And beta 1, or the coefficient for our independent variable, 50 00:03:10,140 --> 00:03:13,770 is estimated to be 0.635. 51 00:03:13,770 --> 00:03:16,120 There's additional information in this table 52 00:03:16,120 --> 00:03:19,470 that we'll discuss in the next video. 53 00:03:19,470 --> 00:03:21,370 Towards the bottom of the output, 54 00:03:21,370 --> 00:03:26,320 you can see Multiple R-squared, 0.435, 55 00:03:26,320 --> 00:03:28,250 which is the R-squared value that we 56 00:03:28,250 --> 00:03:30,840 discussed in the previous video. 57 00:03:30,840 --> 00:03:35,350 Beside it is a number labeled Adjusted R-squared. 58 00:03:35,350 --> 00:03:38,290 In this case, it's 0.41. 59 00:03:38,290 --> 00:03:40,940 This number adjusts the R-squared value 60 00:03:40,940 --> 00:03:44,780 to account for the number of independent variables used 61 00:03:44,780 --> 00:03:48,020 relative to the number of data points. 62 00:03:48,020 --> 00:03:50,860 Multiple R-squared will always increase 63 00:03:50,860 --> 00:03:53,340 if you add more independent variables. 64 00:03:53,340 --> 00:03:55,860 But Adjusted R-squared will decrease 65 00:03:55,860 --> 00:03:58,350 if you add an independent variable that 66 00:03:58,350 --> 00:04:00,450 doesn't help the model. 67 00:04:00,450 --> 00:04:03,930 This is a good way to determine if an additional variable 68 00:04:03,930 --> 00:04:06,790 should even be included in the model. 69 00:04:06,790 --> 00:04:09,140 We'll also discuss other ways to select 70 00:04:09,140 --> 00:04:13,640 important independent variables in the next video. 71 00:04:13,640 --> 00:04:17,200 Let's also compute the sum of squared errors, or SSE, 72 00:04:17,200 --> 00:04:18,940 for our model. 73 00:04:18,940 --> 00:04:22,600 Our residuals, or error terms, are stored in the vector 74 00:04:22,600 --> 00:04:23,300 model1$residuals. 75 00:04:29,050 --> 00:04:33,810 By hitting Enter, we can see the values of all of our residuals. 76 00:04:33,810 --> 00:04:37,990 We can compute the Sum of Squared Errors, or SSE, 77 00:04:37,990 --> 00:04:40,790 by taking the sum(model1$residuals^2). 78 00:04:47,190 --> 00:04:50,350 If we type SSE and hit Enter, we can 79 00:04:50,350 --> 00:04:56,659 see that our sum of squared errors is 5.73. 80 00:04:56,659 --> 00:04:59,360 Now let's add another variable to our regression 81 00:04:59,360 --> 00:05:01,570 model, HarvestRain. 82 00:05:01,570 --> 00:05:03,870 We'll call our new model model2. 83 00:05:07,060 --> 00:05:11,370 And again, we'll use the lm function to predict Price, 84 00:05:11,370 --> 00:05:16,660 but this time using AGST and HarvestRain. 85 00:05:20,050 --> 00:05:23,170 When you want to use more than one independent variable, 86 00:05:23,170 --> 00:05:27,030 you can just separate them with a plus sign like we did here. 87 00:05:27,030 --> 00:05:29,960 Then we again need to indicate that the data that 88 00:05:29,960 --> 00:05:31,910 should be used is wine. 89 00:05:35,170 --> 00:05:38,190 Let's take a look at the summary of our new model 90 00:05:38,190 --> 00:05:39,380 using the summary function. 91 00:05:43,900 --> 00:05:46,230 We have a third row in our Coefficients table 92 00:05:46,230 --> 00:05:48,940 now corresponding to HarvestRain. 93 00:05:48,940 --> 00:05:52,630 The coefficient estimate for this new independent variable 94 00:05:52,630 --> 00:05:56,560 is negative 0.00457. 95 00:05:56,560 --> 00:05:59,960 And if you look at the R-squared near the bottom of the output, 96 00:05:59,960 --> 00:06:03,390 you can see that this variable really helped our model. 97 00:06:03,390 --> 00:06:06,550 Our Multiple R-squared and Adjusted R-squared 98 00:06:06,550 --> 00:06:11,180 both increased significantly compared to the previous model. 99 00:06:11,180 --> 00:06:13,790 This indicates that this new model is probably 100 00:06:13,790 --> 00:06:16,870 better than the previous model. 101 00:06:16,870 --> 00:06:19,310 Let's now also compute the sum of squared errors 102 00:06:19,310 --> 00:06:21,200 for this new model. 103 00:06:21,200 --> 00:06:24,570 So SSE equals, and then sum(model2$residuals^2). 104 00:06:31,960 --> 00:06:35,800 If we type SSE, we can see that the sum of squared errors 105 00:06:35,800 --> 00:06:39,500 for model2 is 2.97, which is much 106 00:06:39,500 --> 00:06:43,250 better than the sum of squared errors for model1. 107 00:06:43,250 --> 00:06:45,850 Now let's build a third model, this time 108 00:06:45,850 --> 00:06:48,510 with all of our independent variables. 109 00:06:48,510 --> 00:06:51,700 We'll call this one model3. 110 00:06:51,700 --> 00:07:01,430 And again, use the lm function to predict Price using AGST 111 00:07:01,430 --> 00:07:19,160 and HarvestRain and WinterRain and Age and FrancePop. 112 00:07:22,170 --> 00:07:24,420 Again, we need to tell the lm function 113 00:07:24,420 --> 00:07:28,340 to use the data set wine. 114 00:07:28,340 --> 00:07:30,290 Let's take a look at the summary of model3. 115 00:07:35,810 --> 00:07:40,650 Now the Coefficients table has six rows, one for the intercept 116 00:07:40,650 --> 00:07:44,159 and one for each of the five independent variables. 117 00:07:44,159 --> 00:07:46,290 If we look at the bottom of the output, 118 00:07:46,290 --> 00:07:49,750 we can again see that the Multiple R-squared and Adjusted 119 00:07:49,750 --> 00:07:53,310 R-squared have both increased. 120 00:07:53,310 --> 00:07:55,460 Let's now compute the sum of squared errors 121 00:07:55,460 --> 00:07:57,140 for this new model. 122 00:07:57,140 --> 00:08:00,240 SSE equals the sum(model3$residuals^2). 123 00:08:09,350 --> 00:08:13,430 And if we type SSE, we can see that the sum of squared errors 124 00:08:13,430 --> 00:08:18,830 for model3 is 1.7, even better than before. 125 00:08:18,830 --> 00:08:21,040 In the next video, we'll determine 126 00:08:21,040 --> 00:08:25,160 if we should keep all of these variables in our final model.