1 00:00:06,150 --> 00:00:09,120 So now let's build an equation to predict points scored 2 00:00:09,120 --> 00:00:12,270 using some common basketball statistics. 3 00:00:12,270 --> 00:00:14,990 So our dependent variable would now be points, 4 00:00:14,990 --> 00:00:16,450 and our independent variables would 5 00:00:16,450 --> 00:00:18,380 be some of the common basketball statistics 6 00:00:18,380 --> 00:00:20,440 that we have in our data set. 7 00:00:20,440 --> 00:00:23,990 So for example, the number of two-point field goal attempts, 8 00:00:23,990 --> 00:00:26,670 the number of three-point field goal attempts, 9 00:00:26,670 --> 00:00:29,790 offensive rebounds, defensive rebounds, 10 00:00:29,790 --> 00:00:33,680 assists, steals, blocks, turnovers, 11 00:00:33,680 --> 00:00:37,520 free throw attempts-- we can use all of these. 12 00:00:37,520 --> 00:00:39,750 So let's build this regression and call it PointsReg. 13 00:00:46,750 --> 00:00:53,940 And that will just be equal to lm of PTS regressing 14 00:00:53,940 --> 00:00:57,170 on all those variables we just talked about. 15 00:00:57,170 --> 00:01:04,080 So X2PA for two-point attempts, plus X3PA 16 00:01:04,080 --> 00:01:12,300 for three-point attempts, plus FTA for free throw attempts, 17 00:01:12,300 --> 00:01:19,280 AST for assists, ORB offensive rebounds, 18 00:01:19,280 --> 00:01:26,640 DRB for defensive rebounds, TOV for turnovers, 19 00:01:26,640 --> 00:01:28,270 and STL for steals. 20 00:01:32,200 --> 00:01:33,700 And let's also throw in blocks, BLK. 21 00:01:36,270 --> 00:01:36,770 Okay. 22 00:01:36,770 --> 00:01:39,120 And as always, the data is from the NBA data set. 23 00:01:45,780 --> 00:01:47,450 So we can go ahead and run this command. 24 00:01:50,400 --> 00:01:52,650 So now that we've created our points regression model, 25 00:01:52,650 --> 00:01:53,990 let's take a look at the summary. 26 00:01:53,990 --> 00:01:55,860 This is always the first thing you should do. 27 00:02:01,460 --> 00:02:03,460 Okay, so taking a look at this, we 28 00:02:03,460 --> 00:02:05,770 can see that some of our variables are indeed very, 29 00:02:05,770 --> 00:02:07,570 very, significant. 30 00:02:07,570 --> 00:02:09,250 Others are less significant. 31 00:02:09,250 --> 00:02:13,740 For example, steals only has one significance star. 32 00:02:13,740 --> 00:02:16,360 And some don't seem to be significant at all. 33 00:02:16,360 --> 00:02:20,780 For example, defensive rebounds, turnovers, and blocks. 34 00:02:20,780 --> 00:02:25,480 We do have a pretty good R-squared value, 0.8992, 35 00:02:25,480 --> 00:02:27,750 so it shows that there really is a linear relationship 36 00:02:27,750 --> 00:02:31,700 between points and all of these basketball statistics. 37 00:02:31,700 --> 00:02:33,579 Let's compute the residuals here. 38 00:02:33,579 --> 00:02:35,490 We can just do that by directly calling them, 39 00:02:35,490 --> 00:02:36,740 so that's PointsReg$residuals. 40 00:02:46,829 --> 00:02:48,829 So there's this giant list of them, 41 00:02:48,829 --> 00:02:52,750 and we'll use this to compute the sum of squared errors. 42 00:02:52,750 --> 00:02:55,480 SSE, standing for sum of squared errors, 43 00:02:55,480 --> 00:02:57,900 is just equal to the sum(PointsReg$residuals^2). 44 00:03:11,340 --> 00:03:13,090 So what is the sum of squared errors here? 45 00:03:16,560 --> 00:03:23,220 It's quite a lot-- 28,394,314. 46 00:03:23,220 --> 00:03:25,040 So the sum of squared errors number 47 00:03:25,040 --> 00:03:28,470 is not really a very interpretable quantity. 48 00:03:28,470 --> 00:03:29,930 But remember, we can also calculate 49 00:03:29,930 --> 00:03:31,980 the root mean squared error, which 50 00:03:31,980 --> 00:03:34,040 is much more interpretable. 51 00:03:34,040 --> 00:03:37,720 It's more like the average error we make in our predictions. 52 00:03:37,720 --> 00:03:42,850 So the root mean squared error, RMSE-- let's calculate it here. 53 00:03:42,850 --> 00:03:49,140 So RMSE is just equal to the square root 54 00:03:49,140 --> 00:03:53,660 of the sum of squared errors divided by n, 55 00:03:53,660 --> 00:03:56,640 where n here is the number of rows in our data set. 56 00:04:06,800 --> 00:04:12,750 So the RMSE in our case is 184.4. 57 00:04:12,750 --> 00:04:18,029 So on average, we make an error of about 184.4 points. 58 00:04:18,029 --> 00:04:20,570 That seems like quite a lot, until you remember that 59 00:04:20,570 --> 00:04:24,470 the average number of points in a season is, let's see, 60 00:04:24,470 --> 00:04:25,060 mean(NBA$PTS). 61 00:04:33,110 --> 00:04:35,570 This will give us the average number of points in a season, 62 00:04:35,570 --> 00:04:39,700 and it's 8,370. 63 00:04:39,700 --> 00:04:44,930 So, okay, if we have an average number of points of 8,370, 64 00:04:44,930 --> 00:04:51,300 being off by about 184.4 points is really not so bad. 65 00:04:51,300 --> 00:04:54,820 But I think we still have room for improvement in this model. 66 00:04:54,820 --> 00:04:58,130 If you recall, not all the variables were significant. 67 00:04:58,130 --> 00:05:00,830 Let's see if we can remove some of the insignificant variables 68 00:05:00,830 --> 00:05:02,870 one at a time. 69 00:05:02,870 --> 00:05:05,330 We'll take a look again at our model, 70 00:05:05,330 --> 00:05:12,120 summary(PointsReg), in order to figure out 71 00:05:12,120 --> 00:05:13,720 which variable we should remove first. 72 00:05:16,410 --> 00:05:18,270 The first variable we would want to remove 73 00:05:18,270 --> 00:05:20,730 is probably turnovers. 74 00:05:20,730 --> 00:05:22,910 And why do I say turnovers? 75 00:05:22,910 --> 00:05:25,950 It's because the p value for turnovers, which you see here 76 00:05:25,950 --> 00:05:34,200 in this column, 0.6859, is the highest of all of the p values. 77 00:05:34,200 --> 00:05:37,940 So that means that turnovers is the least statistically 78 00:05:37,940 --> 00:05:40,860 significant variable in our model. 79 00:05:40,860 --> 00:05:44,760 So let's create a new regression model without turnovers. 80 00:05:44,760 --> 00:05:46,940 An easy way to do this is just to use your up 81 00:05:46,940 --> 00:05:49,990 arrow on your keyboard, and scroll up 82 00:05:49,990 --> 00:05:52,480 through all of your previous commands 83 00:05:52,480 --> 00:05:55,100 until you find the command where you defined the regression 84 00:05:55,100 --> 00:05:55,600 model. 85 00:06:00,760 --> 00:06:03,230 Okay, so this is the command where we defined the model. 86 00:06:03,230 --> 00:06:14,460 Now let's delete turnovers, and then we can rename the model, 87 00:06:14,460 --> 00:06:16,130 and we'll call this one just PointsReg2. 88 00:06:23,020 --> 00:06:26,180 So in our first regression model, PointsReg, 89 00:06:26,180 --> 00:06:29,060 we had an R-squared of 0.8992. 90 00:06:29,060 --> 00:06:31,110 Let's take a look at the R-squared of PointsReg2. 91 00:06:40,820 --> 00:06:43,610 And we see that it's 0.8991. 92 00:06:43,610 --> 00:06:45,990 So almost exactly identical. 93 00:06:45,990 --> 00:06:48,659 It does go down, as we would expect, but very, very 94 00:06:48,659 --> 00:06:49,930 slightly. 95 00:06:49,930 --> 00:06:53,200 So it seems that we're justified in removing turnovers. 96 00:06:53,200 --> 00:06:54,790 Let's see if we can remove another one 97 00:06:54,790 --> 00:06:57,490 of the insignificant variables. 98 00:06:57,490 --> 00:07:00,280 The next one, based on p-value, that we would want to remove 99 00:07:00,280 --> 00:07:02,700 is defensive rebounds. 100 00:07:02,700 --> 00:07:04,550 So again, let's create our model, 101 00:07:04,550 --> 00:07:08,700 taking out defensive rebounds, and calling this PointsReg3. 102 00:07:08,700 --> 00:07:11,710 We'll just scroll up again. 103 00:07:11,710 --> 00:07:15,710 Take out DRB, for defensive rebounds, 104 00:07:15,710 --> 00:07:18,070 and change the name of this to PointsReg3 105 00:07:18,070 --> 00:07:19,760 so we don't overwrite PointsReg2. 106 00:07:23,740 --> 00:07:25,360 Let's look at the summary again to see 107 00:07:25,360 --> 00:07:26,700 if the R-squared has changed. 108 00:07:32,170 --> 00:07:35,430 And it's the same, it's 0.8991. 109 00:07:35,430 --> 00:07:39,659 So I think we're justified again in removing defensive rebounds. 110 00:07:39,659 --> 00:07:43,070 Let's try this one more time and see if we can remove blocks. 111 00:07:43,070 --> 00:07:45,380 So we'll remove blocks, and call it PointsReg4. 112 00:07:55,250 --> 00:07:56,960 Take a look at the summary of PointsReg4. 113 00:08:03,040 --> 00:08:06,140 And again, the R-squared value stayed the same. 114 00:08:06,140 --> 00:08:09,590 So now we've gotten down to a model which is a bit simpler. 115 00:08:09,590 --> 00:08:11,580 All the variables are significant. 116 00:08:11,580 --> 00:08:15,670 We've still got an R-squared 0.899. 117 00:08:15,670 --> 00:08:18,140 And let's take a look now at the sum of squared errors 118 00:08:18,140 --> 00:08:20,070 and the root mean square error, just 119 00:08:20,070 --> 00:08:21,720 to make sure we didn't inflate those 120 00:08:21,720 --> 00:08:23,520 too much by removing a few variables. 121 00:08:26,000 --> 00:08:28,130 So, remember that the sum of squared errors that we 122 00:08:28,130 --> 00:08:35,409 had in the original model was this giant number, 28,394,314. 123 00:08:35,409 --> 00:08:38,020 And the root mean squared error, the much more interpretable 124 00:08:38,020 --> 00:08:40,850 number, was 184.4. 125 00:08:40,850 --> 00:08:43,370 So those are the numbers we'll be comparing against when 126 00:08:43,370 --> 00:08:46,170 we calculate the new sum of squared errors 127 00:08:46,170 --> 00:08:49,410 of the new model, PointsReg4. 128 00:08:49,410 --> 00:08:51,380 So let's call this SSE_4. 129 00:08:55,750 --> 00:08:58,210 And that will just be equal to sum(PointsReg4$residuals^2). 130 00:09:08,590 --> 00:09:18,090 And then RMSE_4 will just be the square root of SSE_4 divided 131 00:09:18,090 --> 00:09:21,270 by n, the number of rows in our data set. 132 00:09:26,810 --> 00:09:28,310 Okay, so let's take a look at these. 133 00:09:32,460 --> 00:09:41,780 The new sum of squared errors is now 28,421,465. 134 00:09:41,780 --> 00:09:44,160 Again, I find this very difficult to interpret. 135 00:09:44,160 --> 00:09:47,990 I like to look at the root mean squared error instead. 136 00:09:47,990 --> 00:09:53,180 So the root mean squared error here is just RMSE_4, 137 00:09:53,180 --> 00:09:57,580 and so it's 184.5. 138 00:09:57,580 --> 00:09:59,880 So although we've increased the root mean squared error 139 00:09:59,880 --> 00:10:01,960 a little bit by removing those variables, 140 00:10:01,960 --> 00:10:04,380 it's really a very, very, small amount. 141 00:10:04,380 --> 00:10:08,170 Essentially, we've kept the root mean squared error the same. 142 00:10:08,170 --> 00:10:10,780 So it seems like we've narrowed down on a much better model 143 00:10:10,780 --> 00:10:12,860 because it's simpler, it's more interpretable, 144 00:10:12,860 --> 00:10:16,120 and it's got just about the same amount of error.