1 00:00:04,391 --> 00:00:05,890 ALLISON O'HAIR: --previous video, we 2 00:00:05,890 --> 00:00:08,680 used data in linear regression to show 3 00:00:08,680 --> 00:00:12,100 that if a team scores at least 135 4 00:00:12,100 --> 00:00:15,460 more runs than their opponent throughout the regular season, 5 00:00:15,460 --> 00:00:19,270 then we predict that that team will win at least 95 games 6 00:00:19,270 --> 00:00:21,640 and make the playoffs. 7 00:00:21,640 --> 00:00:25,090 This means that we need to know how many runs a team will 8 00:00:25,090 --> 00:00:28,390 score, which we will show can be predicted using 9 00:00:28,390 --> 00:00:32,950 batting statistics, and how many runs a team will allow, 10 00:00:32,950 --> 00:00:35,740 which we will show can be predicted using fielding 11 00:00:35,740 --> 00:00:38,470 and pitching statistics. 12 00:00:38,470 --> 00:00:40,530 Let's start by creating a linear regression 13 00:00:40,530 --> 00:00:43,200 model to predict runs scored. 14 00:00:43,200 --> 00:00:46,870 The Oakland A's are interested in answering the question, 15 00:00:46,870 --> 00:00:49,710 how does a team score more runs? 16 00:00:49,710 --> 00:00:52,440 They discovered that two baseball statistics 17 00:00:52,440 --> 00:00:55,980 were significantly more important than anything else. 18 00:00:55,980 --> 00:00:59,460 These were on-base percentage, or OBP, 19 00:00:59,460 --> 00:01:01,890 which is the percentage of time a player gets 20 00:01:01,890 --> 00:01:06,300 on base, including walks, and slugging percentage, 21 00:01:06,300 --> 00:01:10,260 or SLG, which measures how far a player gets around 22 00:01:10,260 --> 00:01:15,680 the bases on his turn and measures the power of a hitter. 23 00:01:15,680 --> 00:01:18,050 Most teams and people in baseball 24 00:01:18,050 --> 00:01:22,970 focused instead on batting average, or BA, 25 00:01:22,970 --> 00:01:25,220 which measures how often a hitter gets 26 00:01:25,220 --> 00:01:27,380 on base by hitting the ball. 27 00:01:27,380 --> 00:01:31,070 This focuses on hits instead of walks. 28 00:01:31,070 --> 00:01:34,700 The Oakland A's claimed that on-base percentage 29 00:01:34,700 --> 00:01:36,650 was the most important. 30 00:01:36,650 --> 00:01:39,350 Slugging percentage was important. 31 00:01:39,350 --> 00:01:42,710 And batting average was overvalued. 32 00:01:42,710 --> 00:01:45,860 Let's see if we can use linear regression in R 33 00:01:45,860 --> 00:01:48,500 to verify which baseball statistics are 34 00:01:48,500 --> 00:01:51,720 more important to predict runs. 35 00:01:51,720 --> 00:01:55,200 In the previous video, we created her dataset in R 36 00:01:55,200 --> 00:01:57,000 and called it Moneyball. 37 00:01:57,000 --> 00:01:59,670 Let's take a look at the structure of our data again. 38 00:02:02,410 --> 00:02:05,230 Our dataset includes many variables including 39 00:02:05,230 --> 00:02:10,090 runs scored, or RS, on-base percentage, OBP, 40 00:02:10,090 --> 00:02:15,680 slugging percentage, SLG, and batting average, BA. 41 00:02:15,680 --> 00:02:19,130 We want to see if we can use linear regression to predict 42 00:02:19,130 --> 00:02:22,880 runs scored using the three hitting statistics, 43 00:02:22,880 --> 00:02:25,970 on-base percentage, slugging percentage, and batting 44 00:02:25,970 --> 00:02:27,480 average. 45 00:02:27,480 --> 00:02:30,740 So let's build a linear regression equation using 46 00:02:30,740 --> 00:02:34,400 the LM function to predict runs scored 47 00:02:34,400 --> 00:02:41,300 using the independent variables OBP, SLG, and BA. 48 00:02:41,300 --> 00:02:46,040 Our dataset is Moneyball. 49 00:02:46,040 --> 00:02:50,950 If we look at the summary of our regression equation, 50 00:02:50,950 --> 00:02:53,580 we can see that all of our independent variables 51 00:02:53,580 --> 00:02:55,470 are highly significant. 52 00:02:55,470 --> 00:02:59,290 And our R squared is 0.93. 53 00:02:59,290 --> 00:03:01,050 If we look at our coefficients, we 54 00:03:01,050 --> 00:03:04,950 see that the coefficient for batting average is negative. 55 00:03:04,950 --> 00:03:08,500 This says that all other things being equal, 56 00:03:08,500 --> 00:03:13,020 a team with a higher batting average will score fewer runs. 57 00:03:13,020 --> 00:03:15,660 But this is a bit counterintuitive. 58 00:03:15,660 --> 00:03:19,230 What's going on here is a case of multi-collinearity. 59 00:03:19,230 --> 00:03:22,540 These three hitting statistics are highly correlated. 60 00:03:22,540 --> 00:03:25,330 So it's hard to interpret our model. 61 00:03:25,330 --> 00:03:28,230 Let's see if we can remove batting average, the variable 62 00:03:28,230 --> 00:03:31,800 with the negative coefficient and the least significance, 63 00:03:31,800 --> 00:03:37,180 to see if we can improve the interpretability of our model. 64 00:03:37,180 --> 00:03:39,360 So let's rerun our regression equation 65 00:03:39,360 --> 00:03:42,210 without the variable BA. 66 00:03:42,210 --> 00:03:45,580 If we look at the summary of our new regression equation, 67 00:03:45,580 --> 00:03:49,750 we can see that our variables are again highly significant. 68 00:03:49,750 --> 00:03:51,540 The coefficients are all positive, 69 00:03:51,540 --> 00:03:53,580 as we would expect them to be. 70 00:03:53,580 --> 00:03:59,530 And our R squared is now 0.9296, or about the same as before. 71 00:03:59,530 --> 00:04:03,180 So this model is simpler with one fewer variable, 72 00:04:03,180 --> 00:04:05,640 but has about the same predictive ability 73 00:04:05,640 --> 00:04:08,240 as a three variable model. 74 00:04:08,240 --> 00:04:10,220 You can experiment and see that if we 75 00:04:10,220 --> 00:04:13,340 had taken out on-base percentage or slugging percentage 76 00:04:13,340 --> 00:04:16,160 instead of batting average, our R squared 77 00:04:16,160 --> 00:04:19,380 would have gone down more. 78 00:04:19,380 --> 00:04:21,450 If we look at the coefficients, we 79 00:04:21,450 --> 00:04:24,450 can see that the coefficient for on-base percentage 80 00:04:24,450 --> 00:04:26,760 is significantly higher than the coefficient 81 00:04:26,760 --> 00:04:29,250 for slugging percentage. 82 00:04:29,250 --> 00:04:32,370 Since these variables are on about the same scale, 83 00:04:32,370 --> 00:04:35,550 this tells us that on-base percentage is worth more 84 00:04:35,550 --> 00:04:37,870 than slugging percentage. 85 00:04:37,870 --> 00:04:41,650 So using linear regression we're able to verify the claims made 86 00:04:41,650 --> 00:04:45,730 in Moneyball that batting average is overvalued, 87 00:04:45,730 --> 00:04:49,300 on-base percentage is the most important, and slugging 88 00:04:49,300 --> 00:04:50,830 percentage is important. 89 00:04:50,830 --> 00:04:53,260 We can create a very similar model 90 00:04:53,260 --> 00:04:56,680 to predict runs allowed or opponent runs. 91 00:04:56,680 --> 00:05:00,190 This model uses pitching statistics, opponents 92 00:05:00,190 --> 00:05:05,590 on-base percentage, or OOBP, and opponents slugging percentage, 93 00:05:05,590 --> 00:05:07,810 or OSLG. 94 00:05:07,810 --> 00:05:10,810 The statistics are computed in the same way 95 00:05:10,810 --> 00:05:13,810 as on-base percentage and slugging percentage, 96 00:05:13,810 --> 00:05:16,510 but use the actions of the opposing batters 97 00:05:16,510 --> 00:05:19,720 against our team's pitcher and fielders. 98 00:05:19,720 --> 00:05:23,920 Using our dataset in R, we can build a linear regression model 99 00:05:23,920 --> 00:05:27,700 to predict runs allowed using opponents on-base percentage 100 00:05:27,700 --> 00:05:30,190 and opponents slugging percentage. 101 00:05:30,190 --> 00:05:33,580 This is, again, a very strong model with an R squared 102 00:05:33,580 --> 00:05:35,950 value of 0.91. 103 00:05:35,950 --> 00:05:39,230 And both variables are significant. 104 00:05:39,230 --> 00:05:41,120 In the next video, we'll show how 105 00:05:41,120 --> 00:05:44,870 we can apply these models to predict whether or not 106 00:05:44,870 --> 00:05:47,560 a team will make the playoffs.