1 00:00:06,960 --> 00:00:10,750 The goal of baseball team is to make the playoffs. 2 00:00:10,750 --> 00:00:14,920 The A's approach was to get to the playoffs by using analytics. 3 00:00:14,920 --> 00:00:18,410 We'll first show how we can predict whether or not 4 00:00:18,410 --> 00:00:21,980 a team will make the playoffs by knowing how many games they 5 00:00:21,980 --> 00:00:24,570 won in the regular season. 6 00:00:24,570 --> 00:00:28,570 We will then use linear regression to predict how many games a team will 7 00:00:28,570 --> 00:00:32,930 win using the difference between runs scored and runs allowed, 8 00:00:32,930 --> 00:00:34,720 or opponent runs. 9 00:00:34,720 --> 00:00:39,390 We will then use linear regression again to predict the number of runs a team will 10 00:00:39,390 --> 00:00:43,170 score using batting statistics, and the number of 11 00:00:43,170 --> 00:00:46,460 runs a team will allow using fielding and pitching 12 00:00:46,460 --> 00:00:47,460 statistics. 13 00:00:47,460 --> 00:00:51,359 We'll start by figuring out how many games a team needs 14 00:00:51,359 --> 00:00:56,690 to win to make the playoffs, and then how many more runs a team needs to score than 15 00:00:56,690 --> 00:01:00,589 their opponent to win that many games. 16 00:01:00,589 --> 00:01:03,780 So our first question is how many games does a team 17 00:01:03,780 --> 00:01:08,020 need to win in the regular season to make it to the playoffs. 18 00:01:08,020 --> 00:01:11,590 In Moneyball, Paul DePodesta reduced the regular season 19 00:01:11,590 --> 00:01:12,900 to a math problem. 20 00:01:12,900 --> 00:01:17,450 He judged that it would take 95 wins for the A's to make it 21 00:01:17,450 --> 00:01:19,340 to the playoffs. 22 00:01:19,340 --> 00:01:23,479 Let's see if we can verify this using data. 23 00:01:23,479 --> 00:01:30,329 This graph uses data from all teams and seasons, from 1996 to 2001. 24 00:01:30,329 --> 00:01:34,329 There is a point on the graph for every team and season pair. 25 00:01:34,329 --> 00:01:38,670 They are sorted on the x-axis by number of regular season wins, 26 00:01:38,670 --> 00:01:41,720 and are ordered on the y-axis by team. 27 00:01:41,720 --> 00:01:44,580 The gray points correspond to the teams that did not 28 00:01:44,580 --> 00:01:48,860 make the playoffs, and the red points correspond to the teams that did make the 29 00:01:48,860 --> 00:01:50,729 playoffs. 30 00:01:50,729 --> 00:01:55,060 This graph shows us that if a team wins 95 or more games, 31 00:01:55,060 --> 00:01:59,119 or is on the right side of this line, they almost certainly will make it to the 32 00:01:59,119 --> 00:02:00,119 playoffs. 33 00:02:00,119 --> 00:02:06,539 But if a team wins, say, 85 or more games then they're likely to make it to the playoffs, 34 00:02:06,539 --> 00:02:08,600 but it's not as certain. 35 00:02:08,600 --> 00:02:11,970 And if a team wins, say, 100 or more games then 36 00:02:11,970 --> 00:02:14,980 they definitely will make it to the playoffs. 37 00:02:14,980 --> 00:02:20,200 So this graph shows us that if a team wins 95 or more games then 38 00:02:20,200 --> 00:02:25,560 they have a strong chance of making it to the playoffs, which confirms Paul DePodesta's 39 00:02:25,560 --> 00:02:27,230 claim. 40 00:02:27,230 --> 00:02:31,520 So we know that a team wants to win 95 or more games. 41 00:02:31,520 --> 00:02:33,470 But how does a team win games? 42 00:02:33,470 --> 00:02:37,630 Well, they score more runs than their opponent does. 43 00:02:37,630 --> 00:02:40,340 But how many more do they need to score? 44 00:02:40,340 --> 00:02:46,510 The A's calculated that they needed to score 135 more runs than they allowed 45 00:02:46,510 --> 00:02:51,650 during the regular season to expect to win 95 games. 46 00:02:51,650 --> 00:02:57,420 Let's see if we can verify this using linear regression in R. 47 00:02:57,420 --> 00:03:01,170 In our R console, let's start by loading our data. 48 00:03:01,170 --> 00:03:05,630 We'll call it baseball, and use the read.csv function 49 00:03:05,630 --> 00:03:10,090 to read in the data file, baseball.csv. 50 00:03:10,090 --> 00:03:17,120 We can look at the structure of our data by using the str function. 51 00:03:17,120 --> 00:03:24,290 This data set includes an observation for every team and year pair from 1962 to 52 00:03:24,290 --> 00:03:29,590 for all seasons with 162 games. 53 00:03:29,590 --> 00:03:36,310 We have 15 variables in our data set, including runs scored, RS, runs allowed, 54 00:03:36,310 --> 00:03:40,370 RA, and Wins, W. We also have several other variables 55 00:03:40,370 --> 00:03:44,650 that we'll use when building models later on in the lecture. 56 00:03:44,650 --> 00:03:47,980 Since we are confirming the claims made in Moneyball, 57 00:03:47,980 --> 00:03:52,880 we want to build models using data Paul DePodesta had in 2002. 58 00:03:52,880 --> 00:03:56,540 So let's start by subsetting our data to only include 59 00:03:56,540 --> 00:03:59,400 the years before 2002. 60 00:03:59,400 --> 00:04:03,010 We'll call our new data set, moneyball, and use 61 00:04:03,010 --> 00:04:07,270 the subset function to only take the rows of baseball 62 00:04:07,270 --> 00:04:11,430 for which year is less than 2002. 63 00:04:11,430 --> 00:04:14,430 We can look at this structure of the data set, moneyball, 64 00:04:14,430 --> 00:04:16,720 by using the str function again. 65 00:04:16,720 --> 00:04:23,030 Now, we have 902 observations of the same 15 variables. 66 00:04:23,030 --> 00:04:28,650 So we want to build a linear regression equation to predict wins using the difference between 67 00:04:28,650 --> 00:04:31,270 runs scored and runs allowed. 68 00:04:31,270 --> 00:04:36,849 To make this a little easier, let's start by creating a new variable called, RD, 69 00:04:36,849 --> 00:04:45,139 for run difference, and set that equal to runs scored minus runs 70 00:04:45,139 --> 00:04:46,530 allowed. 71 00:04:46,530 --> 00:04:52,280 If we look at the structure of our data again, we can see that we have a new variable called, 72 00:04:52,280 --> 00:04:53,680 RD. 73 00:04:53,680 --> 00:05:00,259 So let's build a linear regression equation using the lm function to predict wins 74 00:05:00,259 --> 00:05:08,830 with RD as our independent variable, and using the data set, moneyball. 75 00:05:08,830 --> 00:05:13,560 We can look at the summary of our linear regression equation, 76 00:05:13,560 --> 00:05:18,860 and we can see that both the intercept and our independent variable, RD, are highly 77 00:05:18,860 --> 00:05:20,340 significant. 78 00:05:20,340 --> 00:05:23,229 And our R-squared is 0.88. 79 00:05:23,229 --> 00:05:28,600 So we have a strong model to predict wins using the difference between runs scored and 80 00:05:28,600 --> 00:05:29,870 runs allowed. 81 00:05:29,870 --> 00:05:34,870 Now, let's see if we can use this model to confirm the claim made in Moneyball that 82 00:05:34,870 --> 00:05:39,020 a team needs to score at least 135 more runs than they 83 00:05:39,020 --> 00:05:45,719 allow to win at least 95 games. 84 00:05:45,719 --> 00:05:56,909 Our regression equation is wins equals 80.8814, our intercept term, plus the coefficient 85 00:05:56,909 --> 00:06:02,840 for run difference, 0.1058, times run difference, 86 00:06:02,840 --> 00:06:04,699 or RD. 87 00:06:04,699 --> 00:06:08,860 We want wins to be greater than or equal to 95. 88 00:06:08,860 --> 00:06:13,889 This will be true if and only if our regression equation is also 89 00:06:13,889 --> 00:06:24,389 greater than or equal to 95. 90 00:06:24,389 --> 00:06:28,210 By manipulating this equation, we can see that this will be true if 91 00:06:28,210 --> 00:06:37,099 and only if run difference, or RD, is greater than or equal to 95 minus the intercept 92 00:06:37,099 --> 00:06:47,689 term, 88.8814, divided by 0.1058, which is equal 93 00:06:47,689 --> 00:06:53,689 to 133.4. 94 00:06:53,689 --> 00:06:57,539 So this tells us that if the run difference of a team 95 00:06:57,539 --> 00:07:03,919 is greater than or equal to 133.4 then they will win at least 95 games, 96 00:07:03,919 --> 00:07:06,719 according to our regression equation. 97 00:07:06,719 --> 00:07:12,930 This is very close to the claim made in Moneyball that a team needs to score at least 135 more 98 00:07:12,930 --> 00:07:17,029 runs than they allow to win at least 95 games. 99 00:07:17,029 --> 00:07:21,949 So using linear regression and data, we were able to verify the claim made 100 00:07:21,949 --> 00:07:23,699 by Paul DePodesta in Moneyball.