1 00:00:04,580 --> 00:00:06,550 Using our regression models, we would 2 00:00:06,550 --> 00:00:09,300 like to predict before the season starts 3 00:00:09,300 --> 00:00:12,910 how many games the 2002 Oakland A's will win. 4 00:00:12,910 --> 00:00:15,450 To do this, we first have to predict 5 00:00:15,450 --> 00:00:17,640 how many runs the team will score 6 00:00:17,640 --> 00:00:20,070 and how many runs they will allow. 7 00:00:20,070 --> 00:00:22,890 These models use team statistics. 8 00:00:22,890 --> 00:00:26,100 However, when we are predicting for the 2002 Oakland 9 00:00:26,100 --> 00:00:28,700 A's before the season has occurred, 10 00:00:28,700 --> 00:00:31,980 the team is probably different than it was the year before. 11 00:00:31,980 --> 00:00:34,670 So we don't know the team statistics. 12 00:00:34,670 --> 00:00:38,180 But we can estimate these statistics using past player 13 00:00:38,180 --> 00:00:39,520 performance. 14 00:00:39,520 --> 00:00:42,420 This approach assumes that past performance correlates 15 00:00:42,420 --> 00:00:44,520 with future performance and that there 16 00:00:44,520 --> 00:00:47,480 will be few injuries during the season. 17 00:00:47,480 --> 00:00:50,850 Using this approach, we can estimate the team statistics 18 00:00:50,850 --> 00:00:57,050 for 2002 by using the 2001 player statistics. 19 00:00:57,050 --> 00:01:00,360 Let's start by making a prediction for runs scored. 20 00:01:00,360 --> 00:01:03,170 At the beginning of the 2002 season, 21 00:01:03,170 --> 00:01:07,340 the Oakland A's had 24 batters on their roster. 22 00:01:07,340 --> 00:01:12,140 Using the 2001 regular season statistics for these players, 23 00:01:12,140 --> 00:01:15,510 we can estimate that team on-base percentage will be 24 00:01:15,510 --> 00:01:23,330 about 0.339 and team slugging percentage will be about 0.430. 25 00:01:23,330 --> 00:01:27,350 We built the following linear regression equation in R 26 00:01:27,350 --> 00:01:29,640 to predict runs scored. 27 00:01:29,640 --> 00:01:35,560 If we plug in 0.339 for on-base percentage and 0.430 28 00:01:35,560 --> 00:01:40,060 for slugging percentage, we predict that the 2002 Oakland 29 00:01:40,060 --> 00:01:45,060 A's will score about 805 runs. 30 00:01:45,060 --> 00:01:48,660 Similarly, we can make a prediction for runs allowed. 31 00:01:48,660 --> 00:01:51,340 At the beginning of the 2002 season, 32 00:01:51,340 --> 00:01:55,300 the Oakland A's had 17 pitchers on their roster. 33 00:01:55,300 --> 00:01:59,810 Using the 2001 regular season statistics for these players, 34 00:01:59,810 --> 00:02:02,990 we can estimate that team opponent on-base percentage 35 00:02:02,990 --> 00:02:07,960 will be about 0.307 and team opponent slugging percentage 36 00:02:07,960 --> 00:02:12,800 will be about 0.373. 37 00:02:12,800 --> 00:02:15,590 Our regression equation to predict runs allowed 38 00:02:15,590 --> 00:02:17,530 was as follows. 39 00:02:17,530 --> 00:02:20,370 By plugging in 0.307 for opponents 40 00:02:20,370 --> 00:02:25,990 on-base percentage and 0.373 for opponents slugging percentage, 41 00:02:25,990 --> 00:02:33,070 we predict that the 2002 Oakland A's will allow 622 runs. 42 00:02:33,070 --> 00:02:37,460 We can now make a prediction for how many games they will win. 43 00:02:37,460 --> 00:02:41,510 Our regression equation to predict wins is as follows. 44 00:02:41,510 --> 00:02:48,370 We predicted 805 runs scored and 622 runs allowed. 45 00:02:48,370 --> 00:02:50,800 We can plug in the difference between runs scored 46 00:02:50,800 --> 00:02:53,620 and runs allowed to predict that the A's 47 00:02:53,620 --> 00:02:59,520 will win 100 games in 2002. 48 00:02:59,520 --> 00:03:03,280 Paul DePodesta used a similar approach to make predictions. 49 00:03:03,280 --> 00:03:05,790 It turns out that our predictions and Paul's 50 00:03:05,790 --> 00:03:10,000 predictions closely match actual performance. 51 00:03:10,000 --> 00:03:13,650 Our prediction for runs scored was 805 runs. 52 00:03:13,650 --> 00:03:17,730 Paul predicted between 800 and 820 runs. 53 00:03:17,730 --> 00:03:21,400 And it turns out that the 2002 Oakland A's actually 54 00:03:21,400 --> 00:03:24,020 scored 800 runs. 55 00:03:24,020 --> 00:03:28,100 For runs allowed, we predicted 622. 56 00:03:28,100 --> 00:03:33,079 Paul DePodesta predicted between 650 and 670. 57 00:03:33,079 --> 00:03:39,050 It turns out that the Oakland A's actually allowed 653 runs. 58 00:03:39,050 --> 00:03:43,250 For wins, we predicted that they would win 100 games. 59 00:03:43,250 --> 00:03:47,670 Paul predicted that they would win between 93 and 97 games. 60 00:03:47,670 --> 00:03:52,420 And they actually 103 games. 61 00:03:52,420 --> 00:03:55,070 These predictions show us that by using 62 00:03:55,070 --> 00:03:58,480 publicly available data and simple analytics, 63 00:03:58,480 --> 00:04:01,560 we can predict very close to what actually happened 64 00:04:01,560 --> 00:04:05,340 before the season even started. 65 00:04:05,340 --> 00:04:08,870 It turns out that the A's won a league record in 2002 66 00:04:08,870 --> 00:04:11,350 by winning 20 games in a row. 67 00:04:11,350 --> 00:04:13,940 And they won 1 more game than the previous year 68 00:04:13,940 --> 00:04:17,160 and made it to the playoffs once again.