1 00:00:04,500 --> 00:00:06,570 In this recitation we will apply some 2 00:00:06,570 --> 00:00:09,210 of the ideas from Moneyball to data from the National 3 00:00:09,210 --> 00:00:14,180 Basketball Association-- that is, the NBA. 4 00:00:14,180 --> 00:00:16,710 So the first thing we'll do is read in the data 5 00:00:16,710 --> 00:00:18,930 and learn about it. 6 00:00:18,930 --> 00:00:23,360 The data we have is located in the file NBA_train 7 00:00:23,360 --> 00:00:27,780 and contains data from all teams in season since 1980, 8 00:00:27,780 --> 00:00:31,280 except for ones with less than 82 games. 9 00:00:31,280 --> 00:00:35,590 So I'll read this in to the variable NBA, 10 00:00:35,590 --> 00:00:36,930 NBA = read.csv("NBA_train.csv"). 11 00:00:48,710 --> 00:00:49,210 OK. 12 00:00:49,210 --> 00:00:50,550 So we've read it in. 13 00:00:50,550 --> 00:00:52,340 And let's explore it a little bit using 14 00:00:52,340 --> 00:00:54,460 the str command, str(NBA). 15 00:00:58,950 --> 00:00:59,740 All right. 16 00:00:59,740 --> 00:01:01,290 So this is our data frame. 17 00:01:01,290 --> 00:01:06,110 We have 835 observations of 20 variables. 18 00:01:06,110 --> 00:01:09,750 Let's take a look at what some of these variables are. 19 00:01:09,750 --> 00:01:13,020 SeasonEnd is the year the season ended. 20 00:01:13,020 --> 00:01:15,350 Team is the name of the team. 21 00:01:15,350 --> 00:01:18,200 And playoffs is a binary variable for whether or not 22 00:01:18,200 --> 00:01:20,870 a team made it to the playoffs that year. 23 00:01:20,870 --> 00:01:26,430 If they made it to the playoffs it's a 1, if not it's a 0. 24 00:01:26,430 --> 00:01:30,620 W stands for the number of regular season wins. 25 00:01:30,620 --> 00:01:35,680 PTS stands for points scored during the regular season. 26 00:01:35,680 --> 00:01:38,610 oppPTS stands for opponent points 27 00:01:38,610 --> 00:01:41,990 scored during the regular season. 28 00:01:41,990 --> 00:01:46,420 And then we've got quite a few variables that 29 00:01:46,420 --> 00:01:49,170 have the variable name and then the same variable 30 00:01:49,170 --> 00:01:51,580 with an 'A' afterwards. 31 00:01:51,580 --> 00:02:00,140 So we've got FG and FGA, X2P, X2PA, X3P, X3PA, FT, and FTA. 32 00:02:00,140 --> 00:02:02,780 So what this notation is, is it means 33 00:02:02,780 --> 00:02:05,720 if there is an 'A' it means the number that were attempted. 34 00:02:05,720 --> 00:02:08,090 And if not it means the number that we're successful. 35 00:02:08,090 --> 00:02:12,860 So for example FG is the number of successful field goals, 36 00:02:12,860 --> 00:02:15,220 including two and three pointers. 37 00:02:15,220 --> 00:02:18,240 Whereas FGA is the number of field goal attempts. 38 00:02:18,240 --> 00:02:22,579 So this also contains the number of unsuccessful field goals. 39 00:02:22,579 --> 00:02:27,829 So FGA will always be a bigger number than FG. 40 00:02:27,829 --> 00:02:31,340 The next pair is for two pointers. 41 00:02:31,340 --> 00:02:34,130 The number of successful two pointers and the number 42 00:02:34,130 --> 00:02:35,610 attempted. 43 00:02:35,610 --> 00:02:40,980 The pair after that, right down here is for three pointers, 44 00:02:40,980 --> 00:02:43,640 the number successful and the number attempted. 45 00:02:43,640 --> 00:02:46,370 And the next pair is for free throws, 46 00:02:46,370 --> 00:02:50,380 the number successful and the number attempted. 47 00:02:50,380 --> 00:02:53,900 Now you'll notice, actually, that the two pointer and three 48 00:02:53,900 --> 00:02:58,430 pointer variables have an 'X' in front of them. 49 00:02:58,430 --> 00:03:02,590 Well, this isn't because we had an 'X' in the original data. 50 00:03:02,590 --> 00:03:04,620 In fact, if you were to open up the csv 51 00:03:04,620 --> 00:03:09,920 file of the original data, it would just say, 2P and 2PA, 52 00:03:09,920 --> 00:03:14,510 and, 3P and 3PA, without the 'X' in front. 53 00:03:14,510 --> 00:03:16,180 The reason there's an 'X' in front of it 54 00:03:16,180 --> 00:03:18,360 is because when we load it into R, 55 00:03:18,360 --> 00:03:23,110 R doesn't like it when a variable begins with a number. 56 00:03:23,110 --> 00:03:25,870 So if a variable begins with a number 57 00:03:25,870 --> 00:03:28,790 it will put an 'X' in front of it. 58 00:03:28,790 --> 00:03:29,650 This is fine. 59 00:03:29,650 --> 00:03:31,030 It's just something we need to be 60 00:03:31,030 --> 00:03:36,370 mindful of when we're dealing with variables in R. 61 00:03:36,370 --> 00:03:38,600 So moving on to the rest of our variables. 62 00:03:38,600 --> 00:03:40,970 We've got ORB and DRB. 63 00:03:40,970 --> 00:03:44,850 These are offensive and defensive rebounds. 64 00:03:44,850 --> 00:03:48,120 AST stands for assists. 65 00:03:48,120 --> 00:03:51,130 STL for steals. 66 00:03:51,130 --> 00:03:54,020 BLK stands for blocks. 67 00:03:54,020 --> 00:03:55,900 And TOV stands for turnovers. 68 00:03:58,820 --> 00:04:02,410 Don't worry if you're not a basketball expert 69 00:04:02,410 --> 00:04:05,090 and don't understand exactly the difference between each 70 00:04:05,090 --> 00:04:06,520 of these variables. 71 00:04:06,520 --> 00:04:08,150 But we just wanted to familiarize you 72 00:04:08,150 --> 00:04:12,760 with some common basketball statistics that are recorded. 73 00:04:12,760 --> 00:04:14,250 And explain the labeling notation 74 00:04:14,250 --> 00:04:17,060 that we use in our data. 75 00:04:17,060 --> 00:04:20,769 We'll go on to use these variables in the next video.