1 00:00:04,860 --> 00:00:09,110 Often, you will need to load an external data file into R 2 00:00:09,110 --> 00:00:11,920 to do some analysis and modeling. 3 00:00:11,920 --> 00:00:15,410 In this class, we'll be working with csv files, 4 00:00:15,410 --> 00:00:18,550 or comma separated value files. 5 00:00:18,550 --> 00:00:21,320 This is a common format for data files 6 00:00:21,320 --> 00:00:24,120 and is easy to work with in R. 7 00:00:24,120 --> 00:00:27,570 The first thing you need to do to read in a data file 8 00:00:27,570 --> 00:00:30,520 is to navigate to the directory on your computer 9 00:00:30,520 --> 00:00:32,970 where the data file is saved. 10 00:00:32,970 --> 00:00:37,990 On a Mac, go to the Misc menu, then select "Change Working 11 00:00:37,990 --> 00:00:39,390 Directory...". 12 00:00:39,390 --> 00:00:46,280 On a PC, go to the File menu and select "Change dir...". 13 00:00:46,280 --> 00:00:50,240 This should pop up a browsing or navigation window. 14 00:00:50,240 --> 00:00:57,390 Navigate to the folder where you saved the data file WHO.csv 15 00:00:57,390 --> 00:00:59,630 that you've downloaded for this class, 16 00:00:59,630 --> 00:01:01,260 and then select that folder. 17 00:01:11,930 --> 00:01:13,850 Nothing should have happened in R, 18 00:01:13,850 --> 00:01:17,690 but if you type getwd, and then empty 19 00:01:17,690 --> 00:01:20,970 parentheses and hit Enter, you should see the path 20 00:01:20,970 --> 00:01:25,590 to the folder containing the data set that you just 21 00:01:25,590 --> 00:01:28,250 selected. 22 00:01:28,250 --> 00:01:33,460 Now, read in the data file by typing WHO = 23 00:01:33,460 --> 00:01:42,490 read.csv("WHO.csv") the name of the data file we want to read 24 00:01:42,490 --> 00:01:44,030 in. 25 00:01:44,030 --> 00:01:50,160 If you hit Enter, this will save the data set in WHO.csv 26 00:01:50,160 --> 00:01:53,240 to the data frame WHO. 27 00:01:53,240 --> 00:01:57,009 To look at our data, there are two very useful commands. 28 00:01:57,009 --> 00:02:00,480 The first is the str function, which 29 00:02:00,480 --> 00:02:03,000 shows us the structure of the data. 30 00:02:03,000 --> 00:02:09,570 If you type str(WHO), and hit Enter, 31 00:02:09,570 --> 00:02:12,070 you can see that we have a data frame 32 00:02:12,070 --> 00:02:16,740 of 194 observations and 13 variables. 33 00:02:16,740 --> 00:02:19,530 This data set contains recent statistics 34 00:02:19,530 --> 00:02:22,190 from the World Health Organization-- W, 35 00:02:22,190 --> 00:02:26,490 H, O, or WHO-- on all countries. 36 00:02:26,490 --> 00:02:29,329 The variables are the name of the country, 37 00:02:29,329 --> 00:02:34,150 the region the country is in, the population in thousands, 38 00:02:34,150 --> 00:02:38,980 the percentage of the population under 15 and over 60, 39 00:02:38,980 --> 00:02:43,660 the fertility rate or average number of children per woman, 40 00:02:43,660 --> 00:02:48,670 the life expectancy in years, the child mortality rate which 41 00:02:48,670 --> 00:02:52,260 is the number of children who die by age five per 1,000 42 00:02:52,260 --> 00:02:58,160 births, the number of cellular subscribers per 100 population, 43 00:02:58,160 --> 00:03:01,110 the literacy rate among adults aged greater than 44 00:03:01,110 --> 00:03:06,320 or equal to 15, the gross national income per capita, 45 00:03:06,320 --> 00:03:09,960 the percentage of male children enrolled in primary school, 46 00:03:09,960 --> 00:03:12,000 and the percentage of female children 47 00:03:12,000 --> 00:03:14,160 enrolled in primary school. 48 00:03:14,160 --> 00:03:19,110 For each variable, str gives us the name of the variable, 49 00:03:19,110 --> 00:03:22,300 and then a description of the type of the variable 50 00:03:22,300 --> 00:03:25,550 followed by a first few values of the variable. 51 00:03:25,550 --> 00:03:27,650 We see a couple different types here. 52 00:03:27,650 --> 00:03:30,270 One is a factor variable. 53 00:03:30,270 --> 00:03:33,700 Country and Region are both factor variables. 54 00:03:33,700 --> 00:03:35,579 This means that the variables have 55 00:03:35,579 --> 00:03:40,210 several different categories, not necessarily numerical. 56 00:03:40,210 --> 00:03:44,390 For example, the Region variable has six different categories 57 00:03:44,390 --> 00:03:45,740 or levels. 58 00:03:45,740 --> 00:03:48,930 These include Africa and Americas. 59 00:03:48,930 --> 00:03:52,790 So each observation in the Region variable 60 00:03:52,790 --> 00:03:56,079 belongs to one of six different categories. 61 00:03:56,079 --> 00:03:57,770 For variables like Country, where 62 00:03:57,770 --> 00:04:01,760 there's 194 levels, which is the same number of observations 63 00:04:01,760 --> 00:04:06,330 we have, each value in this variable is different. 64 00:04:06,330 --> 00:04:09,150 In this case, it makes sense, since each country name 65 00:04:09,150 --> 00:04:11,230 is different. 66 00:04:11,230 --> 00:04:14,340 Then we have two types of numerical values-- integer 67 00:04:14,340 --> 00:04:16,040 and then general numerical values. 68 00:04:18,930 --> 00:04:22,200 The other very useful function to take a look at our data 69 00:04:22,200 --> 00:04:24,130 is the summary function. 70 00:04:24,130 --> 00:04:27,190 In your R console, type summary and then, 71 00:04:27,190 --> 00:04:32,159 in parentheses, WHO, the name of our data frame, and hit Enter. 72 00:04:32,159 --> 00:04:36,250 This gives a numerical summary of each of our variables. 73 00:04:36,250 --> 00:04:39,010 For the factor variables, country and region, 74 00:04:39,010 --> 00:04:41,330 it counts the number of observations 75 00:04:41,330 --> 00:04:44,860 in each of the levels or categories. 76 00:04:44,860 --> 00:04:48,120 So here, we see that we have 46 countries in the region 77 00:04:48,120 --> 00:04:53,110 Africa, 35 in the region Americas, etc. 78 00:04:53,110 --> 00:04:55,010 For each of the numerical values, 79 00:04:55,010 --> 00:05:01,420 we see the min, first quartile, median, mean, third quartile, 80 00:05:01,420 --> 00:05:05,640 and maximum values in that variable. 81 00:05:05,640 --> 00:05:09,430 We can also see in some of the variables that we have this 82 00:05:09,430 --> 00:05:11,630 category called NA's. 83 00:05:11,630 --> 00:05:15,160 This means that some observations 84 00:05:15,160 --> 00:05:17,640 are missing values in that variable. 85 00:05:17,640 --> 00:05:21,200 So for FertilityRate, there 11 observations that 86 00:05:21,200 --> 00:05:25,440 are missing the value of FertilityRate. 87 00:05:25,440 --> 00:05:28,100 When working with data in R, it can often 88 00:05:28,100 --> 00:05:30,540 be useful to subset your data. 89 00:05:30,540 --> 00:05:33,820 For example, suppose we want to create a new data 90 00:05:33,820 --> 00:05:36,610 frame with only the countries in Europe. 91 00:05:36,610 --> 00:05:42,990 Let's call it WHO_Europe and use the subset function 92 00:05:42,990 --> 00:05:46,210 to subset the data frame WHO to take 93 00:05:46,210 --> 00:05:49,290 only the observations for which Region 94 00:05:49,290 --> 00:05:51,300 is exactly equal to Europe. 95 00:05:53,890 --> 00:05:57,030 The subset function takes two arguments. 96 00:05:57,030 --> 00:05:58,560 The first is the data frame we want 97 00:05:58,560 --> 00:06:01,870 to take a subset of, in this case, WHO. 98 00:06:01,870 --> 00:06:04,200 And the second argument is the criteria 99 00:06:04,200 --> 00:06:08,350 for which observations of WHO should belong in our new data 100 00:06:08,350 --> 00:06:10,160 frame, WHO_Europe. 101 00:06:10,160 --> 00:06:12,760 In this case, we want the observations 102 00:06:12,760 --> 00:06:17,500 for which the Region variable is exactly equal to Europe. 103 00:06:17,500 --> 00:06:21,500 The double equal sign here means exactly equal to. 104 00:06:21,500 --> 00:06:27,240 If we hit Enter and then look at the structure of WHO_Europe, 105 00:06:27,240 --> 00:06:29,200 we can see that we now have a data 106 00:06:29,200 --> 00:06:33,909 frame of 53 observations of the same 13 variables. 107 00:06:33,909 --> 00:06:35,760 Does 53 sound right? 108 00:06:35,760 --> 00:06:39,450 Well, let's look back at the summary output of WHO. 109 00:06:39,450 --> 00:06:42,200 We can see in the Region output, there 110 00:06:42,200 --> 00:06:47,320 were 53 observations that belonged in the region Europe. 111 00:06:47,320 --> 00:06:49,970 So we should expect 53 observations 112 00:06:49,970 --> 00:06:53,010 in our Europe subset, which is right. 113 00:06:55,650 --> 00:06:58,060 Now, suppose we want to save this new data 114 00:06:58,060 --> 00:07:01,500 frame, WHO_Europe, to a csv file. 115 00:07:01,500 --> 00:07:05,430 You can use the write.csv function to do this. 116 00:07:05,430 --> 00:07:10,440 Type write.csv, and then in parentheses 117 00:07:10,440 --> 00:07:12,700 the name of the data frame we want to save, 118 00:07:12,700 --> 00:07:17,280 WHO_Europe, comma, and then in quotes 119 00:07:17,280 --> 00:07:19,740 the name of the file we want to save it to. 120 00:07:19,740 --> 00:07:20,950 Let's call it WHO_Europe.csv. 121 00:07:24,810 --> 00:07:27,850 If you hit Enter, nothing should happen, 122 00:07:27,850 --> 00:07:31,780 but you should now have a file called WHO_Europe.csv 123 00:07:31,780 --> 00:07:35,040 in the same folder that you saved WHO.csv in. 124 00:07:37,670 --> 00:07:40,980 And now that we've saved this as a csv file, 125 00:07:40,980 --> 00:07:43,570 if we're not working with it anymore in R, 126 00:07:43,570 --> 00:07:47,409 we can remove the data frame from our current session in R. 127 00:07:47,409 --> 00:07:50,370 This is often useful if you're working with a large data 128 00:07:50,370 --> 00:07:53,159 set that's taking up a lot of space. 129 00:07:53,159 --> 00:07:58,110 First, let's type ls() to see what variables we currently 130 00:07:58,110 --> 00:08:01,620 have in R. You could see that WHO_Europe is one 131 00:08:01,620 --> 00:08:03,590 of our variables. 132 00:08:03,590 --> 00:08:09,990 Now, type rm for remove and then the name WHO_Europe and hit 133 00:08:09,990 --> 00:08:11,210 Enter. 134 00:08:11,210 --> 00:08:14,930 If you type ls() again, you should see that WHO_Europe is 135 00:08:14,930 --> 00:08:16,850 gone. 136 00:08:16,850 --> 00:08:21,430 In the next video, we'll explore the WHO data set.