1 00:00:04,500 --> 00:00:08,119 In this video, we'll do some basic data analysis 2 00:00:08,119 --> 00:00:10,500 using our WHO data. 3 00:00:10,500 --> 00:00:13,640 To access a variable in a data frame, 4 00:00:13,640 --> 00:00:17,190 you always have to link it to the data frame it belongs to 5 00:00:17,190 --> 00:00:19,060 with the dollar sign. 6 00:00:19,060 --> 00:00:23,820 To see this, let's first try typing Under15 7 00:00:23,820 --> 00:00:25,700 and hitting Enter. 8 00:00:25,700 --> 00:00:27,800 R responds with an error. 9 00:00:27,800 --> 00:00:32,049 That's because R won't recognize this variable name since it 10 00:00:32,049 --> 00:00:35,050 doesn't know to look in the data frame WHO. 11 00:00:35,050 --> 00:00:41,530 Now type WHO$Under15 and hit Enter. 12 00:00:41,530 --> 00:00:47,130 This outputs the Under15 vector of the data frame WHO. 13 00:00:47,130 --> 00:00:50,430 We can compute some statistics about this variable, 14 00:00:50,430 --> 00:00:54,220 such as the mean, using the mean function and then 15 00:00:54,220 --> 00:00:59,870 in parentheses typing WHO$Under15. 16 00:00:59,870 --> 00:01:02,950 Close the parentheses and hit Enter. 17 00:01:02,950 --> 00:01:05,319 This tells us that the average percentage 18 00:01:05,319 --> 00:01:10,740 of the population under 15 is 28.7. 19 00:01:10,740 --> 00:01:13,539 We can also compute the standard deviation 20 00:01:13,539 --> 00:01:15,740 using the sd function. 21 00:01:15,740 --> 00:01:20,030 So type sd and then in parentheses WHO$Under15. 22 00:01:23,300 --> 00:01:26,140 Close the parentheses and hit Enter. 23 00:01:26,140 --> 00:01:28,670 This tells us that the standard deviation 24 00:01:28,670 --> 00:01:33,910 of the percentage of the population under 15 is 10.5. 25 00:01:33,910 --> 00:01:36,520 We can also get the statistical summary 26 00:01:36,520 --> 00:01:40,259 of just one variable using the summary function like we 27 00:01:40,259 --> 00:01:42,690 did before for the whole data frame. 28 00:01:42,690 --> 00:01:48,430 To do this, we can type summary, and then in parentheses 29 00:01:48,430 --> 00:01:48,930 WHO$Under15. 30 00:01:52,710 --> 00:01:55,820 Close the parentheses and hit Enter. 31 00:01:55,820 --> 00:02:00,090 This gives the minimum value, the first quartile, 32 00:02:00,090 --> 00:02:04,320 the median value, the mean, the third quartile, 33 00:02:04,320 --> 00:02:08,330 and the maximum value of the variable Under15. 34 00:02:08,330 --> 00:02:12,720 The first quartile is the value for which 25% of the data 35 00:02:12,720 --> 00:02:16,280 is less than that value, and the third quartile 36 00:02:16,280 --> 00:02:19,470 is the value for which 75% of the data 37 00:02:19,470 --> 00:02:21,990 is less than that value. 38 00:02:21,990 --> 00:02:26,550 This output tells us that there's a country with only 13% 39 00:02:26,550 --> 00:02:29,370 of the population under 15. 40 00:02:29,370 --> 00:02:34,310 Let's see which country it is using the which.min function. 41 00:02:34,310 --> 00:02:40,100 So we can type which.min and then in parentheses 42 00:02:40,100 --> 00:02:42,650 our variable WHO$Under15. 43 00:02:47,579 --> 00:02:50,380 Close the parentheses and hit Enter. 44 00:02:50,380 --> 00:02:53,020 This returns the number 86, which 45 00:02:53,020 --> 00:02:55,630 is the row number of the observation 46 00:02:55,630 --> 00:02:58,700 with the minimum value of Under15. 47 00:02:58,700 --> 00:03:06,170 To see which country is in row 86, we can type WHO$Country, 48 00:03:06,170 --> 00:03:10,550 for the country name, and then in square brackets 86, 49 00:03:10,550 --> 00:03:12,070 and hit Enter. 50 00:03:12,070 --> 00:03:15,160 So Japan is the country with the minimum percentage 51 00:03:15,160 --> 00:03:19,090 of the population under 15. 52 00:03:19,090 --> 00:03:22,120 Now let's see which country has the maximum percentage 53 00:03:22,120 --> 00:03:24,550 of the population under 15. 54 00:03:24,550 --> 00:03:28,320 We can do this with the which.max function. 55 00:03:28,320 --> 00:03:40,090 So type which.max and then in parentheses WHO$Under15. 56 00:03:40,090 --> 00:03:43,110 Close the parentheses and hit Enter. 57 00:03:43,110 --> 00:03:46,910 This tells us that the 124th observation 58 00:03:46,910 --> 00:03:50,910 has the maximum value of the variable Under15. 59 00:03:50,910 --> 00:03:55,680 We can look up the country of the 124th observation by typing 60 00:03:55,680 --> 00:04:02,080 WHO$Country and then in square brackets 124. 61 00:04:02,080 --> 00:04:04,910 Close the square brackets and hit Enter. 62 00:04:04,910 --> 00:04:08,000 So Niger is the country with the maximum percentage 63 00:04:08,000 --> 00:04:11,990 of the population under 15. 64 00:04:11,990 --> 00:04:17,579 Let's now create a scatter plot of GNI versus fertility rate. 65 00:04:17,579 --> 00:04:20,410 You can do this using the plot function. 66 00:04:20,410 --> 00:04:27,500 So type plot and then in parentheses WHO$GNI, 67 00:04:27,500 --> 00:04:31,450 the variable we want on our x-axis, comma, 68 00:04:31,450 --> 00:04:38,470 and then WHO$FertilityRate, the variable we want on our y-axis. 69 00:04:38,470 --> 00:04:41,420 Close the parentheses and hit Enter. 70 00:04:41,420 --> 00:04:43,610 A scatter plot should appear. 71 00:04:43,610 --> 00:04:47,140 Income, or GNI, is on the x-axis, 72 00:04:47,140 --> 00:04:50,420 and fertility rate is on the y-axis. 73 00:04:50,420 --> 00:04:53,880 Each point in the scatter plot is a country. 74 00:04:53,880 --> 00:04:56,110 We can see that most countries here either 75 00:04:56,110 --> 00:05:01,350 have a low GNI or a high GNI but a low fertility rate. 76 00:05:01,350 --> 00:05:03,600 However, there are a few countries 77 00:05:03,600 --> 00:05:08,110 for which both the GNI and the fertility rate are high. 78 00:05:08,110 --> 00:05:09,720 Let's investigate. 79 00:05:09,720 --> 00:05:13,150 We'll use the subset function to identify the countries 80 00:05:13,150 --> 00:05:17,450 with a GNI greater than 10,000, and a fertility 81 00:05:17,450 --> 00:05:20,700 rate greater than 2.5. 82 00:05:20,700 --> 00:05:26,210 So go back to your R console and then type Outliers-- 83 00:05:26,210 --> 00:05:32,050 this is what we'll call our subset-- equals subset, 84 00:05:32,050 --> 00:05:44,420 and then in parentheses WHO comma GNI greater than 10,000 85 00:05:44,420 --> 00:05:52,570 and FertilityRate greater than 2.5. 86 00:05:52,570 --> 00:05:56,140 Close the parentheses and hit Enter. 87 00:05:56,140 --> 00:05:58,230 When we used subset before, we only 88 00:05:58,230 --> 00:06:00,310 had one condition to define which 89 00:06:00,310 --> 00:06:03,190 observations to keep in the subset. 90 00:06:03,190 --> 00:06:08,140 Here we have two conditions, separated by the and symbol. 91 00:06:08,140 --> 00:06:10,180 This means that both conditions must 92 00:06:10,180 --> 00:06:14,510 be true for all observations in the subset. 93 00:06:14,510 --> 00:06:17,720 We can see how many rows of data are in our subset 94 00:06:17,720 --> 00:06:20,390 by using the nrow function. 95 00:06:20,390 --> 00:06:25,080 So type nrow for number of rows, and in parentheses 96 00:06:25,080 --> 00:06:26,980 the name of our subset, Outliers. 97 00:06:29,930 --> 00:06:33,050 This tells us that there are seven countries for which 98 00:06:33,050 --> 00:06:37,290 the GNI is greater than 10,000 and the fertility rate 99 00:06:37,290 --> 00:06:40,190 is greater than 2.5. 100 00:06:40,190 --> 00:06:44,170 Now let's output just the country names, GNI, 101 00:06:44,170 --> 00:06:47,090 and fertility rate of these seven countries 102 00:06:47,090 --> 00:06:49,290 to investigate further. 103 00:06:49,290 --> 00:06:51,260 There's an easy way of doing this, 104 00:06:51,260 --> 00:06:54,320 and we'll use this technique several times in this class 105 00:06:54,320 --> 00:06:59,090 when we just want to extract a few variables from a data set. 106 00:06:59,090 --> 00:07:01,350 So type the name of our data set, 107 00:07:01,350 --> 00:07:04,840 Outliers, and then in square brackets 108 00:07:04,840 --> 00:07:07,640 we'll make a vector of the names of the variables 109 00:07:07,640 --> 00:07:09,290 we want to output. 110 00:07:09,290 --> 00:07:15,240 So c, and then in parentheses, "Country" for the country name, 111 00:07:15,240 --> 00:07:20,420 comma, "GNI" comma, and then "FertilityRate". 112 00:07:23,520 --> 00:07:27,570 Close the parentheses, close the square brackets, and hit Enter. 113 00:07:30,180 --> 00:07:33,680 This shows us the values of these three variables 114 00:07:33,680 --> 00:07:36,570 for the seven observations of outliers. 115 00:07:36,570 --> 00:07:38,880 We can see that one of the seven countries 116 00:07:38,880 --> 00:07:42,950 is Equatorial Guinea, a country that is very rich per capita 117 00:07:42,950 --> 00:07:45,250 due to oil production, but the wealth 118 00:07:45,250 --> 00:07:48,690 is distributed very unevenly. 119 00:07:48,690 --> 00:07:51,250 In the next video, we'll see how to create 120 00:07:51,250 --> 00:07:54,710 different types of plots in R and then build some summary 121 00:07:54,710 --> 00:07:56,260 tables.