1 00:00:04,500 --> 00:00:07,810 In addition to scatter plots, we can create several other types 2 00:00:07,810 --> 00:00:13,880 of plots in R. Two examples are histograms and box plots. 3 00:00:13,880 --> 00:00:17,650 Let's first create a histogram of CellularSubscribers. 4 00:00:17,650 --> 00:00:21,440 To do this, we'll use the hist function. 5 00:00:21,440 --> 00:00:26,830 So in your R console type hist, and then in parentheses 6 00:00:26,830 --> 00:00:27,830 WHO$CellularSubscribers. 7 00:00:35,440 --> 00:00:38,760 Close the parentheses and hit Enter. 8 00:00:38,760 --> 00:00:40,930 If you go over to your plotting window 9 00:00:40,930 --> 00:00:44,430 you can see that the values of CellularSubscribers 10 00:00:44,430 --> 00:00:48,260 are shown on the x-axis and the frequency of these values 11 00:00:48,260 --> 00:00:50,740 is shown on the y-axis. 12 00:00:50,740 --> 00:00:53,200 A histogram is useful for understanding 13 00:00:53,200 --> 00:00:55,710 the distribution of a variable. 14 00:00:55,710 --> 00:00:58,120 Here we can see that the most frequent value 15 00:00:58,120 --> 00:01:02,610 of CellularSubscribers is around 100. 16 00:01:02,610 --> 00:01:06,570 We can also easily create a box plot in R. 17 00:01:06,570 --> 00:01:09,440 We'll make a box plot of LifeExpectancy 18 00:01:09,440 --> 00:01:11,539 sorted by Region. 19 00:01:11,539 --> 00:01:16,120 So back in your R console type boxplot, 20 00:01:16,120 --> 00:01:24,470 and then in parentheses, WHO$LifeExpectancy and then 21 00:01:24,470 --> 00:01:26,960 a tilde symbol followed by WHO$Region. 22 00:01:29,570 --> 00:01:33,020 Close the parentheses and hit Enter. 23 00:01:33,020 --> 00:01:34,860 Then go over to your plotting window. 24 00:01:34,860 --> 00:01:36,900 You may need to stretch it out a little bit 25 00:01:36,900 --> 00:01:41,410 so that you can see all of the labels on the x-axis. 26 00:01:41,410 --> 00:01:43,979 A box plot is useful for understanding 27 00:01:43,979 --> 00:01:47,110 the statistical range of a variable. 28 00:01:47,110 --> 00:01:51,270 This box plot shows how life expectancy in countries 29 00:01:51,270 --> 00:01:54,620 varies according to the region the country is in. 30 00:01:54,620 --> 00:01:57,289 The box for each region shows the range 31 00:01:57,289 --> 00:01:59,860 between the first and third quartiles 32 00:01:59,860 --> 00:02:03,460 with the middle line marking the median value. 33 00:02:03,460 --> 00:02:06,990 The dashed lines at the top and bottom of the box, 34 00:02:06,990 --> 00:02:09,669 often called whiskers, show the range 35 00:02:09,669 --> 00:02:12,220 from the minimum to maximum values, 36 00:02:12,220 --> 00:02:16,560 excluding any outliers, which are plotted as circles. 37 00:02:16,560 --> 00:02:18,950 Outliers are defined by first computing 38 00:02:18,950 --> 00:02:22,170 the difference between the first and third quartiles, 39 00:02:22,170 --> 00:02:24,150 or the height of the box. 40 00:02:24,150 --> 00:02:27,400 This number is called the inter-quartile range. 41 00:02:27,400 --> 00:02:30,520 Any point that is greater than the third quartile 42 00:02:30,520 --> 00:02:34,040 plus the inter-quartile range, or any point that 43 00:02:34,040 --> 00:02:38,290 is less than the first quartile minus the inter-quartile range 44 00:02:38,290 --> 00:02:40,510 is considered an outlier. 45 00:02:40,510 --> 00:02:44,110 This box plot shows us that Europe has the highest 46 00:02:44,110 --> 00:02:48,660 median life expectancy, the Americas has the smallest 47 00:02:48,660 --> 00:02:52,730 inter-quartile range, and the eastern Mediterranean region 48 00:02:52,730 --> 00:02:58,660 has the highest overall range of life expectancy values. 49 00:02:58,660 --> 00:03:02,230 If you want to give nice labels to any of your plots, 50 00:03:02,230 --> 00:03:05,920 you can easily do so by adding a few arguments. 51 00:03:05,920 --> 00:03:11,970 Go back to your R console, scroll up and then 52 00:03:11,970 --> 00:03:18,730 inside the parentheses type a comma and then xlab equals 53 00:03:18,730 --> 00:03:20,840 and then empty quotes-- we're not 54 00:03:20,840 --> 00:03:24,040 going to label the x-axis here because the regions are already 55 00:03:24,040 --> 00:03:29,220 nicely labeled-- and then a comma, and then ylab 56 00:03:29,220 --> 00:03:32,470 equals "Life Expectancy". 57 00:03:38,470 --> 00:03:41,550 Close the quotes, and then a comma, 58 00:03:41,550 --> 00:03:55,870 and then main = "Life Expectancy of Countries by Region". 59 00:03:55,870 --> 00:03:58,560 Close the quotes and hit Enter. 60 00:03:58,560 --> 00:04:00,670 If you go back and look at your box plot again 61 00:04:00,670 --> 00:04:04,120 you should now see that there's a nice y-axis label 62 00:04:04,120 --> 00:04:06,080 and an overall title to the plot. 63 00:04:08,960 --> 00:04:12,710 Lastly, let's take a look at some summary tables. 64 00:04:12,710 --> 00:04:15,490 So go back to your R console and we'll 65 00:04:15,490 --> 00:04:19,130 start by making a table of the Region variable. 66 00:04:19,130 --> 00:04:24,560 So we'll type table and then in parentheses WHO$Region. 67 00:04:27,390 --> 00:04:30,550 Close the parentheses and hit Enter. 68 00:04:30,550 --> 00:04:33,990 This is similar to what we saw in the summary output 69 00:04:33,990 --> 00:04:36,150 and counts the number of observations 70 00:04:36,150 --> 00:04:38,900 in each category of Region. 71 00:04:38,900 --> 00:04:41,470 Tables work well for variables with only a few 72 00:04:41,470 --> 00:04:46,620 possible values, and we'll see more of this in recitation. 73 00:04:46,620 --> 00:04:48,340 You can see some nice information 74 00:04:48,340 --> 00:04:52,940 about numerical variables by using the tapply function. 75 00:04:52,940 --> 00:04:55,490 Let's start by looking at an example. 76 00:04:55,490 --> 00:05:06,170 So type tapply, and then in parentheses WHO$Over60 comma, 77 00:05:06,170 --> 00:05:12,650 and then WHO$Region comma, and then mean. 78 00:05:12,650 --> 00:05:15,950 Close the parentheses and hit Enter. 79 00:05:15,950 --> 00:05:19,230 This splits the observations by Region 80 00:05:19,230 --> 00:05:23,380 and then computes the mean of the variable Over60. 81 00:05:23,380 --> 00:05:27,820 So tapply splits the data by the second argument you give, 82 00:05:27,820 --> 00:05:30,710 and then applies the third argument function 83 00:05:30,710 --> 00:05:33,790 to the variable given as the first argument. 84 00:05:33,790 --> 00:05:36,560 This result tells us that the average percentage 85 00:05:36,560 --> 00:05:41,720 of the population over 60 in African countries is about 5%, 86 00:05:41,720 --> 00:05:44,490 while the average percentage of the population over 60 87 00:05:44,490 --> 00:05:48,830 in European countries is about 20%. 88 00:05:48,830 --> 00:05:51,130 Let's look at another example. 89 00:05:51,130 --> 00:05:53,840 This time in the tapply function, 90 00:05:53,840 --> 00:06:01,520 we'll give as the first argument WHO$LiteracyRate then 91 00:06:01,520 --> 00:06:06,210 as the second argument we'll give WHO$Region again. 92 00:06:06,210 --> 00:06:08,810 And as our third argument we'll give min. 93 00:06:08,810 --> 00:06:11,560 Close the parentheses and hit Enter. 94 00:06:11,560 --> 00:06:14,270 Here we see something a little strange. 95 00:06:14,270 --> 00:06:17,980 We have the value NA for all of the regions. 96 00:06:17,980 --> 00:06:21,130 This is because we have some missing values in our data 97 00:06:21,130 --> 00:06:22,960 for literacy rate. 98 00:06:22,960 --> 00:06:26,280 A common thing to do is to just remove the missing values 99 00:06:26,280 --> 00:06:28,410 when doing the computation. 100 00:06:28,410 --> 00:06:32,830 We need to pass one additional argument, so hit the up arrow, 101 00:06:32,830 --> 00:06:36,180 and then inside the parentheses add a comma 102 00:06:36,180 --> 00:06:44,970 and then na.rm = TRUE and hit Enter. 103 00:06:44,970 --> 00:06:46,980 This removes all of the countries that 104 00:06:46,980 --> 00:06:49,220 are missing a value for LiteracyRate 105 00:06:49,220 --> 00:06:51,810 before doing the computation. 106 00:06:51,810 --> 00:06:55,870 This time we see numerical values, as we expect. 107 00:06:55,870 --> 00:06:58,370 So we've split the data by Region again 108 00:06:58,370 --> 00:07:01,280 and computed the minimum value of LiteracyRate 109 00:07:01,280 --> 00:07:06,450 for all countries with a value in the LiteracyRate variable. 110 00:07:06,450 --> 00:07:11,070 By using some basic functions in R, plots, and summary tables 111 00:07:11,070 --> 00:07:14,450 we were able to get a better understanding of our data. 112 00:07:14,450 --> 00:07:17,110 You'll see more of this in the recitation and homework 113 00:07:17,110 --> 00:07:18,660 assignment.