1 00:00:05,880 --> 00:00:08,420 Okay, so now we're going to start with a simple bar 2 00:00:08,420 --> 00:00:11,230 plot of the MIT international student data. 3 00:00:11,230 --> 00:00:18,180 So first, let's load the ggplot library, ggplot2, 4 00:00:18,180 --> 00:00:20,230 and load the data frame. 5 00:00:20,230 --> 00:00:22,020 So intl = read.csv("intl.csv"). 6 00:00:28,490 --> 00:00:31,160 Now, the structure of this date frame is very simple. 7 00:00:31,160 --> 00:00:33,680 There are two columns, two variables. 8 00:00:33,680 --> 00:00:36,670 The first one, the region, and the second one 9 00:00:36,670 --> 00:00:38,720 is the percentage of international students 10 00:00:38,720 --> 00:00:41,450 who came from that region. 11 00:00:41,450 --> 00:00:45,970 So making a bar plot from this data isn't too hard. 12 00:00:45,970 --> 00:00:47,940 We start off with a ggplot command, 13 00:00:47,940 --> 00:00:51,970 of course, the first argument being the data frame. 14 00:00:51,970 --> 00:00:56,740 The aesthetic in this case is to have Region on the x-axis, 15 00:00:56,740 --> 00:01:00,070 and on the y-axis, to have the percentage 16 00:01:00,070 --> 00:01:03,290 of international students. 17 00:01:03,290 --> 00:01:05,050 Now, the geometry we're going to use 18 00:01:05,050 --> 00:01:09,470 is, as you might guess, bar, geom_bar. 19 00:01:09,470 --> 00:01:12,850 We have to pass one argument to this geom_bar, 20 00:01:12,850 --> 00:01:15,410 and it's called stat = "identity" . 21 00:01:15,410 --> 00:01:18,920 I'm going to come back and explain what that means. 22 00:01:18,920 --> 00:01:22,000 I also want to label my bars with the value, 23 00:01:22,000 --> 00:01:25,210 so it's easy to read in closer detail. 24 00:01:25,210 --> 00:01:29,560 So I'm going to use geom_text to do that. 25 00:01:29,560 --> 00:01:32,160 And the aesthetic of our text is simply 26 00:01:32,160 --> 00:01:36,130 to have the value of a label, the text of a label, 27 00:01:36,130 --> 00:01:39,110 to be the value of our percentages. 28 00:01:42,850 --> 00:01:45,270 Let's look at that. 29 00:01:45,270 --> 00:01:50,670 So yes, we have a bar for each region. 30 00:01:50,670 --> 00:01:52,170 The values are between zero and one, 31 00:01:52,170 --> 00:01:54,530 which looks kind of strange. 32 00:01:54,530 --> 00:01:57,600 The labels are actually lying over 33 00:01:57,600 --> 00:02:01,030 the top of the columns, which isn't very nice, 34 00:02:01,030 --> 00:02:03,040 and the regions aren't really ordered 35 00:02:03,040 --> 00:02:04,490 in any way that's useful. 36 00:02:04,490 --> 00:02:06,470 They're actually ordered in alphabetical order, 37 00:02:06,470 --> 00:02:08,880 but I think it would be much more interesting to have them 38 00:02:08,880 --> 00:02:11,300 in descending order. 39 00:02:11,300 --> 00:02:14,150 So we're going to work on this. 40 00:02:14,150 --> 00:02:16,640 First of all, though, what is this stat = "identity"? 41 00:02:16,640 --> 00:02:18,270 Well, it's pretty simple. 42 00:02:18,270 --> 00:02:22,400 Geometry bar has multiple modes of operation. 43 00:02:22,400 --> 00:02:26,450 And stat = "identity" says, use the value of the y variable 44 00:02:26,450 --> 00:02:28,400 as is, which is what we want. 45 00:02:28,400 --> 00:02:31,750 The height of the bar is the value of the y variable. 46 00:02:31,750 --> 00:02:33,300 Now, there are other modes, including 47 00:02:33,300 --> 00:02:35,430 one that counts the number of rows 48 00:02:35,430 --> 00:02:39,860 for each value of x, and plots that instead. 49 00:02:39,860 --> 00:02:42,460 So you can look at the documentation for ggplot 50 00:02:42,460 --> 00:02:44,910 to see the different options and how they work. 51 00:02:44,910 --> 00:02:48,880 But stat = "identity" is what we want right now. 52 00:02:48,880 --> 00:02:51,230 Now, the x-axis is out of order. 53 00:02:51,230 --> 00:02:53,650 And the reason for this is that ggplot defaults 54 00:02:53,650 --> 00:02:56,490 to alphabetical order for the x-axis. 55 00:02:56,490 --> 00:02:59,570 What we need to do is make Region an ordered factor 56 00:02:59,570 --> 00:03:02,220 instead of an unordered factor. 57 00:03:02,220 --> 00:03:04,100 We can do this with the reorder command 58 00:03:04,100 --> 00:03:06,290 and the transform command. 59 00:03:06,290 --> 00:03:08,170 So let's write this out. 60 00:03:08,170 --> 00:03:14,730 So we're going to transform the international data frame. 61 00:03:14,730 --> 00:03:18,780 And what we're going to do is say, Region, it's 62 00:03:18,780 --> 00:03:22,820 going to be a reordering of itself, 63 00:03:22,820 --> 00:03:25,690 based on decreasing order of PercentOfIntl. 64 00:03:32,910 --> 00:03:37,110 So if we look at the structure of the data frame now, 65 00:03:37,110 --> 00:03:40,980 we see there's something going on in the Region column that 66 00:03:40,980 --> 00:03:41,950 wasn't going before. 67 00:03:41,950 --> 00:03:44,230 And that's that ordering. 68 00:03:44,230 --> 00:03:47,560 So you might have also noticed that I put a negative sign 69 00:03:47,560 --> 00:03:50,410 in front of PercentOfIntl. 70 00:03:50,410 --> 00:03:53,170 So that negative sign means decreasing order. 71 00:03:53,170 --> 00:03:55,130 If we had left that out, it would have actually 72 00:03:55,130 --> 00:03:57,230 ordered them in increasing order. 73 00:03:57,230 --> 00:03:59,579 So unknown or stateless would have been first, 74 00:03:59,579 --> 00:04:04,230 and Oceania would have been second, and so on. 75 00:04:04,230 --> 00:04:07,190 So that's one thing fixed. 76 00:04:07,190 --> 00:04:09,240 Another thing we didn't like was that the numbers 77 00:04:09,240 --> 00:04:14,210 were between zero and one, which looks a little bit messy. 78 00:04:14,210 --> 00:04:17,440 So let's just simply multiply all the values by 100. 79 00:04:17,440 --> 00:04:30,560 So intl$PercentOfIntl = intl$PercentOfIntl*100. 80 00:04:30,560 --> 00:04:32,230 And now the other things we have to fix, 81 00:04:32,230 --> 00:04:36,310 like the text overlying and the x-axis being all bunched up 82 00:04:36,310 --> 00:04:39,750 like that, we're going to do that in a new ggplot command. 83 00:04:39,750 --> 00:04:42,170 So I'm going to break it across multiple lines. 84 00:04:42,170 --> 00:04:47,230 So we start up with the ggplot command, as we did before, 85 00:04:47,230 --> 00:04:49,840 actually identical to what we had before. 86 00:04:49,840 --> 00:04:54,409 So the aesthetic is x-axis is the region, 87 00:04:54,409 --> 00:05:00,390 and the y-axis is the percentage of international students. 88 00:05:00,390 --> 00:05:04,800 We break it into multiple lines, so put the plus at the end 89 00:05:04,800 --> 00:05:07,140 there, and press Shift Enter. 90 00:05:07,140 --> 00:05:08,930 We're going to do a bar plot. 91 00:05:15,400 --> 00:05:20,720 The stat = "identity", as before. 92 00:05:20,720 --> 00:05:22,090 And this time though, we're going 93 00:05:22,090 --> 00:05:24,940 to manually specify a fill. 94 00:05:24,940 --> 00:05:26,640 I'm going to say "dark blue". 95 00:05:26,640 --> 00:05:30,170 I quite like how that looks. 96 00:05:30,170 --> 00:05:33,470 Now, we need the text, and the aesthetic of that 97 00:05:33,470 --> 00:05:36,350 is to have the label equal the value of the column. 98 00:05:39,680 --> 00:05:41,530 I'm going to add one more thing to this. 99 00:05:41,530 --> 00:05:46,050 I'm going to say vjust = -0.4. 100 00:05:46,050 --> 00:05:49,070 And what this does is, it moves the labels up a little bit 101 00:05:49,070 --> 00:05:50,650 and off the top of the bars. 102 00:05:50,650 --> 00:05:51,610 You can play with that. 103 00:05:51,610 --> 00:05:53,500 So a positive value will move it down, 104 00:05:53,500 --> 00:05:55,050 and a negative value will move it up. 105 00:05:58,270 --> 00:06:01,580 Next, I'm going to set the y-axis label 106 00:06:01,580 --> 00:06:04,090 to be something a bit more sensible-- 107 00:06:04,090 --> 00:06:10,490 so "Percent of International Students". 108 00:06:13,290 --> 00:06:15,730 And finally, I'd like to fix up that x-axis. 109 00:06:15,730 --> 00:06:18,260 So I want to get rid of the word "Region," because it's 110 00:06:18,260 --> 00:06:20,560 pretty obvious these are regions. 111 00:06:20,560 --> 00:06:23,900 And I also want to rotate the text at a bit of an angle, 112 00:06:23,900 --> 00:06:28,440 so you can read it all on a plot like this. 113 00:06:28,440 --> 00:06:29,900 That's done with the theme command. 114 00:06:29,900 --> 00:06:32,690 So the theming we're going to do is 115 00:06:32,690 --> 00:06:37,720 we're going to say the axis title, the x-axis, 116 00:06:37,720 --> 00:06:38,500 should be blank. 117 00:06:43,190 --> 00:06:50,520 And the axis text on the x-axis should be rotated, 118 00:06:50,520 --> 00:06:55,850 so it's a text element that's angle is 45. 119 00:06:55,850 --> 00:06:58,820 And I'll move it sideways just a little bit-- hjust = 1. 120 00:07:02,280 --> 00:07:03,500 And there we go. 121 00:07:03,500 --> 00:07:07,900 So we've got our labels vjust-ed above the columns. 122 00:07:07,900 --> 00:07:10,540 The bars themselves are dark blue. 123 00:07:10,540 --> 00:07:12,340 The numbers are now between 0 and 100, 124 00:07:12,340 --> 00:07:14,440 instead of zero and one. 125 00:07:14,440 --> 00:07:17,140 We can read all the text labels. 126 00:07:17,140 --> 00:07:18,940 And it's generally a lot more readable 127 00:07:18,940 --> 00:07:25,320 than the pie plot or our original ggplot, at that. 128 00:07:25,320 --> 00:07:27,300 Let's go back to the slides now and talk 129 00:07:27,300 --> 00:07:30,020 about what we've just done.