1 00:00:04,500 --> 00:00:06,290 So now we're going to try plotting a world 2 00:00:06,290 --> 00:00:08,810 map with a new data set that has the number 3 00:00:08,810 --> 00:00:11,580 of international students from each country. 4 00:00:11,580 --> 00:00:13,080 So first of all, we're going to need 5 00:00:13,080 --> 00:00:16,570 to use the ggmap package, which you may need to install. 6 00:00:19,940 --> 00:00:21,410 And we're going to load in the data 7 00:00:21,410 --> 00:00:22,780 set, which is called intlall.csv. 8 00:00:26,530 --> 00:00:27,570 So read.csv(intlall.csv). 9 00:00:31,410 --> 00:00:38,290 And I'm going to do stringsAsFactors = FALSE. 10 00:00:38,290 --> 00:00:38,790 OK? 11 00:00:38,790 --> 00:00:42,500 Let's look at the first few rows of intlall. 12 00:00:45,800 --> 00:00:49,210 So you see that each row corresponds to a country. 13 00:00:49,210 --> 00:00:51,350 There's a citizenship column that's 14 00:00:51,350 --> 00:00:54,480 the country name, number of undergraduates, number 15 00:00:54,480 --> 00:00:58,210 of graduates, special undergraduates and graduates, 16 00:00:58,210 --> 00:01:01,630 exchange or visiting, and a total column. 17 00:01:01,630 --> 00:01:04,290 Now there's these NAs in here, but they're not really NAs. 18 00:01:04,290 --> 00:01:05,530 They're just 0's. 19 00:01:05,530 --> 00:01:07,990 So what we're going to do is say, 20 00:01:07,990 --> 00:01:10,539 all these NAs should be 0's. 21 00:01:10,539 --> 00:01:27,160 So in intlall, all entries that are NA, should be 0. 22 00:01:27,160 --> 00:01:28,960 And let's look at the first few rows again. 23 00:01:32,820 --> 00:01:33,320 OK. 24 00:01:33,320 --> 00:01:35,180 Much better. 25 00:01:35,180 --> 00:01:38,520 Right, so next step is to load the world map. 26 00:01:38,520 --> 00:01:44,110 So let's call it world_map = map_data("world"). 27 00:01:50,070 --> 00:01:54,050 We did something similar in the lecture with the state data. 28 00:01:54,050 --> 00:01:58,410 So let's look at the structure of the world_map. 29 00:01:58,410 --> 00:02:02,740 So the first two columns are the longitude and latitude; 30 00:02:02,740 --> 00:02:05,870 third column is something called group -- 31 00:02:05,870 --> 00:02:07,680 that's actually a group for each country, 32 00:02:07,680 --> 00:02:11,310 using a different number for each country; order, 33 00:02:11,310 --> 00:02:15,250 we'll get to that later; region is just the country name, 34 00:02:15,250 --> 00:02:19,350 and subregion is sometimes used for some countries to describe 35 00:02:19,350 --> 00:02:23,900 islands and other things like that. 36 00:02:23,900 --> 00:02:28,150 So we want to shove the world_map data 37 00:02:28,150 --> 00:02:31,400 frame and the intlall data frame into one data frame, 38 00:02:31,400 --> 00:02:33,840 so we can use it for ggplot. 39 00:02:33,840 --> 00:02:46,240 So let's say world_map is a merge of world_map and intlall. 40 00:02:46,240 --> 00:02:48,020 Now, in world_map, the country name 41 00:02:48,020 --> 00:02:52,910 is just called region, as you can see right here. 42 00:02:52,910 --> 00:02:56,240 And in the intlall, the country name 43 00:02:56,240 --> 00:02:58,470 is actually called Citizenship. 44 00:03:04,510 --> 00:03:05,430 OK. 45 00:03:05,430 --> 00:03:07,300 So let's look at the structure of world_map 46 00:03:07,300 --> 00:03:10,820 just to make sure it makes sense. 47 00:03:10,820 --> 00:03:12,300 Looks good. 48 00:03:12,300 --> 00:03:13,360 OK. 49 00:03:13,360 --> 00:03:17,890 So to plot a map, we use the geom_polygon geometry. 50 00:03:17,890 --> 00:03:21,280 So start off ggplot(world_map, aes(x = long, y = lat, 51 00:03:21,280 --> 00:03:21,950 group = group)). 52 00:03:38,700 --> 00:03:41,680 We want to use geom_polygon as the geometry. 53 00:03:47,350 --> 00:03:49,000 Countries will be filled in in white, 54 00:03:49,000 --> 00:03:53,910 and their borders will be in black. 55 00:03:57,270 --> 00:04:00,960 And we'll use a Mercator projection. 56 00:04:00,960 --> 00:04:02,880 There's a few other options in there, as well. 57 00:04:07,560 --> 00:04:08,310 OK. 58 00:04:08,310 --> 00:04:10,850 So that looks kind of like a world map. 59 00:04:10,850 --> 00:04:13,300 There's a few things going on here. 60 00:04:13,300 --> 00:04:14,720 So first of all, all the countries 61 00:04:14,720 --> 00:04:17,640 look like big black blobs. 62 00:04:17,640 --> 00:04:21,010 What on earth is going on, you might say. 63 00:04:21,010 --> 00:04:25,330 Well, sometimes the merge can reorder the data. 64 00:04:25,330 --> 00:04:28,610 And it turns out that what the world_map data frame really is 65 00:04:28,610 --> 00:04:31,530 is actually a list of latitude and longitude points 66 00:04:31,530 --> 00:04:34,490 that define the border of each country. 67 00:04:34,490 --> 00:04:37,659 So if we accidentally reorder our data frame 68 00:04:37,659 --> 00:04:39,560 they no longer make any sense. 69 00:04:39,560 --> 00:04:41,050 And as it goes from point to point, 70 00:04:41,050 --> 00:04:43,210 the points might be on the other side of the country 71 00:04:43,210 --> 00:04:44,900 as it defines the polygon. 72 00:04:44,900 --> 00:04:49,700 So, we have to reorder the data in the correct order. 73 00:04:49,700 --> 00:04:53,060 So this command is a little bit complicated looking, 74 00:04:53,060 --> 00:04:55,880 but when you break it down, it's not so bad. 75 00:04:55,880 --> 00:05:00,850 So, we take the world_map, and we're going to reorder it. 76 00:05:00,850 --> 00:05:04,670 So world_map, we're going to reorder the rows. 77 00:05:04,670 --> 00:05:07,490 We're going to order the rows based on, first 78 00:05:07,490 --> 00:05:11,370 of all, the group, which is pretty much 79 00:05:11,370 --> 00:05:17,690 equivalent to the country, and then the order variable, which 80 00:05:17,690 --> 00:05:21,710 is just the correct order for the border points. 81 00:05:21,710 --> 00:05:25,510 And we're going to take all the columns, of course. 82 00:05:25,510 --> 00:05:26,280 Done. 83 00:05:26,280 --> 00:05:32,000 So if we go and try plotting it again -- so ggplot-- 84 00:05:32,000 --> 00:05:34,190 I guess I should go up, up. 85 00:05:34,190 --> 00:05:37,190 There we go, much easier. 86 00:05:37,190 --> 00:05:39,909 Right, so now we have the map, and it 87 00:05:39,909 --> 00:05:41,960 looks far more reasonable. 88 00:05:41,960 --> 00:05:44,040 OK, next problem. 89 00:05:44,040 --> 00:05:46,400 Some of the countries are missing. 90 00:05:46,400 --> 00:05:49,150 Now of course, the USA is missing 91 00:05:49,150 --> 00:05:51,330 because MIT is in the USA, so that 92 00:05:51,330 --> 00:05:57,480 wouldn't be an international student coming from the USA. 93 00:05:57,480 --> 00:06:00,460 And some parts of Africa are missing, presumably 94 00:06:00,460 --> 00:06:02,960 because there are no students at MIT right now 95 00:06:02,960 --> 00:06:05,620 who are from those countries. 96 00:06:05,620 --> 00:06:08,480 But you'll also notice that Russia is missing, 97 00:06:08,480 --> 00:06:13,350 and a lot of countries near it, as well as China. 98 00:06:13,350 --> 00:06:14,940 Which is definitely not true because I 99 00:06:14,940 --> 00:06:20,020 have many friends at MIT who are from Russia and China. 100 00:06:20,020 --> 00:06:24,010 So, what do we do about that? 101 00:06:24,010 --> 00:06:26,620 The reason China is missing is that it 102 00:06:26,620 --> 00:06:29,890 has a different name in the MIT data frame 103 00:06:29,890 --> 00:06:31,720 than in the world_map data frame. 104 00:06:31,720 --> 00:06:33,880 So when we merged them, it was dropped 105 00:06:33,880 --> 00:06:37,210 from the data set because it didn't match up. 106 00:06:37,210 --> 00:06:44,110 So to see what it's called in the MIT data frame, 107 00:06:44,110 --> 00:06:45,460 let's just do a table. 108 00:06:45,460 --> 00:06:49,760 There's a few ways to do this, but this is pretty easy. 109 00:06:49,760 --> 00:06:51,409 OK, so we get a list of all the names. 110 00:06:54,270 --> 00:06:56,500 If we scroll all the way up, we'll 111 00:06:56,500 --> 00:07:00,720 see it says "China (People's Republic Of)". 112 00:07:00,720 --> 00:07:04,290 Now, in the world_map data frame, 113 00:07:04,290 --> 00:07:05,600 it's simply called "China". 114 00:07:08,360 --> 00:07:13,720 So, what we can do is change the MIT data frame. 115 00:07:13,720 --> 00:07:24,710 So let's say the citizenship column, the one row where 116 00:07:24,710 --> 00:07:30,300 it equals "China (People's Republic Of)" 117 00:07:30,300 --> 00:07:36,180 should just be "China". 118 00:07:36,180 --> 00:07:38,690 OK, let's check. 119 00:07:38,690 --> 00:07:40,710 Do the table again. 120 00:07:40,710 --> 00:07:42,730 Scroll all the way up. 121 00:07:45,950 --> 00:07:47,970 There it is, China. 122 00:07:47,970 --> 00:07:48,950 So we've fixed that. 123 00:07:48,950 --> 00:07:55,230 So now the MIT data frame is consistent with the world_map 124 00:07:55,230 --> 00:07:56,870 data frame. 125 00:07:56,870 --> 00:08:00,270 So now we have to go through the merge again. 126 00:08:00,270 --> 00:08:06,960 So let's say world_map is a merge 127 00:08:06,960 --> 00:08:14,590 of a fresh copy of the map_data, the intlall data 128 00:08:14,590 --> 00:08:15,870 frame with China fixed. 129 00:08:18,800 --> 00:08:24,310 It's called region in the world_map data, 130 00:08:24,310 --> 00:08:31,490 and it's called Citizenship in the MIT data. 131 00:08:31,490 --> 00:08:35,950 Alright, now we need to do the reordering again. 132 00:08:35,950 --> 00:08:41,580 So press up a few times until we find it. 133 00:08:41,580 --> 00:08:42,130 There it was. 134 00:08:42,130 --> 00:08:46,080 So there's the reordering command. 135 00:08:46,080 --> 00:08:46,580 OK. 136 00:08:46,580 --> 00:08:48,200 And we should be good to go, now. 137 00:08:48,200 --> 00:08:50,140 So let's try plotting it. 138 00:08:50,140 --> 00:08:55,900 So ggplot, the world_map data frame. 139 00:08:55,900 --> 00:08:59,840 The aesthetic, x is the longitude, y is the latitude. 140 00:08:59,840 --> 00:09:02,410 We need to group countries together, 141 00:09:02,410 --> 00:09:06,690 so it doesn't all crisscross over the map. 142 00:09:06,690 --> 00:09:10,410 We're going to use geom_polygon again. 143 00:09:10,410 --> 00:09:12,250 This time though, let's actually fill them 144 00:09:12,250 --> 00:09:14,410 with a color that's proportional to the total number 145 00:09:14,410 --> 00:09:15,950 of students. 146 00:09:15,950 --> 00:09:17,960 We'll still outline them in black, though. 147 00:09:21,110 --> 00:09:23,340 And we'll use the Mercator projection. 148 00:09:29,430 --> 00:09:31,190 Much better. 149 00:09:31,190 --> 00:09:32,890 So Russia is missing for similar reasons, 150 00:09:32,890 --> 00:09:35,270 but we won't deal with that now because it's a little bit 151 00:09:35,270 --> 00:09:35,910 annoying. 152 00:09:35,910 --> 00:09:38,960 But you get the idea. 153 00:09:38,960 --> 00:09:40,860 This is pretty interesting actually. 154 00:09:40,860 --> 00:09:47,650 So we can see that Canada, and China, and India supply 155 00:09:47,650 --> 00:09:52,470 a large number of international students to MIT. 156 00:09:52,470 --> 00:09:54,830 But it is a little bit confusing doing it 157 00:09:54,830 --> 00:09:57,900 on a per country basis, because Europe, presumably, 158 00:09:57,900 --> 00:10:00,590 has quite a few students at MIT. 159 00:10:00,590 --> 00:10:03,150 But because Europe is made up of many small countries, 160 00:10:03,150 --> 00:10:05,410 it doesn't look very impressive. 161 00:10:05,410 --> 00:10:06,940 Maybe if all the European countries 162 00:10:06,940 --> 00:10:08,430 were grouped together, it would look 163 00:10:08,430 --> 00:10:10,530 about the same color as Canada. 164 00:10:10,530 --> 00:10:13,740 But it's hard to tell. 165 00:10:13,740 --> 00:10:15,800 There are also other projections we can look at. 166 00:10:15,800 --> 00:10:18,040 So this is a Mercator projection. 167 00:10:18,040 --> 00:10:20,920 What I want to show you is an orthographic projection that 168 00:10:20,920 --> 00:10:24,110 allows you to sort of view the map in 3D, like a globe. 169 00:10:24,110 --> 00:10:27,150 So let's try that out. 170 00:10:27,150 --> 00:10:33,500 ggplot, world_map, aesthetics are the same. 171 00:10:39,710 --> 00:10:42,220 Actually, let me do this the right way. 172 00:10:42,220 --> 00:10:44,050 I'll just press up. 173 00:10:44,050 --> 00:10:45,080 OK. 174 00:10:45,080 --> 00:10:49,400 Let's change it to orthographic projection. 175 00:10:49,400 --> 00:10:53,010 And I want to find, now, an orientation. 176 00:10:53,010 --> 00:10:55,850 And this is almost like thinking about where in the world 177 00:10:55,850 --> 00:10:57,880 you want to focus on. 178 00:10:57,880 --> 00:11:01,310 So this is a latitude and longitude, 20 degrees 179 00:11:01,310 --> 00:11:03,070 and 30 degrees. 180 00:11:03,070 --> 00:11:07,260 If we run this, we should get a map 181 00:11:07,260 --> 00:11:11,010 centered above North Africa. 182 00:11:11,010 --> 00:11:12,730 That's quite a nice visualization 183 00:11:12,730 --> 00:11:16,030 because if you want to look just at Africa and Europe, 184 00:11:16,030 --> 00:11:17,690 this is the way to go. 185 00:11:17,690 --> 00:11:19,580 We can still see China, and Canada, and South 186 00:11:19,580 --> 00:11:22,710 America in there, as well. 187 00:11:22,710 --> 00:11:24,840 Let's do something a little bit more personal. 188 00:11:24,840 --> 00:11:27,240 I want to change the coordinates, now, 189 00:11:27,240 --> 00:11:34,850 to -37 and 175. 190 00:11:34,850 --> 00:11:38,840 Now it's centered on my hometown of Auckland, New Zealand.