1 00:00:04,500 --> 00:00:09,950 In this video, we'll create a basic scatterplot using ggplot. 2 00:00:09,950 --> 00:00:12,570 Let's start by reading in our data. 3 00:00:12,570 --> 00:00:14,500 We'll be using the same data set we 4 00:00:14,500 --> 00:00:18,370 used during week one, WHO.csv. 5 00:00:18,370 --> 00:00:22,850 So let's call it WHO and use the read.csv function 6 00:00:22,850 --> 00:00:26,550 to read in the data file WHO.csv. 7 00:00:26,550 --> 00:00:28,670 Make sure you're in the directory containing 8 00:00:28,670 --> 00:00:30,910 this file first. 9 00:00:30,910 --> 00:00:33,610 Now, let's take a look at the structure of the data 10 00:00:33,610 --> 00:00:37,040 using the str function. 11 00:00:37,040 --> 00:00:42,110 We can see that we have 194 observations, or countries, 12 00:00:42,110 --> 00:00:47,090 and 13 different variables-- the name of the country, the region 13 00:00:47,090 --> 00:00:50,750 the country's in, the population in thousands, 14 00:00:50,750 --> 00:00:55,500 the percentage of the population under 15 or over 60, 15 00:00:55,500 --> 00:01:00,190 the fertility rate or average number of children per woman, 16 00:01:00,190 --> 00:01:04,950 the life expectancy in years, the child mortality rate, 17 00:01:04,950 --> 00:01:08,840 which is the number of children who die by age five per 1,000 18 00:01:08,840 --> 00:01:15,789 births, the number of cellular subscribers per 100 population, 19 00:01:15,789 --> 00:01:19,960 the literacy rate among adults older than 15, 20 00:01:19,960 --> 00:01:23,300 the gross national income per capita, 21 00:01:23,300 --> 00:01:27,080 the percentage of male children enrolled in primary school, 22 00:01:27,080 --> 00:01:29,470 and the percentage of female children enrolled 23 00:01:29,470 --> 00:01:31,620 in primary school. 24 00:01:31,620 --> 00:01:34,940 In week one, the very first plot we made in R 25 00:01:34,940 --> 00:01:37,750 was a scatterplot of fertility rate 26 00:01:37,750 --> 00:01:40,320 versus gross national income. 27 00:01:40,320 --> 00:01:43,920 Let's make this plot again, just like we did in week one. 28 00:01:43,920 --> 00:01:48,170 So we'll use the plot function and give as the first variable 29 00:01:48,170 --> 00:01:52,680 WHO$GNI, and then give as the second variable, 30 00:01:52,680 --> 00:01:53,430 WHO$FertilityRate. 31 00:01:58,750 --> 00:02:02,140 This plot shows us that a higher fertility rate 32 00:02:02,140 --> 00:02:05,520 is correlated with a lower income. 33 00:02:05,520 --> 00:02:08,380 Now, let's redo this scatterplot, 34 00:02:08,380 --> 00:02:11,110 but this time using ggplot. 35 00:02:11,110 --> 00:02:14,270 We'll see how ggplot can be used to make more visually 36 00:02:14,270 --> 00:02:17,770 appealing and complex scatterplots. 37 00:02:17,770 --> 00:02:23,050 First, we need to install and load the ggplot2 package. 38 00:02:23,050 --> 00:02:24,950 So first type install.packages("ggplot2"). 39 00:02:32,570 --> 00:02:34,850 When the CRAN mirror window pops up, 40 00:02:34,850 --> 00:02:36,840 make sure to pick a location near you. 41 00:02:40,500 --> 00:02:43,070 Then, as soon as the package is done installing 42 00:02:43,070 --> 00:02:45,260 and you're back at the blinking cursor, 43 00:02:45,260 --> 00:02:47,170 load the package with the library function. 44 00:02:51,110 --> 00:02:53,800 Now, remember we need at least three things 45 00:02:53,800 --> 00:02:58,680 to create a plot using ggplot-- data, an aesthetic mapping 46 00:02:58,680 --> 00:03:02,020 of variables in the data frame to visual output, 47 00:03:02,020 --> 00:03:04,510 and a geometric object. 48 00:03:04,510 --> 00:03:07,140 So first, let's create the ggplot 49 00:03:07,140 --> 00:03:10,640 object with the data and the aesthetic mapping. 50 00:03:10,640 --> 00:03:14,360 We'll save it to the variable scatterplot, 51 00:03:14,360 --> 00:03:17,329 and then use the ggplot function, where 52 00:03:17,329 --> 00:03:21,470 the first argument is the name of our data set, WHO, 53 00:03:21,470 --> 00:03:25,590 which specifies the data to use, and the second argument 54 00:03:25,590 --> 00:03:28,750 is the aesthetic mapping, aes. 55 00:03:28,750 --> 00:03:31,070 In parentheses, we have to decide 56 00:03:31,070 --> 00:03:34,960 what we want on the x-axis and what we want on the y-axis. 57 00:03:34,960 --> 00:03:38,380 We want the x-axis to be GNI, and we 58 00:03:38,380 --> 00:03:42,810 want the y-axis to be FertilityRate. 59 00:03:42,810 --> 00:03:47,400 Go ahead and close both sets of parentheses, and hit Enter. 60 00:03:47,400 --> 00:03:50,440 Now, we need to tell ggplot what geometric 61 00:03:50,440 --> 00:03:52,480 objects to put in the plot. 62 00:03:52,480 --> 00:03:57,060 We could use bars, lines, points, or something else. 63 00:03:57,060 --> 00:04:00,560 This is a big difference between ggplot and regular plotting 64 00:04:00,560 --> 00:04:03,690 in R. You can build different types of graphs 65 00:04:03,690 --> 00:04:06,670 by using the same ggplot object. 66 00:04:06,670 --> 00:04:08,820 There's no need to learn one function for bar 67 00:04:08,820 --> 00:04:14,290 graphs, a completely different function for line graphs, etc. 68 00:04:14,290 --> 00:04:18,839 So first, let's just create a straightforward scatterplot. 69 00:04:18,839 --> 00:04:22,450 So the geometry we want to add is points. 70 00:04:22,450 --> 00:04:26,430 We can do this by typing the name of our ggplot object, 71 00:04:26,430 --> 00:04:30,690 scatterplot, and then adding the function, geom_point(). 72 00:04:34,750 --> 00:04:38,120 If you hit Enter, you should see a new plot in the Graphics 73 00:04:38,120 --> 00:04:41,080 window that looks similar to our original plot, 74 00:04:41,080 --> 00:04:43,980 but there are already a few nice improvements. 75 00:04:43,980 --> 00:04:47,270 One is that we don't have the data set name with a dollar 76 00:04:47,270 --> 00:04:51,140 sign in front of the label on each axis, just 77 00:04:51,140 --> 00:04:53,030 the variable name. 78 00:04:53,030 --> 00:04:54,970 Another is that we have these nice grid 79 00:04:54,970 --> 00:04:57,640 lines in the background and solid points 80 00:04:57,640 --> 00:05:00,880 that pop out from the background. 81 00:05:00,880 --> 00:05:03,690 We could have made a line graph just as easily 82 00:05:03,690 --> 00:05:05,780 by changing point to line. 83 00:05:05,780 --> 00:05:09,750 So in your R console, hit the up arrow, and then just 84 00:05:09,750 --> 00:05:13,410 delete "point" and type "line" and hit Enter. 85 00:05:13,410 --> 00:05:17,020 Now, you can see a line graph in the Graphics window. 86 00:05:17,020 --> 00:05:19,290 However, a line doesn't really make sense 87 00:05:19,290 --> 00:05:21,880 for this particular plot, so let's switch back 88 00:05:21,880 --> 00:05:25,200 to our points, just by hitting the up arrow twice and hitting 89 00:05:25,200 --> 00:05:27,890 Enter. 90 00:05:27,890 --> 00:05:31,630 In addition to specifying that the geometry we want is points, 91 00:05:31,630 --> 00:05:35,010 we can add other options, like the color, shape, 92 00:05:35,010 --> 00:05:37,080 and size of the points. 93 00:05:37,080 --> 00:05:41,460 Let's redo our plot with blue triangles instead of circles. 94 00:05:41,460 --> 00:05:45,240 To do that, go ahead and hit the up arrow in your R console, 95 00:05:45,240 --> 00:05:48,640 and then in the empty parentheses for geom_point, 96 00:05:48,640 --> 00:05:51,850 we're going to specify some properties of the points. 97 00:05:51,850 --> 00:05:57,920 We want the color to be equal to "blue", the size to equal 3-- 98 00:05:57,920 --> 00:06:00,110 we'll make the points a little bigger -- 99 00:06:00,110 --> 00:06:03,190 and the shape equals 17. 100 00:06:03,190 --> 00:06:06,760 This is the shape number corresponding to triangles. 101 00:06:06,760 --> 00:06:09,760 If you hit Enter, you should now see in your plot 102 00:06:09,760 --> 00:06:13,320 blue triangles instead of black dots. 103 00:06:13,320 --> 00:06:15,120 Let's try another option. 104 00:06:15,120 --> 00:06:21,310 Hit the up arrow again, and change "blue" to "darkred", 105 00:06:21,310 --> 00:06:24,460 and change shape to 8. 106 00:06:24,460 --> 00:06:27,720 Now, you should see dark red stars. 107 00:06:27,720 --> 00:06:29,840 There are many different colors and shapes 108 00:06:29,840 --> 00:06:31,480 that you can specify. 109 00:06:31,480 --> 00:06:36,320 We've provided some information in the text below this video. 110 00:06:36,320 --> 00:06:38,430 Now, let's add a title to the plot. 111 00:06:38,430 --> 00:06:41,010 You can do that by hitting the up arrow, 112 00:06:41,010 --> 00:06:45,740 and then at the very end of everything, add ggtitle, 113 00:06:45,740 --> 00:06:48,210 and then in parentheses in quotes, the title 114 00:06:48,210 --> 00:06:49,750 you want to give your plot. 115 00:06:49,750 --> 00:06:53,200 In our case, we'll call it "Fertility Rate 116 00:06:53,200 --> 00:06:56,240 vs. Gross National Income". 117 00:06:59,610 --> 00:07:01,070 If you look at your plot again, you 118 00:07:01,070 --> 00:07:05,610 should now see that it has a nice title at the top. 119 00:07:05,610 --> 00:07:08,160 Now, let's save our plot to a file. 120 00:07:08,160 --> 00:07:12,450 We can do this by first saving our plot to a variable. 121 00:07:12,450 --> 00:07:15,190 So in your R console, hit the up arrow, 122 00:07:15,190 --> 00:07:18,430 and scroll to the beginning of the line. 123 00:07:18,430 --> 00:07:21,260 Before scatterplot, type fertilityGNIplot 124 00:07:21,260 --> 00:07:29,080 = and then everything else. 125 00:07:29,080 --> 00:07:31,420 This will save our scatterplot to the variable, 126 00:07:31,420 --> 00:07:32,130 fertilityGNIplot. 127 00:07:35,190 --> 00:07:38,830 Now, let's create a file we want to save our plot to. 128 00:07:38,830 --> 00:07:41,120 We can do that with the PDF function. 129 00:07:41,120 --> 00:07:43,700 And then in parentheses and quotes, type the name 130 00:07:43,700 --> 00:07:45,080 you want your file to have. 131 00:07:45,080 --> 00:07:46,180 We'll call it MyPlot.pdf. 132 00:07:50,159 --> 00:07:53,100 Now, let's just print our plot to that file with the print 133 00:07:53,100 --> 00:07:54,730 function -- so print(fertilityGNIplot). 134 00:07:59,930 --> 00:08:07,890 And lastly, we just have to type dev.off() to close the file. 135 00:08:07,890 --> 00:08:11,670 Now, if you look at the folder where WHO.csv is, 136 00:08:11,670 --> 00:08:15,330 you should see another file called MyPlot.pdf, 137 00:08:15,330 --> 00:08:17,850 containing the plot we made. 138 00:08:17,850 --> 00:08:20,350 In the next video, we'll see how to create 139 00:08:20,350 --> 00:08:23,990 more advanced scatterplots using ggplot.