1 00:00:04,710 --> 00:00:08,720 In this video, we'll see how to color our points by region 2 00:00:08,720 --> 00:00:12,830 and how to add a linear regression line to our plot. 3 00:00:12,830 --> 00:00:15,360 Here we have our plot from the last video, 4 00:00:15,360 --> 00:00:18,300 where we've colored the points dark red. 5 00:00:18,300 --> 00:00:22,020 Now, let's color the points by region instead. 6 00:00:22,020 --> 00:00:25,910 This time, we want to add a color option to our aesthetic, 7 00:00:25,910 --> 00:00:30,360 since we're assigning a variable in our data set to the colors. 8 00:00:30,360 --> 00:00:34,540 To do this, we can type ggplot, and then first give 9 00:00:34,540 --> 00:00:36,750 the name of our data, like before, 10 00:00:36,750 --> 00:00:43,870 WHO, and then in our aesthetic, we again specify that x = GNI 11 00:00:43,870 --> 00:00:47,720 and y = FertilityRate. 12 00:00:47,720 --> 00:00:52,620 But then we want to add the option color = Region, 13 00:00:52,620 --> 00:00:56,130 which will color the points by the Region variable. 14 00:00:56,130 --> 00:01:00,840 And then we just want to add the geom_point function and hit 15 00:01:00,840 --> 00:01:02,760 Enter. 16 00:01:02,760 --> 00:01:06,630 Now, in our plot, we should see that each point is colored 17 00:01:06,630 --> 00:01:10,320 corresponding to the region that country belongs in. 18 00:01:10,320 --> 00:01:13,580 So the countries in Africa are colored red, 19 00:01:13,580 --> 00:01:17,070 the countries in the Americas are colored gold, 20 00:01:17,070 --> 00:01:19,350 the countries in the Eastern Mediterranean 21 00:01:19,350 --> 00:01:22,320 are colored green, etc. 22 00:01:22,320 --> 00:01:23,960 This really helps us see something 23 00:01:23,960 --> 00:01:25,940 that we didn't see before. 24 00:01:25,940 --> 00:01:28,210 The points from the different regions 25 00:01:28,210 --> 00:01:32,789 are really located in different areas on the plot. 26 00:01:32,789 --> 00:01:34,970 Let's now instead color the points 27 00:01:34,970 --> 00:01:37,970 according to the country's life expectancy. 28 00:01:37,970 --> 00:01:40,509 To do this, we just have to hit the up arrow 29 00:01:40,509 --> 00:01:43,520 to get back to our ggplot line, and then 30 00:01:43,520 --> 00:01:47,410 delete Region and type LifeExpectancy. 31 00:01:51,080 --> 00:01:54,250 Now, we should see that each point is colored according 32 00:01:54,250 --> 00:01:57,630 to the life expectancy in that country. 33 00:01:57,630 --> 00:01:59,570 Notice that before, we were coloring 34 00:01:59,570 --> 00:02:01,930 by a factor variable, Region. 35 00:02:01,930 --> 00:02:04,800 So we had exactly seven different colors 36 00:02:04,800 --> 00:02:08,070 corresponding to the seven different regions. 37 00:02:08,070 --> 00:02:10,930 Here, we're coloring by LifeExpectancy instead, 38 00:02:10,930 --> 00:02:14,720 which is a numerical variable, so we get a gradient of colors, 39 00:02:14,720 --> 00:02:16,090 like this. 40 00:02:16,090 --> 00:02:20,020 Lighter blue corresponds to a higher life expectancy, 41 00:02:20,020 --> 00:02:25,350 and darker blue corresponds to a lower life expectancy. 42 00:02:25,350 --> 00:02:27,740 Let's take a look at a different plot now. 43 00:02:27,740 --> 00:02:30,870 Suppose we were interested in seeing whether the fertility 44 00:02:30,870 --> 00:02:34,130 rate of a country was a good predictor of the percentage 45 00:02:34,130 --> 00:02:36,850 of the population under 15. 46 00:02:36,850 --> 00:02:39,450 Intuitively, we would expect these variables 47 00:02:39,450 --> 00:02:41,260 to be highly correlated. 48 00:02:41,260 --> 00:02:43,980 But before trying any statistical models, 49 00:02:43,980 --> 00:02:46,800 let's explore our data with a plot. 50 00:02:46,800 --> 00:02:51,640 So now, let's use the ggplot function on the WHO data again, 51 00:02:51,640 --> 00:02:53,730 but we're going to specify in our aesthetic 52 00:02:53,730 --> 00:02:57,750 that the x variable should be FertilityRate, 53 00:02:57,750 --> 00:03:02,390 and the y variable should be the variable, Under15. 54 00:03:02,390 --> 00:03:04,720 Again, we want to add geom_point, 55 00:03:04,720 --> 00:03:05,890 since we want a scatterplot. 56 00:03:09,250 --> 00:03:10,620 This is really interesting. 57 00:03:10,620 --> 00:03:13,710 It looks like the variables are certainly correlated, 58 00:03:13,710 --> 00:03:17,340 but as the fertility rate increases, the variable, 59 00:03:17,340 --> 00:03:21,770 Under15 starts increasing less. 60 00:03:21,770 --> 00:03:24,720 So this doesn't really look like a linear relationship. 61 00:03:24,720 --> 00:03:28,390 But we suspect that a log transformation of FertilityRate 62 00:03:28,390 --> 00:03:29,740 will be better. 63 00:03:29,740 --> 00:03:31,390 Let's give it a shot. 64 00:03:31,390 --> 00:03:33,840 So go ahead and scroll up in your R console 65 00:03:33,840 --> 00:03:38,260 to the previous line, and instead of x = FertilityRate, 66 00:03:38,260 --> 00:03:39,680 we want x = log(FertilityRate). 67 00:03:42,990 --> 00:03:45,180 And hit Enter. 68 00:03:45,180 --> 00:03:48,160 Now this looks like a linear relationship. 69 00:03:48,160 --> 00:03:51,000 Let's try building in a simple linear regression model 70 00:03:51,000 --> 00:03:54,320 to predict the percentage of the population under 15, 71 00:03:54,320 --> 00:03:56,960 using the log of the fertility rate. 72 00:03:56,960 --> 00:04:01,700 So let's call our model, model, and use the lm function 73 00:04:01,700 --> 00:04:06,400 to predict Under15 using as an independent variable 74 00:04:06,400 --> 00:04:07,190 log(FertilityRate). 75 00:04:09,810 --> 00:04:13,870 And our data set will be WHO. 76 00:04:13,870 --> 00:04:15,520 Let's look at the summary of our model. 77 00:04:18,190 --> 00:04:20,190 It looks like the log of FertilityRate 78 00:04:20,190 --> 00:04:22,890 is indeed a great predictor of Under15. 79 00:04:22,890 --> 00:04:25,230 The variable is highly significant, 80 00:04:25,230 --> 00:04:29,320 and our R-squared is 0.9391. 81 00:04:29,320 --> 00:04:31,370 Visualization was a great way for us 82 00:04:31,370 --> 00:04:34,830 to realize that the log transformation would be better. 83 00:04:34,830 --> 00:04:37,600 If we instead had just used the FertilityRate, 84 00:04:37,600 --> 00:04:40,650 the R-squared would have been 0.87. 85 00:04:40,650 --> 00:04:44,610 That's a pretty significant decrease in R-squared. 86 00:04:44,610 --> 00:04:47,800 So now, let's add this regression line to our plot. 87 00:04:47,800 --> 00:04:49,730 This is pretty easy in ggplot. 88 00:04:49,730 --> 00:04:52,420 We just have to add another layer. 89 00:04:52,420 --> 00:04:55,030 So use the up arrow in your R console to get back 90 00:04:55,030 --> 00:05:06,640 to the plotting line, and then add stat_smooth(method = "lm"), 91 00:05:06,640 --> 00:05:08,620 and hit Enter. 92 00:05:08,620 --> 00:05:12,120 Now, you should see a blue line going through the data. 93 00:05:12,120 --> 00:05:14,320 This is our regression line. 94 00:05:14,320 --> 00:05:18,200 By default, ggplot will draw a 95% confidence 95 00:05:18,200 --> 00:05:20,760 interval shaded around the line. 96 00:05:20,760 --> 00:05:23,600 We can change this by specifying options 97 00:05:23,600 --> 00:05:25,730 within the statistics layer. 98 00:05:25,730 --> 00:05:28,920 So go ahead and scroll up in the R console, 99 00:05:28,920 --> 00:05:36,330 and after method = "lm", type level = 0.99, and hit Enter. 100 00:05:36,330 --> 00:05:40,060 This will give a 99% confidence interval. 101 00:05:40,060 --> 00:05:44,430 We could instead take away the confidence interval altogether 102 00:05:44,430 --> 00:05:52,670 by deleting level = 0.99 and typing se = FALSE. 103 00:05:52,670 --> 00:05:56,270 Now, we just have the regression line in blue. 104 00:05:56,270 --> 00:05:59,920 We could also change the color of the regression line 105 00:05:59,920 --> 00:06:07,130 by typing as an option, color = "orange". 106 00:06:07,130 --> 00:06:11,250 Now, we have an orange linear regression line. 107 00:06:11,250 --> 00:06:13,980 As we've seen in this lecture, scatterplots 108 00:06:13,980 --> 00:06:16,340 are great for exploring data. 109 00:06:16,340 --> 00:06:18,300 However, there are many other ways 110 00:06:18,300 --> 00:06:22,650 to represent data visually, such as box plots, line charts, 111 00:06:22,650 --> 00:06:26,520 histograms, heat maps, and geographic maps. 112 00:06:26,520 --> 00:06:28,920 In some cases, it may be better to choose 113 00:06:28,920 --> 00:06:31,980 one of these other ways of visualizing your data. 114 00:06:31,980 --> 00:06:34,550 Luckily, ggplot makes it easy to go 115 00:06:34,550 --> 00:06:37,900 from one type of visualization to another, simply 116 00:06:37,900 --> 00:06:40,860 by adding the appropriate layer to the plot. 117 00:06:40,860 --> 00:06:43,440 We'll learn more about other types of visualizations 118 00:06:43,440 --> 00:06:47,050 and how to create them in the next lecture. 119 00:06:47,050 --> 00:06:50,420 So what is the edge of visualizations? 120 00:06:50,420 --> 00:06:52,720 The WHO data that we used here is 121 00:06:52,720 --> 00:06:56,600 used by citizens, policymakers, and organizations 122 00:06:56,600 --> 00:06:58,300 around the world. 123 00:06:58,300 --> 00:07:01,600 Visualizing the data facilitates the understanding 124 00:07:01,600 --> 00:07:04,730 of global health trends at a glance. 125 00:07:04,730 --> 00:07:08,820 By using ggplot in R, we're able to visualize data 126 00:07:08,820 --> 00:07:14,110 for exploration, modeling, and sharing analytics results.