1 00:00:04,490 --> 00:00:06,910 In this video, we'll create a heat map 2 00:00:06,910 --> 00:00:09,690 on a map of the United States. 3 00:00:09,690 --> 00:00:13,760 We'll be using the data set murders.csv, which 4 00:00:13,760 --> 00:00:16,460 is data provided by the FBI giving 5 00:00:16,460 --> 00:00:21,080 the total number of murders in the United States by state. 6 00:00:21,080 --> 00:00:23,720 Let's start by reading in our data set. 7 00:00:23,720 --> 00:00:31,020 We'll call it murders, and we'll use the read.csv function 8 00:00:31,020 --> 00:00:33,420 to read in the data file murders.csv. 9 00:00:36,610 --> 00:00:39,470 Let's take a look at the structure of this data 10 00:00:39,470 --> 00:00:42,600 using the str function. 11 00:00:42,600 --> 00:00:46,110 We have 51 observations for the 50 states 12 00:00:46,110 --> 00:00:50,460 plus Washington, DC, and six different variables: 13 00:00:50,460 --> 00:00:54,490 the name of the state, the population, the population 14 00:00:54,490 --> 00:00:58,040 density, the number of murders, the number 15 00:00:58,040 --> 00:01:03,230 of murders that used guns, and the rate of gun ownership. 16 00:01:03,230 --> 00:01:06,060 A map of the United States is included 17 00:01:06,060 --> 00:01:12,720 in R. Let's load the map and call it statesMap. 18 00:01:12,720 --> 00:01:18,360 We can do so using the map_data function, 19 00:01:18,360 --> 00:01:24,390 where the only argument is "state" in quotes. 20 00:01:24,390 --> 00:01:27,020 Let's see what this looks like by typing in str(statesMap). 21 00:01:32,930 --> 00:01:35,250 This is just a data frame summarizing 22 00:01:35,250 --> 00:01:37,830 how to draw the United States. 23 00:01:37,830 --> 00:01:43,470 To plot the map, we'll use the polygons geometry of ggplot. 24 00:01:43,470 --> 00:01:49,039 So type ggplot, and then in parentheses, our data frame 25 00:01:49,039 --> 00:01:55,000 is statesMap, and then our aesthetic is x = long, 26 00:01:55,000 --> 00:01:59,750 the longitude variable in statesMap, y = lat, 27 00:01:59,750 --> 00:02:04,660 the latitude variable, and then group = group. 28 00:02:04,660 --> 00:02:06,590 This is the variable defining how 29 00:02:06,590 --> 00:02:10,780 to draw the United States into groups by state. 30 00:02:10,780 --> 00:02:13,470 Then close both parentheses here, 31 00:02:13,470 --> 00:02:19,780 and we'll add geom_polygon where our arguments here will be 32 00:02:19,780 --> 00:02:25,410 fill="white"-- we'll just fill all states in white-- 33 00:02:25,410 --> 00:02:31,950 and color="black" to outline the states in black. 34 00:02:31,950 --> 00:02:34,110 Now in your R graphics window, you 35 00:02:34,110 --> 00:02:37,790 should see a map of the United States. 36 00:02:37,790 --> 00:02:40,500 Before we can plot our data on this map, 37 00:02:40,500 --> 00:02:42,630 we need to make sure that the state names are 38 00:02:42,630 --> 00:02:45,630 the same in the murders data frame 39 00:02:45,630 --> 00:02:48,820 and in the statesMap data frame. 40 00:02:48,820 --> 00:02:51,470 In the murders data frame, our state names 41 00:02:51,470 --> 00:02:53,490 are in the State variable, and they 42 00:02:53,490 --> 00:02:55,510 start with a capital letter. 43 00:02:55,510 --> 00:02:58,540 But in the statesMap data frame, our state names 44 00:02:58,540 --> 00:03:02,550 are in the region variable, and they're all lowercase. 45 00:03:02,550 --> 00:03:07,020 So let's create a new variable called region in our murders 46 00:03:07,020 --> 00:03:11,050 data frame to match the state name variable in the statesMap 47 00:03:11,050 --> 00:03:12,580 data frame. 48 00:03:12,580 --> 00:03:18,380 So we'll add to our murders data frame the variable region, 49 00:03:18,380 --> 00:03:22,000 which will be equal to the lowercase version-- 50 00:03:22,000 --> 00:03:25,400 using the tolower function that we used in the text analytics 51 00:03:25,400 --> 00:03:29,320 lectures-- and the argument will be murders$State. 52 00:03:34,820 --> 00:03:37,140 This will just convert the State variable 53 00:03:37,140 --> 00:03:39,940 to all lowercase letters and store it 54 00:03:39,940 --> 00:03:43,220 as a new variable called region. 55 00:03:43,220 --> 00:03:47,070 Now we can join the statesMap data frame with the murders 56 00:03:47,070 --> 00:03:50,460 data frame by using the merge function, which 57 00:03:50,460 --> 00:03:54,970 matches rows of a data frame based on a shared identifier. 58 00:03:54,970 --> 00:03:57,320 We just defined the variable region, 59 00:03:57,320 --> 00:04:00,140 which exists in both data frames. 60 00:04:00,140 --> 00:04:06,580 So we'll call our new data frame murderMap, 61 00:04:06,580 --> 00:04:08,790 and we'll use the merge function, 62 00:04:08,790 --> 00:04:14,430 where the first argument is our first data frame, statesMap, 63 00:04:14,430 --> 00:04:19,040 the second argument is our second data frame, murders, 64 00:04:19,040 --> 00:04:23,410 and the third argument is by="region". 65 00:04:23,410 --> 00:04:27,570 This is the identifier to use to merge the rows. 66 00:04:27,570 --> 00:04:29,160 Let's take a look at the data frame 67 00:04:29,160 --> 00:04:31,860 we just created using the str function. 68 00:04:34,580 --> 00:04:36,740 We have the same number of observations 69 00:04:36,740 --> 00:04:39,490 here that we had in the statesMap data frame, 70 00:04:39,490 --> 00:04:42,750 but now we have both the variables from the statesMap 71 00:04:42,750 --> 00:04:46,100 data frame and the variables from the murders data 72 00:04:46,100 --> 00:04:50,540 frame, which were matched up based on the region variable. 73 00:04:50,540 --> 00:04:53,180 So now, let's plot the number of murders 74 00:04:53,180 --> 00:04:55,720 on our map of the United States. 75 00:04:55,720 --> 00:04:59,870 We'll again use the ggplot function, but this time, 76 00:04:59,870 --> 00:05:06,020 our data frame is murderMap, and in our aesthetic we want 77 00:05:06,020 --> 00:05:14,540 to again say x=long, y=lat, and group=group, 78 00:05:14,540 --> 00:05:16,750 but we'll add one more argument this time, 79 00:05:16,750 --> 00:05:22,370 which is fill=Murders so that the states will be colored 80 00:05:22,370 --> 00:05:25,030 according to the Murders variable. 81 00:05:25,030 --> 00:05:30,840 Then we need to add the polygon geometry where the only 82 00:05:30,840 --> 00:05:33,909 argument here will be color="black" 83 00:05:33,909 --> 00:05:37,880 to outline the states in black, like before. 84 00:05:37,880 --> 00:05:46,790 And lastly, we'll add scale_fill_gradient where 85 00:05:46,790 --> 00:05:56,450 the arguments here, we'll put low="black" and high="red" 86 00:05:56,450 --> 00:06:00,980 to make our color scheme range from black to red, 87 00:06:00,980 --> 00:06:08,350 and then guide="legend" to make sure we get a legend 88 00:06:08,350 --> 00:06:10,220 on our plot. 89 00:06:10,220 --> 00:06:14,400 If you hit Enter and look at your graphics window now, 90 00:06:14,400 --> 00:06:16,060 you should see that each of the states 91 00:06:16,060 --> 00:06:19,480 is colored by the number of murders in that state. 92 00:06:19,480 --> 00:06:23,010 States with a larger number of murders are more red. 93 00:06:23,010 --> 00:06:25,440 So it looks like California and Texas 94 00:06:25,440 --> 00:06:27,750 have the largest number of murders. 95 00:06:27,750 --> 00:06:31,470 But is that just because they're the most populous states? 96 00:06:31,470 --> 00:06:34,390 Let's create a map of the population of each state 97 00:06:34,390 --> 00:06:35,750 to check. 98 00:06:35,750 --> 00:06:39,780 So back in the R Console, hit the Up arrow, and then, 99 00:06:39,780 --> 00:06:47,740 instead of fill=Murders, we want to put fill=Population to color 100 00:06:47,740 --> 00:06:51,820 each state according to the Population variable. 101 00:06:51,820 --> 00:06:53,320 If you look at the graphics window, 102 00:06:53,320 --> 00:06:55,930 we have a population map here which 103 00:06:55,930 --> 00:06:59,340 looks exactly the same as our murders map. 104 00:06:59,340 --> 00:07:01,990 So we need to plot the murder rate instead 105 00:07:01,990 --> 00:07:04,860 of the number of murders to make sure we're not just 106 00:07:04,860 --> 00:07:07,470 plotting a population map. 107 00:07:07,470 --> 00:07:09,600 So in our R Console, let's create 108 00:07:09,600 --> 00:07:12,130 a new variable for the murder rate. 109 00:07:12,130 --> 00:07:17,580 So in our murderMap data frame, we'll create the MurderRate 110 00:07:17,580 --> 00:07:23,550 variable, which is equal to murderMap$Murders-- 111 00:07:23,550 --> 00:07:32,760 the number of murders-- divided by murderMap$Population times 112 00:07:32,760 --> 00:07:34,940 100,000. 113 00:07:34,940 --> 00:07:37,159 So we've created a new variable that's 114 00:07:37,159 --> 00:07:41,480 the number of murders per 100,000 population. 115 00:07:41,480 --> 00:07:45,590 Now let's redo our plot with the fill equal to MurderRate. 116 00:07:45,590 --> 00:07:49,640 So hit the Up arrow twice to get back to the plotting command, 117 00:07:49,640 --> 00:07:53,750 and instead of fill=Population, this time we'll put 118 00:07:53,750 --> 00:07:54,420 fill=MurderRate. 119 00:08:00,310 --> 00:08:02,660 If you look at your graphics window now, 120 00:08:02,660 --> 00:08:06,320 you should see that the plot is surprisingly maroon-looking. 121 00:08:06,320 --> 00:08:08,530 There aren't really any red states. 122 00:08:08,530 --> 00:08:09,770 Why? 123 00:08:09,770 --> 00:08:12,270 It turns out that Washington, DC is 124 00:08:12,270 --> 00:08:15,210 an outlier with a very high murder rate, 125 00:08:15,210 --> 00:08:17,510 but it's such a small region on the map 126 00:08:17,510 --> 00:08:19,610 that we can't even see it. 127 00:08:19,610 --> 00:08:23,360 So let's redo our plot, removing any observations 128 00:08:23,360 --> 00:08:25,620 with murder rates above 10, which 129 00:08:25,620 --> 00:08:29,160 we know will only exclude Washington, DC. 130 00:08:29,160 --> 00:08:31,720 Keep in mind that when interpreting and explaining 131 00:08:31,720 --> 00:08:34,159 the resulting plot, you should always 132 00:08:34,159 --> 00:08:36,610 note what you did to create it: removed 133 00:08:36,610 --> 00:08:40,210 Washington, DC from the data. 134 00:08:40,210 --> 00:08:44,740 So in your R Console, hit the Up arrow again, and this time, 135 00:08:44,740 --> 00:08:55,670 after guide="legend", we'll type limits=c(0,10) and hit Enter. 136 00:08:55,670 --> 00:08:58,520 Now if you look back at your graphics window, 137 00:08:58,520 --> 00:09:01,930 you can see a range of colors on the map. 138 00:09:01,930 --> 00:09:04,860 In this video, we saw how we can make a heat map 139 00:09:04,860 --> 00:09:07,340 on a map of the United States, which 140 00:09:07,340 --> 00:09:10,340 is very useful for organizations like the World Health 141 00:09:10,340 --> 00:09:14,520 Organization or government entities who want to show data 142 00:09:14,520 --> 00:09:18,510 to the public organized by state or country. 143 00:09:18,510 --> 00:09:20,780 In the next video, we'll conclude 144 00:09:20,780 --> 00:09:24,740 by discussing the analytics edge of predictive policing.