1 00:00:04,490 --> 00:00:07,640 In this video, we'll create a basic line plot 2 00:00:07,640 --> 00:00:10,280 to visualize crime trends. 3 00:00:10,280 --> 00:00:12,480 Let's start by reading in our data. 4 00:00:12,480 --> 00:00:16,260 We'll call it mvt for motor vehicle thefts, 5 00:00:16,260 --> 00:00:22,520 and use the read.csv function to read in the file mvt.csv. 6 00:00:22,520 --> 00:00:27,580 We'll add the argument stringsAsFactors = FALSE, 7 00:00:27,580 --> 00:00:29,220 since we have a text field, and we 8 00:00:29,220 --> 00:00:32,759 want to make sure it's read in properly. 9 00:00:32,759 --> 00:00:35,300 Let's take a look at the structure of our data 10 00:00:35,300 --> 00:00:38,200 using the str function. 11 00:00:38,200 --> 00:00:42,390 We have over 190,000 observations 12 00:00:42,390 --> 00:00:45,950 of three different variables-- the date of the crime, 13 00:00:45,950 --> 00:00:47,990 and the location of the crime, in terms 14 00:00:47,990 --> 00:00:50,740 of latitude and longitude. 15 00:00:50,740 --> 00:00:54,090 We want to first convert the Date variable to a format 16 00:00:54,090 --> 00:00:58,210 that R will recognize so that we can extract the day of the week 17 00:00:58,210 --> 00:01:00,160 and the hour of the day. 18 00:01:00,160 --> 00:01:04,970 We can do this using the strptime function. 19 00:01:04,970 --> 00:01:08,510 So we want to replace our variable, Date, 20 00:01:08,510 --> 00:01:12,950 with the output of the strptime function, which 21 00:01:12,950 --> 00:01:17,320 takes as a first argument our variable, Date, and then 22 00:01:17,320 --> 00:01:20,910 as a second argument the format that the date is in. 23 00:01:20,910 --> 00:01:24,730 Here, we can see in the output from the str function 24 00:01:24,730 --> 00:01:31,550 that our format is the month slash the day slash the year, 25 00:01:31,550 --> 00:01:35,090 and then the hour colon minutes. 26 00:01:35,090 --> 00:02:02,770 So our format equals, "%m/%d/%y %H:%M", close the parentheses, 27 00:02:02,770 --> 00:02:05,000 and hit Enter. 28 00:02:05,000 --> 00:02:08,500 In this format, we can extract the hour and the day 29 00:02:08,500 --> 00:02:10,560 of the week from the Date variable, 30 00:02:10,560 --> 00:02:14,280 and we can add these as new variables to our data frame. 31 00:02:14,280 --> 00:02:18,690 We can do this by first defining our new variable, 32 00:02:18,690 --> 00:02:20,070 mvt$Weekday = weekdays(mvt$Date). 33 00:02:29,150 --> 00:02:33,180 Then, to add the hour, which we'll call mvt$Hour, 34 00:02:33,180 --> 00:02:39,079 we just take the hour variable out of Date variable. 35 00:02:39,079 --> 00:02:43,970 This only exists because we converted the Date variable. 36 00:02:43,970 --> 00:02:45,990 Let's take a look at the structure of our data 37 00:02:45,990 --> 00:02:48,610 again to see what it looks like. 38 00:02:48,610 --> 00:02:51,520 Now, we have two more variables-- Weekday, 39 00:02:51,520 --> 00:02:54,100 which gives the day of the week, and Hour, 40 00:02:54,100 --> 00:02:56,770 which gives the hour of the day. 41 00:02:56,770 --> 00:02:59,410 Now, we're ready to make some line plots. 42 00:02:59,410 --> 00:03:01,400 Let's start by creating the line plot 43 00:03:01,400 --> 00:03:05,470 we saw in the previous video with just one line and a value 44 00:03:05,470 --> 00:03:08,190 for every day of the week. 45 00:03:08,190 --> 00:03:11,030 We want to plot as that value the total number 46 00:03:11,030 --> 00:03:13,930 of crimes on each day of the week. 47 00:03:13,930 --> 00:03:16,329 We can get this information by creating 48 00:03:16,329 --> 00:03:19,470 a table of the Weekday variable. 49 00:03:22,820 --> 00:03:27,510 This gives the total amount of crime on each day of the week. 50 00:03:27,510 --> 00:03:30,130 Let's save this table as a data frame 51 00:03:30,130 --> 00:03:33,360 so that we can pass it to ggplot as our data. 52 00:03:33,360 --> 00:03:40,980 We'll call it WeekdayCounts, and use the as.data.frame function 53 00:03:40,980 --> 00:03:43,430 to convert our table to a data frame. 54 00:03:50,000 --> 00:03:52,650 Let's see what this looks like with the str function. 55 00:03:56,300 --> 00:03:59,829 We can see that our data frame has seven observations, one 56 00:03:59,829 --> 00:04:03,400 for each day of the week, and two different variables. 57 00:04:03,400 --> 00:04:06,740 The first variable, called Var1, gives 58 00:04:06,740 --> 00:04:10,480 the name of the day of the week, and the second variable, 59 00:04:10,480 --> 00:04:15,010 called Freq, for frequency, gives the total amount of crime 60 00:04:15,010 --> 00:04:17,660 on that day of the week. 61 00:04:17,660 --> 00:04:19,610 Now, we're ready to make our plot. 62 00:04:19,610 --> 00:04:23,310 First, we need to load the ggplot2 package. 63 00:04:23,310 --> 00:04:24,610 So we'll type library(ggplot2). 64 00:04:29,120 --> 00:04:32,630 Now, we'll create our plot using the ggplot function. 65 00:04:32,630 --> 00:04:35,560 So type ggplot, and then we need to give 66 00:04:35,560 --> 00:04:39,740 the name of our data, which is WeekdayCounts. 67 00:04:39,740 --> 00:04:42,500 And then we need to define our aesthetic. 68 00:04:42,500 --> 00:04:47,730 So our aesthetic should have x = Var1, 69 00:04:47,730 --> 00:04:51,200 since we want the day of the week on the x-axis, 70 00:04:51,200 --> 00:04:55,540 and y = Freq, since we want the frequency, 71 00:04:55,540 --> 00:04:58,860 the number of crimes, on the y-axis. 72 00:04:58,860 --> 00:05:00,910 Now, we just need to add geom_line(aes(group=1)). 73 00:05:12,670 --> 00:05:15,680 This just groups all of our data into one line, 74 00:05:15,680 --> 00:05:18,620 since we want one line in our plot. 75 00:05:18,620 --> 00:05:20,900 Go ahead and hit Enter. 76 00:05:20,900 --> 00:05:24,280 We can see that this is very close to the plot we want. 77 00:05:24,280 --> 00:05:27,050 We have the total number of crime plotted 78 00:05:27,050 --> 00:05:29,530 by day of the week, but our days of the week 79 00:05:29,530 --> 00:05:31,240 are a little bit out of order. 80 00:05:31,240 --> 00:05:34,600 We have Friday first, then Monday, then Saturday, 81 00:05:34,600 --> 00:05:36,790 then Sunday, etc. 82 00:05:36,790 --> 00:05:39,420 What ggplot did was it put the days of the week 83 00:05:39,420 --> 00:05:41,510 in alphabetical order. 84 00:05:41,510 --> 00:05:44,780 But we actually want the days of the week in chronological order 85 00:05:44,780 --> 00:05:47,840 to make this plot a bit easier to read. 86 00:05:47,840 --> 00:05:52,560 We can do this by making the Var1 variable an ordered factor 87 00:05:52,560 --> 00:05:53,930 variable. 88 00:05:53,930 --> 00:05:58,030 This signals to ggplot that the ordering is meaningful. 89 00:05:58,030 --> 00:06:01,890 We can do this by using the factor function. 90 00:06:01,890 --> 00:06:08,450 So let's start by typing WeekdayCounts$Var1, 91 00:06:08,450 --> 00:06:11,810 the variable we want to convert, and set that equal 92 00:06:11,810 --> 00:06:14,430 to the output of the factor function, 93 00:06:14,430 --> 00:06:21,590 where the first argument is our variable, WeekdayCounts$Var1, 94 00:06:21,590 --> 00:06:24,950 the second argument is ordered = TRUE. 95 00:06:24,950 --> 00:06:27,770 This says that we want an ordered factor. 96 00:06:27,770 --> 00:06:30,450 And the third argument, which is levels, 97 00:06:30,450 --> 00:06:32,990 should be equal to a vector of the days 98 00:06:32,990 --> 00:06:35,460 of the week in the order we want them to be in. 99 00:06:35,460 --> 00:06:38,140 We'll use the c function to do this. 100 00:06:38,140 --> 00:06:40,500 So first, in quotes, type "Sunday" -- 101 00:06:40,500 --> 00:06:48,110 we want Sunday first-- and then "Monday", "Tuesday", 102 00:06:48,110 --> 00:06:58,290 "Wednesday", "Thursday", "Friday", "Saturday". 103 00:07:01,400 --> 00:07:05,480 Go ahead and close both parentheses and hit Enter. 104 00:07:05,480 --> 00:07:08,900 Now, let's try our plot again by just hitting the up arrow twice 105 00:07:08,900 --> 00:07:11,000 and hitting Enter. 106 00:07:11,000 --> 00:07:13,010 Now, this is the plot we want. 107 00:07:13,010 --> 00:07:15,710 We have the total crime by day of the week 108 00:07:15,710 --> 00:07:18,920 with the days of the week in chronological order. 109 00:07:18,920 --> 00:07:20,880 The last thing we'll want to do to our plot 110 00:07:20,880 --> 00:07:23,360 is just change the x- and y-axis labels, 111 00:07:23,360 --> 00:07:26,060 since they're not very helpful as they are now. 112 00:07:26,060 --> 00:07:28,180 To do this, back in the R console, 113 00:07:28,180 --> 00:07:31,400 just hit the up arrow to get back to our plotting line, 114 00:07:31,400 --> 00:07:33,200 and then we'll add xlab("Day of the Week"). 115 00:07:39,630 --> 00:07:41,880 And then we'll add ylab("Total Motor Vehicle Thefts"). 116 00:07:54,170 --> 00:07:56,040 Now, this is the plot we were trying 117 00:07:56,040 --> 00:08:00,810 to generate with descriptive labels on the x- and y-axis. 118 00:08:00,810 --> 00:08:03,990 In the next video, we'll add the hour of the day 119 00:08:03,990 --> 00:08:08,170 to our line plot, and then create a heat map.