1 00:00:04,500 --> 00:00:08,820 Visualization is a crucial step for initial data exploration. 2 00:00:08,820 --> 00:00:11,740 It helps us discern relationships, patterns, 3 00:00:11,740 --> 00:00:13,220 and outliers. 4 00:00:13,220 --> 00:00:15,300 This video will give us a starting point 5 00:00:15,300 --> 00:00:19,100 on how to make plots in R, but more advanced and way cooler 6 00:00:19,100 --> 00:00:23,570 visualization tips will be given in Week 8 of this class. 7 00:00:23,570 --> 00:00:26,430 Let us first create a scatterplot with Protein 8 00:00:26,430 --> 00:00:30,000 on the x-axis and Fat on the y-axis. 9 00:00:30,000 --> 00:00:32,820 To do this we can use the plot function in R 10 00:00:32,820 --> 00:00:36,710 and give it as a first input the Protein vector on the x-axis, 11 00:00:36,710 --> 00:00:41,790 and as a second input the TotalFat vector on the y-axis. 12 00:00:41,790 --> 00:00:44,950 And now pressing Enter, a new window pops up. 13 00:00:44,950 --> 00:00:48,670 If you are on a PC, the new Windows is called R Graphics, 14 00:00:48,670 --> 00:00:51,420 and if you are on a Mac it is called Quartz. 15 00:00:51,420 --> 00:00:55,160 The plot has a very interesting triangular shape. 16 00:00:55,160 --> 00:00:57,840 It looks like foods that are higher in protein 17 00:00:57,840 --> 00:01:01,660 are typically lower in fat, and vice versa. 18 00:01:01,660 --> 00:01:03,970 Now, looking at the aesthetics of the graph, 19 00:01:03,970 --> 00:01:06,400 we realize that R gives default names 20 00:01:06,400 --> 00:01:10,020 for the x-axis and the y-axis using the vector dollar 21 00:01:10,020 --> 00:01:11,350 notation. 22 00:01:11,350 --> 00:01:14,360 Now we can modify these labels by adding more arguments 23 00:01:14,360 --> 00:01:16,070 to the plot function. 24 00:01:16,070 --> 00:01:18,310 To go back to the R console, you can simply 25 00:01:18,310 --> 00:01:20,600 switch windows using the Control tab 26 00:01:20,600 --> 00:01:22,620 if you are on a Windows machine. 27 00:01:22,620 --> 00:01:25,850 On a Mac, the windows are not overlaid by default, 28 00:01:25,850 --> 00:01:29,030 so accessing the console should be easy. 29 00:01:29,030 --> 00:01:32,000 OK, now let's use the Up arrow to go back to our plot 30 00:01:32,000 --> 00:01:37,770 function, and add the argument xlab = "Protein" 31 00:01:37,770 --> 00:01:40,390 and that gives us the label of the x-axis. 32 00:01:40,390 --> 00:01:46,509 Then ylab = "Fat" and this gives us the label to the y-axis. 33 00:01:46,509 --> 00:01:50,650 And let's add a title to the plot using the argument main, 34 00:01:50,650 --> 00:01:56,080 say the title is "Protein vs Fat". 35 00:01:56,080 --> 00:01:58,370 And let's change the color to red. 36 00:01:58,370 --> 00:02:00,390 And remember to put quotation marks 37 00:02:00,390 --> 00:02:03,230 around the values of all these arguments. 38 00:02:03,230 --> 00:02:06,240 And now pressing Enter and going back to the graphics window, 39 00:02:06,240 --> 00:02:10,979 we see that R made all the modifications we requested. 40 00:02:10,979 --> 00:02:15,050 Another way we can visualize our data is by plotting histograms. 41 00:02:15,050 --> 00:02:17,750 We can do this using the histogram function in R, 42 00:02:17,750 --> 00:02:20,140 but note that this function now only takes 43 00:02:20,140 --> 00:02:23,620 one variable as an input, because the y-axis should 44 00:02:23,620 --> 00:02:25,170 have the frequencies. 45 00:02:25,170 --> 00:02:27,940 So let's go back to our console and create 46 00:02:27,940 --> 00:02:30,670 a histogram of VitaminC, for instance. 47 00:02:30,670 --> 00:02:34,310 So we're going to use the hist function, or histogram. 48 00:02:34,310 --> 00:02:39,980 And then the argument that it takes is the VitaminC vector. 49 00:02:39,980 --> 00:02:44,140 Let's label the x-axis as "Vitamin C", 50 00:02:44,140 --> 00:02:46,990 and this is given to us in milligrams. 51 00:02:46,990 --> 00:02:51,500 And give it a title, say "Histogram 52 00:02:51,500 --> 00:02:56,460 of Vitamin C Levels". 53 00:02:56,460 --> 00:02:57,900 Hmm. 54 00:02:57,900 --> 00:03:00,830 Even though the maximum vitamin C content 55 00:03:00,830 --> 00:03:04,470 is 2000 milligrams, most of our foods-- 56 00:03:04,470 --> 00:03:07,700 well, to be more precise, more than 6,000 of them-- 57 00:03:07,700 --> 00:03:11,060 have less than 200 milligrams of vitamin C. 58 00:03:11,060 --> 00:03:14,080 And the histogram lumps them all together in one cell. 59 00:03:14,080 --> 00:03:17,310 Well, it would be nice if we can zoom into this section 60 00:03:17,310 --> 00:03:20,870 here and get a finer understanding of the data. 61 00:03:20,870 --> 00:03:24,650 To do this we need to limit the x-axis to go from zero 62 00:03:24,650 --> 00:03:26,380 to, say, 100 milligrams. 63 00:03:26,380 --> 00:03:29,220 So let's go back to the console, and then we're 64 00:03:29,220 --> 00:03:33,300 going to add the argument xlim to limit the x-axis. 65 00:03:33,300 --> 00:03:36,370 And using the combine function or the c function, 66 00:03:36,370 --> 00:03:38,190 we're going to set the first input 67 00:03:38,190 --> 00:03:40,680 to be 0, which is the lowest value that we want 68 00:03:40,680 --> 00:03:44,810 to see on the x-axis, and the second argument as being 100, 69 00:03:44,810 --> 00:03:49,000 which is the highest value that we want to see on the x-axis. 70 00:03:49,000 --> 00:03:50,610 And now pressing Enter, and we can 71 00:03:50,610 --> 00:03:55,810 see that R gives us 0 to 100 on the x-axis. 72 00:03:55,810 --> 00:03:58,650 But we only see this one big cell. 73 00:03:58,650 --> 00:04:01,530 It seems that R only zoomed into the area, 74 00:04:01,530 --> 00:04:03,370 but it didn't break that huge cell, 75 00:04:03,370 --> 00:04:06,640 and this doesn't give us any additional information. 76 00:04:06,640 --> 00:04:11,440 So we really need to break up the cell into smaller pieces. 77 00:04:11,440 --> 00:04:15,210 And say we want 100 cells, and since the interval 78 00:04:15,210 --> 00:04:19,070 goes from 0 to 100, then we would expect R 79 00:04:19,070 --> 00:04:22,560 to create divisions that are one milligrams in length. 80 00:04:22,560 --> 00:04:24,440 So let's do this. 81 00:04:24,440 --> 00:04:26,370 Let's go back to the console, and then we're 82 00:04:26,370 --> 00:04:30,790 going to add the argument breaks = 100 83 00:04:30,790 --> 00:04:32,570 and this sets the number of cells 84 00:04:32,570 --> 00:04:34,080 that we want to see to 100. 85 00:04:34,080 --> 00:04:35,600 So let's see. 86 00:04:35,600 --> 00:04:36,530 Oh. 87 00:04:36,530 --> 00:04:40,790 We actually only see five cells, and each cell 88 00:04:40,790 --> 00:04:43,000 is 20 milligrams long. 89 00:04:43,000 --> 00:04:44,380 Well, what happened? 90 00:04:44,380 --> 00:04:47,440 We were expecting 100 cells. 91 00:04:47,440 --> 00:04:49,840 Well remember that the histogram originally 92 00:04:49,840 --> 00:04:52,600 went far beyond 100 milligrams. 93 00:04:52,600 --> 00:04:55,950 The maximum was 2000 milligrams. 94 00:04:55,950 --> 00:04:59,310 And now if we were to divide the original interval from 0 95 00:04:59,310 --> 00:05:04,880 to 2000 into 100 cells, then 2000 divided by 100, 96 00:05:04,880 --> 00:05:07,560 each cell would be 20 milligrams long. 97 00:05:07,560 --> 00:05:09,880 And this is exactly what R did. 98 00:05:09,880 --> 00:05:13,160 It actually divided all of the spectrum of values 99 00:05:13,160 --> 00:05:16,860 from 0 to 2000 into 100 cells, and not only 100 00:05:16,860 --> 00:05:19,330 the spectrum from 0 to 100. 101 00:05:19,330 --> 00:05:23,090 But we still want to divide the interval 0 to 100 102 00:05:23,090 --> 00:05:26,770 into 100 cells, each of length 1 milligram. 103 00:05:26,770 --> 00:05:28,440 And how can we do this? 104 00:05:28,440 --> 00:05:30,330 Well, we simply need to think in terms 105 00:05:30,330 --> 00:05:33,590 of the original interval, which was 0 to 2000. 106 00:05:33,590 --> 00:05:36,700 And if we were to break it into 2000 cells, 107 00:05:36,700 --> 00:05:39,090 then each one will be of length one milligram. 108 00:05:39,090 --> 00:05:42,600 So now we know that actually we needed 109 00:05:42,600 --> 00:05:44,970 to set the breaks to 2000. 110 00:05:44,970 --> 00:05:46,540 So let's do this. 111 00:05:46,540 --> 00:05:50,290 And now we obtain our refined histogram. 112 00:05:50,290 --> 00:05:52,860 And we see new information here come up. 113 00:05:52,860 --> 00:05:57,010 Remember our initial conclusion was that more than 6,000 foods 114 00:05:57,010 --> 00:06:00,290 have less than 200 milligrams of vitamin C. 115 00:06:00,290 --> 00:06:02,240 But now that we refined our graph, 116 00:06:02,240 --> 00:06:05,310 we obtained an additional level of information. 117 00:06:05,310 --> 00:06:07,980 Actually, more than 5,000 of them 118 00:06:07,980 --> 00:06:12,050 have less than one milligram of vitamin C. 119 00:06:12,050 --> 00:06:15,980 Now a third way we can visualize the data is using box plots. 120 00:06:15,980 --> 00:06:19,140 So let's go back to the console and create a box plot 121 00:06:19,140 --> 00:06:19,970 for sugar. 122 00:06:19,970 --> 00:06:23,450 So the function we're going to be using is simply boxplot, 123 00:06:23,450 --> 00:06:26,170 and similarly to the histogram function, 124 00:06:26,170 --> 00:06:30,050 the boxplot function only takes as an input a single vector. 125 00:06:30,050 --> 00:06:33,740 And in this case it would be the Sugar vector. 126 00:06:33,740 --> 00:06:40,680 And let's create a title that says "Boxplot of Sugar Levels". 127 00:06:40,680 --> 00:06:43,500 And is it the y-axis or the x-axis 128 00:06:43,500 --> 00:06:46,820 that we have to label it as the sugar level? 129 00:06:46,820 --> 00:06:49,840 So if we're not so sure, let's just plot it, 130 00:06:49,840 --> 00:06:52,120 and, oh, this should be the y-axis. 131 00:06:52,120 --> 00:06:54,840 So let's go back to the console, and simply 132 00:06:54,840 --> 00:07:00,420 set the y label to be "Sugar (g)". 133 00:07:00,420 --> 00:07:05,100 And now we have our box plot with the right labels. 134 00:07:05,100 --> 00:07:06,680 What is it trying to tell us here? 135 00:07:06,680 --> 00:07:08,520 It looks a little bit strange. 136 00:07:08,520 --> 00:07:11,290 Well, the average of sugar across the data set 137 00:07:11,290 --> 00:07:12,810 seems to be pretty low. 138 00:07:12,810 --> 00:07:15,290 It's somewhere around five grams. 139 00:07:15,290 --> 00:07:19,140 But we have a lot of outliers with extremely high values 140 00:07:19,140 --> 00:07:20,370 of sugar. 141 00:07:20,370 --> 00:07:24,440 There exist some foods that have almost 100 grams of sugar 142 00:07:24,440 --> 00:07:26,120 in 100 grams. 143 00:07:26,120 --> 00:07:29,550 Well, candies are definitely among these foods. 144 00:07:29,550 --> 00:07:31,720 So we just reviewed three ways in which 145 00:07:31,720 --> 00:07:33,600 we can visualize our data. 146 00:07:33,600 --> 00:07:36,840 In Week 8, we will see more advanced visualization tools 147 00:07:36,840 --> 00:07:39,250 to make more informative plots. 148 00:07:39,250 --> 00:07:42,730 In our next video we will see how we can construct and add 149 00:07:42,730 --> 00:07:45,790 new variables to our data set.