1 00:00:04,500 --> 00:00:08,200 In this video, we will do some basic data analysis. 2 00:00:08,200 --> 00:00:10,260 All that I've done since our previous video 3 00:00:10,260 --> 00:00:13,840 is clear the console, but R still has all the information 4 00:00:13,840 --> 00:00:14,980 stored. 5 00:00:14,980 --> 00:00:17,860 In fact, if we use the Up Arrow on our keyboard, 6 00:00:17,860 --> 00:00:20,120 we retrieve the last command we typed, 7 00:00:20,120 --> 00:00:23,740 which was the summary of the USDA data frame. 8 00:00:23,740 --> 00:00:26,670 And as a quick reminder, at the end of our last video, 9 00:00:26,670 --> 00:00:29,710 we realized that the maximum level of Sodium 10 00:00:29,710 --> 00:00:34,890 was 38,758 milligrams, which is very high. 11 00:00:34,890 --> 00:00:38,210 We would like to see which food this corresponds to. 12 00:00:38,210 --> 00:00:41,060 Well, to check the values of sodium levels in the foods 13 00:00:41,060 --> 00:00:43,790 within the data set, we can type USDA$Sodium. 14 00:00:46,570 --> 00:00:49,220 This gives us a series of numbers associated 15 00:00:49,220 --> 00:00:52,960 with the amount of sodium in all the foods in our data set. 16 00:00:52,960 --> 00:00:55,920 Remember from the lecture that this is called a vector, 17 00:00:55,920 --> 00:00:59,160 and it is associated with the variable Sodium. 18 00:00:59,160 --> 00:01:01,910 For instance, the sodium level of the last food 19 00:01:01,910 --> 00:01:05,960 in our data set is 68 milligrams. 20 00:01:05,960 --> 00:01:09,289 Now, to find which food has the highest level of sodium, 21 00:01:09,289 --> 00:01:13,230 we can simply use the function which.max, which 22 00:01:13,230 --> 00:01:17,120 takes as an input the Sodium vector, 23 00:01:17,120 --> 00:01:20,560 and it gives us the index of the food with the highest sodium 24 00:01:20,560 --> 00:01:21,600 level. 25 00:01:21,600 --> 00:01:25,350 In this case, the 265th food in our data set 26 00:01:25,350 --> 00:01:28,020 has the maximum sodium content. 27 00:01:28,020 --> 00:01:30,430 Now to know which food that is, we 28 00:01:30,430 --> 00:01:32,970 need to take a look at the vector corresponding 29 00:01:32,970 --> 00:01:35,940 to the text description of the foods. 30 00:01:35,940 --> 00:01:39,390 However, I cannot remember the exact name of that variable 31 00:01:39,390 --> 00:01:42,340 on top of my head to be able to call it in R. 32 00:01:42,340 --> 00:01:45,270 But we can use the function names, 33 00:01:45,270 --> 00:01:49,560 which takes as an input the USDA data frame and gives us 34 00:01:49,560 --> 00:01:53,900 the exact names of all the variables as stored in the USDA 35 00:01:53,900 --> 00:01:55,210 data frame. 36 00:01:55,210 --> 00:01:58,590 And now we know that the name of the variable we're looking at 37 00:01:58,590 --> 00:02:00,490 is Description. 38 00:02:00,490 --> 00:02:04,340 So now, to get the name of the 265th food, 39 00:02:04,340 --> 00:02:08,740 we simply need to ask R to pick the 265th element 40 00:02:08,740 --> 00:02:10,600 from the vector Description. 41 00:02:10,600 --> 00:02:15,350 So, using our dollar notation to call the Description vector 42 00:02:15,350 --> 00:02:19,780 and then the square brackets around the index 265, 43 00:02:19,780 --> 00:02:24,360 and the winner is table salt! 44 00:02:24,360 --> 00:02:30,960 Well, having 38,758 milligrams of sodium in 100 grams of table 45 00:02:30,960 --> 00:02:35,140 salt sort of makes sense, but none of us 46 00:02:35,140 --> 00:02:37,829 would eat 100 grams of salt in one sitting. 47 00:02:37,829 --> 00:02:40,790 So it might be more interesting to find 48 00:02:40,790 --> 00:02:43,800 out which foods, for instance, contain more than, 49 00:02:43,800 --> 00:02:46,920 say, 10,000 milligrams of sodium. 50 00:02:46,920 --> 00:02:49,720 To do so, we can create a new data frame, 51 00:02:49,720 --> 00:02:52,520 and let's call it HighSodium. 52 00:02:52,520 --> 00:02:56,200 And this is going to be a subset of our original data 53 00:02:56,200 --> 00:02:59,300 frame, USDA, with only the foods that 54 00:02:59,300 --> 00:03:03,920 have sodium content that exceeds 10,000. 55 00:03:03,920 --> 00:03:06,770 And now we created this new data frame, 56 00:03:06,770 --> 00:03:10,190 and to see how many foods there exist in this new data frame, 57 00:03:10,190 --> 00:03:13,560 we need to see how many observations this data 58 00:03:13,560 --> 00:03:14,610 frame has. 59 00:03:14,610 --> 00:03:18,510 And this can be done by using the function nrow, which 60 00:03:18,510 --> 00:03:24,190 computes the number of rows in the data frame HighSodium. 61 00:03:24,190 --> 00:03:27,150 And then we obtain 10 foods with sodium levels 62 00:03:27,150 --> 00:03:29,480 above 10,000 milligrams. 63 00:03:29,480 --> 00:03:33,000 Since there are not many, we can output the names of these foods 64 00:03:33,000 --> 00:03:35,850 by looking at their Description vector. 65 00:03:35,850 --> 00:03:37,810 But this time, the Description vector 66 00:03:37,810 --> 00:03:41,110 is not associated with the USDA data frame 67 00:03:41,110 --> 00:03:44,360 but with the HighSodium data frame. 68 00:03:44,360 --> 00:03:50,000 So HighSodium$Description, and now pressing Enter, 69 00:03:50,000 --> 00:03:52,620 we obtain the names of these 10 foods. 70 00:03:52,620 --> 00:03:54,990 So definitely table salt is one of them. 71 00:03:54,990 --> 00:04:00,140 We also have dry soup, gravy, some leavening agents, 72 00:04:00,140 --> 00:04:04,510 but I thought caviar is well known to be among the top 10 73 00:04:04,510 --> 00:04:06,910 foods with highest levels of sodium. 74 00:04:06,910 --> 00:04:09,660 But it doesn't appear in this list. 75 00:04:09,660 --> 00:04:13,390 Let's find how much sodium it has in 100 grams. 76 00:04:13,390 --> 00:04:16,360 Now, obviously, this task would have been very easy 77 00:04:16,360 --> 00:04:19,269 if we knew the index of caviar in our data set, 78 00:04:19,269 --> 00:04:23,070 and we simply feed it into the vector Sodium. 79 00:04:23,070 --> 00:04:25,810 However, we need to get the index of caviar, 80 00:04:25,810 --> 00:04:28,260 and to do this, we need to track down 81 00:04:28,260 --> 00:04:31,610 the word caviar in the text description. 82 00:04:31,610 --> 00:04:34,310 To do this, we can use the match function 83 00:04:34,310 --> 00:04:40,020 and ask R to dig the word caviar in the description vector. 84 00:04:40,020 --> 00:04:40,860 So USDA$Description. 85 00:04:43,670 --> 00:04:47,030 And now pressing Enter, we obtain that caviar is 86 00:04:47,030 --> 00:04:51,360 the 4,154th food in our data set. 87 00:04:51,360 --> 00:04:55,840 So now finding the sodium level of caviar is a piece of cake. 88 00:04:55,840 --> 00:05:01,280 We just type USDA$Sodium and, using the square brackets with 89 00:05:01,280 --> 00:05:06,950 the index 4,154, ask R to pick the sodium level of caviar 90 00:05:06,950 --> 00:05:07,450 for us. 91 00:05:07,450 --> 00:05:11,220 And this is 1,500 milligrams. 92 00:05:11,220 --> 00:05:15,130 Now, to find a level of sodium in caviar, we used two steps, 93 00:05:15,130 --> 00:05:18,450 but we can actually lump them all in one single step. 94 00:05:18,450 --> 00:05:21,150 So let's use the Up Arrow twice to go back 95 00:05:21,150 --> 00:05:24,240 to the match function, and we know that this match function 96 00:05:24,240 --> 00:05:27,290 gives us an index that then should be fed into the Sodium 97 00:05:27,290 --> 00:05:29,010 vector using square brackets. 98 00:05:29,010 --> 00:05:32,190 So let's enclose it in square brackets, 99 00:05:32,190 --> 00:05:34,800 and then at the beginning we're going to just write 100 00:05:34,800 --> 00:05:37,530 USDA$Sodium. 101 00:05:37,530 --> 00:05:39,140 And, again, of course, this gives us 102 00:05:39,140 --> 00:05:45,080 1,500 milligrams of sodium in 100 grams of caviar. 103 00:05:45,080 --> 00:05:48,020 Now, the value 1,500 milligrams seems 104 00:05:48,020 --> 00:05:52,570 to be very small compared to 10,000 milligrams or 38,000 105 00:05:52,570 --> 00:05:55,890 milligrams, which are the values that we worked with so far 106 00:05:55,890 --> 00:05:58,570 with respect to sodium levels. 107 00:05:58,570 --> 00:06:01,680 But this doesn't seem to be a fair comparison. 108 00:06:01,680 --> 00:06:04,870 Maybe the best way to figure out how big this value is, 109 00:06:04,870 --> 00:06:08,240 is by comparing it to the mean and the standard deviation 110 00:06:08,240 --> 00:06:11,640 of the sodium levels across the data set. 111 00:06:11,640 --> 00:06:13,830 To find the mean, we know that this information 112 00:06:13,830 --> 00:06:16,650 is given to us using the summary function. 113 00:06:16,650 --> 00:06:19,420 So let's use the summary function, and this time, 114 00:06:19,420 --> 00:06:21,740 give it the input the Sodium vector 115 00:06:21,740 --> 00:06:25,070 instead of the whole USDA data frame. 116 00:06:25,070 --> 00:06:31,450 And we can see that the mean sodium value is 322 milligrams. 117 00:06:31,450 --> 00:06:33,310 However, the summary function does not 118 00:06:33,310 --> 00:06:35,600 give us standard deviation information, 119 00:06:35,600 --> 00:06:38,560 but we can do this using the function sd, which 120 00:06:38,560 --> 00:06:40,550 stands for standard deviation. 121 00:06:40,550 --> 00:06:44,580 Give it as an input the Sodium vector, 122 00:06:44,580 --> 00:06:49,310 and, oh, we obtain non-available. 123 00:06:49,310 --> 00:06:51,940 Well we got NA because we forgot to remove 124 00:06:51,940 --> 00:06:54,220 the non-available entries before computing 125 00:06:54,220 --> 00:06:55,420 our statistical measure. 126 00:06:55,420 --> 00:06:59,260 So let's use the Up Arrow to go back to the standard deviation 127 00:06:59,260 --> 00:07:03,120 function, and now we have to explicitly tell R to remove 128 00:07:03,120 --> 00:07:05,830 these non-available entries by typing na.rm=TRUE. 129 00:07:09,480 --> 00:07:13,840 And now the standard deviation is 1,045 milligrams. 130 00:07:13,840 --> 00:07:18,020 Note that, if we sum the mean and the standard deviation, 131 00:07:18,020 --> 00:07:21,020 we obtain around 1,400 milligrams, which 132 00:07:21,020 --> 00:07:24,150 is still smaller than the amount of sodium 133 00:07:24,150 --> 00:07:26,370 in 100 grams of caviar. 134 00:07:26,370 --> 00:07:30,110 Well, this means that caviar is pretty rich in sodium 135 00:07:30,110 --> 00:07:33,470 compared to most of the foods in our data set. 136 00:07:33,470 --> 00:07:37,060 Now that we know how to do a basic analysis of our data, 137 00:07:37,060 --> 00:07:39,420 let's look at the plotting functionality in R 138 00:07:39,420 --> 00:07:41,820 in our next video.