1 00:00:04,500 --> 00:00:06,390 Remember that in our previous video, 2 00:00:06,390 --> 00:00:11,000 we created four new variables, HighSodium, HighFat, HighCarbs, 3 00:00:11,000 --> 00:00:12,850 and HighProtein. 4 00:00:12,850 --> 00:00:15,160 Now in this video, we will try to understand 5 00:00:15,160 --> 00:00:17,850 our data and the relationships between our variables 6 00:00:17,850 --> 00:00:21,770 better, using the table and tapply functions. 7 00:00:21,770 --> 00:00:24,620 To figure out how many foods have higher sodium level 8 00:00:24,620 --> 00:00:28,010 than average, we want to look at the HighSodium variable 9 00:00:28,010 --> 00:00:31,390 and count the foods that have values 1. 10 00:00:31,390 --> 00:00:34,020 We can do this using the table function, 11 00:00:34,020 --> 00:00:38,960 and give it as an input the HighSodium vector. 12 00:00:38,960 --> 00:00:42,770 Now pressing Enter, we obtain the following information. 13 00:00:42,770 --> 00:00:44,520 Most of the foods in our data set, 14 00:00:44,520 --> 00:00:50,000 and precisely 4,800 of them, have lower sodium than average, 15 00:00:50,000 --> 00:00:53,570 and we have 2090 foods that have higher sodium 16 00:00:53,570 --> 00:00:55,570 content than average. 17 00:00:55,570 --> 00:00:57,150 Now let's see how many foods have 18 00:00:57,150 --> 00:00:59,770 both high sodium and high fat. 19 00:00:59,770 --> 00:01:02,570 Well, to do this we can also use the table function, 20 00:01:02,570 --> 00:01:05,319 but instead of giving it one input, now 21 00:01:05,319 --> 00:01:06,930 we can give it two inputs. 22 00:01:06,930 --> 00:01:09,760 So let's go back using the Up arrow, 23 00:01:09,760 --> 00:01:12,690 and now have the first input being the HighSodium 24 00:01:12,690 --> 00:01:18,789 vector and the second input being the HighFat vector. 25 00:01:18,789 --> 00:01:21,289 And we obtain the following table. 26 00:01:21,289 --> 00:01:24,880 The rows belong to the first input, which is HighSodium, 27 00:01:24,880 --> 00:01:27,510 and the columns correspond to the second input, 28 00:01:27,510 --> 00:01:29,200 which is HighFat. 29 00:01:29,200 --> 00:01:31,180 So from the table we see that we have 30 00:01:31,180 --> 00:01:36,360 3,529 foods with low sodium and low fat, 31 00:01:36,360 --> 00:01:43,560 1,355 foods with low sodium and high fat, 1,378 32 00:01:43,560 --> 00:01:46,650 foods with high sodium but low fat, 33 00:01:46,650 --> 00:01:52,750 and finally 712 foods with both high sodium and high fat. 34 00:01:52,750 --> 00:01:56,370 Now, what if we want to compute the average amount of iron 35 00:01:56,370 --> 00:01:59,110 sorted by high and low protein? 36 00:01:59,110 --> 00:02:02,470 Well, to do this we can use the tapply function. 37 00:02:02,470 --> 00:02:04,670 Let us have a little refresher on how 38 00:02:04,670 --> 00:02:07,730 the tapply function works. 39 00:02:07,730 --> 00:02:10,990 The tapply function takes three arguments, and groups 40 00:02:10,990 --> 00:02:13,870 the first argument according to the second argument, 41 00:02:13,870 --> 00:02:16,230 and then applies argument three. 42 00:02:16,230 --> 00:02:18,980 For instance, we wanted to compute the average amount 43 00:02:18,980 --> 00:02:22,300 of iron sorted by high and low protein. 44 00:02:22,300 --> 00:02:24,500 In this case, the first argument is 45 00:02:24,500 --> 00:02:26,130 whatever we are trying to analyze, 46 00:02:26,130 --> 00:02:28,750 which is the Iron vector, and we are sorting it 47 00:02:28,750 --> 00:02:31,210 according to the vector HighProtein, which 48 00:02:31,210 --> 00:02:32,710 is our second argument. 49 00:02:32,710 --> 00:02:34,770 And finally we apply the mean function 50 00:02:34,770 --> 00:02:37,360 in R on the sorted Iron values. 51 00:02:37,360 --> 00:02:38,960 And we should not forget to remove 52 00:02:38,960 --> 00:02:41,270 the nonavailable entries. 53 00:02:41,270 --> 00:02:44,090 So what does tapply do exactly? 54 00:02:44,090 --> 00:02:45,990 Suppose that we have the following data 55 00:02:45,990 --> 00:02:48,530 frame with the foods one through six, 56 00:02:48,530 --> 00:02:51,260 along with information about their Iron levels 57 00:02:51,260 --> 00:02:55,800 and their values of HighProtein that we just added earlier. 58 00:02:55,800 --> 00:02:58,230 The first step is grouping the Iron data 59 00:02:58,230 --> 00:03:00,690 according to the values of HighProtein. 60 00:03:00,690 --> 00:03:04,570 So, first group, all the foods that have HighProtein 61 00:03:04,570 --> 00:03:07,440 equal 1, and that would be food number two 62 00:03:07,440 --> 00:03:10,930 with 12.8 milligrams of iron, food number three 63 00:03:10,930 --> 00:03:14,930 with 1.44 milligrams of iron, and food number six 64 00:03:14,930 --> 00:03:18,060 with 2.29 milligrams of iron. 65 00:03:18,060 --> 00:03:21,000 Then we group the remaining foods that have protein levels 66 00:03:21,000 --> 00:03:23,140 below average, and this corresponds 67 00:03:23,140 --> 00:03:27,420 to food one, food four, and food five. 68 00:03:27,420 --> 00:03:31,270 Then we compute the mean of Iron level for each group. 69 00:03:31,270 --> 00:03:34,100 In this case, the mean of the group with high protein levels 70 00:03:34,100 --> 00:03:38,210 is 5.51, and the mean of the group with low protein levels 71 00:03:38,210 --> 00:03:40,120 is 1.72. 72 00:03:40,120 --> 00:03:43,540 And this is the result of the tapply function. 73 00:03:43,540 --> 00:03:45,640 Now let's go back to R and have a hands 74 00:03:45,640 --> 00:03:49,829 on practice on how to use the tapply function. 75 00:03:49,829 --> 00:03:51,460 So let's compute the average amount 76 00:03:51,460 --> 00:03:53,980 of iron sorted by protein levels. 77 00:03:53,980 --> 00:03:57,550 So we're going to type tapply, and then the first argument 78 00:03:57,550 --> 00:04:00,820 is the Iron vector which we are trying to analyze. 79 00:04:00,820 --> 00:04:04,470 And we are sorting it according to the HighProtein vector, 80 00:04:04,470 --> 00:04:06,710 so this is our second argument. 81 00:04:06,710 --> 00:04:09,050 And then the mean statistic is used, 82 00:04:09,050 --> 00:04:10,700 because we're trying to calculate 83 00:04:10,700 --> 00:04:13,250 the average level of Iron. 84 00:04:13,250 --> 00:04:17,220 And do not forget to remove the non available entries by typing 85 00:04:17,220 --> 00:04:17,720 na.rm=TRUE. 86 00:04:20,820 --> 00:04:22,720 And here's the result. 87 00:04:22,720 --> 00:04:25,260 Foods with low protein content have 88 00:04:25,260 --> 00:04:28,640 on average 2.55 milligrams of iron 89 00:04:28,640 --> 00:04:30,860 and foods with high protein content 90 00:04:30,860 --> 00:04:34,690 have on average 3.2 milligrams of iron. 91 00:04:34,690 --> 00:04:37,740 Now how about the maximum level of vitamin C 92 00:04:37,740 --> 00:04:40,350 in foods with high and low carbs? 93 00:04:40,350 --> 00:04:42,580 Again, we're going to use the tapply function, so 94 00:04:42,580 --> 00:04:45,860 use the Up arrow to go back to the previous command, 95 00:04:45,860 --> 00:04:49,360 but now we're trying to analyze the VitaminC vector. 96 00:04:49,360 --> 00:04:53,630 So this is our first argument, And we are sorting it 97 00:04:53,630 --> 00:04:57,020 according to high and low carbs, so the second argument 98 00:04:57,020 --> 00:04:59,000 is the vector HighCarbs. 99 00:04:59,000 --> 00:05:00,990 And instead of the mean, we're applying here 100 00:05:00,990 --> 00:05:05,390 the max statistic, and we obtain the following. 101 00:05:05,390 --> 00:05:09,980 The maximum vitamin C level, which is 2,400 milligrams 102 00:05:09,980 --> 00:05:14,430 is actually present in a food that is high in carbs. 103 00:05:14,430 --> 00:05:16,970 Well, is it true that foods that are high in carbs 104 00:05:16,970 --> 00:05:19,760 have generally high vitamin C content? 105 00:05:19,760 --> 00:05:21,860 Well, to see if this is the case, now 106 00:05:21,860 --> 00:05:24,530 we're going to go back to our tapply function, 107 00:05:24,530 --> 00:05:26,640 and instead of the max statistic we're 108 00:05:26,640 --> 00:05:29,090 going to use the summary function. 109 00:05:29,090 --> 00:05:32,350 We obtain the following two sets of information. 110 00:05:32,350 --> 00:05:36,909 The first set corresponds to the foods with low carb content, 111 00:05:36,909 --> 00:05:39,380 and the second set of information corresponds 112 00:05:39,380 --> 00:05:43,310 to foods with higher carb content than average. 113 00:05:43,310 --> 00:05:45,890 Now the statistical information that the summary function 114 00:05:45,890 --> 00:05:49,580 gives us pertains to the vitamin C levels. 115 00:05:49,580 --> 00:05:54,010 This means that we have on average 6.36 milligrams 116 00:05:54,010 --> 00:05:57,600 of vitamin C in foods with low carb content, 117 00:05:57,600 --> 00:06:02,230 and on average 16.31 milligrams of vitamin C in foods 118 00:06:02,230 --> 00:06:04,240 with high carb content. 119 00:06:04,240 --> 00:06:07,170 So, it does seem like a general trend. 120 00:06:07,170 --> 00:06:11,350 Foods with high carb content are on average richer in vitamin C 121 00:06:11,350 --> 00:06:14,370 compared to foods with low carb content. 122 00:06:14,370 --> 00:06:17,410 Now we reach the end of our first recitation. 123 00:06:17,410 --> 00:06:20,850 I hope this was a good exercise to familiarize yourself better 124 00:06:20,850 --> 00:06:24,470 with R, and learn some fun facts about nutrition. 125 00:06:24,470 --> 00:06:26,310 Stay healthy.