1
00:00:04,500 --> 00:00:06,390
Remember that in
our previous video,

2
00:00:06,390 --> 00:00:11,000
we created four new variables,
HighSodium, HighFat, HighCarbs,

3
00:00:11,000 --> 00:00:12,850
and HighProtein.

4
00:00:12,850 --> 00:00:15,160
Now in this video, we
will try to understand

5
00:00:15,160 --> 00:00:17,850
our data and the relationships
between our variables

6
00:00:17,850 --> 00:00:21,770
better, using the table
and tapply functions.

7
00:00:21,770 --> 00:00:24,620
To figure out how many foods
have higher sodium level

8
00:00:24,620 --> 00:00:28,010
than average, we want to look
at the HighSodium variable

9
00:00:28,010 --> 00:00:31,390
and count the foods
that have values 1.

10
00:00:31,390 --> 00:00:34,020
We can do this using
the table function,

11
00:00:34,020 --> 00:00:38,960
and give it as an input
the HighSodium vector.

12
00:00:38,960 --> 00:00:42,770
Now pressing Enter, we obtain
the following information.

13
00:00:42,770 --> 00:00:44,520
Most of the foods
in our data set,

14
00:00:44,520 --> 00:00:50,000
and precisely 4,800 of them,
have lower sodium than average,

15
00:00:50,000 --> 00:00:53,570
and we have 2090 foods
that have higher sodium

16
00:00:53,570 --> 00:00:55,570
content than average.

17
00:00:55,570 --> 00:00:57,150
Now let's see how
many foods have

18
00:00:57,150 --> 00:00:59,770
both high sodium and high fat.

19
00:00:59,770 --> 00:01:02,570
Well, to do this we can
also use the table function,

20
00:01:02,570 --> 00:01:05,319
but instead of giving
it one input, now

21
00:01:05,319 --> 00:01:06,930
we can give it two inputs.

22
00:01:06,930 --> 00:01:09,760
So let's go back
using the Up arrow,

23
00:01:09,760 --> 00:01:12,690
and now have the first
input being the HighSodium

24
00:01:12,690 --> 00:01:18,789
vector and the second input
being the HighFat vector.

25
00:01:18,789 --> 00:01:21,289
And we obtain the
following table.

26
00:01:21,289 --> 00:01:24,880
The rows belong to the first
input, which is HighSodium,

27
00:01:24,880 --> 00:01:27,510
and the columns correspond
to the second input,

28
00:01:27,510 --> 00:01:29,200
which is HighFat.

29
00:01:29,200 --> 00:01:31,180
So from the table
we see that we have

30
00:01:31,180 --> 00:01:36,360
3,529 foods with low
sodium and low fat,

31
00:01:36,360 --> 00:01:43,560
1,355 foods with low
sodium and high fat, 1,378

32
00:01:43,560 --> 00:01:46,650
foods with high
sodium but low fat,

33
00:01:46,650 --> 00:01:52,750
and finally 712 foods with
both high sodium and high fat.

34
00:01:52,750 --> 00:01:56,370
Now, what if we want to compute
the average amount of iron

35
00:01:56,370 --> 00:01:59,110
sorted by high and low protein?

36
00:01:59,110 --> 00:02:02,470
Well, to do this we can
use the tapply function.

37
00:02:02,470 --> 00:02:04,670
Let us have a little
refresher on how

38
00:02:04,670 --> 00:02:07,730
the tapply function works.

39
00:02:07,730 --> 00:02:10,990
The tapply function takes
three arguments, and groups

40
00:02:10,990 --> 00:02:13,870
the first argument according
to the second argument,

41
00:02:13,870 --> 00:02:16,230
and then applies argument three.

42
00:02:16,230 --> 00:02:18,980
For instance, we wanted to
compute the average amount

43
00:02:18,980 --> 00:02:22,300
of iron sorted by
high and low protein.

44
00:02:22,300 --> 00:02:24,500
In this case, the
first argument is

45
00:02:24,500 --> 00:02:26,130
whatever we are
trying to analyze,

46
00:02:26,130 --> 00:02:28,750
which is the Iron vector,
and we are sorting it

47
00:02:28,750 --> 00:02:31,210
according to the vector
HighProtein, which

48
00:02:31,210 --> 00:02:32,710
is our second argument.

49
00:02:32,710 --> 00:02:34,770
And finally we apply
the mean function

50
00:02:34,770 --> 00:02:37,360
in R on the sorted Iron values.

51
00:02:37,360 --> 00:02:38,960
And we should not
forget to remove

52
00:02:38,960 --> 00:02:41,270
the nonavailable entries.

53
00:02:41,270 --> 00:02:44,090
So what does tapply do exactly?

54
00:02:44,090 --> 00:02:45,990
Suppose that we have
the following data

55
00:02:45,990 --> 00:02:48,530
frame with the foods
one through six,

56
00:02:48,530 --> 00:02:51,260
along with information
about their Iron levels

57
00:02:51,260 --> 00:02:55,800
and their values of HighProtein
that we just added earlier.

58
00:02:55,800 --> 00:02:58,230
The first step is
grouping the Iron data

59
00:02:58,230 --> 00:03:00,690
according to the
values of HighProtein.

60
00:03:00,690 --> 00:03:04,570
So, first group, all the
foods that have HighProtein

61
00:03:04,570 --> 00:03:07,440
equal 1, and that would
be food number two

62
00:03:07,440 --> 00:03:10,930
with 12.8 milligrams of
iron, food number three

63
00:03:10,930 --> 00:03:14,930
with 1.44 milligrams of
iron, and food number six

64
00:03:14,930 --> 00:03:18,060
with 2.29 milligrams of iron.

65
00:03:18,060 --> 00:03:21,000
Then we group the remaining
foods that have protein levels

66
00:03:21,000 --> 00:03:23,140
below average, and
this corresponds

67
00:03:23,140 --> 00:03:27,420
to food one, food
four, and food five.

68
00:03:27,420 --> 00:03:31,270
Then we compute the mean of
Iron level for each group.

69
00:03:31,270 --> 00:03:34,100
In this case, the mean of the
group with high protein levels

70
00:03:34,100 --> 00:03:38,210
is 5.51, and the mean of the
group with low protein levels

71
00:03:38,210 --> 00:03:40,120
is 1.72.

72
00:03:40,120 --> 00:03:43,540
And this is the result
of the tapply function.

73
00:03:43,540 --> 00:03:45,640
Now let's go back to
R and have a hands

74
00:03:45,640 --> 00:03:49,829
on practice on how to
use the tapply function.

75
00:03:49,829 --> 00:03:51,460
So let's compute
the average amount

76
00:03:51,460 --> 00:03:53,980
of iron sorted by
protein levels.

77
00:03:53,980 --> 00:03:57,550
So we're going to type tapply,
and then the first argument

78
00:03:57,550 --> 00:04:00,820
is the Iron vector which
we are trying to analyze.

79
00:04:00,820 --> 00:04:04,470
And we are sorting it according
to the HighProtein vector,

80
00:04:04,470 --> 00:04:06,710
so this is our second argument.

81
00:04:06,710 --> 00:04:09,050
And then the mean
statistic is used,

82
00:04:09,050 --> 00:04:10,700
because we're
trying to calculate

83
00:04:10,700 --> 00:04:13,250
the average level of Iron.

84
00:04:13,250 --> 00:04:17,220
And do not forget to remove the
non available entries by typing

85
00:04:17,220 --> 00:04:17,720
na.rm=TRUE.

86
00:04:20,820 --> 00:04:22,720
And here's the result.

87
00:04:22,720 --> 00:04:25,260
Foods with low
protein content have

88
00:04:25,260 --> 00:04:28,640
on average 2.55
milligrams of iron

89
00:04:28,640 --> 00:04:30,860
and foods with high
protein content

90
00:04:30,860 --> 00:04:34,690
have on average 3.2
milligrams of iron.

91
00:04:34,690 --> 00:04:37,740
Now how about the maximum
level of vitamin C

92
00:04:37,740 --> 00:04:40,350
in foods with high
and low carbs?

93
00:04:40,350 --> 00:04:42,580
Again, we're going to use
the tapply function, so

94
00:04:42,580 --> 00:04:45,860
use the Up arrow to go back
to the previous command,

95
00:04:45,860 --> 00:04:49,360
but now we're trying to
analyze the VitaminC vector.

96
00:04:49,360 --> 00:04:53,630
So this is our first argument,
And we are sorting it

97
00:04:53,630 --> 00:04:57,020
according to high and low
carbs, so the second argument

98
00:04:57,020 --> 00:04:59,000
is the vector HighCarbs.

99
00:04:59,000 --> 00:05:00,990
And instead of the mean,
we're applying here

100
00:05:00,990 --> 00:05:05,390
the max statistic, and
we obtain the following.

101
00:05:05,390 --> 00:05:09,980
The maximum vitamin C level,
which is 2,400 milligrams

102
00:05:09,980 --> 00:05:14,430
is actually present in a
food that is high in carbs.

103
00:05:14,430 --> 00:05:16,970
Well, is it true that foods
that are high in carbs

104
00:05:16,970 --> 00:05:19,760
have generally high
vitamin C content?

105
00:05:19,760 --> 00:05:21,860
Well, to see if this
is the case, now

106
00:05:21,860 --> 00:05:24,530
we're going to go back
to our tapply function,

107
00:05:24,530 --> 00:05:26,640
and instead of the
max statistic we're

108
00:05:26,640 --> 00:05:29,090
going to use the
summary function.

109
00:05:29,090 --> 00:05:32,350
We obtain the following
two sets of information.

110
00:05:32,350 --> 00:05:36,909
The first set corresponds to
the foods with low carb content,

111
00:05:36,909 --> 00:05:39,380
and the second set of
information corresponds

112
00:05:39,380 --> 00:05:43,310
to foods with higher carb
content than average.

113
00:05:43,310 --> 00:05:45,890
Now the statistical information
that the summary function

114
00:05:45,890 --> 00:05:49,580
gives us pertains to
the vitamin C levels.

115
00:05:49,580 --> 00:05:54,010
This means that we have
on average 6.36 milligrams

116
00:05:54,010 --> 00:05:57,600
of vitamin C in foods
with low carb content,

117
00:05:57,600 --> 00:06:02,230
and on average 16.31 milligrams
of vitamin C in foods

118
00:06:02,230 --> 00:06:04,240
with high carb content.

119
00:06:04,240 --> 00:06:07,170
So, it does seem
like a general trend.

120
00:06:07,170 --> 00:06:11,350
Foods with high carb content are
on average richer in vitamin C

121
00:06:11,350 --> 00:06:14,370
compared to foods
with low carb content.

122
00:06:14,370 --> 00:06:17,410
Now we reach the end of
our first recitation.

123
00:06:17,410 --> 00:06:20,850
I hope this was a good exercise
to familiarize yourself better

124
00:06:20,850 --> 00:06:24,470
with R, and learn some
fun facts about nutrition.

125
00:06:24,470 --> 00:06:26,310
Stay healthy.