1
00:00:04,500 --> 00:00:08,200
In this video, we will do
some basic data analysis.

2
00:00:08,200 --> 00:00:10,260
All that I've done
since our previous video

3
00:00:10,260 --> 00:00:13,840
is clear the console, but R
still has all the information

4
00:00:13,840 --> 00:00:14,980
stored.

5
00:00:14,980 --> 00:00:17,860
In fact, if we use the
Up Arrow on our keyboard,

6
00:00:17,860 --> 00:00:20,120
we retrieve the last
command we typed,

7
00:00:20,120 --> 00:00:23,740
which was the summary
of the USDA data frame.

8
00:00:23,740 --> 00:00:26,670
And as a quick reminder, at
the end of our last video,

9
00:00:26,670 --> 00:00:29,710
we realized that the
maximum level of Sodium

10
00:00:29,710 --> 00:00:34,890
was 38,758 milligrams,
which is very high.

11
00:00:34,890 --> 00:00:38,210
We would like to see which
food this corresponds to.

12
00:00:38,210 --> 00:00:41,060
Well, to check the values of
sodium levels in the foods

13
00:00:41,060 --> 00:00:43,790
within the data set, we
can type USDA$Sodium.

14
00:00:46,570 --> 00:00:49,220
This gives us a series
of numbers associated

15
00:00:49,220 --> 00:00:52,960
with the amount of sodium in
all the foods in our data set.

16
00:00:52,960 --> 00:00:55,920
Remember from the lecture
that this is called a vector,

17
00:00:55,920 --> 00:00:59,160
and it is associated
with the variable Sodium.

18
00:00:59,160 --> 00:01:01,910
For instance, the sodium
level of the last food

19
00:01:01,910 --> 00:01:05,960
in our data set
is 68 milligrams.

20
00:01:05,960 --> 00:01:09,289
Now, to find which food has
the highest level of sodium,

21
00:01:09,289 --> 00:01:13,230
we can simply use the
function which.max, which

22
00:01:13,230 --> 00:01:17,120
takes as an input
the Sodium vector,

23
00:01:17,120 --> 00:01:20,560
and it gives us the index of
the food with the highest sodium

24
00:01:20,560 --> 00:01:21,600
level.

25
00:01:21,600 --> 00:01:25,350
In this case, the 265th
food in our data set

26
00:01:25,350 --> 00:01:28,020
has the maximum sodium content.

27
00:01:28,020 --> 00:01:30,430
Now to know which
food that is, we

28
00:01:30,430 --> 00:01:32,970
need to take a look at
the vector corresponding

29
00:01:32,970 --> 00:01:35,940
to the text description
of the foods.

30
00:01:35,940 --> 00:01:39,390
However, I cannot remember the
exact name of that variable

31
00:01:39,390 --> 00:01:42,340
on top of my head to be
able to call it in R.

32
00:01:42,340 --> 00:01:45,270
But we can use the
function names,

33
00:01:45,270 --> 00:01:49,560
which takes as an input the
USDA data frame and gives us

34
00:01:49,560 --> 00:01:53,900
the exact names of all the
variables as stored in the USDA

35
00:01:53,900 --> 00:01:55,210
data frame.

36
00:01:55,210 --> 00:01:58,590
And now we know that the name
of the variable we're looking at

37
00:01:58,590 --> 00:02:00,490
is Description.

38
00:02:00,490 --> 00:02:04,340
So now, to get the
name of the 265th food,

39
00:02:04,340 --> 00:02:08,740
we simply need to ask R
to pick the 265th element

40
00:02:08,740 --> 00:02:10,600
from the vector Description.

41
00:02:10,600 --> 00:02:15,350
So, using our dollar notation
to call the Description vector

42
00:02:15,350 --> 00:02:19,780
and then the square brackets
around the index 265,

43
00:02:19,780 --> 00:02:24,360
and the winner is table salt!

44
00:02:24,360 --> 00:02:30,960
Well, having 38,758 milligrams
of sodium in 100 grams of table

45
00:02:30,960 --> 00:02:35,140
salt sort of makes
sense, but none of us

46
00:02:35,140 --> 00:02:37,829
would eat 100 grams of
salt in one sitting.

47
00:02:37,829 --> 00:02:40,790
So it might be more
interesting to find

48
00:02:40,790 --> 00:02:43,800
out which foods, for
instance, contain more than,

49
00:02:43,800 --> 00:02:46,920
say, 10,000
milligrams of sodium.

50
00:02:46,920 --> 00:02:49,720
To do so, we can create
a new data frame,

51
00:02:49,720 --> 00:02:52,520
and let's call it HighSodium.

52
00:02:52,520 --> 00:02:56,200
And this is going to be a
subset of our original data

53
00:02:56,200 --> 00:02:59,300
frame, USDA, with
only the foods that

54
00:02:59,300 --> 00:03:03,920
have sodium content
that exceeds 10,000.

55
00:03:03,920 --> 00:03:06,770
And now we created
this new data frame,

56
00:03:06,770 --> 00:03:10,190
and to see how many foods there
exist in this new data frame,

57
00:03:10,190 --> 00:03:13,560
we need to see how many
observations this data

58
00:03:13,560 --> 00:03:14,610
frame has.

59
00:03:14,610 --> 00:03:18,510
And this can be done by using
the function nrow, which

60
00:03:18,510 --> 00:03:24,190
computes the number of rows
in the data frame HighSodium.

61
00:03:24,190 --> 00:03:27,150
And then we obtain 10
foods with sodium levels

62
00:03:27,150 --> 00:03:29,480
above 10,000 milligrams.

63
00:03:29,480 --> 00:03:33,000
Since there are not many, we can
output the names of these foods

64
00:03:33,000 --> 00:03:35,850
by looking at their
Description vector.

65
00:03:35,850 --> 00:03:37,810
But this time, the
Description vector

66
00:03:37,810 --> 00:03:41,110
is not associated with
the USDA data frame

67
00:03:41,110 --> 00:03:44,360
but with the
HighSodium data frame.

68
00:03:44,360 --> 00:03:50,000
So HighSodium$Description,
and now pressing Enter,

69
00:03:50,000 --> 00:03:52,620
we obtain the names
of these 10 foods.

70
00:03:52,620 --> 00:03:54,990
So definitely table
salt is one of them.

71
00:03:54,990 --> 00:04:00,140
We also have dry soup,
gravy, some leavening agents,

72
00:04:00,140 --> 00:04:04,510
but I thought caviar is well
known to be among the top 10

73
00:04:04,510 --> 00:04:06,910
foods with highest
levels of sodium.

74
00:04:06,910 --> 00:04:09,660
But it doesn't
appear in this list.

75
00:04:09,660 --> 00:04:13,390
Let's find how much sodium
it has in 100 grams.

76
00:04:13,390 --> 00:04:16,360
Now, obviously, this task
would have been very easy

77
00:04:16,360 --> 00:04:19,269
if we knew the index of
caviar in our data set,

78
00:04:19,269 --> 00:04:23,070
and we simply feed it
into the vector Sodium.

79
00:04:23,070 --> 00:04:25,810
However, we need to get
the index of caviar,

80
00:04:25,810 --> 00:04:28,260
and to do this, we
need to track down

81
00:04:28,260 --> 00:04:31,610
the word caviar in
the text description.

82
00:04:31,610 --> 00:04:34,310
To do this, we can
use the match function

83
00:04:34,310 --> 00:04:40,020
and ask R to dig the word caviar
in the description vector.

84
00:04:40,020 --> 00:04:40,860
So USDA$Description.

85
00:04:43,670 --> 00:04:47,030
And now pressing Enter,
we obtain that caviar is

86
00:04:47,030 --> 00:04:51,360
the 4,154th food
in our data set.

87
00:04:51,360 --> 00:04:55,840
So now finding the sodium level
of caviar is a piece of cake.

88
00:04:55,840 --> 00:05:01,280
We just type USDA$Sodium and,
using the square brackets with

89
00:05:01,280 --> 00:05:06,950
the index 4,154, ask R to pick
the sodium level of caviar

90
00:05:06,950 --> 00:05:07,450
for us.

91
00:05:07,450 --> 00:05:11,220
And this is 1,500 milligrams.

92
00:05:11,220 --> 00:05:15,130
Now, to find a level of sodium
in caviar, we used two steps,

93
00:05:15,130 --> 00:05:18,450
but we can actually lump
them all in one single step.

94
00:05:18,450 --> 00:05:21,150
So let's use the Up
Arrow twice to go back

95
00:05:21,150 --> 00:05:24,240
to the match function, and we
know that this match function

96
00:05:24,240 --> 00:05:27,290
gives us an index that then
should be fed into the Sodium

97
00:05:27,290 --> 00:05:29,010
vector using square brackets.

98
00:05:29,010 --> 00:05:32,190
So let's enclose it
in square brackets,

99
00:05:32,190 --> 00:05:34,800
and then at the beginning
we're going to just write

100
00:05:34,800 --> 00:05:37,530
USDA$Sodium.

101
00:05:37,530 --> 00:05:39,140
And, again, of
course, this gives us

102
00:05:39,140 --> 00:05:45,080
1,500 milligrams of sodium
in 100 grams of caviar.

103
00:05:45,080 --> 00:05:48,020
Now, the value 1,500
milligrams seems

104
00:05:48,020 --> 00:05:52,570
to be very small compared to
10,000 milligrams or 38,000

105
00:05:52,570 --> 00:05:55,890
milligrams, which are the values
that we worked with so far

106
00:05:55,890 --> 00:05:58,570
with respect to sodium levels.

107
00:05:58,570 --> 00:06:01,680
But this doesn't seem
to be a fair comparison.

108
00:06:01,680 --> 00:06:04,870
Maybe the best way to figure
out how big this value is,

109
00:06:04,870 --> 00:06:08,240
is by comparing it to the mean
and the standard deviation

110
00:06:08,240 --> 00:06:11,640
of the sodium levels
across the data set.

111
00:06:11,640 --> 00:06:13,830
To find the mean, we know
that this information

112
00:06:13,830 --> 00:06:16,650
is given to us using
the summary function.

113
00:06:16,650 --> 00:06:19,420
So let's use the summary
function, and this time,

114
00:06:19,420 --> 00:06:21,740
give it the input
the Sodium vector

115
00:06:21,740 --> 00:06:25,070
instead of the whole
USDA data frame.

116
00:06:25,070 --> 00:06:31,450
And we can see that the mean
sodium value is 322 milligrams.

117
00:06:31,450 --> 00:06:33,310
However, the summary
function does not

118
00:06:33,310 --> 00:06:35,600
give us standard
deviation information,

119
00:06:35,600 --> 00:06:38,560
but we can do this using
the function sd, which

120
00:06:38,560 --> 00:06:40,550
stands for standard deviation.

121
00:06:40,550 --> 00:06:44,580
Give it as an input
the Sodium vector,

122
00:06:44,580 --> 00:06:49,310
and, oh, we obtain
non-available.

123
00:06:49,310 --> 00:06:51,940
Well we got NA because
we forgot to remove

124
00:06:51,940 --> 00:06:54,220
the non-available
entries before computing

125
00:06:54,220 --> 00:06:55,420
our statistical measure.

126
00:06:55,420 --> 00:06:59,260
So let's use the Up Arrow to go
back to the standard deviation

127
00:06:59,260 --> 00:07:03,120
function, and now we have to
explicitly tell R to remove

128
00:07:03,120 --> 00:07:05,830
these non-available entries
by typing na.rm=TRUE.

129
00:07:09,480 --> 00:07:13,840
And now the standard
deviation is 1,045 milligrams.

130
00:07:13,840 --> 00:07:18,020
Note that, if we sum the mean
and the standard deviation,

131
00:07:18,020 --> 00:07:21,020
we obtain around 1,400
milligrams, which

132
00:07:21,020 --> 00:07:24,150
is still smaller than
the amount of sodium

133
00:07:24,150 --> 00:07:26,370
in 100 grams of caviar.

134
00:07:26,370 --> 00:07:30,110
Well, this means that caviar
is pretty rich in sodium

135
00:07:30,110 --> 00:07:33,470
compared to most of the
foods in our data set.

136
00:07:33,470 --> 00:07:37,060
Now that we know how to do a
basic analysis of our data,

137
00:07:37,060 --> 00:07:39,420
let's look at the plotting
functionality in R

138
00:07:39,420 --> 00:07:41,820
in our next video.