1
00:00:04,490 --> 00:00:06,220
In this video, we
will see how we

2
00:00:06,220 --> 00:00:09,490
can add a new variable
to our data frame.

3
00:00:09,490 --> 00:00:13,070
Suppose that we want to add
a variable to our USDA data

4
00:00:13,070 --> 00:00:16,490
frame that takes a value
1 if the food has higher

5
00:00:16,490 --> 00:00:19,610
sodium than average,
and 0 if the food has

6
00:00:19,610 --> 00:00:21,710
lower sodium than average.

7
00:00:21,710 --> 00:00:24,090
Let's do this step by step.

8
00:00:24,090 --> 00:00:26,140
To check if the first
food in the dataset

9
00:00:26,140 --> 00:00:29,280
has a higher amount of sodium
compared to the average,

10
00:00:29,280 --> 00:00:34,270
we can simply ask R to dig up
the first value in the Sodium

11
00:00:34,270 --> 00:00:37,410
vector, using the square
brackets and the index 1.

12
00:00:37,410 --> 00:00:40,850
And then compare it using
the greater-than sign

13
00:00:40,850 --> 00:00:43,980
to the mean of
the Sodium vector,

14
00:00:43,980 --> 00:00:49,770
and then do not forget to remove
the non-available entries.

15
00:00:49,770 --> 00:00:51,670
And we obtain TRUE.

16
00:00:51,670 --> 00:00:53,300
How about the 50th food?

17
00:00:53,300 --> 00:00:55,580
Well, let's go back
using the Up arrow,

18
00:00:55,580 --> 00:01:00,950
and simply change the index 1
to 50, and now we get FALSE.

19
00:01:00,950 --> 00:01:03,320
This means that the first
food has higher sodium

20
00:01:03,320 --> 00:01:05,890
content than average,
and the 50th food

21
00:01:05,890 --> 00:01:09,090
has lower sodium
content than average.

22
00:01:09,090 --> 00:01:10,560
Now, we can write
the same command,

23
00:01:10,560 --> 00:01:12,860
but on all the vector Sodium.

24
00:01:12,860 --> 00:01:16,060
Let's use the Up arrow, and
delete the square brackets

25
00:01:16,060 --> 00:01:18,010
with the index 50.

26
00:01:18,010 --> 00:01:20,940
But we know we have 7,000
foods, and we really

27
00:01:20,940 --> 00:01:23,410
don't want to output
7,000 values right now.

28
00:01:23,410 --> 00:01:27,770
So how about instead, we just
save the output to a vector,

29
00:01:27,770 --> 00:01:30,370
and we're going to
call it HighSodium.

30
00:01:30,370 --> 00:01:33,460
And now let's look at the
structure of the HighSodium

31
00:01:33,460 --> 00:01:35,310
vector.

32
00:01:35,310 --> 00:01:38,080
And then we see that the
HighSodium vector indeed

33
00:01:38,080 --> 00:01:41,020
has all these values--
TRUE and FALSE--

34
00:01:41,020 --> 00:01:43,050
which are called logicals.

35
00:01:43,050 --> 00:01:49,080
So basically the type of the
HighSodium vector is logical.

36
00:01:49,080 --> 00:01:52,920
But remember, we said we
wanted values 1's and 0's.

37
00:01:52,920 --> 00:01:55,590
So instead of TRUE, we want 1.

38
00:01:55,590 --> 00:01:58,500
And instead of FALSE,
we want a value of 0.

39
00:01:58,500 --> 00:02:00,900
Well, to do this, we
need to change the data

40
00:02:00,900 --> 00:02:04,370
type of HighSodium
to numeric, and we

41
00:02:04,370 --> 00:02:07,300
can do this using the
as.numeric function.

42
00:02:07,300 --> 00:02:10,259
So let's use the Up
arrow twice, and then

43
00:02:10,259 --> 00:02:14,250
enclose this logical expression
by the as.numeric function.

44
00:02:14,250 --> 00:02:20,530
So as.numeric, and now look up
the structure of HighSodium,

45
00:02:20,530 --> 00:02:23,860
and now we see that we turned
it into a numerical vector with

46
00:02:23,860 --> 00:02:26,700
values 0's and 1's.

47
00:02:26,700 --> 00:02:28,690
Now, this vector,
HighSodium, is not

48
00:02:28,690 --> 00:02:31,760
associated with the
USDA data frame.

49
00:02:31,760 --> 00:02:35,940
How can we add a variable,
HighSodium, to our data frame?

50
00:02:35,940 --> 00:02:39,020
Well, simply we need to
use the dollar notation.

51
00:02:39,020 --> 00:02:43,010
So let's go back
twice to the command

52
00:02:43,010 --> 00:02:45,530
where we created the
HighSodium vector,

53
00:02:45,530 --> 00:02:47,900
and then simply right now,
instead of just calling

54
00:02:47,900 --> 00:02:51,360
it HighSodium, we associate
it with the USDA data

55
00:02:51,360 --> 00:02:53,880
frame using the dollar notation.

56
00:02:53,880 --> 00:02:56,780
Now, pressing Enter, and going
and checking the structure

57
00:02:56,780 --> 00:03:00,480
of the USDA data frame,
we see that we just added

58
00:03:00,480 --> 00:03:03,700
the HighSodium variable
that was not present before,

59
00:03:03,700 --> 00:03:09,230
and it's a numerical variable
with values 1's and 0's.

60
00:03:09,230 --> 00:03:13,040
Now we can do the same, and
add the variables HighProtein,

61
00:03:13,040 --> 00:03:16,770
HighCarbs, HighFat,
similarly to our data frame.

62
00:03:16,770 --> 00:03:19,620
Well, let's do this
quickly using the Up arrow,

63
00:03:19,620 --> 00:03:25,390
and then let's go and replace
Sodium now by Protein.

64
00:03:25,390 --> 00:03:28,870
So again, here Sodium
is replaced by Protein.

65
00:03:28,870 --> 00:03:34,200
And then we're going to call
this new variable HighProtein.

66
00:03:34,200 --> 00:03:36,980
And do the same with TotalFat.

67
00:03:36,980 --> 00:03:38,920
So instead of
Protein, we're going

68
00:03:38,920 --> 00:03:47,730
to have TotalFat, and then
replace it here again,

69
00:03:47,730 --> 00:03:53,240
and the variable name
is going to be HighFat.

70
00:03:53,240 --> 00:03:57,480
And finally,
Carbohydrates-- so here

71
00:03:57,480 --> 00:04:05,260
is the vector of Carbohydrates,
and this is, too,

72
00:04:05,260 --> 00:04:07,290
getting the
Carbohydrates vector.

73
00:04:10,150 --> 00:04:13,860
And finally, this last
variable that we want to add

74
00:04:13,860 --> 00:04:16,610
is called HighCarbs.

75
00:04:16,610 --> 00:04:20,570
And now looking at the structure
of the USDA data frame,

76
00:04:20,570 --> 00:04:22,980
we see that we
successfully added

77
00:04:22,980 --> 00:04:27,340
these three new variables, which
are high HighProtein, HighFat,

78
00:04:27,340 --> 00:04:30,270
and HighCarbs, in
addition to the HighSodium

79
00:04:30,270 --> 00:04:33,250
variable that we
added previously.

80
00:04:33,250 --> 00:04:35,990
So how can we now
find relationships

81
00:04:35,990 --> 00:04:40,140
between these variables, and
also the original variables

82
00:04:40,140 --> 00:04:43,190
that we had in the
USDA data frame?

83
00:04:43,190 --> 00:04:45,630
Well, we're going to
be using the table

84
00:04:45,630 --> 00:04:49,659
and the tapply functions
in our next video.