1 00:00:04,490 --> 00:00:06,220 In this video, we will see how we 2 00:00:06,220 --> 00:00:09,490 can add a new variable to our data frame. 3 00:00:09,490 --> 00:00:13,070 Suppose that we want to add a variable to our USDA data 4 00:00:13,070 --> 00:00:16,490 frame that takes a value 1 if the food has higher 5 00:00:16,490 --> 00:00:19,610 sodium than average, and 0 if the food has 6 00:00:19,610 --> 00:00:21,710 lower sodium than average. 7 00:00:21,710 --> 00:00:24,090 Let's do this step by step. 8 00:00:24,090 --> 00:00:26,140 To check if the first food in the dataset 9 00:00:26,140 --> 00:00:29,280 has a higher amount of sodium compared to the average, 10 00:00:29,280 --> 00:00:34,270 we can simply ask R to dig up the first value in the Sodium 11 00:00:34,270 --> 00:00:37,410 vector, using the square brackets and the index 1. 12 00:00:37,410 --> 00:00:40,850 And then compare it using the greater-than sign 13 00:00:40,850 --> 00:00:43,980 to the mean of the Sodium vector, 14 00:00:43,980 --> 00:00:49,770 and then do not forget to remove the non-available entries. 15 00:00:49,770 --> 00:00:51,670 And we obtain TRUE. 16 00:00:51,670 --> 00:00:53,300 How about the 50th food? 17 00:00:53,300 --> 00:00:55,580 Well, let's go back using the Up arrow, 18 00:00:55,580 --> 00:01:00,950 and simply change the index 1 to 50, and now we get FALSE. 19 00:01:00,950 --> 00:01:03,320 This means that the first food has higher sodium 20 00:01:03,320 --> 00:01:05,890 content than average, and the 50th food 21 00:01:05,890 --> 00:01:09,090 has lower sodium content than average. 22 00:01:09,090 --> 00:01:10,560 Now, we can write the same command, 23 00:01:10,560 --> 00:01:12,860 but on all the vector Sodium. 24 00:01:12,860 --> 00:01:16,060 Let's use the Up arrow, and delete the square brackets 25 00:01:16,060 --> 00:01:18,010 with the index 50. 26 00:01:18,010 --> 00:01:20,940 But we know we have 7,000 foods, and we really 27 00:01:20,940 --> 00:01:23,410 don't want to output 7,000 values right now. 28 00:01:23,410 --> 00:01:27,770 So how about instead, we just save the output to a vector, 29 00:01:27,770 --> 00:01:30,370 and we're going to call it HighSodium. 30 00:01:30,370 --> 00:01:33,460 And now let's look at the structure of the HighSodium 31 00:01:33,460 --> 00:01:35,310 vector. 32 00:01:35,310 --> 00:01:38,080 And then we see that the HighSodium vector indeed 33 00:01:38,080 --> 00:01:41,020 has all these values-- TRUE and FALSE-- 34 00:01:41,020 --> 00:01:43,050 which are called logicals. 35 00:01:43,050 --> 00:01:49,080 So basically the type of the HighSodium vector is logical. 36 00:01:49,080 --> 00:01:52,920 But remember, we said we wanted values 1's and 0's. 37 00:01:52,920 --> 00:01:55,590 So instead of TRUE, we want 1. 38 00:01:55,590 --> 00:01:58,500 And instead of FALSE, we want a value of 0. 39 00:01:58,500 --> 00:02:00,900 Well, to do this, we need to change the data 40 00:02:00,900 --> 00:02:04,370 type of HighSodium to numeric, and we 41 00:02:04,370 --> 00:02:07,300 can do this using the as.numeric function. 42 00:02:07,300 --> 00:02:10,259 So let's use the Up arrow twice, and then 43 00:02:10,259 --> 00:02:14,250 enclose this logical expression by the as.numeric function. 44 00:02:14,250 --> 00:02:20,530 So as.numeric, and now look up the structure of HighSodium, 45 00:02:20,530 --> 00:02:23,860 and now we see that we turned it into a numerical vector with 46 00:02:23,860 --> 00:02:26,700 values 0's and 1's. 47 00:02:26,700 --> 00:02:28,690 Now, this vector, HighSodium, is not 48 00:02:28,690 --> 00:02:31,760 associated with the USDA data frame. 49 00:02:31,760 --> 00:02:35,940 How can we add a variable, HighSodium, to our data frame? 50 00:02:35,940 --> 00:02:39,020 Well, simply we need to use the dollar notation. 51 00:02:39,020 --> 00:02:43,010 So let's go back twice to the command 52 00:02:43,010 --> 00:02:45,530 where we created the HighSodium vector, 53 00:02:45,530 --> 00:02:47,900 and then simply right now, instead of just calling 54 00:02:47,900 --> 00:02:51,360 it HighSodium, we associate it with the USDA data 55 00:02:51,360 --> 00:02:53,880 frame using the dollar notation. 56 00:02:53,880 --> 00:02:56,780 Now, pressing Enter, and going and checking the structure 57 00:02:56,780 --> 00:03:00,480 of the USDA data frame, we see that we just added 58 00:03:00,480 --> 00:03:03,700 the HighSodium variable that was not present before, 59 00:03:03,700 --> 00:03:09,230 and it's a numerical variable with values 1's and 0's. 60 00:03:09,230 --> 00:03:13,040 Now we can do the same, and add the variables HighProtein, 61 00:03:13,040 --> 00:03:16,770 HighCarbs, HighFat, similarly to our data frame. 62 00:03:16,770 --> 00:03:19,620 Well, let's do this quickly using the Up arrow, 63 00:03:19,620 --> 00:03:25,390 and then let's go and replace Sodium now by Protein. 64 00:03:25,390 --> 00:03:28,870 So again, here Sodium is replaced by Protein. 65 00:03:28,870 --> 00:03:34,200 And then we're going to call this new variable HighProtein. 66 00:03:34,200 --> 00:03:36,980 And do the same with TotalFat. 67 00:03:36,980 --> 00:03:38,920 So instead of Protein, we're going 68 00:03:38,920 --> 00:03:47,730 to have TotalFat, and then replace it here again, 69 00:03:47,730 --> 00:03:53,240 and the variable name is going to be HighFat. 70 00:03:53,240 --> 00:03:57,480 And finally, Carbohydrates-- so here 71 00:03:57,480 --> 00:04:05,260 is the vector of Carbohydrates, and this is, too, 72 00:04:05,260 --> 00:04:07,290 getting the Carbohydrates vector. 73 00:04:10,150 --> 00:04:13,860 And finally, this last variable that we want to add 74 00:04:13,860 --> 00:04:16,610 is called HighCarbs. 75 00:04:16,610 --> 00:04:20,570 And now looking at the structure of the USDA data frame, 76 00:04:20,570 --> 00:04:22,980 we see that we successfully added 77 00:04:22,980 --> 00:04:27,340 these three new variables, which are high HighProtein, HighFat, 78 00:04:27,340 --> 00:04:30,270 and HighCarbs, in addition to the HighSodium 79 00:04:30,270 --> 00:04:33,250 variable that we added previously. 80 00:04:33,250 --> 00:04:35,990 So how can we now find relationships 81 00:04:35,990 --> 00:04:40,140 between these variables, and also the original variables 82 00:04:40,140 --> 00:04:43,190 that we had in the USDA data frame? 83 00:04:43,190 --> 00:04:45,630 Well, we're going to be using the table 84 00:04:45,630 --> 00:04:49,659 and the tapply functions in our next video.