1 00:00:04,490 --> 00:00:09,130 Let's go ahead and start R. The first thing you'll see in the R 2 00:00:09,130 --> 00:00:11,730 console is the version of R you are using, 3 00:00:11,730 --> 00:00:14,080 and other basic information related 4 00:00:14,080 --> 00:00:17,630 to licensing, citations, and demos. 5 00:00:17,630 --> 00:00:19,840 To clear the console, you can simply 6 00:00:19,840 --> 00:00:23,410 go to Edit and select Clear Console. 7 00:00:23,410 --> 00:00:27,870 We'll start by reading in our dataset USDA.csv, which 8 00:00:27,870 --> 00:00:32,740 contains all foods in the USDA database in 100-gram amounts. 9 00:00:32,740 --> 00:00:34,860 You should have already downloaded the dataset 10 00:00:34,860 --> 00:00:36,420 to your computer. 11 00:00:36,420 --> 00:00:38,750 To be able to read the dataset in R, 12 00:00:38,750 --> 00:00:42,170 we first need to navigate to the directory in our computer, 13 00:00:42,170 --> 00:00:46,640 where the data file, USDA.csv, is saved. 14 00:00:46,640 --> 00:00:50,160 To do so, if you are on a Mac, go to the Misc menu, 15 00:00:50,160 --> 00:00:53,050 and select Change Working Directory. 16 00:00:53,050 --> 00:00:55,990 If you are on a PC, go to the File menu, 17 00:00:55,990 --> 00:00:59,490 select Change Directory, and then navigate to the folder 18 00:00:59,490 --> 00:01:01,850 where you saved the csv file. 19 00:01:01,850 --> 00:01:03,910 Then press OK. 20 00:01:03,910 --> 00:01:05,830 Nothing has happened in R until now, 21 00:01:05,830 --> 00:01:08,510 except changing the working directory. 22 00:01:08,510 --> 00:01:11,660 To double-check that we are in the right working directory, 23 00:01:11,660 --> 00:01:16,370 we can type getwd, which stands for get working directory, 24 00:01:16,370 --> 00:01:20,630 and then R gives us the path to the folder we just selected. 25 00:01:20,630 --> 00:01:23,789 Now we should be ready to read in our dataset. 26 00:01:23,789 --> 00:01:26,740 We will use the function read.csv, 27 00:01:26,740 --> 00:01:30,750 since the dataset was given to us in a csv format. 28 00:01:30,750 --> 00:01:35,080 And let's save the output to a data frame, and call it USDA. 29 00:01:35,080 --> 00:01:39,960 And this is equal to read.csv, and this takes, as an input, 30 00:01:39,960 --> 00:01:45,110 the name of the csv file, which is USDA.csv, and don't forget 31 00:01:45,110 --> 00:01:48,560 the quotation marks around the name of the csv file. 32 00:01:48,560 --> 00:01:52,850 Pressing Enter, R now read the information from the dataset, 33 00:01:52,850 --> 00:01:57,180 and saved it to the data frame, USDA. 34 00:01:57,180 --> 00:01:59,780 Now it's time to learn about our data. 35 00:01:59,780 --> 00:02:03,700 We can use the structure function, or str in R, 36 00:02:03,700 --> 00:02:06,690 and give it the input USDA. 37 00:02:06,690 --> 00:02:09,370 This gives us the following information. 38 00:02:09,370 --> 00:02:14,100 We have 7,058 observations, or foods in our dataset, 39 00:02:14,100 --> 00:02:16,950 along with 16 different variables. 40 00:02:16,950 --> 00:02:20,160 The first variable gives a unique identification number 41 00:02:20,160 --> 00:02:24,230 for each of the foods, starting with the number 1,001. 42 00:02:24,230 --> 00:02:26,680 The second variable gives a text description 43 00:02:26,680 --> 00:02:28,410 of each of the foods. 44 00:02:28,410 --> 00:02:30,810 The third variable is the amount of calories 45 00:02:30,810 --> 00:02:33,510 in 100 grams of these foods, and it's 46 00:02:33,510 --> 00:02:36,040 given to us in kilocalories. 47 00:02:36,040 --> 00:02:39,850 Then we also have information about the protein, 48 00:02:39,850 --> 00:02:44,090 total fat, carbohydrate, saturated fat, 49 00:02:44,090 --> 00:02:47,240 and sugar levels in grams, as well as 50 00:02:47,240 --> 00:02:52,950 the sodium, cholesterol, calcium, iron, potassium, 51 00:02:52,950 --> 00:02:55,370 and vitamin C levels, in milligrams. 52 00:02:55,370 --> 00:02:59,790 And finally, the amount of vitamin E and vitamin D in what 53 00:02:59,790 --> 00:03:02,330 is known as international units, and this 54 00:03:02,330 --> 00:03:05,660 is a standard measurement for drugs and vitamins. 55 00:03:05,660 --> 00:03:08,600 Now to obtain high-level statistical information 56 00:03:08,600 --> 00:03:12,330 about our dataset, we can use the summary function, 57 00:03:12,330 --> 00:03:17,480 and give it, as an input, the USDA data frame. 58 00:03:17,480 --> 00:03:19,620 The summary function gives us information 59 00:03:19,620 --> 00:03:22,730 such as the minimum, the maximum, and the mean values 60 00:03:22,730 --> 00:03:27,920 across all 7,058 foods for each of the 16 different variables. 61 00:03:27,920 --> 00:03:31,480 For instance, the maximum amount of cholesterol 62 00:03:31,480 --> 00:03:38,090 is 3,100 milligrams, whereas the mean is only 41.55 milligrams. 63 00:03:38,090 --> 00:03:40,680 We also have information about the number 64 00:03:40,680 --> 00:03:42,600 of non-available entries. 65 00:03:42,600 --> 00:03:45,710 For instance, we have 1,910 foods 66 00:03:45,710 --> 00:03:49,210 that are missing entries for their sugar levels. 67 00:03:49,210 --> 00:03:51,260 Now, scrolling through this information, 68 00:03:51,260 --> 00:03:54,600 a startling observation is the maximum level 69 00:03:54,600 --> 00:03:59,560 of sodium, which is 38,758 milligrams. 70 00:03:59,560 --> 00:04:03,370 This number is huge, given that the daily recommended maximum 71 00:04:03,370 --> 00:04:06,530 is only 2,300 milligrams. 72 00:04:06,530 --> 00:04:11,060 Let's investigate this variable further in our next video.