1
00:00:04,490 --> 00:00:09,130
Let's go ahead and start R. The
first thing you'll see in the R

2
00:00:09,130 --> 00:00:11,730
console is the version
of R you are using,

3
00:00:11,730 --> 00:00:14,080
and other basic
information related

4
00:00:14,080 --> 00:00:17,630
to licensing,
citations, and demos.

5
00:00:17,630 --> 00:00:19,840
To clear the console,
you can simply

6
00:00:19,840 --> 00:00:23,410
go to Edit and
select Clear Console.

7
00:00:23,410 --> 00:00:27,870
We'll start by reading in
our dataset USDA.csv, which

8
00:00:27,870 --> 00:00:32,740
contains all foods in the USDA
database in 100-gram amounts.

9
00:00:32,740 --> 00:00:34,860
You should have already
downloaded the dataset

10
00:00:34,860 --> 00:00:36,420
to your computer.

11
00:00:36,420 --> 00:00:38,750
To be able to read
the dataset in R,

12
00:00:38,750 --> 00:00:42,170
we first need to navigate to
the directory in our computer,

13
00:00:42,170 --> 00:00:46,640
where the data file,
USDA.csv, is saved.

14
00:00:46,640 --> 00:00:50,160
To do so, if you are on a
Mac, go to the Misc menu,

15
00:00:50,160 --> 00:00:53,050
and select Change
Working Directory.

16
00:00:53,050 --> 00:00:55,990
If you are on a PC,
go to the File menu,

17
00:00:55,990 --> 00:00:59,490
select Change Directory, and
then navigate to the folder

18
00:00:59,490 --> 00:01:01,850
where you saved the csv file.

19
00:01:01,850 --> 00:01:03,910
Then press OK.

20
00:01:03,910 --> 00:01:05,830
Nothing has happened
in R until now,

21
00:01:05,830 --> 00:01:08,510
except changing the
working directory.

22
00:01:08,510 --> 00:01:11,660
To double-check that we are in
the right working directory,

23
00:01:11,660 --> 00:01:16,370
we can type getwd, which stands
for get working directory,

24
00:01:16,370 --> 00:01:20,630
and then R gives us the path
to the folder we just selected.

25
00:01:20,630 --> 00:01:23,789
Now we should be ready
to read in our dataset.

26
00:01:23,789 --> 00:01:26,740
We will use the
function read.csv,

27
00:01:26,740 --> 00:01:30,750
since the dataset was given
to us in a csv format.

28
00:01:30,750 --> 00:01:35,080
And let's save the output to a
data frame, and call it USDA.

29
00:01:35,080 --> 00:01:39,960
And this is equal to read.csv,
and this takes, as an input,

30
00:01:39,960 --> 00:01:45,110
the name of the csv file, which
is USDA.csv, and don't forget

31
00:01:45,110 --> 00:01:48,560
the quotation marks around
the name of the csv file.

32
00:01:48,560 --> 00:01:52,850
Pressing Enter, R now read the
information from the dataset,

33
00:01:52,850 --> 00:01:57,180
and saved it to the
data frame, USDA.

34
00:01:57,180 --> 00:01:59,780
Now it's time to
learn about our data.

35
00:01:59,780 --> 00:02:03,700
We can use the structure
function, or str in R,

36
00:02:03,700 --> 00:02:06,690
and give it the input USDA.

37
00:02:06,690 --> 00:02:09,370
This gives us the
following information.

38
00:02:09,370 --> 00:02:14,100
We have 7,058 observations,
or foods in our dataset,

39
00:02:14,100 --> 00:02:16,950
along with 16
different variables.

40
00:02:16,950 --> 00:02:20,160
The first variable gives a
unique identification number

41
00:02:20,160 --> 00:02:24,230
for each of the foods,
starting with the number 1,001.

42
00:02:24,230 --> 00:02:26,680
The second variable
gives a text description

43
00:02:26,680 --> 00:02:28,410
of each of the foods.

44
00:02:28,410 --> 00:02:30,810
The third variable is
the amount of calories

45
00:02:30,810 --> 00:02:33,510
in 100 grams of
these foods, and it's

46
00:02:33,510 --> 00:02:36,040
given to us in kilocalories.

47
00:02:36,040 --> 00:02:39,850
Then we also have information
about the protein,

48
00:02:39,850 --> 00:02:44,090
total fat, carbohydrate,
saturated fat,

49
00:02:44,090 --> 00:02:47,240
and sugar levels in
grams, as well as

50
00:02:47,240 --> 00:02:52,950
the sodium, cholesterol,
calcium, iron, potassium,

51
00:02:52,950 --> 00:02:55,370
and vitamin C levels,
in milligrams.

52
00:02:55,370 --> 00:02:59,790
And finally, the amount of
vitamin E and vitamin D in what

53
00:02:59,790 --> 00:03:02,330
is known as international
units, and this

54
00:03:02,330 --> 00:03:05,660
is a standard measurement
for drugs and vitamins.

55
00:03:05,660 --> 00:03:08,600
Now to obtain high-level
statistical information

56
00:03:08,600 --> 00:03:12,330
about our dataset, we can
use the summary function,

57
00:03:12,330 --> 00:03:17,480
and give it, as an input,
the USDA data frame.

58
00:03:17,480 --> 00:03:19,620
The summary function
gives us information

59
00:03:19,620 --> 00:03:22,730
such as the minimum, the
maximum, and the mean values

60
00:03:22,730 --> 00:03:27,920
across all 7,058 foods for each
of the 16 different variables.

61
00:03:27,920 --> 00:03:31,480
For instance, the maximum
amount of cholesterol

62
00:03:31,480 --> 00:03:38,090
is 3,100 milligrams, whereas the
mean is only 41.55 milligrams.

63
00:03:38,090 --> 00:03:40,680
We also have information
about the number

64
00:03:40,680 --> 00:03:42,600
of non-available entries.

65
00:03:42,600 --> 00:03:45,710
For instance, we
have 1,910 foods

66
00:03:45,710 --> 00:03:49,210
that are missing entries
for their sugar levels.

67
00:03:49,210 --> 00:03:51,260
Now, scrolling through
this information,

68
00:03:51,260 --> 00:03:54,600
a startling observation
is the maximum level

69
00:03:54,600 --> 00:03:59,560
of sodium, which is
38,758 milligrams.

70
00:03:59,560 --> 00:04:03,370
This number is huge, given that
the daily recommended maximum

71
00:04:03,370 --> 00:04:06,530
is only 2,300 milligrams.

72
00:04:06,530 --> 00:04:11,060
Let's investigate this variable
further in our next video.