1
00:00:04,860 --> 00:00:09,110
Often, you will need to load
an external data file into R

2
00:00:09,110 --> 00:00:11,920
to do some analysis
and modeling.

3
00:00:11,920 --> 00:00:15,410
In this class, we'll be
working with csv files,

4
00:00:15,410 --> 00:00:18,550
or comma separated value files.

5
00:00:18,550 --> 00:00:21,320
This is a common
format for data files

6
00:00:21,320 --> 00:00:24,120
and is easy to work with in R.

7
00:00:24,120 --> 00:00:27,570
The first thing you need to
do to read in a data file

8
00:00:27,570 --> 00:00:30,520
is to navigate to the
directory on your computer

9
00:00:30,520 --> 00:00:32,970
where the data file is saved.

10
00:00:32,970 --> 00:00:37,990
On a Mac, go to the Misc menu,
then select "Change Working

11
00:00:37,990 --> 00:00:39,390
Directory...".

12
00:00:39,390 --> 00:00:46,280
On a PC, go to the File menu
and select "Change dir...".

13
00:00:46,280 --> 00:00:50,240
This should pop up a browsing
or navigation window.

14
00:00:50,240 --> 00:00:57,390
Navigate to the folder where
you saved the data file WHO.csv

15
00:00:57,390 --> 00:00:59,630
that you've downloaded
for this class,

16
00:00:59,630 --> 00:01:01,260
and then select that folder.

17
00:01:11,930 --> 00:01:13,850
Nothing should
have happened in R,

18
00:01:13,850 --> 00:01:17,690
but if you type
getwd, and then empty

19
00:01:17,690 --> 00:01:20,970
parentheses and hit Enter,
you should see the path

20
00:01:20,970 --> 00:01:25,590
to the folder containing
the data set that you just

21
00:01:25,590 --> 00:01:28,250
selected.

22
00:01:28,250 --> 00:01:33,460
Now, read in the data
file by typing WHO =

23
00:01:33,460 --> 00:01:42,490
read.csv("WHO.csv") the name of
the data file we want to read

24
00:01:42,490 --> 00:01:44,030
in.

25
00:01:44,030 --> 00:01:50,160
If you hit Enter, this will
save the data set in WHO.csv

26
00:01:50,160 --> 00:01:53,240
to the data frame WHO.

27
00:01:53,240 --> 00:01:57,009
To look at our data, there
are two very useful commands.

28
00:01:57,009 --> 00:02:00,480
The first is the
str function, which

29
00:02:00,480 --> 00:02:03,000
shows us the
structure of the data.

30
00:02:03,000 --> 00:02:09,570
If you type str(WHO),
and hit Enter,

31
00:02:09,570 --> 00:02:12,070
you can see that we
have a data frame

32
00:02:12,070 --> 00:02:16,740
of 194 observations
and 13 variables.

33
00:02:16,740 --> 00:02:19,530
This data set contains
recent statistics

34
00:02:19,530 --> 00:02:22,190
from the World Health
Organization-- W,

35
00:02:22,190 --> 00:02:26,490
H, O, or WHO-- on all countries.

36
00:02:26,490 --> 00:02:29,329
The variables are the
name of the country,

37
00:02:29,329 --> 00:02:34,150
the region the country is in,
the population in thousands,

38
00:02:34,150 --> 00:02:38,980
the percentage of the
population under 15 and over 60,

39
00:02:38,980 --> 00:02:43,660
the fertility rate or average
number of children per woman,

40
00:02:43,660 --> 00:02:48,670
the life expectancy in years,
the child mortality rate which

41
00:02:48,670 --> 00:02:52,260
is the number of children
who die by age five per 1,000

42
00:02:52,260 --> 00:02:58,160
births, the number of cellular
subscribers per 100 population,

43
00:02:58,160 --> 00:03:01,110
the literacy rate among
adults aged greater than

44
00:03:01,110 --> 00:03:06,320
or equal to 15, the gross
national income per capita,

45
00:03:06,320 --> 00:03:09,960
the percentage of male children
enrolled in primary school,

46
00:03:09,960 --> 00:03:12,000
and the percentage
of female children

47
00:03:12,000 --> 00:03:14,160
enrolled in primary school.

48
00:03:14,160 --> 00:03:19,110
For each variable, str gives
us the name of the variable,

49
00:03:19,110 --> 00:03:22,300
and then a description of
the type of the variable

50
00:03:22,300 --> 00:03:25,550
followed by a first few
values of the variable.

51
00:03:25,550 --> 00:03:27,650
We see a couple
different types here.

52
00:03:27,650 --> 00:03:30,270
One is a factor variable.

53
00:03:30,270 --> 00:03:33,700
Country and Region are
both factor variables.

54
00:03:33,700 --> 00:03:35,579
This means that
the variables have

55
00:03:35,579 --> 00:03:40,210
several different categories,
not necessarily numerical.

56
00:03:40,210 --> 00:03:44,390
For example, the Region variable
has six different categories

57
00:03:44,390 --> 00:03:45,740
or levels.

58
00:03:45,740 --> 00:03:48,930
These include
Africa and Americas.

59
00:03:48,930 --> 00:03:52,790
So each observation
in the Region variable

60
00:03:52,790 --> 00:03:56,079
belongs to one of six
different categories.

61
00:03:56,079 --> 00:03:57,770
For variables like
Country, where

62
00:03:57,770 --> 00:04:01,760
there's 194 levels, which is
the same number of observations

63
00:04:01,760 --> 00:04:06,330
we have, each value in
this variable is different.

64
00:04:06,330 --> 00:04:09,150
In this case, it makes sense,
since each country name

65
00:04:09,150 --> 00:04:11,230
is different.

66
00:04:11,230 --> 00:04:14,340
Then we have two types of
numerical values-- integer

67
00:04:14,340 --> 00:04:16,040
and then general
numerical values.

68
00:04:18,930 --> 00:04:22,200
The other very useful function
to take a look at our data

69
00:04:22,200 --> 00:04:24,130
is the summary function.

70
00:04:24,130 --> 00:04:27,190
In your R console,
type summary and then,

71
00:04:27,190 --> 00:04:32,159
in parentheses, WHO, the name of
our data frame, and hit Enter.

72
00:04:32,159 --> 00:04:36,250
This gives a numerical summary
of each of our variables.

73
00:04:36,250 --> 00:04:39,010
For the factor variables,
country and region,

74
00:04:39,010 --> 00:04:41,330
it counts the number
of observations

75
00:04:41,330 --> 00:04:44,860
in each of the
levels or categories.

76
00:04:44,860 --> 00:04:48,120
So here, we see that we have
46 countries in the region

77
00:04:48,120 --> 00:04:53,110
Africa, 35 in the
region Americas, etc.

78
00:04:53,110 --> 00:04:55,010
For each of the
numerical values,

79
00:04:55,010 --> 00:05:01,420
we see the min, first quartile,
median, mean, third quartile,

80
00:05:01,420 --> 00:05:05,640
and maximum values
in that variable.

81
00:05:05,640 --> 00:05:09,430
We can also see in some of the
variables that we have this

82
00:05:09,430 --> 00:05:11,630
category called NA's.

83
00:05:11,630 --> 00:05:15,160
This means that
some observations

84
00:05:15,160 --> 00:05:17,640
are missing values
in that variable.

85
00:05:17,640 --> 00:05:21,200
So for FertilityRate,
there 11 observations that

86
00:05:21,200 --> 00:05:25,440
are missing the value
of FertilityRate.

87
00:05:25,440 --> 00:05:28,100
When working with data
in R, it can often

88
00:05:28,100 --> 00:05:30,540
be useful to subset your data.

89
00:05:30,540 --> 00:05:33,820
For example, suppose we
want to create a new data

90
00:05:33,820 --> 00:05:36,610
frame with only the
countries in Europe.

91
00:05:36,610 --> 00:05:42,990
Let's call it WHO_Europe
and use the subset function

92
00:05:42,990 --> 00:05:46,210
to subset the data
frame WHO to take

93
00:05:46,210 --> 00:05:49,290
only the observations
for which Region

94
00:05:49,290 --> 00:05:51,300
is exactly equal to Europe.

95
00:05:53,890 --> 00:05:57,030
The subset function
takes two arguments.

96
00:05:57,030 --> 00:05:58,560
The first is the
data frame we want

97
00:05:58,560 --> 00:06:01,870
to take a subset of,
in this case, WHO.

98
00:06:01,870 --> 00:06:04,200
And the second argument
is the criteria

99
00:06:04,200 --> 00:06:08,350
for which observations of WHO
should belong in our new data

100
00:06:08,350 --> 00:06:10,160
frame, WHO_Europe.

101
00:06:10,160 --> 00:06:12,760
In this case, we
want the observations

102
00:06:12,760 --> 00:06:17,500
for which the Region variable
is exactly equal to Europe.

103
00:06:17,500 --> 00:06:21,500
The double equal sign here
means exactly equal to.

104
00:06:21,500 --> 00:06:27,240
If we hit Enter and then look
at the structure of WHO_Europe,

105
00:06:27,240 --> 00:06:29,200
we can see that
we now have a data

106
00:06:29,200 --> 00:06:33,909
frame of 53 observations
of the same 13 variables.

107
00:06:33,909 --> 00:06:35,760
Does 53 sound right?

108
00:06:35,760 --> 00:06:39,450
Well, let's look back at
the summary output of WHO.

109
00:06:39,450 --> 00:06:42,200
We can see in the
Region output, there

110
00:06:42,200 --> 00:06:47,320
were 53 observations that
belonged in the region Europe.

111
00:06:47,320 --> 00:06:49,970
So we should expect
53 observations

112
00:06:49,970 --> 00:06:53,010
in our Europe subset,
which is right.

113
00:06:55,650 --> 00:06:58,060
Now, suppose we want
to save this new data

114
00:06:58,060 --> 00:07:01,500
frame, WHO_Europe,
to a csv file.

115
00:07:01,500 --> 00:07:05,430
You can use the write.csv
function to do this.

116
00:07:05,430 --> 00:07:10,440
Type write.csv, and
then in parentheses

117
00:07:10,440 --> 00:07:12,700
the name of the data
frame we want to save,

118
00:07:12,700 --> 00:07:17,280
WHO_Europe, comma,
and then in quotes

119
00:07:17,280 --> 00:07:19,740
the name of the file
we want to save it to.

120
00:07:19,740 --> 00:07:20,950
Let's call it WHO_Europe.csv.

121
00:07:24,810 --> 00:07:27,850
If you hit Enter,
nothing should happen,

122
00:07:27,850 --> 00:07:31,780
but you should now have a
file called WHO_Europe.csv

123
00:07:31,780 --> 00:07:35,040
in the same folder that
you saved WHO.csv in.

124
00:07:37,670 --> 00:07:40,980
And now that we've saved
this as a csv file,

125
00:07:40,980 --> 00:07:43,570
if we're not working
with it anymore in R,

126
00:07:43,570 --> 00:07:47,409
we can remove the data frame
from our current session in R.

127
00:07:47,409 --> 00:07:50,370
This is often useful if you're
working with a large data

128
00:07:50,370 --> 00:07:53,159
set that's taking
up a lot of space.

129
00:07:53,159 --> 00:07:58,110
First, let's type ls() to see
what variables we currently

130
00:07:58,110 --> 00:08:01,620
have in R. You could see
that WHO_Europe is one

131
00:08:01,620 --> 00:08:03,590
of our variables.

132
00:08:03,590 --> 00:08:09,990
Now, type rm for remove and
then the name WHO_Europe and hit

133
00:08:09,990 --> 00:08:11,210
Enter.

134
00:08:11,210 --> 00:08:14,930
If you type ls() again, you
should see that WHO_Europe is

135
00:08:14,930 --> 00:08:16,850
gone.

136
00:08:16,850 --> 00:08:21,430
In the next video, we'll
explore the WHO data set.