1
00:00:04,500 --> 00:00:08,119
In this video, we'll do
some basic data analysis

2
00:00:08,119 --> 00:00:10,500
using our WHO data.

3
00:00:10,500 --> 00:00:13,640
To access a variable
in a data frame,

4
00:00:13,640 --> 00:00:17,190
you always have to link it to
the data frame it belongs to

5
00:00:17,190 --> 00:00:19,060
with the dollar sign.

6
00:00:19,060 --> 00:00:23,820
To see this, let's
first try typing Under15

7
00:00:23,820 --> 00:00:25,700
and hitting Enter.

8
00:00:25,700 --> 00:00:27,800
R responds with an error.

9
00:00:27,800 --> 00:00:32,049
That's because R won't recognize
this variable name since it

10
00:00:32,049 --> 00:00:35,050
doesn't know to look
in the data frame WHO.

11
00:00:35,050 --> 00:00:41,530
Now type WHO$Under15
and hit Enter.

12
00:00:41,530 --> 00:00:47,130
This outputs the Under15
vector of the data frame WHO.

13
00:00:47,130 --> 00:00:50,430
We can compute some statistics
about this variable,

14
00:00:50,430 --> 00:00:54,220
such as the mean, using
the mean function and then

15
00:00:54,220 --> 00:00:59,870
in parentheses
typing WHO$Under15.

16
00:00:59,870 --> 00:01:02,950
Close the parentheses
and hit Enter.

17
00:01:02,950 --> 00:01:05,319
This tells us that
the average percentage

18
00:01:05,319 --> 00:01:10,740
of the population
under 15 is 28.7.

19
00:01:10,740 --> 00:01:13,539
We can also compute
the standard deviation

20
00:01:13,539 --> 00:01:15,740
using the sd function.

21
00:01:15,740 --> 00:01:20,030
So type sd and then in
parentheses WHO$Under15.

22
00:01:23,300 --> 00:01:26,140
Close the parentheses
and hit Enter.

23
00:01:26,140 --> 00:01:28,670
This tells us that
the standard deviation

24
00:01:28,670 --> 00:01:33,910
of the percentage of the
population under 15 is 10.5.

25
00:01:33,910 --> 00:01:36,520
We can also get the
statistical summary

26
00:01:36,520 --> 00:01:40,259
of just one variable using
the summary function like we

27
00:01:40,259 --> 00:01:42,690
did before for the
whole data frame.

28
00:01:42,690 --> 00:01:48,430
To do this, we can type
summary, and then in parentheses

29
00:01:48,430 --> 00:01:48,930
WHO$Under15.

30
00:01:52,710 --> 00:01:55,820
Close the parentheses
and hit Enter.

31
00:01:55,820 --> 00:02:00,090
This gives the minimum
value, the first quartile,

32
00:02:00,090 --> 00:02:04,320
the median value, the
mean, the third quartile,

33
00:02:04,320 --> 00:02:08,330
and the maximum value
of the variable Under15.

34
00:02:08,330 --> 00:02:12,720
The first quartile is the
value for which 25% of the data

35
00:02:12,720 --> 00:02:16,280
is less than that value,
and the third quartile

36
00:02:16,280 --> 00:02:19,470
is the value for
which 75% of the data

37
00:02:19,470 --> 00:02:21,990
is less than that value.

38
00:02:21,990 --> 00:02:26,550
This output tells us that
there's a country with only 13%

39
00:02:26,550 --> 00:02:29,370
of the population under 15.

40
00:02:29,370 --> 00:02:34,310
Let's see which country it is
using the which.min function.

41
00:02:34,310 --> 00:02:40,100
So we can type which.min
and then in parentheses

42
00:02:40,100 --> 00:02:42,650
our variable WHO$Under15.

43
00:02:47,579 --> 00:02:50,380
Close the parentheses
and hit Enter.

44
00:02:50,380 --> 00:02:53,020
This returns the
number 86, which

45
00:02:53,020 --> 00:02:55,630
is the row number
of the observation

46
00:02:55,630 --> 00:02:58,700
with the minimum
value of Under15.

47
00:02:58,700 --> 00:03:06,170
To see which country is in row
86, we can type WHO$Country,

48
00:03:06,170 --> 00:03:10,550
for the country name, and
then in square brackets 86,

49
00:03:10,550 --> 00:03:12,070
and hit Enter.

50
00:03:12,070 --> 00:03:15,160
So Japan is the country
with the minimum percentage

51
00:03:15,160 --> 00:03:19,090
of the population under 15.

52
00:03:19,090 --> 00:03:22,120
Now let's see which country
has the maximum percentage

53
00:03:22,120 --> 00:03:24,550
of the population under 15.

54
00:03:24,550 --> 00:03:28,320
We can do this with
the which.max function.

55
00:03:28,320 --> 00:03:40,090
So type which.max and then
in parentheses WHO$Under15.

56
00:03:40,090 --> 00:03:43,110
Close the parentheses
and hit Enter.

57
00:03:43,110 --> 00:03:46,910
This tells us that
the 124th observation

58
00:03:46,910 --> 00:03:50,910
has the maximum value
of the variable Under15.

59
00:03:50,910 --> 00:03:55,680
We can look up the country of
the 124th observation by typing

60
00:03:55,680 --> 00:04:02,080
WHO$Country and then
in square brackets 124.

61
00:04:02,080 --> 00:04:04,910
Close the square
brackets and hit Enter.

62
00:04:04,910 --> 00:04:08,000
So Niger is the country
with the maximum percentage

63
00:04:08,000 --> 00:04:11,990
of the population under 15.

64
00:04:11,990 --> 00:04:17,579
Let's now create a scatter plot
of GNI versus fertility rate.

65
00:04:17,579 --> 00:04:20,410
You can do this using
the plot function.

66
00:04:20,410 --> 00:04:27,500
So type plot and then
in parentheses WHO$GNI,

67
00:04:27,500 --> 00:04:31,450
the variable we want
on our x-axis, comma,

68
00:04:31,450 --> 00:04:38,470
and then WHO$FertilityRate, the
variable we want on our y-axis.

69
00:04:38,470 --> 00:04:41,420
Close the parentheses
and hit Enter.

70
00:04:41,420 --> 00:04:43,610
A scatter plot should appear.

71
00:04:43,610 --> 00:04:47,140
Income, or GNI,
is on the x-axis,

72
00:04:47,140 --> 00:04:50,420
and fertility rate
is on the y-axis.

73
00:04:50,420 --> 00:04:53,880
Each point in the scatter
plot is a country.

74
00:04:53,880 --> 00:04:56,110
We can see that most
countries here either

75
00:04:56,110 --> 00:05:01,350
have a low GNI or a high GNI
but a low fertility rate.

76
00:05:01,350 --> 00:05:03,600
However, there are
a few countries

77
00:05:03,600 --> 00:05:08,110
for which both the GNI and
the fertility rate are high.

78
00:05:08,110 --> 00:05:09,720
Let's investigate.

79
00:05:09,720 --> 00:05:13,150
We'll use the subset function
to identify the countries

80
00:05:13,150 --> 00:05:17,450
with a GNI greater than
10,000, and a fertility

81
00:05:17,450 --> 00:05:20,700
rate greater than 2.5.

82
00:05:20,700 --> 00:05:26,210
So go back to your R console
and then type Outliers--

83
00:05:26,210 --> 00:05:32,050
this is what we'll call
our subset-- equals subset,

84
00:05:32,050 --> 00:05:44,420
and then in parentheses WHO
comma GNI greater than 10,000

85
00:05:44,420 --> 00:05:52,570
and FertilityRate
greater than 2.5.

86
00:05:52,570 --> 00:05:56,140
Close the parentheses
and hit Enter.

87
00:05:56,140 --> 00:05:58,230
When we used subset
before, we only

88
00:05:58,230 --> 00:06:00,310
had one condition
to define which

89
00:06:00,310 --> 00:06:03,190
observations to
keep in the subset.

90
00:06:03,190 --> 00:06:08,140
Here we have two conditions,
separated by the and symbol.

91
00:06:08,140 --> 00:06:10,180
This means that
both conditions must

92
00:06:10,180 --> 00:06:14,510
be true for all
observations in the subset.

93
00:06:14,510 --> 00:06:17,720
We can see how many rows
of data are in our subset

94
00:06:17,720 --> 00:06:20,390
by using the nrow function.

95
00:06:20,390 --> 00:06:25,080
So type nrow for number of
rows, and in parentheses

96
00:06:25,080 --> 00:06:26,980
the name of our
subset, Outliers.

97
00:06:29,930 --> 00:06:33,050
This tells us that there are
seven countries for which

98
00:06:33,050 --> 00:06:37,290
the GNI is greater than
10,000 and the fertility rate

99
00:06:37,290 --> 00:06:40,190
is greater than 2.5.

100
00:06:40,190 --> 00:06:44,170
Now let's output just
the country names, GNI,

101
00:06:44,170 --> 00:06:47,090
and fertility rate of
these seven countries

102
00:06:47,090 --> 00:06:49,290
to investigate further.

103
00:06:49,290 --> 00:06:51,260
There's an easy
way of doing this,

104
00:06:51,260 --> 00:06:54,320
and we'll use this technique
several times in this class

105
00:06:54,320 --> 00:06:59,090
when we just want to extract a
few variables from a data set.

106
00:06:59,090 --> 00:07:01,350
So type the name
of our data set,

107
00:07:01,350 --> 00:07:04,840
Outliers, and then
in square brackets

108
00:07:04,840 --> 00:07:07,640
we'll make a vector of
the names of the variables

109
00:07:07,640 --> 00:07:09,290
we want to output.

110
00:07:09,290 --> 00:07:15,240
So c, and then in parentheses,
"Country" for the country name,

111
00:07:15,240 --> 00:07:20,420
comma, "GNI" comma, and
then "FertilityRate".

112
00:07:23,520 --> 00:07:27,570
Close the parentheses, close the
square brackets, and hit Enter.

113
00:07:30,180 --> 00:07:33,680
This shows us the values
of these three variables

114
00:07:33,680 --> 00:07:36,570
for the seven
observations of outliers.

115
00:07:36,570 --> 00:07:38,880
We can see that one
of the seven countries

116
00:07:38,880 --> 00:07:42,950
is Equatorial Guinea, a country
that is very rich per capita

117
00:07:42,950 --> 00:07:45,250
due to oil production,
but the wealth

118
00:07:45,250 --> 00:07:48,690
is distributed very unevenly.

119
00:07:48,690 --> 00:07:51,250
In the next video,
we'll see how to create

120
00:07:51,250 --> 00:07:54,710
different types of plots in
R and then build some summary

121
00:07:54,710 --> 00:07:56,260
tables.