1
00:00:04,500 --> 00:00:07,810
In addition to scatter plots, we
can create several other types

2
00:00:07,810 --> 00:00:13,880
of plots in R. Two examples
are histograms and box plots.

3
00:00:13,880 --> 00:00:17,650
Let's first create a histogram
of CellularSubscribers.

4
00:00:17,650 --> 00:00:21,440
To do this, we'll use
the hist function.

5
00:00:21,440 --> 00:00:26,830
So in your R console type
hist, and then in parentheses

6
00:00:26,830 --> 00:00:27,830
WHO$CellularSubscribers.

7
00:00:35,440 --> 00:00:38,760
Close the parentheses
and hit Enter.

8
00:00:38,760 --> 00:00:40,930
If you go over to
your plotting window

9
00:00:40,930 --> 00:00:44,430
you can see that the values
of CellularSubscribers

10
00:00:44,430 --> 00:00:48,260
are shown on the x-axis and
the frequency of these values

11
00:00:48,260 --> 00:00:50,740
is shown on the y-axis.

12
00:00:50,740 --> 00:00:53,200
A histogram is useful
for understanding

13
00:00:53,200 --> 00:00:55,710
the distribution of a variable.

14
00:00:55,710 --> 00:00:58,120
Here we can see that
the most frequent value

15
00:00:58,120 --> 00:01:02,610
of CellularSubscribers
is around 100.

16
00:01:02,610 --> 00:01:06,570
We can also easily
create a box plot in R.

17
00:01:06,570 --> 00:01:09,440
We'll make a box plot
of LifeExpectancy

18
00:01:09,440 --> 00:01:11,539
sorted by Region.

19
00:01:11,539 --> 00:01:16,120
So back in your R
console type boxplot,

20
00:01:16,120 --> 00:01:24,470
and then in parentheses,
WHO$LifeExpectancy and then

21
00:01:24,470 --> 00:01:26,960
a tilde symbol
followed by WHO$Region.

22
00:01:29,570 --> 00:01:33,020
Close the parentheses
and hit Enter.

23
00:01:33,020 --> 00:01:34,860
Then go over to your
plotting window.

24
00:01:34,860 --> 00:01:36,900
You may need to stretch
it out a little bit

25
00:01:36,900 --> 00:01:41,410
so that you can see all of
the labels on the x-axis.

26
00:01:41,410 --> 00:01:43,979
A box plot is useful
for understanding

27
00:01:43,979 --> 00:01:47,110
the statistical
range of a variable.

28
00:01:47,110 --> 00:01:51,270
This box plot shows how
life expectancy in countries

29
00:01:51,270 --> 00:01:54,620
varies according to the
region the country is in.

30
00:01:54,620 --> 00:01:57,289
The box for each
region shows the range

31
00:01:57,289 --> 00:01:59,860
between the first
and third quartiles

32
00:01:59,860 --> 00:02:03,460
with the middle line
marking the median value.

33
00:02:03,460 --> 00:02:06,990
The dashed lines at the
top and bottom of the box,

34
00:02:06,990 --> 00:02:09,669
often called whiskers,
show the range

35
00:02:09,669 --> 00:02:12,220
from the minimum
to maximum values,

36
00:02:12,220 --> 00:02:16,560
excluding any outliers,
which are plotted as circles.

37
00:02:16,560 --> 00:02:18,950
Outliers are defined
by first computing

38
00:02:18,950 --> 00:02:22,170
the difference between the
first and third quartiles,

39
00:02:22,170 --> 00:02:24,150
or the height of the box.

40
00:02:24,150 --> 00:02:27,400
This number is called
the inter-quartile range.

41
00:02:27,400 --> 00:02:30,520
Any point that is greater
than the third quartile

42
00:02:30,520 --> 00:02:34,040
plus the inter-quartile
range, or any point that

43
00:02:34,040 --> 00:02:38,290
is less than the first quartile
minus the inter-quartile range

44
00:02:38,290 --> 00:02:40,510
is considered an outlier.

45
00:02:40,510 --> 00:02:44,110
This box plot shows us
that Europe has the highest

46
00:02:44,110 --> 00:02:48,660
median life expectancy, the
Americas has the smallest

47
00:02:48,660 --> 00:02:52,730
inter-quartile range, and the
eastern Mediterranean region

48
00:02:52,730 --> 00:02:58,660
has the highest overall range
of life expectancy values.

49
00:02:58,660 --> 00:03:02,230
If you want to give nice
labels to any of your plots,

50
00:03:02,230 --> 00:03:05,920
you can easily do so by
adding a few arguments.

51
00:03:05,920 --> 00:03:11,970
Go back to your R console,
scroll up and then

52
00:03:11,970 --> 00:03:18,730
inside the parentheses type
a comma and then xlab equals

53
00:03:18,730 --> 00:03:20,840
and then empty
quotes-- we're not

54
00:03:20,840 --> 00:03:24,040
going to label the x-axis here
because the regions are already

55
00:03:24,040 --> 00:03:29,220
nicely labeled-- and then
a comma, and then ylab

56
00:03:29,220 --> 00:03:32,470
equals "Life Expectancy".

57
00:03:38,470 --> 00:03:41,550
Close the quotes,
and then a comma,

58
00:03:41,550 --> 00:03:55,870
and then main = "Life Expectancy
of Countries by Region".

59
00:03:55,870 --> 00:03:58,560
Close the quotes and hit Enter.

60
00:03:58,560 --> 00:04:00,670
If you go back and look
at your box plot again

61
00:04:00,670 --> 00:04:04,120
you should now see that
there's a nice y-axis label

62
00:04:04,120 --> 00:04:06,080
and an overall
title to the plot.

63
00:04:08,960 --> 00:04:12,710
Lastly, let's take a look
at some summary tables.

64
00:04:12,710 --> 00:04:15,490
So go back to your
R console and we'll

65
00:04:15,490 --> 00:04:19,130
start by making a table
of the Region variable.

66
00:04:19,130 --> 00:04:24,560
So we'll type table and then
in parentheses WHO$Region.

67
00:04:27,390 --> 00:04:30,550
Close the parentheses
and hit Enter.

68
00:04:30,550 --> 00:04:33,990
This is similar to what we
saw in the summary output

69
00:04:33,990 --> 00:04:36,150
and counts the number
of observations

70
00:04:36,150 --> 00:04:38,900
in each category of Region.

71
00:04:38,900 --> 00:04:41,470
Tables work well for
variables with only a few

72
00:04:41,470 --> 00:04:46,620
possible values, and we'll see
more of this in recitation.

73
00:04:46,620 --> 00:04:48,340
You can see some
nice information

74
00:04:48,340 --> 00:04:52,940
about numerical variables by
using the tapply function.

75
00:04:52,940 --> 00:04:55,490
Let's start by
looking at an example.

76
00:04:55,490 --> 00:05:06,170
So type tapply, and then in
parentheses WHO$Over60 comma,

77
00:05:06,170 --> 00:05:12,650
and then WHO$Region
comma, and then mean.

78
00:05:12,650 --> 00:05:15,950
Close the parentheses
and hit Enter.

79
00:05:15,950 --> 00:05:19,230
This splits the
observations by Region

80
00:05:19,230 --> 00:05:23,380
and then computes the mean
of the variable Over60.

81
00:05:23,380 --> 00:05:27,820
So tapply splits the data by
the second argument you give,

82
00:05:27,820 --> 00:05:30,710
and then applies the
third argument function

83
00:05:30,710 --> 00:05:33,790
to the variable given
as the first argument.

84
00:05:33,790 --> 00:05:36,560
This result tells us that
the average percentage

85
00:05:36,560 --> 00:05:41,720
of the population over 60 in
African countries is about 5%,

86
00:05:41,720 --> 00:05:44,490
while the average percentage
of the population over 60

87
00:05:44,490 --> 00:05:48,830
in European countries
is about 20%.

88
00:05:48,830 --> 00:05:51,130
Let's look at another example.

89
00:05:51,130 --> 00:05:53,840
This time in the
tapply function,

90
00:05:53,840 --> 00:06:01,520
we'll give as the first
argument WHO$LiteracyRate then

91
00:06:01,520 --> 00:06:06,210
as the second argument
we'll give WHO$Region again.

92
00:06:06,210 --> 00:06:08,810
And as our third
argument we'll give min.

93
00:06:08,810 --> 00:06:11,560
Close the parentheses
and hit Enter.

94
00:06:11,560 --> 00:06:14,270
Here we see something
a little strange.

95
00:06:14,270 --> 00:06:17,980
We have the value NA
for all of the regions.

96
00:06:17,980 --> 00:06:21,130
This is because we have some
missing values in our data

97
00:06:21,130 --> 00:06:22,960
for literacy rate.

98
00:06:22,960 --> 00:06:26,280
A common thing to do is to
just remove the missing values

99
00:06:26,280 --> 00:06:28,410
when doing the computation.

100
00:06:28,410 --> 00:06:32,830
We need to pass one additional
argument, so hit the up arrow,

101
00:06:32,830 --> 00:06:36,180
and then inside the
parentheses add a comma

102
00:06:36,180 --> 00:06:44,970
and then na.rm =
TRUE and hit Enter.

103
00:06:44,970 --> 00:06:46,980
This removes all of
the countries that

104
00:06:46,980 --> 00:06:49,220
are missing a value
for LiteracyRate

105
00:06:49,220 --> 00:06:51,810
before doing the computation.

106
00:06:51,810 --> 00:06:55,870
This time we see numerical
values, as we expect.

107
00:06:55,870 --> 00:06:58,370
So we've split the
data by Region again

108
00:06:58,370 --> 00:07:01,280
and computed the minimum
value of LiteracyRate

109
00:07:01,280 --> 00:07:06,450
for all countries with a value
in the LiteracyRate variable.

110
00:07:06,450 --> 00:07:11,070
By using some basic functions
in R, plots, and summary tables

111
00:07:11,070 --> 00:07:14,450
we were able to get a better
understanding of our data.

112
00:07:14,450 --> 00:07:17,110
You'll see more of this in
the recitation and homework

113
00:07:17,110 --> 00:07:18,660
assignment.