1
00:00:04,500 --> 00:00:08,820
Visualization is a crucial step
for initial data exploration.

2
00:00:08,820 --> 00:00:11,740
It helps us discern
relationships, patterns,

3
00:00:11,740 --> 00:00:13,220
and outliers.

4
00:00:13,220 --> 00:00:15,300
This video will give
us a starting point

5
00:00:15,300 --> 00:00:19,100
on how to make plots in R, but
more advanced and way cooler

6
00:00:19,100 --> 00:00:23,570
visualization tips will be
given in Week 8 of this class.

7
00:00:23,570 --> 00:00:26,430
Let us first create a
scatterplot with Protein

8
00:00:26,430 --> 00:00:30,000
on the x-axis and
Fat on the y-axis.

9
00:00:30,000 --> 00:00:32,820
To do this we can use
the plot function in R

10
00:00:32,820 --> 00:00:36,710
and give it as a first input the
Protein vector on the x-axis,

11
00:00:36,710 --> 00:00:41,790
and as a second input the
TotalFat vector on the y-axis.

12
00:00:41,790 --> 00:00:44,950
And now pressing Enter,
a new window pops up.

13
00:00:44,950 --> 00:00:48,670
If you are on a PC, the new
Windows is called R Graphics,

14
00:00:48,670 --> 00:00:51,420
and if you are on a Mac
it is called Quartz.

15
00:00:51,420 --> 00:00:55,160
The plot has a very
interesting triangular shape.

16
00:00:55,160 --> 00:00:57,840
It looks like foods that
are higher in protein

17
00:00:57,840 --> 00:01:01,660
are typically lower in
fat, and vice versa.

18
00:01:01,660 --> 00:01:03,970
Now, looking at the
aesthetics of the graph,

19
00:01:03,970 --> 00:01:06,400
we realize that R
gives default names

20
00:01:06,400 --> 00:01:10,020
for the x-axis and the y-axis
using the vector dollar

21
00:01:10,020 --> 00:01:11,350
notation.

22
00:01:11,350 --> 00:01:14,360
Now we can modify these labels
by adding more arguments

23
00:01:14,360 --> 00:01:16,070
to the plot function.

24
00:01:16,070 --> 00:01:18,310
To go back to the R
console, you can simply

25
00:01:18,310 --> 00:01:20,600
switch windows using
the Control tab

26
00:01:20,600 --> 00:01:22,620
if you are on a Windows machine.

27
00:01:22,620 --> 00:01:25,850
On a Mac, the windows are
not overlaid by default,

28
00:01:25,850 --> 00:01:29,030
so accessing the
console should be easy.

29
00:01:29,030 --> 00:01:32,000
OK, now let's use the Up
arrow to go back to our plot

30
00:01:32,000 --> 00:01:37,770
function, and add the
argument xlab = "Protein"

31
00:01:37,770 --> 00:01:40,390
and that gives us the
label of the x-axis.

32
00:01:40,390 --> 00:01:46,509
Then ylab = "Fat" and this gives
us the label to the y-axis.

33
00:01:46,509 --> 00:01:50,650
And let's add a title to the
plot using the argument main,

34
00:01:50,650 --> 00:01:56,080
say the title is
"Protein vs Fat".

35
00:01:56,080 --> 00:01:58,370
And let's change
the color to red.

36
00:01:58,370 --> 00:02:00,390
And remember to
put quotation marks

37
00:02:00,390 --> 00:02:03,230
around the values of
all these arguments.

38
00:02:03,230 --> 00:02:06,240
And now pressing Enter and going
back to the graphics window,

39
00:02:06,240 --> 00:02:10,979
we see that R made all the
modifications we requested.

40
00:02:10,979 --> 00:02:15,050
Another way we can visualize our
data is by plotting histograms.

41
00:02:15,050 --> 00:02:17,750
We can do this using the
histogram function in R,

42
00:02:17,750 --> 00:02:20,140
but note that this
function now only takes

43
00:02:20,140 --> 00:02:23,620
one variable as an input,
because the y-axis should

44
00:02:23,620 --> 00:02:25,170
have the frequencies.

45
00:02:25,170 --> 00:02:27,940
So let's go back to
our console and create

46
00:02:27,940 --> 00:02:30,670
a histogram of
VitaminC, for instance.

47
00:02:30,670 --> 00:02:34,310
So we're going to use the
hist function, or histogram.

48
00:02:34,310 --> 00:02:39,980
And then the argument that it
takes is the VitaminC vector.

49
00:02:39,980 --> 00:02:44,140
Let's label the
x-axis as "Vitamin C",

50
00:02:44,140 --> 00:02:46,990
and this is given
to us in milligrams.

51
00:02:46,990 --> 00:02:51,500
And give it a title,
say "Histogram

52
00:02:51,500 --> 00:02:56,460
of Vitamin C Levels".

53
00:02:56,460 --> 00:02:57,900
Hmm.

54
00:02:57,900 --> 00:03:00,830
Even though the maximum
vitamin C content

55
00:03:00,830 --> 00:03:04,470
is 2000 milligrams,
most of our foods--

56
00:03:04,470 --> 00:03:07,700
well, to be more precise,
more than 6,000 of them--

57
00:03:07,700 --> 00:03:11,060
have less than 200
milligrams of vitamin C.

58
00:03:11,060 --> 00:03:14,080
And the histogram lumps them
all together in one cell.

59
00:03:14,080 --> 00:03:17,310
Well, it would be nice if we
can zoom into this section

60
00:03:17,310 --> 00:03:20,870
here and get a finer
understanding of the data.

61
00:03:20,870 --> 00:03:24,650
To do this we need to limit
the x-axis to go from zero

62
00:03:24,650 --> 00:03:26,380
to, say, 100 milligrams.

63
00:03:26,380 --> 00:03:29,220
So let's go back to the
console, and then we're

64
00:03:29,220 --> 00:03:33,300
going to add the argument
xlim to limit the x-axis.

65
00:03:33,300 --> 00:03:36,370
And using the combine
function or the c function,

66
00:03:36,370 --> 00:03:38,190
we're going to set
the first input

67
00:03:38,190 --> 00:03:40,680
to be 0, which is the
lowest value that we want

68
00:03:40,680 --> 00:03:44,810
to see on the x-axis, and the
second argument as being 100,

69
00:03:44,810 --> 00:03:49,000
which is the highest value that
we want to see on the x-axis.

70
00:03:49,000 --> 00:03:50,610
And now pressing
Enter, and we can

71
00:03:50,610 --> 00:03:55,810
see that R gives us 0
to 100 on the x-axis.

72
00:03:55,810 --> 00:03:58,650
But we only see
this one big cell.

73
00:03:58,650 --> 00:04:01,530
It seems that R only
zoomed into the area,

74
00:04:01,530 --> 00:04:03,370
but it didn't break
that huge cell,

75
00:04:03,370 --> 00:04:06,640
and this doesn't give us
any additional information.

76
00:04:06,640 --> 00:04:11,440
So we really need to break up
the cell into smaller pieces.

77
00:04:11,440 --> 00:04:15,210
And say we want 100 cells,
and since the interval

78
00:04:15,210 --> 00:04:19,070
goes from 0 to 100,
then we would expect R

79
00:04:19,070 --> 00:04:22,560
to create divisions that are
one milligrams in length.

80
00:04:22,560 --> 00:04:24,440
So let's do this.

81
00:04:24,440 --> 00:04:26,370
Let's go back to the
console, and then we're

82
00:04:26,370 --> 00:04:30,790
going to add the
argument breaks = 100

83
00:04:30,790 --> 00:04:32,570
and this sets the
number of cells

84
00:04:32,570 --> 00:04:34,080
that we want to see to 100.

85
00:04:34,080 --> 00:04:35,600
So let's see.

86
00:04:35,600 --> 00:04:36,530
Oh.

87
00:04:36,530 --> 00:04:40,790
We actually only see
five cells, and each cell

88
00:04:40,790 --> 00:04:43,000
is 20 milligrams long.

89
00:04:43,000 --> 00:04:44,380
Well, what happened?

90
00:04:44,380 --> 00:04:47,440
We were expecting 100 cells.

91
00:04:47,440 --> 00:04:49,840
Well remember that the
histogram originally

92
00:04:49,840 --> 00:04:52,600
went far beyond 100 milligrams.

93
00:04:52,600 --> 00:04:55,950
The maximum was 2000 milligrams.

94
00:04:55,950 --> 00:04:59,310
And now if we were to divide
the original interval from 0

95
00:04:59,310 --> 00:05:04,880
to 2000 into 100 cells,
then 2000 divided by 100,

96
00:05:04,880 --> 00:05:07,560
each cell would be
20 milligrams long.

97
00:05:07,560 --> 00:05:09,880
And this is exactly what R did.

98
00:05:09,880 --> 00:05:13,160
It actually divided all
of the spectrum of values

99
00:05:13,160 --> 00:05:16,860
from 0 to 2000 into
100 cells, and not only

100
00:05:16,860 --> 00:05:19,330
the spectrum from 0 to 100.

101
00:05:19,330 --> 00:05:23,090
But we still want to divide
the interval 0 to 100

102
00:05:23,090 --> 00:05:26,770
into 100 cells, each
of length 1 milligram.

103
00:05:26,770 --> 00:05:28,440
And how can we do this?

104
00:05:28,440 --> 00:05:30,330
Well, we simply need
to think in terms

105
00:05:30,330 --> 00:05:33,590
of the original interval,
which was 0 to 2000.

106
00:05:33,590 --> 00:05:36,700
And if we were to break
it into 2000 cells,

107
00:05:36,700 --> 00:05:39,090
then each one will be
of length one milligram.

108
00:05:39,090 --> 00:05:42,600
So now we know that
actually we needed

109
00:05:42,600 --> 00:05:44,970
to set the breaks to 2000.

110
00:05:44,970 --> 00:05:46,540
So let's do this.

111
00:05:46,540 --> 00:05:50,290
And now we obtain our
refined histogram.

112
00:05:50,290 --> 00:05:52,860
And we see new
information here come up.

113
00:05:52,860 --> 00:05:57,010
Remember our initial conclusion
was that more than 6,000 foods

114
00:05:57,010 --> 00:06:00,290
have less than 200
milligrams of vitamin C.

115
00:06:00,290 --> 00:06:02,240
But now that we
refined our graph,

116
00:06:02,240 --> 00:06:05,310
we obtained an additional
level of information.

117
00:06:05,310 --> 00:06:07,980
Actually, more
than 5,000 of them

118
00:06:07,980 --> 00:06:12,050
have less than one
milligram of vitamin C.

119
00:06:12,050 --> 00:06:15,980
Now a third way we can visualize
the data is using box plots.

120
00:06:15,980 --> 00:06:19,140
So let's go back to the
console and create a box plot

121
00:06:19,140 --> 00:06:19,970
for sugar.

122
00:06:19,970 --> 00:06:23,450
So the function we're going
to be using is simply boxplot,

123
00:06:23,450 --> 00:06:26,170
and similarly to the
histogram function,

124
00:06:26,170 --> 00:06:30,050
the boxplot function only takes
as an input a single vector.

125
00:06:30,050 --> 00:06:33,740
And in this case it would
be the Sugar vector.

126
00:06:33,740 --> 00:06:40,680
And let's create a title that
says "Boxplot of Sugar Levels".

127
00:06:40,680 --> 00:06:43,500
And is it the
y-axis or the x-axis

128
00:06:43,500 --> 00:06:46,820
that we have to label
it as the sugar level?

129
00:06:46,820 --> 00:06:49,840
So if we're not so sure,
let's just plot it,

130
00:06:49,840 --> 00:06:52,120
and, oh, this should
be the y-axis.

131
00:06:52,120 --> 00:06:54,840
So let's go back to
the console, and simply

132
00:06:54,840 --> 00:07:00,420
set the y label
to be "Sugar (g)".

133
00:07:00,420 --> 00:07:05,100
And now we have our box
plot with the right labels.

134
00:07:05,100 --> 00:07:06,680
What is it trying
to tell us here?

135
00:07:06,680 --> 00:07:08,520
It looks a little bit strange.

136
00:07:08,520 --> 00:07:11,290
Well, the average of
sugar across the data set

137
00:07:11,290 --> 00:07:12,810
seems to be pretty low.

138
00:07:12,810 --> 00:07:15,290
It's somewhere
around five grams.

139
00:07:15,290 --> 00:07:19,140
But we have a lot of outliers
with extremely high values

140
00:07:19,140 --> 00:07:20,370
of sugar.

141
00:07:20,370 --> 00:07:24,440
There exist some foods that
have almost 100 grams of sugar

142
00:07:24,440 --> 00:07:26,120
in 100 grams.

143
00:07:26,120 --> 00:07:29,550
Well, candies are definitely
among these foods.

144
00:07:29,550 --> 00:07:31,720
So we just reviewed
three ways in which

145
00:07:31,720 --> 00:07:33,600
we can visualize our data.

146
00:07:33,600 --> 00:07:36,840
In Week 8, we will see more
advanced visualization tools

147
00:07:36,840 --> 00:07:39,250
to make more informative plots.

148
00:07:39,250 --> 00:07:42,730
In our next video we will see
how we can construct and add

149
00:07:42,730 --> 00:07:45,790
new variables to our data set.