1
00:00:04,500 --> 00:00:07,130
Pre-processing the
data can be difficult,

2
00:00:07,130 --> 00:00:11,280
but, luckily, R's packages
provide easy-to-use functions

3
00:00:11,280 --> 00:00:13,400
for the most common tasks.

4
00:00:13,400 --> 00:00:16,530
In this video, we'll
load and process our data

5
00:00:16,530 --> 00:00:20,300
in R. In your R
console, let's load

6
00:00:20,300 --> 00:00:25,560
the data set tweets.csv
with the read.csv function.

7
00:00:25,560 --> 00:00:28,030
But since we're working
with text data here,

8
00:00:28,030 --> 00:00:29,740
we need one extra
argument, which is

9
00:00:29,740 --> 00:00:30,700
stringsAsFactors=FALSE.

10
00:00:33,390 --> 00:00:36,110
So we'll call our
data set tweets.

11
00:00:36,110 --> 00:00:39,440
And we'll use the read.csv
function to read in the data

12
00:00:39,440 --> 00:00:44,020
file tweets.csv, but then
we'll add the extra argument

13
00:00:44,020 --> 00:00:44,980
stringsAsFactors=FALSE.

14
00:00:51,530 --> 00:00:54,150
You'll always need to add
this extra argument when

15
00:00:54,150 --> 00:00:56,660
working on a text
analytics problem

16
00:00:56,660 --> 00:01:00,040
so that the text is
read in properly.

17
00:01:00,040 --> 00:01:01,670
Now let's take a
look at the structure

18
00:01:01,670 --> 00:01:03,620
of our data with
the str function.

19
00:01:06,450 --> 00:01:12,250
We can see that we have 1,181
observations of two variables,

20
00:01:12,250 --> 00:01:14,960
the text of the
tweet, called Tweet,

21
00:01:14,960 --> 00:01:19,500
and the average sentiment
score, called Avg for average.

22
00:01:19,500 --> 00:01:21,610
The tweet texts are
real tweets that we

23
00:01:21,610 --> 00:01:25,000
found on the internet directed
to Apple with a few cleaned up

24
00:01:25,000 --> 00:01:26,390
words.

25
00:01:26,390 --> 00:01:29,250
We're more interested
in being able to detect

26
00:01:29,250 --> 00:01:32,220
the tweets with clear
negative sentiment,

27
00:01:32,220 --> 00:01:34,910
so let's define a new
variable in our data

28
00:01:34,910 --> 00:01:38,509
set tweets called Negative.

29
00:01:38,509 --> 00:01:40,810
And we'll set this equal to
as.factor(tweets$Avg = -1).

30
00:01:51,000 --> 00:01:55,180
This will set tweets$Negative
equal to true if the average

31
00:01:55,180 --> 00:01:58,560
sentiment score is less than
or equal to negative 1 and will

32
00:01:58,560 --> 00:02:02,460
set tweets$Negative equal to
false if the average sentiment

33
00:02:02,460 --> 00:02:05,050
score is greater
than negative 1.

34
00:02:05,050 --> 00:02:07,610
Let's look at a table of
this new variable, Negative.

35
00:02:12,360 --> 00:02:19,320
We can see that 182 of the
1,181 tweets, or about 15%,

36
00:02:19,320 --> 00:02:19,870
are negative.

37
00:02:22,630 --> 00:02:24,829
Now to pre-process
our text data so

38
00:02:24,829 --> 00:02:27,250
that we can use the
bag of words approach,

39
00:02:27,250 --> 00:02:30,850
we'll be using the tm
text mining package.

40
00:02:30,850 --> 00:02:35,250
We'll need to install and
load two packages to do this.

41
00:02:35,250 --> 00:02:41,590
First, let's install the
package tm, and go ahead

42
00:02:41,590 --> 00:02:43,620
and select a CRAN
mirror near you.

43
00:02:47,380 --> 00:02:49,910
As soon as that package
is done installing

44
00:02:49,910 --> 00:02:51,990
and you're back at
the blinking cursor,

45
00:02:51,990 --> 00:02:56,640
go ahead and load that package
with the library command.

46
00:02:56,640 --> 00:03:00,990
Then we also need to install
the package snowballC.

47
00:03:04,230 --> 00:03:07,530
This package helps us
use the tm package.

48
00:03:07,530 --> 00:03:10,330
And go ahead and load the
snowball package as well.

49
00:03:13,280 --> 00:03:16,610
One of the concepts
introduced by the tm package

50
00:03:16,610 --> 00:03:18,660
is that of a corpus.

51
00:03:18,660 --> 00:03:21,329
A corpus is a
collection of documents.

52
00:03:21,329 --> 00:03:26,100
We'll need to convert our tweets
to a corpus for pre-processing.

53
00:03:26,100 --> 00:03:29,160
tm can create a corpus
in many different ways,

54
00:03:29,160 --> 00:03:32,420
but we'll create it from the
tweet column of our data frame

55
00:03:32,420 --> 00:03:36,420
using two functions,
corpus and vector source.

56
00:03:36,420 --> 00:03:39,310
We'll call our corpus
"corpus" and then

57
00:03:39,310 --> 00:03:44,030
use the corpus and the
vector source functions

58
00:03:44,030 --> 00:03:48,840
called on our tweets variable
of our tweets data set.

59
00:03:48,840 --> 00:03:49,800
So that's tweets$Tweet.

60
00:03:54,140 --> 00:03:55,710
We can check that
this has worked

61
00:03:55,710 --> 00:04:01,210
by typing corpus and seeing
that our corpus has 1,181 text

62
00:04:01,210 --> 00:04:03,040
documents.

63
00:04:03,040 --> 00:04:06,070
And we can check that the
documents match our tweets

64
00:04:06,070 --> 00:04:07,430
by using double brackets.

65
00:04:07,430 --> 00:04:08,270
So type corpus[[1]].

66
00:04:13,860 --> 00:04:18,130
This shows us the first
tweet in our corpus.

67
00:04:18,130 --> 00:04:21,660
Now we're ready to start
pre-processing our data.

68
00:04:21,660 --> 00:04:24,470
Pre-processing is easy in tm.

69
00:04:24,470 --> 00:04:28,010
Each operation, like stemming
or removing stop words,

70
00:04:28,010 --> 00:04:30,180
can be done with
one line in R, where

71
00:04:30,180 --> 00:04:33,010
we use the tm_map function.

72
00:04:33,010 --> 00:04:36,190
Let's try it out by changing
all of the text in our tweets

73
00:04:36,190 --> 00:04:37,840
to lowercase.

74
00:04:37,840 --> 00:04:40,540
To do that, we'll
replace our corpus

75
00:04:40,540 --> 00:04:45,290
with the output of the
tm_map function, where

76
00:04:45,290 --> 00:04:48,430
the first argument is
the name of our corpus

77
00:04:48,430 --> 00:04:50,780
and the second argument
is what we want to do.

78
00:04:50,780 --> 00:04:54,440
In this case, tolower.

79
00:04:54,440 --> 00:04:57,320
tolower is a standard
function in R,

80
00:04:57,320 --> 00:05:00,850
and this is like when we pass
mean to the tapply function.

81
00:05:00,850 --> 00:05:03,780
We're passing the
tm_map function

82
00:05:03,780 --> 00:05:07,620
a function to use on our corpus.

83
00:05:07,620 --> 00:05:10,180
Let's see what that did by
looking at our first tweet

84
00:05:10,180 --> 00:05:10,870
again.

85
00:05:10,870 --> 00:05:13,140
Go ahead and hit the up
arrow twice to get back

86
00:05:13,140 --> 00:05:17,410
to corpuscorpus{[1] and now we can
see that all of our letters are

87
00:05:17,410 --> 00:05:17,910
lowercase.

88
00:05:20,980 --> 00:05:23,950
Now let's remove
all punctuation.

89
00:05:23,950 --> 00:05:26,070
This is done in a
very similar way,

90
00:05:26,070 --> 00:05:28,370
except this time we
give the argument

91
00:05:28,370 --> 00:05:31,640
removePunctuation
instead of tolower.

92
00:05:31,640 --> 00:05:35,120
Hit the up arrow twice,
and in the tm_map function,

93
00:05:35,120 --> 00:05:37,990
delete tolower, and
type removePunctuation.

94
00:05:41,210 --> 00:05:44,100
Let's see what this did
to our first tweet again.

95
00:05:44,100 --> 00:05:47,540
Now the comma after "say",
the exclamation point after

96
00:05:47,540 --> 00:05:52,990
"received", and the @ symbols
before "Apple" are all gone.

97
00:05:52,990 --> 00:05:56,860
Now we want to remove the
stop words in our tweets.

98
00:05:56,860 --> 00:06:00,070
tm provides a list of stop
words for the English language.

99
00:06:00,070 --> 00:06:02,490
We can check it out by typing
stopwords("english") [1:10].

100
00:06:12,300 --> 00:06:14,430
We see that these
are words like "I,"

101
00:06:14,430 --> 00:06:18,490
"me," "my," "myself," et cetera.

102
00:06:18,490 --> 00:06:21,610
Removing words can be done
with the removeWords argument

103
00:06:21,610 --> 00:06:24,740
to the tm_map function, but
we need one extra argument

104
00:06:24,740 --> 00:06:28,830
this time-- what the stop words
are that we want to remove.

105
00:06:28,830 --> 00:06:31,370
We'll remove all of
these English stop words,

106
00:06:31,370 --> 00:06:33,750
but we'll also remove
the word "apple"

107
00:06:33,750 --> 00:06:36,310
since all of these tweets
have the word "apple"

108
00:06:36,310 --> 00:06:39,040
and it probably won't be
very useful in our prediction

109
00:06:39,040 --> 00:06:40,600
problem.

110
00:06:40,600 --> 00:06:42,659
So go ahead and hit the
up arrow to get back

111
00:06:42,659 --> 00:06:47,730
to the tm_map function, delete
removePunctuation and, instead,

112
00:06:47,730 --> 00:06:48,440
type removeWords.

113
00:06:52,210 --> 00:06:54,760
Then we need to add one
extra argument, c("apple").

114
00:06:59,230 --> 00:07:01,730
This is us removing
the word "apple."

115
00:07:01,730 --> 00:07:02,980
And then stopwords("english").

116
00:07:08,020 --> 00:07:10,260
So this will remove
the word "apple"

117
00:07:10,260 --> 00:07:14,060
and all of the
English stop words.

118
00:07:14,060 --> 00:07:15,560
Let's take a look
at our first tweet

119
00:07:15,560 --> 00:07:16,680
again to see what happened.

120
00:07:20,470 --> 00:07:23,590
Now we can see that we have
significantly fewer words, only

121
00:07:23,590 --> 00:07:26,730
the words that are
not stop words.

122
00:07:26,730 --> 00:07:30,230
Lastly, we want to stem our
document with the stem document

123
00:07:30,230 --> 00:07:31,220
argument.

124
00:07:31,220 --> 00:07:34,850
Go ahead and scroll back up
to the removePunctuation,

125
00:07:34,850 --> 00:07:40,090
delete removePunctuation,
and type stemDocument.

126
00:07:40,090 --> 00:07:43,830
If you hit Enter and then
look at the first tweet again,

127
00:07:43,830 --> 00:07:48,540
we can see that this took
off the ending of "customer,"

128
00:07:48,540 --> 00:07:52,260
"service," "received,"
and "appstore."

129
00:07:52,260 --> 00:07:55,360
In the next video, we'll
investigate our corpus

130
00:07:55,360 --> 00:07:58,510
and prepare it for our
prediction problem.