1
00:00:04,490 --> 00:00:07,800
In the previous video,
we preprocessed our data,

2
00:00:07,800 --> 00:00:11,230
and we're now ready to extract
the word frequencies to be

3
00:00:11,230 --> 00:00:14,080
used in our prediction problem.

4
00:00:14,080 --> 00:00:16,660
The tm package provides
a function called

5
00:00:16,660 --> 00:00:20,390
DocumentTermMatrix that
generates a matrix where

6
00:00:20,390 --> 00:00:24,460
the rows correspond to
documents, in our case tweets,

7
00:00:24,460 --> 00:00:28,270
and the columns correspond
to words in those tweets.

8
00:00:28,270 --> 00:00:30,470
The values in the
matrix are the number

9
00:00:30,470 --> 00:00:34,190
of times that word
appears in each argument.

10
00:00:34,190 --> 00:00:36,340
Let's go ahead and
generate this matrix

11
00:00:36,340 --> 00:00:39,140
and call it "frequencies."

12
00:00:39,140 --> 00:00:44,530
So we'll use the
DocumentTermMatrix function

13
00:00:44,530 --> 00:00:49,490
calls on our corpus that we
created in the previous video.

14
00:00:49,490 --> 00:00:54,040
Let's take a look at our
matrix by typing frequencies.

15
00:00:54,040 --> 00:00:58,570
We can see that
there are 3,289 terms

16
00:00:58,570 --> 00:01:04,239
or words in our matrix
and 1,181 documents

17
00:01:04,239 --> 00:01:07,730
or tweets after preprocessing.

18
00:01:07,730 --> 00:01:10,230
Let's see what this
matrix looks like using

19
00:01:10,230 --> 00:01:12,280
the inspect function.

20
00:01:12,280 --> 00:01:30,370
So type
inspect(frequencies[1000:1005, 505:515]).

21
00:01:30,370 --> 00:01:34,210
In this range we see that
the word "cheer" appears

22
00:01:34,210 --> 00:01:38,210
in the tweet 1005,
but "cheap" doesn't

23
00:01:38,210 --> 00:01:41,509
appear in any of these tweets.

24
00:01:41,509 --> 00:01:43,950
This data is what
we call sparse.

25
00:01:43,950 --> 00:01:47,600
This means that there are
many zeros in our matrix.

26
00:01:47,600 --> 00:01:50,350
We can look at what the
most popular terms are,

27
00:01:50,350 --> 00:01:53,380
or words, with the
function findFreqTerms.

28
00:01:57,900 --> 00:02:02,530
We want to call this on
our matrix frequencies,

29
00:02:02,530 --> 00:02:07,260
and then we want to give
an argument lowFreq, which

30
00:02:07,260 --> 00:02:09,680
is equal to the
minimum number of times

31
00:02:09,680 --> 00:02:12,180
a term must appear
to be displayed.

32
00:02:12,180 --> 00:02:14,110
Let's type 20.

33
00:02:14,110 --> 00:02:17,430
We see here 56 different words.

34
00:02:17,430 --> 00:02:22,370
So out of the 3,289
words in our matrix,

35
00:02:22,370 --> 00:02:26,990
only 56 words appear at
least 20 times in our tweets.

36
00:02:26,990 --> 00:02:29,950
This means that we probably
have a lot of terms

37
00:02:29,950 --> 00:02:33,920
that will be pretty useless
for our prediction model.

38
00:02:33,920 --> 00:02:37,010
The number of terms is an
issue for two main reasons.

39
00:02:37,010 --> 00:02:39,180
One is computational.

40
00:02:39,180 --> 00:02:42,170
More terms means more
independent variables,

41
00:02:42,170 --> 00:02:46,160
which usually means it takes
longer to build our models.

42
00:02:46,160 --> 00:02:48,450
The other is in building
models, as we mentioned

43
00:02:48,450 --> 00:02:52,550
before, the ratio of independent
variables to observations

44
00:02:52,550 --> 00:02:56,050
will affect how good the
model will generalize.

45
00:02:56,050 --> 00:02:59,670
So let's remove some terms
that don't appear very often.

46
00:02:59,670 --> 00:03:03,260
We'll call the output
sparse, and we'll use

47
00:03:03,260 --> 00:03:04,670
the
removeSparseTerms(frequencies,

48
00:03:04,670 --> 00:03:19,670
0.98).

49
00:03:19,670 --> 00:03:22,390
The sparsity threshold
works as follows.

50
00:03:22,390 --> 00:03:26,400
If we say 0.98, this
means to only keep

51
00:03:26,400 --> 00:03:29,890
terms that appear in 2%
or more of the tweets.

52
00:03:29,890 --> 00:03:33,920
If we say 0.99, that
means to only keep

53
00:03:33,920 --> 00:03:37,010
terms that appear in 1%
or more of the tweets.

54
00:03:37,010 --> 00:03:41,570
If we say 0.995, that
means to only keep

55
00:03:41,570 --> 00:03:45,060
terms that appear in 0.5%
or more of the tweets,

56
00:03:45,060 --> 00:03:46,890
about six or more tweets.

57
00:03:46,890 --> 00:03:50,840
We'll go ahead and use
this sparsity threshold.

58
00:03:50,840 --> 00:03:53,860
If you type sparse, you
can see that there's

59
00:03:53,860 --> 00:03:57,900
only 309 terms in
our sparse matrix.

60
00:03:57,900 --> 00:04:05,920
This is only about 9% of
the previous count of 3,289.

61
00:04:05,920 --> 00:04:09,860
Now let's convert the sparse
matrix into a data frame

62
00:04:09,860 --> 00:04:12,660
that we'll be able to use
for our predictive models.

63
00:04:12,660 --> 00:04:20,640
We'll call it tweetsSparse and
use the as.data.frame function

64
00:04:20,640 --> 00:04:26,730
called on the as.matrix function
called on our matrixsparse.

65
00:04:26,730 --> 00:04:31,610
This convert sparse to a data
frame called tweetsSparse.

66
00:04:31,610 --> 00:04:35,260
Since R struggles with variable
names that start with a number,

67
00:04:35,260 --> 00:04:38,950
and we probably have some words
here that start with a number,

68
00:04:38,950 --> 00:04:41,570
let's run the make names
function to make sure

69
00:04:41,570 --> 00:04:44,560
all of our words are
appropriate variable names.

70
00:04:44,560 --> 00:04:49,540
To do this type COLnames and
then in parentheses the name

71
00:04:49,540 --> 00:04:55,560
of our data frame,
tweetsSparse equals make.names,

72
00:04:55,560 --> 00:04:57,770
and then in parentheses
again colnames(tweetsSparse =

73
00:04:57,770 --> 00:04:59,230
make.names(colnames(tweetsSparse)).

74
00:05:02,910 --> 00:05:05,120
This will just convert
our variable names

75
00:05:05,120 --> 00:05:07,300
to make sure they're
all appropriate names

76
00:05:07,300 --> 00:05:09,840
before we build our
predictive models.

77
00:05:09,840 --> 00:05:11,440
You should do this
each time you've

78
00:05:11,440 --> 00:05:15,570
built a data frame
using text analytics.

79
00:05:15,570 --> 00:05:18,750
Now let's add our dependent
variable to this data set.

80
00:05:18,750 --> 00:05:20,290
We'll call it
tweetsSparse$Negative =

81
00:05:20,290 --> 00:05:20,950
tweets$Negative.

82
00:05:31,090 --> 00:05:34,470
Lastly, let's split our
data into a training set

83
00:05:34,470 --> 00:05:38,290
and a testing set, putting 70%
of the data in the training

84
00:05:38,290 --> 00:05:39,400
set.

85
00:05:39,400 --> 00:05:43,480
First we'll have to load the
library catools so that we

86
00:05:43,480 --> 00:05:45,860
can use the sample
split function.

87
00:05:45,860 --> 00:05:51,870
Then let's set the seed to
123 and create our split using

88
00:05:51,870 --> 00:05:55,240
sample.split where a
dependent variable is

89
00:05:55,240 --> 00:05:56,159
tweetsSparse$Negative.

90
00:06:00,660 --> 00:06:04,810
And then our split
ratio will be 0.7.

91
00:06:04,810 --> 00:06:09,150
We'll put 70% of the
data in the training set.

92
00:06:09,150 --> 00:06:12,950
Then let's just use subset to
create a treating set called

93
00:06:12,950 --> 00:06:18,320
trainSparse, which will take
a subset of the whole data set

94
00:06:18,320 --> 00:06:21,770
tweetsSparse, but always take
the observations for which

95
00:06:21,770 --> 00:06:24,030
split==TRUE.

96
00:06:24,030 --> 00:06:27,280
And we'll create our
test set, testSparse,

97
00:06:27,280 --> 00:06:31,780
again using subset to take the
observations of tweetsSparse,

98
00:06:31,780 --> 00:06:33,430
but this time for
which split==FALSE.

99
00:06:36,150 --> 00:06:39,890
Our data is now ready, and we
can build our predictive model.

100
00:06:39,890 --> 00:06:43,670
In the next video, we'll use
CART and logistic regression

101
00:06:43,670 --> 00:06:45,920
to predict negative sentiment.