1
00:00:04,490 --> 00:00:06,640
Now let's build the
document-term matrix

2
00:00:06,640 --> 00:00:07,740
for our corpus.

3
00:00:07,740 --> 00:00:09,120
So we'll create
a variable called

4
00:00:09,120 --> 00:00:11,270
dtm that contains the
DocumentTermMatrix(corpus).

5
00:00:15,510 --> 00:00:19,240
The corpus has already had all
the pre-processing run on it.

6
00:00:19,240 --> 00:00:22,870
So to get the summary statistics
about the document-term matrix,

7
00:00:22,870 --> 00:00:25,840
we'll just type in the
name of our variable, dtm.

8
00:00:25,840 --> 00:00:28,270
And what we can see
is that even though we

9
00:00:28,270 --> 00:00:31,860
have only 855 emails
in the corpus,

10
00:00:31,860 --> 00:00:35,890
we have over 22,000 terms
that showed up at least once,

11
00:00:35,890 --> 00:00:38,290
which is clearly
too many variables

12
00:00:38,290 --> 00:00:40,550
for the number of
observations we have.

13
00:00:40,550 --> 00:00:42,280
So we want to remove
the terms that

14
00:00:42,280 --> 00:00:45,030
don't appear too
often in our data set,

15
00:00:45,030 --> 00:00:50,590
and we'll do that using the
remove sparse terms function.

16
00:00:50,590 --> 00:00:52,680
And we're going to have
to determine the sparsity,

17
00:00:52,680 --> 00:00:55,920
so we'll say that we'll remove
any term that doesn't appear

18
00:00:55,920 --> 00:00:58,320
in at least 3% of the documents.

19
00:00:58,320 --> 00:01:04,720
To do that, we'll pass 0.97
to remove sparse terms.

20
00:01:04,720 --> 00:01:07,190
Now we can take a look
at the summary statistics

21
00:01:07,190 --> 00:01:09,270
for the document-term
matrix, and we

22
00:01:09,270 --> 00:01:11,270
can see that we've decreased
the number of terms

23
00:01:11,270 --> 00:01:15,380
to 788, which is a much
more reasonable number.

24
00:01:15,380 --> 00:01:18,970
So let's build a data frame
called labeledTerms out

25
00:01:18,970 --> 00:01:20,860
of this document-term matrix.

26
00:01:20,860 --> 00:01:24,320
So to do this, we'll
use as.data.fram

27
00:01:24,320 --> 00:01:30,080
of as.matrix applied to dtm,
the document-term matrix.

28
00:01:30,080 --> 00:01:33,330
So this data frame is
only including right now

29
00:01:33,330 --> 00:01:36,380
the frequencies of the words
that appeared in at least 3%

30
00:01:36,380 --> 00:01:40,050
of the documents, but in order
to run our text analytics

31
00:01:40,050 --> 00:01:43,670
models, we're also going to
have the outcome variable, which

32
00:01:43,670 --> 00:01:46,650
is whether or not each
email was responsive.

33
00:01:46,650 --> 00:01:49,280
So we need to add in
this outcome variable.

34
00:01:49,280 --> 00:01:53,740
So we'll create
labeledTerms$responsive,

35
00:01:53,740 --> 00:01:56,770
and we'll simply copy over
the responsive variable from

36
00:01:56,770 --> 00:01:59,240
the original emails
data frame so it's equal

37
00:01:59,240 --> 00:02:00,370
to emails$responsive.

38
00:02:04,480 --> 00:02:07,400
So finally let's take a look
at our newly constructed data

39
00:02:07,400 --> 00:02:09,509
frame with the str function.

40
00:02:12,580 --> 00:02:18,850
So as we expect, turn off a
lot of variables, 789 in total.

41
00:02:18,850 --> 00:02:21,630
788 of those variables
are the frequencies

42
00:02:21,630 --> 00:02:25,860
of various words in the emails,
and the last one is responsive,

43
00:02:25,860 --> 00:02:27,970
the outcome variable.