1
00:00:05,080 --> 00:00:08,590
Now it's time to construct
and preprocess the corpus.

2
00:00:08,590 --> 00:00:11,980
So we'll start by loading the
tm package with library(tm).

3
00:00:15,180 --> 00:00:16,600
And now that we'll
have done that,

4
00:00:16,600 --> 00:00:21,150
we'll construct a variable
called corpus using the corpus

5
00:00:21,150 --> 00:00:25,000
and vector source functions
and passing in all the emails

6
00:00:25,000 --> 00:00:26,630
in our data set,
which is emails$email.

7
00:00:31,360 --> 00:00:33,460
So now that we've
constructed the corpus,

8
00:00:33,460 --> 00:00:37,220
we can output the first
email in the corpus.

9
00:00:37,220 --> 00:00:40,290
We'll start out by calling
the strwrap function to get it

10
00:00:40,290 --> 00:00:43,660
on multiple lines, and then we
can select the first element

11
00:00:43,660 --> 00:00:46,950
in the corpus using the
double square bracket notation

12
00:00:46,950 --> 00:00:50,400
and selecting element 1.

13
00:00:50,400 --> 00:00:53,680
And we can see that
this is exactly

14
00:00:53,680 --> 00:00:56,580
the same email that
we saw originally,

15
00:00:56,580 --> 00:01:00,990
the email about
the working paper.

16
00:01:00,990 --> 00:01:04,480
So now we're ready to preprocess
the corpus using the tm map

17
00:01:04,480 --> 00:01:05,640
function.

18
00:01:05,640 --> 00:01:08,180
So first, we'll
convert the corpus

19
00:01:08,180 --> 00:01:11,560
to lowercase using tm map
and the two lower function.

20
00:01:11,560 --> 00:01:13,510
So we'll have corpus =
tm_map(corpus, tolower).

21
00:01:21,000 --> 00:01:23,340
And then we'll do the exact
same thing except removing

22
00:01:23,340 --> 00:01:24,840
punctuation, so we'll have
corpus = tm_map(corpus,

23
00:01:24,840 --> 00:01:25,630
removePunctuation).

24
00:01:36,310 --> 00:01:45,050
We'll remove the stop words
with remove words function

25
00:01:45,050 --> 00:01:49,759
and we'll pass along the stop
words of the English language

26
00:01:49,759 --> 00:01:53,270
as the words we want to remove.

27
00:01:53,270 --> 00:01:56,120
And lastly, we're going
to stem the document.

28
00:01:56,120 --> 00:01:57,820
So corpus = tm_map(corpus,
stemDocument).

29
00:02:06,030 --> 00:02:08,770
And now that we've gone through
those four preprocessing steps,

30
00:02:08,770 --> 00:02:11,690
we can take a second look at
the first email in the corpus.

31
00:02:11,690 --> 00:02:13,740
So again, call
strwrap(corpusstrwrap(corpus{[1]).

32
00:02:20,050 --> 00:02:22,040
And now it looks
quite a bit different.

33
00:02:22,040 --> 00:02:23,900
We can come up to the top here.

34
00:02:23,900 --> 00:02:26,510
It's a lot harder to
read now that we removed

35
00:02:26,510 --> 00:02:29,829
all the stop words and
punctuation and word stems,

36
00:02:29,829 --> 00:02:31,490
but now the emails
in this corpus

37
00:02:31,490 --> 00:02:34,490
are ready for our machine
learning algorithms.