1 00:00:05,080 --> 00:00:08,590 Now it's time to construct and preprocess the corpus. 2 00:00:08,590 --> 00:00:11,980 So we'll start by loading the tm package with library(tm). 3 00:00:15,180 --> 00:00:16,600 And now that we'll have done that, 4 00:00:16,600 --> 00:00:21,150 we'll construct a variable called corpus using the corpus 5 00:00:21,150 --> 00:00:25,000 and vector source functions and passing in all the emails 6 00:00:25,000 --> 00:00:26,630 in our data set, which is emails$email. 7 00:00:31,360 --> 00:00:33,460 So now that we've constructed the corpus, 8 00:00:33,460 --> 00:00:37,220 we can output the first email in the corpus. 9 00:00:37,220 --> 00:00:40,290 We'll start out by calling the strwrap function to get it 10 00:00:40,290 --> 00:00:43,660 on multiple lines, and then we can select the first element 11 00:00:43,660 --> 00:00:46,950 in the corpus using the double square bracket notation 12 00:00:46,950 --> 00:00:50,400 and selecting element 1. 13 00:00:50,400 --> 00:00:53,680 And we can see that this is exactly 14 00:00:53,680 --> 00:00:56,580 the same email that we saw originally, 15 00:00:56,580 --> 00:01:00,990 the email about the working paper. 16 00:01:00,990 --> 00:01:04,480 So now we're ready to preprocess the corpus using the tm map 17 00:01:04,480 --> 00:01:05,640 function. 18 00:01:05,640 --> 00:01:08,180 So first, we'll convert the corpus 19 00:01:08,180 --> 00:01:11,560 to lowercase using tm map and the two lower function. 20 00:01:11,560 --> 00:01:13,510 So we'll have corpus = tm_map(corpus, tolower). 21 00:01:21,000 --> 00:01:23,340 And then we'll do the exact same thing except removing 22 00:01:23,340 --> 00:01:24,840 punctuation, so we'll have corpus = tm_map(corpus, 23 00:01:24,840 --> 00:01:25,630 removePunctuation). 24 00:01:36,310 --> 00:01:45,050 We'll remove the stop words with remove words function 25 00:01:45,050 --> 00:01:49,759 and we'll pass along the stop words of the English language 26 00:01:49,759 --> 00:01:53,270 as the words we want to remove. 27 00:01:53,270 --> 00:01:56,120 And lastly, we're going to stem the document. 28 00:01:56,120 --> 00:01:57,820 So corpus = tm_map(corpus, stemDocument). 29 00:02:06,030 --> 00:02:08,770 And now that we've gone through those four preprocessing steps, 30 00:02:08,770 --> 00:02:11,690 we can take a second look at the first email in the corpus. 31 00:02:11,690 --> 00:02:13,740 So again, call strwrap(corpusstrwrap(corpus{[1]). 32 00:02:20,050 --> 00:02:22,040 And now it looks quite a bit different. 33 00:02:22,040 --> 00:02:23,900 We can come up to the top here. 34 00:02:23,900 --> 00:02:26,510 It's a lot harder to read now that we removed 35 00:02:26,510 --> 00:02:29,829 all the stop words and punctuation and word stems, 36 00:02:29,829 --> 00:02:31,490 but now the emails in this corpus 37 00:02:31,490 --> 00:02:34,490 are ready for our machine learning algorithms.