1 00:00:04,490 --> 00:00:06,640 Now let's build the document-term matrix 2 00:00:06,640 --> 00:00:07,740 for our corpus. 3 00:00:07,740 --> 00:00:09,120 So we'll create a variable called 4 00:00:09,120 --> 00:00:11,270 dtm that contains the DocumentTermMatrix(corpus). 5 00:00:15,510 --> 00:00:19,240 The corpus has already had all the pre-processing run on it. 6 00:00:19,240 --> 00:00:22,870 So to get the summary statistics about the document-term matrix, 7 00:00:22,870 --> 00:00:25,840 we'll just type in the name of our variable, dtm. 8 00:00:25,840 --> 00:00:28,270 And what we can see is that even though we 9 00:00:28,270 --> 00:00:31,860 have only 855 emails in the corpus, 10 00:00:31,860 --> 00:00:35,890 we have over 22,000 terms that showed up at least once, 11 00:00:35,890 --> 00:00:38,290 which is clearly too many variables 12 00:00:38,290 --> 00:00:40,550 for the number of observations we have. 13 00:00:40,550 --> 00:00:42,280 So we want to remove the terms that 14 00:00:42,280 --> 00:00:45,030 don't appear too often in our data set, 15 00:00:45,030 --> 00:00:50,590 and we'll do that using the remove sparse terms function. 16 00:00:50,590 --> 00:00:52,680 And we're going to have to determine the sparsity, 17 00:00:52,680 --> 00:00:55,920 so we'll say that we'll remove any term that doesn't appear 18 00:00:55,920 --> 00:00:58,320 in at least 3% of the documents. 19 00:00:58,320 --> 00:01:04,720 To do that, we'll pass 0.97 to remove sparse terms. 20 00:01:04,720 --> 00:01:07,190 Now we can take a look at the summary statistics 21 00:01:07,190 --> 00:01:09,270 for the document-term matrix, and we 22 00:01:09,270 --> 00:01:11,270 can see that we've decreased the number of terms 23 00:01:11,270 --> 00:01:15,380 to 788, which is a much more reasonable number. 24 00:01:15,380 --> 00:01:18,970 So let's build a data frame called labeledTerms out 25 00:01:18,970 --> 00:01:20,860 of this document-term matrix. 26 00:01:20,860 --> 00:01:24,320 So to do this, we'll use as.data.fram 27 00:01:24,320 --> 00:01:30,080 of as.matrix applied to dtm, the document-term matrix. 28 00:01:30,080 --> 00:01:33,330 So this data frame is only including right now 29 00:01:33,330 --> 00:01:36,380 the frequencies of the words that appeared in at least 3% 30 00:01:36,380 --> 00:01:40,050 of the documents, but in order to run our text analytics 31 00:01:40,050 --> 00:01:43,670 models, we're also going to have the outcome variable, which 32 00:01:43,670 --> 00:01:46,650 is whether or not each email was responsive. 33 00:01:46,650 --> 00:01:49,280 So we need to add in this outcome variable. 34 00:01:49,280 --> 00:01:53,740 So we'll create labeledTerms$responsive, 35 00:01:53,740 --> 00:01:56,770 and we'll simply copy over the responsive variable from 36 00:01:56,770 --> 00:01:59,240 the original emails data frame so it's equal 37 00:01:59,240 --> 00:02:00,370 to emails$responsive. 38 00:02:04,480 --> 00:02:07,400 So finally let's take a look at our newly constructed data 39 00:02:07,400 --> 00:02:09,509 frame with the str function. 40 00:02:12,580 --> 00:02:18,850 So as we expect, turn off a lot of variables, 789 in total. 41 00:02:18,850 --> 00:02:21,630 788 of those variables are the frequencies 42 00:02:21,630 --> 00:02:25,860 of various words in the emails, and the last one is responsive, 43 00:02:25,860 --> 00:02:27,970 the outcome variable.