1 00:00:04,490 --> 00:00:07,800 In the previous video, we preprocessed our data, 2 00:00:07,800 --> 00:00:11,230 and we're now ready to extract the word frequencies to be 3 00:00:11,230 --> 00:00:14,080 used in our prediction problem. 4 00:00:14,080 --> 00:00:16,660 The tm package provides a function called 5 00:00:16,660 --> 00:00:20,390 DocumentTermMatrix that generates a matrix where 6 00:00:20,390 --> 00:00:24,460 the rows correspond to documents, in our case tweets, 7 00:00:24,460 --> 00:00:28,270 and the columns correspond to words in those tweets. 8 00:00:28,270 --> 00:00:30,470 The values in the matrix are the number 9 00:00:30,470 --> 00:00:34,190 of times that word appears in each argument. 10 00:00:34,190 --> 00:00:36,340 Let's go ahead and generate this matrix 11 00:00:36,340 --> 00:00:39,140 and call it "frequencies." 12 00:00:39,140 --> 00:00:44,530 So we'll use the DocumentTermMatrix function 13 00:00:44,530 --> 00:00:49,490 calls on our corpus that we created in the previous video. 14 00:00:49,490 --> 00:00:54,040 Let's take a look at our matrix by typing frequencies. 15 00:00:54,040 --> 00:00:58,570 We can see that there are 3,289 terms 16 00:00:58,570 --> 00:01:04,239 or words in our matrix and 1,181 documents 17 00:01:04,239 --> 00:01:07,730 or tweets after preprocessing. 18 00:01:07,730 --> 00:01:10,230 Let's see what this matrix looks like using 19 00:01:10,230 --> 00:01:12,280 the inspect function. 20 00:01:12,280 --> 00:01:30,370 So type inspect(frequencies[1000:1005, 505:515]). 21 00:01:30,370 --> 00:01:34,210 In this range we see that the word "cheer" appears 22 00:01:34,210 --> 00:01:38,210 in the tweet 1005, but "cheap" doesn't 23 00:01:38,210 --> 00:01:41,509 appear in any of these tweets. 24 00:01:41,509 --> 00:01:43,950 This data is what we call sparse. 25 00:01:43,950 --> 00:01:47,600 This means that there are many zeros in our matrix. 26 00:01:47,600 --> 00:01:50,350 We can look at what the most popular terms are, 27 00:01:50,350 --> 00:01:53,380 or words, with the function findFreqTerms. 28 00:01:57,900 --> 00:02:02,530 We want to call this on our matrix frequencies, 29 00:02:02,530 --> 00:02:07,260 and then we want to give an argument lowFreq, which 30 00:02:07,260 --> 00:02:09,680 is equal to the minimum number of times 31 00:02:09,680 --> 00:02:12,180 a term must appear to be displayed. 32 00:02:12,180 --> 00:02:14,110 Let's type 20. 33 00:02:14,110 --> 00:02:17,430 We see here 56 different words. 34 00:02:17,430 --> 00:02:22,370 So out of the 3,289 words in our matrix, 35 00:02:22,370 --> 00:02:26,990 only 56 words appear at least 20 times in our tweets. 36 00:02:26,990 --> 00:02:29,950 This means that we probably have a lot of terms 37 00:02:29,950 --> 00:02:33,920 that will be pretty useless for our prediction model. 38 00:02:33,920 --> 00:02:37,010 The number of terms is an issue for two main reasons. 39 00:02:37,010 --> 00:02:39,180 One is computational. 40 00:02:39,180 --> 00:02:42,170 More terms means more independent variables, 41 00:02:42,170 --> 00:02:46,160 which usually means it takes longer to build our models. 42 00:02:46,160 --> 00:02:48,450 The other is in building models, as we mentioned 43 00:02:48,450 --> 00:02:52,550 before, the ratio of independent variables to observations 44 00:02:52,550 --> 00:02:56,050 will affect how good the model will generalize. 45 00:02:56,050 --> 00:02:59,670 So let's remove some terms that don't appear very often. 46 00:02:59,670 --> 00:03:03,260 We'll call the output sparse, and we'll use 47 00:03:03,260 --> 00:03:04,670 the removeSparseTerms(frequencies, 48 00:03:04,670 --> 00:03:19,670 0.98). 49 00:03:19,670 --> 00:03:22,390 The sparsity threshold works as follows. 50 00:03:22,390 --> 00:03:26,400 If we say 0.98, this means to only keep 51 00:03:26,400 --> 00:03:29,890 terms that appear in 2% or more of the tweets. 52 00:03:29,890 --> 00:03:33,920 If we say 0.99, that means to only keep 53 00:03:33,920 --> 00:03:37,010 terms that appear in 1% or more of the tweets. 54 00:03:37,010 --> 00:03:41,570 If we say 0.995, that means to only keep 55 00:03:41,570 --> 00:03:45,060 terms that appear in 0.5% or more of the tweets, 56 00:03:45,060 --> 00:03:46,890 about six or more tweets. 57 00:03:46,890 --> 00:03:50,840 We'll go ahead and use this sparsity threshold. 58 00:03:50,840 --> 00:03:53,860 If you type sparse, you can see that there's 59 00:03:53,860 --> 00:03:57,900 only 309 terms in our sparse matrix. 60 00:03:57,900 --> 00:04:05,920 This is only about 9% of the previous count of 3,289. 61 00:04:05,920 --> 00:04:09,860 Now let's convert the sparse matrix into a data frame 62 00:04:09,860 --> 00:04:12,660 that we'll be able to use for our predictive models. 63 00:04:12,660 --> 00:04:20,640 We'll call it tweetsSparse and use the as.data.frame function 64 00:04:20,640 --> 00:04:26,730 called on the as.matrix function called on our matrixsparse. 65 00:04:26,730 --> 00:04:31,610 This convert sparse to a data frame called tweetsSparse. 66 00:04:31,610 --> 00:04:35,260 Since R struggles with variable names that start with a number, 67 00:04:35,260 --> 00:04:38,950 and we probably have some words here that start with a number, 68 00:04:38,950 --> 00:04:41,570 let's run the make names function to make sure 69 00:04:41,570 --> 00:04:44,560 all of our words are appropriate variable names. 70 00:04:44,560 --> 00:04:49,540 To do this type COLnames and then in parentheses the name 71 00:04:49,540 --> 00:04:55,560 of our data frame, tweetsSparse equals make.names, 72 00:04:55,560 --> 00:04:57,770 and then in parentheses again colnames(tweetsSparse = 73 00:04:57,770 --> 00:04:59,230 make.names(colnames(tweetsSparse)). 74 00:05:02,910 --> 00:05:05,120 This will just convert our variable names 75 00:05:05,120 --> 00:05:07,300 to make sure they're all appropriate names 76 00:05:07,300 --> 00:05:09,840 before we build our predictive models. 77 00:05:09,840 --> 00:05:11,440 You should do this each time you've 78 00:05:11,440 --> 00:05:15,570 built a data frame using text analytics. 79 00:05:15,570 --> 00:05:18,750 Now let's add our dependent variable to this data set. 80 00:05:18,750 --> 00:05:20,290 We'll call it tweetsSparse$Negative = 81 00:05:20,290 --> 00:05:20,950 tweets$Negative. 82 00:05:31,090 --> 00:05:34,470 Lastly, let's split our data into a training set 83 00:05:34,470 --> 00:05:38,290 and a testing set, putting 70% of the data in the training 84 00:05:38,290 --> 00:05:39,400 set. 85 00:05:39,400 --> 00:05:43,480 First we'll have to load the library catools so that we 86 00:05:43,480 --> 00:05:45,860 can use the sample split function. 87 00:05:45,860 --> 00:05:51,870 Then let's set the seed to 123 and create our split using 88 00:05:51,870 --> 00:05:55,240 sample.split where a dependent variable is 89 00:05:55,240 --> 00:05:56,159 tweetsSparse$Negative. 90 00:06:00,660 --> 00:06:04,810 And then our split ratio will be 0.7. 91 00:06:04,810 --> 00:06:09,150 We'll put 70% of the data in the training set. 92 00:06:09,150 --> 00:06:12,950 Then let's just use subset to create a treating set called 93 00:06:12,950 --> 00:06:18,320 trainSparse, which will take a subset of the whole data set 94 00:06:18,320 --> 00:06:21,770 tweetsSparse, but always take the observations for which 95 00:06:21,770 --> 00:06:24,030 split==TRUE. 96 00:06:24,030 --> 00:06:27,280 And we'll create our test set, testSparse, 97 00:06:27,280 --> 00:06:31,780 again using subset to take the observations of tweetsSparse, 98 00:06:31,780 --> 00:06:33,430 but this time for which split==FALSE. 99 00:06:36,150 --> 00:06:39,890 Our data is now ready, and we can build our predictive model. 100 00:06:39,890 --> 00:06:43,670 In the next video, we'll use CART and logistic regression 101 00:06:43,670 --> 00:06:45,920 to predict negative sentiment.