1 00:00:04,490 --> 00:00:06,250 Now that we've prepared our data set, 2 00:00:06,250 --> 00:00:08,780 let's use CART to build a predictive model. 3 00:00:08,780 --> 00:00:11,930 First, we need to load the necessary packages in our R 4 00:00:11,930 --> 00:00:17,300 Console by typing library(rpart), 5 00:00:17,300 --> 00:00:18,510 and then library(rpart.plot). 6 00:00:25,950 --> 00:00:27,860 Now let's build our model. 7 00:00:27,860 --> 00:00:34,760 We'll call it tweetCART, and we'll use the rpart function 8 00:00:34,760 --> 00:00:39,450 to predict Negative using all of the other variables 9 00:00:39,450 --> 00:00:45,890 as our independent variables and the data set trainSparse. 10 00:00:45,890 --> 00:00:50,880 We'll add one more argument here, which is method = "class" 11 00:00:50,880 --> 00:00:54,050 so that the rpart function knows to build a classification 12 00:00:54,050 --> 00:00:55,320 model. 13 00:00:55,320 --> 00:00:57,660 We're just using the default parameter settings 14 00:00:57,660 --> 00:01:01,920 so we won't add anything for minbucket or cp. 15 00:01:01,920 --> 00:01:04,959 Now let's plot the tree using the prp function. 16 00:01:10,960 --> 00:01:14,730 Our tree says that if the word "freak" is in the tweet, 17 00:01:14,730 --> 00:01:17,830 then predict TRUE, or negative sentiment. 18 00:01:17,830 --> 00:01:19,870 If the word "freak" is not in the tweet, 19 00:01:19,870 --> 00:01:23,760 but the word "hate" is, again predict TRUE. 20 00:01:23,760 --> 00:01:26,090 If neither of these two words are in the tweet, 21 00:01:26,090 --> 00:01:30,240 but the word "wtf" is, also predict TRUE, or negative 22 00:01:30,240 --> 00:01:31,650 sentiment. 23 00:01:31,650 --> 00:01:34,450 If none of these three words are in the tweet, 24 00:01:34,450 --> 00:01:38,500 then predict FALSE, or non-negative sentiment. 25 00:01:38,500 --> 00:01:40,500 This tree makes sense intuitively 26 00:01:40,500 --> 00:01:42,300 since these three words are generally 27 00:01:42,300 --> 00:01:45,220 seen as negative words. 28 00:01:45,220 --> 00:01:48,050 Now, let's go back to our R Console 29 00:01:48,050 --> 00:01:51,380 and evaluate the numerical performance of our model 30 00:01:51,380 --> 00:01:54,690 by making predictions on the test set. 31 00:01:54,690 --> 00:01:56,990 We'll call our predictions predictCART. 32 00:01:59,729 --> 00:02:02,310 And we'll use the predict function 33 00:02:02,310 --> 00:02:10,960 to predict using our model tweetCART on the new data set 34 00:02:10,960 --> 00:02:11,460 testSparse. 35 00:02:13,970 --> 00:02:18,870 We'll add one more argument, which is type = "class" 36 00:02:18,870 --> 00:02:21,960 to make sure we get class predictions. 37 00:02:21,960 --> 00:02:24,430 Now let's make our confusion matrix 38 00:02:24,430 --> 00:02:26,520 using the table function. 39 00:02:26,520 --> 00:02:30,040 We'll give as the first argument the actual outcomes, 40 00:02:30,040 --> 00:02:34,450 testSparse$Negative, and then as the second argument, 41 00:02:34,450 --> 00:02:36,300 our predictions, predictCART. 42 00:02:41,020 --> 00:02:43,260 To compute the accuracy of our model, 43 00:02:43,260 --> 00:02:48,640 we add up the numbers on the diagonal, 294 plus 18-- 44 00:02:48,640 --> 00:02:51,590 these are the observations we predicted correctly-- 45 00:02:51,590 --> 00:02:55,120 and divide by the total number of observations in the table, 46 00:02:55,120 --> 00:02:58,360 or the total number of observations in our test set. 47 00:03:01,000 --> 00:03:04,940 So the accuracy of our CART model is about 0.88. 48 00:03:04,940 --> 00:03:07,600 Let's compare this to a simple baseline model 49 00:03:07,600 --> 00:03:10,160 that always predicts non-negative. 50 00:03:10,160 --> 00:03:12,930 To compute the accuracy of the baseline model, 51 00:03:12,930 --> 00:03:16,760 let's make a table of just the outcome variable Negative. 52 00:03:16,760 --> 00:03:20,290 So we'll type table, and then in parentheses, 53 00:03:20,290 --> 00:03:21,120 testSparse$Negative. 54 00:03:28,150 --> 00:03:30,120 This tells us that in our test set 55 00:03:30,120 --> 00:03:33,780 we have 300 observations with non-negative sentiment 56 00:03:33,780 --> 00:03:37,450 and 55 observations with negative sentiment. 57 00:03:37,450 --> 00:03:39,450 So the accuracy of a baseline model 58 00:03:39,450 --> 00:03:41,760 that always predicts non-negative 59 00:03:41,760 --> 00:03:48,340 would be 300 divided by 355, or 0.845. 60 00:03:48,340 --> 00:03:52,360 So our CART model does better than the simple baseline model. 61 00:03:52,360 --> 00:03:54,200 How about a random forest model? 62 00:03:54,200 --> 00:03:55,940 How well would that do? 63 00:03:55,940 --> 00:03:59,380 Let's first load the random forest package 64 00:03:59,380 --> 00:04:05,680 with library(randomForest), and then we'll 65 00:04:05,680 --> 00:04:10,080 set the seed to 123 so that we can 66 00:04:10,080 --> 00:04:12,750 replicate our model if we want to. 67 00:04:12,750 --> 00:04:16,200 Keep in mind that even if you set the seed to 123, 68 00:04:16,200 --> 00:04:18,800 you might get a different random forest model than me 69 00:04:18,800 --> 00:04:21,820 depending on your operating system. 70 00:04:21,820 --> 00:04:23,820 Now, let's create our model. 71 00:04:23,820 --> 00:04:30,570 We'll call it tweetRF and use the randomForest function 72 00:04:30,570 --> 00:04:34,860 to predict Negative again using all of our other variables 73 00:04:34,860 --> 00:04:39,030 as independent variables and the data set trainSparse. 74 00:04:42,070 --> 00:04:45,409 We'll again use the default parameter settings. 75 00:04:45,409 --> 00:04:47,940 The random forest model takes significantly longer 76 00:04:47,940 --> 00:04:49,940 to build than the CART model. 77 00:04:49,940 --> 00:04:52,490 We've seen this before when building CART and random forest 78 00:04:52,490 --> 00:04:54,710 models, but in this case, the difference 79 00:04:54,710 --> 00:04:56,800 is particularly drastic. 80 00:04:56,800 --> 00:04:59,680 This is because we have so many independent variables, 81 00:04:59,680 --> 00:05:02,350 about 300 different words. 82 00:05:02,350 --> 00:05:04,970 So far in this course, we haven't seen data sets 83 00:05:04,970 --> 00:05:07,590 with this many independent variables. 84 00:05:07,590 --> 00:05:10,680 So keep in mind that for text analytics problems, 85 00:05:10,680 --> 00:05:13,760 building a random forest model will take significantly longer 86 00:05:13,760 --> 00:05:15,980 than building a CART model. 87 00:05:15,980 --> 00:05:17,750 So now that our model's finished, 88 00:05:17,750 --> 00:05:20,330 let's make predictions on our test set. 89 00:05:20,330 --> 00:05:24,430 We'll call them predictRF, and again, we'll 90 00:05:24,430 --> 00:05:27,570 use the predict function to make predictions 91 00:05:27,570 --> 00:05:31,920 using the model tweetRF this time, 92 00:05:31,920 --> 00:05:34,090 and again, the new data set testSparse. 93 00:05:38,080 --> 00:05:40,950 Now let's make our confusion matrix using the table 94 00:05:40,950 --> 00:05:44,790 function, first giving the actual outcomes, 95 00:05:44,790 --> 00:05:50,530 testSparse$Negative, and then giving our predictions, 96 00:05:50,530 --> 00:05:51,030 predictRF. 97 00:05:53,780 --> 00:05:56,690 To compute the accuracy of the random forest model, 98 00:05:56,690 --> 00:06:02,280 we again sum up the cases we got right, 293 plus 21, 99 00:06:02,280 --> 00:06:05,530 and divide by the total number of observations in the table. 100 00:06:09,970 --> 00:06:14,370 So our random forest model has an accuracy of 0.885. 101 00:06:14,370 --> 00:06:16,320 This is a little better than our CART model, 102 00:06:16,320 --> 00:06:18,930 but due to the interpretability of our CART model, 103 00:06:18,930 --> 00:06:22,310 I'd probably prefer it over the random forest model. 104 00:06:22,310 --> 00:06:25,090 If you were to use cross-validation to pick the cp 105 00:06:25,090 --> 00:06:27,770 parameter for the CART model, the accuracy 106 00:06:27,770 --> 00:06:31,820 would increase to about the same as the random forest model. 107 00:06:31,820 --> 00:06:35,280 So by using a bag-of-words approach and these models, 108 00:06:35,280 --> 00:06:37,909 we can reasonably predict sentiment even 109 00:06:37,909 --> 00:06:41,490 with a relatively small data set of tweets.