1 00:00:04,500 --> 00:00:07,130 Pre-processing the data can be difficult, 2 00:00:07,130 --> 00:00:11,280 but, luckily, R's packages provide easy-to-use functions 3 00:00:11,280 --> 00:00:13,400 for the most common tasks. 4 00:00:13,400 --> 00:00:16,530 In this video, we'll load and process our data 5 00:00:16,530 --> 00:00:20,300 in R. In your R console, let's load 6 00:00:20,300 --> 00:00:25,560 the data set tweets.csv with the read.csv function. 7 00:00:25,560 --> 00:00:28,030 But since we're working with text data here, 8 00:00:28,030 --> 00:00:29,740 we need one extra argument, which is 9 00:00:29,740 --> 00:00:30,700 stringsAsFactors=FALSE. 10 00:00:33,390 --> 00:00:36,110 So we'll call our data set tweets. 11 00:00:36,110 --> 00:00:39,440 And we'll use the read.csv function to read in the data 12 00:00:39,440 --> 00:00:44,020 file tweets.csv, but then we'll add the extra argument 13 00:00:44,020 --> 00:00:44,980 stringsAsFactors=FALSE. 14 00:00:51,530 --> 00:00:54,150 You'll always need to add this extra argument when 15 00:00:54,150 --> 00:00:56,660 working on a text analytics problem 16 00:00:56,660 --> 00:01:00,040 so that the text is read in properly. 17 00:01:00,040 --> 00:01:01,670 Now let's take a look at the structure 18 00:01:01,670 --> 00:01:03,620 of our data with the str function. 19 00:01:06,450 --> 00:01:12,250 We can see that we have 1,181 observations of two variables, 20 00:01:12,250 --> 00:01:14,960 the text of the tweet, called Tweet, 21 00:01:14,960 --> 00:01:19,500 and the average sentiment score, called Avg for average. 22 00:01:19,500 --> 00:01:21,610 The tweet texts are real tweets that we 23 00:01:21,610 --> 00:01:25,000 found on the internet directed to Apple with a few cleaned up 24 00:01:25,000 --> 00:01:26,390 words. 25 00:01:26,390 --> 00:01:29,250 We're more interested in being able to detect 26 00:01:29,250 --> 00:01:32,220 the tweets with clear negative sentiment, 27 00:01:32,220 --> 00:01:34,910 so let's define a new variable in our data 28 00:01:34,910 --> 00:01:38,509 set tweets called Negative. 29 00:01:38,509 --> 00:01:40,810 And we'll set this equal to as.factor(tweets$Avg = -1). 30 00:01:51,000 --> 00:01:55,180 This will set tweets$Negative equal to true if the average 31 00:01:55,180 --> 00:01:58,560 sentiment score is less than or equal to negative 1 and will 32 00:01:58,560 --> 00:02:02,460 set tweets$Negative equal to false if the average sentiment 33 00:02:02,460 --> 00:02:05,050 score is greater than negative 1. 34 00:02:05,050 --> 00:02:07,610 Let's look at a table of this new variable, Negative. 35 00:02:12,360 --> 00:02:19,320 We can see that 182 of the 1,181 tweets, or about 15%, 36 00:02:19,320 --> 00:02:19,870 are negative. 37 00:02:22,630 --> 00:02:24,829 Now to pre-process our text data so 38 00:02:24,829 --> 00:02:27,250 that we can use the bag of words approach, 39 00:02:27,250 --> 00:02:30,850 we'll be using the tm text mining package. 40 00:02:30,850 --> 00:02:35,250 We'll need to install and load two packages to do this. 41 00:02:35,250 --> 00:02:41,590 First, let's install the package tm, and go ahead 42 00:02:41,590 --> 00:02:43,620 and select a CRAN mirror near you. 43 00:02:47,380 --> 00:02:49,910 As soon as that package is done installing 44 00:02:49,910 --> 00:02:51,990 and you're back at the blinking cursor, 45 00:02:51,990 --> 00:02:56,640 go ahead and load that package with the library command. 46 00:02:56,640 --> 00:03:00,990 Then we also need to install the package snowballC. 47 00:03:04,230 --> 00:03:07,530 This package helps us use the tm package. 48 00:03:07,530 --> 00:03:10,330 And go ahead and load the snowball package as well. 49 00:03:13,280 --> 00:03:16,610 One of the concepts introduced by the tm package 50 00:03:16,610 --> 00:03:18,660 is that of a corpus. 51 00:03:18,660 --> 00:03:21,329 A corpus is a collection of documents. 52 00:03:21,329 --> 00:03:26,100 We'll need to convert our tweets to a corpus for pre-processing. 53 00:03:26,100 --> 00:03:29,160 tm can create a corpus in many different ways, 54 00:03:29,160 --> 00:03:32,420 but we'll create it from the tweet column of our data frame 55 00:03:32,420 --> 00:03:36,420 using two functions, corpus and vector source. 56 00:03:36,420 --> 00:03:39,310 We'll call our corpus "corpus" and then 57 00:03:39,310 --> 00:03:44,030 use the corpus and the vector source functions 58 00:03:44,030 --> 00:03:48,840 called on our tweets variable of our tweets data set. 59 00:03:48,840 --> 00:03:49,800 So that's tweets$Tweet. 60 00:03:54,140 --> 00:03:55,710 We can check that this has worked 61 00:03:55,710 --> 00:04:01,210 by typing corpus and seeing that our corpus has 1,181 text 62 00:04:01,210 --> 00:04:03,040 documents. 63 00:04:03,040 --> 00:04:06,070 And we can check that the documents match our tweets 64 00:04:06,070 --> 00:04:07,430 by using double brackets. 65 00:04:07,430 --> 00:04:08,270 So type corpus[[1]]. 66 00:04:13,860 --> 00:04:18,130 This shows us the first tweet in our corpus. 67 00:04:18,130 --> 00:04:21,660 Now we're ready to start pre-processing our data. 68 00:04:21,660 --> 00:04:24,470 Pre-processing is easy in tm. 69 00:04:24,470 --> 00:04:28,010 Each operation, like stemming or removing stop words, 70 00:04:28,010 --> 00:04:30,180 can be done with one line in R, where 71 00:04:30,180 --> 00:04:33,010 we use the tm_map function. 72 00:04:33,010 --> 00:04:36,190 Let's try it out by changing all of the text in our tweets 73 00:04:36,190 --> 00:04:37,840 to lowercase. 74 00:04:37,840 --> 00:04:40,540 To do that, we'll replace our corpus 75 00:04:40,540 --> 00:04:45,290 with the output of the tm_map function, where 76 00:04:45,290 --> 00:04:48,430 the first argument is the name of our corpus 77 00:04:48,430 --> 00:04:50,780 and the second argument is what we want to do. 78 00:04:50,780 --> 00:04:54,440 In this case, tolower. 79 00:04:54,440 --> 00:04:57,320 tolower is a standard function in R, 80 00:04:57,320 --> 00:05:00,850 and this is like when we pass mean to the tapply function. 81 00:05:00,850 --> 00:05:03,780 We're passing the tm_map function 82 00:05:03,780 --> 00:05:07,620 a function to use on our corpus. 83 00:05:07,620 --> 00:05:10,180 Let's see what that did by looking at our first tweet 84 00:05:10,180 --> 00:05:10,870 again. 85 00:05:10,870 --> 00:05:13,140 Go ahead and hit the up arrow twice to get back 86 00:05:13,140 --> 00:05:17,410 to corpuscorpus{[1] and now we can see that all of our letters are 87 00:05:17,410 --> 00:05:17,910 lowercase. 88 00:05:20,980 --> 00:05:23,950 Now let's remove all punctuation. 89 00:05:23,950 --> 00:05:26,070 This is done in a very similar way, 90 00:05:26,070 --> 00:05:28,370 except this time we give the argument 91 00:05:28,370 --> 00:05:31,640 removePunctuation instead of tolower. 92 00:05:31,640 --> 00:05:35,120 Hit the up arrow twice, and in the tm_map function, 93 00:05:35,120 --> 00:05:37,990 delete tolower, and type removePunctuation. 94 00:05:41,210 --> 00:05:44,100 Let's see what this did to our first tweet again. 95 00:05:44,100 --> 00:05:47,540 Now the comma after "say", the exclamation point after 96 00:05:47,540 --> 00:05:52,990 "received", and the @ symbols before "Apple" are all gone. 97 00:05:52,990 --> 00:05:56,860 Now we want to remove the stop words in our tweets. 98 00:05:56,860 --> 00:06:00,070 tm provides a list of stop words for the English language. 99 00:06:00,070 --> 00:06:02,490 We can check it out by typing stopwords("english") [1:10]. 100 00:06:12,300 --> 00:06:14,430 We see that these are words like "I," 101 00:06:14,430 --> 00:06:18,490 "me," "my," "myself," et cetera. 102 00:06:18,490 --> 00:06:21,610 Removing words can be done with the removeWords argument 103 00:06:21,610 --> 00:06:24,740 to the tm_map function, but we need one extra argument 104 00:06:24,740 --> 00:06:28,830 this time-- what the stop words are that we want to remove. 105 00:06:28,830 --> 00:06:31,370 We'll remove all of these English stop words, 106 00:06:31,370 --> 00:06:33,750 but we'll also remove the word "apple" 107 00:06:33,750 --> 00:06:36,310 since all of these tweets have the word "apple" 108 00:06:36,310 --> 00:06:39,040 and it probably won't be very useful in our prediction 109 00:06:39,040 --> 00:06:40,600 problem. 110 00:06:40,600 --> 00:06:42,659 So go ahead and hit the up arrow to get back 111 00:06:42,659 --> 00:06:47,730 to the tm_map function, delete removePunctuation and, instead, 112 00:06:47,730 --> 00:06:48,440 type removeWords. 113 00:06:52,210 --> 00:06:54,760 Then we need to add one extra argument, c("apple"). 114 00:06:59,230 --> 00:07:01,730 This is us removing the word "apple." 115 00:07:01,730 --> 00:07:02,980 And then stopwords("english"). 116 00:07:08,020 --> 00:07:10,260 So this will remove the word "apple" 117 00:07:10,260 --> 00:07:14,060 and all of the English stop words. 118 00:07:14,060 --> 00:07:15,560 Let's take a look at our first tweet 119 00:07:15,560 --> 00:07:16,680 again to see what happened. 120 00:07:20,470 --> 00:07:23,590 Now we can see that we have significantly fewer words, only 121 00:07:23,590 --> 00:07:26,730 the words that are not stop words. 122 00:07:26,730 --> 00:07:30,230 Lastly, we want to stem our document with the stem document 123 00:07:30,230 --> 00:07:31,220 argument. 124 00:07:31,220 --> 00:07:34,850 Go ahead and scroll back up to the removePunctuation, 125 00:07:34,850 --> 00:07:40,090 delete removePunctuation, and type stemDocument. 126 00:07:40,090 --> 00:07:43,830 If you hit Enter and then look at the first tweet again, 127 00:07:43,830 --> 00:07:48,540 we can see that this took off the ending of "customer," 128 00:07:48,540 --> 00:07:52,260 "service," "received," and "appstore." 129 00:07:52,260 --> 00:07:55,360 In the next video, we'll investigate our corpus 130 00:07:55,360 --> 00:07:58,510 and prepare it for our prediction problem.