1 00:00:09,500 --> 00:00:12,730 In this lecture, we'll use a technique called Bag of Words 2 00:00:12,730 --> 00:00:15,640 to build text analytics models. 3 00:00:15,640 --> 00:00:19,570 Fully understanding text is difficult, but Bag of Words 4 00:00:19,570 --> 00:00:22,130 provides a very simple approach. 5 00:00:22,130 --> 00:00:24,360 It just counts the number of times 6 00:00:24,360 --> 00:00:28,320 each word appears in the text and uses these counts 7 00:00:28,320 --> 00:00:31,080 as the independent variables. 8 00:00:31,080 --> 00:00:34,880 For example, in the sentence, "This course is great. 9 00:00:34,880 --> 00:00:37,440 I would recommend this course my friends," 10 00:00:37,440 --> 00:00:43,180 the word this is seen twice, the word course is seen twice, 11 00:00:43,180 --> 00:00:48,000 the word great is seen once, et cetera. 12 00:00:48,000 --> 00:00:51,680 In Bag of Words, there's one feature for each word. 13 00:00:51,680 --> 00:00:55,160 This is a very simple approach, but is often very effective, 14 00:00:55,160 --> 00:00:56,330 too. 15 00:00:56,330 --> 00:00:59,680 It's used as a baseline in text analytics projects 16 00:00:59,680 --> 00:01:02,390 and for natural language processing. 17 00:01:02,390 --> 00:01:04,360 This isn't the whole story, though. 18 00:01:04,360 --> 00:01:06,940 Preprocessing the text can dramatically 19 00:01:06,940 --> 00:01:11,890 improve the performance of the Bag of Words method. 20 00:01:11,890 --> 00:01:14,090 One part of preprocessing the text 21 00:01:14,090 --> 00:01:16,620 is to clean up irregularities. 22 00:01:16,620 --> 00:01:20,000 Text data often as many inconsistencies that will cause 23 00:01:20,000 --> 00:01:21,940 algorithms trouble. 24 00:01:21,940 --> 00:01:24,850 Computers are very literal by default. 25 00:01:24,850 --> 00:01:29,710 Apple with just an uppercase A, APPLE all in uppercase letters, 26 00:01:29,710 --> 00:01:33,370 or ApPLe with a mixture of uppercase and lowercase letters 27 00:01:33,370 --> 00:01:35,990 will all be counted separately. 28 00:01:35,990 --> 00:01:38,610 We want to change the text so that all three 29 00:01:38,610 --> 00:01:42,450 versions of Apple here will be counted as the same word, 30 00:01:42,450 --> 00:01:46,930 by either changing all words to uppercase or to lower case. 31 00:01:46,930 --> 00:01:50,350 We'll typically change all the letters to lowercase, 32 00:01:50,350 --> 00:01:52,500 so these three versions of Apple will all 33 00:01:52,500 --> 00:01:55,030 become Apple with lower case letters 34 00:01:55,030 --> 00:01:57,000 and will be counted as the same word. 35 00:01:59,940 --> 00:02:02,900 Punctuation can also cause problems. 36 00:02:02,900 --> 00:02:04,910 The basic approach is to deal with this 37 00:02:04,910 --> 00:02:07,510 is to remove everything that isn't a standard number 38 00:02:07,510 --> 00:02:08,650 or letter. 39 00:02:08,650 --> 00:02:12,480 However, sometimes punctuation is meaningful. 40 00:02:12,480 --> 00:02:16,990 In the case of Twitter, @Apple denotes a message to Apple, 41 00:02:16,990 --> 00:02:20,690 and #Apple is a message about Apple. 42 00:02:20,690 --> 00:02:22,790 For web addresses, the punctuation 43 00:02:22,790 --> 00:02:25,420 often defines the web address. 44 00:02:25,420 --> 00:02:28,190 For these reasons, the removal of punctuation 45 00:02:28,190 --> 00:02:31,370 should be tailored to the specific problem. 46 00:02:31,370 --> 00:02:36,420 In our case, we will remove all punctuation, so @Apple, 47 00:02:36,420 --> 00:02:40,020 Apple with an exclamation point, Apple with dashes 48 00:02:40,020 --> 00:02:42,010 will all count as just Apple. 49 00:02:44,880 --> 00:02:47,490 Another preprocessing task we want to do 50 00:02:47,490 --> 00:02:50,680 is to remove unhelpful terms. 51 00:02:50,680 --> 00:02:52,820 Many words are frequently used but are 52 00:02:52,820 --> 00:02:54,990 only meaningful in a sentence. 53 00:02:54,990 --> 00:02:58,110 These are called stop words. 54 00:02:58,110 --> 00:03:02,940 Examples are the, is, at, and which. 55 00:03:02,940 --> 00:03:05,440 It's unlikely that these words will improve 56 00:03:05,440 --> 00:03:07,660 the machine learning prediction quality, 57 00:03:07,660 --> 00:03:11,680 so we want to remove them to reduce the size of the data. 58 00:03:11,680 --> 00:03:14,560 There are some potential problems with this approach. 59 00:03:14,560 --> 00:03:17,329 Sometimes, two stop words taken together 60 00:03:17,329 --> 00:03:19,560 have a very important meaning. 61 00:03:19,560 --> 00:03:23,579 For example, "The Who"-- which is a combination of two stop 62 00:03:23,579 --> 00:03:27,100 words-- is actually the name of the band we see on the right 63 00:03:27,100 --> 00:03:28,800 here. 64 00:03:28,800 --> 00:03:32,960 By removing the stop words, we remove both of these words, 65 00:03:32,960 --> 00:03:35,720 but The Who might actually have a significant meaning 66 00:03:35,720 --> 00:03:37,850 for our prediction task. 67 00:03:37,850 --> 00:03:40,940 Another example is the phrase, "Take That". 68 00:03:40,940 --> 00:03:42,700 If we remove the stop words, we'll 69 00:03:42,700 --> 00:03:47,000 remove the word "that," so the phrase would just say, "take." 70 00:03:47,000 --> 00:03:50,620 It no longer has the same meaning as before. 71 00:03:50,620 --> 00:03:54,150 So while removing stop words sometimes is not helpful, 72 00:03:54,150 --> 00:03:59,770 it generally is a very helpful preprocessing step. 73 00:03:59,770 --> 00:04:02,430 Lastly, an important preprocessing step 74 00:04:02,430 --> 00:04:04,300 is called stemming. 75 00:04:04,300 --> 00:04:06,550 This step is motivated by the desire 76 00:04:06,550 --> 00:04:08,760 to represent words with different endings 77 00:04:08,760 --> 00:04:10,380 as the same word. 78 00:04:10,380 --> 00:04:14,020 We probably do not need to draw a distinction between argue, 79 00:04:14,020 --> 00:04:16,910 argued, argues, and arguing. 80 00:04:16,910 --> 00:04:21,370 They could all be represented by a common stem, argue. 81 00:04:21,370 --> 00:04:24,290 The algorithmic process of performing this reduction 82 00:04:24,290 --> 00:04:26,170 is called stemming. 83 00:04:26,170 --> 00:04:29,440 There are many ways to approach the problem. 84 00:04:29,440 --> 00:04:31,750 One approach is to build a database 85 00:04:31,750 --> 00:04:33,890 of words and their stems. 86 00:04:33,890 --> 00:04:38,130 A pro is that this approach handles exceptions very nicely, 87 00:04:38,130 --> 00:04:40,970 since we have to find all of the stems. 88 00:04:40,970 --> 00:04:43,980 However, it won't handle new words at all, 89 00:04:43,980 --> 00:04:45,860 since they are not in the database. 90 00:04:45,860 --> 00:04:48,030 This is especially bad for problems 91 00:04:48,030 --> 00:04:50,070 where we're using data from the internet, 92 00:04:50,070 --> 00:04:53,480 since we have no idea what words will be used. 93 00:04:53,480 --> 00:04:56,909 A different approach is to write a rule-based algorithm. 94 00:04:56,909 --> 00:05:02,460 In this approach, if a word ends in things like ed, ing, or ly, 95 00:05:02,460 --> 00:05:04,300 we would remove the ending. 96 00:05:04,300 --> 00:05:08,010 A pro of this approach is that it handles new or unknown words 97 00:05:08,010 --> 00:05:08,910 well. 98 00:05:08,910 --> 00:05:11,080 However, there are many exceptions, 99 00:05:11,080 --> 00:05:13,320 and this approach would miss all of these. 100 00:05:13,320 --> 00:05:17,480 Words like child and children would be considered different, 101 00:05:17,480 --> 00:05:22,830 but it would get other plurals, like dog and dogs. 102 00:05:22,830 --> 00:05:24,940 This second approach is widely popular 103 00:05:24,940 --> 00:05:26,910 and is called the Porter Stemmer, designed 104 00:05:26,910 --> 00:05:31,520 by Martin Porter in 1980, and it's still used today. 105 00:05:31,520 --> 00:05:35,820 Stemmers like this one have been written for many languages. 106 00:05:35,820 --> 00:05:37,720 Other options for stemming include 107 00:05:37,720 --> 00:05:39,570 machine learning, where algorithms 108 00:05:39,570 --> 00:05:43,420 are trained to recognize the roots of words and combinations 109 00:05:43,420 --> 00:05:45,750 of the approaches explained here. 110 00:05:45,750 --> 00:05:48,190 As a real example from our data set, 111 00:05:48,190 --> 00:05:51,710 the phrase "by far the best customer care service I 112 00:05:51,710 --> 00:05:54,290 have ever received" has three words 113 00:05:54,290 --> 00:05:57,659 that would be stemmed-- customer, service, 114 00:05:57,659 --> 00:05:59,050 and received. 115 00:05:59,050 --> 00:06:02,110 The "er" would be removed in customer, 116 00:06:02,110 --> 00:06:04,550 the "e" would be removed in service, 117 00:06:04,550 --> 00:06:08,740 and the "ed" would be removed in received. 118 00:06:08,740 --> 00:06:12,180 In the next video, we'll see how to run these preprocessing 119 00:06:12,180 --> 00:06:14,150 steps in R.