1
00:00:09,500 --> 00:00:12,730
In this lecture, we'll use a
technique called Bag of Words

2
00:00:12,730 --> 00:00:15,640
to build text analytics models.

3
00:00:15,640 --> 00:00:19,570
Fully understanding text is
difficult, but Bag of Words

4
00:00:19,570 --> 00:00:22,130
provides a very simple approach.

5
00:00:22,130 --> 00:00:24,360
It just counts the
number of times

6
00:00:24,360 --> 00:00:28,320
each word appears in the
text and uses these counts

7
00:00:28,320 --> 00:00:31,080
as the independent variables.

8
00:00:31,080 --> 00:00:34,880
For example, in the sentence,
"This course is great.

9
00:00:34,880 --> 00:00:37,440
I would recommend this
course my friends,"

10
00:00:37,440 --> 00:00:43,180
the word this is seen twice,
the word course is seen twice,

11
00:00:43,180 --> 00:00:48,000
the word great is
seen once, et cetera.

12
00:00:48,000 --> 00:00:51,680
In Bag of Words, there's
one feature for each word.

13
00:00:51,680 --> 00:00:55,160
This is a very simple approach,
but is often very effective,

14
00:00:55,160 --> 00:00:56,330
too.

15
00:00:56,330 --> 00:00:59,680
It's used as a baseline
in text analytics projects

16
00:00:59,680 --> 00:01:02,390
and for natural
language processing.

17
00:01:02,390 --> 00:01:04,360
This isn't the
whole story, though.

18
00:01:04,360 --> 00:01:06,940
Preprocessing the
text can dramatically

19
00:01:06,940 --> 00:01:11,890
improve the performance of
the Bag of Words method.

20
00:01:11,890 --> 00:01:14,090
One part of
preprocessing the text

21
00:01:14,090 --> 00:01:16,620
is to clean up irregularities.

22
00:01:16,620 --> 00:01:20,000
Text data often as many
inconsistencies that will cause

23
00:01:20,000 --> 00:01:21,940
algorithms trouble.

24
00:01:21,940 --> 00:01:24,850
Computers are very
literal by default.

25
00:01:24,850 --> 00:01:29,710
Apple with just an uppercase A,
APPLE all in uppercase letters,

26
00:01:29,710 --> 00:01:33,370
or ApPLe with a mixture of
uppercase and lowercase letters

27
00:01:33,370 --> 00:01:35,990
will all be counted separately.

28
00:01:35,990 --> 00:01:38,610
We want to change the
text so that all three

29
00:01:38,610 --> 00:01:42,450
versions of Apple here will
be counted as the same word,

30
00:01:42,450 --> 00:01:46,930
by either changing all words
to uppercase or to lower case.

31
00:01:46,930 --> 00:01:50,350
We'll typically change all
the letters to lowercase,

32
00:01:50,350 --> 00:01:52,500
so these three versions
of Apple will all

33
00:01:52,500 --> 00:01:55,030
become Apple with
lower case letters

34
00:01:55,030 --> 00:01:57,000
and will be counted
as the same word.

35
00:01:59,940 --> 00:02:02,900
Punctuation can
also cause problems.

36
00:02:02,900 --> 00:02:04,910
The basic approach
is to deal with this

37
00:02:04,910 --> 00:02:07,510
is to remove everything
that isn't a standard number

38
00:02:07,510 --> 00:02:08,650
or letter.

39
00:02:08,650 --> 00:02:12,480
However, sometimes
punctuation is meaningful.

40
00:02:12,480 --> 00:02:16,990
In the case of Twitter, @Apple
denotes a message to Apple,

41
00:02:16,990 --> 00:02:20,690
and #Apple is a
message about Apple.

42
00:02:20,690 --> 00:02:22,790
For web addresses,
the punctuation

43
00:02:22,790 --> 00:02:25,420
often defines the web address.

44
00:02:25,420 --> 00:02:28,190
For these reasons, the
removal of punctuation

45
00:02:28,190 --> 00:02:31,370
should be tailored to
the specific problem.

46
00:02:31,370 --> 00:02:36,420
In our case, we will remove
all punctuation, so @Apple,

47
00:02:36,420 --> 00:02:40,020
Apple with an exclamation
point, Apple with dashes

48
00:02:40,020 --> 00:02:42,010
will all count as just Apple.

49
00:02:44,880 --> 00:02:47,490
Another preprocessing
task we want to do

50
00:02:47,490 --> 00:02:50,680
is to remove unhelpful terms.

51
00:02:50,680 --> 00:02:52,820
Many words are
frequently used but are

52
00:02:52,820 --> 00:02:54,990
only meaningful in a sentence.

53
00:02:54,990 --> 00:02:58,110
These are called stop words.

54
00:02:58,110 --> 00:03:02,940
Examples are the,
is, at, and which.

55
00:03:02,940 --> 00:03:05,440
It's unlikely that
these words will improve

56
00:03:05,440 --> 00:03:07,660
the machine learning
prediction quality,

57
00:03:07,660 --> 00:03:11,680
so we want to remove them to
reduce the size of the data.

58
00:03:11,680 --> 00:03:14,560
There are some potential
problems with this approach.

59
00:03:14,560 --> 00:03:17,329
Sometimes, two stop
words taken together

60
00:03:17,329 --> 00:03:19,560
have a very important meaning.

61
00:03:19,560 --> 00:03:23,579
For example, "The Who"-- which
is a combination of two stop

62
00:03:23,579 --> 00:03:27,100
words-- is actually the name
of the band we see on the right

63
00:03:27,100 --> 00:03:28,800
here.

64
00:03:28,800 --> 00:03:32,960
By removing the stop words,
we remove both of these words,

65
00:03:32,960 --> 00:03:35,720
but The Who might actually
have a significant meaning

66
00:03:35,720 --> 00:03:37,850
for our prediction task.

67
00:03:37,850 --> 00:03:40,940
Another example is the
phrase, "Take That".

68
00:03:40,940 --> 00:03:42,700
If we remove the
stop words, we'll

69
00:03:42,700 --> 00:03:47,000
remove the word "that," so the
phrase would just say, "take."

70
00:03:47,000 --> 00:03:50,620
It no longer has the
same meaning as before.

71
00:03:50,620 --> 00:03:54,150
So while removing stop words
sometimes is not helpful,

72
00:03:54,150 --> 00:03:59,770
it generally is a very
helpful preprocessing step.

73
00:03:59,770 --> 00:04:02,430
Lastly, an important
preprocessing step

74
00:04:02,430 --> 00:04:04,300
is called stemming.

75
00:04:04,300 --> 00:04:06,550
This step is motivated
by the desire

76
00:04:06,550 --> 00:04:08,760
to represent words
with different endings

77
00:04:08,760 --> 00:04:10,380
as the same word.

78
00:04:10,380 --> 00:04:14,020
We probably do not need to draw
a distinction between argue,

79
00:04:14,020 --> 00:04:16,910
argued, argues, and arguing.

80
00:04:16,910 --> 00:04:21,370
They could all be represented
by a common stem, argue.

81
00:04:21,370 --> 00:04:24,290
The algorithmic process of
performing this reduction

82
00:04:24,290 --> 00:04:26,170
is called stemming.

83
00:04:26,170 --> 00:04:29,440
There are many ways to
approach the problem.

84
00:04:29,440 --> 00:04:31,750
One approach is to
build a database

85
00:04:31,750 --> 00:04:33,890
of words and their stems.

86
00:04:33,890 --> 00:04:38,130
A pro is that this approach
handles exceptions very nicely,

87
00:04:38,130 --> 00:04:40,970
since we have to find
all of the stems.

88
00:04:40,970 --> 00:04:43,980
However, it won't
handle new words at all,

89
00:04:43,980 --> 00:04:45,860
since they are not
in the database.

90
00:04:45,860 --> 00:04:48,030
This is especially
bad for problems

91
00:04:48,030 --> 00:04:50,070
where we're using data
from the internet,

92
00:04:50,070 --> 00:04:53,480
since we have no idea
what words will be used.

93
00:04:53,480 --> 00:04:56,909
A different approach is to
write a rule-based algorithm.

94
00:04:56,909 --> 00:05:02,460
In this approach, if a word ends
in things like ed, ing, or ly,

95
00:05:02,460 --> 00:05:04,300
we would remove the ending.

96
00:05:04,300 --> 00:05:08,010
A pro of this approach is that
it handles new or unknown words

97
00:05:08,010 --> 00:05:08,910
well.

98
00:05:08,910 --> 00:05:11,080
However, there are
many exceptions,

99
00:05:11,080 --> 00:05:13,320
and this approach would
miss all of these.

100
00:05:13,320 --> 00:05:17,480
Words like child and children
would be considered different,

101
00:05:17,480 --> 00:05:22,830
but it would get other
plurals, like dog and dogs.

102
00:05:22,830 --> 00:05:24,940
This second approach
is widely popular

103
00:05:24,940 --> 00:05:26,910
and is called the
Porter Stemmer, designed

104
00:05:26,910 --> 00:05:31,520
by Martin Porter in 1980,
and it's still used today.

105
00:05:31,520 --> 00:05:35,820
Stemmers like this one have
been written for many languages.

106
00:05:35,820 --> 00:05:37,720
Other options for
stemming include

107
00:05:37,720 --> 00:05:39,570
machine learning,
where algorithms

108
00:05:39,570 --> 00:05:43,420
are trained to recognize the
roots of words and combinations

109
00:05:43,420 --> 00:05:45,750
of the approaches
explained here.

110
00:05:45,750 --> 00:05:48,190
As a real example
from our data set,

111
00:05:48,190 --> 00:05:51,710
the phrase "by far the best
customer care service I

112
00:05:51,710 --> 00:05:54,290
have ever received"
has three words

113
00:05:54,290 --> 00:05:57,659
that would be stemmed--
customer, service,

114
00:05:57,659 --> 00:05:59,050
and received.

115
00:05:59,050 --> 00:06:02,110
The "er" would be
removed in customer,

116
00:06:02,110 --> 00:06:04,550
the "e" would be
removed in service,

117
00:06:04,550 --> 00:06:08,740
and the "ed" would be
removed in received.

118
00:06:08,740 --> 00:06:12,180
In the next video, we'll see
how to run these preprocessing

119
00:06:12,180 --> 00:06:14,150
steps in R.