1
00:00:04,500 --> 00:00:07,650
In this lecture, we'll be
trying to understand sentiment

2
00:00:07,650 --> 00:00:11,280
of tweets about
the company Apple.

3
00:00:11,280 --> 00:00:15,530
Apple is a computer company
known for its laptops, phones,

4
00:00:15,530 --> 00:00:18,280
tablets, and personal
media players.

5
00:00:18,280 --> 00:00:21,250
While Apple has a
large number of fans,

6
00:00:21,250 --> 00:00:23,220
they also have a
large number of people

7
00:00:23,220 --> 00:00:24,950
who don't like their products.

8
00:00:24,950 --> 00:00:27,540
And they have
several competitors.

9
00:00:27,540 --> 00:00:30,390
To better understand
public perception,

10
00:00:30,390 --> 00:00:33,530
Apple wants to monitor
how people feel over time

11
00:00:33,530 --> 00:00:36,600
and how people receive
new announcements.

12
00:00:36,600 --> 00:00:40,080
Our challenge in this lecture
is to see if we can correctly

13
00:00:40,080 --> 00:00:44,400
classify tweets as being
negative, positive, or neither

14
00:00:44,400 --> 00:00:47,510
about Apple.

15
00:00:47,510 --> 00:00:50,140
To collect the data
needed for this task,

16
00:00:50,140 --> 00:00:52,820
we had to perform two steps.

17
00:00:52,820 --> 00:00:57,750
The first was to collect data
about tweets from the internet.

18
00:00:57,750 --> 00:01:00,060
Twitter data is
publicly available.

19
00:01:00,060 --> 00:01:02,690
And you can collect it
through scraping the website

20
00:01:02,690 --> 00:01:05,680
or by using a special
interface for programmers

21
00:01:05,680 --> 00:01:09,000
that Twitter provides
called an API.

22
00:01:09,000 --> 00:01:12,210
The sender of the tweet might
be useful to predict sentiment.

23
00:01:12,210 --> 00:01:15,020
But we'll ignore it to
keep our data anonymized.

24
00:01:15,020 --> 00:01:19,190
So we'll just be using
the text of the tweet.

25
00:01:19,190 --> 00:01:22,260
Then we need to construct
the outcome variable

26
00:01:22,260 --> 00:01:25,330
for these tweets, which means
that we have to label them

27
00:01:25,330 --> 00:01:29,170
as positive, negative,
or neutral sentiment.

28
00:01:29,170 --> 00:01:31,960
We would like to label
thousands of tweets.

29
00:01:31,960 --> 00:01:34,539
And we know that two
people might disagree over

30
00:01:34,539 --> 00:01:37,380
the correct
classification of a tweet.

31
00:01:37,380 --> 00:01:40,100
So to do this
efficiently, one option

32
00:01:40,100 --> 00:01:42,890
is to use the Amazon
Mechanical Turk.

33
00:01:45,630 --> 00:01:48,870
So what is the Amazon
Mechanical Turk?

34
00:01:48,870 --> 00:01:53,160
It allows people to break tasks
down into small components

35
00:01:53,160 --> 00:01:56,670
and then enables them to
distribute these tasks online

36
00:01:56,670 --> 00:02:00,250
to be solved by people
all over the world.

37
00:02:00,250 --> 00:02:04,520
People can sign up to perform
the available tasks for a fee.

38
00:02:04,520 --> 00:02:07,220
As the task creator,
we pay the workers

39
00:02:07,220 --> 00:02:09,759
a fixed amount per
completed task.

40
00:02:09,759 --> 00:02:14,470
For example, we might pay $0.02
for a single classified tweet.

41
00:02:14,470 --> 00:02:17,610
The Amazon Mechanical
Turk serves as a broker

42
00:02:17,610 --> 00:02:20,520
and takes a small
cut of the money.

43
00:02:20,520 --> 00:02:23,140
Many of the tasks on
the Mechanical Turk

44
00:02:23,140 --> 00:02:25,829
require human intelligence,
like classifying

45
00:02:25,829 --> 00:02:27,620
the sentiment of a tweet.

46
00:02:27,620 --> 00:02:29,890
But these tasks may
be time consuming

47
00:02:29,890 --> 00:02:33,120
or require building
otherwise unneeded capacity

48
00:02:33,120 --> 00:02:34,900
for the creator of the task.

49
00:02:34,900 --> 00:02:39,570
And so it's appealing
to outsource the job.

50
00:02:39,570 --> 00:02:42,840
The task that we put on
the Amazon Mechanical Turk

51
00:02:42,840 --> 00:02:45,590
was to judge the
sentiment expressed

52
00:02:45,590 --> 00:02:49,760
by the following item toward
the software company Apple.

53
00:02:49,760 --> 00:02:53,300
The items we gave them were
tweets that we had collected.

54
00:02:53,300 --> 00:02:55,550
The workers could pick
from the following options

55
00:02:55,550 --> 00:03:00,410
as their response-- strongly
negative, negative, neutral,

56
00:03:00,410 --> 00:03:03,220
positive, and strongly positive.

57
00:03:03,220 --> 00:03:05,340
We represented each
of these outcomes

58
00:03:05,340 --> 00:03:08,800
as a number on the scale
from negative 2 to 2.

59
00:03:08,800 --> 00:03:12,540
We had five workers
label each tweet.

60
00:03:12,540 --> 00:03:14,930
The graph on the right
shows the distribution

61
00:03:14,930 --> 00:03:16,710
of the number of
tweets classified

62
00:03:16,710 --> 00:03:18,900
into each of the categories.

63
00:03:18,900 --> 00:03:21,760
We can see here that
the majority of tweets

64
00:03:21,760 --> 00:03:25,430
were classified as neutral,
with a small number classified

65
00:03:25,430 --> 00:03:27,490
as strongly negative
or strongly positive.

66
00:03:30,750 --> 00:03:34,670
Then, for each tweet, we take
the average of the five scores

67
00:03:34,670 --> 00:03:36,690
given by the five workers.

68
00:03:36,690 --> 00:03:39,620
For example, the
tweet "LOVE U @APPLE"

69
00:03:39,620 --> 00:03:43,010
was seen as strongly
positive by 4 of the workers

70
00:03:43,010 --> 00:03:45,260
and positive by
one of the workers.

71
00:03:45,260 --> 00:03:47,460
So it gets a score of 1.8.

72
00:03:47,460 --> 00:03:51,150
The tweet "@apple @twitter
Happy Programmers' Day folks!"

73
00:03:51,150 --> 00:03:54,510
was seen as slightly
positive on average.

74
00:03:54,510 --> 00:03:57,480
And the tweet "So
disappointed in @Apple.

75
00:03:57,480 --> 00:04:00,210
Sold me a Macbook Air
that WONT run my apps.

76
00:04:00,210 --> 00:04:02,170
So I have to drive
hours to return it.

77
00:04:02,170 --> 00:04:06,300
They won't let me ship it."
was seen as pretty negative.

78
00:04:06,300 --> 00:04:08,050
So now we have a
bunch of tweets that

79
00:04:08,050 --> 00:04:09,710
are labeled with
their sentiment.

80
00:04:09,710 --> 00:04:12,260
But how do we build
independent variables

81
00:04:12,260 --> 00:04:16,760
from the text of a tweet to be
used to predict the sentiment?

82
00:04:16,760 --> 00:04:19,810
In the next video, we'll
discuss a technique

83
00:04:19,810 --> 00:04:22,820
called bag of words
that transforms text

84
00:04:22,820 --> 00:04:25,370
into independent variables.