1 00:00:04,500 --> 00:00:07,650 In this lecture, we'll be trying to understand sentiment 2 00:00:07,650 --> 00:00:11,280 of tweets about the company Apple. 3 00:00:11,280 --> 00:00:15,530 Apple is a computer company known for its laptops, phones, 4 00:00:15,530 --> 00:00:18,280 tablets, and personal media players. 5 00:00:18,280 --> 00:00:21,250 While Apple has a large number of fans, 6 00:00:21,250 --> 00:00:23,220 they also have a large number of people 7 00:00:23,220 --> 00:00:24,950 who don't like their products. 8 00:00:24,950 --> 00:00:27,540 And they have several competitors. 9 00:00:27,540 --> 00:00:30,390 To better understand public perception, 10 00:00:30,390 --> 00:00:33,530 Apple wants to monitor how people feel over time 11 00:00:33,530 --> 00:00:36,600 and how people receive new announcements. 12 00:00:36,600 --> 00:00:40,080 Our challenge in this lecture is to see if we can correctly 13 00:00:40,080 --> 00:00:44,400 classify tweets as being negative, positive, or neither 14 00:00:44,400 --> 00:00:47,510 about Apple. 15 00:00:47,510 --> 00:00:50,140 To collect the data needed for this task, 16 00:00:50,140 --> 00:00:52,820 we had to perform two steps. 17 00:00:52,820 --> 00:00:57,750 The first was to collect data about tweets from the internet. 18 00:00:57,750 --> 00:01:00,060 Twitter data is publicly available. 19 00:01:00,060 --> 00:01:02,690 And you can collect it through scraping the website 20 00:01:02,690 --> 00:01:05,680 or by using a special interface for programmers 21 00:01:05,680 --> 00:01:09,000 that Twitter provides called an API. 22 00:01:09,000 --> 00:01:12,210 The sender of the tweet might be useful to predict sentiment. 23 00:01:12,210 --> 00:01:15,020 But we'll ignore it to keep our data anonymized. 24 00:01:15,020 --> 00:01:19,190 So we'll just be using the text of the tweet. 25 00:01:19,190 --> 00:01:22,260 Then we need to construct the outcome variable 26 00:01:22,260 --> 00:01:25,330 for these tweets, which means that we have to label them 27 00:01:25,330 --> 00:01:29,170 as positive, negative, or neutral sentiment. 28 00:01:29,170 --> 00:01:31,960 We would like to label thousands of tweets. 29 00:01:31,960 --> 00:01:34,539 And we know that two people might disagree over 30 00:01:34,539 --> 00:01:37,380 the correct classification of a tweet. 31 00:01:37,380 --> 00:01:40,100 So to do this efficiently, one option 32 00:01:40,100 --> 00:01:42,890 is to use the Amazon Mechanical Turk. 33 00:01:45,630 --> 00:01:48,870 So what is the Amazon Mechanical Turk? 34 00:01:48,870 --> 00:01:53,160 It allows people to break tasks down into small components 35 00:01:53,160 --> 00:01:56,670 and then enables them to distribute these tasks online 36 00:01:56,670 --> 00:02:00,250 to be solved by people all over the world. 37 00:02:00,250 --> 00:02:04,520 People can sign up to perform the available tasks for a fee. 38 00:02:04,520 --> 00:02:07,220 As the task creator, we pay the workers 39 00:02:07,220 --> 00:02:09,759 a fixed amount per completed task. 40 00:02:09,759 --> 00:02:14,470 For example, we might pay $0.02 for a single classified tweet. 41 00:02:14,470 --> 00:02:17,610 The Amazon Mechanical Turk serves as a broker 42 00:02:17,610 --> 00:02:20,520 and takes a small cut of the money. 43 00:02:20,520 --> 00:02:23,140 Many of the tasks on the Mechanical Turk 44 00:02:23,140 --> 00:02:25,829 require human intelligence, like classifying 45 00:02:25,829 --> 00:02:27,620 the sentiment of a tweet. 46 00:02:27,620 --> 00:02:29,890 But these tasks may be time consuming 47 00:02:29,890 --> 00:02:33,120 or require building otherwise unneeded capacity 48 00:02:33,120 --> 00:02:34,900 for the creator of the task. 49 00:02:34,900 --> 00:02:39,570 And so it's appealing to outsource the job. 50 00:02:39,570 --> 00:02:42,840 The task that we put on the Amazon Mechanical Turk 51 00:02:42,840 --> 00:02:45,590 was to judge the sentiment expressed 52 00:02:45,590 --> 00:02:49,760 by the following item toward the software company Apple. 53 00:02:49,760 --> 00:02:53,300 The items we gave them were tweets that we had collected. 54 00:02:53,300 --> 00:02:55,550 The workers could pick from the following options 55 00:02:55,550 --> 00:03:00,410 as their response-- strongly negative, negative, neutral, 56 00:03:00,410 --> 00:03:03,220 positive, and strongly positive. 57 00:03:03,220 --> 00:03:05,340 We represented each of these outcomes 58 00:03:05,340 --> 00:03:08,800 as a number on the scale from negative 2 to 2. 59 00:03:08,800 --> 00:03:12,540 We had five workers label each tweet. 60 00:03:12,540 --> 00:03:14,930 The graph on the right shows the distribution 61 00:03:14,930 --> 00:03:16,710 of the number of tweets classified 62 00:03:16,710 --> 00:03:18,900 into each of the categories. 63 00:03:18,900 --> 00:03:21,760 We can see here that the majority of tweets 64 00:03:21,760 --> 00:03:25,430 were classified as neutral, with a small number classified 65 00:03:25,430 --> 00:03:27,490 as strongly negative or strongly positive. 66 00:03:30,750 --> 00:03:34,670 Then, for each tweet, we take the average of the five scores 67 00:03:34,670 --> 00:03:36,690 given by the five workers. 68 00:03:36,690 --> 00:03:39,620 For example, the tweet "LOVE U @APPLE" 69 00:03:39,620 --> 00:03:43,010 was seen as strongly positive by 4 of the workers 70 00:03:43,010 --> 00:03:45,260 and positive by one of the workers. 71 00:03:45,260 --> 00:03:47,460 So it gets a score of 1.8. 72 00:03:47,460 --> 00:03:51,150 The tweet "@apple @twitter Happy Programmers' Day folks!" 73 00:03:51,150 --> 00:03:54,510 was seen as slightly positive on average. 74 00:03:54,510 --> 00:03:57,480 And the tweet "So disappointed in @Apple. 75 00:03:57,480 --> 00:04:00,210 Sold me a Macbook Air that WONT run my apps. 76 00:04:00,210 --> 00:04:02,170 So I have to drive hours to return it. 77 00:04:02,170 --> 00:04:06,300 They won't let me ship it." was seen as pretty negative. 78 00:04:06,300 --> 00:04:08,050 So now we have a bunch of tweets that 79 00:04:08,050 --> 00:04:09,710 are labeled with their sentiment. 80 00:04:09,710 --> 00:04:12,260 But how do we build independent variables 81 00:04:12,260 --> 00:04:16,760 from the text of a tweet to be used to predict the sentiment? 82 00:04:16,760 --> 00:04:19,810 In the next video, we'll discuss a technique 83 00:04:19,810 --> 00:04:22,820 called bag of words that transforms text 84 00:04:22,820 --> 00:04:25,370 into independent variables.