1 00:00:09,500 --> 00:00:13,240 Until now, we have seen data that are structured, numerical, 2 00:00:13,240 --> 00:00:14,860 or categorical. 3 00:00:14,860 --> 00:00:17,550 On the other hand, tweets are loosely structured. 4 00:00:17,550 --> 00:00:19,080 They are often textual. 5 00:00:19,080 --> 00:00:22,160 They have poor spelling, often contain 6 00:00:22,160 --> 00:00:25,470 non-traditional grammar, and they are multilingual. 7 00:00:25,470 --> 00:00:28,630 In this example here, we see two examples 8 00:00:28,630 --> 00:00:31,270 of this aspect of tweets. 9 00:00:34,280 --> 00:00:37,890 We have also discussed why people care about textual data. 10 00:00:37,890 --> 00:00:41,270 A key question, however, is how to handle this information, 11 00:00:41,270 --> 00:00:43,190 including in the tweets. 12 00:00:43,190 --> 00:00:46,140 Humans cannot, of course, keep up with internet-scale volumes 13 00:00:46,140 --> 00:00:50,950 of data as there are about half a billion tweets per day. 14 00:00:50,950 --> 00:00:55,090 Even at the small scale, the cost and time 15 00:00:55,090 --> 00:00:59,590 required to process this is of course prohibitive. 16 00:00:59,590 --> 00:01:02,350 How can computers help? 17 00:01:02,350 --> 00:01:06,230 The field that addresses how computers understand text 18 00:01:06,230 --> 00:01:08,430 is called Natural Language Processing. 19 00:01:08,430 --> 00:01:11,350 The goal is to understand and derive meaning 20 00:01:11,350 --> 00:01:12,680 from human language. 21 00:01:12,680 --> 00:01:17,100 In 1950, Alan Turing, a major computer scientist of the era, 22 00:01:17,100 --> 00:01:19,780 proposed a test of machine intelligence. 23 00:01:19,780 --> 00:01:22,680 That the computer program passes it if it can take part 24 00:01:22,680 --> 00:01:25,910 in a real-time conversation and cannot be distinguished from 25 00:01:25,910 --> 00:01:28,010 a human. 26 00:01:28,010 --> 00:01:30,300 Let's discuss briefly the history of Natural Language 27 00:01:30,300 --> 00:01:31,250 Processing. 28 00:01:31,250 --> 00:01:33,350 There has been some progress-- for example, 29 00:01:33,350 --> 00:01:36,180 the "chatterbot" ELIZA. 30 00:01:36,180 --> 00:01:39,490 The initial focus has been on understanding grammar. 31 00:01:39,490 --> 00:01:42,560 Later, the focus shifted towards statistical, machine learning 32 00:01:42,560 --> 00:01:47,030 techniques that learn from large bodies of text. 33 00:01:47,030 --> 00:01:51,240 Today, there are modern versions of these Natural Language 34 00:01:51,240 --> 00:01:52,690 Processing. 35 00:01:52,690 --> 00:01:59,080 Apple is using Siri, and Google is using Google Now. 36 00:01:59,080 --> 00:02:01,610 Why is it hard? 37 00:02:01,610 --> 00:02:03,480 Let us give you an example. 38 00:02:03,480 --> 00:02:07,440 Suppose we say the phrase, I put my bag in the car. 39 00:02:07,440 --> 00:02:09,680 Is it large and blue? 40 00:02:09,680 --> 00:02:12,570 The question is, does the "it" refer to the bag 41 00:02:12,570 --> 00:02:15,780 or the "it" refers to car? 42 00:02:15,780 --> 00:02:17,930 The context is often important. 43 00:02:17,930 --> 00:02:21,670 Humans use homonyms, metaphors, often sarcasm. 44 00:02:21,670 --> 00:02:25,070 In this lecture, we'll see how can build analytics models 45 00:02:25,070 --> 00:02:27,960 using text as our data.