1
00:00:09,500 --> 00:00:13,240
Until now, we have seen data
that are structured, numerical,

2
00:00:13,240 --> 00:00:14,860
or categorical.

3
00:00:14,860 --> 00:00:17,550
On the other hand, tweets
are loosely structured.

4
00:00:17,550 --> 00:00:19,080
They are often textual.

5
00:00:19,080 --> 00:00:22,160
They have poor
spelling, often contain

6
00:00:22,160 --> 00:00:25,470
non-traditional grammar,
and they are multilingual.

7
00:00:25,470 --> 00:00:28,630
In this example here,
we see two examples

8
00:00:28,630 --> 00:00:31,270
of this aspect of tweets.

9
00:00:34,280 --> 00:00:37,890
We have also discussed why
people care about textual data.

10
00:00:37,890 --> 00:00:41,270
A key question, however, is
how to handle this information,

11
00:00:41,270 --> 00:00:43,190
including in the tweets.

12
00:00:43,190 --> 00:00:46,140
Humans cannot, of course, keep
up with internet-scale volumes

13
00:00:46,140 --> 00:00:50,950
of data as there are about
half a billion tweets per day.

14
00:00:50,950 --> 00:00:55,090
Even at the small
scale, the cost and time

15
00:00:55,090 --> 00:00:59,590
required to process this
is of course prohibitive.

16
00:00:59,590 --> 00:01:02,350
How can computers help?

17
00:01:02,350 --> 00:01:06,230
The field that addresses how
computers understand text

18
00:01:06,230 --> 00:01:08,430
is called Natural
Language Processing.

19
00:01:08,430 --> 00:01:11,350
The goal is to understand
and derive meaning

20
00:01:11,350 --> 00:01:12,680
from human language.

21
00:01:12,680 --> 00:01:17,100
In 1950, Alan Turing, a major
computer scientist of the era,

22
00:01:17,100 --> 00:01:19,780
proposed a test of
machine intelligence.

23
00:01:19,780 --> 00:01:22,680
That the computer program
passes it if it can take part

24
00:01:22,680 --> 00:01:25,910
in a real-time conversation and
cannot be distinguished from

25
00:01:25,910 --> 00:01:28,010
a human.

26
00:01:28,010 --> 00:01:30,300
Let's discuss briefly the
history of Natural Language

27
00:01:30,300 --> 00:01:31,250
Processing.

28
00:01:31,250 --> 00:01:33,350
There has been some
progress-- for example,

29
00:01:33,350 --> 00:01:36,180
the "chatterbot" ELIZA.

30
00:01:36,180 --> 00:01:39,490
The initial focus has been
on understanding grammar.

31
00:01:39,490 --> 00:01:42,560
Later, the focus shifted towards
statistical, machine learning

32
00:01:42,560 --> 00:01:47,030
techniques that learn
from large bodies of text.

33
00:01:47,030 --> 00:01:51,240
Today, there are modern versions
of these Natural Language

34
00:01:51,240 --> 00:01:52,690
Processing.

35
00:01:52,690 --> 00:01:59,080
Apple is using Siri, and
Google is using Google Now.

36
00:01:59,080 --> 00:02:01,610
Why is it hard?

37
00:02:01,610 --> 00:02:03,480
Let us give you an example.

38
00:02:03,480 --> 00:02:07,440
Suppose we say the phrase,
I put my bag in the car.

39
00:02:07,440 --> 00:02:09,680
Is it large and blue?

40
00:02:09,680 --> 00:02:12,570
The question is, does
the "it" refer to the bag

41
00:02:12,570 --> 00:02:15,780
or the "it" refers to car?

42
00:02:15,780 --> 00:02:17,930
The context is often important.

43
00:02:17,930 --> 00:02:21,670
Humans use homonyms,
metaphors, often sarcasm.

44
00:02:21,670 --> 00:02:25,070
In this lecture, we'll see
how can build analytics models

45
00:02:25,070 --> 00:02:27,960
using text as our data.