1
00:00:04,500 --> 00:00:08,370
Let's begin by creating a
data frame called emails

2
00:00:08,370 --> 00:00:11,690
using the read.csv function.

3
00:00:11,690 --> 00:00:13,650
And loading up energy_bids.csv.

4
00:00:16,710 --> 00:00:19,310
And as always, in the
text analytics week,

5
00:00:19,310 --> 00:00:24,560
we're going to pass
stringsAsFactors=FALSE to this

6
00:00:24,560 --> 00:00:26,660
function.

7
00:00:26,660 --> 00:00:30,230
So we can take a look at the
structure of our new data frame

8
00:00:30,230 --> 00:00:33,230
using the str function.

9
00:00:33,230 --> 00:00:36,060
We can see that there
are 855 observations.

10
00:00:36,060 --> 00:00:39,730
This means we have 855 labeled
emails in the data set.

11
00:00:39,730 --> 00:00:42,930
And for each one we have
the text of the email

12
00:00:42,930 --> 00:00:45,410
and whether or not it's
responsive to our query

13
00:00:45,410 --> 00:00:48,660
about energy schedules and bids.

14
00:00:48,660 --> 00:00:51,660
So let's take a look at a few
example emails in the data set,

15
00:00:51,660 --> 00:00:53,300
starting with the first one.

16
00:00:53,300 --> 00:00:58,610
So the first email can be
accessed with emails$emailemails$email[1].

17
00:00:58,610 --> 00:01:01,060
Almost like the first one.

18
00:01:01,060 --> 00:01:04,940
So while the output you
get when you type this

19
00:01:04,940 --> 00:01:08,510
will depend on what operating
system you're running on,

20
00:01:08,510 --> 00:01:10,430
many of you will see
what I'm displaying here.

21
00:01:10,430 --> 00:01:12,090
Which is a single
line of text that we

22
00:01:12,090 --> 00:01:15,210
need to horizontally
scroll to read through.

23
00:01:15,210 --> 00:01:18,460
This is a pretty tough way
to read a long piece of text.

24
00:01:18,460 --> 00:01:20,990
So if you have this
sort of display,

25
00:01:20,990 --> 00:01:26,450
you can use the strwrap function
and pass it the long string you

26
00:01:26,450 --> 00:01:30,360
want to print out, in
this case emails$email.

27
00:01:30,360 --> 00:01:32,610
Selecting the first one.

28
00:01:32,610 --> 00:01:35,620
And now we can see that this
has broken down our long string

29
00:01:35,620 --> 00:01:40,150
into multiple shorter lines
that are much easier to read.

30
00:01:40,150 --> 00:01:41,240
OK.

31
00:01:41,240 --> 00:01:43,220
So let's take a look
now at this email,

32
00:01:43,220 --> 00:01:45,780
now that it's a
lot easier to read.

33
00:01:45,780 --> 00:01:47,910
We can see just by parsing
through the first couple

34
00:01:47,910 --> 00:01:50,140
of lines that this
is an email that's

35
00:01:50,140 --> 00:01:52,450
talking about a
new working paper,

36
00:01:52,450 --> 00:01:55,430
"The Environmental
Challenges and Opportunities

37
00:01:55,430 --> 00:01:57,990
in the Evolving North
American Electricity Market"

38
00:01:57,990 --> 00:01:59,870
is the name of the paper.

39
00:01:59,870 --> 00:02:02,640
And it's being released
by the Commission

40
00:02:02,640 --> 00:02:05,300
for Environmental
Cooperation, or CEC.

41
00:02:05,300 --> 00:02:08,430
So while this certainly deals
with electricity markets,

42
00:02:08,430 --> 00:02:11,610
it doesn't have to do with
energy schedules or bids.

43
00:02:11,610 --> 00:02:14,710
So it is not responsive
to our query.

44
00:02:14,710 --> 00:02:18,480
So we can take a look at
the value in the responsive

45
00:02:18,480 --> 00:02:25,640
variable for this email using
email$responsive and selecting

46
00:02:25,640 --> 00:02:27,250
the first one.

47
00:02:27,250 --> 00:02:29,170
And we have value 0 there.

48
00:02:29,170 --> 00:02:32,130
So let's take a look at the
second email in our data set.

49
00:02:32,130 --> 00:02:34,750
Again I'm going to use
the strwrap function.

50
00:02:34,750 --> 00:02:36,800
I'm going to pass
it emails$emailemails$email[1].

51
00:02:42,220 --> 00:02:44,420
And scrolling up
the top here we can

52
00:02:44,420 --> 00:02:46,630
see that the original message
is actually very short,

53
00:02:46,630 --> 00:02:49,540
it just says FYI,
for your information.

54
00:02:49,540 --> 00:02:52,120
And most of it is a
forwarded message.

55
00:02:52,120 --> 00:02:53,910
So we have all the
people who originally

56
00:02:53,910 --> 00:02:55,770
received the message.

57
00:02:55,770 --> 00:02:58,780
And then down at the very
bottom is the message itself.

58
00:02:58,780 --> 00:03:02,340
"Attached is my report prepared
on behalf of the California

59
00:03:02,340 --> 00:03:04,170
State auditor."

60
00:03:04,170 --> 00:03:07,920
And there's an attached
report, ca report new.pdf.

61
00:03:07,920 --> 00:03:11,450
Now our data set contains
just the text of the emails

62
00:03:11,450 --> 00:03:13,410
and not the text
of the attachments.

63
00:03:13,410 --> 00:03:15,790
But it turns out,
as we might expect,

64
00:03:15,790 --> 00:03:18,770
that this attachment had to do
with Enron's electricity bids

65
00:03:18,770 --> 00:03:20,040
in California.

66
00:03:20,040 --> 00:03:22,920
And therefore it is
responsive to our query.

67
00:03:22,920 --> 00:03:25,360
And we can check this in
the responsive variable.

68
00:03:25,360 --> 00:03:26,230
emails$responsive[2].

69
00:03:30,890 --> 00:03:33,240
And we see that that's a 1.

70
00:03:33,240 --> 00:03:35,040
So now let's look
at the breakdown

71
00:03:35,040 --> 00:03:38,710
of the number of emails that are
responsive to our query using

72
00:03:38,710 --> 00:03:40,680
the table function.

73
00:03:40,680 --> 00:03:42,390
We're going to pass
it emails$responsive.

74
00:03:45,110 --> 00:03:47,710
And as we can see the
data set is unbalanced,

75
00:03:47,710 --> 00:03:50,690
with a relatively small
proportion of emails responsive

76
00:03:50,690 --> 00:03:51,670
to the query.

77
00:03:51,670 --> 00:03:55,220
And this is typical in
predictive coding problems.