1 00:00:04,500 --> 00:00:08,370 Let's begin by creating a data frame called emails 2 00:00:08,370 --> 00:00:11,690 using the read.csv function. 3 00:00:11,690 --> 00:00:13,650 And loading up energy_bids.csv. 4 00:00:16,710 --> 00:00:19,310 And as always, in the text analytics week, 5 00:00:19,310 --> 00:00:24,560 we're going to pass stringsAsFactors=FALSE to this 6 00:00:24,560 --> 00:00:26,660 function. 7 00:00:26,660 --> 00:00:30,230 So we can take a look at the structure of our new data frame 8 00:00:30,230 --> 00:00:33,230 using the str function. 9 00:00:33,230 --> 00:00:36,060 We can see that there are 855 observations. 10 00:00:36,060 --> 00:00:39,730 This means we have 855 labeled emails in the data set. 11 00:00:39,730 --> 00:00:42,930 And for each one we have the text of the email 12 00:00:42,930 --> 00:00:45,410 and whether or not it's responsive to our query 13 00:00:45,410 --> 00:00:48,660 about energy schedules and bids. 14 00:00:48,660 --> 00:00:51,660 So let's take a look at a few example emails in the data set, 15 00:00:51,660 --> 00:00:53,300 starting with the first one. 16 00:00:53,300 --> 00:00:58,610 So the first email can be accessed with emails$emailemails$email[1]. 17 00:00:58,610 --> 00:01:01,060 Almost like the first one. 18 00:01:01,060 --> 00:01:04,940 So while the output you get when you type this 19 00:01:04,940 --> 00:01:08,510 will depend on what operating system you're running on, 20 00:01:08,510 --> 00:01:10,430 many of you will see what I'm displaying here. 21 00:01:10,430 --> 00:01:12,090 Which is a single line of text that we 22 00:01:12,090 --> 00:01:15,210 need to horizontally scroll to read through. 23 00:01:15,210 --> 00:01:18,460 This is a pretty tough way to read a long piece of text. 24 00:01:18,460 --> 00:01:20,990 So if you have this sort of display, 25 00:01:20,990 --> 00:01:26,450 you can use the strwrap function and pass it the long string you 26 00:01:26,450 --> 00:01:30,360 want to print out, in this case emails$email. 27 00:01:30,360 --> 00:01:32,610 Selecting the first one. 28 00:01:32,610 --> 00:01:35,620 And now we can see that this has broken down our long string 29 00:01:35,620 --> 00:01:40,150 into multiple shorter lines that are much easier to read. 30 00:01:40,150 --> 00:01:41,240 OK. 31 00:01:41,240 --> 00:01:43,220 So let's take a look now at this email, 32 00:01:43,220 --> 00:01:45,780 now that it's a lot easier to read. 33 00:01:45,780 --> 00:01:47,910 We can see just by parsing through the first couple 34 00:01:47,910 --> 00:01:50,140 of lines that this is an email that's 35 00:01:50,140 --> 00:01:52,450 talking about a new working paper, 36 00:01:52,450 --> 00:01:55,430 "The Environmental Challenges and Opportunities 37 00:01:55,430 --> 00:01:57,990 in the Evolving North American Electricity Market" 38 00:01:57,990 --> 00:01:59,870 is the name of the paper. 39 00:01:59,870 --> 00:02:02,640 And it's being released by the Commission 40 00:02:02,640 --> 00:02:05,300 for Environmental Cooperation, or CEC. 41 00:02:05,300 --> 00:02:08,430 So while this certainly deals with electricity markets, 42 00:02:08,430 --> 00:02:11,610 it doesn't have to do with energy schedules or bids. 43 00:02:11,610 --> 00:02:14,710 So it is not responsive to our query. 44 00:02:14,710 --> 00:02:18,480 So we can take a look at the value in the responsive 45 00:02:18,480 --> 00:02:25,640 variable for this email using email$responsive and selecting 46 00:02:25,640 --> 00:02:27,250 the first one. 47 00:02:27,250 --> 00:02:29,170 And we have value 0 there. 48 00:02:29,170 --> 00:02:32,130 So let's take a look at the second email in our data set. 49 00:02:32,130 --> 00:02:34,750 Again I'm going to use the strwrap function. 50 00:02:34,750 --> 00:02:36,800 I'm going to pass it emails$emailemails$email[1]. 51 00:02:42,220 --> 00:02:44,420 And scrolling up the top here we can 52 00:02:44,420 --> 00:02:46,630 see that the original message is actually very short, 53 00:02:46,630 --> 00:02:49,540 it just says FYI, for your information. 54 00:02:49,540 --> 00:02:52,120 And most of it is a forwarded message. 55 00:02:52,120 --> 00:02:53,910 So we have all the people who originally 56 00:02:53,910 --> 00:02:55,770 received the message. 57 00:02:55,770 --> 00:02:58,780 And then down at the very bottom is the message itself. 58 00:02:58,780 --> 00:03:02,340 "Attached is my report prepared on behalf of the California 59 00:03:02,340 --> 00:03:04,170 State auditor." 60 00:03:04,170 --> 00:03:07,920 And there's an attached report, ca report new.pdf. 61 00:03:07,920 --> 00:03:11,450 Now our data set contains just the text of the emails 62 00:03:11,450 --> 00:03:13,410 and not the text of the attachments. 63 00:03:13,410 --> 00:03:15,790 But it turns out, as we might expect, 64 00:03:15,790 --> 00:03:18,770 that this attachment had to do with Enron's electricity bids 65 00:03:18,770 --> 00:03:20,040 in California. 66 00:03:20,040 --> 00:03:22,920 And therefore it is responsive to our query. 67 00:03:22,920 --> 00:03:25,360 And we can check this in the responsive variable. 68 00:03:25,360 --> 00:03:26,230 emails$responsive[2]. 69 00:03:30,890 --> 00:03:33,240 And we see that that's a 1. 70 00:03:33,240 --> 00:03:35,040 So now let's look at the breakdown 71 00:03:35,040 --> 00:03:38,710 of the number of emails that are responsive to our query using 72 00:03:38,710 --> 00:03:40,680 the table function. 73 00:03:40,680 --> 00:03:42,390 We're going to pass it emails$responsive. 74 00:03:45,110 --> 00:03:47,710 And as we can see the data set is unbalanced, 75 00:03:47,710 --> 00:03:50,690 with a relatively small proportion of emails responsive 76 00:03:50,690 --> 00:03:51,670 to the query. 77 00:03:51,670 --> 00:03:55,220 And this is typical in predictive coding problems.