Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
D4M Session 4 Demo
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
JEREMY KEPNER: All right. We're back. So I think we just went over a sort of a tour of some of the more complicated analytics that people can do. And now, we're going to show some examples of not those specific analytics, but using the Reuters data set that we had before some of the analytics that we can do here.
So again, this is in the D4M API. We go into the Examples directory. It's Apps. And now we're going to deal with tracks. And this is a data set here.
The data actually, this Entity.mat was actually created in the previous. And this entity analysis actually creates the Entity.mat. It's the same file. So we just have over here.
And so let's get started. And just to remind people, we do load entity.mat. You see we have this E here. That's the data. It represents, essentially, almost 10,000 documents and 3,600 entities in that. If we do spy E. transpose.
AUDIENCE: Can you move the bottom of your screen up? Because we're seeing the top half of your thing.
JEREMY KEPNER: That better?
AUDIENCE: Much better.
JEREMY KEPNER: OK. And so there you go. That is the entire data set. Again, spy plot, very useful tool for doing that. So you can see we have locations and people and times in this data set.
We also have organizations in here. You can zoom in on it if you want to. You can see, the are the locations, and then the organizations, and the people.
Zoom in a little bit more, you can actually look at the actual values here. And you see there's this common popular location here. What is that? Ah, it's the United States. Yes, the United States does appear a lot in Reuters documents as one would expect.
What do we think this one is here? Any guesses?
AUDIENCE: New York.
JEREMY KEPNER: What?
AUDIENCE: Is that New York?
JEREMY KEPNER: Yeah. Oh, New York. See, it's already there. New York. Yes, we can read. All right.
Organizations, anything really popular here? Maybe this guy, is he popular? International Red Cross, right? And people here, we don't have any really popular people in this list. Ahmad Shah Massoud, I have no idea who that person was back in 1996.
Anyway, so that's the data. And you see, by the way, when every single time you click. Because when we make those spy plots, I have to do some compression on the strings just to make it work. But we actually always print out onto the screen the full exact string.
So if I want that string, you can just then copy it if you want to say copy and paste it or something like that. Or the full person or something like that, you can then copy that and paste that. You can do something like that like, E, this guy, yeah.
And there, it's showed. That's all the documents that contain this person's name. And then this shows the character position, I believe, that they appeared in the document. So again, you have all kinds of fun with that.
You can do row, that. That gets us all those rows. And then we can pass those rows back in. And let's say we want to do starts with, let's see here, how about organization?
Everyone correct my spelling here. I'm going to type this wrong, organization slash. All right, there we go.
AUDIENCE: Starts with.
JEREMY KEPNER: Yeah. I always get that wrong. Stars with-- starts with. See? There you go.
All right. There you go. So this shows you all the organizations that are cited in the documents that contained this person's name. You know?
This is kind of the spirit of it, right? I mean, it's like just, oh, I want that? I can get that. Oh, I could then say, oh, all right, well get me-- you know? And you could just keep going and going and going.
If I did r c s of that, right, it would never return those as triple. So r, those are all the documents. C, those are all the columns. V, there we go. You get the idea.
So again, very, very powerful type of syntax there. All right. So I'll just give you a sense of the data set that we're looking at. So let's look at the first example here. So we're going to do track, analytic, build, test.
So we're going to build some tracks out of this data. So I have these documents. And they have locations in them, and people, and times. So I could say, hey, if there's a person and a location and a time in a document, maybe I could call that's sort of a track.
So let's do that. So what do we do here? So the first thing we do is we loaded the data. The string values are those character positions. We don't really care about those. So we're going to just get rid of them and just convert it to a numeric like we've always been doing here.
And then I'm going to say my thing that I want to track is going to be anything starting with person. So I set that. And my time thing is going to be anything starting my time. And my location thing is going to be anything here starting with location.
So I've done the starts with to get these ranges. And now, the first thing I'm going to do is I want to limit my data to only rows that have at least one of all three of those. So I'm not dealing with I have a person and a location and no time or whatever. So I'm just going to clean that up.
So basically, I get all the people. And I sum that. I basically sum across the columns. So I basically compress the columns. And then I sum the rows. All right.
There we go, sum those. I get, then, all three. And then I filter them back out. And that just reduces this to the ones that contain just the ones that have these.
Let's see here. So now, I want to collapse these. I want to create, essentially, just edges and times. So I can do that with the call to type syntax and the val to call syntax. And going and bopping back and forth between that, I get a set of edges, the edge list, which is essentially the document and the time.
And now, I'm going to combine these back together into a new associative array, which essentially still has the same text label, which is essentially the document. But it has columns of time. And the value is space.
All right. And then I'm going to do another one, which is, again, has a row, which is the document or the edge. And it has a column of space and a value of time. And now, I can construct a track from this through this wonderful sparse matrix multiply.
So essentially, I transpose Etx. And then I'm going to just get the people and convert those numeric values. And then we do this cat value mul, which will actually convert that.
These are time tracks. And these are space tracks. And again, it's a little difficult to explain to you exactly why this works, why these matrix multiplies give us the answer that we're going for. Because we're going to have to sit and think about the actual matrices.
This is a great example to go do, and then explore these various associative arrays to actually see why these matrix multiplies actually give us the answer that we want. So in fact, we can take a look at those. I just want to look at Figure 1 here.
So this shows us, basically, and I plotted the transpose of this, the people on the right. And then these are times. And so, basically, for each row here, I have a listing of times. And if I click on one of them, it will give me a location. OK.
In fact, I think that number's the number of times that appears. So basically, we have here the person, Daniel Smith was in a document with a time stamp of 1996, November 12. Oh, that's almost-- yesterday.
And then the location New York appeared once in that. Here's another one. And so this is a track of sorts. We basically have a person and a set of times and a set of tracks. That's one kind of track.
Another kind of track here is now we have person and locations. So that was the other matrix multiply that we did. And so now, we have person Carole King, location Buffalo, and on this time. So those are two different ways of representing the tracks.
Obviously, these are triples. But then you can use either of these matrices to do additional queries and other types of things. All right. So that just shows you how using matrix multiplies and other types of things you can construct more sophisticated graphs or data structures, in this case tracks, which is a very interesting type of thing.
Let's move on to the next one. That's going to be TA2. So this is a slightly more sophisticated tract builder. Again, so when I read the data in, i create my three sort of categories here, the object, and the time, and the location, or the coordinate.
And then I have a function here called find tracks, which actually just goes and creates those tracks that I essentially did in the last section. To be honest with you, the reason I did that is because some of those matrix multiplies used to be really, really, really slow. And so I did a sort of special function that took advantage of certain properties of the data to make it find these tracks much faster.
Eventually, I broke down and just optimized the matrix multiply. In the past when I ran that last query, before it would have taken like a minute to actually run the analytic, which got annoying. So I optimized it.
But we still have this code. This code shows all kinds of little tricks and techniques for doing things that are slightly better and using triples instead of associate arrays if you want to do optimization. So we leave it here. But the matrix multiply performance is now pretty good that these tricks are less necessary.
So what we want to do here, we have this track now. And I want to do a track query. So I have a person here, Michael Chang, another person Javier Sanchez. Now, Michael Chang was a tennis player at this time. Was Javier Sanchez also a tennis player at this time?
I don't know. I think there was a Javier Sanchez that was a tennis player at that time. So we just want to look. We're going to just do, essentially, here A and just say give me of this track. And say give me the listings for these two people, P1, P2.
And then we use our Display Full command to sort of make them in a nice neat tabular format. And you see here, basically, here is Javier Sanchez' listing. OK. And here is Michael Chang's. And you see there's no overlap here. We don't ever have them in the same time or same place.
We can also do things like track windows. So we can say I want to set a time range here and a location, Australia. So if we have our track thing here and I say, all right, give me the time range, T, and then equal to all locations in Australia, this shows me all tracks that essentially went through this location in this time window.
And these are the different folks that they list, Sanchez, Melissa Russo, whoever that was, Michael Chang, and Michelle Martin. So those are just an example of a more sophisticated analytic. And here, we're using the fact that for our associative arrays, we actually have defined equals equals.
So this only works, though, if x is a constant. So it will check the value to see if that value, if it's a string, if it's equal to that string, or if it's a numeric value, if it's equal to that numeric value. But it only works with a constant.
One could argue that maybe I should make this work for a list of strings. But then the MATLAB syntax doesn't really work there either. If I have a matrix equals equal to a list of-- I don't know. I don't know if that really works.
So we try and preserve the MATLAB syntax where we can. And again, then we're just getting the columns. Again, this thing returns an associative array. This equal equals returns the associative array of all things that are equal to that. And then we can look at the columns.
All right. Moving on here, next one's TA3. So those are fairly simple track builders. Let's begin to do something that's, I think, kind of-- doing that track analytic, one could imagine doing that with existing techniques that are out there, existing tools and stuff like that.
It would be long. You would write a lot of code to do it. But you could do it. Now, let's do some things where you go, like, wow, this is really something that would just be prohibitively complicated to do it using other techniques.
So once again, we load our data. We convert it to numeric. We get our object and our time and our space keys. We find our tracks. And then we've built something called FindTrackGraph.
All right. And this is actually not that complicated. But it is more than, like, one or two lines. But what it does, it says, OK, I have this track. This track is a sequence of locations in a particular time order.
Well, now, I want to build a graph that's location by location. So if a track started in one location, and then its next destination was another, that will, of course, create a new graph. OK. So I now have a new graph, which is essentially 220 by 220 locations.
And we can actually take a look at that. And that's this graph here. So this basically says, you know, there was a track that started in Belgium, and then its next stop was Albania.
Or here's another one. It started in Australia and ended in Colombia. And obviously, we have a dense diagonal here, because by definition-- well, actually a lot of times, that's just the way it works. And so again, here's Damascus, Florida, all this type of thing.
So now, we've created a new graph of these tracks. Now, we can do something like a track pattern. So let's say I just want to look at the tracks associated with people associated with the organization International Monetary Fund.
So I'm going have starts with person. And I'm going to limit my data. So I'm going to basically limit it to data that begins with the organization. So now, I'm building a new graph, G0. OK.
FindTrackGraph of just that data set. So I've basically taken my A, which is this graph, and I said, oh, just go back and find me the people associated with the International Monetary Fund. And now, I can do things like, all right-- because this track graph the value is the number of times that occurred.
So for instance, if I say show me now all edges that occurred more than twice and where are the tracks that were due to people associate the International Monetary Fund were greater than 20% of all the tracks that occurred. So basically, I'm looking for something that happened more than twice, in that IMF folks did more than twice. And of all the data, the IMF people did it a lot of the time. So we're in a lot of people. It wasn't just a really, really popular track.
And so we see here, now we get Karachi and Afghanistan. So we had, essentially, at least two of those. And all of them were people associated with the IMF. Afghanistan and Britain. Here's Britain and England.
Well, obviously that's a little bit-- Britain and England are the same thing, right? Here's Islamabad, Islamabad, Moscow, Moscow. So here just shows you the kinds of things that you can do with typing in again.
This is a very sophisticated type of analytic. If you were to try and to do those things using existing types of things-- if you knew this was the analytic to do and someone handed this to you and said, go implement it using another technique, you definitely could do it. But discovering the analytic using existing techniques would be very, very time consuming.
And this kind of tool is very, very easy for you to explore and get the analytic just right. And I would say that in a certain sense what D4M is doing here is doing the same rule that we've always used MATLAB for in signal processing. People use it to play with their algorithms, to figure out their algorithms, to get their algorithms right. They know this will give us the right answer.
And then when they deploy it and actually make it part of a real system, sometimes they'll just take the MATLAB code and make it a part of the real system. But more often than not, the target system or the target application will require you to port it to some other language, maybe C++, maybe Java, for deployment reasons. We still see that happening today that, you know, algorithm development is one thing, deployment is another thing.
Even if people use the same language for doing algorithm and deployment, usually deployment people end up having to completely rewrite what the algorithm analysts wrote anyway. Because the algorithm analysts had certain things they were concerned about. And the deployment person will have completely different issues that they have to worry about. But I think D4M allows you to still do that same kind of model on these new types of data in a very useful and productive way to get the productivity that we want out of that.
All right. Let's see here. So finally, one last thing, a more complicated analytic, which I call sort of a multiple hypothesis tracker. Essentially, what we're doing here is we're loading the data. We're going to just focus on one person here.
And then the locations are specified by time. I mean, the time is specified by the time column. And location is specified by location. And then I'm going to have this function called find multiply hypothesis trackers for Michael Chang with respect to the data set E.
And what this does is this says, all right, in the previous thing, I was basically just making one pick. If I had a document and Michael Chang was in it and there was multiple locations and times, I would just sort of pick one of those locations and times. Here, I'm now going to show you, for Michael Chang, for each document, all the locations and times.
So here's, basically, Africa and time. And let's see here. Maybe we can make this a little smaller. That would probably help a bit, a little smaller.
There we go. You can barely see that. But you see here, this basically shows all the times, all the locations. And this shows you, essentially, in theory the true track could be going through any one of these. There isn't a single track really for Michael Chang. There's multiple potential tracks.
And then this complex value I happened to store here just because I'm using complex values, the first one is the character distance. The location is-- so Michael Chang appears in a particular word position in the document. And it tells me that Austria appears 278 characters before him and that the time stamp appears 11 characters before his name.
And so over here, you see the time stamp appears-- this is a different document, but with the same time-- in different locations. So you can then use this data to actually go back and say, you know what, I only to pick the words that are closest to Michael Chang. And I want that to actually be the real track for Michael Chang.
So that just shows the more complicated things that we can do with that. So with that, that leads us to the end of the examples. And if there's any questions, I'll be happy to take them now if there's any. All right, good. I'm just showing you some of the kinds of things that people can do. And so there we go.