Lecture 4.3: Aude Oliva - Predicting Visual Memory

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: What makes an image memorable? Discussing visual memory experiments, consistency of memorability across observers, memorability of images, neural framework for memorability, and biologically inspired deep neural network model of object recognition.

Instructor: Aude Oliva

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

AUDE OLIVA: Thank you very much for the introduction. Good morning, everyone. So I'm very pleased to be here. It's the first time I visit.

So I would like to give you a tour during this lecture about how you can predict human visual memory, and actually an interdisciplinary account of the methods you can use together in order to have basically a view of what people can remember or forget.

So the specific question we are going to ask today is, well, we are experiencing and seeing all the time a lot of digital information. First you see in the real world, but also you are exposed to many, many videos and images. So vision and visual memory is one of the core concept of cognition.

And the question we ask is, can we predict which image or graph or face or words or videos or piece of information or event is going to be memorable or forgettable for a group of people, and eventually for a given individual?

So let's take a moment to imagine that we will be able to predict accurately the memory of people. Well, this will be very useful to understand the mechanism of human memory, both at the cognitive and neuroscience level, system level, as well as possibly diagnose memory problem, short-term visual memory, long-term visual memory, that may arise possibly in an acute way or developing over time, as well as design mnemonic aids in order to recall better informations.

But beside basic science, if you could predict information that are memorable or forgettable, there is a realm of application that you could work on or basically propose that lie really everywhere between the data visualization or the slogan, make basically slogan better, as well as all the realm of education, the individual differences that may arise between people.

Someone will not learn very well. Well, what can we do in order to increase the memory of visual information that this person can grasp? As well as various applications-- social networking, faces, retrieve better images, and so on.

So understanding what make an information memorable or forgettable basically is really a very inter-disciplinary question, something very exciting for us to work on, because there's a lot, a lot of future in working in that topic.

So this is a topic we started in my lab a few years ago. And the best for you to get a sense of how you start working, for instance, on memory is to do the kind of game and the experiment we had people doing.

So welcome to the Visual Memory Game. A stream of images will be presented on the screen for one second each. And your task is to-- If you are in front of your computer you will press a key. But what you're going to do is play the game with me and clap your hand anytime you see any image that you saw before.

So you're going to have to be attentive, because images, they go by fast in this rapid stream of images. And so you will be getting feedback. So it's a very straightforward memory experiment. And this is the first step in order to get some score on a lot of images regarding the type of information that people will naturally forget or naturally remember.

So let's do the game. So are you ready? All right. So clap your hand whenever you see a repeat. So this is what will rerun. Very simple images.

[CLAPPING]

Fine. Excellent. So images are shown one second. There's a one-second interval.

[CLAPPING]

You're good. No false alarm. Excellent. All right. So that was one of the level, level 9, out of 30 complete. And so here's the game that people had. They could play this game for five minutes, had a break, and come back. And this game had a lot of success, and we run it.

I'm going to show many results about this. So you could see your score, the amount of money you have done. This was run on Amazon Mechanical Turk. And this allow us to collect all around the world a lot of data regarding many, many images.

So those visual memory experiment were set up by Phillip Isola from MIT. And you can basically play that game for any kind of information displayed visually. So we did it for pictures, faces, and words. And you're going to see the result.

So let's start with the pictures. In the first experiment, we presented 10,000 images, and about 2,200 we repeated many, many time. So those were the one where we collected the score. So for a given subject, those images were actually seen only once. Twice, sorry.

So you have the stream of images at the top. So, for instance, an image will be shown after 90, 100, 110 images or so. And if this image was one that the subject recognized, then he will press a key. So it's exactly the design you just did.

And when first you look at the type of images that are highly memorable or forgettable, well, there's a trend that we all expect, is images who are kind of either funny or have something distinctive or something different, or if people are doing various action, or of some object that are kind of out of context, those tend to be memorable. And, in general, landscape or images that don't have any activity tend to be forgettable. So we have those that score for more than 2,000 images from that experiment.

So one of the first thing we need to know is, well, everyone is playing that game, and we can have their own memory score. But in order to know if some images are indeed memorable for all of us or a good amount of the population, we need to see if there is consistency between people.

So here is a simple measure we do. You have a group of people looking at those images, and we have the memory score. And you can split the group into two, and rank according to the average of the first group, the images, from the most memorable to the less memorable. And then you can also rank the same images in the second group.

And if the two group were identical to each other, you will get a correlation between those two ranking of 1. And what we observe when we split repetitively the group into two like this is we observe a correlation of 0.75, which is pretty high.

And this give us basically the maximum performances that we can expect when we have a group of people that can predict the rank of images of another group of people.

And actually, here is the curve that shows what the 0.75 consistency look like. So on the x-axis, you have the image rank according to the group number one, so the images with the highest memory score. So I have a group of them that are above 90. And it's normal.

The group number one in blue decrease as images are less and less memorable. So that's like basically your ground truth curve. And the green show the group number two, totally independent people, and also the performances that they got for each bins along the image rank. And you can see the curve are pretty close by. So a correlation of 0.75 look like this.

So it means that, with independent group of people, there are some images that are going to be systematically more memorable or forgettable. You can see the full range going below 40% for the images that were forgotten, up to 95 for the one that were systematically remembered. But, importantly, a group of person is going to predict another group of person.

So there's several ways to test memory. The way we tested memory here to get ground truth was an objective measurement. You see an image again, and if you remember it, you press a key. So that's an objective measurement.

But you can also ask people, do you think you will remember an image? Do you think someone else will remember an image? We also run those subjective memory score.

And we observed this very interesting trend that the subjective judgment do not predict image memorability, which means that, if you ask yourself, am I going to remember this? Well, basically, maybe, maybe not. So subjective judgment of what you think your memory will be or what you think the memory of someone else will be is not correlated with the true memory, with whatever you're going to remember or not.

So this was very interesting, because it shows that objective measurement should be needed here in order to really get a sense of what people will remember or forget.

We basically have many papers on this topic since 2010. They are all on the web site. And then some of them will look at the correlation existing between memory, so the fact that you're going to remember certain kind of images, and other attributes, for instance, aesthetic. And, again, we found that memorability is distinct from image aesthetic.

This means that basically you could have an image that is judged very beautiful, or, on the contrary, ugly or boring. And in those cases, you will still remember those two images. So we found this absence of a correlation between those two attribute in our values studies. We replicate this with other data set and faces as well.

So it looks like what you will remember is this notion of distinctiveness, but it can be beautiful or ugly. It doesn't matter. You can still either remember it or forget it.

So you had this question about the notion of the lag. So the lag is in that you could test visual memory after a few seconds or a few intervening images, or you could test it a few minutes or even one hour later, or even days later.

So because we were running those experiments on Amazon Mechanical Turk, we did not do the dates. However, we did run some with a larger gap, up to 1,000 different images between the first and the second repeat.

And so here is the design, the one that I show you for the about 100 images intervening between the first and the second repeat. But what about a shorter and a longer time scale?

So all that work is also published. You can go and download the paper and see the details. But the basic idea is that the ranks were conserved. So if one image is very memorable, one of the top after, let's say, a few seconds, will still be memorable after hours. And if an image is forgettable or one of the most forgettable after a few seconds, will still be forgettable after hours.

So the fact that the magnitude, the percentage of images remembered decrease is normal. That is known from memory research for decades. So it is expected. However, memorability here is basically the rank, which is an image, is for a population independently basically of the row, the magnitude. One of the most memorable given this condition is forgettable.

And we did those experiment, both on the web as well as in the lab, because in the lab we could control for more factors. And it was very interesting to see in the lab experiment that only after 20 second there were images that were totally forgotten by a good group, a good amount of people.

So some image seems to really-- basically do not stick and be gone in I don't know how many second. We did not go really to short-term memory. We work starting in long-term memory at 20 seconds and so on. But there's this phenomenon.

So it suggests that there are some features into the visual information that are encoded in less details than others or with more details than others. And what's very interesting is then you can go to neuroscience and basically start studying the level of details or the quality of encoding of an image and see where in the visual pathway an image basically is gone after 20 seconds or something like this. So important point-- the rank is conserved.

So we also look at those principle of memorability that we found for images in faces. So faces is a very interesting material to work with, because basically it's all images that look alike. So you have basically one object and look at many, many exemplars.

And there's no reason to believe that this high consistency we will find for images as different as amphitheater and the parking and so on, and landscape, this high consistency will be found with faces.

So we gather a data set of 10,000 faces. The paper is published, as well as the entire data set is available on the web. So you can go and download those 10,000 faces as well as all the attribute that we found where we study with the data set.

And we also found the same phenomenon, very high consistency, both in the correct positive responses, when people remember seeing a face, as well as in the false alarm, when people did not see a face but falsely thought that they saw a face and basically pressed the key.

So this very high consistency for both measurement suggests that, again, in the facial features of people, or at least the way a photo is taken, there is something at the level of the image that will make a face highly memorable or highly forgettable for most people.

So all the details of this study are actually on the web. It was a pretty complex study to run, because while we have very different sensitivity to faces, so the race effect on basically where we grew up, so we know there's a lot of individual differences.

So the way this study was run is the collection of faces we have followed the US census in terms of male, female, race, and age. We started at 18 years old and older. So did our population as well.

So on the group, we show a collection of faces that did match the population, the people who were running the study. And, as I say, all the data are available on the web if some of you want to go back and do additional analysis.

So we kept going with this. So we have the consistency in the visual material. So now what about words? So words is a very interesting case, because now you do know the words. But are there some kind of words that we can predict are more forgettable or memorable?

Again, there's no reason to believe a high consistency will be found. But we've run the study twice with two different data set, again, ten thousand item, and two different set of words. And we found, again, very high consistency, which is that a collection of words were systematically remembered by people and others were forgettable.

So this work that is done in collaboration with Mahowald, Isola, Gibson, and Fedorenko is under submission. But let me give you a taste of the words that we found.

So what make words more memorable or forgettable? So here is a cartoon that give you the basic idea is, well, if there is one word for one meaning, so basically a word has a single meaning, it will have a tendency to be much more memorable than if a words has many meanings.

So in the paper, we also look at the correlation between memorability and image ability or frequency and so on. And all this is describe, but really the main factor is this one-to-one referent between a word and its meaning or concept.

So let's look at some of the example. The paper will come with the two data set and thousand of words that were found memorable and forgettable. So I think if you write a letter, you should not say that your student is excellent but that she is fabulous. And our research is not a blast. It's in vogue. And the idea of a team is not irrational, they are grotesque.

So those are just a few example of the words that on average, of the three first, had more referent and the tendency to be more forgettable because they might be used for many more things than the one. I also notice a lot of the French words tend to be memorable.

So we do find this stable principle that you can predict the content, can predict what type of images, faces, words that are memorable. And we also did it for visualization, starting working on the topic of education.

So this is very useful, because at least at the level of a group, you can start making that prediction. Oh, I also have massive and avalanche. Forgot about it.

So now that we were able to have all those data and see that there is this consistency, then one of the next question you can look at is, OK, well, if it seems that we all have a tendency to remember the same image, can we find a neural signature into the human brain?

So the question of memorability, is it a perceptual or memory question? Because in all our experiment, the images are shown for a short time. And then, when they are repeated, you see them a second time.

But basically, all the action is at the perception level. Whenever you perceive this image for half a second or one second, there is something going on here at the perception level that is going to bias if this image is going to basically go into memory or not.

So, knowing this, if we want to look at the potential neural framework of memorability, we have to look at the entire brain. We have to look at all the region that have been found to be related to perception, faces perception, picture perception, object, space, and so on, as well as the medial temporal lobe region, more in the middle of the brain, that have been related to memory.

So this is what we did with Wilma Bainbridge. This is her PhD, basically having a look at all those region. And here is the very simple experiment we run.

So we took a collection of faces and scene from the thousands we have, and where every one, we'll basically separate those two between looking at the region that are more activated for seeing versus the region more activated for faces. We split it that way, between the memorable and the forgettable set.

So in those set, every images is novel. So exactly like in the memory experiment, we could show them one time for half a second. So you had the perception level. You're in a perception experiment. You saw those image one after the other only one time. All those images are novel.

So we're going to look at the contrast from novel image minus novel image, from scene minus scene, from faces minus faces, except that some images are highly memorable and some are highly forgettable.

So other factor that you have to look at is, it's still possible that within those group of images and faces that are highly memorable or forgettable, that there's a lot of image features that basically correlate with those.

So if you take a collection of images like I shown you before, and did the environment or the photo that have people or action tend to be memorable versus a landscape tend to be forgettable? So here you have a lot of visual features that will co-vary with the dimension of memorability.

So in that study for the brain, we equalize for that, because we had enough images. So here you have a sample of the two groups and sample of images and the type of statistic we look at.

And the two, for instance, for this scene were equalized in term of the type of category you had-- outdoor, indoor, beach, landscape, house, kitchen, and so on-- as well as a collection of low-level features.

And you can see some of the average signature that are actually identical on a lot of low to mid to higher level image features that were equalized between the two group. So whatever we find is not going to be due to simple statistic due to the image or the type of object those have.

We could play the same game for the faces, so we did. Here are the numbers again, memorable and forgettable faces that were also equalized for various attribute like attractiveness, emotion, kindness, happiness, and so on, as well as male, female, race, as well as expression, and so on. And you can see that the statistic, the average faces for both the memorable and the forgettable group, are actually also identical.

So with those group, what's left is hopefully only the factor of something else in the image at the level of a higher image statistic, because only image statistic will explain the fact that very different people will remember the same faces and forget the same faces. But certainly not some of obvious low-level image statistic. Those cannot explain any result.

So two years and four study later, we replicated basically this study four times with many different matter. Then, I'm just going to show you here one snapshot of the result.

So this is a Multi-variate Pattern Analysis looking at the memorable versus the forgettable groups, searchlight analysis, MVPA looking for the region of the brain that are more active. Well, that have a different pattern. They are also more active, but at a different pattern for memorable faces.

And we find signature in the hippocampus, the parahippocampal area, as well as the perirhinal that are typical for memorable faces and scene, and a new signature in the visual area or even the higher visual area, because we did equalize for those.

So it seems to show that those MTL region play a role in a kind of higher-order statistical perception, a notion of distinctiveness that is within those image.

So this suggest that at the perception of a new image, could be a face or a scene or a collection of object and so on, well, there's already a signature of this that's going to guide or bias if this is going to be put into memory or not. And we have those at the level of the group, looking now at the level of individuals.

All right. So, well, it looks like we're going, given your question is now trying to model this notion of memorability. So we have a good case that there is some information into the image of a higher-level status-- we don't know which one-- that cannot be explained by simple features that make all of us reacting the same way. Even our brain do react the same way. We also have images, signature of memorability coming up.

So we have this intrinsic information that make all of us remember or forget the same kind of visual information. So can we now in a way imitate or model those result into an artificial system?

All right. So you have all heard about the revolution in computer vision a couple of years ago and deep learning and those neural networks that are now able to outperform a lot of-- that were able to recognize and perform a lot of task, some of them at the level of humans, so recognizing various object and so on.

And one of the aspect of those neural networks, and I'm going to talk about them, is that they require a lot of information. So you need to teach them the classes you want to distinguish. And they can and need a lot, a lot of data.

So everything we did so far, we are getting that on a couple of thousand images. And that's really not enough to even start scratching computational modeling. So with Aditya Khosla, we run recently a new large-scale visual memorability study on Amazon Mechanical Turk, but this time getting score for 60,000 photograph.

And you have a set of a sample over here. 60,000. They are all going to be available in a few weeks. The paper is under revision. It's looking good. So as soon as we have a citation, we are going to give away all the data as well as the score and the images and so on.

And in this experiment, images were presented for 600 milliseconds or shorter time, but they really did not change much. So as a snapshot, because the 60,000 images really cover a lot of type of photo-- faces, event, action, and even a graph and so on-- well, you either hear some that were highly memorable or forgettable. There's also the website. It's not populated yet with many things, but it will be very shortly.

And the correlation we got on that data set was pretty high again, 0.68. That was expected. And the paper explain the various split we did. But, again, we find this very high correlation. So we do know that there's something to model there.

And other-- again, a very quick summary. The type of images that seems to be the most memorable are the one which have a focus and close settings, that show some dynamics and something distinctive, unusual, a little different, whereas the less memorable one seems to have no single focus, distant view, static, and more commonalities.

All photos, you can still find two images that will be focus and dynamics, and one of them will be more memorable than the other because it will have something unusual that now our system can capture.

So we have this new data set. We have all the memory score. We have the high consistency. How do we even start thinking about the computational model of visual memory or memorability?

Well, in order to give you a sense of one of the basic we needs, in order to even start thinking of a model, I'm going to show and run another demo. So in this demo, you're also going to see some images. And clap your hand whenever you see an image that repeat. OK? Exactly the same game than before.

Ready? All right. If everyone play the game, it will be fun. OK.

[CLAPPING]

[CLAPPING]

A little false alarm.

[CLAPPING]

False alarm.

[CLAPPING]

Good. More energy.

[CLAPPING]

Good.

[CLAPPING]

Sorry.

[CLAPPING]

No

[LAUGHTER]

[CLAPPING]

No.

[CLAPPING]

Yes.

[CLAPPING]

No.

AUDIENCE: Too close.

[CLAPPING]

AUDE OLIVA: No. Yeah, that was-- yes.

[CLAPPING]

[LAUGHTER]

All right. So for the sake of the demo, I put here really images that are very different, some you are familiar with. You have a concept. You know what it is. This is a restaurant. This is an alley. This is a stadium, and so on. And some for which either you don't have a specific concept, or you have the same concept-- texture, paintings, texture, texture, texture.

And the basic idea of memory is you need to recognize, to put a unique tag or collection of tag in order to remember that individual image. So the fact that you saw a collection of texture or paintings or whatever you want to call them, you'll remember that as a group.

But to go to the individual memory of one, you're going to need to have a specific concept. And in order to remember it, this is going to be an abstraction, a format, a collection of words, or a coding that make it unique.

So you need to recognize to remember, which means that if you want a model of memory which start from the raw image-- I'm not in toy world here. I really start from the raw image, like the retina. Well, you're going to need to build a visual recognition system first.

So first you need a model that recognize object and scene and event and so on. And then, from this, there can be a base to start modeling memory.

So, fortunately, the field of computer vision made a lot of progress in the past couple of years. So now we do have visual recognition system that works pretty well. I'm going to describe them.

We need to first a visual recognition system. All right. So, what does a visual recognition system needs to do? So, well, it's your Sunday morning. You're going to the picnic area, and you're faced with that view. You take a photo. This photo actually became viral on the web.

And here is the state of the art of computer vision system. When it comes to recognize the object-- I know it's a different view, but it works very well for any view-- object recognition for about 1,000 object category is reaching human performances so far.

And so this will tell you this is a black bear, there's a bench, a table, and so on, and trees in the background. But it's missing the point, that this is a picnic area.

So you do need at least two kind of information in order to reach visual scene understanding. You need the scene and the context, you need to know the place, and you need to know the object.

So, as I said, so far on the challenge that's called the ImageNet Challenge in computer vision, computer vision model average human performances, which is 95% correct on exemplars of objects that have never been trained on, a new one, for 1,000 object category.

And recently, we basically published a few papers for the other part of visual scene understanding, the place and the context. And this is an output of our system. And you can go on the web-- I'm going to give you the address-- and play with it and see the performances of recognizing the context and the place.

So just to put into context what the field of computer vision have been doing for the past 15 years is, well, the number of data set, the number of images by data set has been increasing, so that there's more exemplar to learn from.

And, in perspective, you can see that two-years-old kids. Of course, it will depend on the sampling you use in order to have an estimate of the number of visual information that the retina sees. But it sees much, much more variety and numbers of visual input.

But right now, both ImageNet, that is, a data set of objects, and Places, which is a data set of scene I'm going to present to you now, have about 10 million label images of many categories. So label means that for the places, it will tell you, this is a kitchen or this is a conference room, and so on for hundred of categories.

So 10 million is largely enough to start building very serious visual recognition system, but it's no near the human brain. However, we might be getting there. So how do we even start to build a visual scene recognition data set?

This is a work we did and published in 2010, the Scene Understanding, or SUN data set, where we collected the words from the dictionary that correspond to places at the subordinate level-- I'm going to give you example-- and then retrieve a lot of images from the web. And there was a total of 900 different categories, and about 400 of them have enough exemplars to build a artificial system.

So instead of going and only looking to build images, like to build a data set of a bedroom and kitchen and so on, what's happening for the human brain is that you have a different environment. You see there's many attribute you can put in. And you are forthcoming and storytelling and so on.

So most of the places, it's not only that this is, for instance, a bedroom, is that you will go more to a student bedroom or-- Well, I think those are student bedroom two doors from my colleague.

So there are many type of adjective that we can use in order to retrieve images that will give us a larger panorama of the type of concept that are used by the human brain in order to recognize environment.

So a simple bedroom. The tag were put automatically. Superior bedroom. Senior bedroom. Colorful bedroom. Hotel bedroom. And so on and so on.

So that was the retrieval we did, which means that now, for every category there is also a tag in term of the subtype of environment this can be. Messy bedroom.

And a couple of years later, 80 million images later, and a lot of Amazon Mechanical Turk experiment, we are launching this week the Places2 data set, with 460 categories, different categories of environment, and 10 million images label.

So this is a larger data set of label images, with a label to be used right away for artificial system learning, deep learning, and so on.

So here is just a snapshot of the differences of the places in term of the number of exemplars with other a large data set. So the Places data set is actually part of the ImageNet challenge this year, which means that you can go to ImageNet and register for the challenge and download right now, tonight, eight million images of places to use for the learning of your neural networks, as well as of a set that will be used for the testing, and participate to the challenge.

So this was launch last week, and the website associated with Places will be launched this week. And we decided to just give this away to everyone right away. We are finishing up the paper now. It will be an archive paper.

No time to wait for month and month. This is a data set that can be used by a lot of people to make progress fast. And so that's what we are doing.

So, as I said, the computer vision model now require, if you use deep learning, a lot of data. And we hope that with this data set, fast progress are going to be made.

So what we specifically did is using the AlexNet deep learning architecture-- If you don't know what this is, I can tell you later how to basically access to it with the paper on so on. This is not my model. This is a model put together by Geoffrey Hinton and collaborator a few years ago. And you can download the model or download the code and re-train.

So neural net now basically are based on the collection of operation that are call layers, convolution, normalization, simple image processing operation that you do in a sequence. You do it one time, then a second and third and so on. And then you have those multi-layers models.

And the number of layers is still a question of research. How does layer correspond to the brain? I'm going to say a little bit about that.

And using this simple-- well, this AlexNet model, we built a scene recognition system. And now you can go to places.csail.mit.edu with a smartphone will work, will take a photo. And it should tell you the type of environment the photo represent.

It will give you several possibilities, because basically environment are ambiguous. They can be of different type. So I don't know if you can read. Let me read a few. The first one, it says, restaurant, coffee shop, cafeteria, food court, restaurant patio. I guess they all fit.

The second one, parking lot and driveway. The third one, conference room, dining room, banquet hall, classroom. That was a difficult one. And the fourth environment is patio, restaurant patio, or restaurant.

So if you go there, you can also give feedback if one of the label match the environment that you're looking at. And it should be above 80% correct. And this model use 1.5 million images and 200 categories. So soon with a great-- we hope that things will be even more interesting and accurate.

And I took this morning a couple of photo at breakfast. So you may all recognize the scenery here. So from the breakfast area looking outside, it's an outdoor harbor, dock, boat deck. Yeah, it could be actually on a boat deck looking at the harbor. And otherwise, the breakfast area was restaurant, cafeteria, coffee shop, food court, or bar. All those, again, fits.

So those model now works very, very well. So why? Well, let me tell you why. So those neural networks, you can go to any layers and open them and look at what every single artificial neuron do. So what we call the receptive field of every single unit in the layer one, the layer two, the layer three, and so on.

So the first the layer-- here, four layers are shown-- basically learn simple features, contours and simple texture. That's called pool1. And those looks like the type of responses of the visual cells, and possibly V1, V2. I'm going to tell more about this.

And as you go higher up in the layers, then you start having some texture, some patches, that make more sense. And higher up in the layers, like layer number five, then you start having artificial receptive field that are specific to part of object, part of scene, or an entire object by itself. Like we can see here some kind of [INAUDIBLE] coffee, as well as the tower and so on.

So it seems that the system are able to recognize environment of object. But what they learn are the part and the object that the environment contain. And I'm going to show example of those.

So jumping, just giving you a result in neuroscience. And so there's a lot of debate out there about, OK, well, there's those model with different layers. And you have different models out there. To which extent they correspond to the visual hierarchy of the human brain?

Well, first, the computational model were inspire by the visual hierarchy, the V1, V2, V4, and [INAUDIBLE] and so on, knowing that more complex features are built over time and space.

So what you can also do is run through a network and run in an fMRI experiment the same image, and then look at the correlation existing between the responses of the cells, let's say in layer one, and the responses you may have on the human brain in different part of the brain.

And what we find is that the layer one will correspond more, will have a higher correlation, with responses in the visual area, literally V1, V2. And as you move up through the layers, then there is correlation, higher [INAUDIBLE] correlation, with part of the ventral and the parietal part of the brain.

And I know you had the lecture by Jim DiCarlo that actually must have explained this. So Jim DiCarlo team did it, Jack Gallant as well in Berkeley. And we also did it with other type of images and other network, and all the result really corroborate each other with this nice correlation between low to high visual areas between the brain, the human brain, and those multi-layers model.

All right. Let me show you some of the receptive field, the artificial receptive field that we find in the higher layers. So this network was trained for scene categorization. So the only thing that this network, the one we are using here, learn was to discriminate between, in this case, 200 categories-- the kitchen, the bedroom, the bathroom, the alley, the living room, the forest, and so on and so on. 200 of those. So that's the task.

What the network learn and what we observe is that object discriminant and diagnostical information between those category form the emerging representation that are automatically, naturally learn by the networks.

So the network has never learned a wheel, but the representation emerge, as you can see here. So this is one artificial neurone. And it is receptive field and its responses, the highest response it got for a collection of images.

And as you can see, those higher pool5 receptive field are more independent to the location. They are built that way. But the network never learn the parts. This is something that emerged naturally by learning different environment.

The other thing that's very interesting in this model, when you can open up and look at the receptive field using various method, here is one, is, well, this model has never learned the notion of shape or object. So it's going to basically become sensitive to discriminant and diagnostic information.

So you have this unit that is discriminant to the bottom part of either the legs of animate, or even you can see the trees over there. So this is a unit. It seems that it was needed in order to classify environment to have units that we might not have a word for it. But the human brain might as well have many of those unit that do not correspond to necessarily a word.

So, basically, with this model you can have a lot of new object emerging that you might not have thought of, but they become parts of the code needed in order to identify an environment or an object.

So we also have another unit for this bottom parts of a collection of chairs. We also have chair, of course, showing up. Faces. The model never learned faces. Only learn kitchen, bathroom, street. We have several unit emerging for faces. Why? Because those are correlated with a collection of environment.

Then, entire object shapes, like bed, very diagnostic of a bedroom in this case. So that will be a unit that is very, very specific. That would be really only for beds, that one.

And then, others like lamps and so on. There's thousand and thousand here. Another unit never learned, screen monitors. And here is specific unit for that that emerge. Also, collection of object or space, collection of chairs over here.

The network found that this is a discriminant information to classify environment, crowds. It's a very interesting unit because it's really independent of the location, as well as basically the number of people and if they are closer or further away. But it does capture this notion of crowd.

So the model doesn't have the word crowd. So it has never known the crowd. It's one of the unit emerging that now can be used as an object detector to enhance the recognition of what's going on in the scene. So this is an ice skating area, and there is a crowd.

And also unit that are more specific to space and useful for navigation, for instance. We have many of those. So in that case, just the fact that there is lamps up or perspective, so we have a unit like this specific to this.

So it's not the only object as physical object. There's also a collection of unit that are related to spatial layout and geometry that are also discriminant for environment. And many, many other that are showing up. So those object detector naturally emerge inside this kind of network trained for scene understanding.

So now I only have a couple of minutes to wrap up, 20 minutes, so I'm going to just give you the hint of the next part. So with the Places challenge that is starting this week, certainly in less than a year the computational vision model of scene recognition are going to be very close to human performances.

And then there's a long way to go and many more things to match, like the error. So know that the error look alike or when a category can have many type of object or what can happen next. So you can really expand.

But let's say we can consider now that we have a base of visual recognition into a model that works pretty well at the level of human, or close enough, or will be close enough. So, now that we have that, we can add the memory module.

How to add the memorability module is really open in the air. There's many ways to do. So we just did it one way, to have a ground very first model of visual memorability at the level of human.

And this is going to be out-- the paper is in revision-- in a few weeks. And you're going to be able to download the images, the model, and so on. Again, this is model number one, and we hope that then a better model can be done.

So we went for the Occam razor approach. The most simple one given the model is we took AlexNet, we feed AlexNet with both ImageNet and Places. Because all the images that are memorable or unforgettable, they might have object, they might have places. OK, so let's put the two together so we have more power.

We train the model, so we have the visual recognition model. We do know that those units make sense. They recognize that this is a kitchen. And we do know why, because we have the parts. So that's a classical standard AlexNet. The output is scene and object categorization, so we are still in categorization land.

And we can remove the last layer. And then this is a procedure that has been well published in computer vision and computer science, this notion of fine tuning and back propagation. So you use the network on learning that are learn to recognize places, and you finely tune and adopt the feature so that the task has change.

So the task was recognition. And now, for the network, at the end there is a new task for that network that has learn all those object and scene. And the new task is now to learn that those element are of high, medium, or low memorability, which is a continuous value.

So by doing this, we have now a model where we give it a new image, and it's going to output a score between 0 and 1. And the human to human I'll show you is 0.68 correlation. The human to computer is 0.65, which mean now that there is a first model that can basically replicate human memorability of a group nearly at the level of a human.

And we use the data set of 60,000 image to finely tune the recognition model, as well as test with images that this has not shown.

And because we can open up this new network of memorability and look also now at the receptive field of the neuron that are related to high or low memorability image-- and we will publish every single receptive field on the web as well with this paper-- and thus now we can see the unit that are related, this higher level information of object or space or parts and so on that is related with images that are highly memorable in green, strong positive, or highly forgettable.

And we find, again, that if you have animate object or kind of object, roundish object, and so on, you can go [INAUDIBLE] those will make your images more memorable.

And so our last part is, so now that we do have this model that spill out responses at the level of human and indicate the parts that are related to higher memory, lower memory, at least as a guideline, then we can go back to a given image and emphasize the part that correspond to the high memorability receptive field and de-emphasize the part corresponding to the forgettable part of the receptive field.

And this give you images on the right like that that have been weighted by the element that are memorable and the element that are forgettable.

So maybe here it doesn't matter that the ground is forgettable. Here, the part I like to emphasize are the memorable one. But, for instance, in this imagine we have two person. And, well, she just happened to be more forgettable in this case, which will make a perfect CIA agent, if we think about it. And we did tested those.

So then, for a values part of navigation, I mean values scene, we do find that the element of exit or entrance, so where there's basically path and a 3D structure for navigation, tend also to be more memorable. So those are highlighted in those images.

Another example, where the kids will be more memorable than the features of the person. And I don't-- Well, I'm not an expert in game. You can explain that one to me.

So I have to stop because we need a break, and I might need to talk. However, here is a vision of where we are going. And maybe other people will be interested to go on this adventure. Because it's really, really just the beginning.

But if you now can have a model that at the level of a group of human predict which image are memorable, and as well, also match part of the visual region and even the higher level object recognition, then you can start having this similarity going on in this study between the human brain and all the part, from perception to memory and computational model, to characterize what the computation underlying those particular region. Because now you have model zero.

I'm not saying that the model I'm showing to you is a model that imitate the brain. But it's a model zero, and now it can be tuned to better learn the calculation that are done at different level along the visual to the memory hierarchy in order to characterize visual memory.

Free Downloads

Video


Caption

  • English-US (SRT)