1 00:00:01,700 --> 00:00:04,040 The following content is provided under a Creative 2 00:00:04,040 --> 00:00:05,460 Commons license. 3 00:00:05,460 --> 00:00:07,670 Your support will help MIT OpenCourseWare 4 00:00:07,670 --> 00:00:11,760 continue to offer high-quality educational resources for free. 5 00:00:11,760 --> 00:00:14,300 To make a donation or to view additional materials 6 00:00:14,300 --> 00:00:18,260 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,260 --> 00:00:19,294 at ocw.mit.edu. 8 00:00:22,790 --> 00:00:25,760 RYAN ALEXANDER: All right, so as this XKCD comic points out, 9 00:00:25,760 --> 00:00:27,990 in CS, it can be very difficult to figure out 10 00:00:27,990 --> 00:00:29,480 when something is just really hard 11 00:00:29,480 --> 00:00:31,200 or something is virtually impossible. 12 00:00:31,200 --> 00:00:32,700 And until a couple years ago, people 13 00:00:32,700 --> 00:00:35,690 thought this idea of image classification 14 00:00:35,690 --> 00:00:38,840 would be something that was closer to the impossible side. 15 00:00:38,840 --> 00:00:40,856 But with the advent of deep learning typology, 16 00:00:40,856 --> 00:00:43,147 we've made significant strides in image classification, 17 00:00:43,147 --> 00:00:45,475 and now the problem's actually quite practical. 18 00:00:45,475 --> 00:00:47,690 So today we'll be going through how 19 00:00:47,690 --> 00:00:49,310 the process of image classification 20 00:00:49,310 --> 00:00:50,351 with deep learning works. 21 00:00:50,351 --> 00:00:53,390 So we're first going to talk about what deep learning is, 22 00:00:53,390 --> 00:00:56,170 and then we'll move into some of the image processing techniques 23 00:00:56,170 --> 00:00:58,749 that researchers use, followed by the architecture 24 00:00:58,749 --> 00:01:00,290 of the convolutional neural networks, 25 00:01:00,290 --> 00:01:03,440 which will be the main focus in our presentation. 26 00:01:03,440 --> 00:01:05,840 We'll also talk about the training process, 27 00:01:05,840 --> 00:01:08,240 and then go through some results and limitations of CNNs 28 00:01:08,240 --> 00:01:10,400 and image classification. 29 00:01:10,400 --> 00:01:11,760 So what is deep learning? 30 00:01:11,760 --> 00:01:14,290 Well, the term is particularly vague, 31 00:01:14,290 --> 00:01:16,530 and it's purposely so for a couple of reasons. 32 00:01:16,530 --> 00:01:19,230 The first is mystery is always good for marketing. 33 00:01:19,230 --> 00:01:21,030 But the second reason is that deep learning 34 00:01:21,030 --> 00:01:24,150 refers to a pretty wide range of machine learning algorithms. 35 00:01:24,150 --> 00:01:25,910 They do have some commonalities. 36 00:01:25,910 --> 00:01:28,850 They all seek to solve problems of a complexity 37 00:01:28,850 --> 00:01:32,840 that previously, people thought only people could solve. 38 00:01:32,840 --> 00:01:35,504 So these are more sophisticated classification problems 39 00:01:35,504 --> 00:01:37,420 than traditional conventional machine learning 40 00:01:37,420 --> 00:01:39,800 algorithms can do. 41 00:01:39,800 --> 00:01:41,280 So how do they go about doing this? 42 00:01:41,280 --> 00:01:44,120 Well, all of these deep learning programs 43 00:01:44,120 --> 00:01:45,950 tend to take all the processes that need 44 00:01:45,950 --> 00:01:47,430 to happen, and split them up. 45 00:01:47,430 --> 00:01:48,950 They've got different parts of their program working 46 00:01:48,950 --> 00:01:51,440 on different things, all while performing calculations, 47 00:01:51,440 --> 00:01:53,540 and then at the end, it all comes together, 48 00:01:53,540 --> 00:01:56,240 and we get a result. Of course, this 49 00:01:56,240 --> 00:01:59,320 isn't unique to deep learning, and lots of distributed systems 50 00:01:59,320 --> 00:02:02,250 decentralize their calculations. 51 00:02:02,250 --> 00:02:03,980 But the key thing about deep learning 52 00:02:03,980 --> 00:02:07,310 is that every part is performing these calculations. 53 00:02:07,310 --> 00:02:09,674 The calculations are not simple calculations. 54 00:02:09,674 --> 00:02:11,840 They're not we'll do this one simple operation over, 55 00:02:11,840 --> 00:02:13,570 and over again on a lot of data, and then we'll 56 00:02:13,570 --> 00:02:14,570 get a result at the end. 57 00:02:14,570 --> 00:02:17,960 Each part is performing some particularly complicated 58 00:02:17,960 --> 00:02:21,440 process on all the little parts before they come together. 59 00:02:21,440 --> 00:02:24,470 So why is this architecture a good idea? 60 00:02:24,470 --> 00:02:26,360 Why did engineers come up with this sort 61 00:02:26,360 --> 00:02:30,190 of decentralized, multi-layered complex process? 62 00:02:30,190 --> 00:02:33,080 Well, we take the example of image classification. 63 00:02:33,080 --> 00:02:35,030 It turns out that the human brain 64 00:02:35,030 --> 00:02:37,920 does a pretty similar process. 65 00:02:37,920 --> 00:02:39,470 So here's the human visual system, 66 00:02:39,470 --> 00:02:42,350 and it's pretty much a hierarchical process. 67 00:02:42,350 --> 00:02:45,470 So you begin by moving from the retina 68 00:02:45,470 --> 00:02:47,917 into the first areas of the brain, 69 00:02:47,917 --> 00:02:49,500 and as the information gets processed, 70 00:02:49,500 --> 00:02:51,625 it moves from one region of the brain to the other, 71 00:02:51,625 --> 00:02:54,170 and each spatial element of your brain 72 00:02:54,170 --> 00:02:56,880 is performing an entirely different calculation. 73 00:02:56,880 --> 00:02:58,640 For example, the v1 area over here 74 00:02:58,640 --> 00:03:01,430 is picking out edges and corners, 75 00:03:01,430 --> 00:03:03,800 and then over here, a couple steps later in v4, 76 00:03:03,800 --> 00:03:06,360 you're starting to group those figures together. 77 00:03:06,360 --> 00:03:08,000 And so the brain kind of operates 78 00:03:08,000 --> 00:03:12,330 in a way that is very similar to the way these networks operate. 79 00:03:12,330 --> 00:03:14,780 So let's talk about to classify a face. 80 00:03:14,780 --> 00:03:18,080 If I asked you guys how would you classify a face, 81 00:03:18,080 --> 00:03:20,331 what is the first thing you might do? 82 00:03:20,331 --> 00:03:22,580 Well, as I mentioned before, the first thing out brain 83 00:03:22,580 --> 00:03:24,350 does is it finds these edges. 84 00:03:24,350 --> 00:03:26,470 The first thing to do is identify where the face 85 00:03:26,470 --> 00:03:28,970 is versus everything else. 86 00:03:28,970 --> 00:03:30,890 Now, does anyone have any idea as to what 87 00:03:30,890 --> 00:03:32,150 we could with the next step? 88 00:03:36,650 --> 00:03:38,720 Julian, you have an idea? 89 00:03:38,720 --> 00:03:41,224 AUDIENCE: Maybe you could group these edges together. 90 00:03:41,224 --> 00:03:42,140 RYAN ALEXANDER: Right. 91 00:03:42,140 --> 00:03:45,170 We could maybe identify some of these features 92 00:03:45,170 --> 00:03:46,520 that we're working with. 93 00:03:46,520 --> 00:03:50,420 So these are things like noses, and lips, and eyes. 94 00:03:50,420 --> 00:03:53,905 And then what do we do after we have these individual features? 95 00:03:57,970 --> 00:03:58,644 Steve. 96 00:03:58,644 --> 00:04:01,060 AUDIENCE: Well, maybe we can group some of those together. 97 00:04:01,060 --> 00:04:02,059 RYAN ALEXANDER: Exactly. 98 00:04:02,059 --> 00:04:05,000 Yeah, we can organize them into what we know the pattern to be. 99 00:04:05,000 --> 00:04:07,900 We know that a face has to have two eyes, above a nose, 100 00:04:07,900 --> 00:04:09,550 and then above the mouth. 101 00:04:09,550 --> 00:04:12,849 So that is precisely what a neural network actually ends up 102 00:04:12,849 --> 00:04:15,390 doing, and we'll walk through the process of how it does this 103 00:04:15,390 --> 00:04:17,129 later on in the talk. 104 00:04:17,129 --> 00:04:18,670 But as you can see, the intuitive way 105 00:04:18,670 --> 00:04:20,500 that we classify a face, and the way 106 00:04:20,500 --> 00:04:21,996 our brains are wired to do it, is 107 00:04:21,996 --> 00:04:24,120 pretty similar to the way got these neural networks 108 00:04:24,120 --> 00:04:26,350 to operate. 109 00:04:26,350 --> 00:04:27,990 So like I said, we're talking a lot 110 00:04:27,990 --> 00:04:29,740 about these convolutional neural networks. 111 00:04:29,740 --> 00:04:32,556 There are other types of architectures involved. 112 00:04:32,556 --> 00:04:34,180 Like we mentioned before, deep learning 113 00:04:34,180 --> 00:04:36,600 is a pretty wide variety of algorithms, 114 00:04:36,600 --> 00:04:39,440 but we're going to focus on these CNNs. 115 00:04:39,440 --> 00:04:43,750 To give you a precursor of how good these CNN are, 116 00:04:43,750 --> 00:04:46,740 this results from ImageNet competition. 117 00:04:46,740 --> 00:04:48,680 So the ImageNet competition is basically 118 00:04:48,680 --> 00:04:50,470 exactly what it sounds like, a bunch of computer scientists 119 00:04:50,470 --> 00:04:52,803 get together, and see how many images they can correctly 120 00:04:52,803 --> 00:04:54,160 classify. 121 00:04:54,160 --> 00:04:56,430 And their error rate was pretty high. 122 00:04:56,430 --> 00:04:58,980 Almost a third error rate over here in 2010, 123 00:04:58,980 --> 00:05:04,010 2011, and then in 2011, the CNNs were introduced to the topic, 124 00:05:04,010 --> 00:05:05,950 and the error rate plummeted. 125 00:05:05,950 --> 00:05:08,030 As you can see over here in 2015, 126 00:05:08,030 --> 00:05:10,420 we've got a significant improvement in these ImageNet 127 00:05:10,420 --> 00:05:11,020 competitions. 128 00:05:11,020 --> 00:05:13,994 So clearly, the CNNs have been very effective, 129 00:05:13,994 --> 00:05:15,410 and it's definitely been something 130 00:05:15,410 --> 00:05:19,500 that is exciting in the field and happening right now. 131 00:05:19,500 --> 00:05:23,759 All right, so now we're going to move into image processing. 132 00:05:23,759 --> 00:05:25,550 ISHWARYA ANANTHABHOTLA: OK, so Ryan gave us 133 00:05:25,550 --> 00:05:27,900 a nice overview of where we get this concept 134 00:05:27,900 --> 00:05:31,690 of neural networks, but let's take a time travel, 135 00:05:31,690 --> 00:05:33,290 and go into a quick history lesson. 136 00:05:33,290 --> 00:05:35,740 So suppose I had a chair, and I wanted the computer 137 00:05:35,740 --> 00:05:37,580 to classify this chair. 138 00:05:37,580 --> 00:05:40,520 I have some a priori knowledge about what sort of things 139 00:05:40,520 --> 00:05:41,210 make up a chair. 140 00:05:41,210 --> 00:05:42,668 So I might be interested in looking 141 00:05:42,668 --> 00:05:46,400 at arms, and corners of the chair, legs, things like that. 142 00:05:46,400 --> 00:05:50,480 So I would go ahead, and feature engineer my discovery scheme 143 00:05:50,480 --> 00:05:51,897 to be looking for specific things. 144 00:05:51,897 --> 00:05:53,855 So I'm going to talk about some techniques that 145 00:05:53,855 --> 00:05:55,020 are traditionally used. 146 00:05:55,020 --> 00:05:57,520 For example, chairs, doors, these things have corners. 147 00:05:57,520 --> 00:05:59,855 So I might use an image processing technique called 148 00:05:59,855 --> 00:06:02,030 a Harris Corner Detector, where we basically 149 00:06:02,030 --> 00:06:05,400 look at large changes in intensity as groups of pixels 150 00:06:05,400 --> 00:06:07,550 move from an image to an image that indicate 151 00:06:07,550 --> 00:06:10,470 the presence of corners, and you can use common corners to say 152 00:06:10,470 --> 00:06:14,150 that OK, all of these images are chairs, or doors, or whatever. 153 00:06:14,150 --> 00:06:16,790 Similarly, I want to say I have a bunch of pictures of chairs 154 00:06:16,790 --> 00:06:19,200 of different sizes, but that they all must have 155 00:06:19,200 --> 00:06:21,380 so many corners or something. 156 00:06:21,380 --> 00:06:24,590 So typically, we use a sift algorithm to scale invariant 157 00:06:24,590 --> 00:06:25,520 feature transforms. 158 00:06:25,520 --> 00:06:28,086 It basically says that across different sizes, 159 00:06:28,086 --> 00:06:29,960 I still should be able to extract information 160 00:06:29,960 --> 00:06:33,416 about the placement of corners. 161 00:06:33,416 --> 00:06:35,930 Another common technique that's used in image processing 162 00:06:35,930 --> 00:06:38,832 is what we call HOG, Histogram of Gradients. 163 00:06:38,832 --> 00:06:40,540 So basically, for example, in this image, 164 00:06:40,540 --> 00:06:45,140 if I want to say I want to find all the images that have faces 165 00:06:45,140 --> 00:06:46,940 in them, or consist of faces, let's 166 00:06:46,940 --> 00:06:50,480 say, I might come up with a template of a face that 167 00:06:50,480 --> 00:06:53,870 basically assigns gradients to groups of pixels that form 168 00:06:53,870 --> 00:06:55,460 an outline of what looks like a face, 169 00:06:55,460 --> 00:06:58,910 and then scan it across my sample images, and say OK, 170 00:06:58,910 --> 00:07:00,305 a face is present in this image. 171 00:07:00,305 --> 00:07:01,680 Obviously, there are some errors. 172 00:07:01,680 --> 00:07:04,240 A mead cap, and a logo back here have become a face, 173 00:07:04,240 --> 00:07:07,710 but this is the traditional approach. 174 00:07:07,710 --> 00:07:10,310 But here's the problem, what if I don't actually 175 00:07:10,310 --> 00:07:14,410 necessarily know what features are the most critical depending 176 00:07:14,410 --> 00:07:16,280 on the dataset that I get? 177 00:07:16,280 --> 00:07:19,130 I want the system itself to figure out 178 00:07:19,130 --> 00:07:22,490 what techniques to apply without having any a priori knowledge 179 00:07:22,490 --> 00:07:23,940 about the dataset. 180 00:07:23,940 --> 00:07:25,884 So this is exactly the idea of CNNs, 181 00:07:25,884 --> 00:07:27,300 the convolutional neural networks. 182 00:07:27,300 --> 00:07:30,030 We want the techniques to be learned automatically 183 00:07:30,030 --> 00:07:31,760 by the process, by the system. 184 00:07:31,760 --> 00:07:33,950 So if I'm trying to classify faces, 185 00:07:33,950 --> 00:07:35,980 I want the system to figure out that eyes, 186 00:07:35,980 --> 00:07:38,330 and ears, and nose, these are the most important things. 187 00:07:38,330 --> 00:07:39,955 Or if I'm trying to classify elephants, 188 00:07:39,955 --> 00:07:41,900 the ears and trunks are the critical features, 189 00:07:41,900 --> 00:07:43,460 without me having to say OK, we're 190 00:07:43,460 --> 00:07:45,790 going to do corner detection, so on, and so forth. 191 00:07:45,790 --> 00:07:48,470 So this is the idea. 192 00:07:48,470 --> 00:07:52,000 So to be able to understand this process in greater detail, 193 00:07:52,000 --> 00:07:54,000 I'm first going to go into a little bit of math, 194 00:07:54,000 --> 00:07:57,230 and the idea is to present the most fundamental operation 195 00:07:57,230 --> 00:07:59,019 here, which is the convolution. 196 00:07:59,019 --> 00:08:01,310 So this is the formal definition of the two-dimensional 197 00:08:01,310 --> 00:08:04,210 convolution, and since we're working with images, 198 00:08:04,210 --> 00:08:08,060 we're only considering the two-dimensional case. 199 00:08:08,060 --> 00:08:09,937 So in a more graphical presentation, 200 00:08:09,937 --> 00:08:12,395 which is a little bit easier to understand than just seeing 201 00:08:12,395 --> 00:08:14,570 the formula, the idea is that we have 202 00:08:14,570 --> 00:08:16,610 a kernel, or a convolutional filter 203 00:08:16,610 --> 00:08:18,860 that we seek to apply on another image, 204 00:08:18,860 --> 00:08:21,410 and that extracts some information about that image 205 00:08:21,410 --> 00:08:25,485 that we can use to help us classify the convolution. 206 00:08:25,485 --> 00:08:27,860 So assume that this is our kernel, or this is our filter, 207 00:08:27,860 --> 00:08:29,280 and suppose this-- 208 00:08:33,663 --> 00:08:35,850 oh, there it is. 209 00:08:35,850 --> 00:08:38,090 So suppose we're applying the kernel [INAUDIBLE] 210 00:08:38,090 --> 00:08:40,690 here to the image that's in green. 211 00:08:40,690 --> 00:08:43,587 So the idea is we want to slide this filter across the image, 212 00:08:43,587 --> 00:08:45,170 and what we're basically doing this is 213 00:08:45,170 --> 00:08:46,580 a succession of dot products. 214 00:08:46,580 --> 00:08:48,440 So at each placement on the image, 215 00:08:48,440 --> 00:08:50,590 we multiply the overlayed numbers, 216 00:08:50,590 --> 00:08:55,770 and the sum becomes the output image on the convolt output. 217 00:08:55,770 --> 00:08:57,964 So this is basically the way the process works. 218 00:08:57,964 --> 00:09:00,380 You probably notice that there's a reduction in dimension, 219 00:09:00,380 --> 00:09:04,520 and Henry will talk a little bit more about why this is. 220 00:09:04,520 --> 00:09:07,386 Let me get to it, and then [INAUDIBLE] So let's 221 00:09:07,386 --> 00:09:09,320 see some examples of what information 222 00:09:09,320 --> 00:09:11,390 we get by applying the convolution. 223 00:09:11,390 --> 00:09:13,580 So you see the image of a tiger on the top left. 224 00:09:13,580 --> 00:09:15,760 When we apply a filter that's a low pass filter, 225 00:09:15,760 --> 00:09:18,440 basically-- it's a Gaussian-- then 226 00:09:18,440 --> 00:09:22,390 we get low spatial frequency information about this image. 227 00:09:22,390 --> 00:09:24,680 So basically, we blurred it, and this tells us 228 00:09:24,680 --> 00:09:27,285 something specific that we might want to learn. 229 00:09:27,285 --> 00:09:29,785 So the kernel actually looks like a two-dimensional Gaussian 230 00:09:29,785 --> 00:09:32,243 function that's been distributed across this three-by-three 231 00:09:32,243 --> 00:09:34,010 kernel. 232 00:09:34,010 --> 00:09:35,570 Similarly, we might be interested 233 00:09:35,570 --> 00:09:38,220 in high spatial frequency information. 234 00:09:38,220 --> 00:09:40,850 So in this case, we're looking at sharp features. 235 00:09:40,850 --> 00:09:43,740 So horizontal edges or vertical edges. 236 00:09:43,740 --> 00:09:46,660 So a question for you is if I have this kernel, 237 00:09:46,660 --> 00:09:48,935 which of these outputs when this kernel 238 00:09:48,935 --> 00:09:51,310 was applied to the original image, which of these outputs 239 00:09:51,310 --> 00:09:53,449 do you think it produced? 240 00:09:53,449 --> 00:09:54,990 AUDIENCE: The third one on the right. 241 00:09:54,990 --> 00:09:56,180 ISHWARYA ANANTHABHOTLA: Yeah, that's exactly right, 242 00:09:56,180 --> 00:09:57,740 and it's probably pretty easy to see 243 00:09:57,740 --> 00:09:59,960 why that's the case, given that the numbers are all 244 00:09:59,960 --> 00:10:02,370 horizontal bands here. 245 00:10:02,370 --> 00:10:04,525 Lastly, we also may be interested in extracting 246 00:10:04,525 --> 00:10:06,710 information at a particular frequency. 247 00:10:06,710 --> 00:10:11,420 So we can take the difference of a high pass 248 00:10:11,420 --> 00:10:13,212 filter and a low pass filter, and add it 249 00:10:13,212 --> 00:10:15,545 to your frequency you can extract information about that 250 00:10:15,545 --> 00:10:17,910 as well. 251 00:10:17,910 --> 00:10:20,591 OK, one last helpful piece of information 252 00:10:20,591 --> 00:10:22,090 is that there's another way that you 253 00:10:22,090 --> 00:10:25,960 can think of the information that's learned at each stage 254 00:10:25,960 --> 00:10:27,940 because a convolution can also be 255 00:10:27,940 --> 00:10:30,470 thought of as a Fourier transform in the frequency 256 00:10:30,470 --> 00:10:30,970 domain. 257 00:10:30,970 --> 00:10:33,140 You can think of the image transformation that way. 258 00:10:33,140 --> 00:10:37,180 So from an image perspective, what a Fourier transform is 259 00:10:37,180 --> 00:10:40,580 is basically a sum of a set of sinusoidal gratings 260 00:10:40,580 --> 00:10:43,575 that differ, say, in frequency, in orientation, in amplitude, 261 00:10:43,575 --> 00:10:44,743 and in phase. 262 00:10:44,743 --> 00:10:47,250 So you can think about the zebra image here 263 00:10:47,250 --> 00:10:52,030 that's actually a composite of different gradients 264 00:10:52,030 --> 00:10:54,880 that might look like this, and the Fourier coefficients 265 00:10:54,880 --> 00:10:57,100 would be how much of each of these pieces 266 00:10:57,100 --> 00:10:59,907 come together to make that final image. 267 00:10:59,907 --> 00:11:01,990 So just to get a sense of what kind of information 268 00:11:01,990 --> 00:11:05,042 this could convey, we typically take a Fourier transformation, 269 00:11:05,042 --> 00:11:07,000 and break it apart into the magnitude and phase 270 00:11:07,000 --> 00:11:08,850 representation. 271 00:11:08,850 --> 00:11:11,390 So you see magnitude, and you see phase. 272 00:11:11,390 --> 00:11:13,420 So those images weren't particularly clear, 273 00:11:13,420 --> 00:11:16,192 but this is a really good example for this. 274 00:11:16,192 --> 00:11:17,650 So if we take the Fourier transform 275 00:11:17,650 --> 00:11:19,150 of all the horizontal text here, you 276 00:11:19,150 --> 00:11:20,880 see how the magnitude reflects this, 277 00:11:20,880 --> 00:11:22,870 and you can go back to the math to understand 278 00:11:22,870 --> 00:11:26,243 why it's reflected in a vertical marking. 279 00:11:26,243 --> 00:11:28,550 And similarly, if I were to take that same image, 280 00:11:28,550 --> 00:11:31,540 and rotate it, and then ask for the Fourier transform, 281 00:11:31,540 --> 00:11:33,850 you see how that information is contained very clearly 282 00:11:33,850 --> 00:11:35,380 in the magnitude spectrum. 283 00:11:35,380 --> 00:11:37,860 So these might be things that a network would 284 00:11:37,860 --> 00:11:41,140 learn at each stage to try to identify this as a text, 285 00:11:41,140 --> 00:11:43,615 or as as body of text that's tilted one way or the other, 286 00:11:43,615 --> 00:11:45,050 so on and so forth. 287 00:11:45,050 --> 00:11:48,660 So with that, we can now go into what the actual architecture 288 00:11:48,660 --> 00:11:49,981 of a convolutional neural is. 289 00:11:49,981 --> 00:11:52,480 HENRY NASSIF: All right, so as it was said earlier, in order 290 00:11:52,480 --> 00:11:54,909 to classify or detect objects, you actually 291 00:11:54,909 --> 00:11:55,825 need certain features. 292 00:11:55,825 --> 00:11:58,630 You need to be able to identify these features. 293 00:11:58,630 --> 00:12:00,530 And the way you can identify these features 294 00:12:00,530 --> 00:12:03,790 is using certain convolutions or certain filters. 295 00:12:03,790 --> 00:12:07,060 In many cases, we don't know what these features are, 296 00:12:07,060 --> 00:12:08,920 and as a result of that, we don't actually 297 00:12:08,920 --> 00:12:11,530 know what the filters are to extract these features. 298 00:12:11,530 --> 00:12:14,290 And what convolution neural networks allow us to do 299 00:12:14,290 --> 00:12:17,110 is actually determine what these features are, 300 00:12:17,110 --> 00:12:19,600 and also determine what the filters are in order 301 00:12:19,600 --> 00:12:21,970 to extract these features. 302 00:12:21,970 --> 00:12:24,720 Now, the idea for convolutional neural networks, 303 00:12:24,720 --> 00:12:27,470 or the idea for replicating how the brain works 304 00:12:27,470 --> 00:12:32,440 started in about 1960s or 1950s after some experiments 305 00:12:32,440 --> 00:12:34,120 by Hubel and Wesel. 306 00:12:34,120 --> 00:12:35,830 And what happened in these experiments, 307 00:12:35,830 --> 00:12:38,680 as can be seen here, is a cat was actually 308 00:12:38,680 --> 00:12:41,390 shown a light band at different angles, 309 00:12:41,390 --> 00:12:43,510 and the neural activity of the cat 310 00:12:43,510 --> 00:12:45,730 was measured using an electrode. 311 00:12:45,730 --> 00:12:48,070 And the outcome from this experiment 312 00:12:48,070 --> 00:12:51,280 show that based on the angle at which the light was shown, 313 00:12:51,280 --> 00:12:54,040 the neural response of the cat was different. 314 00:12:54,040 --> 00:12:56,620 As you can see here, the number of neurons, as well as 315 00:12:56,620 --> 00:12:58,810 the neurons that were firing were very different 316 00:12:58,810 --> 00:13:00,690 based on the angle. 317 00:13:00,690 --> 00:13:02,820 So what you can see also here is really 318 00:13:02,820 --> 00:13:07,670 a plot of the response versus the orientation of the light. 319 00:13:07,670 --> 00:13:10,390 And what this has led Hubel and Wesel to 320 00:13:10,390 --> 00:13:12,940 is the idea that neurons in the brain 321 00:13:12,940 --> 00:13:15,600 are organized in a certain topographical order, 322 00:13:15,600 --> 00:13:18,770 and at each filter, it fills a specific role, 323 00:13:18,770 --> 00:13:24,610 and the only fires when its specific input is shown, 324 00:13:24,610 --> 00:13:29,980 or when the angle is show, or the angle is specified. 325 00:13:29,980 --> 00:13:32,020 Now, the first step to actually replicating 326 00:13:32,020 --> 00:13:33,760 how the brain works in code is really 327 00:13:33,760 --> 00:13:38,120 understanding how the building block, the neuron, works. 328 00:13:38,120 --> 00:13:41,080 That's a quick reminder of 7012 here. 329 00:13:41,080 --> 00:13:46,100 So a neuron is actually a cell with dendrites, nucleus, axon, 330 00:13:46,100 --> 00:13:47,770 and a terminal. 331 00:13:47,770 --> 00:13:49,960 And what the neuron actually does 332 00:13:49,960 --> 00:13:53,240 is aggregate the action potentials or the inputs 333 00:13:53,240 --> 00:13:55,480 it gets from all the neighboring neurons that 334 00:13:55,480 --> 00:13:57,670 are connected to it through the timelines, 335 00:13:57,670 --> 00:14:00,580 and it sums these action potentials, and then compares 336 00:14:00,580 --> 00:14:03,840 them to a certain threshold that it has internally, 337 00:14:03,840 --> 00:14:06,250 and that would determine whether or not this neuron would 338 00:14:06,250 --> 00:14:08,620 fire an action potential. 339 00:14:08,620 --> 00:14:12,400 And that very simple idea can actually be replicated in code. 340 00:14:15,560 --> 00:14:19,840 An artificial neuron looks very much like a natural one. 341 00:14:19,840 --> 00:14:22,000 So what you would have is a set of inputs. 342 00:14:22,000 --> 00:14:25,720 Here we have three inputs that are summed inside of a cell, 343 00:14:25,720 --> 00:14:27,640 or a neuron. 344 00:14:27,640 --> 00:14:30,220 The sum here is not just a regular sum, 345 00:14:30,220 --> 00:14:31,190 it's a weighted sum. 346 00:14:31,190 --> 00:14:33,950 So the neuron specifies some weight, 347 00:14:33,950 --> 00:14:37,540 which you can think of as how much it values the input coming 348 00:14:37,540 --> 00:14:41,740 from a specific neuron, and then the input is multiplied 349 00:14:41,740 --> 00:14:45,070 by its weight, and then the total sum that the neuron 350 00:14:45,070 --> 00:14:48,820 computes is then fed into an activation function that 351 00:14:48,820 --> 00:14:52,935 produces the output that the neuron then basically produces. 352 00:14:55,690 --> 00:14:58,740 Now, what we just saw here is really a simple neuron, 353 00:14:58,740 --> 00:14:59,950 a single neuron. 354 00:14:59,950 --> 00:15:02,570 You can't really do much with just one neuron, 355 00:15:02,570 --> 00:15:04,710 so what you would do is combined these neurons 356 00:15:04,710 --> 00:15:07,750 in a certain topography, or in that case, 357 00:15:07,750 --> 00:15:12,130 we have a network with seven neurons organized 358 00:15:12,130 --> 00:15:13,730 in three different layers. 359 00:15:13,730 --> 00:15:16,430 And what you can think of that is really 360 00:15:16,430 --> 00:15:20,790 as one big neuron with 12 inputs, and one output. 361 00:15:20,790 --> 00:15:22,922 So for example, in the case of the chair that 362 00:15:22,922 --> 00:15:24,380 was previously mentioned, if you're 363 00:15:24,380 --> 00:15:27,890 trying to identify whether a specific image has 364 00:15:27,890 --> 00:15:31,670 a chair in it or not, these 12 inputs here 365 00:15:31,670 --> 00:15:35,060 could be some sub images, or some small areas 366 00:15:35,060 --> 00:15:38,030 of the initial image that you feed into the network, 367 00:15:38,030 --> 00:15:41,100 and the output here could be a yes or no. 368 00:15:41,100 --> 00:15:44,480 Whether the image has a chair, or doesn't have a chair. 369 00:15:44,480 --> 00:15:46,330 And that is really the concept behind 370 00:15:46,330 --> 00:15:47,990 convolutional neural networks, which 371 00:15:47,990 --> 00:15:52,040 we'll go into details in a bit. 372 00:15:52,040 --> 00:15:55,460 So what each neuron would be doing in that case, 373 00:15:55,460 --> 00:15:58,600 is really just performing a dot product, 374 00:15:58,600 --> 00:16:01,520 which if you aggregate that with the dot 375 00:16:01,520 --> 00:16:04,140 products computed by each of the other neurons, 376 00:16:04,140 --> 00:16:05,960 you would obtain a convolution. 377 00:16:05,960 --> 00:16:08,420 So what we have here is three inputs. 378 00:16:08,420 --> 00:16:11,810 If the input, in that case, is an image or a sub image, 379 00:16:11,810 --> 00:16:13,790 then the inputs would be pixels. 380 00:16:13,790 --> 00:16:15,650 The weights that you would be using here 381 00:16:15,650 --> 00:16:19,100 would be the filter weights, which is the filter that you 382 00:16:19,100 --> 00:16:20,930 use in the convolution. 383 00:16:20,930 --> 00:16:23,480 And then the sum here would be the dot product 384 00:16:23,480 --> 00:16:26,060 of the weights and the inputs, and that sum 385 00:16:26,060 --> 00:16:30,110 would be computed by a specific neuron in your network. 386 00:16:30,110 --> 00:16:33,710 Then, that would be the convolution step, 387 00:16:33,710 --> 00:16:35,990 and then that convolution step would 388 00:16:35,990 --> 00:16:37,699 happen at the first layer in the network. 389 00:16:37,699 --> 00:16:39,490 So you would be applying this to the input, 390 00:16:39,490 --> 00:16:41,120 but you also would be applying this 391 00:16:41,120 --> 00:16:42,699 at the second layer, and third layer. 392 00:16:42,699 --> 00:16:44,240 In that case, we're only showing what 393 00:16:44,240 --> 00:16:47,030 happens in the first layer. 394 00:16:47,030 --> 00:16:48,590 The next step after the convolution 395 00:16:48,590 --> 00:16:50,720 would be the activation step. 396 00:16:50,720 --> 00:16:54,470 So the dot product computed here would 397 00:16:54,470 --> 00:16:57,470 be there's a function that would be applied to the sum, 398 00:16:57,470 --> 00:16:59,420 and then that function would produce 399 00:16:59,420 --> 00:17:02,730 the output of the neuron. 400 00:17:02,730 --> 00:17:05,030 And this is where the activation layer is. 401 00:17:05,030 --> 00:17:07,369 You also have another activation layer here, 402 00:17:07,369 --> 00:17:08,630 and then a final one here. 403 00:17:12,319 --> 00:17:15,290 What we just went through now are convolutions 404 00:17:15,290 --> 00:17:18,829 and activations, but this is not the only thing that actually 405 00:17:18,829 --> 00:17:21,260 happens in a neural net. 406 00:17:21,260 --> 00:17:25,415 What we also have is a step called subsampling, which 407 00:17:25,415 --> 00:17:28,880 we will be talking about next. 408 00:17:28,880 --> 00:17:32,210 For now, we will dig deeper into the activation, 409 00:17:32,210 --> 00:17:35,240 and specifically, what activation functions to use. 410 00:17:35,240 --> 00:17:38,810 In that case, we can see that that's a neuron, 411 00:17:38,810 --> 00:17:40,400 and what the neuron is doing here 412 00:17:40,400 --> 00:17:42,290 is the weighted sum that we talked about, 413 00:17:42,290 --> 00:17:43,532 or the dot product. 414 00:17:43,532 --> 00:17:44,990 And then the output from this would 415 00:17:44,990 --> 00:17:48,480 be fed into a certain activation function. 416 00:17:48,480 --> 00:17:50,960 Common activation functions are sigmoid, tanh, 417 00:17:50,960 --> 00:17:55,490 or rectify linear unit, and we will go through each one 418 00:17:55,490 --> 00:17:56,190 independently. 419 00:17:56,190 --> 00:17:59,690 So here, we can see the sigmoid activation function. 420 00:17:59,690 --> 00:18:01,490 So what this function essentially does 421 00:18:01,490 --> 00:18:06,595 is map any input to an output in the range of zero to one, 422 00:18:06,595 --> 00:18:10,970 and it's defined as one divided by one plus e to the minus x. 423 00:18:10,970 --> 00:18:13,700 The other common activation function is tanh, 424 00:18:13,700 --> 00:18:16,160 and that's any input to an output 425 00:18:16,160 --> 00:18:18,320 between minus one and one. 426 00:18:18,320 --> 00:18:21,890 And then finally, would be the rectified linear unit, which 427 00:18:21,890 --> 00:18:24,470 maps an input to itself if it's positive, 428 00:18:24,470 --> 00:18:27,440 or to zero if it's negative. 429 00:18:27,440 --> 00:18:30,590 Now, in theory, you could use any function 430 00:18:30,590 --> 00:18:33,930 as an activation function in your network, 431 00:18:33,930 --> 00:18:36,290 but that's not what you want to do in practice. 432 00:18:36,290 --> 00:18:38,000 You want your activation functions 433 00:18:38,000 --> 00:18:41,000 to be non-linear for one main reason, 434 00:18:41,000 --> 00:18:42,830 that the goal of the activation function 435 00:18:42,830 --> 00:18:45,510 is actually to introduce non-linearity in your system. 436 00:18:45,510 --> 00:18:48,442 And if all your activation functions are linear, 437 00:18:48,442 --> 00:18:50,900 then you would essentially be having a linear system, which 438 00:18:50,900 --> 00:18:53,690 prevents you from achieving the level of complexity 439 00:18:53,690 --> 00:18:57,320 that you would ideally want to achieve with a neural network. 440 00:18:57,320 --> 00:19:00,850 And there's a formal proof for as to why 441 00:19:00,850 --> 00:19:02,620 you need non-linear activation functions. 442 00:19:02,620 --> 00:19:04,370 They don't all need to be non-linear, 443 00:19:04,370 --> 00:19:06,757 but you need to have a least a few non-linear activation 444 00:19:06,757 --> 00:19:07,840 functions in your network. 445 00:19:07,840 --> 00:19:10,131 And the proof is available in the appendix, or the link 446 00:19:10,131 --> 00:19:14,840 to the paper that has the proof. 447 00:19:14,840 --> 00:19:18,350 So after we've discussed what happens at the activation 448 00:19:18,350 --> 00:19:21,370 layer, now we want to talk about convolution. 449 00:19:21,370 --> 00:19:25,780 So as I said earlier, an image is obviously 450 00:19:25,780 --> 00:19:29,170 a two-dimensional image, but we're using RGB images. 451 00:19:29,170 --> 00:19:31,200 So actually need three channels. 452 00:19:31,200 --> 00:19:33,170 So what this means is that an image is actually 453 00:19:33,170 --> 00:19:37,670 three-dimensional, and each 2D matrix represents one channel. 454 00:19:37,670 --> 00:19:41,230 One of them corresponding to R, one of them corresponding to G, 455 00:19:41,230 --> 00:19:45,330 one of them corresponding to B. So a 32 by 32 image 456 00:19:45,330 --> 00:19:50,900 would essentially be represented by a 32 by 32 by three matrix, 457 00:19:50,900 --> 00:19:52,100 as can be seen here. 458 00:19:54,710 --> 00:19:58,980 So what happens at the convolution layer? 459 00:19:58,980 --> 00:20:01,070 So here we have a nice animation that 460 00:20:01,070 --> 00:20:05,800 shows what is happening at each convolutional layer. 461 00:20:05,800 --> 00:20:09,590 So assume we have a five by five by three filter. 462 00:20:09,590 --> 00:20:11,360 So what this is, essentially, would 463 00:20:11,360 --> 00:20:15,380 be doing is covering a certain patch in the original image, 464 00:20:15,380 --> 00:20:17,450 which is 32 by 32 by three. 465 00:20:17,450 --> 00:20:19,820 So what can see here is that for that five 466 00:20:19,820 --> 00:20:22,460 by five by three patch in the original image, 467 00:20:22,460 --> 00:20:24,320 we have a neuron that is actually 468 00:20:24,320 --> 00:20:27,170 performing the dot product on all the pixels 469 00:20:27,170 --> 00:20:29,120 in that specific patch. 470 00:20:29,120 --> 00:20:32,690 So what is happening here is the pixel values, 471 00:20:32,690 --> 00:20:37,400 which in that case, we have five by five by three pixels, 472 00:20:37,400 --> 00:20:40,220 are being multiplied by the filter values, 473 00:20:40,220 --> 00:20:43,530 and this operation is being performed here. 474 00:20:43,530 --> 00:20:47,120 Then, after that dot product is performed, 475 00:20:47,120 --> 00:20:51,080 it's fed into an activation function, as can be seen here, 476 00:20:51,080 --> 00:20:54,420 and this produces the output of this neuron. 477 00:20:54,420 --> 00:20:57,260 Now, this is what this single neuron is actually doing. 478 00:20:57,260 --> 00:21:00,680 It's just covering that area of the original image. 479 00:21:00,680 --> 00:21:02,890 What you would have in a neural net 480 00:21:02,890 --> 00:21:06,020 is many neurons, each covering a certain area 481 00:21:06,020 --> 00:21:07,830 of the original image. 482 00:21:07,830 --> 00:21:11,930 And if you aggregate the output of all of these neurons, what 483 00:21:11,930 --> 00:21:15,650 you would be doing or performing is, essentially, a convolution 484 00:21:15,650 --> 00:21:18,920 on the original image. 485 00:21:18,920 --> 00:21:21,290 And to formalize what happens here, 486 00:21:21,290 --> 00:21:23,480 or what's the output that's being produced 487 00:21:23,480 --> 00:21:26,180 from that operation, we can look at that from a more 488 00:21:26,180 --> 00:21:27,300 mathematical perspective. 489 00:21:27,300 --> 00:21:29,570 So if you have an input of size H1, 490 00:21:29,570 --> 00:21:34,630 W1, D1, and you're performing a convolution with a filter, 491 00:21:34,630 --> 00:21:38,420 then the output, W2, would be related to W1 492 00:21:38,420 --> 00:21:39,730 with the following formula. 493 00:21:39,730 --> 00:21:43,360 So W2 plus W1 minus filter width plus one, and the same formula 494 00:21:43,360 --> 00:21:46,460 applies for the height, and the depth would actually 495 00:21:46,460 --> 00:21:48,020 be the same because in that case, 496 00:21:48,020 --> 00:21:51,260 we're using a filter that has the same depth, or three, 497 00:21:51,260 --> 00:21:52,430 as the original image. 498 00:21:55,330 --> 00:21:58,280 So what this would produce in aggregate 499 00:21:58,280 --> 00:22:01,890 is if you have 28 by 28 by one neurons, each one performing 500 00:22:01,890 --> 00:22:05,030 a dot product on some pixels in the original image, 501 00:22:05,030 --> 00:22:08,690 the output would be an activation map of size 28 by 28 502 00:22:08,690 --> 00:22:10,790 by one, and the output of each neuron 503 00:22:10,790 --> 00:22:15,540 would be one pixel in the activation map. 504 00:22:15,540 --> 00:22:18,870 Now, if we go back to the points that we made earlier, 505 00:22:18,870 --> 00:22:23,060 one thing we said was that the reason you use a neural network 506 00:22:23,060 --> 00:22:25,979 is because you don't know exactly what features 507 00:22:25,979 --> 00:22:27,770 you want to extract, and you don't actually 508 00:22:27,770 --> 00:22:30,950 have specific filters that you want to apply to the image. 509 00:22:30,950 --> 00:22:33,140 So ideally, what you want to do is 510 00:22:33,140 --> 00:22:36,650 have multiple filters being applied to the first image, 511 00:22:36,650 --> 00:22:38,540 and perform multiple convolutions, 512 00:22:38,540 --> 00:22:41,434 and this is what you can do with multiple neuron layers. 513 00:22:41,434 --> 00:22:43,850 So what we described before was just for one neuron layer. 514 00:22:43,850 --> 00:22:46,990 In that case, we can assume we have five different neuron 515 00:22:46,990 --> 00:22:50,000 layers, each one performing a different convolution 516 00:22:50,000 --> 00:22:51,690 on the original image. 517 00:22:51,690 --> 00:22:53,780 So in that case, we would have 28 by 28 518 00:22:53,780 --> 00:22:56,210 by one neuron per layer, and then 519 00:22:56,210 --> 00:22:58,070 if we aggregate all these neurons together, 520 00:22:58,070 --> 00:22:59,865 we need to multiply it by five, and that 521 00:22:59,865 --> 00:23:01,490 would be the total number of neurons we 522 00:23:01,490 --> 00:23:06,230 have in that specific number. 523 00:23:06,230 --> 00:23:09,860 So this actually leaves us with a pretty complicated system. 524 00:23:09,860 --> 00:23:12,070 It would have many parameters. 525 00:23:12,070 --> 00:23:14,510 The neurons have weights, the number of neurons 526 00:23:14,510 --> 00:23:18,890 is also a parameter So how do we actually formalize that? 527 00:23:18,890 --> 00:23:22,160 If we have an input volume of 32 by 32 by three, which 528 00:23:22,160 --> 00:23:24,800 is our original image, and a filter size of five 529 00:23:24,800 --> 00:23:28,100 by five by three, then the size of the activation map 530 00:23:28,100 --> 00:23:31,970 that would be reduced would be 28 by 28 by one. 531 00:23:31,970 --> 00:23:34,850 Then in that case, we also said we have five different neuron 532 00:23:34,850 --> 00:23:37,410 layers that perform five different convolutions. 533 00:23:37,410 --> 00:23:41,310 Then the total number of neurons would be 28 by 28 by five, 534 00:23:41,310 --> 00:23:44,460 and then the weights per neuron are five 535 00:23:44,460 --> 00:23:46,062 by five by three, which is 75. 536 00:23:46,062 --> 00:23:48,520 In that case, we're assuming that the neurons independently 537 00:23:48,520 --> 00:23:49,860 keep track of their own weights. 538 00:23:49,860 --> 00:23:52,129 This could be simplified to each layer having 539 00:23:52,129 --> 00:23:54,170 their own weight, which would tremendously reduce 540 00:23:54,170 --> 00:23:55,460 the number of parameters. 541 00:23:55,460 --> 00:23:58,010 But in that case, just to get an upper bound, 542 00:23:58,010 --> 00:23:59,622 this leaves us with a total number 543 00:23:59,622 --> 00:24:01,910 of parameters of 294,000. 544 00:24:01,910 --> 00:24:05,480 And this is just using a 32 by 32 by three image. 545 00:24:05,480 --> 00:24:07,980 You can think of this as a pretty small image. 546 00:24:07,980 --> 00:24:09,520 So if you have a bigger image, you 547 00:24:09,520 --> 00:24:13,500 will have many more parameters. 548 00:24:13,500 --> 00:24:14,000 Great. 549 00:24:14,000 --> 00:24:17,570 So what we just saw now, and described, were convolutions, 550 00:24:17,570 --> 00:24:21,425 activations, and these steps happen sequentially 551 00:24:21,425 --> 00:24:24,110 in a convolutional neural network, 552 00:24:24,110 --> 00:24:25,920 specifically as can be see here. 553 00:24:25,920 --> 00:24:29,870 One step that also happens occasionally is subsampling, 554 00:24:29,870 --> 00:24:32,594 and we'll discuss that step in detail here. 555 00:24:32,594 --> 00:24:34,760 So there are two main reasons why you would actually 556 00:24:34,760 --> 00:24:36,140 subsample your input. 557 00:24:36,140 --> 00:24:39,920 One is to obviously reduce the size of your input, 558 00:24:39,920 --> 00:24:42,752 and your feature space, but also because you 559 00:24:42,752 --> 00:24:44,960 want to keep track of the most important information, 560 00:24:44,960 --> 00:24:48,850 and get rid of everything else that you don't think 561 00:24:48,850 --> 00:24:52,130 is going to be relevant to your classification. 562 00:24:52,130 --> 00:24:54,490 And the common methods used in subsampling 563 00:24:54,490 --> 00:24:56,825 are either max pooling or average pooling. 564 00:24:56,825 --> 00:25:00,280 We will describe max pooling here. 565 00:25:00,280 --> 00:25:02,010 So what happens in max pooling is, 566 00:25:02,010 --> 00:25:03,940 essentially, you are dividing the image 567 00:25:03,940 --> 00:25:07,430 into different sub images, non-overlapping sub images, 568 00:25:07,430 --> 00:25:09,590 and you perform an at max operation. 569 00:25:09,590 --> 00:25:12,650 So in that case, if we consider two by two filters, 570 00:25:12,650 --> 00:25:15,520 we would split the image, which in that case is four by four. 571 00:25:15,520 --> 00:25:17,370 We'd split it into four sub images, 572 00:25:17,370 --> 00:25:21,370 and for each two by two square, we would take the maximum. 573 00:25:21,370 --> 00:25:23,590 In that case, for the first square it would be six, 574 00:25:23,590 --> 00:25:25,480 then eight, then three, then four. 575 00:25:25,480 --> 00:25:27,610 And the reason that actually works 576 00:25:27,610 --> 00:25:30,340 is because what you want to do is really keep 577 00:25:30,340 --> 00:25:35,740 track of the response of the neurons that-- or the highest 578 00:25:35,740 --> 00:25:38,130 response produced by your neurons. 579 00:25:38,130 --> 00:25:40,780 In that case, for example, the first highest response 580 00:25:40,780 --> 00:25:42,730 in the first square is six, and that 581 00:25:42,730 --> 00:25:45,610 means that if you get that high of a response, 582 00:25:45,610 --> 00:25:48,020 it means that something has been detected in the image, 583 00:25:48,020 --> 00:25:49,090 or has been detected. 584 00:25:49,090 --> 00:25:52,135 And this is something you want to keep track of as you 585 00:25:52,135 --> 00:25:53,530 move forward in your network. 586 00:25:53,530 --> 00:25:57,580 And although this moves around the location of pixels, 587 00:25:57,580 --> 00:26:00,640 because you can think of that as subsampling an image, 588 00:26:00,640 --> 00:26:03,220 it does keep track of the information you care about 589 00:26:03,220 --> 00:26:05,470 because you only care about the fact 590 00:26:05,470 --> 00:26:07,541 that something has been detected in the image. 591 00:26:07,541 --> 00:26:09,040 At this point, you don't really care 592 00:26:09,040 --> 00:26:11,560 about where it's located in the image, 593 00:26:11,560 --> 00:26:14,590 and you want to keep track of all the features 594 00:26:14,590 --> 00:26:16,770 that your neurons have detected in order to be 595 00:26:16,770 --> 00:26:22,070 able to eventually classify the input correctly. 596 00:26:22,070 --> 00:26:25,570 So if you have multiple feature maps-- so in that case, 597 00:26:25,570 --> 00:26:30,910 if you have 224 by 224 by 64, what your subsampling operation 598 00:26:30,910 --> 00:26:35,140 would be doing is reducing the height and the width 599 00:26:35,140 --> 00:26:37,180 so the depth would remain unchanged. 600 00:26:37,180 --> 00:26:41,670 So in that case, you would go from 224 by 224 by 64 to 112 601 00:26:41,670 --> 00:26:45,600 by 112 by 64, and that would be reducing your output size 602 00:26:45,600 --> 00:26:47,200 by a factor of four. 603 00:26:49,840 --> 00:26:51,820 And formally, what this would look 604 00:26:51,820 --> 00:26:54,250 like is if you have an input of size H1, 605 00:26:54,250 --> 00:26:57,775 W1, D1, the size of your output would 606 00:26:57,775 --> 00:26:59,800 be related to your input in the following ways. 607 00:26:59,800 --> 00:27:03,460 W2 would be W1 minus the pool width plus one. 608 00:27:03,460 --> 00:27:09,240 The same applies for H2, and the depth would remain unchanged. 609 00:27:09,240 --> 00:27:11,286 So these are, essentially-- 610 00:27:11,286 --> 00:27:13,660 these are the steps that happen in a convolutional neural 611 00:27:13,660 --> 00:27:14,825 network. 612 00:27:14,825 --> 00:27:17,300 What you could be doing is repeating these steps 613 00:27:17,300 --> 00:27:20,380 on a certain number of times in your network. 614 00:27:20,380 --> 00:27:23,070 But eventually, you have to make a classification, 615 00:27:23,070 --> 00:27:27,010 and decide in our case, whether our image has a chair 616 00:27:27,010 --> 00:27:28,450 or doesn't have a chair. 617 00:27:28,450 --> 00:27:30,230 So how does that happen? 618 00:27:30,230 --> 00:27:33,130 So after you perform all these steps, 619 00:27:33,130 --> 00:27:35,940 there's a step that happens here that would allow 620 00:27:35,940 --> 00:27:38,240 you to make that prediction, and that step 621 00:27:38,240 --> 00:27:40,960 is usually called a fully connected layer, 622 00:27:40,960 --> 00:27:43,225 or a multi-layer perception. 623 00:27:43,225 --> 00:27:47,010 And what this essentially is is layers that are very similar, 624 00:27:47,010 --> 00:27:49,120 or exactly the same as what you had before, 625 00:27:49,120 --> 00:27:50,980 except that every neuron in the layer 626 00:27:50,980 --> 00:27:53,690 is connected to all the previous neurons. 627 00:27:53,690 --> 00:27:56,290 So what it's allowed you to do is really 628 00:27:56,290 --> 00:27:58,610 consider everything you currently 629 00:27:58,610 --> 00:28:00,430 have about your input, or everything that's 630 00:28:00,430 --> 00:28:05,080 left about your input, and compute a dot product on that, 631 00:28:05,080 --> 00:28:09,340 rather than focusing on a subset subsample of your input 632 00:28:09,340 --> 00:28:11,740 like previous layers do. 633 00:28:11,740 --> 00:28:14,140 In that case, if you're actually trying 634 00:28:14,140 --> 00:28:16,620 to classify your input into four classes, 635 00:28:16,620 --> 00:28:18,830 you would ideally have four different neurons 636 00:28:18,830 --> 00:28:21,460 in your output layer, each one corresponding 637 00:28:21,460 --> 00:28:24,070 to one of the classes that you have, 638 00:28:24,070 --> 00:28:27,220 and then you would perform the same operation 639 00:28:27,220 --> 00:28:29,920 as you would in a previous layer, compute the dot product, 640 00:28:29,920 --> 00:28:33,570 and then once you obtain the values at every neuron, 641 00:28:33,570 --> 00:28:35,870 you would perform a normalization operation 642 00:28:35,870 --> 00:28:36,940 on all the output. 643 00:28:36,940 --> 00:28:39,430 This organization operation is called softmax, 644 00:28:39,430 --> 00:28:42,280 or normalized exponential, and what it does is really, 645 00:28:42,280 --> 00:28:44,800 put more weight on the highest value. 646 00:28:44,800 --> 00:28:47,740 And by computing the softmax at the output, 647 00:28:47,740 --> 00:28:50,190 you're able to compute the posterior probabilities, 648 00:28:50,190 --> 00:28:52,730 and allows you to make a more informed-- 649 00:28:52,730 --> 00:28:54,460 or basically make a classification 650 00:28:54,460 --> 00:28:57,730 decision on your input. 651 00:28:57,730 --> 00:28:58,850 Great. 652 00:28:58,850 --> 00:29:01,182 So that's everything. 653 00:29:01,182 --> 00:29:05,325 And now, the next step will be talking about back propagation. 654 00:29:05,325 --> 00:29:06,450 ISHWARYA ANANTHABHOTLA: OK. 655 00:29:06,450 --> 00:29:07,940 So now that Henry has given us an overview 656 00:29:07,940 --> 00:29:09,459 of the entire architecture of a CNN, 657 00:29:09,459 --> 00:29:11,000 I'm going to quickly spend some time, 658 00:29:11,000 --> 00:29:14,610 and talk about standard preprocessing tricks and tips 659 00:29:14,610 --> 00:29:17,020 that people might use on the image dataset 660 00:29:17,020 --> 00:29:19,710 before they actually feet it through a neural net 661 00:29:19,710 --> 00:29:21,600 to classify the images. 662 00:29:21,600 --> 00:29:24,580 So let's suppose we have a dataset x, 663 00:29:24,580 --> 00:29:26,900 and there are n number of data points in the dataset, 664 00:29:26,900 --> 00:29:29,450 and each point has a dimension, D. 665 00:29:29,450 --> 00:29:33,010 So they have D features per point. 666 00:29:33,010 --> 00:29:35,520 So in this example, we use these graphs as an example. 667 00:29:35,520 --> 00:29:38,370 Basically, our original data here has just two dimensions, 668 00:29:38,370 --> 00:29:40,879 and it spans this range of values. 669 00:29:40,879 --> 00:29:43,170 So for example, if we want to center this data, what we 670 00:29:43,170 --> 00:29:44,330 would do is a mean subtraction. 671 00:29:44,330 --> 00:29:45,760 So we basically subtract the mean 672 00:29:45,760 --> 00:29:48,085 across all the features of all the points, 673 00:29:48,085 --> 00:29:50,710 and we basically center it, and you can see that transformation 674 00:29:50,710 --> 00:29:52,120 here. 675 00:29:52,120 --> 00:29:54,920 And then we might, again, go for normalizing the dimension 676 00:29:54,920 --> 00:29:55,932 so that you have it. 677 00:29:55,932 --> 00:29:58,270 The data points span the same range 678 00:29:58,270 --> 00:29:59,692 of values in both dimensions. 679 00:29:59,692 --> 00:30:01,150 So you can see that transformation, 680 00:30:01,150 --> 00:30:02,942 and how it's taken place here. 681 00:30:02,942 --> 00:30:07,087 And we just divide by the standard deviation to do this. 682 00:30:07,087 --> 00:30:08,545 Something that's very commonly done 683 00:30:08,545 --> 00:30:11,297 is called PCA, or Principal Component Analysis. 684 00:30:11,297 --> 00:30:12,896 And the idea here is sometimes we 685 00:30:12,896 --> 00:30:15,270 have a dataset that has a very, very high dimensionality, 686 00:30:15,270 --> 00:30:18,910 and we would like to reduce that dimensionality. 687 00:30:18,910 --> 00:30:20,770 So basically, what our goal is is 688 00:30:20,770 --> 00:30:23,680 to project the higher dimensional space onto a lower 689 00:30:23,680 --> 00:30:27,126 dimensionality space by taking the subset of those features. 690 00:30:27,126 --> 00:30:28,750 And if you've seen a little bit of 1806 691 00:30:28,750 --> 00:30:30,340 from linear algebra, the way we do 692 00:30:30,340 --> 00:30:32,860 this is by generating a covariance matrix, 693 00:30:32,860 --> 00:30:35,070 then doing the single variable decomposition. 694 00:30:35,070 --> 00:30:39,210 And I'll gloss over the math now, but that's the idea. 695 00:30:39,210 --> 00:30:40,960 And you can see here how the original data 696 00:30:40,960 --> 00:30:41,980 spanned two dimensions. 697 00:30:41,980 --> 00:30:44,438 I would decorrelate it so that it spans a single dimension. 698 00:30:44,438 --> 00:30:46,066 And even with this data, you might 699 00:30:46,066 --> 00:30:47,440 want to ensure that it's widened, 700 00:30:47,440 --> 00:30:48,850 which is the same deal. 701 00:30:48,850 --> 00:30:52,770 You want the values to span the same range in both dimensions. 702 00:30:52,770 --> 00:30:55,631 So then you would just divide by your Eigenvalues 703 00:30:55,631 --> 00:30:56,950 to get the widened data. 704 00:30:59,520 --> 00:31:01,190 This last bit is something that's 705 00:31:01,190 --> 00:31:02,960 very commonly done as a preprocessing 706 00:31:02,960 --> 00:31:08,010 trick, though people aren't entirely sure why it works very 707 00:31:08,010 --> 00:31:09,720 well, or that it really does help, 708 00:31:09,720 --> 00:31:11,230 but it's something that people do, 709 00:31:11,230 --> 00:31:12,820 and it's called data augmentation. 710 00:31:12,820 --> 00:31:14,582 So basically, if I have a dataset that 711 00:31:14,582 --> 00:31:18,165 contains a bunch of images of chairs, 712 00:31:18,165 --> 00:31:20,790 a bunch of images of tables, and then a bunch of images of say, 713 00:31:20,790 --> 00:31:23,510 trees, I might want to intentionally augment 714 00:31:23,510 --> 00:31:27,080 that dataset further by introducing a few variations 715 00:31:27,080 --> 00:31:28,300 on these same images. 716 00:31:28,300 --> 00:31:31,280 So I might take the chair image, rotate some, reflect it 717 00:31:31,280 --> 00:31:35,042 a few more, scale, crop, remap the color space, 718 00:31:35,042 --> 00:31:36,500 or just kind of have a process that 719 00:31:36,500 --> 00:31:38,250 does this randomly to create more 720 00:31:38,250 --> 00:31:39,960 variation on the same dataset. 721 00:31:39,960 --> 00:31:43,490 And this is a good illustration of why this makes a difference. 722 00:31:43,490 --> 00:31:45,620 I've taken an image here of what looks 723 00:31:45,620 --> 00:31:47,860 like a waterfall or some spot of nature, 724 00:31:47,860 --> 00:31:49,785 and simply just inverted the colors. 725 00:31:49,785 --> 00:31:52,076 And what I see, if I were to just see this image alone, 726 00:31:52,076 --> 00:31:54,370 it maybe looks like a curtain, or a bit of texture, 727 00:31:54,370 --> 00:31:55,010 or something. 728 00:31:55,010 --> 00:31:57,464 And the idea is even to a human perception, 729 00:31:57,464 --> 00:31:59,380 these images have two very different meanings, 730 00:31:59,380 --> 00:32:01,744 and so it's interesting to see what effect they 731 00:32:01,744 --> 00:32:04,790 would have on a neural network. 732 00:32:04,790 --> 00:32:08,480 And with that, we'll go over to image classification results. 733 00:32:08,480 --> 00:32:10,310 ALI SOYLEMEZOGLU: So, so far we've 734 00:32:10,310 --> 00:32:14,600 seen how convolutional neural networks are built, 735 00:32:14,600 --> 00:32:16,890 and certain image processing techniques 736 00:32:16,890 --> 00:32:22,850 we can use on the input images to get them into formats that 737 00:32:22,850 --> 00:32:25,430 are there for the classification process, 738 00:32:25,430 --> 00:32:27,930 but so far, it seems a bit abstract. 739 00:32:27,930 --> 00:32:33,500 It's good to know how CNNs work, why CNNs work, but why don't we 740 00:32:33,500 --> 00:32:36,500 take a look at some of the practical results from CNNs, 741 00:32:36,500 --> 00:32:39,990 and what they're used for so that when you're 742 00:32:39,990 --> 00:32:41,450 done watching this lecture, you can 743 00:32:41,450 --> 00:32:46,490 go home, and try classifying images on your own time? 744 00:32:46,490 --> 00:32:49,880 With that, let's first revisit the ImageNet competition. 745 00:32:49,880 --> 00:32:53,240 I hope you remember the graph at the bottom from the beginning 746 00:32:53,240 --> 00:32:55,580 of the lecture, where we used this 747 00:32:55,580 --> 00:32:57,980 to motivate the use of CNNs. 748 00:32:57,980 --> 00:33:01,040 CNNs came onto the picture in 2012, 749 00:33:01,040 --> 00:33:06,470 but the winning CNN from 2012 was used on the 2010 ImageNet 750 00:33:06,470 --> 00:33:09,390 competition as well, and it managed 751 00:33:09,390 --> 00:33:13,070 to bring down the top five error rate to 0.17, which 752 00:33:13,070 --> 00:33:16,910 is pretty much on the same level as how performed in the 2012 753 00:33:16,910 --> 00:33:20,480 competition when it was first used, which was at 0.16. 754 00:33:20,480 --> 00:33:24,150 So this just goes to show that these convolutional neural 755 00:33:24,150 --> 00:33:26,646 networks are the state of the art when it comes to image 756 00:33:26,646 --> 00:33:30,740 classification, and that's why we're currently 757 00:33:30,740 --> 00:33:31,770 focusing on that. 758 00:33:31,770 --> 00:33:33,820 But you might be wondering what the ImageNet 759 00:33:33,820 --> 00:33:37,340 competition exactly looks like, what the images looks like. 760 00:33:37,340 --> 00:33:39,410 So why don't we take a look at that. 761 00:33:39,410 --> 00:33:41,360 As you can see here, these are images 762 00:33:41,360 --> 00:33:43,310 from the ImageNet competition. 763 00:33:43,310 --> 00:33:45,830 Underneath each image is a bold caption, which 764 00:33:45,830 --> 00:33:47,390 is considered to be the ground truth, 765 00:33:47,390 --> 00:33:50,950 or what the competition believes to be 766 00:33:50,950 --> 00:33:55,190 the correct classification of the image. 767 00:33:55,190 --> 00:33:56,850 Underneath that ground truth, you 768 00:33:56,850 --> 00:33:58,630 see a list of five different labels. 769 00:33:58,630 --> 00:34:00,770 Now, these five labels are produced 770 00:34:00,770 --> 00:34:03,120 by a convolutional neural network, 771 00:34:03,120 --> 00:34:05,390 and the different bars-- 772 00:34:05,390 --> 00:34:08,600 the different lengthened bars, some pink and others blue, 773 00:34:08,600 --> 00:34:11,179 represent how confident the CNN is 774 00:34:11,179 --> 00:34:14,989 that what it sees in that image is that specific label. 775 00:34:14,989 --> 00:34:16,850 As you can see in certain examples, 776 00:34:16,850 --> 00:34:21,469 the CNN is pretty confident in that it has a correct answer. 777 00:34:21,469 --> 00:34:23,750 For example, when we look at the container ship, 778 00:34:23,750 --> 00:34:26,540 it's pretty confident that what's in that image is exactly 779 00:34:26,540 --> 00:34:27,650 a container ship. 780 00:34:27,650 --> 00:34:30,350 There are certain cases when it doesn't get the correct label 781 00:34:30,350 --> 00:34:34,070 on its first try, but it does have in its top five labels. 782 00:34:34,070 --> 00:34:36,010 For example, you can see grill and mushroom. 783 00:34:36,010 --> 00:34:38,870 Now, the funny thing about the mushroom image 784 00:34:38,870 --> 00:34:41,750 is that what it thinks the image should 785 00:34:41,750 --> 00:34:43,658 be classified as is agaric. 786 00:34:43,658 --> 00:34:46,199 And if you don't know, agaric is actually a type of mushroom, 787 00:34:46,199 --> 00:34:48,530 and in fact, it's a mushroom that image. 788 00:34:48,530 --> 00:34:50,510 And it make sense that their confidence levels 789 00:34:50,510 --> 00:34:51,590 are pretty much the same. 790 00:34:51,590 --> 00:34:55,909 Agaric is slightly-- it's slightly more complex that what 791 00:34:55,909 --> 00:34:57,770 it sees in the image is agaric. 792 00:35:00,800 --> 00:35:03,140 But there are certain cases when the CNN 793 00:35:03,140 --> 00:35:05,870 fails to classify the image correctly 794 00:35:05,870 --> 00:35:07,200 in its top five levels. 795 00:35:07,200 --> 00:35:09,320 This will be registered as a top five error, 796 00:35:09,320 --> 00:35:12,810 as you just saw in the previous slide about the top error rate. 797 00:35:12,810 --> 00:35:15,560 One example here on this slide is cherry. 798 00:35:15,560 --> 00:35:18,140 Now, the ImageNet competition believed 799 00:35:18,140 --> 00:35:19,902 that this should be classified correctly 800 00:35:19,902 --> 00:35:21,360 as cherry, even though there's also 801 00:35:21,360 --> 00:35:22,850 a Dalmatian in the background. 802 00:35:22,850 --> 00:35:25,940 The CNN, on the other hand, is pretty confident 803 00:35:25,940 --> 00:35:29,410 that what it sees in this image is the Dalmatian. 804 00:35:29,410 --> 00:35:31,550 But if you look at some of the other results 805 00:35:31,550 --> 00:35:34,640 within the top five, although it doesn't guess cherry at all, 806 00:35:34,640 --> 00:35:36,380 it does guess certain fruits that it 807 00:35:36,380 --> 00:35:38,780 may think look sort of like cherries 808 00:35:38,780 --> 00:35:40,820 like grape or elderberry. 809 00:35:40,820 --> 00:35:44,510 So the CNN does actually pick up on two 810 00:35:44,510 --> 00:35:47,000 different distinct objects within the image, 811 00:35:47,000 --> 00:35:50,210 but as a result of how it's built, or its training set, 812 00:35:50,210 --> 00:35:52,610 it ends up classifying it as a Dalmatian. 813 00:35:52,610 --> 00:35:54,720 But it goes to show you that CNNs could also 814 00:35:54,720 --> 00:35:56,775 be used not just as an image classification, 815 00:35:56,775 --> 00:35:59,987 but also as object detection, which we do not touch up 816 00:35:59,987 --> 00:36:02,910 on in this lecture at all. 817 00:36:02,910 --> 00:36:05,460 So I'm not going to go further into that. 818 00:36:05,460 --> 00:36:08,080 Now, this is all fun and all, but what about some real world 819 00:36:08,080 --> 00:36:10,040 applications? 820 00:36:10,040 --> 00:36:13,220 So this is a study that they did at Google with Google Street 821 00:36:13,220 --> 00:36:18,670 View house numbers, where they used the CNN to classify 822 00:36:18,670 --> 00:36:21,637 photographic images of house numbers, as you can see here, 823 00:36:21,637 --> 00:36:23,470 of certain examples of these house numbers-- 824 00:36:23,470 --> 00:36:24,500 what they look like. 825 00:36:24,500 --> 00:36:27,100 So what the CNN was tasked with doing 826 00:36:27,100 --> 00:36:31,600 was that it was supposed to recognize the individual digits 827 00:36:31,600 --> 00:36:34,630 within the image, and then understand that it's not 828 00:36:34,630 --> 00:36:36,910 just one digit that it's looking at, 829 00:36:36,910 --> 00:36:38,410 but it's actually a string of digits 830 00:36:38,410 --> 00:36:42,720 connected, and successfully classified as the correct house 831 00:36:42,720 --> 00:36:44,780 number. 832 00:36:44,780 --> 00:36:47,430 This can be quite challenging, even for humans sometimes 833 00:36:47,430 --> 00:36:48,930 when the image is quite blurry. 834 00:36:48,930 --> 00:36:54,420 You might not exactly know what the house number exactly is, 835 00:36:54,420 --> 00:36:56,890 but they managed to get the convolutional neural network 836 00:36:56,890 --> 00:37:00,410 to operate around human operator levels. 837 00:37:00,410 --> 00:37:04,420 So that corresponds to around 96% to 97% accuracy, 838 00:37:04,420 --> 00:37:06,270 and what that enables Google to do 839 00:37:06,270 --> 00:37:08,950 is that they can deploy the CNN such 840 00:37:08,950 --> 00:37:12,790 that the CNN automatically extracts the house 841 00:37:12,790 --> 00:37:19,180 numbers from the images online, and uses that to geocode 842 00:37:19,180 --> 00:37:20,516 these addresses. 843 00:37:20,516 --> 00:37:24,190 And it's gotten to a point were the CNN is successfully 844 00:37:24,190 --> 00:37:27,520 able to do this process in less than an hour 845 00:37:27,520 --> 00:37:31,390 for all of the street view house numbers in all of France. 846 00:37:31,390 --> 00:37:35,930 Now, you might asking where this could be useful for. 847 00:37:35,930 --> 00:37:38,560 If you don't have access a lot of resources 848 00:37:38,560 --> 00:37:40,750 to actually do this geocoding process where 849 00:37:40,750 --> 00:37:45,670 you match latitude and longitude to street addresses, 850 00:37:45,670 --> 00:37:48,680 then your only resource might be actually photographic images. 851 00:37:48,680 --> 00:37:51,160 So you actually need something, hopefully not human, 852 00:37:51,160 --> 00:37:54,030 but some sort of software that can do this successfully. 853 00:37:54,030 --> 00:37:56,200 And so this is, for example, a place 854 00:37:56,200 --> 00:37:58,000 in South Africa, a bird's eye view. 855 00:37:58,000 --> 00:38:00,670 Not sure if you can exactly see, but there 856 00:38:00,670 --> 00:38:03,459 are these small numbers on top of each of the houses. 857 00:38:03,459 --> 00:38:05,500 All of these numbers were extracted and correctly 858 00:38:05,500 --> 00:38:09,930 classified using this previously seen CNN. 859 00:38:09,930 --> 00:38:15,620 Another example from robotics is recognizing hand gestures. 860 00:38:15,620 --> 00:38:19,120 So obviously, robots come equipped with a lot 861 00:38:19,120 --> 00:38:20,680 of different hardware. 862 00:38:20,680 --> 00:38:22,660 They can sense sounds, they can also 863 00:38:22,660 --> 00:38:25,330 capture images of their surroundings. 864 00:38:25,330 --> 00:38:28,812 And if you're able to classify what you see-- if the robots is 865 00:38:28,812 --> 00:38:30,400 able to classify what it sees, then it 866 00:38:30,400 --> 00:38:33,280 can actually act upon it, and take certain actions. 867 00:38:33,280 --> 00:38:35,785 That's why it becomes really helpful to successfully 868 00:38:35,785 --> 00:38:37,640 classify the images. 869 00:38:37,640 --> 00:38:40,810 So this is what they did using hand gestures, 870 00:38:40,810 --> 00:38:43,810 where there were five different classes. 871 00:38:43,810 --> 00:38:46,460 Each class corresponds to the number of extended fingers. 872 00:38:46,460 --> 00:38:49,790 So a, b, c, d, the top row, corresponds to the same class. 873 00:38:49,790 --> 00:38:51,640 They all have two fingers sticking out. 874 00:38:51,640 --> 00:38:53,620 The bottom row has three fingers sticking out. 875 00:38:53,620 --> 00:38:56,230 So that's another class. 876 00:38:56,230 --> 00:39:00,250 And they get the error rate down all the way to 3%. 877 00:39:00,250 --> 00:39:03,370 So 97% of the time, the convolutional neural net 878 00:39:03,370 --> 00:39:05,650 correctly classified the hand gesture. 879 00:39:05,650 --> 00:39:08,230 And you can use these hand gestures then 880 00:39:08,230 --> 00:39:10,990 to give certain commands to a robot, 881 00:39:10,990 --> 00:39:15,100 and it can train the CNN to act upon something 882 00:39:15,100 --> 00:39:16,360 else besides hand gestures. 883 00:39:16,360 --> 00:39:18,276 For example, if it's in some sort of terranean 884 00:39:18,276 --> 00:39:20,850 and you train it on certain images 885 00:39:20,850 --> 00:39:23,740 that you might find in nature, then it 886 00:39:23,740 --> 00:39:27,490 can take those classifications, and act upon it once it sees, 887 00:39:27,490 --> 00:39:32,230 for example, a tree, or some sort of body of water. 888 00:39:32,230 --> 00:39:35,080 It's all thanks to image classification. 889 00:39:35,080 --> 00:39:38,050 Now, obviously, gestures are not necessarily static. 890 00:39:38,050 --> 00:39:40,390 You could be waving your hand, and so that would 891 00:39:40,390 --> 00:39:42,730 require a temporal component. 892 00:39:42,730 --> 00:39:45,660 So it's not just an image you're looking at, but a video. 893 00:39:45,660 --> 00:39:50,380 And so if follows that we can probably 894 00:39:50,380 --> 00:39:53,050 extend image classification into video classification. 895 00:39:53,050 --> 00:39:57,580 After all, videos are just images with an added component, 896 00:39:57,580 --> 00:39:59,920 specifically time. 897 00:39:59,920 --> 00:40:01,930 Obviously, the added temporal component 898 00:40:01,930 --> 00:40:04,520 comes with a lot of additional complexity. 899 00:40:04,520 --> 00:40:07,517 So we're not going to dive into any of that, but in the end, 900 00:40:07,517 --> 00:40:08,850 it comes down to the same thing. 901 00:40:08,850 --> 00:40:10,900 You extract features from the videos, 902 00:40:10,900 --> 00:40:12,520 and you attempt to classify them using 903 00:40:12,520 --> 00:40:14,180 convolutional neural nets. 904 00:40:14,180 --> 00:40:17,190 So why don't we look at a study done, again, 905 00:40:17,190 --> 00:40:21,390 at Google, where they extracted one million videos 906 00:40:21,390 --> 00:40:26,980 from YouTube, sports videos, with somewhere between 400 907 00:40:26,980 --> 00:40:32,630 to 500 different classes, and they used CNNs to attempt 908 00:40:32,630 --> 00:40:34,360 to classify these videos. 909 00:40:34,360 --> 00:40:36,550 Now, they used different approaches. 910 00:40:36,550 --> 00:40:38,590 They used different approaches, different tests, 911 00:40:38,590 --> 00:40:43,360 different types of CNNs that I'm not going to go into. 912 00:40:43,360 --> 00:40:45,795 But as you can see here, these are 913 00:40:45,795 --> 00:40:50,260 certain stills from these videos where the caption highlighted 914 00:40:50,260 --> 00:40:53,770 in blue is what the correct answer should be, 915 00:40:53,770 --> 00:40:56,740 and underneath it, the top five labels 916 00:40:56,740 --> 00:40:59,250 that the convolutional neural network producers. 917 00:40:59,250 --> 00:41:01,990 The one highlighted in green is supposed 918 00:41:01,990 --> 00:41:03,420 to be the correct answer. 919 00:41:03,420 --> 00:41:09,180 So you can see on all of these, it gets it within the top five, 920 00:41:09,180 --> 00:41:11,757 and for the most part, within the top two, 921 00:41:11,757 --> 00:41:14,090 and it's pretty confident when it does get it correctly. 922 00:41:14,090 --> 00:41:16,673 Now, when I said that they use different types of classifieds, 923 00:41:16,673 --> 00:41:19,880 some of them were more stacked classifieds, 924 00:41:19,880 --> 00:41:22,970 where they were just trained on stills within these images, 925 00:41:22,970 --> 00:41:26,120 while others were what they called fusion ones, where they 926 00:41:26,120 --> 00:41:28,865 sort of add the temporal component by fusing 927 00:41:28,865 --> 00:41:34,250 in different stills from these photo images together. 928 00:41:34,250 --> 00:41:36,680 Now, the current accuracy rate-- 929 00:41:36,680 --> 00:41:38,810 the best one they've achieved so far-- 930 00:41:38,810 --> 00:41:43,550 has been around 80% accuracy within the top five label. 931 00:41:43,550 --> 00:41:46,190 Now, 80% accuracy is nowhere near what 932 00:41:46,190 --> 00:41:48,020 we saw with the ImageNet classification, 933 00:41:48,020 --> 00:41:50,690 where in 2015, they had managed to get it up 934 00:41:50,690 --> 00:41:56,350 to 98% or 99% accuracy. 935 00:41:56,350 --> 00:42:01,520 But obviously, there's way more complexity involved in this. 936 00:42:01,520 --> 00:42:04,520 So it makes sense that it's not quite there yet. 937 00:42:04,520 --> 00:42:08,090 But it does provide a good benchmark, and something 938 00:42:08,090 --> 00:42:11,310 to improve upon in the future as well. 939 00:42:11,310 --> 00:42:15,740 Now, that being said, convolutional neural networks 940 00:42:15,740 --> 00:42:17,450 do come with certain limitations. 941 00:42:17,450 --> 00:42:19,190 They're not perfect. 942 00:42:19,190 --> 00:42:25,020 And so Julian will now talk about the limitations. 943 00:42:25,020 --> 00:42:26,750 JULIAN BROWN: Thanks, Ali. 944 00:42:26,750 --> 00:42:30,210 So Ali talked about the ImageNet competition, 945 00:42:30,210 --> 00:42:32,540 and talked about how the recent winners have 946 00:42:32,540 --> 00:42:35,180 been convolutional neural nets. 947 00:42:35,180 --> 00:42:40,604 So before, the best was about 26% top five error rate, 948 00:42:40,604 --> 00:42:42,020 but now they've actually gotten it 949 00:42:42,020 --> 00:42:45,330 down to a 4.9% top five error rate, and that 950 00:42:45,330 --> 00:42:47,920 was-- the winner of that competition, the 2015 one, 951 00:42:47,920 --> 00:42:50,452 was actually Microsoft. 952 00:42:50,452 --> 00:42:52,785 They've got the current state of the art implementation, 953 00:42:52,785 --> 00:42:54,890 and so because it's the ImageNet competition, 954 00:42:54,890 --> 00:42:58,070 that means they can identify exactly 1,000 955 00:42:58,070 --> 00:43:00,770 different categories of images. 956 00:43:00,770 --> 00:43:03,050 So there are few problems, actually, 957 00:43:03,050 --> 00:43:05,240 with the implementation, or just in general 958 00:43:05,240 --> 00:43:07,300 with convolutional neural nets. 959 00:43:07,300 --> 00:43:09,710 So one of them is that 1,000 categories, well, it 960 00:43:09,710 --> 00:43:10,700 may seem like a lot-- 961 00:43:10,700 --> 00:43:14,012 ImageNet is actually one of the largest competitions-- 962 00:43:14,012 --> 00:43:15,720 that's not actually that many categories. 963 00:43:15,720 --> 00:43:18,500 So it doesn't contain things like hot air balloons, 964 00:43:18,500 --> 00:43:19,330 for instance. 965 00:43:19,330 --> 00:43:23,180 So these things that children would be able to classify, 966 00:43:23,180 --> 00:43:25,400 the neural nets actually aren't able to, even 967 00:43:25,400 --> 00:43:27,810 in the biggest competition. 968 00:43:27,810 --> 00:43:29,570 And each of these categories also 969 00:43:29,570 --> 00:43:31,940 requires thousands of training images, 970 00:43:31,940 --> 00:43:33,860 whereas you could show a child a couple 971 00:43:33,860 --> 00:43:36,560 examples of a dog or a cat, and they'd 972 00:43:36,560 --> 00:43:39,740 be able to, generally, get a feel for what a dog or a cat 973 00:43:39,740 --> 00:43:41,000 looks like. 974 00:43:41,000 --> 00:43:43,980 It takes thousands of images per category for the neural 975 00:43:43,980 --> 00:43:46,850 nets to learn, which means that the total number of images 976 00:43:46,850 --> 00:43:49,370 you need to train for the ImageNet competition 977 00:43:49,370 --> 00:43:51,510 is over a million. 978 00:43:51,510 --> 00:43:54,200 And so this leads to very long training times, 979 00:43:54,200 --> 00:43:56,360 even with all of the heavy optimizations 980 00:43:56,360 --> 00:43:59,300 that Ishwari was telling us about like how 981 00:43:59,300 --> 00:44:02,480 efficient convolution is, it still 982 00:44:02,480 --> 00:44:07,970 takes weeks to train on multiple parallel GPUs working together 983 00:44:07,970 --> 00:44:10,110 to train the net. 984 00:44:10,110 --> 00:44:12,260 There's actually a more fundamental problem 985 00:44:12,260 --> 00:44:14,500 with neural nets as well. 986 00:44:14,500 --> 00:44:18,410 So here on the left, we have a school bus, some kind of bird, 987 00:44:18,410 --> 00:44:20,174 and an Indian temple. 988 00:44:20,174 --> 00:44:21,840 And all of these images on the left side 989 00:44:21,840 --> 00:44:23,350 are actually correctly identified 990 00:44:23,350 --> 00:44:25,670 by convolutional neural nets. 991 00:44:25,670 --> 00:44:27,980 But when we add this small distortion here 992 00:44:27,980 --> 00:44:32,290 in the middle, that doesn't change any of the images 993 00:44:32,290 --> 00:44:36,230 perceptively to the human, this actually 994 00:44:36,230 --> 00:44:39,660 causes the neural network to misclassify these images, 995 00:44:39,660 --> 00:44:43,170 and now all three of them are ostriches. 996 00:44:43,170 --> 00:44:44,330 So that's a little weird. 997 00:44:44,330 --> 00:44:46,020 How does this work? 998 00:44:46,020 --> 00:44:47,960 How did we find those distortions. 999 00:44:47,960 --> 00:44:50,750 So here on the left side, we see how a neural network typically 1000 00:44:50,750 --> 00:44:51,250 works. 1001 00:44:51,250 --> 00:44:53,690 You start with some images, you put it 1002 00:44:53,690 --> 00:44:55,970 through the different layers of the neural network, 1003 00:44:55,970 --> 00:44:58,630 and then it tells you a certain probability 1004 00:44:58,630 --> 00:45:00,620 that it is a guitar, or a penguin. 1005 00:45:00,620 --> 00:45:04,160 So it classifies it, and so we can 1006 00:45:04,160 --> 00:45:09,010 use a modification of that method 1007 00:45:09,010 --> 00:45:11,120 by applying an evolutionary algorithm, 1008 00:45:11,120 --> 00:45:13,700 or a hill-climbing or gradient ascent algorithm. 1009 00:45:13,700 --> 00:45:16,840 We take a couple of images, and we put them through, 1010 00:45:16,840 --> 00:45:21,290 it classifies it, and we see what the classification is. 1011 00:45:21,290 --> 00:45:23,590 And then we can do some crossover between the images. 1012 00:45:23,590 --> 00:45:27,170 So we take the ones that look a lot like what we're training 1013 00:45:27,170 --> 00:45:31,150 for in the guitars or penguins, in this case, 1014 00:45:31,150 --> 00:45:33,190 and we take the features of those 1015 00:45:33,190 --> 00:45:35,692 that identify very strongly as a guitar, 1016 00:45:35,692 --> 00:45:37,900 and we combine those together in the crossover phase. 1017 00:45:37,900 --> 00:45:39,910 This is for the evolutionary algorithm. 1018 00:45:39,910 --> 00:45:42,690 Then we mutate the images, which is making small changes 1019 00:45:42,690 --> 00:45:45,275 to each one, and then we re-evaluate 1020 00:45:45,275 --> 00:45:47,720 by plugging it back in through the neural network, 1021 00:45:47,720 --> 00:45:49,850 and only the best images, the ones 1022 00:45:49,850 --> 00:45:51,860 that looked the most like a guitar or penguin, 1023 00:45:51,860 --> 00:45:54,397 are then selected for the next iteration. 1024 00:45:54,397 --> 00:45:56,980 And this continues until you get to a very high identification 1025 00:45:56,980 --> 00:45:59,595 rates, even higher than actual images of the objects. 1026 00:46:01,935 --> 00:46:04,560 So using gradient ascent, these are some of the images that you 1027 00:46:04,560 --> 00:46:09,000 could produce if you start with just a flat grey image, 1028 00:46:09,000 --> 00:46:11,530 and then you run it through this algorithm. 1029 00:46:11,530 --> 00:46:13,890 So here on the side, we have a backpack, 1030 00:46:13,890 --> 00:46:16,290 and we can actually see the outline of what 1031 00:46:16,290 --> 00:46:18,200 looks like a backpack in there. 1032 00:46:18,200 --> 00:46:21,790 And over here, we have what looks like a Windsor tie 1033 00:46:21,790 --> 00:46:24,185 right here, but all of these objects-- and perhaps, 1034 00:46:24,185 --> 00:46:25,810 there are things in these other images, 1035 00:46:25,810 --> 00:46:31,610 but they seem to be lost in the LSD trip of colors here. 1036 00:46:31,610 --> 00:46:32,966 So that's kind of strange. 1037 00:46:32,966 --> 00:46:34,590 That's definitely not how humans do it. 1038 00:46:34,590 --> 00:46:35,923 So let's try a different method. 1039 00:46:35,923 --> 00:46:37,580 What if instead of directly encoding, 1040 00:46:37,580 --> 00:46:40,200 which is where we change individual pixels, 1041 00:46:40,200 --> 00:46:42,710 what if we change patterns in the images, 1042 00:46:42,710 --> 00:46:44,340 like different shapes? 1043 00:46:44,340 --> 00:46:46,980 Then this is the kind of output that we get. 1044 00:46:46,980 --> 00:46:49,830 So in the upper left, we have a starfish. 1045 00:46:49,830 --> 00:46:52,526 So you can see that it has the orange and blue 1046 00:46:52,526 --> 00:46:54,150 of the orange in the starfish, and also 1047 00:46:54,150 --> 00:46:56,355 the blue of the ocean of the environment 1048 00:46:56,355 --> 00:46:59,320 that typical images of starfish are taken in. 1049 00:46:59,320 --> 00:47:02,810 And you can also see that it has the points, the jagged lines, 1050 00:47:02,810 --> 00:47:06,194 the triangles that we associate with the arms of a starfish. 1051 00:47:06,194 --> 00:47:08,110 But the strange thing here is that they're not 1052 00:47:08,110 --> 00:47:10,500 arranged in a circular pattern. 1053 00:47:10,500 --> 00:47:12,840 They're not pointing outwards like this, 1054 00:47:12,840 --> 00:47:16,230 like we would expect of an actual starfish. 1055 00:47:16,230 --> 00:47:19,920 So clearly, it's not latching onto the same large scale 1056 00:47:19,920 --> 00:47:22,110 features that humans do. 1057 00:47:22,110 --> 00:47:25,470 It's actually just looking at the low down features. 1058 00:47:25,470 --> 00:47:27,530 Even though it's a deep neural network, 1059 00:47:27,530 --> 00:47:32,040 it doesn't grab onto these abstract concepts 1060 00:47:32,040 --> 00:47:34,980 like a human would. 1061 00:47:34,980 --> 00:47:37,920 So the reason for this problem, or at least 1062 00:47:37,920 --> 00:47:39,420 why we think neural networks aren't 1063 00:47:39,420 --> 00:47:43,210 as good as humans at things like this is the type of model. 1064 00:47:43,210 --> 00:47:45,210 So a human would have more of what's 1065 00:47:45,210 --> 00:47:48,150 called a generative model, which means if we have examples 1066 00:47:48,150 --> 00:47:51,990 here, these dark blue dots, say, of lines, 1067 00:47:51,990 --> 00:47:55,660 a few examples of images of lines, 1068 00:47:55,660 --> 00:47:58,080 then we could construct a probability distribution, 1069 00:47:58,080 --> 00:48:01,320 and say that images that fall somewhere in this region 1070 00:48:01,320 --> 00:48:02,770 are lines. 1071 00:48:02,770 --> 00:48:06,060 And over here, we have a few examples of giraffes, say, 1072 00:48:06,060 --> 00:48:08,660 and so anything that falls in this region would be a giraffe. 1073 00:48:08,660 --> 00:48:12,710 And so if you had a red triangle in here, that would be a lion. 1074 00:48:12,710 --> 00:48:14,790 But if the red triangle is instead over here, 1075 00:48:14,790 --> 00:48:17,220 it actually wouldn't classify at all. 1076 00:48:17,220 --> 00:48:18,570 We wouldn't know what that is. 1077 00:48:18,570 --> 00:48:21,370 We would say that's something other than a lion or a giraffe, 1078 00:48:21,370 --> 00:48:23,990 but neural networks don't work the same way. 1079 00:48:23,990 --> 00:48:25,480 They just draw a decision boundary. 1080 00:48:25,480 --> 00:48:29,505 They just draw lines between the different categories. 1081 00:48:29,505 --> 00:48:33,480 So they don't say that something really far away from the lion 1082 00:48:33,480 --> 00:48:37,080 class is necessary not a lion. 1083 00:48:37,080 --> 00:48:40,420 It just depends how far away it is from the decision boundary. 1084 00:48:40,420 --> 00:48:42,770 So if we have the red triangle way over there, 1085 00:48:42,770 --> 00:48:45,030 it's very far away from giraffes, 1086 00:48:45,030 --> 00:48:47,480 and it's just generally closer lions, 1087 00:48:47,480 --> 00:48:50,890 even though it isn't explicitly very close to it at all, 1088 00:48:50,890 --> 00:48:54,030 and that will still be identified as a lion. 1089 00:48:54,030 --> 00:48:57,200 So that's why we think we're able to fool 1090 00:48:57,200 --> 00:49:01,290 these neural networks in such a simplistic way or in such a 1091 00:49:01,290 --> 00:49:03,842 really abstract way. 1092 00:49:03,842 --> 00:49:05,715 So the main takeaways from our presentation, 1093 00:49:05,715 --> 00:49:09,980 and the salient points are that deep learning 1094 00:49:09,980 --> 00:49:13,200 is a very powerful tool for image classification, 1095 00:49:13,200 --> 00:49:17,035 and it relies on multiple layers of a network. 1096 00:49:17,035 --> 00:49:19,570 So multiple processing layers. 1097 00:49:19,570 --> 00:49:24,190 Also CNNs outperform basically every other method 1098 00:49:24,190 --> 00:49:27,570 for classifying images, and that's their primary use 1099 00:49:27,570 --> 00:49:28,540 right now. 1100 00:49:28,540 --> 00:49:30,810 We're currently exploring other uses, 1101 00:49:30,810 --> 00:49:33,450 but that's generally where it's at, 1102 00:49:33,450 --> 00:49:35,970 and this is because convolutional filters are just 1103 00:49:35,970 --> 00:49:37,230 so incredibly powerful. 1104 00:49:37,230 --> 00:49:41,250 They're very fast and very efficient. 1105 00:49:41,250 --> 00:49:44,670 Also back propagation is the way that we train neural networks. 1106 00:49:44,670 --> 00:49:47,280 Normally, if you were to train a neural network that 1107 00:49:47,280 --> 00:49:50,480 has a lot of layers, there's actually an exponential growth 1108 00:49:50,480 --> 00:49:54,470 in the time it takes to train because of the branching when 1109 00:49:54,470 --> 00:49:56,730 you go backwards because each neuron is connected 1110 00:49:56,730 --> 00:49:59,590 to a large number of neurons in the previous layer. 1111 00:49:59,590 --> 00:50:03,060 You get this exponential growth in the number of dependencies 1112 00:50:03,060 --> 00:50:03,980 from a given neuron. 1113 00:50:03,980 --> 00:50:06,910 By using back propagation, it actually 1114 00:50:06,910 --> 00:50:09,690 reduces it to linear time to train the networks. 1115 00:50:09,690 --> 00:50:12,600 So this allows for efficient training. 1116 00:50:12,600 --> 00:50:16,410 And even with back propagation and convolution 1117 00:50:16,410 --> 00:50:20,130 being so efficient, it still takes a very large number 1118 00:50:20,130 --> 00:50:23,910 of images, and a long time with a lot of processing power 1119 00:50:23,910 --> 00:50:27,900 to train neural networks. 1120 00:50:27,900 --> 00:50:29,670 Also, if you'd like to get started 1121 00:50:29,670 --> 00:50:31,550 working with neural networks, there 1122 00:50:31,550 --> 00:50:36,870 are a couple of really nice open source programming 1123 00:50:36,870 --> 00:50:38,290 platforms for neural networks. 1124 00:50:38,290 --> 00:50:41,480 So one of them that we used for our pset was actually 1125 00:50:41,480 --> 00:50:44,440 TensorFlow, which is Google's open source neural network 1126 00:50:44,440 --> 00:50:46,600 platform, and another one would be 1127 00:50:46,600 --> 00:50:49,839 Cafe, which is Berkley's neural network platform. 1128 00:50:49,839 --> 00:50:51,380 And they actually have an online demo 1129 00:50:51,380 --> 00:50:54,240 where you can plug in images, and immediately 1130 00:50:54,240 --> 00:50:55,590 get identifications. 1131 00:50:55,590 --> 00:50:59,560 So you can get started very quickly with that one. 1132 00:50:59,560 --> 00:51:01,410 Thank you.