1 00:00:00,530 --> 00:00:02,960 The following content is provided under a Creative 2 00:00:02,960 --> 00:00:04,370 Commons license. 3 00:00:04,370 --> 00:00:07,410 Your support will help MIT OpenCourseWare continue to 4 00:00:07,410 --> 00:00:11,060 offer high-quality educational resources for free. 5 00:00:11,060 --> 00:00:13,960 To make a donation or view additional materials from 6 00:00:13,960 --> 00:00:19,790 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,790 --> 00:00:21,040 ocw.mit.edu. 8 00:00:22,775 --> 00:00:24,130 PROFESSOR: All right. 9 00:00:24,130 --> 00:00:27,580 So we've got three main topics to talk about. 10 00:00:27,580 --> 00:00:29,320 One is distributions. 11 00:00:29,320 --> 00:00:30,980 The other is Monte Carlo methods. 12 00:00:30,980 --> 00:00:33,340 And one is on regression. 13 00:00:33,340 --> 00:00:38,720 So for distributions, which distributions have we learned 14 00:00:38,720 --> 00:00:39,970 about in class? 15 00:00:43,336 --> 00:00:44,340 Hmm? 16 00:00:44,340 --> 00:00:46,340 AUDIENCE: Normal. 17 00:00:46,340 --> 00:00:46,720 PROFESSOR: OK. 18 00:00:46,720 --> 00:00:47,970 So we have normal. 19 00:00:51,990 --> 00:00:53,382 What's another one? 20 00:00:53,382 --> 00:00:54,326 AUDIENCE: Uniform. 21 00:00:54,326 --> 00:00:55,576 PROFESSOR: OK. 22 00:00:59,530 --> 00:01:01,160 And there's one more that he's kind of 23 00:01:01,160 --> 00:01:03,542 mentioned, I think, in passing. 24 00:01:03,542 --> 00:01:04,760 AUDIENCE: Exponential? 25 00:01:04,760 --> 00:01:05,780 PROFESSOR: Yes. 26 00:01:05,780 --> 00:01:07,030 So exponential. 27 00:01:13,430 --> 00:01:17,010 So for uniform, what would this look like if I were to 28 00:01:17,010 --> 00:01:23,315 plot this as a histogram, and I have endpoints A and B? 29 00:01:26,358 --> 00:01:30,354 Someone clue me in? 30 00:01:30,354 --> 00:01:30,852 Hmm? 31 00:01:30,852 --> 00:01:32,350 AUDIENCE: [INAUDIBLE] straight line. 32 00:01:32,350 --> 00:01:34,140 PROFESSOR: So it's going to be a horizontal line, right? 33 00:01:37,790 --> 00:01:39,200 And if we were to look at the function for 34 00:01:39,200 --> 00:01:40,470 this, it would be-- 35 00:01:46,890 --> 00:01:52,430 the probability would be 1 over b minus a for all points 36 00:01:52,430 --> 00:01:54,690 between a and b. 37 00:01:54,690 --> 00:01:57,335 So let's look at this graphically. 38 00:02:02,160 --> 00:02:05,010 So this chunk of code should not be too difficult to 39 00:02:05,010 --> 00:02:07,280 understand at this point, right? 40 00:02:07,280 --> 00:02:09,639 All we're doing is we're using the random 41 00:02:09,639 --> 00:02:11,580 number generator, randint. 42 00:02:11,580 --> 00:02:14,720 It's going to return us an integer, random integer from a 43 00:02:14,720 --> 00:02:17,410 uniform distribution between a and b. 44 00:02:17,410 --> 00:02:20,810 Is there Anyone that's puzzled by that? 45 00:02:20,810 --> 00:02:21,640 All right. 46 00:02:21,640 --> 00:02:24,280 We're going to do that for numpoints, and then we're 47 00:02:24,280 --> 00:02:25,810 going to plot a histogram. 48 00:02:25,810 --> 00:02:27,980 The only parameter that I don't think you've seen here 49 00:02:27,980 --> 00:02:29,230 is this normed=True. 50 00:02:32,100 --> 00:02:37,460 What this does is, normally, when you use the hist command 51 00:02:37,460 --> 00:02:41,260 in Python, it's going to give you raw frequency counts on 52 00:02:41,260 --> 00:02:42,130 the y-axis. 53 00:02:42,130 --> 00:02:45,380 What normed=True does is it gives you the proportion of 54 00:02:45,380 --> 00:02:47,890 the points that wound up in a particular bin. 55 00:02:47,890 --> 00:02:52,350 So I can actually show you both ways. 56 00:02:56,430 --> 00:02:59,400 So does that look about right? 57 00:02:59,400 --> 00:03:05,770 For 100 bins, got, what, 100,000 points? 58 00:03:05,770 --> 00:03:11,620 Each one has about 0.01, so it looks right, 1% in each bin. 59 00:03:11,620 --> 00:03:13,330 So that was normed. 60 00:03:13,330 --> 00:03:14,580 If we do it un-normed-- 61 00:03:21,990 --> 00:03:24,620 see how the y-axis here has changed? 62 00:03:24,620 --> 00:03:29,230 Before, it was from like 0 to 0.12, or [? 0-1 ?] 63 00:03:29,230 --> 00:03:33,060 Now, it's from 0 to like 1,000. 64 00:03:33,060 --> 00:03:34,870 That's all that normed primer does. 65 00:03:34,870 --> 00:03:38,150 But this is what we would expect. 66 00:03:38,150 --> 00:03:39,380 This is for integers. 67 00:03:39,380 --> 00:03:46,700 And then, of course, Python also has a way of doing it for 68 00:03:46,700 --> 00:03:47,880 floating point. 69 00:03:47,880 --> 00:03:51,210 So here, we are going to use the uniform command. 70 00:03:51,210 --> 00:03:54,650 And then when I say, show continuous uniform, going to 71 00:03:54,650 --> 00:03:59,920 give it the a and b, 0 and 1.0. 72 00:03:59,920 --> 00:04:03,660 And it's really not going to look all that much different. 73 00:04:03,660 --> 00:04:08,265 It's just that the x-axis is from 0 to 1. 74 00:04:15,440 --> 00:04:15,632 Ok. 75 00:04:15,632 --> 00:04:19,329 So uniform is easy. 76 00:04:19,329 --> 00:04:21,594 What does a Gaussian look like, or a normal? 77 00:04:27,710 --> 00:04:31,880 Like if I were to plot it, what should this look like? 78 00:04:31,880 --> 00:04:33,700 AUDIENCE: Bell curve. 79 00:04:33,700 --> 00:04:36,040 PROFESSOR: OK, it'll be a bell curve. 80 00:04:36,040 --> 00:04:37,661 Where is its peak going to be? 81 00:04:37,661 --> 00:04:39,425 AUDIENCE: Exactly in the middle? 82 00:04:39,425 --> 00:04:40,310 AUDIENCE: At the mean. 83 00:04:40,310 --> 00:04:40,850 PROFESSOR: At the mean. 84 00:04:40,850 --> 00:04:42,200 Thank you. 85 00:04:42,200 --> 00:04:44,190 So the peak is going to be at the mean. 86 00:04:44,190 --> 00:04:46,430 We usually denote it with mu. 87 00:04:46,430 --> 00:04:49,540 And then it's going to fall off asymmetrically or 88 00:04:49,540 --> 00:04:52,370 symmetrically off on either side? 89 00:04:52,370 --> 00:04:53,620 Symmetrically. 90 00:04:57,330 --> 00:05:01,200 Now, a Gaussian can be specified fully using two 91 00:05:01,200 --> 00:05:01,820 parameters. 92 00:05:01,820 --> 00:05:04,730 What are they? 93 00:05:04,730 --> 00:05:07,920 You have one here, and then you have standard deviation. 94 00:05:07,920 --> 00:05:12,976 So mean and sigma. 95 00:05:17,040 --> 00:05:20,020 Now, the function for this is not something you're going to 96 00:05:20,020 --> 00:05:20,530 have to know. 97 00:05:20,530 --> 00:05:24,770 But I wanted to show it to you. 98 00:05:24,770 --> 00:05:26,810 And the stats major can correct me if I'm wrong. 99 00:05:45,580 --> 00:05:47,760 So it might be a little scary. 100 00:05:47,760 --> 00:05:48,300 I don't know. 101 00:05:48,300 --> 00:05:50,130 It intimidated me the first time I saw it. 102 00:05:50,130 --> 00:05:51,540 Does that look about right to you? 103 00:05:51,540 --> 00:05:51,830 AUDIENCE: Yes. 104 00:05:51,830 --> 00:05:52,830 PROFESSOR: All right. 105 00:05:52,830 --> 00:05:56,170 So the reason why I threw that out there is because what I 106 00:05:56,170 --> 00:06:03,380 want to do is show you the ideal form when we plot out 107 00:06:03,380 --> 00:06:07,960 this function, versus a bunch of random samples we've drawn 108 00:06:07,960 --> 00:06:10,280 from a distribution that is Gaussian. 109 00:06:15,200 --> 00:06:19,580 So, I have a function make Gaussian plot. 110 00:06:19,580 --> 00:06:23,240 All it takes is the mean, standard deviation, how many 111 00:06:23,240 --> 00:06:26,480 points we want to draw from the distribution. 112 00:06:26,480 --> 00:06:28,720 And then I have a parameter here, show ideal. 113 00:06:28,720 --> 00:06:33,520 And we'll get to that in a second. 114 00:06:33,520 --> 00:06:38,160 The function that we use is called dot Gauss. 115 00:06:38,160 --> 00:06:41,410 And it just takes a mean and the standard deviation. 116 00:06:44,930 --> 00:06:48,860 We're also going to compute the ideal points. 117 00:06:48,860 --> 00:06:54,010 So if I take the mean, and I go a couple of standard 118 00:06:54,010 --> 00:06:58,590 deviations in either direction on the x-axis, then I can plot 119 00:06:58,590 --> 00:07:01,270 out what the y should be according to this function 120 00:07:01,270 --> 00:07:06,690 here, and then just do a histogram. 121 00:07:09,710 --> 00:07:12,030 If I want to show this plot, that's what 122 00:07:12,030 --> 00:07:13,920 that parameter controls. 123 00:07:13,920 --> 00:07:15,940 It'll plot out the function. 124 00:07:15,940 --> 00:07:19,120 And if not, then it'll just plot the histogram. 125 00:07:19,120 --> 00:07:21,340 So let's see what this looks like with just the histogram. 126 00:07:34,370 --> 00:07:37,290 So it looks like what we would expect. 127 00:07:37,290 --> 00:07:38,610 We have the nice bell shape. 128 00:07:38,610 --> 00:07:42,350 It's centered at 0, and it's got a standard deviation of 1. 129 00:07:46,760 --> 00:07:49,680 These are the relative frequencies of a random 130 00:07:49,680 --> 00:07:54,150 sampling of points from a Gaussian distribution. 131 00:07:54,150 --> 00:07:59,950 And we can see that if we look at the ideal version or the 132 00:07:59,950 --> 00:08:14,595 actual function, it matches very closely. 133 00:08:20,200 --> 00:08:25,850 And then for various shapes, standard deviation of 2, 134 00:08:25,850 --> 00:08:28,520 different mean, different standard deviation. 135 00:08:28,520 --> 00:08:31,500 So it's pretty easy, right? 136 00:08:31,500 --> 00:08:34,200 Are there any questions on Gaussian distributions or 137 00:08:34,200 --> 00:08:35,450 normal distributions? 138 00:08:45,760 --> 00:08:45,790 Ok. 139 00:08:45,790 --> 00:08:48,410 So, the last one we have-- 140 00:08:48,410 --> 00:08:49,660 AUDIENCE: [INAUDIBLE]. 141 00:08:52,170 --> 00:08:53,030 PROFESSOR: Oh -- 142 00:08:53,030 --> 00:08:55,060 frange is a custom function. 143 00:08:55,060 --> 00:08:56,890 So we actually define it up here. 144 00:08:56,890 --> 00:08:57,320 AUDIENCE: [INAUDIBLE]. 145 00:08:57,320 --> 00:09:01,640 PROFESSOR: Was kind of hoping I could slip that past you. 146 00:09:01,640 --> 00:09:05,930 It's just like range, except instead of integers, it 147 00:09:05,930 --> 00:09:09,010 returns a list of floating point numbers separated by 148 00:09:09,010 --> 00:09:10,270 step argument. 149 00:09:10,270 --> 00:09:18,400 So it starts at a lower-end range start, and stops at the 150 00:09:18,400 --> 00:09:24,950 stop, and then increments by step, until it returns a bunch 151 00:09:24,950 --> 00:09:26,200 of floating point numbers. 152 00:09:37,240 --> 00:09:39,710 The last one is the exponential distribution. 153 00:09:39,710 --> 00:09:42,490 And I don't know-- did he really explain what the shape 154 00:09:42,490 --> 00:09:47,350 looked like for this at all? 155 00:09:47,350 --> 00:09:50,740 So we can go really quickly through it, because it doesn't 156 00:09:50,740 --> 00:09:55,850 sound he actually expects you to know it too deeply. 157 00:09:55,850 --> 00:09:59,250 Basically, it'll like that. 158 00:09:59,250 --> 00:10:01,270 And the function is-- 159 00:10:13,410 --> 00:10:14,840 you don't need to know it. 160 00:10:14,840 --> 00:10:16,865 It's just there for your edification. 161 00:10:21,030 --> 00:10:24,400 Lambda is greater than 0. 162 00:10:24,400 --> 00:10:27,340 So I'm just going to show you what it looks like, and then 163 00:10:27,340 --> 00:10:28,590 we'll move on. 164 00:10:38,990 --> 00:10:44,250 So here, the blue are the sample points, and the red is 165 00:10:44,250 --> 00:10:45,500 the ideal curve. 166 00:10:47,940 --> 00:10:50,559 Just different values of lambda. 167 00:10:50,559 --> 00:10:52,924 AUDIENCE: Does it always have a downward slope like that for 168 00:10:52,924 --> 00:10:55,290 it to be exponential? 169 00:10:55,290 --> 00:10:58,940 PROFESSOR: Yeah, in this case. 170 00:10:58,940 --> 00:11:02,180 There's another family of distributions that we're not 171 00:11:02,180 --> 00:11:03,430 going to touch on. 172 00:11:07,310 --> 00:11:12,660 But that is that for distributions for today. 173 00:11:12,660 --> 00:11:14,110 Unless anyone has any questions, I'm 174 00:11:14,110 --> 00:11:17,150 going to move on. 175 00:11:17,150 --> 00:11:19,310 OK. 176 00:11:19,310 --> 00:11:28,950 So the next big topic is Monte Carlo methods. 177 00:11:28,950 --> 00:11:34,670 So can someone give me an informal definition of what a 178 00:11:34,670 --> 00:11:36,265 Monte Carlo method is? 179 00:11:40,971 --> 00:11:46,350 AUDIENCE: Really roughly, is it based on using a random 180 00:11:46,350 --> 00:11:48,900 method to try to approximate something that's not random, 181 00:11:48,900 --> 00:11:52,370 by doing it many, many times over? 182 00:11:52,370 --> 00:11:53,550 PROFESSOR: Yeah, more or less. 183 00:11:53,550 --> 00:11:57,470 It's trying to arrive at a solution by repeated sampling, 184 00:11:57,470 --> 00:11:58,720 or random sampling. 185 00:12:00,950 --> 00:12:05,040 And we've seen many different applications of this. 186 00:12:05,040 --> 00:12:10,240 But we're going to review them and kind of try and get a 187 00:12:10,240 --> 00:12:11,820 better understanding. 188 00:12:11,820 --> 00:12:15,840 So the Monty Hall problem. 189 00:12:15,840 --> 00:12:18,670 This is a Monte Carlo simulation. 190 00:12:18,670 --> 00:12:23,582 So, one, what's the action that a person should take? 191 00:12:23,582 --> 00:12:24,430 AUDIENCE: [INAUDIBLE]. 192 00:12:24,430 --> 00:12:24,800 PROFESSOR: All right. 193 00:12:24,800 --> 00:12:27,430 And does anyone remember what proportion of the time if they 194 00:12:27,430 --> 00:12:28,754 switch they won? 195 00:12:28,754 --> 00:12:29,640 AUDIENCE: 2/3. 196 00:12:29,640 --> 00:12:31,700 PROFESSOR: Two-thirds, Ok. 197 00:12:31,700 --> 00:12:34,770 So I happen to know this works-- 198 00:12:37,470 --> 00:12:38,934 maybe. 199 00:12:38,934 --> 00:12:41,419 I think my program died. 200 00:12:48,380 --> 00:12:52,640 OK, so it works. 201 00:12:52,640 --> 00:12:57,710 Is this code confusing to anyone or cryptic? 202 00:12:57,710 --> 00:13:00,420 I tried to make it a little bit simpler than the code that 203 00:13:00,420 --> 00:13:01,770 was in the handout for class. 204 00:13:06,020 --> 00:13:07,620 We have a number of trials. 205 00:13:07,620 --> 00:13:10,750 We're going to pick a door for the prize. 206 00:13:10,750 --> 00:13:12,170 The player's going to choose a door. 207 00:13:14,820 --> 00:13:18,720 If they choose to stay, and the prize is in the door that 208 00:13:18,720 --> 00:13:21,400 they chose, then stay wins. 209 00:13:21,400 --> 00:13:27,680 And if they choose to switch, and the prize door is not the 210 00:13:27,680 --> 00:13:31,060 door that they originally chose, then switch wins. 211 00:13:34,447 --> 00:13:36,350 So it's easy. 212 00:13:36,350 --> 00:13:40,380 What I wanted to try and do is look at an intuitive 213 00:13:40,380 --> 00:13:41,730 explanation for this. 214 00:13:45,120 --> 00:13:47,870 At office hours, we were kicking around different ways 215 00:13:47,870 --> 00:13:49,200 of explaining this. 216 00:13:49,200 --> 00:13:53,550 And we went to Wikipedia, and we found this explanation. 217 00:13:53,550 --> 00:13:59,700 So the idea is let's say that the contestant 218 00:13:59,700 --> 00:14:01,130 chooses door One. 219 00:14:01,130 --> 00:14:04,890 So there's a 1/3 probability that they've chosen the door 220 00:14:04,890 --> 00:14:07,210 that has the prize behind it. 221 00:14:07,210 --> 00:14:10,860 And then there's a 1/3 probability that it's behind 222 00:14:10,860 --> 00:14:13,120 door number Two, 1/3 probability it's behind door 223 00:14:13,120 --> 00:14:14,790 number Three. 224 00:14:14,790 --> 00:14:18,080 The key to this kind of explanation is that if you 225 00:14:18,080 --> 00:14:21,080 consider both Two and Three together, then there's a 2/3 226 00:14:21,080 --> 00:14:25,600 probability that the prize is behind one of those two doors. 227 00:14:28,880 --> 00:14:31,210 So the player chooses, and then Monty opens a door. 228 00:14:31,210 --> 00:14:34,390 There's a goat behind door number Three. 229 00:14:34,390 --> 00:14:37,140 This new knowledge doesn't change, though, the 230 00:14:37,140 --> 00:14:41,180 probability that you chose the correct door. 231 00:14:41,180 --> 00:14:45,520 So you still have 1/3 chance that One was the correct door. 232 00:14:45,520 --> 00:14:49,300 And there's still 2/3 chance on this side. 233 00:14:49,300 --> 00:14:52,100 But you know this one is 0, because you see the goat. 234 00:14:52,100 --> 00:14:57,542 So this door has to a 2/3 chance of having the prize. 235 00:14:57,542 --> 00:14:58,960 Does that agree with you? 236 00:15:01,550 --> 00:15:04,960 So it's one way of explaining it. 237 00:15:04,960 --> 00:15:05,370 I don't know. 238 00:15:05,370 --> 00:15:09,950 I had problems getting this into my head. 239 00:15:09,950 --> 00:15:12,170 Does anyone want me to try again? 240 00:15:12,170 --> 00:15:14,066 All right. 241 00:15:14,066 --> 00:15:15,316 AUDIENCE: [INAUDIBLE] 242 00:15:17,538 --> 00:15:22,498 two doors the probability that your goat is going to be 243 00:15:22,498 --> 00:15:25,308 [INAUDIBLE] behind the door you chose [INAUDIBLE], so it's 244 00:15:25,308 --> 00:15:26,820 basically the same [INAUDIBLE]? 245 00:15:26,820 --> 00:15:29,490 PROFESSOR: Same idea, but kind of negating it, and thinking 246 00:15:29,490 --> 00:15:31,073 of it from the negative direction. 247 00:15:34,900 --> 00:15:37,780 Another explanation that was good was if you had a million 248 00:15:37,780 --> 00:15:43,670 doors, and you had 999,999 goats, and you had one prize, 249 00:15:43,670 --> 00:15:45,000 you have a one in a million chance of 250 00:15:45,000 --> 00:15:46,510 choosing the right door. 251 00:15:46,510 --> 00:15:49,770 So now imagine Monty walking down and open opening up 252 00:15:49,770 --> 00:15:54,400 999,998 doors, each with a goat behind it. 253 00:15:54,400 --> 00:15:57,560 Well, now you have your door that's still closed, and the 254 00:15:57,560 --> 00:16:02,230 door that's mystery also closed. 255 00:16:02,230 --> 00:16:04,993 The probability that you chose the correct door is still one 256 00:16:04,993 --> 00:16:06,310 in a million. 257 00:16:06,310 --> 00:16:11,380 So if you see 999,998 goats, and one closed door, and you 258 00:16:11,380 --> 00:16:14,080 know that your door only has a one in a million chance, you 259 00:16:14,080 --> 00:16:15,770 want to switch to the other door, because that probably 260 00:16:15,770 --> 00:16:18,440 has the prize. 261 00:16:18,440 --> 00:16:20,860 So different ways of thinking about it. 262 00:16:20,860 --> 00:16:23,960 The probability problems and statistics problems, it always 263 00:16:23,960 --> 00:16:26,410 helps to-- or at least, I think it does-- to have an 264 00:16:26,410 --> 00:16:28,850 intuitive idea of what's going on. 265 00:16:28,850 --> 00:16:33,670 So with that said, let's talk about pi. 266 00:16:33,670 --> 00:16:39,870 Because this is one of my favorite Monte Carlo methods. 267 00:16:39,870 --> 00:16:42,250 Because it's got a nice explanation. 268 00:16:42,250 --> 00:16:48,500 So does anyone need me to talk about the idea behind this, 269 00:16:48,500 --> 00:16:52,435 like how this method works, or to go through it? 270 00:16:56,200 --> 00:16:57,450 Someone's nodding. 271 00:17:00,710 --> 00:17:05,705 So the idea is we have a square. 272 00:17:10,140 --> 00:17:17,630 And its side is 2r units long. 273 00:17:17,630 --> 00:17:19,020 So what's the area of the square? 274 00:17:22,450 --> 00:17:23,700 So Asq ... 275 00:17:27,390 --> 00:17:29,970 squared, right? 276 00:17:29,970 --> 00:17:32,170 Now, we still have a circle that's 277 00:17:32,170 --> 00:17:33,420 inscribed in the square. 278 00:17:37,100 --> 00:17:39,850 And it's got a radius of r. 279 00:17:39,850 --> 00:17:41,150 So area of circle. 280 00:17:46,490 --> 00:17:51,800 If we take the ratio of the circle to the area of the 281 00:17:51,800 --> 00:17:57,410 square, then we find have pi over 4. 282 00:17:57,410 --> 00:18:01,875 Now, let's assume that I throw darts at this. 283 00:18:04,430 --> 00:18:07,630 Wakes people up. 284 00:18:07,630 --> 00:18:12,670 And there's a uniform probability that the point 285 00:18:12,670 --> 00:18:15,856 will land somewhere in the square here. 286 00:18:15,856 --> 00:18:23,800 If I throw N of these, then I can expect pi over 4 of them, 287 00:18:23,800 --> 00:18:25,865 times N, to wind up in the circle. 288 00:18:28,500 --> 00:18:31,130 And since I find this number and this number, and I want to 289 00:18:31,130 --> 00:18:33,326 find pi, I can just rearrange this. 290 00:18:36,295 --> 00:18:37,545 That's how we get pi. 291 00:18:40,590 --> 00:18:42,500 So let's go to the code. 292 00:18:45,800 --> 00:18:47,830 We just have some easy code. 293 00:18:47,830 --> 00:18:51,280 It gets a random point within a square that's from minus r 294 00:18:51,280 --> 00:18:55,580 to r, so 2r units long. 295 00:18:55,580 --> 00:18:57,840 I have a function that makes a whole bunch of points. 296 00:19:00,470 --> 00:19:02,760 And then I have a function that checks if a point is 297 00:19:02,760 --> 00:19:09,740 within a circle of radius r and another function that 298 00:19:09,740 --> 00:19:11,910 looks at a bunch of points and counts how many 299 00:19:11,910 --> 00:19:15,330 are within the circle. 300 00:19:15,330 --> 00:19:17,290 And then I have my compute pi function here. 301 00:19:19,900 --> 00:19:23,500 And all it does is you can either pass at some points 302 00:19:23,500 --> 00:19:29,200 that are already made, or just say, I want to have 100,000 303 00:19:29,200 --> 00:19:31,990 darts thrown at this square. 304 00:19:31,990 --> 00:19:36,230 And it'll make a whole bunch of those random points, figure 305 00:19:36,230 --> 00:19:38,470 out many are in the circle. 306 00:19:38,470 --> 00:19:40,380 And then we have-- 307 00:19:40,380 --> 00:19:48,080 this would be m and numpoints N. If we multiply it by 4, 308 00:19:48,080 --> 00:19:53,190 that gives us pi, more or less. 309 00:19:53,190 --> 00:19:57,893 So let's look at a couple of plots. 310 00:19:57,893 --> 00:20:00,530 I have a function here, runtrials. 311 00:20:00,530 --> 00:20:02,920 And what it's going to do is it's going to run a number of 312 00:20:02,920 --> 00:20:07,020 trials for a given number of points. 313 00:20:07,020 --> 00:20:15,900 So what I want to do is I'm going to run 50 trials for 314 00:20:15,900 --> 00:20:17,900 each number of points. 315 00:20:17,900 --> 00:20:21,590 And I'm going to have a points list that goes from 10 to 316 00:20:21,590 --> 00:20:23,950 10,000 in 1000-point increments. 317 00:20:26,770 --> 00:20:28,470 I'm going to run the trials and get the results. 318 00:20:28,470 --> 00:20:31,880 And then I'm going to plot my results. 319 00:20:31,880 --> 00:20:33,380 And why don't we just throw that out there? 320 00:20:48,410 --> 00:20:48,650 Ok. 321 00:20:48,650 --> 00:20:52,480 So on the plot, the blue line blue, horizontal line, that's 322 00:20:52,480 --> 00:20:55,750 the actual value of pi, as near as a computer can 323 00:20:55,750 --> 00:20:58,240 approximate it. 324 00:20:58,240 --> 00:21:00,800 On the x-axis, we have the number of darts that we threw 325 00:21:00,800 --> 00:21:03,320 at the square. 326 00:21:03,320 --> 00:21:07,260 And each red dot represents the result of one trial of 327 00:21:07,260 --> 00:21:12,120 throwing however many darts at a board. 328 00:21:12,120 --> 00:21:16,945 So when you're down here, and you're only throwing 10 darts, 329 00:21:16,945 --> 00:21:19,130 you tend to have a very wide spread for the 330 00:21:19,130 --> 00:21:21,120 estimated value of pi. 331 00:21:21,120 --> 00:21:28,770 As you increase the number of darts, you get much closer-- 332 00:21:28,770 --> 00:21:31,330 I would say shot group, but grouping it's probably more 333 00:21:31,330 --> 00:21:34,060 appropriate. 334 00:21:34,060 --> 00:21:38,646 And it's much closer to the actual of pi. 335 00:21:38,646 --> 00:21:41,980 There's nothing really unusual about this, right? 336 00:21:41,980 --> 00:21:44,760 Nothing confusing? 337 00:21:44,760 --> 00:21:53,020 So another way of visualizing this is to actually, well, 338 00:21:53,020 --> 00:21:55,380 look at the darts that are thrown. 339 00:22:02,880 --> 00:22:06,600 So I have a function here, plot pi scatter. 340 00:22:06,600 --> 00:22:10,520 And this is actually just going to plot this. 341 00:22:13,320 --> 00:22:17,280 And it's going to do it for 10 points, 100 points, 1,000 342 00:22:17,280 --> 00:22:19,450 points, and 10,000 points. 343 00:22:19,450 --> 00:22:26,500 And we'll see why we can start converging on pi. 344 00:22:26,500 --> 00:22:30,990 So this is with only 10 darts thrown at the square. 345 00:22:30,990 --> 00:22:34,430 The value for pi is really pretty off. 346 00:22:34,430 --> 00:22:36,140 And it doesn't really look very compelling. 347 00:22:42,810 --> 00:22:47,360 In fact, one of the darts actually 348 00:22:47,360 --> 00:22:49,150 fell outside the circle. 349 00:22:49,150 --> 00:22:50,730 Nine of the darts fell inside the circle. 350 00:22:50,730 --> 00:22:54,310 So you're not going to get a real good estimate there. 351 00:22:54,310 --> 00:22:57,240 The blue dots there represent being in the circle. 352 00:22:57,240 --> 00:22:59,810 Red is outside. 353 00:22:59,810 --> 00:23:02,250 So if we do it with 100 points, it starts getting a 354 00:23:02,250 --> 00:23:04,832 little better. 355 00:23:04,832 --> 00:23:08,770 If we do with 1,000 points, starts getting better. 356 00:23:12,690 --> 00:23:20,720 If we do it with 10,000 points. 357 00:23:20,720 --> 00:23:21,970 Anyone confused? 358 00:23:25,330 --> 00:23:29,840 So I'm going to move on and show you how we can use the 359 00:23:29,840 --> 00:23:32,895 same method to do numeric integration. 360 00:23:39,740 --> 00:23:41,880 So here we go. 361 00:23:41,880 --> 00:23:44,570 Here's that frange function again, so it's 362 00:23:44,570 --> 00:23:48,360 not confusing anyone. 363 00:23:48,360 --> 00:23:53,170 What we're going to do is we're going to use a Monte 364 00:23:53,170 --> 00:23:58,946 Carlo method to integrate a polynomial. 365 00:24:01,790 --> 00:24:04,030 So let's say that I have-- 366 00:24:11,310 --> 00:24:12,560 what I want to find. 367 00:24:19,730 --> 00:24:22,070 I'm going to do it for-- 368 00:24:22,070 --> 00:24:26,780 because this is a numeric method, let's say do it from 369 00:24:26,780 --> 00:24:27,820 negative 5 to 5. 370 00:24:27,820 --> 00:24:29,155 So I want to do this. 371 00:24:37,180 --> 00:24:39,990 If you haven't had calculus or anything like that, don't 372 00:24:39,990 --> 00:24:41,010 worry about this. 373 00:24:41,010 --> 00:24:45,006 But I think a lot of people have, with a couple of 374 00:24:45,006 --> 00:24:46,710 exceptions. 375 00:24:46,710 --> 00:24:53,350 So this is an easy function to integrate, right? 376 00:24:53,350 --> 00:24:55,570 But there are also some functions that are really hard 377 00:24:55,570 --> 00:24:56,540 or impossible to. 378 00:24:56,540 --> 00:25:02,360 So that's where a lot of software packages actually use 379 00:25:02,360 --> 00:25:06,740 Monte Carlo methods to do a numeric integration for you. 380 00:25:06,740 --> 00:25:14,080 But the idea is the same I'm going to take a function. 381 00:25:14,080 --> 00:25:17,000 And this is going to be x-squared. 382 00:25:17,000 --> 00:25:19,576 And then I'm going to take an x-min and an x-max. 383 00:25:23,750 --> 00:25:26,020 These become my left and right boundaries. 384 00:25:26,020 --> 00:25:28,310 And then I'm going to find the minimum of the function 385 00:25:28,310 --> 00:25:34,000 between these limits and the maximum of the function. 386 00:25:34,000 --> 00:25:35,460 So you see what I'm doing? 387 00:25:35,460 --> 00:25:37,715 I'm defining a rectangle. 388 00:25:40,270 --> 00:25:41,975 So again, same thing. 389 00:25:47,260 --> 00:25:48,250 Same principle. 390 00:25:48,250 --> 00:25:49,500 I have the area of the rectangle. 391 00:25:54,070 --> 00:25:57,260 I don't have the area of this guy. 392 00:25:57,260 --> 00:25:59,080 That's what I'm trying to find. 393 00:25:59,080 --> 00:26:04,350 But I know that if I find the ratio, the number of points 394 00:26:04,350 --> 00:26:07,740 that land in the square-- 395 00:26:07,740 --> 00:26:13,220 or the ratio that land in this curve versus the total in the 396 00:26:13,220 --> 00:26:17,565 square, then I can find this area pretty easily. 397 00:26:20,360 --> 00:26:28,940 So this function, find function, y-min, y-max. 398 00:26:28,940 --> 00:26:30,200 Does exactly what it says. 399 00:26:33,750 --> 00:26:37,540 Just goes between x-min and x-max, and then finds where 400 00:26:37,540 --> 00:26:43,150 the function is a minimum and where it's a maximum. 401 00:26:43,150 --> 00:26:46,546 So the function I'm calling f. 402 00:26:46,546 --> 00:26:49,350 It's one of the few single-letter variable names 403 00:26:49,350 --> 00:26:52,850 I'll use that isn't an index counter. 404 00:26:56,610 --> 00:26:58,910 My random point generator, it's 405 00:26:58,910 --> 00:27:00,840 going to take the bounds-- 406 00:27:00,840 --> 00:27:03,090 x-min, x-max, y-min, y-max. 407 00:27:03,090 --> 00:27:05,640 So it's going to uniformly produce a point that falls 408 00:27:05,640 --> 00:27:06,890 within this rectangle. 409 00:27:10,240 --> 00:27:11,520 My make-points -- 410 00:27:11,520 --> 00:27:14,590 it just makes a whole bunch of these. 411 00:27:14,590 --> 00:27:17,790 Then I have this function between curve. 412 00:27:17,790 --> 00:27:21,610 What this tells me is if I have a point here, it'll 413 00:27:21,610 --> 00:27:24,220 return true, because it's between the 414 00:27:24,220 --> 00:27:28,540 curve and the x-axis. 415 00:27:28,540 --> 00:27:31,510 If it's up here, it's false, right? 416 00:27:34,460 --> 00:27:37,130 Does anyone not understand how that works? 417 00:27:37,130 --> 00:27:38,380 Ah, you're all smart. 418 00:27:40,990 --> 00:27:49,335 So here is our estimate of our main function, estimate area. 419 00:27:49,335 --> 00:27:52,780 You give it a function, x-min, x-max. 420 00:27:52,780 --> 00:27:55,690 I'm going to tell it how many points to toss. 421 00:27:55,690 --> 00:27:58,600 And optionally, we can tell it that we already have points 422 00:27:58,600 --> 00:28:01,160 that have been tossed. 423 00:28:01,160 --> 00:28:04,220 And the first thing we do is find the y-min and the y-max. 424 00:28:07,010 --> 00:28:10,960 And then if we don't have points, we make them. 425 00:28:10,960 --> 00:28:14,910 And then point counter counts how many times a point wound 426 00:28:14,910 --> 00:28:18,030 up between the curve and the x-axis. 427 00:28:21,110 --> 00:28:24,150 And we just iterate through the points. 428 00:28:24,150 --> 00:28:27,185 If it's between the curve, that means it's here. 429 00:28:30,150 --> 00:28:32,340 Then, if it's above the x-axis, we're going to 430 00:28:32,340 --> 00:28:33,590 increment the point counter. 431 00:28:33,590 --> 00:28:37,910 And then if it's below the x-axis, we're going to 432 00:28:37,910 --> 00:28:38,800 decrement the point counter. 433 00:28:38,800 --> 00:28:40,820 So we're accounting for signs here. 434 00:28:40,820 --> 00:28:46,770 So if we had a function that did this, we'd be able to 435 00:28:46,770 --> 00:28:48,020 properly handle it. 436 00:28:51,170 --> 00:28:55,190 Now we get the rectangular area. 437 00:28:55,190 --> 00:29:00,110 And then all we do is we multiply the rectangular area 438 00:29:00,110 --> 00:29:04,240 by the ratio of the number of points between the curve and 439 00:29:04,240 --> 00:29:08,060 the x-axis and the total number of points thrown. 440 00:29:08,060 --> 00:29:10,250 And that gives us the function area. 441 00:29:13,060 --> 00:29:18,680 So here's my function, x-squared. 442 00:29:18,680 --> 00:29:21,040 And this is just a plot function scatter. 443 00:29:21,040 --> 00:29:23,810 All this is going to do is just do the same thing I did 444 00:29:23,810 --> 00:29:25,060 with the circle. 445 00:29:27,070 --> 00:29:31,920 And I am going to do this for-- 446 00:29:31,920 --> 00:29:35,910 if I tossed 10 points, 100 points, 1,000, 447 00:29:35,910 --> 00:29:38,540 10,000, or a 100,000. 448 00:29:38,540 --> 00:29:39,815 So let's see what this looks like. 449 00:29:47,230 --> 00:29:49,260 Assuming that Python doesn't crash. 450 00:29:57,190 --> 00:29:59,485 So not too nice. 451 00:30:06,690 --> 00:30:26,230 100 points, 1,000 points, 10,000 points. 452 00:30:26,230 --> 00:30:27,480 And then a whole mess of points. 453 00:30:36,956 --> 00:30:40,372 Oh, I crashed it. 454 00:30:40,372 --> 00:30:41,348 Hm? 455 00:30:41,348 --> 00:30:42,812 AUDIENCE: Can't we just [INAUDIBLE]? 456 00:30:49,160 --> 00:30:49,580 PROFESSOR: I'm sorry. 457 00:30:49,580 --> 00:30:50,323 Say that again? 458 00:30:50,323 --> 00:30:51,573 AUDIENCE: Calculate [INAUDIBLE] 459 00:30:54,187 --> 00:30:58,534 split up the x-axis to a lot of points, and then multiply 460 00:30:58,534 --> 00:31:01,432 those by the value function [INAUDIBLE] 461 00:31:01,432 --> 00:31:02,420 add them up? 462 00:31:02,420 --> 00:31:04,490 PROFESSOR: You're talking about doing a Riemann 463 00:31:04,490 --> 00:31:05,390 approximation? 464 00:31:05,390 --> 00:31:07,000 AUDIENCE: Yeah, [INAUDIBLE]. 465 00:31:07,000 --> 00:31:09,630 PROFESSOR: Or a Riemann sum? 466 00:31:09,630 --> 00:31:16,640 So his question is, why don't you do something like this? 467 00:31:23,650 --> 00:31:39,610 Divide up the x-axis into very small portions, like that, and 468 00:31:39,610 --> 00:31:41,265 then sum up the areas of these rectangles. 469 00:31:43,991 --> 00:31:46,426 Yeah, you could do that. 470 00:31:46,426 --> 00:31:47,676 AUDIENCE: [INAUDIBLE]? 471 00:31:51,310 --> 00:31:53,460 PROFESSOR: You know, I don't have an answer for that. 472 00:31:53,460 --> 00:31:57,644 I can't say which one would work better. 473 00:31:57,644 --> 00:31:59,000 Do you know, Serena? 474 00:32:03,230 --> 00:32:08,920 I would say that right now, whichever one you prefer. 475 00:32:11,810 --> 00:32:15,090 But I'll see if there's any actual research on whether or 476 00:32:15,090 --> 00:32:16,620 not one is better than the other. 477 00:32:16,620 --> 00:32:19,950 It might turn out that there are certain instances where 478 00:32:19,950 --> 00:32:22,500 doing this sort of approximation is better than 479 00:32:22,500 --> 00:32:24,340 doing the approximation I'm talking about. 480 00:32:27,570 --> 00:32:30,832 But I don't know. 481 00:32:30,832 --> 00:32:34,095 Yeah, for this problem, you could definitely use that. 482 00:32:38,800 --> 00:32:41,280 Is everyone good with this? 483 00:32:41,280 --> 00:32:42,260 Anyone confused? 484 00:32:42,260 --> 00:32:45,080 Any questions? 485 00:32:45,080 --> 00:32:45,370 Yeah? 486 00:32:45,370 --> 00:32:49,314 AUDIENCE: I think my concern is that you need a 487 00:32:49,314 --> 00:32:52,765 fantastically large number of darts to get a reasonably good 488 00:32:52,765 --> 00:32:54,750 integration [INAUDIBLE]. 489 00:32:54,750 --> 00:32:56,090 PROFESSOR: Yeah. 490 00:32:56,090 --> 00:32:59,970 That is one issue with Monte Carlo methods, is that they do 491 00:32:59,970 --> 00:33:02,660 rely on large numbers. 492 00:33:02,660 --> 00:33:09,218 So, yeah, sometimes they can take a while. 493 00:33:09,218 --> 00:33:11,628 AUDIENCE: At least for the purposes of this class, we 494 00:33:11,628 --> 00:33:15,002 don't need to be able to quantify the error or anything 495 00:33:15,002 --> 00:33:16,930 like that, right? 496 00:33:16,930 --> 00:33:18,180 PROFESSOR: No. 497 00:33:22,080 --> 00:33:26,480 You do need to understand that there can be error. 498 00:33:26,480 --> 00:33:29,750 And you should also understand stuff like confidence 499 00:33:29,750 --> 00:33:31,190 intervals and confidence levels. 500 00:33:34,200 --> 00:33:35,448 Are you OK with that? 501 00:33:35,448 --> 00:33:37,938 AUDIENCE: Mostly. 502 00:33:37,938 --> 00:33:41,175 But in order to get a confidence interval, you'd 503 00:33:41,175 --> 00:33:44,412 have to do several trials at, say, 504 00:33:44,412 --> 00:33:46,420 100,000 points, and then-- 505 00:33:46,420 --> 00:33:47,670 PROFESSOR: Right, exactly. 506 00:33:51,260 --> 00:33:56,380 You could estimate the error. 507 00:33:56,380 --> 00:33:58,220 Like you could estimate it. 508 00:33:58,220 --> 00:34:00,860 But in order to really get a good sense for how much 509 00:34:00,860 --> 00:34:03,820 variance there is, you'd have to do repeated trials. 510 00:34:03,820 --> 00:34:05,810 So yeah. 511 00:34:05,810 --> 00:34:08,260 AUDIENCE: What I guess I was getting at was in order to get 512 00:34:08,260 --> 00:34:10,220 a sense of how big the error is relative to 513 00:34:10,220 --> 00:34:11,989 the number of trials-- 514 00:34:11,989 --> 00:34:13,430 PROFESSOR: Yeah. 515 00:34:13,430 --> 00:34:14,855 AUDIENCE: --without sort of analytically. 516 00:34:14,855 --> 00:34:17,710 But I guess that's probably [INAUDIBLE]. 517 00:34:17,710 --> 00:34:19,113 PROFESSOR: I'm sorry, what? 518 00:34:19,113 --> 00:34:20,965 AUDIENCE: That's not something that we're going to be asked 519 00:34:20,965 --> 00:34:22,360 to do, at least in this course? 520 00:34:22,360 --> 00:34:24,475 PROFESSOR: Yeah, no. 521 00:34:24,475 --> 00:34:27,400 The purpose is we want you to understand that when you do 522 00:34:27,400 --> 00:34:32,280 things like this, that there is some thought that has to go 523 00:34:32,280 --> 00:34:33,929 into, well, how many trials do I need to do? 524 00:34:33,929 --> 00:34:36,280 How many points do I need to throw? 525 00:34:36,280 --> 00:34:39,174 And you have to ask yourself, how much error am 526 00:34:39,174 --> 00:34:41,040 I willing to tolerate? 527 00:34:41,040 --> 00:34:45,940 There's the joke that mathematicians call pi pi, and 528 00:34:45,940 --> 00:34:54,810 then engineers call it 3.14. 529 00:34:54,810 --> 00:34:59,770 OK, so if everyone's done with integration, I'm going to move 530 00:34:59,770 --> 00:35:01,020 on to regression. 531 00:35:07,370 --> 00:35:08,260 Oh, wait, now. 532 00:35:08,260 --> 00:35:11,870 There's one thing wanted to touch on. 533 00:35:11,870 --> 00:35:20,880 So we kind of looked at some toy problems with Monte Carlo. 534 00:35:20,880 --> 00:35:24,070 And this is, I guess, a toy problem too, because it has to 535 00:35:24,070 --> 00:35:24,670 do with a toy. 536 00:35:24,670 --> 00:35:28,110 Is everyone familiar with the game of Monopoly? 537 00:35:28,110 --> 00:35:32,780 So I don't have to explain the rules too much in depth? 538 00:35:32,780 --> 00:35:33,550 OK. 539 00:35:33,550 --> 00:35:41,150 So let's assume that there are no factors that modify this 540 00:35:41,150 --> 00:35:43,590 distribution. 541 00:35:43,590 --> 00:35:49,390 If I roll the die twice, then each one of these spaces has 542 00:35:49,390 --> 00:35:51,940 an equal probability of being landed on. 543 00:35:51,940 --> 00:35:53,890 It's about 2 and 1/2%. 544 00:35:53,890 --> 00:35:57,800 But there are certain rules that distort this 545 00:35:57,800 --> 00:35:58,700 distribution. 546 00:35:58,700 --> 00:36:01,200 So you can land on Go To Jail. 547 00:36:01,200 --> 00:36:05,070 You can roll three doubles, and get sent to Jail. 548 00:36:05,070 --> 00:36:08,660 You can draw a Chance card, and get sent to Jail, sent to 549 00:36:08,660 --> 00:36:12,070 Go, or sent anywhere on the board. 550 00:36:12,070 --> 00:36:15,660 And there are 10 out of 16 Chance cards that modify this 551 00:36:15,660 --> 00:36:17,560 distribution. 552 00:36:17,560 --> 00:36:19,230 And for Community Chest, same thing. 553 00:36:19,230 --> 00:36:22,340 There's 2 out of the 16 cards that distort the distribution. 554 00:36:22,340 --> 00:36:27,570 So the question is, how do you do this analytically? 555 00:36:27,570 --> 00:36:28,710 And I've tried. 556 00:36:28,710 --> 00:36:30,700 It's hard. 557 00:36:30,700 --> 00:36:33,020 I'm actually not sure if it's possible. 558 00:36:33,020 --> 00:36:36,270 Well, this is a perfect example of where you would use 559 00:36:36,270 --> 00:36:39,940 a Monte Carlo simulation in order to arrive at the answer. 560 00:36:39,940 --> 00:36:44,780 So if you actually want to take a whack at this problem, 561 00:36:44,780 --> 00:36:48,090 you can go to this site called projecteuler.net. 562 00:36:48,090 --> 00:36:50,880 They have a whole bunch of mathy questions on there that 563 00:36:50,880 --> 00:36:55,740 are meant to get people to think about math and computer 564 00:36:55,740 --> 00:36:58,550 programming. 565 00:36:58,550 --> 00:37:01,890 And you get little rankings the more questions you answer 566 00:37:01,890 --> 00:37:02,890 correctly, and stuff like that. 567 00:37:02,890 --> 00:37:04,790 So there's a little competition. 568 00:37:04,790 --> 00:37:10,650 But the question in this particular case was, what are 569 00:37:10,650 --> 00:37:14,650 the top three places you'll land on 570 00:37:14,650 --> 00:37:16,500 with all these factors? 571 00:37:16,500 --> 00:37:20,950 And if you represent them as a number that is concatenated 572 00:37:20,950 --> 00:37:23,110 one after the other, what is the number? 573 00:37:23,110 --> 00:37:26,140 What is the six-digit number? 574 00:37:26,140 --> 00:37:28,190 But that's a fun problem. 575 00:37:30,990 --> 00:37:34,075 So going onto something that's less fun, regression. 576 00:37:37,280 --> 00:37:43,030 So can someone tell me what purposes we would use 577 00:37:43,030 --> 00:37:44,280 regression for? 578 00:37:52,190 --> 00:37:52,960 Take a stab. 579 00:37:52,960 --> 00:37:53,300 AUDIENCE: Sure. 580 00:37:53,300 --> 00:37:58,100 If you have experimental data which you believe to fit some 581 00:37:58,100 --> 00:37:59,540 type of theoretical model. 582 00:37:59,540 --> 00:38:00,980 But experiments being 583 00:38:00,980 --> 00:38:04,010 experiments, they're not perfect. 584 00:38:04,010 --> 00:38:06,710 You can't-- 585 00:38:06,710 --> 00:38:09,690 the data points exactly fall in the model, so you have to 586 00:38:09,690 --> 00:38:14,078 find which parameters from the model to pick so that your 587 00:38:14,078 --> 00:38:16,874 experiment [UNINTELLIGIBLE] best fits [INAUDIBLE]. 588 00:38:16,874 --> 00:38:17,810 PROFESSOR: Uh-huh. 589 00:38:17,810 --> 00:38:21,940 So the idea is that you have a bunch of experimental data 590 00:38:21,940 --> 00:38:24,000 that has error. 591 00:38:24,000 --> 00:38:28,260 And you want to be able to maybe find the underlying 592 00:38:28,260 --> 00:38:32,610 function of those observations. 593 00:38:32,610 --> 00:38:35,420 And you would do that using regression. 594 00:38:35,420 --> 00:38:39,880 So we have a couple of nice cools in 595 00:38:39,880 --> 00:38:41,130 Python for doing that. 596 00:38:44,070 --> 00:38:47,710 Actually, before I move on, another reason is you can find 597 00:38:47,710 --> 00:38:48,090 the function. 598 00:38:48,090 --> 00:38:50,310 But you can also then, once you find that function, you 599 00:38:50,310 --> 00:38:52,560 can use it to predict additional values. 600 00:38:52,560 --> 00:38:56,460 So say you have a gap in your data, or you want to predict 601 00:38:56,460 --> 00:38:58,510 values beyond the range that you collected 602 00:38:58,510 --> 00:39:00,050 observations for. 603 00:39:00,050 --> 00:39:03,270 If you do a regression, you find the function, find the 604 00:39:03,270 --> 00:39:05,510 parameters for the function, then you can use it to predict 605 00:39:05,510 --> 00:39:08,450 those values. 606 00:39:08,450 --> 00:39:14,130 And what we mainly want you to understand here are the 607 00:39:14,130 --> 00:39:17,120 functions that you would use to do it, and how you would 608 00:39:17,120 --> 00:39:21,650 tell if you have a good fit or not a good fit, and the idea 609 00:39:21,650 --> 00:39:23,190 of overfitting. 610 00:39:23,190 --> 00:39:30,710 So we have a little bit of code that demonstrates this, 611 00:39:30,710 --> 00:39:36,980 so a couple of helper functions that compute various 612 00:39:36,980 --> 00:39:40,330 values that you've seen before. 613 00:39:40,330 --> 00:39:44,170 So MSE is the sum of the residual squares. 614 00:39:44,170 --> 00:39:48,230 And then you have the total sum of squares. 615 00:39:48,230 --> 00:39:57,160 So these will help you compute the coefficient of 616 00:39:57,160 --> 00:39:58,410 termination. 617 00:40:00,280 --> 00:40:06,680 And what I'm going to show is let's say I 618 00:40:06,680 --> 00:40:07,930 define a function here. 619 00:40:10,380 --> 00:40:12,470 In this case, I have it defined as x-cubed 620 00:40:12,470 --> 00:40:13,820 plus 5x plus 3. 621 00:40:16,580 --> 00:40:22,310 I am going to, for a certain number of x values, apply the 622 00:40:22,310 --> 00:40:25,860 function and get the y value. 623 00:40:25,860 --> 00:40:28,800 And then to simulate observational data, I'm going 624 00:40:28,800 --> 00:40:33,250 to perturb it using a Gaussian distribution. 625 00:40:33,250 --> 00:40:34,910 So it's going to jitter the points. 626 00:40:39,650 --> 00:40:42,220 And that's what the make observations function does, is 627 00:40:42,220 --> 00:40:47,040 it just adds noise to the y values. 628 00:40:47,040 --> 00:40:49,470 And then I'm going to-- 629 00:40:49,470 --> 00:40:55,730 this function here plots out the measured or observed 630 00:40:55,730 --> 00:41:00,430 values, the simulated. 631 00:41:00,430 --> 00:41:07,750 It computes a fit for one degree. 632 00:41:07,750 --> 00:41:10,297 So in this case, I have two parameters, fit degree 1 and 633 00:41:10,297 --> 00:41:13,000 fit degree 2, because I want to do comparisons. 634 00:41:13,000 --> 00:41:19,080 So it'll compute fit using the first degree and predict some 635 00:41:19,080 --> 00:41:20,500 values for the curve. 636 00:41:23,320 --> 00:41:27,590 And then it'll compute the residual error and the 637 00:41:27,590 --> 00:41:32,290 coefficient of determination and plot it out. 638 00:41:32,290 --> 00:41:36,790 And then it'll do the same thing for the second degree. 639 00:41:42,290 --> 00:41:44,410 Let's see what this looks like. 640 00:41:52,460 --> 00:41:56,885 Let's see Python not behave badly. 641 00:41:59,795 --> 00:42:01,250 There we go. 642 00:42:10,640 --> 00:42:15,130 The function that we plotted was, what, x-squared 643 00:42:15,130 --> 00:42:17,310 something, 5x-squared? 644 00:42:17,310 --> 00:42:18,560 Let me see. 645 00:42:22,100 --> 00:42:24,830 x-cubed plus 5x plus 3. 646 00:42:27,960 --> 00:42:30,780 And we're plotting it from negative 2 to 2. 647 00:42:30,780 --> 00:42:34,170 So this is what I'm talking about with the noise. 648 00:42:34,170 --> 00:42:37,840 So each of these red dots represents some observation 649 00:42:37,840 --> 00:42:41,280 that's been disturbed a little bit. 650 00:42:41,280 --> 00:42:45,800 And then I try to fit this with a first degree 651 00:42:45,800 --> 00:42:48,380 polynomial, and then a second degree. 652 00:42:48,380 --> 00:42:51,570 And I see-- 653 00:42:51,570 --> 00:42:57,304 actually, my residual error is lower for my first degree fit. 654 00:42:57,304 --> 00:42:58,620 That's interesting. 655 00:43:01,310 --> 00:43:04,340 So I don't know. 656 00:43:04,340 --> 00:43:06,690 At this point, I'd say just stop and 657 00:43:06,690 --> 00:43:07,450 don't proceed further. 658 00:43:07,450 --> 00:43:09,940 But we know that that's not the right function. 659 00:43:09,940 --> 00:43:16,260 So let's look at what we have for a third degree fit. 660 00:43:16,260 --> 00:43:19,025 It actually worse, huh. 661 00:43:24,470 --> 00:43:26,530 This is the problem with random programs, is that 662 00:43:26,530 --> 00:43:27,780 sometimes they fail you. 663 00:43:34,950 --> 00:43:37,440 I would say that these are nice pretty plots, but they're 664 00:43:37,440 --> 00:43:41,000 not really telling me much, other than I can fit some 665 00:43:41,000 --> 00:43:43,892 lines to some points. 666 00:43:43,892 --> 00:43:45,620 AUDIENCE: What should it look like? 667 00:43:45,620 --> 00:43:48,540 What are you looking for that's not there? 668 00:43:48,540 --> 00:43:51,590 PROFESSOR: So we know that the function that we made the 669 00:43:51,590 --> 00:43:55,900 observations on is a third degree polynomial. 670 00:43:55,900 --> 00:44:06,850 So it's a little puzzling why this first degree fit is 671 00:44:06,850 --> 00:44:14,120 better than our third degree fit. 672 00:44:14,120 --> 00:44:18,800 That's the conundrum. 673 00:44:18,800 --> 00:44:20,150 So maybe-- 674 00:44:20,150 --> 00:44:23,350 I wonder what would happen if I expanded the x range. 675 00:44:23,350 --> 00:44:27,570 So let's say I go from negative 5 to 5. 676 00:44:27,570 --> 00:44:29,820 Maybe it's just too little data. 677 00:44:34,150 --> 00:44:35,400 That's looking a little better. 678 00:44:43,790 --> 00:44:45,040 Now I feel better. 679 00:44:48,080 --> 00:44:51,190 So the issue was that we just were going from negative 2 to 680 00:44:51,190 --> 00:44:54,070 2, and basically it looked linear there. 681 00:44:54,070 --> 00:44:57,370 So the first degree polynomial was doing fine. 682 00:44:57,370 --> 00:45:01,610 But as soon as we go out and get a little curvy in there, 683 00:45:01,610 --> 00:45:04,820 we see that both the first and the second degree fits, they 684 00:45:04,820 --> 00:45:07,770 have pretty high error. 685 00:45:07,770 --> 00:45:09,460 Their R is pretty good. 686 00:45:09,460 --> 00:45:14,940 But when you compare them with, say, a third degree fit, 687 00:45:14,940 --> 00:45:18,240 you see that the error drops down dramatically. 688 00:45:18,240 --> 00:45:22,050 And it's got higher coefficient of determination. 689 00:45:22,050 --> 00:45:25,130 So what we would say in this case is that this third degree 690 00:45:25,130 --> 00:45:29,800 fit here is a lot better than the first or 691 00:45:29,800 --> 00:45:32,660 second degree fit. 692 00:45:32,660 --> 00:45:35,970 And then we can also look at, say, a fourth degree fit, 693 00:45:35,970 --> 00:45:39,040 which in this case happens to have a higher error. 694 00:45:39,040 --> 00:45:41,580 So that's a good thing. 695 00:45:41,580 --> 00:45:45,080 And then if we look at a fifth degree fit, it also has a 696 00:45:45,080 --> 00:45:45,630 higher error. 697 00:45:45,630 --> 00:45:50,810 So we'd say in this case that the third degree fit is 698 00:45:50,810 --> 00:45:53,650 probably our best bet, and we probably have a pretty good 699 00:45:53,650 --> 00:45:56,600 idea of what the function is for the underlying model. 700 00:45:59,960 --> 00:46:02,780 AUDIENCE: Which part of this is regression? 701 00:46:02,780 --> 00:46:06,220 PROFESSOR: Well, the part of this that is regression is-- 702 00:46:10,170 --> 00:46:12,560 the part that actually does the regression is this poly 703 00:46:12,560 --> 00:46:15,200 fit method here. 704 00:46:15,200 --> 00:46:19,030 And what you do is you pass it in the x values, the y values, 705 00:46:19,030 --> 00:46:20,860 and the degree of the polynomial that you 706 00:46:20,860 --> 00:46:22,110 want to fit to it. 707 00:46:29,950 --> 00:46:32,270 I've hit the end of my material, 708 00:46:32,270 --> 00:46:33,785 unless someone has questions. 709 00:46:36,950 --> 00:46:41,476 Comments, fears, trepidations? 710 00:46:41,476 --> 00:46:42,853 AUDIENCE: Just [INAUDIBLE] 711 00:46:42,853 --> 00:46:46,266 having done some stuff-- like in Excel, you can fit curves 712 00:46:46,266 --> 00:46:47,238 with the R-squares? 713 00:46:47,238 --> 00:46:47,724 PROFESSOR: Yeah. 714 00:46:47,724 --> 00:46:50,640 AUDIENCE: The R-squared values are really, really high, like 715 00:46:50,640 --> 00:46:51,890 really, really [? wanting ?] 716 00:46:51,890 --> 00:46:55,240 these fits, even though the fits are pretty terrible. 717 00:46:55,240 --> 00:46:55,830 PROFESSOR: Yeah. 718 00:46:55,830 --> 00:46:57,510 AUDIENCE: So that's weird to me. 719 00:46:57,510 --> 00:47:00,150 PROFESSOR: That is puzzling. 720 00:47:00,150 --> 00:47:04,925 And it's quite possible that I have a bug. 721 00:47:04,925 --> 00:47:07,350 AUDIENCE: I wonder whether there were different 722 00:47:07,350 --> 00:47:09,290 definitions for R-squared that are maybe floating around in 723 00:47:09,290 --> 00:47:10,430 different places? 724 00:47:10,430 --> 00:47:12,200 PROFESSOR: No. 725 00:47:12,200 --> 00:47:14,130 I made a correction to this earlier. 726 00:47:14,130 --> 00:47:16,360 And like I said, maybe I introduced a bug. 727 00:47:16,360 --> 00:47:20,140 So I'm going to have to double-check my math. 728 00:47:20,140 --> 00:47:21,803 Unfortunately, I'm not perfect. 729 00:47:21,803 --> 00:47:23,053 I wish I was.