1 00:00:00,910 --> 00:00:03,000 In this segment, we introduce a model 2 00:00:03,000 --> 00:00:07,020 with multiple parameters and multiple observations. 3 00:00:07,020 --> 00:00:09,980 It is a model that appears in countless real-world 4 00:00:09,980 --> 00:00:11,110 applications. 5 00:00:11,110 --> 00:00:13,920 But instead of giving you a general and abstract model, 6 00:00:13,920 --> 00:00:16,160 we will talk about a specific context that 7 00:00:16,160 --> 00:00:18,900 has all the elements of the general model, 8 00:00:18,900 --> 00:00:21,200 but it has the advantage of being concrete. 9 00:00:21,200 --> 00:00:23,930 And we can also visualize the results. 10 00:00:23,930 --> 00:00:26,040 The model is as follows. 11 00:00:26,040 --> 00:00:30,820 Somebody is holding a ball and throws it upwards. 12 00:00:30,820 --> 00:00:35,290 This ball is going to follow a certain trajectory. 13 00:00:35,290 --> 00:00:37,520 What kind of trajectory is it? 14 00:00:37,520 --> 00:00:40,240 According to Newton's laws, we know 15 00:00:40,240 --> 00:00:43,560 that it's going to be described by a quadratic function 16 00:00:43,560 --> 00:00:44,940 of time. 17 00:00:44,940 --> 00:00:48,500 So here's a plot of such a quadratic function, 18 00:00:48,500 --> 00:00:50,700 where this is the time axis. 19 00:00:50,700 --> 00:00:55,355 And this variable here, x, is the vertical displacement 20 00:00:55,355 --> 00:00:56,910 of the ball. 21 00:00:56,910 --> 00:01:00,000 The ball initially is it a certain location, 22 00:01:00,000 --> 00:01:02,790 at a certain height-- theta 0. 23 00:01:02,790 --> 00:01:04,420 It is thrown upwards. 24 00:01:04,420 --> 00:01:09,190 And it starts moving with some initial velocity, theta 1. 25 00:01:09,190 --> 00:01:11,260 But because of the gravitational force, 26 00:01:11,260 --> 00:01:13,640 it will start eventually going down. 27 00:01:13,640 --> 00:01:17,740 So this parameter theta 2, which would actually be negative, 28 00:01:17,740 --> 00:01:21,120 reflects the gravitational constant. 29 00:01:21,120 --> 00:01:25,170 Suppose now that you do not know these parameters. 30 00:01:25,170 --> 00:01:27,750 You do not to know exactly where the ball was 31 00:01:27,750 --> 00:01:28,840 when it was thrown. 32 00:01:28,840 --> 00:01:31,170 You don't know the exact velocity. 33 00:01:31,170 --> 00:01:33,870 And perhaps you also live in a strange gravitational 34 00:01:33,870 --> 00:01:37,190 environment, and you do not know the gravitational constant. 35 00:01:37,190 --> 00:01:39,330 And you would like to estimate those quantities 36 00:01:39,330 --> 00:01:41,150 based on measurements. 37 00:01:41,150 --> 00:01:44,350 If you are to follow the Bayesian Inference methodology, 38 00:01:44,350 --> 00:01:47,550 what you need to do is to model those variables 39 00:01:47,550 --> 00:01:51,539 as random variables, and think of the actual realized 40 00:01:51,539 --> 00:01:55,820 values as realizations of these random variables. 41 00:01:55,820 --> 00:01:58,890 And we also need some prior distributions 42 00:01:58,890 --> 00:02:00,190 for these random variable. 43 00:02:00,190 --> 00:02:03,580 We should specify the joint PDF of these three 44 00:02:03,580 --> 00:02:04,890 random variables. 45 00:02:04,890 --> 00:02:07,290 A common assumption here is to assume 46 00:02:07,290 --> 00:02:09,620 that the random variables involved are 47 00:02:09,620 --> 00:02:11,560 independent of each other. 48 00:02:11,560 --> 00:02:14,210 And each one has a certain prior. 49 00:02:14,210 --> 00:02:16,590 Where do these priors come from? 50 00:02:16,590 --> 00:02:21,520 If for example you know where is the person that's 51 00:02:21,520 --> 00:02:27,220 throwing the ball, if you know their location within let's say 52 00:02:27,220 --> 00:02:31,600 a meter or so, you should have a distribution for Theta 0 53 00:02:31,600 --> 00:02:33,850 that describes your state of knowledge 54 00:02:33,850 --> 00:02:36,160 and which has a width or standard 55 00:02:36,160 --> 00:02:40,060 deviation of maybe a couple of meters. 56 00:02:40,060 --> 00:02:43,280 So the priors are chosen to reflect whatever 57 00:02:43,280 --> 00:02:48,710 you know about the possible values of these parameters. 58 00:02:48,710 --> 00:02:51,260 Then what is going to happen is that you're 59 00:02:51,260 --> 00:02:55,110 going to observe the trajectory at certain points in time. 60 00:02:55,110 --> 00:02:59,100 For example, at a certain time t1, you make a measurement. 61 00:02:59,100 --> 00:03:01,440 But your measurement is not exact. 62 00:03:01,440 --> 00:03:04,900 It is noisy, and you record a certain value. 63 00:03:04,900 --> 00:03:08,020 At another time you make another measurement, 64 00:03:08,020 --> 00:03:10,040 and you record another value. 65 00:03:10,040 --> 00:03:12,560 At another time, you make another measurement, 66 00:03:12,560 --> 00:03:14,550 and you record another value. 67 00:03:14,550 --> 00:03:17,840 And similarly, you get multiple measurements. 68 00:03:17,840 --> 00:03:19,430 On the basis of these measurements, 69 00:03:19,430 --> 00:03:22,500 you would like to estimate the parameters. 70 00:03:22,500 --> 00:03:25,420 Let us write down a model for these measurements. 71 00:03:25,420 --> 00:03:27,220 We assume that the measurement is 72 00:03:27,220 --> 00:03:29,840 equal to the true position of the ball 73 00:03:29,840 --> 00:03:31,720 at the time of the measurement, which 74 00:03:31,720 --> 00:03:35,300 is this quantity, plus some noise. 75 00:03:35,300 --> 00:03:39,170 And we introduce a model for these noise random variables. 76 00:03:39,170 --> 00:03:41,370 We assume that we have a probability 77 00:03:41,370 --> 00:03:43,050 distribution for them. 78 00:03:43,050 --> 00:03:47,079 And we also assume that they're independent of each other, 79 00:03:47,079 --> 00:03:49,440 and independent from the Thetas. 80 00:03:49,440 --> 00:03:52,550 It is quite often that measuring devices, 81 00:03:52,550 --> 00:03:55,220 when they try to measure something multiple times, 82 00:03:55,220 --> 00:03:57,740 the corresponding noises will be independent. 83 00:03:57,740 --> 00:04:00,370 So this is often a realistic assumption. 84 00:04:00,370 --> 00:04:04,150 And in addition, the noise in the measuring devices 85 00:04:04,150 --> 00:04:08,410 is usually independent from whatever randomness there 86 00:04:08,410 --> 00:04:12,430 is that determined the values of the phenomenon 87 00:04:12,430 --> 00:04:14,470 that you are trying to measure. 88 00:04:14,470 --> 00:04:19,120 So these are pretty common and realistic assumptions. 89 00:04:19,120 --> 00:04:21,220 Let us take stock of what we have. 90 00:04:21,220 --> 00:04:24,670 We have observations that are determined according 91 00:04:24,670 --> 00:04:28,160 to this relation, where Wi are noise terms. 92 00:04:28,160 --> 00:04:30,700 Now let us make some more concrete assumptions. 93 00:04:30,700 --> 00:04:33,970 We will assume that the random variables, the Thetas, 94 00:04:33,970 --> 00:04:36,010 are normal random variables. 95 00:04:36,010 --> 00:04:38,320 And similarly, the W's are normal. 96 00:04:38,320 --> 00:04:42,020 As we said before, they're all independent. 97 00:04:42,020 --> 00:04:44,470 And to keep the formula simple, we 98 00:04:44,470 --> 00:04:47,270 will assume that the means of those random variables 99 00:04:47,270 --> 00:04:50,640 are 0, although the same procedure can 100 00:04:50,640 --> 00:04:53,900 be followed if the means are non-0. 101 00:04:53,900 --> 00:04:56,930 We would like to estimate the Theta parameters 102 00:04:56,930 --> 00:04:59,370 on the basis of the observations. 103 00:04:59,370 --> 00:05:03,070 We will use as usual, the appropriate form of the Bayes 104 00:05:03,070 --> 00:05:05,320 rule, which is this one-- the Bayes 105 00:05:05,320 --> 00:05:08,030 rule for continuous random variables. 106 00:05:08,030 --> 00:05:12,430 The only thing to notice is that in this notation here, 107 00:05:12,430 --> 00:05:18,720 x is an n-dimensional vector, because we have n observations. 108 00:05:18,720 --> 00:05:22,700 And theta in this example is a three-dimensional vector, 109 00:05:22,700 --> 00:05:25,830 because we have three unknown parameters. 110 00:05:25,830 --> 00:05:29,850 So wherever you see a theta or an x without a subscript, 111 00:05:29,850 --> 00:05:34,230 it should be interpreted as a vector. 112 00:05:34,230 --> 00:05:37,890 Now, in order to calculate this posterior distribution, 113 00:05:37,890 --> 00:05:43,060 we need to put our hands on the conditional density of X. 114 00:05:43,060 --> 00:05:45,650 Actually, it's a joint density, because X 115 00:05:45,650 --> 00:05:49,460 is a vector given the value of Theta. 116 00:05:49,460 --> 00:05:51,720 The arguments are pretty much the same 117 00:05:51,720 --> 00:05:53,950 as in our previous examples. 118 00:05:53,950 --> 00:05:56,390 And it goes as follows. 119 00:05:56,390 --> 00:05:59,420 Suppose that I tell you the value of the parameters, 120 00:05:59,420 --> 00:06:00,550 as here. 121 00:06:00,550 --> 00:06:02,590 Then we look at this equation. 122 00:06:02,590 --> 00:06:06,560 This quantity is now a constant. 123 00:06:06,560 --> 00:06:09,090 And we have a constant plus normal noise. 124 00:06:09,090 --> 00:06:13,050 So Xi is this normal noise whose mean 125 00:06:13,050 --> 00:06:15,610 is shifted by this constant. 126 00:06:15,610 --> 00:06:19,090 So Xi is going to be a normal random variable, 127 00:06:19,090 --> 00:06:24,160 with a mean of theta 0 plus theta 1ti, 128 00:06:24,160 --> 00:06:29,270 plus theta 2ti squared. 129 00:06:29,270 --> 00:06:32,075 And a variance equal to the variance of Wi. 130 00:06:34,970 --> 00:06:37,900 We know what the normal PDF is. 131 00:06:37,900 --> 00:06:39,840 So we can write it down. 132 00:06:39,840 --> 00:06:42,880 It's the usual exponential of a quadratic. 133 00:06:42,880 --> 00:06:45,880 And in this quadratic, we have Xi 134 00:06:45,880 --> 00:06:49,710 minus the mean of the normal random variable 135 00:06:49,710 --> 00:06:52,870 that we're dealing with. 136 00:06:52,870 --> 00:06:55,500 Let us now continue and write down 137 00:06:55,500 --> 00:06:57,600 a formula for the posterior. 138 00:06:57,600 --> 00:07:00,270 We first have this denominator term, 139 00:07:00,270 --> 00:07:02,760 which does not involve any thetas, 140 00:07:02,760 --> 00:07:07,580 and as in previous examples, does not really concern us. 141 00:07:07,580 --> 00:07:11,860 Here we have the joint PDF of the vector of Thetas. 142 00:07:11,860 --> 00:07:13,450 There's three of them. 143 00:07:13,450 --> 00:07:17,110 Because we assumed that the Thetas are independent, 144 00:07:17,110 --> 00:07:23,390 the joint PDF factors as a product of individual PDFs. 145 00:07:30,550 --> 00:07:37,120 And then, we have here the joint PDF of X, conditioned on Theta. 146 00:07:37,120 --> 00:07:39,960 Now with the same argument is in our previous discussion 147 00:07:39,960 --> 00:07:42,290 of the case of multiple observations, 148 00:07:42,290 --> 00:07:45,310 once I tell you the values of Theta, 149 00:07:45,310 --> 00:07:48,960 then the Xi's are just functions of the noises. 150 00:07:48,960 --> 00:07:53,470 The noises are independent, so the Xi's are also independent. 151 00:07:53,470 --> 00:07:57,020 So in the conditional universe, where the value of Theta is 152 00:07:57,020 --> 00:08:00,950 known, the Xi's are independent and therefore, 153 00:08:00,950 --> 00:08:06,270 the joint PDF of the Xi's is equal to the product 154 00:08:06,270 --> 00:08:10,770 of the marginal PDFs of each one of the Xi's. 155 00:08:14,330 --> 00:08:19,950 But this marginal PDF in the conditional universe 156 00:08:19,950 --> 00:08:24,390 of the Xi's is something that we have already calculated. 157 00:08:24,390 --> 00:08:27,560 And so we know what each one of these densities is. 158 00:08:27,560 --> 00:08:29,300 We can write them down. 159 00:08:29,300 --> 00:08:32,169 And we obtain an expression of this form. 160 00:08:32,169 --> 00:08:35,000 We have a normalizing constant in the beginning. 161 00:08:35,000 --> 00:08:39,090 We have here a term that comes from the prior for Theta 0, 162 00:08:39,090 --> 00:08:41,789 a prior from Theta 1. 163 00:08:41,789 --> 00:08:44,120 Here is a typical term that comes 164 00:08:44,120 --> 00:08:50,390 from the density of Xi, which is this term up here. 165 00:08:50,390 --> 00:08:53,610 So here is what we have so far. 166 00:08:53,610 --> 00:08:56,840 How should we now estimate Theta if we 167 00:08:56,840 --> 00:08:59,330 wish to derive a point estimate? 168 00:08:59,330 --> 00:09:02,570 The natural process is to look for the maximum posteriori 169 00:09:02,570 --> 00:09:05,760 probability estimate, which will maximize 170 00:09:05,760 --> 00:09:09,040 this expression over theta. 171 00:09:09,040 --> 00:09:12,550 Find a set of theta parameters. 172 00:09:12,550 --> 00:09:15,270 It's a three-dimensional vector for which 173 00:09:15,270 --> 00:09:17,480 is this quantity is largest. 174 00:09:17,480 --> 00:09:21,660 Equivalently we can look at the exponent, get rid of the minus 175 00:09:21,660 --> 00:09:24,550 signs, and minimize the quadratic function 176 00:09:24,550 --> 00:09:26,430 that we obtain here. 177 00:09:26,430 --> 00:09:29,680 How does one minimize a quadratic function? 178 00:09:29,680 --> 00:09:33,080 We take the derivatives with respect 179 00:09:33,080 --> 00:09:39,450 to each one of the parameters, and set the derivative to 0. 180 00:09:42,000 --> 00:09:48,375 We will get this way three equations and three unknowns. 181 00:09:51,490 --> 00:09:54,730 And because it's a quadratic function of theta, 182 00:09:54,730 --> 00:09:58,970 these derivatives will be linear functions of theta. 183 00:09:58,970 --> 00:10:01,510 So these equations that we're dealing with 184 00:10:01,510 --> 00:10:03,560 will be linear equations. 185 00:10:03,560 --> 00:10:06,430 So it's a system of three linear equations 186 00:10:06,430 --> 00:10:08,190 which we can solve numerically. 187 00:10:08,190 --> 00:10:11,570 And this is what is usually done in practice. 188 00:10:11,570 --> 00:10:14,920 A this is exactly what it takes to calculate 189 00:10:14,920 --> 00:10:16,960 the maximum a posteriori probability 190 00:10:16,960 --> 00:10:21,910 estimate in this particular example that we have discussed. 191 00:10:21,910 --> 00:10:24,680 It turns out as we will discuss later, 192 00:10:24,680 --> 00:10:28,600 that there are many interesting properties for this estimate, 193 00:10:28,600 --> 00:10:31,160 and which are quite general.