Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
Lecture 17: Bayesian Statis...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MITOpenCourseWare@OCW.MIT.edu
PHILIPPE RIGOLLET: So today WE'LL actually just do a brief chapter on Bayesian statistics. And there's entire courses on Bayesian statistics, there's entire books on Bayesian statistics, there's entire careers in Bayesian statistics. So admittedly, I'm not going to be able to do it justice and tell you all the interesting things that are happening in Bayesian statistics. But I think it's important as a statistician to know what it is, how it works, because it's actually a weapon of choice for many practitioners. And because it allows them to incorporate their knowledge about a problem in a fairly systematic manner. So if you look at like, say the Bayesian statistics literature, it's huge. And so here I give you sort of a range of what you can expect to see in Bayesian statistics from your second edition of a traditional book, something that involves computation, some things that involve risk thinking.
And there's a lot of Bayesian thinking. There's a lot of things that you know talking about sort of like philosophy of thinking Bayesian. This book, for example, seems to be one of them. This book is definitely one of them. This one represents sort of a wide, a broad literature on Bayesian statistics, for applications for example, in social sciences. But even in large scale machine learning, there's a lot of Bayesian statistics happening, particular using something called Bayesian parametrics, or hierarchical Bayesian modeling. So we do have some experts at MIT in the c-cell.
Tamara Broderick for example, is a person who does quite a bit of interesting work on Bayesian parametrics. And if that's something you want to know more about, I urge you to go and talk to her. So before we go into more advanced things, we need to start with what is the Bayesian approach. What do Bayesians do, and how is it different from what we've been doing so far?
So to understand the difference between Bayesians and what we've been doing so far is, we need to first put a name on what we've been doing so far. It's called frequentist statistics. Which usually Bayesian versus frequentist statistics, by versus I don't mean that there is naturally in opposition to them. Actually, often you will see the same method that comes out of both approaches.
So let's see how we did it, right. The first thing, we had data. We observed some data. And we assumed that this data was generated randomly. The reason we did that is because this would allow us to leverage tools from probability. So let's say by nature, measurements, you do a survey, you get some data. Then we made some assumptions on the data generating process.
For example, we assumed they were iid. That was one of the recurring things. Sometimes we assume it was Gaussian. If you wanted to use say, T-test. Maybe we did some nonparametric statistics. We assume it was a smooth function or maybe linear regression function. So those are our modeling. And this was basically a way to say, well, we're not going to allow for any distributions for the data that we have. But maybe a small set of distributions that indexed by some small parameters, for example. Or at least remove some of the possibilities. Otherwise, there's nothing we can learn.
And so for example, this was associated to some parameter of interest, say data or beta in the regression model. Then we had this unknown problem and this unknown thing, a known parameter. And we wanted to find it. We wanted to either estimate it or test it, or maybe find a confidence interval for the subject.
So, so far I should not have said anything that's new. But this last sentence is actually what's going to be different from the Bayesian part. And particular, this unknown but fixed things is what's going to be changing. In the Bayesian approach, we still assume that we observe some random data. But the generating process is slightly different. It's sort of a two later process. And there's one process that generates the parameter and then one process that, given this parameter generates the data.
So what the first layer does, nobody really believes that there's some random process that's happening, about generating what is going to be the true expected number of people who turn their head to the right when they kiss. But this is actually going to be something that brings us some easiness for us to incorporate what we call prior belief. We'll see an example in a second.
But often, you actually have prior belief of what this parameter should be. When we, say least squares, we looked over all of the vectors in all of R to the p, including the ones that have coefficients equal to 50 million. Those are things that we might be able to rule out. We might be able to rule out that on a much smaller scale.
For example, well I'm not an expert on turning your head to the right or to the left. But maybe you can rule out the fact that almost everybody is turning their head in the same direction, or almost everybody is turning their head to another direction.
So we have this prior belief. And this belief is going to play say, hopefully less and less important role as we collect more and more data. But if we have a smaller amount of data, we might want to be able to use this information, rather than just shooting in the dark. And so the idea is to have this prior belief. And then, we want to update this prior belief into what's called the posterior belief after we've seen some data.
Maybe I believe that there's something that should be in some range. But maybe after I see data, it's comforting me in my beliefs. So I'm actually having maybe a belief that's more. So belief encompasses basically what you think and how strongly you think about it. That's what I call belief.
So for example, if I have a belief about some parameter theta, maybe my belief is telling me where theta should be and how strongly I believe in it, in the sense that I have a very narrow region where theta could be. The posterior beliefs, as well, you see some data. And maybe you're more confident or less confident about what you've seen. Maybe you've shifted your belief a little bit. And so that's what we're going to try to see, and how to do this in a principal manner.
To understand this better, there's nothing better than an example. So let's talk about another stupid statistical question. Which is, let's try to understand p. Of course, I'm not going to talk about politics from now on. So let's talk about p, the proportion of women in the population. And so what I could do is to collect some data, X1, Xn and assume that they're Bernoulli with some parameter, p unknown.
So p is in 0, 1. OK, let's assume that those guys are iid. So this is just an indicator for each of my collected data, whether the person I randomly sample is a woman, I get a one. If it's a man, I get a zero. Now the question is, I sample these people randomly. I do you know their gender. And the frequentist approach was just saying, OK, let's just estimate p hat being Xn bar. And then we could do some tests.
So here, there's a test. I want to test maybe if p is equal to 0.5 or not. That sounds like a pretty reasonable thing to test. But we want to also maybe estimate p. But here, this is a case where we definitely prior belief of what p should be. We are pretty confident that p is not going to be 0.7. We actually believe that we should be extremely close to one half, but maybe not exactly.
Maybe this population is not the population in the world. But maybe this is the population of, say some college and we want to understand if this college has half women or not. Maybe we know it's going to be close to one half, but maybe we're not quite sure. We're going to want to integrate that knowledge.
So I could integrate it in a blunt manner by saying, discard the data and say that p is equal to one half. But maybe that's just a little too much. So how do I do this trade off between adding the data and combining it with this prior knowledge?
In many instances, essentially what's going to happen is this one half is going to act like one new observation. So if you have five observations, this is just the sixth observation, which will play a role. If you have a million observations, you're going to have a million and one. It's not going to play so much of a role. That's basically how it goes. But, definitely not always because we'll see that if I take my prior to be a point minus one half here, it's basically as if I was discarding my data.
So essentially, there's also your ability to encompass how strongly you believe in this prior. And if you believe infinitely more in the prior than you believe in the data you collected, then it's not going to act like one more observation. The Bayesian approach is a tool to one, include mathematically our prior. And our prior belief into statistical procedures. Maybe I have this prior knowledge. But if I'm a medical doctor, it's not clear to me how I'm going to turn this into some principal way of building estimators.
And the second goal is going to be to update this prior belief into a posterior belief by using the data. How do I do this? And at some point, I sort of suggested that there's two layers. One is where you draw the parameter at random. And two, once you have the parameter, conditionless parameter, you draw your data.
Nobody believed this actually is happening, that nature is just rolling dice for us and choosing parameters at random. But what's happening is that, this idea that the parameter comes from some random distribution actually captures, very well, this idea that how you would encompass your prior. How would you say, my belief is as follows?
Well here's an example about p. I'm 90% sure that p is between 0.4 and 0.6. And I'm 95% sure that p is between 0.3 and 0.8. So essentially, I have this possible value of p. And what I know is that, there's 90% here between 0.4 and 0.6. And then I have 0.3 and 0.8. And I know that I'm 95% sure that I'm in here.
If you remember, this sort of looks like the kind of pictures that I made when I had some Gaussian, for example. And I said, oh here we have 90% of the observations. And here, we have 95% of the observations. So in a way, if I were able to tell you all those ranges for all possible values, then I would essentially describe a probability distribution for p.
And what I'm saying is that, p is going to have this kind of shape. So of course, if I tell you only two twice this information that there's 90% I'm here, and I'm between here and here. And 95%, I'm between here and here, then there's many ways I can accomplish that, right. I could have something that looks like this, maybe. It could be like this. There's many ways I can have this.
Some of them are definitely going to be mathematically more convenient than others. And hopefully, we're going to have things that I can parameterize very well. Because if I tell you this is this guy, then there's basically one, two three, four, five, six, seven parameters. So I probably don't want something that has seven parameters. But maybe I can say, oh, it's a Gaussian and I all I have to do is to tell you where it's centered and what the standard deviation is.
So the idea of using this two layer thing, where we think of the parameter p as being drawn from some distribution, is really just a way for us to capture this information. Our prior belief being, well there's this percentage of chances that it's there. But the percentage of this chance, I'm not I'm deliberately not using probability here. So it's really a way to get close to this.
That's why I say, the true parameter is not random. But the Bayesian approach does as if it was random. And then, just spits out a procedure out of this thought process, this thought experiment. So when you practice Bayesian statistics a lot, you start getting automatisms. You start getting some things that you do without really thinking about it. just like when you you're a statistician, the first thing you do is, can I think of this data as being Gaussian for example?
When you're Bayesian you're thinking about, OK I have a set of parameters. So here, I can describe my parameter as being theta in general, in some big space parameter of theta. But what spaces did we encounter? Well, we encountered the real line. We encountered the interval 0, 1 for Bernoulli's And we encountered some of the positive real line for exponential distributions, etc.
And so what I'm going to need to do, if I want to put some prior on those spaces, I'm going to have to have a usual set of tools for this guy, usual set of tools for this guy, usual sort of tools for this guy. And by usual set of tools, I mean I'm going to have to have a family of distributions that's supported on this.
So in particular, this is the speed in which my parameter that I usually denote by p for Bernoulli lives. And so what I need is to find a distribution on the interval 0, 1 just like this guy. The problem with the Gaussian is that it's not on the interval 0, 1. It's going to spill out in the end. And it's not going to be something that works for me.
And so the question is, I need to think about distributions that are probably continuous. Why would I restrict myself to discrete distributions that are actually convenient and for Bernoulli, one that's actually basically the main tool that everybody is using is the so-called beta distribution.
So the beta distribution has two parameters. So x follows a beta with parameters a and b if it has a density, f of x is equal to x to the a minus 1. 1 minus x to the b minus 1, if x is in the interval 0, 1 and 0 for all other x's. OK?
Why is that a good thing? Well, it's a density that's on the interval 0, 1 for sure. But now I have these two parameters and a set of shapes that I can get by tweaking those two parameters is incredible. It's going to be a unimodal distribution. It's still fairly nice. It's not going to be something that goes like this and this.
Because if you think about this, what would it mean if your prior distribution of the interval 0, 1 had this shape? It would mean that, maybe you think that p is here or maybe you think that p is here, or maybe you think that p is here. Which essentially means that you think that p can come from three different phenomena. And there's other models that are called mixers for that, that directly account for the fact that maybe there are several phenomena that are aggregated in your data set.
But if you think that your data set is sort of pure, and that everything comes from the same phenomenon, you want something that looks like this, or maybe looks like this, or maybe is sort of symmetric. You want to get all this stuff. Maybe you want something that says, well if I'm talking about p being the probability of the proportion of women in the whole world, you want something that's probably really spiked around one half. Almost the point math, because you know let's agree that 0.5 is the actual number.
So you want something that says, OK maybe I'm wrong. But I'm sure I'm not going to be really that way off. So you want something that's really pointy. But if it's something you've never checked, and again I can not make references at this point, but something where you might have some uncertainty that should be around one half. Maybe you want something that a little more allows you to say, well, I think there's more around one half. But there's still some fluctuations that are possible.
And in particular here, I talk about p, where the two parameters a and b are actually the same. I call them a. One is called scale. The other one is called shape.
Oh sorry, this is not a density. So it actually has to be normalized. When you integrate this guy, it's going to be some function that depends on a and b, actually depends on this function through the beta function. Which is this combination of gamma function, so that's why it's called beta distribution. That's the definition of the beta function when you integrate this thing anyway. You just have to normalize it. That's just a number that depends on the a and b.
So here, if you take a equal to b, you have something that essentially is symmetric around one half. Because what does it look like? Well, so my density f of x, is going to be what? It's going to be my constant times x, times one minus x to a minus one. And this function, x times 1 minus x looks like this.
We've drawn it before. That was something that showed up as being the variance of my Bernoulli. So we know it's something that takes its maximum at one half. And now I'm just taking a power of this guy. So I'm really just distorting this thing into some fairly symmetric manner.
This distribution that we actually take for p. I assume that p, the parameter, notice that this is kind of weird. First of all, this is probably the first time in this entire course that something has a distribution when it's actually a lower case letter. That's something you have to deal with, because we've been using lower case letters for parameters. And now we want them to have a distribution. So that's what's going to happen.
This is called the prior distribution. So really, I should write something like f of p is equal to a constant times p, 1 minus p, to the n minus 1. Well no, actually I should not because then it's confusing.
One thing in terms of notation that I'm going to write, when I have a constant here and I don't want to make it explicit. And we'll see in a second why I don't need to make it explicit. I'm going to write this as f of x is proportional to x 1 minus x to the n minus 1. That's just to say, equal to some constant that does not depend on x times this thing.
So if we continue with our experiment where I'm drawing this data, X1 to Xn, which is Bernoulli p, if p has some distribution it's not clear what it means to have a Bernoulli with some random parameter. So what I'm going to do is, then I'm going to first draw my p. Let's say I get a number, 0.52. And then, I'm going to draw my data conditionally on p. So here comes the first and last flowchart of this class.
So nature first draws p. p follows some data on a, a. Then I condition on p. And then I draw X1, Xn that are iid, Bernoulli p. Everybody understand the process of generating this data? So you first draw a parameter, and then you just flip those independent biased coins with this particular p. There's this layered thing.
Now conditionally p, right so here I have this prior about p which was the thing. So this is just the thought process again, it's not anything that actually happens in practice. This is my way of thinking about how the data was generated. And from this, I'm going to try to come up with some procedure.
Just like, if your estimator is the average of the data, you don't have to understand probability to say that my estimator is the average of the data. Anyone outside this room understands that the average is a good estimator for some average behavior. And they don't need to think of the data as being a random variable, et cetera. So same thing, basically.
In this case, you can see that the posterior distribution is still a beta. What it means is that, I had this thing. Then, I observed my data. And then, I continue and here I'm going to update my prior into some posterior distribution, pi. And here, this guy is actually also a beta. My posterior distribution, p, is also a beta distribution with the parameters that are on this slide. And I'll have the space to reproduce them.
So I start the beginning of this flowchart as having p, which is a prior. I'm going to get some observations and then, I'm going to update what my posterior is. This posterior is basically something that's, in business statistics was beautiful is as soon as you have this distribution, it's essentially capturing all the information about the data that you want for p.
And it's not just the point. It's not just an average. It's actually an entire distribution for the possible values of theta. And it's not the same thing as saying, well if theta hat is equal to Xn bar, in the Gaussian case I know that this is some mean, mu. And then maybe it has varying sigma squared over n. That's not what I mean by, this is my posterior distribution. This is not what I mean.
This is going to come from this guy, the Gaussian thing and the central limit theorem. But what I mean is this guy. And this came exclusively from the prior distribution. If I had another prior, I would not necessarily have a beta distribution on the output.
So when I have the same family of distributions at the beginning and at the end of this flowchart, I say that beta is a conjugate prior. Meaning I put in beta as a prior and I get beta as [INAUDIBLE] And that's why betas are so popular.
Conjugate priors are really nice, because you know that whatever you put in, what you're going to get in the end is a beta. So all you have to think about is the parameters. You don't have to check again what the posterior is going to look like, what the PDF of this guy is going to be. You don't have to think about it. You just have to check what the parameters are.
And there's families of conjugate priors. Gaussian gives Gaussian, for example. There's a bunch of them. And this is what drives people into using specific priors as opposed to others. It has nice mathematical properties. Nobody believes that p is really distributed according to beta. But it's flexible enough and super convenient mathematically.
Now let's see for one second, before we actually go any further. I didn't mention A and B are both in here, A and B are both positive numbers. They can be anything positive. So here what I did is that, I updated A into a plus the sum of my data, and b into b plus n minus the sum of my data. So that's essentially, a becomes a plus the number of ones. Well, that's only when I have a and a. So the first parameters become itself plus the number of ones. And the second one becomes itself plus the number of zeros.
And so just as a sanity check, what does this mean? If a it goes to zero, what is the beta when a goes to 0? We can actually read this from here. Actually, let's take a goes to-- no. Sorry, let's just do this. I'll do it when we talk about non-informative prior, because it's a little too messy.
How do we do this? How did I get this posterior distribution, given the prior? How do I update This well this is called Bayesian statistics. And you've heard this word, Bayes before. And the way you've heard it is in the Bayes formula. What was the Bayes formula? The Bayes formula was telling you that the probability of A, given B was equal to something that depended on the probability of B, given A. That's what it was.
You can actually either remember the formula or you can remember the definition. And this is what p of A and B divided by p of B. So this is p of B, given A times p of A divided by p of B. That's what Bayes formula is telling you. Agree?
So now what I want is to have something that's telling me how this is going to work. What is going to play the role of those events, A and B? Well one is going to be, this is going to be the distribution of my parameter of theta, given that I see the data. And this is going to tell me, what is the distribution of the data, given that I know what my parameter if theta is.
But that part, if this is theta and this is the parameter of theta, this is what we've been doing all along. The distribution of the data, given the parameter here was n iid Bernoulli p. I knew exactly what their joint probability mass function is.
Then, that was what? So we said that this is going to be my data and this is going to be my parameter. So that means that, this is the probability of my data, given the parameter. This is the probability of the parameter. What is this? What did we call this? This is the prior. It's just the distribution of my parameter.
Now what is this? Well, this is just the distribution of the data, itself. This is essentially the distribution of this, if this was indeed not conditioned on p. So if I don't condition on p, this data is going to be a bunch of iid, Bernoulli with some parameter. But the perimeter is random, right.
So for different realization of this data set, I'm going to get different parameters for the Bernoulli. And so that leads to some sort of convolution. It's not really a convolution in this case, but it's like some sort of composition of distributions. I have the randomness that comes from here and then, the randomness that comes from realizing the Bernoulli.
That's just the marginal distribution. It actually might be painful to understand what this is, right. In a way, it's sort of a mixture and it's not super nice. But we'll see that this actually won't matter for us. This is going to be some number. It's going to be there. But it will matter for us, what it is. Because it actually does not depend on the parameter. And that's all that matters to us.
Let's put some names on those things. This was very informal. So let's put some actual names on what we call prior. So what is the formal definition of a prior, what is the formal definition of a posterior, and what are the rules to update it?
So I'm going to have my data, which is going to be X1, Xn. Let's say they are iid, but they don't actually have to. And so I'm going to have given, theta. And when I say given, it's either given like I did in the first part of this course in all previous chapters, or conditionally on. If you're thinking like a Bayesian, what I really mean is conditionally on this random parameter. It's as if it was a fixed number.
They're going to have a distribution, X1, Xn is going to have some distribution. Let's assume for now it's a PDF, pn of X1, Xn. I'm going to write theta like this. So for example, what is this? Let's say this is a PDF. It could be a PMF. Everything I say, I'm going to think of them as being PDF's. I'm going to combine PDF's with PDF's, but I could combine PDF it PMF, PMF with PDF's or PMF with PMF. So everywhere you see a D could be an M.
Now I have those things. So what does it mean? So here is an example. X1, Xn or iid, and theta 1. Now I know exactly what the joint PDF of this thing is. It means that pn of X1, Xn given theta is equal to what? Well it's 1 over 2pi to the power n e, to the minus sum from i equal 1 to n of xi minus theta squared divided by 2. So that's just the joint distribution of n iid and theta 1, random variables.
That's my pn given theta. Now this is what we denoted by f sub theta before. We had the subscript before, but now we just put a bar in theta because we want to remember that this is actually conditioned on theta. But this is just notation. You should just think of this as being, just the usual thing that you get from some statistical model.
Now, that's going to be pn. Theta has prior distribution, pi. For example, so think of it as either PDF or PMF again. For example, pi of theta was what? Well it was some constant times theta to the a minus 1, 1 minus theta to a minus 1. So it has some prior distribution, and that's another PMF.
So now I'm given the distribution of my, x is given theta and given the distribution of my theta. I'm given this guy. That's this guy. I'm given that guy, which is my pi. So that's my pn of X1, Xn given theta. That's my pi of theta. Well, this is just the integral of pn of X1, Xn times pi of theta, d theta, over all possible sets of theta. That's just when I integrate out my theta, or I compute the marginal distribution, I did this by integrating. That's just basic probability, conditional probabilities. Then if I had the PMF, I would just sum over the values of thetas.
Now what I want is to find what's called, so that's the prior distribution, and I want to find the posterior distribution. It's pi of theta, given X1, Xn. If I use Bayes' rule I know that this is pn of X1, Xn, given theta times pi of theta. And then it's divided by the distribution of those guys, which I will write as integral over theta of pn, X1, Xn, given theta times pi of theta, d theta.
Everybody's with me, still? If you're not comfortable with this, it means that you probably need to go read your couple of pages on conditional densities and conditional PMF's from your probably class. There's really not much there. It's just a matter of being able to define those quantities, f density of x, given y. This is just what's called a conditional density. You need to understand what this object is and how it relates to the joint distribution of x and y, or maybe the distribution of x or the distribution of y.
But it's the same rules. One way to actually remember this is, this is exactly the same rules as this. When you see a bar, it's the same thing as the probability of this and this guy. So for densities, it's just a comma divided by the second the probably the second guy. That's it. So if you remember this, you can just do some pattern matching and see what I just wrote here.
Now, I can compute every single one of these guys. This something I get from my modeling. So I did not write this. It's not written in the slides. But I give a name to this guy that was my prior distribution. And that was my posterior distribution. In chapter three, maybe what did we call this guy? The one that does not have a name and that's in the box. What did we call it?
AUDIENCE: [INAUDIBLE]
PHILLIPE RIGOLLET: It is the joint distribution of the Xi's. And we gave it a name.
AUDIENCE: [INAUDIBLE]
PHILLIPE RIGOLLET: It's the likelihood, right? This is exactly the likelihood. This was the likelihood of theta. And this is something that's very important to remember, and that really reminds you that these things are really not that different. Maximum likelihood estimation and Bayesian estimation, because your posterior is really just your likelihood times something that's just putting some weights on the thetas, depending on where you think theta should be.
If I had, say a maximum likelihood estimate, and my likelihood and theta looked like this, but my prior and theta looked like this. I said, oh I really want thetas that are like this. So what's going to happen is that, I'm going to turn this into some posterior that looks like this. So I'm just really waiting, this posterior, this is a constant that does not depend on theta right? Agreed? I integrated over theta, so theta is gone. So forget about this guy.
I have basically, that the posterior distribution up to scaling, because it has to be a probability density and not just anything any function that's positive, is the product of this guy. It's a weighted version of my likelihood. That's all it is. I'm just weighing the likelihood, using my prior belief on theta. And so given this guy a natural estimator, if you follow the maximum likelihood principle, would be the maximum of this posterior. Agreed? That would basically be doing exactly what maximum likelihood estimation is telling you.
So it turns out that you can. It's called Maximum A Posteriori, and I won't talk much about this, or MAP. That's Maximum a Posteriori. So it's just the theta hat is the arc max of pi theta, given X1, Xn. And it sounds like it's OK. I'll give you a density and you say, OK I have a density for all values of my parameters. You're asking me to summarize it into one number. I'm just going to take the most likely number of those guys.
But you could summarize it, otherwise. You could take the average. You could take the median. You could take a bunch of numbers. And the beauty of Bayesian statistics is that, you don't have to take any number in particular. You have an entire posterior distribution. This is not only telling you where theta is, but it's actually telling you the difference if you actually give as something that gives you the posterior.
Now, let's say the theta is p between 0 and 1. If my posterior distribution looks like this, or my posterior distribution looks like this, then those two guys have one, the same mode. This is the same value. And their symmetric, so they'll also have the same mean. So these two posterior distributions give me the same summary into one number. However clearly, one is much more confident than the other one. So I might as well just spit it out as a solution.
You can do even better. People actually do things, such as drawing a random number from this distribution. Say, this is my number. That's kind of dangerous, but you can imagine you could do this.
This is what works. That's what we went through. So here, as you notice I don't care so much about this part here. Because it does not depend on theta. So I know that given the product of those two things, this thing is only the constant that I need to divide so that when I integrate this thing over theta, it integrates to one. Because this has to be a probability density on theta. I can write this and just forget about that part. And that's what's written on the top of this slide.
This notation, this sort of weird alpha, or I don't know. Infinity sign propped to the right. Whatever you want to call this thing is actually just really emphasizing the fact that I don't care. I write it because I can, but you know what it is. In some instances, you have to compute the integral. In some instances, you don't have to compute the integral.
And a lot of Bayesian computation is about saying, OK it's actually really hard to compute this integral, so I'd rather not doing it. So let me try to find some methods that will allow me to sample from the posterior distribution, without having to compute this. And that's what's called Monte-Carlo Markov chains, or MCMC, and that's exactly what they're doing. They're just using only ratios of things, like that for different thetas. And which means that if you take ratios, the normalizing constant is gone and you don't need to find this integral.
So we won't go into those details at all. That would be the purpose of an entire course on Bayesian inference. Actually, even Bayesian computations would be an entire course on its own. And there's some very interesting things that are going on there, the interface of stats and computation.
So let's go back to our example and see if we can actually compute any of those things. Because it's very nice to give you some data, some formulas. Let's see if we can actually do it. In particular, can I actually recover this claim that the posterior associated to a beta prior with a Bernoulli likelihood is actually giving me a beta again?
What was my prior? So p was following a beta AA, which means that p, the density. That was pi of theta. Well I'm going to write this as pi of p-- was proportional to p to the A minus 1 times 1 minus p to the A minus 1.
So that's the first ingredient I need to complete my posterior. I really need only two, if I wanted to bound up to constant. The second one was p hat. We've computed that many times. And we had even a nice compact way of writing it, which was that pn of X1, Xn, given the parameter p. So the joint density of my data, given p, that's my likelihood. The likelihood of p was what? Well it was p to the sum of Xi's. 1 minus p to the n minus some of the Xi's.
Anybody wants me to parse this more? Or do you remember seeing that from maximum likelihood estimation? Yeah?
AUDIENCE: [INAUDIBLE]
PHILLIPE RIGOLLET: That's what conditioning does.
AUDIENCE: [INAUDIBLE] previous slide. [INAUDIBLE] bottom there, it says D pi of t. Shouldn't it be dt pi of t?
PHILLIPE RIGOLLET: So D pi of T is a measure theoretic notation, which I used without thinking. And I should not because I can see it upsets you. D pi of T is just a natural way to say that I integrate against whatever I'm given for the prior of theta. In particular, if theta is just the mix of a PDF and a point mass, maybe I say that my p takes value 0.5 with probability 0.5. And then is uniform on the interval with probability 0.5.
For this, I neither have a PDF nor a PMF. But I can still talk about integrating with respect to this, right? It's going to look like, if I take a function f of T, D pi of T is going to be one half of f of one half. That's the point mass with probability one half, at one half. Plus one half of the integral between 0 and 1, of f of TDT.
This is just the notation, which is actually funnily enough, interchangeable with pi of DT. But if you have a density, it's really just the density pi of TDT. If pi is really a density, but that's when it's when pi is and measure and not a density. Everybody else, forget about this. This is not something you should really worry about at this point. This is more graduate level probability classes. But yeah, it's called measure theory. And that's when you think of pi as being a measure in an abstract fashion. You don't have to worry whether it's a density or not, or whether it has a density.
So everybody is OK with this?
Now I need to compute my posterior. And as I said, my posterior is really just the product of the likelihood weighted by the prior. Hopefully, at this stage of your application, you can multiply two functions. So what's happening is, if I multiply this guy with this guy, p gets this guy to the power this guy plus this guy.
And then 1 minus p gets the power n minus some of Xi's. So this is always from I equal 1 to n. And then plus A minus 1 as well. This is up to constant, because I still need to solve this. And I could try to do it. But I really don't have to, because I know that if my density has this form, then it's a beta distribution. And then I can just go on Wikipedia and see what should be the normalization factor.
But I know it's going to be a beta distribution. It's actually the beta with parameter. So this is really my beta with parameter, sum of Xi, i equal 1 to n plus A minus 1. And then the second parameter is n minus sum of the Xi's plus A minus 1. I just wrote what was here. What happened to my one? Oh no, sorry. Beta has the power minus 1. So that's the parameter of the beta. And this is the parameter of the beta.
Beta is over there, right? So I just replace A by what I see. A is just becoming this guy plus this guy and this guy plus this guy. Everybody is comfortable with this computation?
We just agreed that beta priors for Bernoulli observations are certainly convenient. Because they are just conjugate, and we know that's what is going to come out in the end. That's going to be a beta as well. I just claim it was convenient. It was certainly convenient to compute this, right? There was certainly some compatibility when I had to multiply this function by that function. And you can imagine that things could go much more wrong, than just having p to some power and p to some power, 1 minus p to some power, when it might just be some other power. Things were nice.
Now this is nice, but I can also question the following things. Why beta, for one? The beta tells me something. That's convenient, but then how do I pick A? I know that A should definitely capture the fact that where I want to have my p most likely located. But it also actually also captures the variance of my beta.
And so choosing different As is going to have different functions. If I have A and B, If I started with the beta with parameter. If I started with a B here, I would just pick up the B here. Agreed? And that would just be a symmetric. But they're going to capture mean and variance of this thing. And so how do I pick those guys?
If I'm a doctor and you're asking me, what do you think the chances of this drug working in this kind of patients is? And I have to spit out the parameters of a beta for you, it might be a bit of a complicated thing to do. So how do you do this, especially for problems? So by now, people have actually mastered the art of coming up with how to formulate those numbers.
But in new problems that come up, how do you do this? What happens if you want to use Bayesian methods, but you actually do not know what you expect to see? To be fair, before we started this class, I hope all of you had no idea whether people tend to bend their head to the right or to the left before kissing. Because if you did, well you have too much time on your hands and I should double your homework.
So in this case, maybe you still want to use the Bayesian machinery. Maybe you just want to do something nice. It's nice right, I mean it worked out pretty well. What if you want to do? Well you actually want to use some priors that carry no information, that basically do not prefer any theta to another theta.
Now, you could read this slide or you could look at this formula. We just said that this pi here was just here to weigh some thetas more than others, depending on their prior belief. If our prior belief does not want to put any preference towards some thetas than to others, what do I do?
AUDIENCE: [INAUDIBLE]
PHILLIPE RIGOLLET: Yeah, I remove it. And the way to remove something we multiply by, is just replace it by one. That's really what we're doing. If this was a constant not depending on theta, then that would mean that we're not preferring any theta. And we're looking at the likelihood. But not as a function that we're trying to maximize, but it is a function that we normalize in such a way that it's actually a distribution.
So if I have pi, which is not here, this is really just taking the like likelihood, which is a positive function. It may not integrate to 1, so I normalize it so that it integrates to 1. And then I just say, well this is my posterior distribution. Now I could just maximize this thing and spit out my maximum likelihood estimator.
But I can also integrate and find what the expectation of this guy is. I can find what the median of this guy is. I can sample data from this guy. I can build, understand what the variance of this guy is. Which is something we did not do when we just did maximum likelihood estimation because given a function, all we cared about was the arc max of this function.
These priors are called uninformative. This is just replacing this number by one or by a constant. Because it still has to be a density. If I have a bounded set, I'm just looking for the uniform distribution on this bounded set, the one that puts constant one over the size of this thing.
But if I have an invalid set, what is the density that takes a constant value on the entire real line, for example? What is this density?
AUDIENCE: [INAUDIBLE]
PHILLIPE RIGOLLET: Doesn't exist, right? It just doesn't exist. The way you can think of it is a Gaussian with the variance going to infinity, maybe, or something like this. But you can think of it in many ways. You can think of the limit of the uniform between minus T and T, with T going to infinity. But this thing is actually zero. There's nothing there.
You can actually still talk about this. You could always talk about this thing, where you think of this guy as being a constant, remove this thing from this equation, and just say, well my posterior is just the likelihood divided by the integral of the likelihood over theta. And if theta is the entire real line, so be it. As long as this integral converges, you can still talk about this stuff.
This is what's called an improper prior. An improper prior is just a non-negative function defined in theta, but it does not have to integrate neither to one, nor to anything. If I integrate the function equal to 1 on the entire real line, what do I get? Infinity. It's not a proper prior, and it's called and improper prior.
And those improper priors are usually what you see when you start to want non-informative priors on infinite sets of datas. That's just the nature of it. You should think of them as being the uniform distribution of some infinite set, if that thing were to exist.
Let's see some examples about non-informative priors. If I'm in the interval 0, 1 this is a finite set. So I can talk about the uniform prior on the interval 0, 1 for a parameter, p of a Bernoulli. If I want to talk about this, then it means that my prior is p follows some uniform on the interval 0, 1. So that means that f of x is 1 if x is in 0, 1.
Otherwise, there is actually not even a normalization. This thing integrates to 1. And so now if I look at my likelihood, it's still the same thing. So my posterior becomes theta X1, Xn. That's my posterior. I don't write the likelihood again, because we still have it-- well we don't have it here anymore. The likelihood is given here. Copy, paste over there.
The posterior is just this thing times 1. So you will see it in a second. So it's p to the power sum of the Xi's, one minus p to the power, n minus sum of the Xi's. And then it's multiplied by 1, and then divided by this integral between 0 and 1 of p, sum of the Xi's. 1 minus p, n minus sum of the Xi's. Dp, which does not depend on p. And I really don't care what the thing actually is. That's posterior of p.
And now I can see, well what is this? It's actually just the beta with parameters. This guy plus 1. And this guy plus 1. I didn't tell you what the expectation of a beta was. We don't know what the expectation of a beta is, agreed?
If I wanted to find say, the expectation of this thing that would be some good estimator, we know that the maximum of this guy-- what is the maximum of this thing? Well, it's just this thing, it's the average of the Xi's. That's just the maximum likelihood estimator for Bernoulli. We know it's the average.
Do you think if I take the expectation of this thing, I'm going to get the average? So actually, I'm not going to get the average. I'm going to get this guy plus this guy, divided by n plus 1. Let's look at what this thing is doing. It's looking at the number of ones and it's adding one. And this guy is looking at the number of zeros and it's adding one.
Why is it adding this one? What's going on here? This is going to matter mostly when the number of ones is actually zero, or the number of zeros is zero. Because what it does is just pushes the zero from non-zero.
And why is that something that this Bayesian method actually does for you automatically? It's because when we put this non-informative prior on p, which was uniform on the interval 0, 1. In particular, we know that the probability that p is equal to 0 is zero. And the probability p is equal to 1 is zero.
And so the problem is that if I did not add this 1 with some positive probability, I wouldn't be allowed to spit out something that actually had p hat, which was equal to 0. If by chance, let's say I have n is equal to 3, and I get only 0, 0, 0, that could happen with probability. 1 over pq, one over 1 minus pq. That's not something that I want.
And I'm using my priors. My prior is not informative, but somehow it captures the fact that I don't want to believe p is going to be either equal to 0 or 1. So that's sort of taken care of here. So let's move away a little bit from the Bernoulli example, shall we? I think we've seen enough of it.
And so let's talk about the Gaussian model. Let's say I want to do Gaussian inference. I want to do inference in a Gaussian model, using Bayesian methods. What I want is that Xi, X1, Xn, or say 0, 1 iid. Sorry, theta 1, iid conditionally on theta. That means that pn of X1, Xn, given theta is equal to exactly what I wrote before. So 1 square root to pi, to the n exponential minus one half sum of Xi minus theta squared.
So that's just the joint distribution of my Gaussian with mean data. And the another question is, what is the posterior distribution? Well here I said, let's use the uninformative prior, which is an improper prior. It puts weight on everyone. That's the so-called uniform on the entire real line. So that's certainly not a density. But it can still just use this. So all I need to do is get this divided by normalizing this thing.
But if you look at this, essentially I want to understand. So this is proportional to the exponential minus one half sum from I equal 1 to n of Xi minus theta squared. And now I want to see this thing as a density, not on the Xi's but on theta. What I want is a density on theta.
So it looks like I have chances of getting something that looks like a Gaussian. To have a Gaussian, I would need to see minus one half. And then I would need to see theta minus something here, not just the sum of something minus thetas. So I need to work a little bit more, to expand the square here. So this thing here is going to be equal to exponential minus one half sum from I equal 1 to n of Xi squared minus 2Xi theta plus theta squared.
Now what I'm going to do is, everything remember is up to this little sign. So every time I see a term that does not depend on theta, I can just push it in there and just make it disappear. Agreed? This term here, exponential minus one half sum of Xi squared, does it depend on theta? No. So I'm just pushing it here. This guy, yes. And the other one, yes.
So this is proportional to exponential sum of the Xi. And then I'm going to pull out my theta, the minus one half canceled with the minus 2. And then I have minus one half sum from I equal 1 to n of theta squared. Agreed?
So now what this thing looks like, this looks very much like some theta minus something squared. This thing here is really just n over 2 times theta. Sorry, times theta squared. So now what I need to do is to write this of the form, theta minus something. Let's call it mu, squared, divided by 2 sigma squared.
I want to turn this into that, maybe up to terms that do not depend on theta. That's what I'm going to try to do. So that's called completing the squaring. That's some exercises you do. You've done it probably, already in the homework. And that's something you do a lot when you do Bayesian statistics, in particular.
So let's do this. What is it going to be the leading term? Theta squared is going to be multiplied by this thing. So I'm going to pull out my n over 2. And then I'm going to write this as minus theta over 2. And then I'm going to write theta minus something squared. And this something is going to be one half of what I see in the cross-product.
I need to actually pull this thing out. So let me write it like that first. So that's theta squared. And then I'm going to write it as minus 2 times 1 over n sum from I equal 1 to n of Xi's times theta. That's exactly just a rewriting of what we had before. And that should look much more familiar. A squared minus 2 blap A, and then I missed something.
So this thing, I'm going to be able to rewrite as theta minus Xn bar squared. But then I need to remove the square of Xn bar. Because it's not here. So I just complete the square. And then I actually really don't care with this thing actually was, because it's going to go again in the little Alpha's sign over there. So this thing eventually is going to be proportional to exponential of minus n over 2 times theta of minus Xn bar squared.
And so we know that if this is a density that's proportional to this guy, it has to be some n with mean, Xn bar. And variance, this is supposed to be 1 over sigma squared. This guy over here, this n. So that's really just 1 over n.
So the posterior distribution is a Gaussian centered at the average of my observations. And with variance, 1 over n. Everybody's with me? Why I'm saying this, this was the output of some computation. But it sort of makes sense, right? It's really telling me that the more observations I have, the more concentrated this posterior is. Concentrated around what? Well around this Xn bar.
That looks like something we've sort of seen before. But it does not have the same meaning, somehow. This is really just the posterior distribution. It's sort of a sanity check, that I have this 1 over n when I have Xn bar. But it's not the same thing as saying that the variance of Xn bar was 1 over n, like we had before.
As an exercise, I would recommend if you don't get it, just try pi of theta to be equal to some n mu 1. Here, the prior that we used was completely non-informative. What happens if I take my prior to be some Gaussian, which is centered at mu and it has the same variance as the other guys?
So what's going to happen here is that we're going to put a weight. And everything that's away from mu is going to actually get less weight. I want to know how I'm going to be updating this prior into a posterior. Everybody sees what I'm saying here? So that means that pi of theta has the density proportional to exponential minus one half theta minus mu squared.
So I need to multiply my posterior with this, and then see. It's actually going to be a Gaussian. This is also a conjugate prior. It's going to spit out another Gaussian. You're going to have to complete a square again, and just check what it's actually giving you. And so spoiler alert, it's going to look like you get an extra observation, which is actually equal to mu.
It's going to be the average of n plus 1 observations. The first n1's being X1 to Xn. And then, the last one being mu. And it sort of makes sense. That's actually a fairly simple exercise. Rather than going into more computation, this is something you can definitely do when you're in the comfort of your room.
I want to talk about other types of priors. The first thing I said is, there's this beta prior that I just pulled out of my hat and that was just convenient. Then there was this non-informative prior. It was convenient. It was non-informative, so if you don't know anything else maybe that's what you want to do.
The question is, are there any other priors that are sort of principled and generic, in the sense that the uninformative prior was generic, right? It was equal to 1, that's as generic as it gets. So is there anything that's generic as well? Well, there's this priors that are called Jeffrey's priors.
And Jeffrey's prior, which is proportional to square root of the determinant of the Fisher information of theta. This is actually a weird thing to do. It says, look at your model. Your model is going to have a Fisher information. Let's say it exists. Because we know it does not always exist. For example, in the multinomial model, we didn't have a Fisher information.
The determinant of a matrix is somehow measuring the size of a matrix. If you don't trust me, just think about the matrix being of size one by one, then the determinant is just the number that you have there. And so this is really something that looks like the Fisher information. It's proportional to the amount of information that you have at a certain point.
And so what my prior is saying well, I want to put more weights on those thetas that are going to just extract more information from the data. You can actually compute those things. In the first example, Jeffrey's prior is something that looks like this. In one dimension, Fisher information is essentially one the word variance. That's just 1 over the square root of the variance, because I have the square root.
And when I have the Jeffrey's prior, when I have the Gaussian case, this is the identity matrix that I would have in the Gaussian case. The determinant of the identities is 1. So square root of 1 is 1, and so I would basically get 1. And that gives me my improper prior, my uninformative prior that I had.
So the uninformative prior 1 is fine. Clearly, all the thetas carry the same information in the Gaussian model. Whether I translate it here or here, it's pretty clear none of them is actually better than the other. But clearly for the Bernoulli case, the p's that are closer to the boundary carry more information.
I sort of like those guys, because they just carry more information. So what I do is, I take this function. So p1 minus p. Remember, it's something that looks like this. On the interval 0, 1. This guy, 1 over square root of p1 minus p is something that looks like this. Agreed
What it's doing is sort of wants to push towards the piece that actually carry more information. Whether you want to bias your data that way or not, is something you need to think about. When you put a prior on your data, on your parameter, you're sort of biasing towards this idea your data.
That's maybe not such a good idea, when you have some p that's actually close to one half, for example. You're actually saying, no I don't want to see a p that's close to one half. Just make a decision, one way or another. But just make a decision. So it's forcing you to do that.
Jeffrey's prior, I'm running out of time so I don't want to go into too much detail. We'll probably stop here, actually. So Jeffrey's priors have this very nice property. It's that they actually do not care about the parameterization of your space. If you actually have p and you suddenly decide that p is not the right parameter for Bernoulli, but it's p squared. You could decide to parameterize this by p squared. Maybe your doctor is actually much more able to formulate some prior assumption on p squared, rather than p. You never know.
And so what happens is that Jeffrey's priors are an invariant in this. And the reason is because the information carried by p is the same as the information carried by p squared, somehow. They're essentially the same thing. You need to have one to one map. Where you basically for each parameter, before you have another parameter. Let's call Eta the new parameters.
The PDF of the new prior indexed by Eta this time is actually also Jeffrey's prior. But this time, the new Fisher information is not the Fisher information with respect to theta. But it's this Fisher information associated to this statistical model indexed by Eta. So essentially, when you change the parameterization of your model, you still get Jeffrey's prior for the new parameterization. Which is, in a way, a desirable property.
Jeffrey's prior is just an uninformative priors, or priors you want to use when you want a systematic way without really thinking about what to pick for your mile.
I'll finish this next time. And we'll talk about Bayesian confidence regions. We'll talk about Bayesian estimation. Once I have a posterior, what do I get? And basically, the only message is going to be that you might want to integrate against the posterior. Find the posterior, the expectation of your posterior distribution. That's a good point estimator for theta. We'll just do a couple of computation.