Flash and JavaScript are required for this feature.
Download the video from iTunes U or the Internet Archive.
Description: Probability, as it appears in the real world, is related to axiomatic mathematical models. Events, independence, and random variables are reviewed, stressing both the axioms and intuition.
Instructor: Prof. Robert Gallager
Lecture 1: Introduction and...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
PROFESSOR: OK, we're all ready to go. This is Discrete Stochastic Processes as you are know. It is-- want to get it where I can read it, too.
We're going to try to deal with a bunch of different topics today, some of which are a little bit philosophical saying, what is probability really? You are supposed to have taken a course in probability, but unfortunately courses in probability are almost always courses in how to solve well-posed problems. The big problem in probability theory, and particularly stochastic processes is not so much how do you solve well-posed problems. Anybody can do that. Or anybody who has a little bit of background can do it.
The hard problem is finding the right models for a real-world problem. I will call these real-world problems. I hate people who call things real-world because they sound like they dislike theory or something. It's the only word I can think of though, because physical world is no longer adequate because so much of the applications of probability are to all sorts of different things, not having to do with the physical world very much, but having to do with things like the business world, or the economic world, or the biological world, or all of these things. So real-world is just a code word we'll use to distinguish theory from anything real that you might have to deal with.
Theory is very nice because theory-- everything is specified. There's a right answer to everything. There is a wrong answer usually. But there's at least one right answer. And most of us like those things because they're very specific. People go into engineering and science very often because they don't like the uncertainty of a lot of other fields.
The problem is as soon as you go into probability theory, you're moving away from that safe region where everything is specified, and you're moving into a region where things, in fact, are not very well-specified and you have to be careful about it.
OK, so first we're going to talk about probability in the real world and probability as a branch of mathematics. Then we're going to say what discrete stochastic processes are. Then we're going to talk just a very, very little bit about the processes we're going to study. If you want to see more of that, you have two chapters of the notes. You can look at them. You can look at the table of contents. And more than that, if you look at my website, you will see the notes for all the other chapters if you want to read ahead or if you want to really find out what kinds of things we're going to talk about and what kinds of things we're not going to talk about.
Then we're going to talk about when, where, and how is this useful? The short answer to that is it's useful everywhere. But we'll have to see why that is. Then we're going to talk about the axioms of probability theory. You cannot take any elementary course in probability, or even in statistics, without seeing the axioms of probability. And in almost all of those cases, and in almost all of the graduate courses I've seen, you see them, they disappear, and suddenly you're solving problems in whatever way you can, and the axioms have nothing to do with it anymore. So we're going to see that, in fact, the axioms do have something to do with this.
Those of you who want to be real engineers and not mathematicians, you'll find this a little uncomfortable at times. We are going to be proving things. And you will have to get used to that. And I'll try to convince you of why it's important to be able to prove things.
Then we're going on to a review of probability, independent events, experiments, and random variables. So that's what we'll do today.
Incidentally, this course started about-- must've been 25 years ago or so. I started it because we had a huge number of students at MIT who had been interested in communication and control, and who were suddenly starting to get interested in networks. And there were all sorts of queuing problems that they had to deal with every day. And they started to read about queuing theory. and it was the most disjointed, crazy theory in the world where there were 1,000 different kinds of queues and each one of them had to be treated in its own way. And we realized that stochastic processes was the right way to tie all of that together, so we started this course. And we made it mostly discrete so it would deal primarily with network type applications. As things have grown, it now deals with a whole lot more applications. And we'll see how that works later on.
OK, how did probability get started in the real world? Well, there were games of chance that everybody was interested in. People really like to gamble. I don't know why. I don't like to gamble that much. I would rather be certain about things. But most people love to gamble. And most people have an intuitive sense of what probability is about.
I mean, eight-year-old kids, when they start to learn to play games of chance-- and there are all sorts of board games that involve chance. These kids, if they're bright-- and I'm sure you people fall into that category-- they immediately start to figure out what the odds are. I mean, how many of you have never thought about what the odds are in some gambling game?
OK, that makes my point. So all of you understand this at an intuitive level. But what makes games of chance easier to deal with than all the other issues where we have uncertainty in life?
Well, games of chance are inherently repeatable. You play a game of chance and you play many, many hands, or many, many throws, or many, many trials. And after many, many trials of what's essentially the same experiment, you start to get a sense of what relative frequencies are. You start to get a sense of what the odds are because of doing this repeatedly. So games of chance are easy to use probability on because they are repeatable. You have essentially the same thing going on each time, but each time there's a different answer. You flip a coin and sometimes it comes up heads and sometimes it comes up tails. So in fact, we have to figure out how to deal with that fact that there is uncertainty there. I'll talk about that in just another minute.
But anyway, most of life's decisions involve uncertainty. I mean, for all of you, when you go into a PhD program, you have two problems. Am I going to enjoy this? And you don't know whether you're going to enjoy it because not until you really get into it do you have a sense of whether this set of problems you're dealing with is something that you like to deal with.
And the only way you can do that is to make guesses. You come up with likelihoods. There's some likelihood. There's a risk cost-benefit that you deal with. And in life, risk cost-benefits are always based on some sense of what the likelihood of something is.
Now, what is a likelihood? A likelihood is just a probability. It's a synonym for probability. When you get into the mathematics of probability, likelihood has a special meaning to it. But in the real world, likelihood is just a word you use when you don't want to let people know that you're really talking about probabilities.
OK, so that's where we are. But on the last slide, you saw the word "essentially, essentially, essentially." If you read the notes, and I hope you read the notes, because I spent the last three years doing virtually nothing but trying to make these notes clear. I would appreciate it if any of you, with whatever background you have, when you read these nodes, if you read them twice and you still don't understand something, tell me you don't understand it. If you know why you don't understand it, I'd appreciate it knowing that. But just saying "help" is enough to let me know that I still haven't made something as clear as it should be. At least as clear as it should be for some of the people who I think should be taking this course.
One of the problems we have at MIT now, and at every graduate school in the world I think, is that human knowledge has changed and grown so much in the last 50 years. So when you study something general, like probability, there's just an enormous mass of stuff you have to deal with. And because of that, when you try to write notes for a course, you don't know what anybody's background is anymore.
I mean, it used to be that when you saw a graduate course at MIT, people would know what limits were. People would know what basic mathematics is all about. They would know what continuity means. They would know some linear algebra. They would know all sorts of different things. Many people still do know those things. Many other people have studied all sorts of other fascinating and very interesting things. They're just as smart, but they have a very different background. So if your background is different, it's not your fault that you don't have the kind of background that makes probability easy. Just yell. Or yell in class. Please ask questions. The fact that we're videotaping this makes it far more interesting for anybody who's using OpenCourseWare to see some kinds of questions going on, so I very much encourage that.
I'm fairly old at this point, and my memory is getting shot. So if you ask a question and I don't remember what it's all about, just be patient with it. I will come back the next time, or I'll send you an email straightening it out. But I will often get confused doing something, and that's just because of my age. It's what we call "senior moments." It's not that I don't understand the subject. I think I understand it, I just don't remember it anymore.
Important point about probability. Think about flipping a coin. I'm going to talk about flipping coins a great deal this term. It's an absolutely trivial topic, but it's important because when you understand deep things about a large subject, the best way to understand them is to understand them in terms of the most trivial examples you can think of.
Now, when you flip a coin, the outcome-- heads or tails-- really depends on the initial velocity, the orientation of the person flipping it, or the machine flipping it, the coin surfaces, the ground surface. And after you put all of those things into a careful equation, you will know whether the coin is going to come up heads or tails.
I don't think quantum theory has anything to do with something as big as a coin. I might be wrong. I've never looked into it. And frankly, I don't care. Because the point that I'm trying to make is that flipping a coin, and many of the things that you view as random, when you look at them in a slightly different way, are not random.
There's a big field in communication called data compression. And data compression is based on random models for the data, which is going to be compressed.
Now, what I'm saying here today is by no means random. Or maybe it is partly random. Maybe it's coming out of a random mind, I don't know. But all of the data we try to compress to the people who have created that data, it's not random at all. If you study the data carefully enough-- I mean, code breakers and people like that are extremely good at sorting out what the meaning is in something, which cannot be done by data compression techniques at all.
So the point is, when you're doing data compression, you model the data as being random and having certain characteristics. But it really isn't. So the model is no good.
When you get to more important questions-- well, data compression is an important question. When you ask, what's the probability of another catastrophic oil spill in the next year? Or you ask the question, what's the probability that Google stock will double in five years? That's less important, but it's still important to many people. How do you model that? Understanding probability theory, understanding all the mathematics of this is not going to help you model this.
Now, why do I make such a big deal about this? Well, there have been a number of times in the last 10 or 15 years when the whole financial system of the world has almost been destroyed by very, very bright PhDs. Many of them coming from electrical engineering. Most of whom are really superb at understanding probability theory. And they have used their probability theory to analyze risk and other things in investments. And what has happened?
They do very well for a while. Suddenly they do so well that they think they can borrow all sorts of money and risk other people's money as well as their own. In fact, they try to do that right from the beginning. And then suddenly, the whole thing collapses. Because their models are no damn good. There's nothing wrong with their mathematics, it's that their models are no good.
So please, especially if you're going to do something important in your lives-- if you're just going to write papers in engineering journals, maybe it's all right. But if you're going to make decisions about things, please spend some time thinking about the probability models that you come up because this is vitally important.
OK, what's probability? It's a branch of mathematics. Now we're into something that's more familiar, something that's simpler, something we can deal with.
You might be uncomfortable with what probability really means. And all probability books, all stochastic process books are uncomfortable with this. Feller is the best book in probability there's ever been written. Any question you have, he probably has the answer to it.
When you look at what he says about real-world probability, the modeling issues, he's an extraordinarily bright guy. And he spent some time thinking about this. But you read it and you realize that it's pure nonsense. So please, take my word for it. Don't assume that real-world probability is something you're going to learn about from other people because you can't trust what any of them say. It's something you have to think through for yourselves, and we'll talk more about this as we go.
But now, when we get into mathematics, that's fine. We just create models. And once we have the model, we just use it.
We have standard models for all sorts of different standard problems. When you talk about coin tossing, what almost everyone means is not this crazy thing I was just talking about where you have an initial angular momentum when you flip a coin and all of that stuff. It's a purely mathematical model where a coin is flipped and with probability one half it comes up heads and with probability one half it comes up tails.
OK, students are given a well-specified model, and they calculate various things. This is in mathematical probability. Heads and tails are equiprobable in that system. Subsequent tosses are independent. Here's a little bit of cynicism. I apologize for insulting you people with it. I apologize to any faculty member who later reads this. And I particularly apologize to businessmen and government people who might read it.
Students compute, professors write papers, business and government leaders obtain questionable models and data on which they can blame failures.
Most cynical towards business leaders because business leaders often hire consultants. Not so much to learn what to do, but so they have excuses when what they do doesn't work out right.
When I say the students compute, what I mean is this in almost all the courses you've taken up until now-- and in this course also-- what you're going to be doing is solving well-posed problems. You solve well-posed exercises because that's a good way to understand what the mathematics of the subject is about.
Don't think that that's the only part of it. If that's the only thing you're doing, you might as well not waste your time. You might as well do something else. You might as well go out and shovel snow today instead of trying to learn about probability theory. It's more pleasant to learn about probability theory.
OK, the use of probability models has two major problems with it. The first problem is, how do you make a model for a real-world problem? And a partial answer is, learn about estimation and decisions in the context of standard models. In other words, decisions and estimation inside a completely mathematical framework.
Then you learn a great deal about the real-world problem itself. Not about the mathematics of it, but about how you actually understand what's going on. If you talk to somebody who is a superb architect in any field-- networks, computer systems, control systems, anything-- what are you going to find?
You're not going to find huge, involved sets of equations that they're going to use to explain something to you. They're going to pick at-- if there any good, they're going to take this big problem, and they're going to take your issue with this big problem. And they're going to find the one or two really important things that tell you something that you have to know. And that's what you want to get out of this course. You want to get the ability to take all of the chat, put it all together, and be able to say one or two important things which is really necessary. That's where you're going to.
Before you get there, you'll take low-level jobs in various companies and you'll compute a lot of things. You'll simulate a lot of things. You'll deal with a lot of detail. Eventually, you're going to get to the point where you've got to make major decisions. And you want to be ready for it. OK, that's enough philosophy. I will try to give no more philosophy today, except when I get pushed into it.
OK, one of the problems in this problem of finding a good model is that no model is perfect. Namely, what happens is you keep finding more and more complicated models, which deal with more and more of the issues. And as you deal with them, things get more complicated. You're more down in the level of details and you're finding out less. So you want to find some sort of match between a model that tells you something and a model which is complicated enough to deal with the issues.
There's a beautiful quote by Alfred North Whitehead. I don't know whether you've ever heard of Whitehead. You've probably heard of Bertrand Russell, who was both a great logician and a great philosopher, and had a lot to do with the origins of set theory. Whitehead and Russell together, wrote this massive book around the turn of the last century between the 1900s and the 2000s called Principia Mathematica where they try to resolve all of the paradoxes which were coming up in mathematics.
And Whitehead's general philosophical comment was, "Seek simplicity and distrust it."
Now, every time I look at that, I say, why in hell didn't he say, seek simplicity and question it? I mean, you all hear about questioning authority, of course, and that's important to do. Why when you find a simple model for something should you distrust it?
Well, the reason is psychological. If you find a simple model for something and you question it, you have an enormous psychological bias towards not giving up the simple model. You want to keep that simple model. And therefore, it takes an enormous amount of evidence before you're going to give something out.
Whitehead said something more than that. He said, "Seek simplicity and distrust it."
Now, why do I talk about the philosophy of science when we're trying to learn about probability theory? Well, probability theory is a mathematical theory. It's the basis for a great deal of science. And it's the place where modeling is most difficult. Scientific questions in most areas, if there's no probability or uncertainty involved, you just do an experiment that tells you the answer. You might not do it carefully enough and then 10 other people do it. And finally, everybody agrees, this is the answer to that problem. In probability, it ain't that simple. And that's why one has to focus on this a little more than usual.
The second problem is, how do you make a probability model that has no hidden paradoxes in it? In other words, when you make a mathematical model, how do you make sure that it really is well-posed? How do you make sure that when you solve a problem in that mathematical model that you don't come to something that doesn't make any sense?
Well, everyone's answer to that is you use Kolmogorov's axioms of probability. Because back in 1933, Kolmogorov published this little thin book. Those of you who are interested in the history of science probably ought to read it. You will find you only understand the first five pages the first time you look at it. But it's worthwhile doing that because here was one of the truly great minds of the early 20th century.
And he took everything he knew about probability, which was a whole lot more than I know certainly, and a whole lot more than anybody else at the time knew, and he collapsed it into these very simple axioms. And he said, if you obey these axioms in a model that you use in probability, those axioms will keep you out of any paradoxes at all. And then, he showed why that was and he showed how the axioms could be used and so forth. So we're going to spend a little bit of time talking about them today.
OK, quickly, what is a discrete stochastic process? Well, a stochastic process-- you've been talking about probability. And you might be getting the idea that I'm just using the name "stochastic processes" as a foil for talking about what I really love, which is the probability. And there's a certain amount of truth to that.
But stochastic processes are special types of probability models where the sample points represent functions in time. In other words, when we're dealing with a probability model, the basis of a probability model is a sample space. It's the set of possible things that might happen. And you can reduce that to the sample points, which are the indivisible, little, tiny crumbs of what happens when you do an experiment. It's the thing which specifies everything that can be specified in that model of that experiment.
OK, when you get to a stochastic process, what you're doing is you're looking at a situation in which these sample points, the solutions to what happens is, in fact, a whole sequence of random variables in time. And what that means is instead of looking at just a vector of random variables, you're going to be looking at a whole sequence of random variables.
Now, what is different about a vector of a very large number of random variables and an infinite sequence of random variables? Well, from an engineering standpoint, not very much. I mean, there's nothing you can do to actually look at an infinite sequence of random variables. If you start out at the Big Bang and you carry it on to what you might imagine is the time when the sun explodes or something, that's a finite amount of time. And if you imagine how fast you can observe things, there's a finite number of random variables you might observe.
All these models we're going to be dealing with get outside or that realm, and they deal with something that starts infinitely far in the past and goes infinitely far in the future. It doesn't make much sense, does it?
But then look at the alternative. You built a device which you're going to sell to people, and which they're going to use. And you know they're only going to use it three or four year until something better comes along. But do you want to build in to everything you're doing the idea that it's going to be obsolete in three years?
No. You want to design this thing so, in fact, it will work for an essentially arbitrary amount of time. And therefore, you make a mathematical model of it. You look at what happens over an infinite span of time. So whenever we get into mathematics, we always go to an infinite number of things rather than a finite number of things.
Now, discrete stochastic processes are those where the random variables are discrete in time. Namely, a finite number of possible outcomes from each of them. Or the set of possible sample values is discrete. What does that mean?
It doesn't mean a whole lot when you really start asking detailed questions about this. What it means is, I want to talk about a particular kind of stochastic processes. And it's a class of processes which will be more than we can deal with in one term. And I want to exclude certain processes, like noise processes, because we don't have time to do both of them. So don't worry too much about exactly what a discrete stochastic process is. It's whatever we want to call it when we deal with it.
Oops. Oh, where am I? Oh, I wanted to talk about the different processes we're going to study.
The first kind of process is what we call a counting process. The sample points in the process-- remember, a sample point specifies everything about an experiment. It tells you every little detail. And the sample points here in counting processes are sequences of arrivals. This is a very useful idea in dealing with queuing systems because queuing systems have arrivals. They have departures. They have rules for how the arrivals get processed before they get spit out. And a big part of that is studying first the arrival process, then we study the departure process. We study how to put them together. And when we get to chapter 2 of the notes, we're going to be studying Poisson processes, which are in a sense, the perfect discrete stochastic process. It's like coin tossing in probability. Everything that might be true with a Poisson process is true. The only things that aren't true are the things that obviously can't be true. And we'll find out why that is and how that works a little later.
We're then going to study renewal processes in chapter 4. We're going to put Markov chains in between. And you'll see why when we do it. And renewal processes are a more complicated kind of thing than Poisson processes. And there's no point confusing you at this point saying what the difference is, so I won't.
Markov processes are processes. In other words, the sequences in time of things where what happens in the future depends on what happens in the past, only through the state at the present. In other words, if you can specify the state in the present, you can forget about everything in the past.
If you have those kinds of processes around, you don't have to study history at all, which would be very nice. But unfortunately, not all processes behave that way. When you do the modeling to try to find out what the state is, which is what you have to know at the present, you find out there's a lot of history involved.
OK, finally, we're going to talk about random wa;ls and martingales. I'm not going to even say what a random walks or a martingale is. We will find out about that soon enough, but I want to tell you that's what's in chapter 7 of the notes. That's the last topic we will deal with.
We'll study all sorts of mixtures of these. Things which involve a little bit of each. We'll start out working on one thing and we'll find out another. One of these other topics is the right way to look at it.
If you want to know more about that, please go look at the notes, and you'll find out as much as you want. But it's not appropriate to talk about it right now.
OK, when, where, and how is this useful? You see, I'm almost at the point where we'll start actually talking about real stuff. And when I say real stuff, I mean mathematical stuff, which is not real stuff.
Broad answer-- probability in stochastic processes are an important adjunct to rational thought about all human and scientific endeavor. That's a very strong statement. I happen to believe it. You might not believe it. And you're welcome to not believe it. It won't be on a quiz or anything, believe me. But almost anything you have to deal with is dealing with something in the future. I mean, you have to plan for things which are going to happen in the future.
When you look at the future, there's a lot of uncertainty involved with it. One of the ways is dealing with uncertainty. And probably the only scientific way of dealing with uncertainty is through the mechanism of probability models. So anything you want to deal with, which is important, you're probably better off knowing something about probability than not.
A narrow answer is probability in stochastic processes are essential components of the following areas. Now, I must confess I made up this list in about 10 minutes without thinking about it very seriously. And these things are related to each other. Some of them are parts of others. Let me read them.
Communication systems and networks. That's where I got involved in this question, and very important there. Computer systems. I also got involved in it because of computer systems. Queuing in all areas. Well, I got involved in queuing because of being interested in networks. risk management in all areas. I got interested in that because I started to get disturbed about civilization destroying itself because people who have a great deal of power don't know anything about taking risks.
OK, catastrophe management. How do you prevent oil spills and things like that? How do you prevent nuclear plants from going off? How do you prevent nuclear weapons from falling in the hands of the wrong people? These again, are probability issues. These are important probability issues because most people don't regard them as probability issues.
If you say there is one chance in a billion that something will happen, 3/4 of the population will say, that's not acceptable. I don't want any risk. And these people are fools. But unfortunately, these fools outnumber those of us who have studied these issues. So we have to deal with it. We have to understand it if nothing else.
OK, failures in all types of systems-- operations research, biology, medicine, optical systems, and control system. Name your own favorite thing. You can put it all in. Probability gets used everywhere.
OK, let's go to the axioms. Probability models have three components to them. There's a sample space.
Now, here we're in mathematics again. The sample space is just a set of things. You don't have to be at all specific about what those things are. I mean, at this point we're right in to set theory, which is the most basic part of mathematics again. And a set contains elements. And that's what we're talking about here. So there's a sample space. There are the elements in that sample space. There's also a collection of things called events.
Now, the events are subsets of the sample space. And if you're dealing with a finite set of things, there's no reason why the events should not be all subsets of that countable collection of things.
If you have a deck of cards, there are 52 factorial ways of arranging the cards in that deck of cards. Very large number. But when you talk about subsets of that, you might as well talk about all combinations of those configurations of the deck.
You can talk about, what's the probability that the first five cards in that deck happen to contain 4 aces? That's an easy thing to compute. I'm sure you've all computed it at some point or other. Those who like to play poker, of course do this. It's fun. But it's a straightforward problem.
When you have these countable sets of things, there's no reason at all for not having the set of events consist of all possible subsets.
Well, people believed that for a long time. One of the things that forced Kolmogorov to start dealing with these axioms was the realization that when you had much more complicated sets, where in fact you had the set of real numbers as possible outcomes, or sequences of things which go from 0 to infinity, and all of these sets, which are uncountable, you really can't make sense out of probability models where all subsets of sample points are called events. So in terms of measure theory, you're forced to restrict the set of things you call events.
Now, we're not going to deal with measure theory in this subject. But every once in a while, we will have to mention it because the reason why a lot of things are the way they are is because of measure theory. So you'll have to be at least conscious of it.
If you really want to be serious, as far as your study of mathematical probability theory, you really have to take a course in measure theory at some point. But you don't have to do it now. In fact, I would almost urge most of you not to do it now. Because once you get all the way into measure theory, you're so far into measure theory that you can't come back and think about real problems anymore. You're suddenly stuck in the world of mathematics, which happens to lots of people.
So anyway, some of you should learn about all this mathematics. Some of you shouldn't. Some of you should learn about it later. So you can do whatever you want.
OK, the axioms about events is that if you have a set of events. In other words, a set of subsets, and it's a countable set, then the union of all of those-- the union from n equals 1 to infinity of a sub n is also an event.
I've gone for 50 minutes and nobody has asked a question yet. Who has a question? Who thinks that all of this is nonsense? How many of you? I do. OK, I'll come back in another 10 minutes. And if nobody has a question by then, I'm just going to stop and wait. OK, so anyway.
If you look at a union of events. Now, remember, that an event is a subset of points. We're just talking about set theory now. So the union of this union here-- excuse me. This union here is A1, all the points in A1, and all the points in A2, and all the points in A3, all the way up to infinity. That's what we're talking about here.
And one of the axioms of probability theory is that if each of these things are events, then this union is also an event. That's just an axiom. You can't define events if that's not true. And if you try to define events where this isn't true, you eventually come into the most god awful problems you might imagine. And suddenly, nothing make sense anymore.
Most of the time when we define a set of events in a probability model, each singleton event-- namely, each single point has a set, which contains only that element, is taken as an event. There's no real reason to not do that.
If you don't do that, you might as well just put those points together and not regard them as separate points. We will see an example in a little bit where, in fact, you might want to do that. But let's hold that off for a little bit.
OK, not all subsets need to be events. Usually, each sample point is taken to be a singleton event. And then non-events are truly weird things. I mean, as soon as you take all sample points to be events, all countable unions of sample points are events. And then intersections of events are events, and so forth, and so forth, and so forth. So most things are events. And just because of measure theory, you can't make all things events. And I'm not going to give you any example of that because examples are horrendous.
OK, the empty set has to be an event. Why does the empty set have to be an event. If we're going to believe these axioms-- I'm in a real bind here because every one of you people has seen these axioms before. And you've all gone on and said, I can get an A in any probability class in the world without having any idea of what these axioms are all about. And therefore, it's unimportant. So you see something that says, the empty set is an event. And you say, well, of course that has nothing to do with anything. Why should I worry about whether the empty set is an event or not? The empty set can't happen, so how can it be an event?
Well, because of these axioms, it has to be an event. The axioms say that if A is an event, and that's the whole sample space, then the complement has to be an event also. So that says that the empty set has to be an event. And that just follows from the axioms.
If all sample points are singleton events, then all finite and countable sets are events.
And finally, deMorgan's law. Is there anyone who isn't familiar with deMorgan's law? Anyone who hasn't seen even that small amount of set theory? If not, look it up on-- what's the name of the computer-- Wikipedia.
Most of you will think that things on Wikipedia are not reliable. Strangely enough, in terms of probability theory and a lot of mathematics, Wikipedia does things a whole lot better than most textbooks do. So any time you're unfamiliar with what a word means or something, you can look it up in your old probability textbook. If you've used [INAUDIBLE] and [INAUDIBLE], you will probably find the right answer there. Other textbooks, maybe the right answer. Wikipedia's more reliable than most of them. And it's also clearer than most of them. So I highly recommend using Wikipedia whenever you get totally confused by something.
OK, so probability measure and events satisfies the following axioms. We've said what things are events. The only things that have probabilities are events. So the entire set has a probability. When you do the experiment, something has to happen. So one of the sample points occurs. That's the whole idea of probability. And therefore, omega has probability 1. Capital Omega.
If A is an event, then the probability of A has to be greater than or equal to 0. You can probably see without too much trouble why it has to be less than or equal to 1 also. But that's not one of the axioms. You see, when you state a set of axioms for something, you'd like to use the minimum set of axioms you can, so that you don't have to verify too many things before you say, yes, this satisfies all the axioms. So the second one is the probability of A has to be greater than or equal to 0.
The third one says that if you have a sequence of disjoint events, incidentally when I say a sequence, I will almost always mean a countably infinite sequence-- A1, A2, A3, all the way up. If I'm talking about what most of you would call a finite sequence-- and I like the word "finite sequence," but I like to be able to talk about sequences. I'm talking about a finite sequence I will usually call it an n-tuple of random variables or an n-tuple of things. So sequence really means you go the whole way out.
OK, if A1, A2, all the way up are disjoint events-- disjoint. Disjoint means if omega is only in one, it can't be in any of the others. Then the probability of this countable union is going to be equal to the sum of the probabilities of the individual event.
Anyone who has ever done a probability problem knows all of these things. The only thing you don't know and you probably haven't thought about is why everything else follows from this. But this is the whole mathematical theory. Why should we study it anymore? We're done. We have the axioms. Everything else follows, it's just a matter of computation. Just sit down and do it. Not quite that simple.
Anyway, a few consequences of the probability of the empty set is 0, which says when you do an experiment something's going to happen. And therefore, the probability that nothing happens is 0 because that's what the model says. The probability of the complement of an event is 1 minus the probability of that event. Which, in fact, is what's says that all events have to have probabilities less than or equal to 1. And if the event A is contained in the event B-- remember when we talk about events, we're talking about two different things, both simultaneously.
One of them is this beautiful idea with measure theory worked into it and everything else. And the other is just a simple set theoretic idea. And all we need to be familiar with is a set theoretical idea. Within that set theoretical idea, A contained in B means that every sample point that's in A is also in B. It means that when you do an experiment, and the event A occurs, the event B has to occur because one of the things that compose A has to occur. And that thing has to be in B because A is contained in B. So the probability of A has to be less than or equal to the probability of B. That has to be less than or equal to 1. These are things you all know.
Another consequence is the union bound. Many of you have probably seen the union bound. We will use it probably almost every day in this course. So it's good to have that as one of the things you remember at the highest level.
If you have a set of events-- A1, A2, and so forth-- the probability of that union-- namely, the event that consists of all of them-- is less than or equal to the sum of the individual event probabilities.
I give a little proof here for just two events, A1 and A2. So you see why this is true. I hope you can extend this to 3 and 4. I can't draw a picture of it very easily for 3 and 4 and so forth. But here's the event A1. Here's the event A2. Visualize this as a set of sample points, which are just in the two-dimensional space here. So all these points here are in A1. All these points are in A2. This set of points here are the points that are in A1 and A2.
I will use just writing things next to each other to mean intersection. And sometimes I'll use a big cap to main intersection. So all of these things are both in A1 and A2. This is A2, but not A1. So the probability of this whole event, A1 union A2, is the probability of this thing and this thing together. So it's the probability of this plus the probability of this. The probability of this is less than the probability of A2 because this is contained in that whole rectangle. And therefore, the probability of the union of A1 and A2 is less than or equal to the probability of A1 plus probability of A2.
Now, the classy way to extend this to a countably infinite set is to use induction. And I leave that as something that you can all play with some time when it's not between 9:30 and 11:00 in the morning and you're struggling to stay awake. And if you don't want to do that on your own, you can look at it in the notes.
OK, these axioms look ho-hum to you. And you've always ignored them before, and you think you're going to be able to ignore them now. Partly you can, but partly you can't because every once in a while we'll start doing things where you really need to understand what the axioms say.
OK, one other thing which you might not have noticed. When you studied elementary probability, wherever you studied it, what do you spend most of your time doing? You spent most of your time talking about random variables and talking about expectations. The axioms don't have random variables in them. They don't have expectations in them. All they have in them is events and probabilities of events. So these axioms say that the really important things in probability are the events and the probabilities of events. And the random variables and the expectations are derived quantities, which we'll now start to talk about.
OK, so we're now down to independent events and experiments. Two events, A1 and A2, are independent if the probability of the two of them is equal to the product of their probabilities. You've all seen this. I'm sure you've all seen it. If you haven't at least seen it, you probably shouldn't be in this class because even though the text does everything in detail that has to be done, you need to have a little bit of insight from having dealt with these subjects before. If you don't have that, you're just going to get lost very, very quickly.
So the probability is the intersection of the event A1 and A2 is the product of the two. Now, in other words, you have a red die and a white die. You flip the dice, what's the probability that you get a 1 for the red die and a 1 for the white die? Well, the probability you get a 1 for the red die is 1/6. Just by symmetry, there are only 6 possible things that can happen. Probability of white die comes up as 1. Probability is 1/6 for that. And the probability of the two things, they're independent. There's a sense of real-world independence and probability theory independence. Real-world independence says the two things are isolated, they don't interfere with each other.
Probability theory says just by definition, well, the real-world idea of them not interfering with each other should say-- and I'm waving my hands here because this is so elementary, you all know it. And I would bore you if I talked about it more. But I probably should talk about it, but I'm not going to.
Anyway, this is the definition of independence. If you don't have any idea of how this corresponds to being unconnected in the real-world, then go to Wikipedia. Read the notes. Well, you should read the notes anyway. I hope you will read the notes because I'm not going to say everything in class that needs to be said. And you will get a better feeling for it.
Now, here's something important. Given two probability models, a combined model can be defined in which, first, the sample space, omega, is the Cartesian product of omega 1 and omega 2. Namely, it's the Cartesian product of the two sample spaces.
Think of rolling the red die and the white die. Rolling a red die is an experiment. There are 6 possible outcomes, a 1 to 6. Rolled a white die, there are 6 possible outcomes, a 1 to a 6. You roll the two dice together, and you really need to have some way of putting these two experiments together. How do you put them together? You talk about the outcome for the two dice, number for one and number for the other. The Cartesian product simply means you have the set made up of 1 to 6 Cartesian product with 1 to 6. So you have 36 possibilities.
It's an interesting thing, which comes from Kolmogorov's axioms. That, in fact, you can take in any two probability models for two different experiments. You can take this Cartesian product of sample points. You can assume that what happens here is independent of what happens here. And when you do this, you will, in fact, get something for the two experiments put together which satisfies Kolmogorov's axioms. That is neither trivial nor very hard to prove. I mean, for the case of two dice, you can see it almost immediately. I mean, you see what the sample space is. It's this Cartesian product. And you see what the probabilities have to be because the probability of, say, 1 and 2 for the red die and 1 and 2 for the white is 2, 6 times 2, 6. So with probability 1/9, you're going to get a 1 and a 2 combined with a 1 and a 2.
I'm going to talk a little bit later about something that you all know. What happens if you roll two white dice? This is something you all ought to think about a little bit because it really isn't as simple as it sounds. If you roll two dice, what's the probability that you'll get a 1 and a 2? And how can you justify that?
First, what's a sample space when you roll two white dice?
Well, if you look at the possible things that might happen, you can get 1, 1; 2, 2; 3, 3; 4, 4; 5, 5; 6, 6. You can also get 1, 2 or 2, 1. But you can't tell them apart, so there's one sample point, you might say, which is 1, 2, and 2, 1. Another sample point which is 2, 3; 3, 2, and so forth. If you count them up, there are 21 separate sample points that you might have. And when you look at what the probabilities ought to be, the probabilities of the pairs are 136 each. And the probabilities of the I J, where I is unequal to J is 1/18 each. That's awful. So what do you do?
When you're rolling two dice, you do the same thing that everybody else does. You say, well, even though it's two white dice, I'm going to think of it as if it's a white die and a red die. I'm going to think of it as if the two are indistinguishable. My sample space is going to be these 36 different things. I will never be able to distinguish a 1, 2 from a 2, 1. But I don't care because I now know the probability of each of them.
What I'm trying to say by this, is this a very, very trivial example of where you really have to think through the question of what kind of mathematical model do you want of the most simple situation you can think of almost. When you combine two different experiments together and you lose distinguishability, then what do you do?
Well, the sensible thing to do is assume that the distinguishability is still there, but it's not observable. But that makes it hard to make a correspondence between the real world and the probability world. So we'll come back to that later. But for the most part, you don't have to worry about it because this is something you've dealt with all of your lives.
I mean, you've done probability problems with dice. You've done probability problems with all sorts of other things where things are indistinguishable from each other. And after doing a few of these problems, you are used to being schizophrenic about it. And on one hand, thinking that these things are distinguishable to figure out what all the probabilities are. And then you go back to saying, well, they aren't really distinguishable, and you find the right answer. So you don't have to worry about it. All I'm trying to say here is that you should understand it. Because when you get the complicated situations, this is one of the main things which will cause confusion. It's one of the main things where people write papers and other people say that paper is wrong because they're both thinking of different models for it.
Important thing is if you satisfy Kolmogorov's axioms in each of a set of models, and most important thing is where each of these models are exactly the same. And then you make them each independent of each other, Kolmogorov's axioms are going to be satisfied for the combination, as well as the individual model.
Why do I care about that? Because we're studying stochastic processes. We're studying an infinite sequence of random variables. And I don't want to generate a complete probability model for an infinite set of random variables every time I talk about an infinite set of random variables.
If I'm talking about an infinite sequence of flipping a coin, I want to do what you do, which is say, for each coin, the coin is equiprobably a head or a tail. And the coin tosses are independent of each other. And I want to know that I can go from that to thinking about this sequence. Strange things will happen in these sequences when we go to the limit. But still, we don't want to have to worry about a model for the whole infinite sequence. So that's one of the things we should deal with.
Finally, random variables. Definition. Three years ago, I taught this course and I asked people to write down definition of what a random variable was. And almost no one really had any idea of what it was. They said it was something that had a probability density, or something that had a probability mass function, or something that had a distribution function, or something like that.
What it is, if you want to get a definition which fits in with the axioms, the only thing we know from the axioms is there's a sample space. There are events and there are probabilities. So a random variable, what it really is, is it's a function from the set of sample points to the set of real values. And as you get into this, you will realize that the set of real values does not include minus infinity or plus infinity. It says that every sample point gets mapped into a finite value. This happens, of course, when you flip a coin.
Well, flipping a coin, the outcome is not a random variable. But you'd like to make it a random variable, so you say, OK, I'm going to model tails as 0 and heads as 1, or vice versa. And then what happens?
Your model for coin tossing, a sequence of coin tosses becomes the same as your model for data. So that what you know about coin tossing, you can apply to data compression. You see, when you think about these things mathematically, then you can make all sorts of connections you couldn't make otherwise.
So random variables have to satisfy the constraint that-- they have to satisfy the constraint that the set of sample points, such that x, x of omega, which is a real number, is less than or equal to some given real number. That this set has to be an event. Because those are the only things that have probabilities. So if we want to be able to talk about the probabilities of these random variables lying in certain ranges, or things like this, or having PMFs, or anything that you like to do, you need this constraint on it. It's an event for all A in the set of real numbers.
Also, if this set of things here are each random variables. In other words, if each of them are functions from the sample space to the real line, then the set of omega such that x1 of omega is less than or equal to A1, up to An of omega is less than or equal to A n is an event also. You might recognize this as the distribution function, the joint distribution function for n random variables. You might recognize this as the distribution function evaluated at A for a single random variable. So you define a random variable.
And what we're doing here is-- it's kind of funny because we already have these axioms. But now when we want to define things in the context of these axioms, we need extra things in the definitions.
This is a distribution function, a distribution function of the random variable x is the probability that the random variable x is less than or equal to x, which means that x is a mapping from omega into real numbers. It says that with this mapping here, you're mapping this whole sample space into the real line. Some omegas get mapped into things less than or equal to a real number x. Some of them get mapped into things greater than the real number x. And the set that gets mapped into something less than or equal to x, according to the definition of a random variable, has to be an event. Therefore, it has to have a probability. And these probabilities increase as we go.
It is totally immaterial for all purposes whether we have a less than or equal to here or a less than here. And everyone follows the convention of using a less than or equal to here rather than a less than here. The importance of that is that when you look at a distribution function, the distribution function often has jumps. And the distribution will have a jump whenever there's a nonzero probability that the random variable takes on a particular value x here. It takes on this particular value with something more than probabilities here.
If you have a probability density for a random variable, this curve just moves up continuously. And the derivative of this curve is the probability density.
If you have a probability mass function, this is a staircase type of function. Because of the fact that we define the distribution function with a less than or equal to rather than a less than means that in every one of these jumps, the value here is the upper value of the jump. Value here is the upper value of the jump, and so forth.
Now, I'm going to-- I've already said half of this.
Affects maps only until finite or countable set of values. It's discrete. And it has a probability mass function-- this notation.
If the derivative exists, then you say that the random variable is continuous and it has a density. And most problems that you do in probability theory, you're dealing with random variables. And they either have a probability mass function if they're discrete or they have a density if they're continuous. And this is just saying some things are one way, some things are the other way. And some things are neither. And we'll see lots of things that are neither. And you need the distribution function to talk about things that are neither. We will find that the distribution function, which you've hardly ever used in the past, is extraordinarily important, both for theoretical purposes and for other purposes. You really need that as a way of solving problems, as well as keeping yourself out of trouble.
For every random variable, the distribution function exists. Why? Anybody know why this has to exist for every random variable? Yeah.
AUDIENCE: Because the [INAUDIBLE].
PROFESSOR: Yes. Because we insisted that it did. Namely, we insisted that this event actually was an event for all little x. That's part of the definition. So, in fact, when you do these things a little more carefully than you might be used to, the definition implies that the distribution function always exists. As a more real-world kind of argument, we now have a way of dealing with things that are discrete, and continuous, and mixed continuous and discrete, and anything else that you might think of because the definition restricts it.
Now, one other thing. How do I know that this starts at 0? That's a more complicated thing. And I'm not even going to do it in detail. But since every omega maps into a finite number, you can't have a jump down here at minus infinity. And you can't have a jump here at plus infinity. Because omegas don't map into plus infinity or minus infinity. So you have to start down here at 0. You have to climb up here to 1. You might never reach 1. You might reach it only as a limit, but you have to reach it as a limit. Yes?
AUDIENCE: In the first paragraph, [INAUDIBLE].
PROFESSOR: If we have a sequence of [INAUDIBLE], yeah.
AUDIENCE: It's [INAUDIBLE].
PROFESSOR: You are probably right. Yes. Well, I don't know I don't think about that one. I don't think you're right, but we can argue about it. But anyway, this has to start at 0. It has to go up to 1.
OK, we did this. Now, I'm going to go through a theoretical nitpick for the last five minutes of the class. Anyone who doesn't like theoretical nitpicks, you're welcome to either go to sleep for five minutes, or you're welcome to go out and get a cup of coffee, or whatever you want to do. I will do this to you occasionally. And I realize it's almost torture for some of you, because I want to get you used to thinking about how relatively obvious things actually get proven. I want to increase your ability to prove things.
The general statement about proving things, or at least the way I prove things, is not the way most mathematicians prove things. Most mathematicians prove things by starting out with axioms and going step by step until they get to what they're trying to prove.
Now, every time I talk to a good mathematician, I find out that's what they write down when they prove something, but that's not the way they think about it at all. All of us-- engineers, businesspeople, everyone-- thinks about problems in a different way. If we're trying to prove something, we first give a really half-assed proof of it. And after we do that, we look at it and we say, well, I don't see why this is true and I don't see why that's true. And then you go back and you patch these things up. And then after you patch things up, it starts to look ugly. So you go back and do it a nicer way. And you go back and forth and back and forth and back and forth, using both contradiction and implication. You use both of them.
Now, when you're proving things in this class, I don't care whether you make it look like you're a formal mathematician or not. I would just assume you didn't pretend you were a formal mathematician. I would like to see you prove things in such a way that it is at least difficult to poke a hole in your argument. In other words, I would like you to give an argument which you've thought about enough that there aren't obvious counter examples to it. And if you learn to do that, you're well on your way to learning to use this theory in a way where you can actually come up with correct answer.
And in fact, I'm not going to go through this proof at all. And I don't think I really wanted to. I just did it because-- well, I think it's something you ought to read. It is important to learn to prove things because when you get to complicated systems, you cannot see your way through them intuitively. And if you can't see your way through it intuitively, you need to understand something about how to prove things, and you need to put all the techniques of proving things that you learned together with all the techniques that you've learned for doing things intuitively. And you need to know how to put them together.
If you're stuck dealing only with things that are intuitive, or things that you learned in high school like calculus, then you really can't deal with complicated systems very well.
OK, I'm going to end at that point. You can read this theoretical nitpick if you want, and play with it. And we'll go on next time.
Introduction and Probability Review (PDF)