1 00:00:00,740 --> 00:00:02,760 Before we get going with our discussion 2 00:00:02,760 --> 00:00:04,800 of inference methods, it is worth 3 00:00:04,800 --> 00:00:07,550 looking at the big picture for some perspective. 4 00:00:07,550 --> 00:00:09,680 So far, we have concentrated on ways 5 00:00:09,680 --> 00:00:11,860 to analyze probability models. 6 00:00:11,860 --> 00:00:14,230 This part of the picture. 7 00:00:14,230 --> 00:00:18,190 If our model has been selected in a careful way, 8 00:00:18,190 --> 00:00:21,730 then it should also be relevant to the real world 9 00:00:21,730 --> 00:00:26,000 and help us make predictions or decisions. 10 00:00:26,000 --> 00:00:28,480 But how do we know that this is the case? 11 00:00:28,480 --> 00:00:32,450 This is why we need to look at data that are generated 12 00:00:32,450 --> 00:00:37,010 by the real world, and then use these to come up with a model. 13 00:00:37,010 --> 00:00:39,760 This is what the field of inference and statistics 14 00:00:39,760 --> 00:00:41,230 is all about. 15 00:00:41,230 --> 00:00:43,950 This field has undergone a radical transformation 16 00:00:43,950 --> 00:00:45,670 in recent years. 17 00:00:45,670 --> 00:00:48,490 I will exaggerate a little, but in the past, 18 00:00:48,490 --> 00:00:50,480 a statistician might be called to look 19 00:00:50,480 --> 00:00:53,360 at problems such as this one. 20 00:00:53,360 --> 00:00:56,000 You're given data on a few patients, 21 00:00:56,000 --> 00:01:00,200 and you need to figure out whether a certain treatment is 22 00:01:00,200 --> 00:01:01,850 effective or not. 23 00:01:01,850 --> 00:01:05,510 But today, a statistician lives in a dreamland. 24 00:01:05,510 --> 00:01:09,370 There are tons of data that are generated everywhere. 25 00:01:09,370 --> 00:01:15,010 These allow us to build quite detailed large models involving 26 00:01:15,010 --> 00:01:17,100 thousands of parameters. 27 00:01:17,100 --> 00:01:20,890 And we do have the computational power to do all that. 28 00:01:20,890 --> 00:01:22,700 In this landscape, the opportunities 29 00:01:22,700 --> 00:01:25,030 for a statistician are endless. 30 00:01:25,030 --> 00:01:29,130 So let me give you a small representative sample. 31 00:01:29,130 --> 00:01:30,960 In a somewhat traditional setting, 32 00:01:30,960 --> 00:01:33,670 one designs a data collection method, 33 00:01:33,670 --> 00:01:38,310 and then uses these data to make a simple prediction. 34 00:01:38,310 --> 00:01:41,200 This is the case, for example, in polling. 35 00:01:41,200 --> 00:01:45,479 Where the purpose is to predict the outcome of an election. 36 00:01:45,479 --> 00:01:48,330 Another field is marketing and advertising, 37 00:01:48,330 --> 00:01:50,220 where the situation is somewhat similar. 38 00:01:50,220 --> 00:01:52,729 Except now, we want to make predictions 39 00:01:52,729 --> 00:01:55,229 not for a population as a whole, but 40 00:01:55,229 --> 00:01:58,220 for each individual consumer. 41 00:01:58,220 --> 00:02:00,260 And a particular application has to do 42 00:02:00,260 --> 00:02:03,350 with so-called recommendation systems. 43 00:02:03,350 --> 00:02:07,400 You collect ratings that people give to movies, as 44 00:02:07,400 --> 00:02:11,550 in a famous competition that was announced by Netflix. 45 00:02:11,550 --> 00:02:16,600 So you have data for every movie and the people 46 00:02:16,600 --> 00:02:19,250 who have watched them. 47 00:02:19,250 --> 00:02:22,840 You make a note of what rating that person 48 00:02:22,840 --> 00:02:25,530 gave to a particular movie. 49 00:02:25,530 --> 00:02:29,880 And now after you collect huge amounts of data of this kind, 50 00:02:29,880 --> 00:02:33,240 you try to use this information to guess 51 00:02:33,240 --> 00:02:37,280 whether, for example, this person is 52 00:02:37,280 --> 00:02:41,640 interested in this particular movie or not. 53 00:02:41,640 --> 00:02:44,060 This is a quite difficult problem. 54 00:02:44,060 --> 00:02:46,030 A quite complicated one. 55 00:02:46,030 --> 00:02:48,870 And it gave the community an opportunity 56 00:02:48,870 --> 00:02:52,760 to develop fancier and fancier combinations of methods 57 00:02:52,760 --> 00:02:55,870 in order to come up with good predictions 58 00:02:55,870 --> 00:02:59,060 of unknown entries in this table. 59 00:02:59,060 --> 00:03:01,770 Another field is, of course, finance. 60 00:03:01,770 --> 00:03:03,960 The markets are truly uncertain. 61 00:03:03,960 --> 00:03:07,090 And there are quite complete historical data. 62 00:03:07,090 --> 00:03:08,210 Lots of them. 63 00:03:08,210 --> 00:03:11,670 How do we use these data to make predictions? 64 00:03:11,670 --> 00:03:13,650 Coming now to the natural sciences, 65 00:03:13,650 --> 00:03:18,010 a revolution has been taking place in the life sciences. 66 00:03:18,010 --> 00:03:22,220 There are tons of genomic data to be processed to find out 67 00:03:22,220 --> 00:03:26,810 what combination of genes causes what disease. 68 00:03:26,810 --> 00:03:29,870 Or we may want to find out the details 69 00:03:29,870 --> 00:03:33,100 of the chemical reactions inside a living cell. 70 00:03:33,100 --> 00:03:37,880 And there is an upcoming new frontier, neuroscience, 71 00:03:37,880 --> 00:03:43,110 where there will be vast amounts of data that will be generated. 72 00:03:43,110 --> 00:03:45,420 These will consist of brain measurements. 73 00:03:45,420 --> 00:03:48,290 Of measurements of what each neuron is doing. 74 00:03:48,290 --> 00:03:51,710 And hopefully, these will lead us one day 75 00:03:51,710 --> 00:03:56,870 to finding out what the brain really does and how it works. 76 00:03:56,870 --> 00:03:59,430 In the sciences, the list is endless. 77 00:03:59,430 --> 00:04:01,240 It goes on and on. 78 00:04:01,240 --> 00:04:04,070 In modeling climate and the environment, 79 00:04:04,070 --> 00:04:07,690 scientists are using a huge models these days. 80 00:04:07,690 --> 00:04:11,610 Which they try to calibrate using lots of available data. 81 00:04:11,610 --> 00:04:14,640 And in physics as well, scientists 82 00:04:14,640 --> 00:04:17,820 to use fancy inference methods trying to find 83 00:04:17,820 --> 00:04:19,959 needles in a haystack. 84 00:04:19,959 --> 00:04:24,120 Like rare particles or remote planets. 85 00:04:24,120 --> 00:04:28,680 Finally, engineering is a fight against noise. 86 00:04:28,680 --> 00:04:31,110 Engineers try to make devices that 87 00:04:31,110 --> 00:04:33,800 will work in uncertain environments. 88 00:04:33,800 --> 00:04:36,510 The field of signal processing is a prime example 89 00:04:36,510 --> 00:04:39,050 where the generic question is to recover 90 00:04:39,050 --> 00:04:40,850 the content of a signal. 91 00:04:40,850 --> 00:04:44,260 For example, the content of a radio transmission 92 00:04:44,260 --> 00:04:49,550 when a signal is received after it gets corrupted by noise. 93 00:04:49,550 --> 00:04:53,800 I could go on and on for hours generating lists of this kind, 94 00:04:53,800 --> 00:04:55,620 but we have to stop somewhere. 95 00:04:55,620 --> 00:04:59,500 The bottom line is that the opportunities and the needs 96 00:04:59,500 --> 00:05:00,960 are vast. 97 00:05:00,960 --> 00:05:04,750 For this reason, we will look into the core methodologies 98 00:05:04,750 --> 00:05:06,560 that come into play. 99 00:05:06,560 --> 00:05:09,720 Fortunately for us, the fundamental concepts 100 00:05:09,720 --> 00:05:13,790 and approaches turn out to be the same independent 101 00:05:13,790 --> 00:05:16,220 of the particular application.