1 00:00:01,030 --> 00:00:03,150 In this segment we introduce the concept of 2 00:00:03,150 --> 00:00:05,120 a confidence interval. 3 00:00:05,120 --> 00:00:08,109 The starting point is that an estimate, the value of an 4 00:00:08,109 --> 00:00:11,110 estimator, does not tell the whole story. 5 00:00:11,110 --> 00:00:13,910 One option is to also provide the standard error of the 6 00:00:13,910 --> 00:00:17,290 estimator, but a more common practice is to report a 7 00:00:17,290 --> 00:00:18,910 confidence interval. 8 00:00:18,910 --> 00:00:20,040 What is that? 9 00:00:20,040 --> 00:00:22,060 We'll introduce the notion of a confidence 10 00:00:22,060 --> 00:00:23,740 interval through a story. 11 00:00:23,740 --> 00:00:25,910 You're working for a polling company. 12 00:00:25,910 --> 00:00:31,520 You carry out a poll, and then you go and report to your boss 13 00:00:31,520 --> 00:00:38,000 that my estimate is this particular number. 14 00:00:38,000 --> 00:00:41,620 And then your boss says, I appreciate the five digit 15 00:00:41,620 --> 00:00:45,510 accuracy, but are your conclusions that accurate? 16 00:00:45,510 --> 00:00:49,240 You go back to your desk, you do some more calculations, and 17 00:00:49,240 --> 00:00:55,880 then you tell your boss, here is a 95% confidence interval. 18 00:00:58,480 --> 00:01:01,930 Your boss tells you, that looks great, but what does 19 00:01:01,930 --> 00:01:03,570 that exactly mean? 20 00:01:03,570 --> 00:01:06,980 You go back to your textbook, you pull out the definition, 21 00:01:06,980 --> 00:01:09,420 and you reply as follows. 22 00:01:09,420 --> 00:01:13,130 Well, a 95% confidence interval-- 23 00:01:13,130 --> 00:01:17,550 so here I am letting alpha to be 5%-- 24 00:01:17,550 --> 00:01:22,800 a 95% confidence interval is an interval that has the 25 00:01:22,800 --> 00:01:24,360 following property. 26 00:01:24,360 --> 00:01:27,710 That the unknown value of the parameter that we're trying to 27 00:01:27,710 --> 00:01:32,060 estimate falls inside this interval with 28 00:01:32,060 --> 00:01:35,320 probability at least 95%. 29 00:01:35,320 --> 00:01:39,390 And if you wish, I could also let alpha be equal to 1%, in 30 00:01:39,390 --> 00:01:43,630 which case I could give you a 99% confidence interval. 31 00:01:43,630 --> 00:01:46,220 Your boss replies, no, that sounds good. 32 00:01:46,220 --> 00:01:49,920 A 95% interval sounds fine. 33 00:01:49,920 --> 00:01:53,310 And your boss goes out to the press, holds a press 34 00:01:53,310 --> 00:01:58,740 conference, and reports that the true value of the 35 00:01:58,740 --> 00:02:04,490 parameter lies inside this range, inside the reported 36 00:02:04,490 --> 00:02:08,627 confidence interval with probability at least 95%. 37 00:02:11,680 --> 00:02:14,610 Does this statement make sense? 38 00:02:14,610 --> 00:02:16,220 Actually, no. 39 00:02:16,220 --> 00:02:20,300 This statement is the most common misconception of what a 40 00:02:20,300 --> 00:02:22,740 confidence interval is. 41 00:02:22,740 --> 00:02:24,790 To see why this statement does not make 42 00:02:24,790 --> 00:02:27,030 sense, look at it carefully. 43 00:02:27,030 --> 00:02:29,880 We're talking about the probability of something. 44 00:02:29,880 --> 00:02:33,460 But that something does not involve anything random. 45 00:02:33,460 --> 00:02:36,300 0.3 and 0.52 are just numbers. 46 00:02:36,300 --> 00:02:41,079 And theta is also a number which we do not know what it 47 00:02:41,079 --> 00:02:42,816 is, but it is not random. 48 00:02:42,816 --> 00:02:44,650 It is a constant. 49 00:02:44,650 --> 00:02:49,350 So this statement is incorrect on a purely syntactic basis. 50 00:02:49,350 --> 00:02:52,370 I mean the true parameter theta either is inside this 51 00:02:52,370 --> 00:02:54,190 interval, or it is not. 52 00:02:54,190 --> 00:02:57,640 But there's nothing random here, and so this statement 53 00:02:57,640 --> 00:02:59,980 does not make sense. 54 00:02:59,980 --> 00:03:04,070 Instead, let us look carefully at this definition. 55 00:03:04,070 --> 00:03:06,810 This statement does make sense because it 56 00:03:06,810 --> 00:03:09,230 involves random variables. 57 00:03:09,230 --> 00:03:13,150 The lower and the upper end of the confidence interval are 58 00:03:13,150 --> 00:03:17,140 quantities that are determined by the data, and therefore 59 00:03:17,140 --> 00:03:18,530 they are random. 60 00:03:18,530 --> 00:03:21,400 So we do have random variables in here. 61 00:03:21,400 --> 00:03:25,160 And so it makes sense to talk about probabilities. 62 00:03:25,160 --> 00:03:29,590 To really understand what's going on, think as follows. 63 00:03:29,590 --> 00:03:33,860 We're dealing with a poll that is trying to estimate some 64 00:03:33,860 --> 00:03:36,860 unknown value, theta. 65 00:03:36,860 --> 00:03:40,860 You carry out the poll, and you come up with a confidence 66 00:03:40,860 --> 00:03:42,990 interval based on the data. 67 00:03:42,990 --> 00:03:46,820 You might be lucky, and your confidence interval happens to 68 00:03:46,820 --> 00:03:49,100 capture the true parameter. 69 00:03:49,100 --> 00:03:52,579 You carry the poll one more time, maybe on another day. 70 00:03:52,579 --> 00:03:55,170 You come up with another confidence interval. 71 00:03:55,170 --> 00:03:58,360 And you're again lucky, and it captures the true parameter. 72 00:03:58,360 --> 00:04:01,260 You carry it on another day, and you come up with a 73 00:04:01,260 --> 00:04:02,550 confidence interval. 74 00:04:02,550 --> 00:04:06,770 And maybe the data that you got were kind of skewed. 75 00:04:06,770 --> 00:04:10,010 You were unlucky, and your confidence interval does not 76 00:04:10,010 --> 00:04:12,750 capture the true parameter. 77 00:04:12,750 --> 00:04:18,870 Having a 95% confidence interval means that 95% of the 78 00:04:18,870 --> 00:04:24,980 time, 95% of the polls that you carry out will capture the 79 00:04:24,980 --> 00:04:26,380 true parameter. 80 00:04:26,380 --> 00:04:30,190 So the word 95% really talks about your method of 81 00:04:30,190 --> 00:04:32,490 constructing confidence intervals. 82 00:04:32,490 --> 00:04:35,850 It's a method that 95% of the time will 83 00:04:35,850 --> 00:04:37,590 capture the true parameter. 84 00:04:37,590 --> 00:04:42,380 It is not a statement about the actual numbers that you 85 00:04:42,380 --> 00:04:46,040 are reporting on one specific poll. 86 00:04:46,040 --> 00:04:49,190 So it is important to keep this in mind, and to always 87 00:04:49,190 --> 00:04:53,700 interpret confidence intervals the correct way. 88 00:04:53,700 --> 00:04:56,909 So how does one come up with confidence intervals? 89 00:04:56,909 --> 00:04:59,240 The most common method is based on normal 90 00:04:59,240 --> 00:05:02,140 approximations, as we will be seeing next.