1 00:00:00,510 --> 00:00:03,300 In this segment, we want to reinforce the message 2 00:00:03,300 --> 00:00:07,780 that how you choose to sample can give different results, 3 00:00:07,780 --> 00:00:10,180 and the choice of sampling is important. 4 00:00:10,180 --> 00:00:13,810 Suppose you're interested in measuring the average family 5 00:00:13,810 --> 00:00:16,260 size in some population. 6 00:00:16,260 --> 00:00:19,450 Suppose that there are families that are small 7 00:00:19,450 --> 00:00:22,890 and have just one person in them, 8 00:00:22,890 --> 00:00:31,770 and there's also a family that has many people in it. 9 00:00:31,770 --> 00:00:35,250 What does it mean to measure the average family size? 10 00:00:35,250 --> 00:00:40,760 One possibility is to pick at random a family, each family 11 00:00:40,760 --> 00:00:43,750 being chosen with equal probability, 12 00:00:43,750 --> 00:00:46,810 and talk about the expected value that you get, 13 00:00:46,810 --> 00:00:50,190 or the average value if you sample that way. 14 00:00:50,190 --> 00:00:54,410 In this particular example with probability 1/4, 15 00:00:54,410 --> 00:00:57,190 you get a 1, with probability 1/4, you get a 1 16 00:00:57,190 --> 00:00:59,480 with probability 1/4, you get a 1. 17 00:00:59,480 --> 00:01:03,420 So the answer would be with probability 3/4 you get a 1, 18 00:01:03,420 --> 00:01:09,360 and with probability 1/4, you get a 6. 19 00:01:09,360 --> 00:01:14,120 But suppose that instead, you pick a person at random 20 00:01:14,120 --> 00:01:18,680 and you ask for that person, how big is your family? 21 00:01:18,680 --> 00:01:22,420 What's the expected value of the answer you're going to get? 22 00:01:22,420 --> 00:01:24,630 Here we have nine people. 23 00:01:24,630 --> 00:01:27,940 Out of those nine people, 3 of them 24 00:01:27,940 --> 00:01:31,870 will give you an answer my family has size one, 25 00:01:31,870 --> 00:01:36,229 and six of those people will give you an answer, 26 00:01:36,229 --> 00:01:39,970 my family has a size of six. 27 00:01:39,970 --> 00:01:42,770 And this number is going to be larger 28 00:01:42,770 --> 00:01:44,330 than the previous number. 29 00:01:44,330 --> 00:01:46,920 You're going to get different answers. 30 00:01:46,920 --> 00:01:49,220 So it is possible to have a situation where 31 00:01:49,220 --> 00:01:52,340 you can make a statement such as the following. 32 00:01:52,340 --> 00:01:56,759 The average family size is three, 33 00:01:56,759 --> 00:02:02,070 but the average person lives in a family of size four. 34 00:02:02,070 --> 00:02:04,840 There is no contradiction between these two statements 35 00:02:04,840 --> 00:02:07,730 because we're measuring different things. 36 00:02:07,730 --> 00:02:10,259 Another example of the same flavor. 37 00:02:10,259 --> 00:02:13,250 You're interested in the average bus occupancy. 38 00:02:13,250 --> 00:02:16,500 You're interested in whether buses are crowded or not 39 00:02:16,500 --> 00:02:18,050 in your city. 40 00:02:18,050 --> 00:02:21,850 One way of carrying out this calculation 41 00:02:21,850 --> 00:02:25,400 is to pick buses at random, each bus 42 00:02:25,400 --> 00:02:27,530 is equally likely to be picked, and see 43 00:02:27,530 --> 00:02:30,640 how many people are riding this bus. 44 00:02:30,640 --> 00:02:35,030 Another possibility is to take a typical passenger, 45 00:02:35,030 --> 00:02:40,290 a random passenger, and ask them, how crowded was your bus? 46 00:02:40,290 --> 00:02:41,930 Take an extreme case. 47 00:02:41,930 --> 00:02:45,860 Suppose that half of the buses have 0 people in them, 48 00:02:45,860 --> 00:02:50,440 and half of the buses have 50 people in them. 49 00:02:50,440 --> 00:02:55,050 If you look at random buses, then the average occupancy 50 00:02:55,050 --> 00:02:57,150 would be 25. 51 00:02:57,150 --> 00:03:00,720 But if you ask passengers, all of the passengers 52 00:03:00,720 --> 00:03:06,020 would report 50, and it would be, again, a different answer. 53 00:03:06,020 --> 00:03:09,900 A similar situation is if you're talking about average class 54 00:03:09,900 --> 00:03:11,280 sizes. 55 00:03:11,280 --> 00:03:14,060 One method is to look at all the classes, 56 00:03:14,060 --> 00:03:16,410 see how many students there are in each class, 57 00:03:16,410 --> 00:03:18,079 and take the average. 58 00:03:18,079 --> 00:03:22,860 Another method would be to ask a typical student, 59 00:03:22,860 --> 00:03:25,150 how large is your class? 60 00:03:25,150 --> 00:03:29,240 Because more students are in large classes, when 61 00:03:29,240 --> 00:03:31,430 you pick a student at random, you 62 00:03:31,430 --> 00:03:34,710 are likely to get a higher answer, as opposed 63 00:03:34,710 --> 00:03:37,770 to when you look at a random class. 64 00:03:37,770 --> 00:03:40,250 The moral from all these examples 65 00:03:40,250 --> 00:03:42,450 is that it is very important to be 66 00:03:42,450 --> 00:03:45,730 careful about what you choose to sample. 67 00:03:45,730 --> 00:03:47,860 When you pick at random, what exactly 68 00:03:47,860 --> 00:03:49,720 are you picking at random? 69 00:03:49,720 --> 00:03:52,390 And you need to be aware that different sampling 70 00:03:52,390 --> 00:03:55,000 methods measure different things, 71 00:03:55,000 --> 00:03:58,300 and will generally give you different results.