1
00:00:00,920 --> 00:00:04,240
We will now consider a very
practical application of the

2
00:00:04,240 --> 00:00:07,060
weak law of large numbers,
and the calculations

3
00:00:07,060 --> 00:00:08,670
associated with it.

4
00:00:08,670 --> 00:00:10,970
The application has to
do with polling.

5
00:00:10,970 --> 00:00:14,200
There's a certain referendum
that's going to take place.

6
00:00:14,200 --> 00:00:16,940
We're close enough to the day
of the referendum so that

7
00:00:16,940 --> 00:00:20,270
voters have made up their minds,
and there is a fraction

8
00:00:20,270 --> 00:00:24,300
p of the population that
represents the voters that are

9
00:00:24,300 --> 00:00:25,780
going to vote yes.

10
00:00:25,780 --> 00:00:28,920
But the referendum has not yet
taken place, and you want to

11
00:00:28,920 --> 00:00:34,990
find out, to predict or estimate
what p actually is.

12
00:00:34,990 --> 00:00:39,280
What you do is that you go
ahead, and you select at

13
00:00:39,280 --> 00:00:42,750
random a number of people
out of the population.

14
00:00:42,750 --> 00:00:46,690
And for each person, you record
their answer, whether

15
00:00:46,690 --> 00:00:49,810
they intend to vote yes,
or whether they

16
00:00:49,810 --> 00:00:52,120
intend to vote no.

17
00:00:52,120 --> 00:00:55,630
When we say that the people are
randomly selected, what we

18
00:00:55,630 --> 00:01:00,890
mean is that we choose them
uniformly from the population.

19
00:01:00,890 --> 00:01:05,110
And since there's a fraction p
that will vote yes, this means

20
00:01:05,110 --> 00:01:09,210
that this random variable will
be 1 with probability p, and

21
00:01:09,210 --> 00:01:14,860
therefore, the expected value
of Xi is equal to p.

22
00:01:14,860 --> 00:01:18,680
In addition, we assume that
we select those people

23
00:01:18,680 --> 00:01:19,930
independently.

24
00:01:21,810 --> 00:01:25,789
Now note that if we select
people independently, there's

25
00:01:25,789 --> 00:01:30,289
always a chance that the first
person polled will be the same

26
00:01:30,289 --> 00:01:33,530
as the second person polled,
something that we do not

27
00:01:33,530 --> 00:01:35,140
really want to happen.

28
00:01:35,140 --> 00:01:39,729
However, if we assume that the
population is very large, or

29
00:01:39,729 --> 00:01:43,360
even idealize the situation by
assuming that the population

30
00:01:43,360 --> 00:01:46,360
is infinite, then this is never
going to happen, and

31
00:01:46,360 --> 00:01:49,620
this will not be a concern.

32
00:01:49,620 --> 00:01:51,820
So how do we proceed?

33
00:01:51,820 --> 00:01:54,350
We look at the results
that we got from the

34
00:01:54,350 --> 00:01:55,720
people that we polled.

35
00:01:55,720 --> 00:02:00,420
We count how many said yes,
divide by n, and this gives us

36
00:02:00,420 --> 00:02:04,310
the fraction of yeses in the
sample that we have obtained.

37
00:02:04,310 --> 00:02:08,380
And this is a pretty reasonable
estimate for the

38
00:02:08,380 --> 00:02:11,920
unknown fraction p, the fraction
of yeses in the

39
00:02:11,920 --> 00:02:14,770
overall population.

40
00:02:14,770 --> 00:02:18,260
Now perhaps your boss has asked
you to find out the

41
00:02:18,260 --> 00:02:20,120
exact value of p.

42
00:02:20,120 --> 00:02:22,310
What should your response be?

43
00:02:22,310 --> 00:02:27,490
Well, there is no way to
calculate p exactly on the

44
00:02:27,490 --> 00:02:31,170
basis of a finite
and random poll.

45
00:02:31,170 --> 00:02:35,090
Therefore, there is going
to be some error in our

46
00:02:35,090 --> 00:02:37,320
estimation of p.

47
00:02:37,320 --> 00:02:42,030
Then, perhaps your boss comes
back and says, OK, then try to

48
00:02:42,030 --> 00:02:45,970
give me an estimate of p
which is very accurate.

49
00:02:45,970 --> 00:02:49,030
I would like you to come up
with an estimate which is

50
00:02:49,030 --> 00:02:52,250
correct within one
percentage point.

51
00:02:52,250 --> 00:02:54,710
Can you do this for me?

52
00:02:54,710 --> 00:03:00,370
Your answer might be, OK, let me
try polling 10,000 people,

53
00:03:00,370 --> 00:03:05,080
and see if I can guarantee for
you such a small error.

54
00:03:05,080 --> 00:03:08,050
But after you think about the
situation a little more, you

55
00:03:08,050 --> 00:03:12,350
realize that there is no way of
guaranteeing such a small

56
00:03:12,350 --> 00:03:13,820
error with certainty.

57
00:03:13,820 --> 00:03:17,329
What if your unlucky, and the
people that you poll happen to

58
00:03:17,329 --> 00:03:20,870
be not representative of
the true population?

59
00:03:20,870 --> 00:03:24,070
So you come back to your boss
and you say, I cannot

60
00:03:24,070 --> 00:03:27,460
guarantee with certainty that
the error is going to be

61
00:03:27,460 --> 00:03:31,460
small, but perhaps I can
guarantee for you that the

62
00:03:31,460 --> 00:03:36,420
error that I get is small
with high probability.

63
00:03:36,420 --> 00:03:40,610
Or alternatively, I'm going to
guarantee for you that the

64
00:03:40,610 --> 00:03:43,890
probability that we get an error
that's bigger than that

65
00:03:43,890 --> 00:03:46,070
is very small.

66
00:03:46,070 --> 00:03:48,750
So how small is it
going to be?

67
00:03:48,750 --> 00:03:53,000
Let's try to derive a bound on
this probability of an error

68
00:03:53,000 --> 00:03:55,560
larger than one percentage
point.

69
00:03:55,560 --> 00:03:59,250
Using the calculations that we
carried out when we derived

70
00:03:59,250 --> 00:04:03,450
the weak law of large numbers,
we know that this probability

71
00:04:03,450 --> 00:04:08,200
of a large error is bounded
above by this quantity.

72
00:04:08,200 --> 00:04:10,040
What is this quantity?

73
00:04:10,040 --> 00:04:14,650
Sigma squared is the variance
of the random variable that

74
00:04:14,650 --> 00:04:15,580
we're sampling.

75
00:04:15,580 --> 00:04:19,450
And since this is Bernoulli,
this variance is p times 1

76
00:04:19,450 --> 00:04:25,090
minus p, and then we divide by
n, which in our case is 10 to

77
00:04:25,090 --> 00:04:27,960
the fourth times epsilon
squared.

78
00:04:27,960 --> 00:04:31,340
Epsilon is 10 to the minus
2, so here we have 10

79
00:04:31,340 --> 00:04:33,570
to the minus 4.

80
00:04:33,570 --> 00:04:35,620
OK, but now, what is
this quantity?

81
00:04:35,620 --> 00:04:40,620
This quantity depends on p, and
we do not know what p is.

82
00:04:40,620 --> 00:04:44,420
However, if you take this
expression, and plot it as a

83
00:04:44,420 --> 00:04:51,580
function of p, what you obtain
is a plot of this type.

84
00:04:51,580 --> 00:04:57,030
And the maximum happens when p
is equal to 1/2, in which case

85
00:04:57,030 --> 00:04:59,970
we get a value of 1/4.

86
00:04:59,970 --> 00:05:04,060
That is, the variance of the
Bernoulli is, at most, 1/4.

87
00:05:04,060 --> 00:05:07,970
And therefore, we obtain this
bound here where the

88
00:05:07,970 --> 00:05:10,710
denominator terms have
disappeared because they're

89
00:05:10,710 --> 00:05:12,250
equal to 1.

90
00:05:12,250 --> 00:05:16,710
So you tell your boss, if I
sample 10,000 people, then the

91
00:05:16,710 --> 00:05:20,590
probability of an error more
than the one percentage point

92
00:05:20,590 --> 00:05:23,860
is going to be less than 25%.

93
00:05:23,860 --> 00:05:28,600
At which point, your boss might
reply and say, well, a

94
00:05:28,600 --> 00:05:32,659
probability of a large error
of 25% is too big.

95
00:05:32,659 --> 00:05:34,580
This is unacceptable.

96
00:05:34,580 --> 00:05:38,850
I would like you to have a
probability of error that's

97
00:05:38,850 --> 00:05:43,170
less than 5%.

98
00:05:43,170 --> 00:05:48,260
So suppose now that we want to
change this, and make it only

99
00:05:48,260 --> 00:05:50,700
a 5% error--

100
00:05:50,700 --> 00:05:52,909
5% or less.

101
00:05:52,909 --> 00:05:55,240
How are you going to proceed?

102
00:05:55,240 --> 00:05:58,870
Well, you have this quantity
here, this upper bound, which

103
00:05:58,870 --> 00:06:05,720
we know to be less than or
equal to 1/4 divided by n

104
00:06:05,720 --> 00:06:08,960
times epsilon squared, which
is, in our case, 10

105
00:06:08,960 --> 00:06:10,850
to the minus 4.

106
00:06:10,850 --> 00:06:17,020
We would like this quantity to
be less than or equal to 5%,

107
00:06:17,020 --> 00:06:21,420
which is 5/10 to the
second power.

108
00:06:21,420 --> 00:06:25,380
And after you solve this
inequality, you find that this

109
00:06:25,380 --> 00:06:30,180
is equivalent to taking
n larger than or equal

110
00:06:30,180 --> 00:06:32,790
to 10 to the sixth.

111
00:06:32,790 --> 00:06:36,010
And then the five together
with that four gives us a

112
00:06:36,010 --> 00:06:38,960
denominator of 20.

113
00:06:38,960 --> 00:06:43,860
And this number is
equal to 50,000.

114
00:06:43,860 --> 00:06:47,560
So at this point, you can go
back to your boss and tell

115
00:06:47,560 --> 00:06:51,200
them that one way of
guaranteeing that the

116
00:06:51,200 --> 00:06:55,909
probability of a large error is
less than or equal to 5% is

117
00:06:55,909 --> 00:06:58,850
to take n equal to 50,000.

118
00:06:58,850 --> 00:07:05,260
So 50,000 will suffice to
achieve the desired specs.

119
00:07:05,260 --> 00:07:08,880
Notice that the desired specs
have two parameters.

120
00:07:08,880 --> 00:07:14,140
One is the accuracy that you
want, and the other is the

121
00:07:14,140 --> 00:07:17,870
confidence with which
the accuracy

122
00:07:17,870 --> 00:07:21,040
is going to be achieved.

123
00:07:21,040 --> 00:07:24,420
Now 50,000 is a pretty
large number.

124
00:07:24,420 --> 00:07:27,890
If you notice the results of
polls, the way that they are

125
00:07:27,890 --> 00:07:31,690
presented in newspapers, they
usually tell you that there's

126
00:07:31,690 --> 00:07:36,570
an accuracy of plus or minus
three percentage points, not

127
00:07:36,570 --> 00:07:38,380
one percentage point.

128
00:07:38,380 --> 00:07:41,250
That helps things a little.

129
00:07:41,250 --> 00:07:43,670
It means that you can
do with a somewhat

130
00:07:43,670 --> 00:07:46,060
smaller sample size.

131
00:07:46,060 --> 00:07:48,440
And then, there's
another effect.

132
00:07:48,440 --> 00:07:52,790
Our calculation here was based
on this inequality, which is

133
00:07:52,790 --> 00:07:54,390
the Chebyshev inequality.

134
00:07:54,390 --> 00:07:58,650
But the Chebyshev inequality
is not that accurate.

135
00:07:58,650 --> 00:08:03,120
It turns out that if we use
more accurate estimates of

136
00:08:03,120 --> 00:08:07,620
this probability, we will find
that actually much smaller

137
00:08:07,620 --> 00:08:10,910
values of n will be enough
for our purposes.