Win a copy of Penetration Testing Basics this week in the Security forum!

# Laws of Statistics

John Smith
Ranch Hand
Posts: 2937
I was watching the elections in progress yesterday, and very soon after I saw that exit polls indicating 55%/45% in favor of the recall with only 19% of the precinct reporting, and Arnold with a 600,000 lead over Cruz, Arnold was proclaimed a new Governer of California by the networks, and Bustamante conceeded elections.
Well, I thought I understood the law of big numbers, but apparently not. With only 19% of the votes counted, and the 10% difference between the votes, I thought there was a significant chance that the final result may be different. Something like 0.05 probability, just from the top of my head. Apparently, that probability is much lower, and that's what motivated the proclamations of victory and defeat.
Can anyone demonstrate how such a probability can be calculated, statistically? Take the previous paragraph as a set of assumptions.

Paul Stevens
Ranch Hand
Posts: 2823
I couldn't but I bet the people at VNS (Voter News Service) could. That is the outfit that projects these things. They use key precints and exit polls to get the trend. They probably use historical data to get what they think the final will be. They aren't wrong often.
The key is that they don't just use the actual hard numbers reported.

HS Thomas
Ranch Hand
Posts: 3404
I couldn't but I bet the people at VNS (Voter News Service) could

With luck , they may even have a Peter Snow :Last seen on MAy 2001 election coverage
Usually famed for his wacky graphics, slick presentation and 21st century swing-o-meter, Snow is said to incandescent with rage at having to do it all on a shoestring, as dictated by the same bosses who got rid of the famed BBC second breakfast.
It will be back to the Sixties for Snow's analysis, in that he will be demonstrating the potential swing of the electorate using a cardboard cut-out from the back of a Kellogg's cornflake packet, and a pendulum made from a pointed stick discarded by a mortat board wearing Oxford don.
Instead of a state-of-the-art computer simulation of what the House of Commons would look like on current predictions, Snow was to be equipped with a matchstick model of the Palace of Westminster, which Newsnight studio cleaner Eric Gates, had been working on for some 18 months.
The idea was that Snow would open this up and shovel in handfuls of red, blue or yellow soldiers garnered from sets of Risk, that were lying around the Wood Lane studios, in proportion to the standings of the major parties. Snow could not control his rage saying, 'You dragged me back from hi-tech heaven and staring down Philippa Forrester's cleavage, at Tomorrow's World for this pile of crap!'

regards

Jim Yingst
Wanderer
Sheriff
Posts: 18671
FI'm not sure if the exit poll data had anything really to do with the 19%. "Exit poll" probably means that news services asked some number of people who had just voted, what their vote was. I doubt they came anywhere close to poling 19% of the voters. The 19%, and the 600000 lead, referred to actual votes which had been cast and reported. I'd be somewhat skeptical of those results when it's just 19%, as it's possible that early-reporting precincts have significantly different demographics than late-reporting precincts. E.g. maybe the big cities are the last to report in, or maybe the low-tech backwater areas are last; dunno. Of course in national elections there's the issue of what time zone each state is in.
The exit polls are probably based on much less than 19% of the vote, but nonetheless can be done with a reasonably high degree of accuracy. IF you can poll a group of, say, 10000 voters, chosen purely at random from amongst the people who are actually voting (regardless of how big that group is), you can predict the overall voting results within about 2%, with 95% confidence. The key is that the sample error of the mean (SEM) is given by
SEM = s / sqrt(n)
Where s is standard deviation of the sample, and n is the sample size. If 0 represents "no" and 1 represents "yes", the standard deviation of votes will be somewhere less than 1. Setting n = 10000 would give SEM of .01. For a normal distribution, the population mean is within 1.96*SEM of the sample mean, 95% of the time. So if you've got exit polls saying 55% are voting for the recall, you're 95% sure the final number will be between 53% and 57%. That's probably close enough to call a winner. If not, well, poll more people. I don't know how many people they actually poll in these exit polls, but it's probably a lot less than 19% of the voters.
The key here though is that this assumes that the people in your exit poll represent a suitably random sample of all the voters. In practice, they tend to represent the people who were easy to get hold of and convice to reveal who they voted for. That's where all the extra considerations Paul S mentioned come into play. The pollsters try to detect and correct for discrepancies between the demographics of their sample, and the demographics of the whole voting population. How they do this, I could only guess.

John Smith
Ranch Hand
Posts: 2937
SEM = s / sqrt(n)
Where s is standard deviation of the sample, and n is the sample size. If 0 represents "no" and 1 represents "yes", the standard deviation of votes will be somewhere less than 1. Setting n = 10000 would give SEM of .01. For a normal distribution, the population mean is within 1.96*SEM of the sample mean, 95% of the time. So if you've got exit polls saying 55% are voting for the recall, you're 95% sure the final number will be between 53% and 57%.

I am not sure I follow. Why isn't the population size anywhere in these calculations? It looks like if you sample 10,000 people, then you get a 95% confidence in the result, no matter how large the total population is. In other words, suppose you are a physicist sampling the elementary particles, and the total number of particles in the Universe is some huge number. Now, if you sample 10,000 particles, and get some results, do you stll have the same 95% confidence in the result, assuming the normal distribution of particles across everywhere in the Universe?
Shouldn't the confidence be propotional to the percentage of the total population sampled, rather than just to the absolute number of samples?

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Shouldn't the confidence be propotional to the percentage of the total population sampled, rather than just to the absolute number of samples?
I agree that it does seem like there ought to be some role for the ratio of sample size to population size here, but "proportional"? Meaning what, if I get 95% confidence by polling 10% of the populace, I can poll 20% and see the confidence increase to 190%? No, that doesn't work. Perhaps we should work from the other end. If I poll 100% of the populace, I'm 100% confident my results reflect the population mean. (Well assuming they didn't lie or misunderstand the poll, that's another discussion.) So if I poll 10% of the population, can I only get 10% confidence in my results, ever? I'd think that if I polled 6000000 people in California, I could make some sort of prediction with better than 10% confidence.
Truth is, I only vaguely remember how this works, and don't have access to good references at the moment, and would have to study a fair bit anyway to relearn this stuff now. But from my vague recollections...
In other words, suppose you are a physicist sampling the elementary particles, and the total number of particles in the Universe is some huge number. Now, if you sample 10,000 particles, and get some results, do you stll have the same 95% confidence in the result, assuming the normal distribution of particles across everywhere in the Universe?
Yes. Assuming that you were able to draw your 10000 samples randomly from all particles in the universe. That's the tough part - those pesky neutrinos don't seem to respond to polls very often. And we haven't sent many pollsters to interesting environments like inside a star or black hole. And also the s / sqrt(N) result applies to statements about the mean values, not individual values. So if a physicist says there's a 95% chance that the average mass of all prticles in the universe is between x - dx and x + dx, that doesnt' rule out the possibility that there are individual rare particles with mass 10^6 * x - it just means that those are (95%likely) rare enough that they're not going to affect the average mass by more than dx.
For the original polling problem - I think there's probably a more complex formula possble for the exact probability distribution of the population mean based on the sample mean, and this complex formula probably incorporates population size as well as sample size. I suspect this formula is based off the binomial distribution or something similar. But there's a large range of situations in which the ratio or sample size to population size just isn't particlularly significant to the results, and it's possible to make some simplifications that lead to the s / sqrt(n) formula. Offhand this probably requires the assumption that sample size is significantly larger than 0, and the population size is significantly larger than the sample size. Obviously as sample size approaches sample size confidence will approach 100% (or the SEM will approach 0, depending which one you keep constant while looking at the other). And you can't ahve a sample size larger than the population after all. So as sample size approaches population size, the s / sqrt(n) approximation becomes less and less valid. But if we stay away from that region and stick to 0 << sample size << population size, the s / sqrt(n) formula holds up pretty well. The exact formula which incorporates population size as well might give slightly different values, but in general it's not wortht the added effort to efvaluate that formula; the difference is not that great.
Now there may well be significant errors in what I've said here; I'm just going off vague memeories. (Or maybe my memory is fine, but my understanding was never that great to begin with.) More informed posts would be welcome. But that's the best I have for now. Cheers...
[ October 09, 2003: Message edited by: Jim Yingst ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Sadly, the Tucson Public Library does not seem to want to carry books with real math in them. But I did find a slightly more elaborate formula for the variance of the mean. If N is the population size and n is the sample size, the variace of the mean for a measurement is given by:
V = (N - n)/N * S^2 / n
where S^2 is the sample variance, S^2 = sum((x(i) - m)^2) / (n - 1)
From this, the standard error for 95% confidence is given by:
1.96 * sqrt [ (1 - n/N) * S^2 / n ]
It's easy to see that for most real-world polling applications, (1 - n/N) is essentially 1. But if n -> N the error -> 0, much as we'd expect.
[ October 12, 2003: Message edited by: Jim Yingst ]

John Smith
Ranch Hand
Posts: 2937
where S^2 is the sample variance, S^2 = sum((x(i) - m)^2) / (n - 1)
From this, the standard error for 95% confidence is given by:
1.96 * sqrt [ (1 - n/N) * S^2 / n ]

Thanks, Jim. I was going to do my own research, but you denied me a chance.
Ok, let's apply this to the recall/no-recall situation after 19% of the votes were counted and the results were 55%/45% in favor of the recall.
I'll go with your definition of 1 as the "yes" and 0 as a "no" on the recall. The variance of a series of 55 ones and 45 zeros gives us the variance of 0.25. The total number of votes on the recall question was 7,974,834, and 19% of that is 1,515,218. So N = 7,974,834 and n = 1,515,218.
Then the standard error for 95% confidence is
1.96 * SQRT( (1 - 1515218 / 7974834 ) * 0.25 / 1515218 ) = 0.0007165
So, what does 0.0007165 tell us?
[ October 12, 2003: Message edited by: Eugene Kononov ]

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Thanks, Jim. I was going to do my own research, but you denied me a chance.
Not at all. For example, I've completely glossed over the original binomial distribution for this data, and how we justify the normal approximation, and how we use that to derive the SEM formula in which an SEM of 1.96 or less gives confidence of 95%. So you're welcome to research that part, and post an explanation.
Ok, let's apply this to the recall/no-recall situation after 19% of the votes were counted and the results were 55%/45% in favor of the recall.
Again, I'm not at all convinced that's what the numbers were, in the sense that the 45/55% most likely came from exit poll data which probably involved considerably less than 19% of the population, and the 19% referred to actual official votes. Why not use the actual official votes in place of the much smaller polled sample? Because the 19% do not represent a randome selection - they represent those particular precincts which happened to get their results tabulated early. This could be because they are in less-populated regions or something, which would mean they're demographically skewed towards one candidate or another. The exit polls may use a smaller sample, but more work has gone into making sure it's a mixed and representatvie sample.
But for the sake of argument, I'll pretend that the 45/55% did come from polling 19% of the population.
So, what does 0.0007165 tell us? That the final result would be 55 to 45, with a 95% confidence that the lowest number of the "yes" votes would be at least 54.9993%?
No, 54.93%.
That seems like an awfully low estimated error.
Well it's a bit better now...
In other words, if the intermidiate results were 50.0007% in favor of the recall, we would still have a 95% confidence that Davis would be recalled?
If I correct this to 50.07%: Yes. Assuming that our other assumptions were correct - most notably, assuming that the 19% was really selected completely at random from all who voted. In practice there's probably some error in this assumption, though I have little idea how to quantify it. But as an example - what if there are 100 precincts (pretend there's equal size for each), and the 19% comes from the fact that 19 precincts have reported their results, and we have complete results for those 19 precincts, and no information about the other 81 precincts. IF there are significant demographic differences between precincts, then we don't really have such a random sample anymore. It becomes similar to a case where we've polled a sample size of 19 rather than 1.5 million. The variance of the mean can now get substantially higher than what we estimated based on 1.5 million random samples.
There are plenty of other possible sources of error. A big problem in most polls is getting people to actually respond (after you've gone to the trouble of randomly selecting them.) What if people of one party are more likely to say "none of your business" to a pollster than are people of another party? How do you correct for something like that in the data? If your polling data shows 50.07% in favor of recall, I'm betting that these other errors can add up to notably more than .07%. So in this case the polling organization would be wise not to make any predictions yet - uncertainty is too great.
[ October 12, 2003: Message edited by: Jim Yingst ]

John Smith
Ranch Hand
Posts: 2937
Sorry, Jim, I deleted a few last paragraphs of my post, as I realized that the calculated error doesn't directly translate into the percentages, so I was going to correct it, but you posted in the meanwhile. I'll take some time to recalc it to see if it makes sense.

HS Thomas
Ranch Hand
Posts: 3404
Math of Elections
(An unexpected thing happened in the California election. The winner got more votes than the loser )
regards

John Smith
Ranch Hand
Posts: 2937
I was lazy to do the research, so I posted the question to the stats group here.
I got two responses (more would come, I think), but there is already a surprise. Here is what one fellow stated:
"This is a common misconception. In fact, with binomial (yes/no)
data, the margin of error and the p-value depend on the absolute
number in your sample, not on the sample size as a fraction of
population size."

Jim Yingst
Wanderer
Sheriff
Posts: 18671
What, you didn't believe me when I told you, but you trust some random person on the net?

John Smith
Ranch Hand
Posts: 2937
What, you didn't believe me when I told you, but you trust some random person on the net?
Well, the last formula that you showed me was using both the population size and the sample size, and the context of the formula implies that the ratio of the two is what counts. This is in contradiction to the random guy from the 'net, who implies that the poulation size doesn't even matter. So, I am still not convinced. No offence to you, Jim, -- I am just looking for a 95% confidence.

Jim Yingst
Wanderer
Sheriff
Posts: 18671
Doubter! It seems J. Random Guy is telling you the same thing I did initially, that n/N doesn't really matter. When I did more research I found the more precise formula with the (1 - n/N) term, telling us that n/N does have an effect (though it proves to be generally ignorable). This formula has the nice property that if n = N, the sample error (in measuring the population mean) becomes 0 - which is what we'd expect at this point; the sample is the population, so there should be no error. So clearly any formula that doesn't mention N will be invalid for n = N (and nearby). However in practice we generally assume n << N (we never poll more than a small fraction of the population), so (1 - n/N) = ~1, leading to the standard formula which simply ignores N. This would seem to reconcile the two views given here, doesn't it? IF you push J. Random Guy on it, he'll probably concede (or someone else will) that there's a (1 - n/N) term that is generally ignored. There may be even more precise formulas available with additional perm which are even more negligible. It all depends how much precision is required.

John Smith
Ranch Hand
Posts: 2937
Well, I got to get to my statistics books, I guess there is no way out of this. Now they are talking about 2x2 contingency chisquared matrix, and this is beyond my mental capacity at the moment.