This document presents three statistical formulas that give the margin of errors when estimating a proportion, and a piece of code to empirically verifies the formulas. It uses an unusual programming manner for discussing statistics (in contrast to pure maths).
Basic proportion estimation consists of sampling a given population (i.e. selecting random elements of this population) and performing a binary test on each of them. The estimated proportion is the ratio of elements that pass the test.
For instance, if the elements are citizens and the test is a question like "Will you vote for the galactic emperor", then one estimates the proportion of supporters of the galactic emperor for the next election. Another example in the context of empirical software engineering: the elements are classes of an object-oriented program, the test is "with intensive software testing, can we find bugs in this class?" and the estimation is the proportion of buggy classes.
The main question of experimenters is: "What is the margin of error of my estimated proportion?". Let's consider that n is the number of tested elements, p the estimated proportion and q the real proportion. Statistics gives three formulas. Explaining the theory behind the formulas is out of the scope of this document.
First formula: The first formula is
`q in [p - X/sqrt(n), p + X/sqrt(n)]`, where X depends on the confidence level.
For instance, at 95% confidence level, `q in [p - 0.98/sqrt(n), p + 0.98/sqrt(n)]`
Second formula: A second formula states that p (and not q) is in the following interval
`p in [q - Z*sqrt((q*(1-q))/n), q + Z*sqrt((q*(1-q))/n)]`, where Z (see Z-score in you statistics textbook) depends on the confidence level. The problem of this formula is that we never know q, thus the formula can not answer to the experimenter question "What is the margin of error of my estimated proportion?". It can only answer to questions like: "Is it possible that the real proportion is 40%?", which are much smaller in scope.
Third formula: It states that p is a good-enough approximation of q (if the sample size is big enough) to replace q by p in the above formula. As a result, q can be considered in the following interval
`q in [p - Z*sqrt((p*(1-p))/n), p + Z*sqrt((p*(1-p))/n)]`
The code above simulates an experiment where q is known so that we can validate the different formulas. Running this code (by clicking on "execute") shows that the first two formulas hold.
It also show that the third formula is very good, i.e. that the approximation is acceptable. For instance, at theoretical 95% confidence level, the empirical confidence is 94.4%. Of course, if this code or the implementation of the uniform distribution (Math.random) contain bugs, nothing holds at all.
Feedback welcome :-)
--Martin