Effective sample size
July 16, 2008
The notion of “effective sample size” is useful when comparing the information gathered using a certain sampling method to information gathered (or that would be gathered) with a different – reference – sampling method when applied to the same population. Given a known level of uncertainty of an estimate – the one achieved with the sampling method and sample size actually used, the “effective sample size” is the size of the sample that would be needed to achieve the same level of uncertainty if the reference sampling method were used instead.
The uncertainty can be measured in principle in various ways, but is usually measured as the variance of the estimator.
The most common application is when comparing cluster sampling to simple random sampling (SRS). See, for example, this presentation. Cluster samples are used because it is often much less costly to collect data regarding geographically adjacent sampling units (people, households, etc.) than about geographically dispersed units. The drawback is that there are usually geographically dependent effects that make adjacent sampling units more alike than non-adjacent units. Therefore, averages of data collected from adjacent units have higher variability than averages of data from non-adjacent units. Thus a tradeoff between reducing cost per unit and reducing variability occurs. This tradeoff can be analyzed in terms of effective sample size. Here is a concrete example:
Suppose a survey is designed to estimate the winner in a two-way elections race. The population is divided into geographically contiguous zones, each containing k adults. Interviewers show up at a randomly chosen zone and ask all the adults in the zone who is their preferred candidate (for simplicity, assume that all adults will show up to vote). The cost of traveling to a random zone is C, and having arrived at the zone, the cost of each interview is c.
If the split between the candidates in the population is p vs. 1 – p, then the variance in the population is p (1 – p). If there are no zone-dependent effects, i.e., if the preferences of the members of the zones are independent of each other, the variance in the population of zones, v(k), would be p(1 – p) / k. In the general case v could be lower (unlikely) or higher – possibly as high as the variance in the population, when all the adults in any zone have the same preference. A sample of nc / k zones (i.e., requiring nc interviews) generates an estimate with variance kv(k) / ns, as compared to variance p (1 – p) / n of an estimate generated by a SRS with size n. The effective sample size of the cluster sample is thus nc p (1 – p) / kv(k).
The cost of the SRS is nC while that of the clustered sample is ncc + ncC / k. If the total cost is constrained as T, the optimal cluster size is k such that the effective sample size
T / (kc + C) · p (1 – p) / v(k)
Another interesting relevant scenario is that of stratified sampling: If it is known that different segments in the population are, on average, different in terms of their response variable and if the proportions of the different segments in the population are known, this information may be used to achieve estimates that are more efficient than that of simple random sample. Thus, in such a case it may be possible to achieve effective sample size which is larger than the actual sample size. Here is a concrete example:
Suppose we wish to estimate the mean height of a certain population. Suppose that
- the proportion of women in the population is known to be p,
- the mean height of women in the population is hw and that of men is hm (both are unknown),
- the standard deviation of the heights of men, s, is equal to the standard deviation of the heights of women.
The variance in the population is s2 + (hm – hw)2 p (1 – p), and so the standard deviation of the mean of a simple random sample of n people has a variance of (s2 + (hm – hw)2 p (1 – p)) / n. A stratified sample which is split into αn women and (1 – α)n men, on the other hand, produces an estimate for the mean height of women which has a variance of s2 / αn and an estimate for the mean height of women which has a variance of s2 / ((1 – α) n). A weighted average of those estimates produces an estimate for the mean height of the population which has variance
s2 / n · (p2 / α + (1 – p)2 / (1 – α)).
The effective sample size of such a sample is n’ such that
(s2 + (hm – hw)2 p (1 – p)) / n’ = s2 / n · (p2 / α + (1 – p)2 / (1 – α)),
n’ = n (1 + z) / q,
where z is the inter-group to the intra-group variance factor:
z = (hm – hw)2 p (1 – p) / s2,
and q is the group sample mis-match factor:
q = p2 / α + (1 – p)2 / (1 – α).
The effective sample size is maximized when q is minimized, i.e., when α = p. In that optimal case, n’ = n (1 + z), which, of course, is no less than n and could be significantly higher. When the split of the sample between the groups is mismatched, however, the efficiency can drop precipitously, and the effective sample size, n’, can be much smaller than n.