IFHS – Effective sample size
January 20, 2008
An explanation of the concept of effective sample size is here. This post applies this concept to a particular study.
Extrapolating from a sample taken in a reference area containing a relatively small number of deaths in order to estimate the number of deaths in the area containing the bulk of the deaths in Iraq is problematic even if we assume that the extrapolation factor is known precisely.
The reference area is the “three provinces that contributed more than 4% each to the total number of deaths reported for the period from March 2003 through June 2006″. By the design of the sample, there are no more than 180 clusters in these provinces (each province was sample with 54 clusters, except for Nineveh which was sample with 72 clusters). Table 3 of the supplementary material of the paper shows that sample sizes in each governorate. Taking the largest 3 samples in the “High mortality governorates” section (Nineveh, Babylon and Basra) gives an upper bound on the total sample in the reference area of 14,891 people. Multiplying that bound with the mortality rate in the area (Table 2 of the supplementary material) – 0.83 death per 1000 person-years – gives an upper bound for the number of violent deaths in the sample in the reference area of 14,891 x 0.83 / 1000 x 3.33 = 41.19.
That is, over 70% of the deaths in the IFHS estimate – those in Baghdad, Anbar and the three reference governorates – are based on a sample containing about 40 deaths. The estimate of the number of deaths in those areas is generated simply by multiplying those 40 deaths by a factor and thus any uncertainty in the number 40 is directly translated into uncertainty in the estimator of the total number of deaths in the areas.
Reference to the large total number of clusters in the IFHS (almost 1000 clusters visited) is therefore misleading. The determining factor of the estimate is the data collected in a much smaller number of clusters – generating a small number of recorded deaths. The uncertainty in the estimate is correspondingly large.
An approximate calculation – which can be made without using detailed counts – shows that the confidence interval reported by the IFHS authors for the rate of violent death (95% CI, 0.81 to 1.50 deaths per 1000 person-years) is indeed as large as would be implied by the small effective size of the sample.
Sampling theory allows to calculate a lower bound for the sampling variation in the death count in the reference governorates. The count – about 40 deaths – is the sum of many independent variables – i.e., the death counts from the 180 or so individual clusters. Obviously, in many of those clusters the death count was zero. In others it was one or more. The theory shows that the least variation in the sampling would occur if there were 40 clusters with one death each and zero deaths in all others.
In that case, the count would behave approximately like a Binomial variable with parameters n=180 and p=40/180=0.22. The standard deviation of such a variable is sqrt(180 x 0.22 x (1 – 0.22)) = 5.57, or about 14% of the estimate. The ratio of the upper point of a 95% CI (40 + 1.96 x 5.57) to the lower point (40 – 1.96 x 5.57) is about 1.75 – similar to the ratio of the values of endpoints of the CI reported by the authors (1.85).