Mean square error of a maximum likelihood estimate is not monotonic in sample size
May 12, 2008
The IMS Bulletin is usually not a technical publication. It usually carries announcements about upcoming conferences, obituaries, stories about award winners, job ads and columns.
The latest issue of the IMS Bulletin, however, has an unusual item on page 4 in which Ning-Zhong Shi of the School of Mathematics and Statistics, Northeast Normal University, P.R. China lays out “A conjecture on Maximum Likelihood Estimation”. Shi conjectures that the MLE has the finite sample property of having expected squared-error monotonically decreasing in the number of samples n, i.e., MSE( θn+1 ) ≤ MSE( θn ), where θn is the MLE calculated with n i.i.d. samples, and MSE( θn ) = E [ (θn − θ)2 ].
This is a rather bold conjecture, since there is very little established regarding finite sample properties of MLEs. Shi notes that the conjecture is true in the special case in which the MLE is a mean of a sample of i.i.d. variables – such as when the variables are normal or Bernoulli variables parameterized by the unknown mean. It seems that despite applying the qualifier “under some regularity conditions”, Shi expects the result to hold on a much wider set of cases.
He is, of course, wrong.
A simple counter-example involves the case of Bernoulli variables, where instead of parameterizing the distributions using their mean, they are parameterized using some monotonic, non-linear, function of the mean. Let
fr(p) = 1/2 + atan(r * tan( (p – 1/2) * π)) / π.
This is a sigmoid function which equals to 0 at 0 and to 1 at 1. The parameter r controls the steepness when x = 1/2. When r = 1, it is the identity function. Here is a plot of f10(p). I use r=10 in the calculations below, but other values > 1 can be used.
Because of the functional invariance of the MLE, the MLE of fr(p) is equal to fr(p), where p is the MLE of p. We now look at the case where p = 1/2. In this case, the MSE of fr(p) when n = 1 is 1/4, since fr(p) is equal to either 0 or 1, and in both cases, the error is 1/2. When n = 2, the MSE falls significantly – to 1/8 – because there is a chance of 1/2 that fr(p) is exactly 1/2, in which case the error is 0. When n = 3, the MLE will be 0, 1, fr(1/3) or fr(2/3). Now, if r is high, fr(1/3) ≈ 0 and fr(2/3) ≈ 1, so the MSE is almost back up to 1/4. The MSE then keeps oscillating between even n‘s (low MSE, because the MLE may be equal to 1/2) and odd n‘s (high MSE, MLE cannot be equal to 1/2), until the the distribution of the MLE becomes tight enough so that the sigmoid nature of the function does not influence the MSE, and the asymptotic behavior kicks in. The onset of asymptotic behavior can be made arbitrarily late by increasing the parameter r. The figure below shows the MSE as a function of n: