Basic unit-level models

The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample’s cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
##       Mean   SD t-value MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.47    -244 0.12 -1097 -1089 -1083  1400     1
## 
## sigma_ :
##        Mean   SD t-value   MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.09    18.2 0.0666  51.4 56.1  61.4  2161     1
## 
## beta :
##                 Mean     SD t-value    MCSE    q0.05     q0.5    q0.95 n_eff R_hat
## (Intercept)  802.649 25.161   31.90 0.49578  761.570  802.762 843.6561  2576 0.999
## ell           -1.957  0.307   -6.37 0.00580   -2.477   -1.959  -1.4546  2805 1.001
## meals         -1.966  0.284   -6.92 0.00542   -2.436   -1.967  -1.5031  2751 1.000
## stypeH      -106.923 13.391   -7.98 0.24736 -128.806 -106.825 -84.3755  2931 0.999
## stypeM       -60.304 11.562   -5.22 0.21109  -79.320  -60.454 -40.9672  3000 1.000
## hsg           -0.616  0.412   -1.49 0.00752   -1.290   -0.611   0.0538  3000 1.000
## col.grad       0.581  0.508    1.14 0.01007   -0.234    0.576   1.4234  2542 0.999
## grad.sch       2.124  0.481    4.42 0.00936    1.343    2.108   2.9256  2642 1.000
## 
## v_sigma :
##         Mean   SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23.3 6.28    3.72 0.203  13.7 22.9  34.2   956     1
## 
## v :
##                  Mean   SD  t-value  MCSE q0.05      q0.5 q0.95 n_eff R_hat
## Alameda      -25.8712 15.2 -1.69748 0.330 -51.7 -2.54e+01 -1.11  2131     1
## Amador         0.0396 24.2  0.00164 0.450 -40.2 -5.30e-02 39.76  2890     1
## Butte         -0.4771 23.6 -0.02020 0.453 -39.0 -1.97e-01 37.69  2721     1
## Calaveras      8.2061 21.9  0.37429 0.435 -26.5  7.59e+00 44.85  2538     1
## Colusa         0.5383 24.5  0.02198 0.453 -40.4  8.97e-01 40.15  2918     1
## Contra Costa -10.1490 15.3 -0.66427 0.279 -35.4 -9.50e+00 13.53  3000     1
## Del Norte      0.9160 23.8  0.03841 0.435 -38.3  2.72e-01 40.80  3000     1
## El Dorado      0.5042 24.3  0.02077 0.456 -38.3  8.19e-01 39.16  2836     1
## Fresno         0.2572 15.5  0.01656 0.283 -25.7  4.45e-01 25.70  3000     1
## Glenn         -0.6469 23.9 -0.02711 0.436 -39.9  8.09e-04 37.51  3000     1
## ... 47 elements suppressed ...

We wish to estimate the area population means \[ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, \] where \(U_i\) is the set of units in area \(i\) of size \(N_i\). The MCMC output in variable sim can be used to obtain draws from the posterior distribution for \(\theta_i\). The \(r\)th draw can be expressed as \[ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i} - n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, \] where \(\bar{y}_i\) is the sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals for area \(i\).

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                 Mean    SD t-value  MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda          678 14.60    46.4 0.315   654  678   701  2144 1.000
## Amador           739 31.81    23.2 0.581   686  738   791  3000 1.000
## Butte            679 26.37    25.7 0.499   637  679   722  2791 0.999
## Calaveras        744 27.43    27.1 0.520   700  743   790  2781 1.000
## Colusa           553 31.97    17.3 0.590   499  553   605  2939 1.000
## Contra Costa     727 14.53    50.0 0.266   702  727   750  2984 1.000
## Del Norte        679 32.12    21.2 0.586   626  680   731  3000 1.000
## El Dorado        754 26.78    28.2 0.496   712  754   797  2910 0.999
## Fresno           595 15.18    39.2 0.277   570  595   620  3000 1.000
## Glenn            625 31.84    19.6 0.581   572  625   677  3000 1.000
## Humboldt         701 27.48    25.5 0.505   657  701   747  2960 1.001
## Imperial         558 25.27    22.1 0.492   518  557   601  2638 0.999
## Inyo             707 33.61    21.0 0.614   651  708   760  3000 1.000
## Kern             596 15.16    39.3 0.315   571  596   621  2318 0.999
## Kings            594 22.95    25.9 0.490   554  595   631  2191 1.001
## Lake             663 25.31    26.2 0.462   622  663   704  3000 0.999
## Lassen           706 26.55    26.6 0.485   661  707   750  3000 1.001
## Los Angeles      636  8.21    77.4 0.150   622  636   649  3000 0.999
## Madera           615 20.21    30.4 0.369   583  615   649  3000 0.999
## Marin            817 20.36    40.1 0.380   785  817   852  2869 1.000
## Mariposa         723 35.66    20.3 0.667   665  723   781  2859 0.999
## Mendocino        662 27.57    24.0 0.513   618  662   708  2888 1.001
## Merced           579 23.81    24.3 0.435   541  579   619  3000 1.000
## Modoc            672 29.46    22.8 0.597   624  671   721  2433 1.000
## Mono             704 40.57    17.3 0.769   636  704   770  2782 1.000
## Monterey         629 18.26    34.4 0.364   599  629   659  2515 1.000
## Napa             702 21.29    33.0 0.395   668  703   737  2909 0.999
## Nevada           799 29.83    26.8 0.545   749  799   847  3000 0.999
## Orange           693 15.12    45.8 0.300   669  693   718  2543 1.000
## Placer           772 23.99    32.2 0.459   736  772   812  2726 0.999
## Plumas           690 32.38    21.3 0.615   638  690   744  2772 1.000
## Riverside        636 14.24    44.6 0.274   612  636   659  2697 1.001
## Sacramento       669 15.38    43.5 0.281   644  669   695  3000 1.000
## San Benito       710 30.37    23.4 0.572   659  710   761  2816 1.000
## San Bernardino   647 13.10    49.4 0.239   625  647   668  3000 1.000
## San Diego        705 15.07    46.8 0.321   680  705   730  2197 1.000
## San Francisco    638 19.73    32.4 0.360   606  638   672  3000 0.999
## San Joaquin      630 17.56    35.9 0.376   600  630   658  2177 1.000
## San Luis Obispo  753 23.99    31.4 0.438   714  753   792  3000 1.000
## San Mateo        735 21.63    34.0 0.407   700  735   772  2821 1.001
## Santa Barbara    679 19.62    34.6 0.358   646  679   711  3000 1.002
## Santa Clara      733 15.79    46.4 0.295   706  733   758  2859 1.000
## Santa Cruz       680 20.05    33.9 0.416   646  680   712  2328 0.999
## Shasta           696 22.19    31.4 0.418   661  695   734  2821 1.002
## Sierra           708 42.19    16.8 0.794   639  708   777  2826 1.000
## Siskiyou         698 25.45    27.4 0.470   656  697   739  2938 0.999
## Solano           712 20.14    35.3 0.374   679  712   745  2900 1.000
## Sonoma           725 22.45    32.3 0.415   687  725   760  2924 1.000
## Stanislaus       661 19.82    33.4 0.374   628  661   694  2812 1.000
## Sutter           653 24.53    26.6 0.455   612  653   692  2900 1.001
## Tehama           660 27.93    23.6 0.554   614  660   705  2546 1.000
## Trinity          650 38.22    17.0 0.698   587  649   714  3000 0.999
## Tulare           583 21.75    26.8 0.402   546  583   618  2927 1.001
## Tuolumne         719 30.82    23.3 0.569   670  720   769  2930 1.001
## Ventura          709 17.44    40.7 0.332   681  709   739  2752 1.000
## Yolo             687 24.24    28.3 0.443   647  687   726  2998 1.000
## Yuba             627 28.68    21.9 0.528   581  627   674  2951 1.000
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, \[ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \]

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
##       Mean  SD t-value   MCSE q0.05  q0.5 q0.95 n_eff R_hat
## llh_ -83.3 2.5   -33.3 0.0616 -87.7 -83.2 -79.6  1651     1
## 
## beta :
##                 Mean     SD t-value     MCSE   q0.05    q0.5    q0.95 n_eff R_hat
## (Intercept)  4.56473 1.4712  3.1028 0.053839  2.2341  4.5078  7.11085   747 1.002
## ell         -0.03652 0.0148 -2.4708 0.000397 -0.0615 -0.0362 -0.01266  1389 1.000
## meals        0.00124 0.0139  0.0889 0.000360 -0.0217  0.0013  0.02444  1495 1.001
## stypeH      -2.83181 0.6298 -4.4966 0.018468 -3.8958 -2.8147 -1.80091  1163 1.000
## stypeM      -1.66401 0.5866 -2.8368 0.018267 -2.6488 -1.6688 -0.69783  1031 1.000
## hsg         -0.02638 0.0222 -1.1890 0.000655 -0.0630 -0.0261  0.01058  1147 1.001
## col.grad    -0.03601 0.0267 -1.3478 0.000850 -0.0783 -0.0362  0.00844   988 1.000
## grad.sch     0.01267 0.0327  0.3868 0.001280 -0.0400  0.0118  0.06877   655 0.999
## 
## v_sigma :
##          Mean    SD t-value   MCSE  q0.05  q0.5 q0.95 n_eff R_hat
## v_sigma 0.372 0.267    1.39 0.0117 0.0336 0.325 0.868   525  1.01
## 
## v :
##                  Mean    SD  t-value    MCSE  q0.05      q0.5 q0.95 n_eff R_hat
## Alameda      -0.19361 0.428 -0.45204 0.01245 -1.041 -0.082856 0.331  1184 1.001
## Amador        0.01073 0.456  0.02353 0.00835 -0.704  0.002050 0.749  2983 0.999
## Butte         0.00163 0.451  0.00362 0.00824 -0.697  0.004466 0.692  2996 1.000
## Calaveras     0.04540 0.454  0.10000 0.00978 -0.613  0.004618 0.818  2157 1.001
## Colusa       -0.00671 0.455 -0.01474 0.00831 -0.729 -0.004081 0.711  3000 0.999
## Contra Costa -0.00822 0.396 -0.02073 0.00873 -0.666  0.000834 0.620  2061 1.000
## Del Norte    -0.00788 0.471 -0.01674 0.00884 -0.761 -0.002417 0.712  2839 1.000
## El Dorado    -0.00810 0.454 -0.01785 0.00933 -0.755  0.000714 0.707  2368 1.000
## Fresno       -0.04653 0.365 -0.12749 0.00714 -0.704 -0.015266 0.517  2612 1.000
## Glenn         0.01084 0.464  0.02336 0.00847 -0.687  0.002663 0.745  3000 1.000
## ... 47 elements suppressed ...

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
##                  Mean     SD t-value     MCSE q0.05  q0.5 q0.95 n_eff R_hat
## Alameda         0.784 0.0603   13.00 0.001665 0.674 0.796 0.864  1312 1.002
## Amador          0.845 0.1156    7.32 0.002174 0.600 0.900 1.000  2826 0.999
## Butte           0.838 0.0745   11.25 0.001509 0.708 0.854 0.938  2440 1.000
## Calaveras       0.878 0.0928    9.46 0.001834 0.700 0.900 1.000  2562 1.000
## Colusa          0.643 0.1598    4.02 0.003052 0.333 0.667 0.889  2741 1.000
## Contra Costa    0.816 0.0572   14.26 0.001321 0.715 0.821 0.899  1876 1.001
## Del Norte       0.881 0.1080    8.16 0.001972 0.750 0.875 1.000  3000 1.000
## El Dorado       0.850 0.0744   11.42 0.001713 0.725 0.850 0.950  1886 1.000
## Fresno          0.782 0.0592   13.20 0.001232 0.677 0.785 0.871  2314 0.999
## Glenn           0.798 0.1388    5.75 0.002569 0.556 0.778 1.000  2919 1.000
## Humboldt        0.854 0.0710   12.03 0.001489 0.725 0.850 0.950  2275 1.000
## Imperial        0.698 0.1045    6.68 0.002128 0.525 0.700 0.850  2412 1.000
## Inyo            0.777 0.1541    5.04 0.003090 0.571 0.857 1.000  2488 1.000
## Kern            0.806 0.0522   15.44 0.001136 0.711 0.811 0.878  2109 1.000
## Kings           0.851 0.0803   10.60 0.001618 0.720 0.840 0.960  2460 1.000
## Lake            0.828 0.0953    8.69 0.001822 0.682 0.818 0.955  2739 1.000
## Lassen          0.836 0.1105    7.56 0.002173 0.636 0.818 1.000  2586 0.999
## Los Angeles     0.799 0.0427   18.70 0.001183 0.731 0.799 0.872  1306 1.002
## Madera          0.816 0.0757   10.78 0.001520 0.677 0.839 0.935  2482 1.001
## Marin           0.839 0.0688   12.21 0.001600 0.720 0.840 0.940  1847 1.000
## Mariposa        0.863 0.1473    5.86 0.002882 0.600 0.800 1.000  2611 1.000
## Mendocino       0.817 0.0898    9.10 0.001709 0.680 0.840 0.960  2760 1.000
## Merced          0.791 0.0790   10.02 0.001553 0.651 0.794 0.905  2590 1.001
## Modoc           0.836 0.1569    5.33 0.002864 0.600 0.800 1.000  3000 0.999
## Mono            0.752 0.2432    3.09 0.004440 0.333 0.667 1.000  3000 1.000
## Monterey        0.802 0.0688   11.66 0.001543 0.687 0.807 0.916  1989 1.000
## Napa            0.850 0.0787   10.80 0.001588 0.704 0.852 0.963  2457 1.001
## Nevada          0.871 0.0940    9.26 0.001848 0.714 0.857 1.000  2588 1.000
## Orange          0.774 0.0611   12.67 0.001340 0.670 0.778 0.871  2078 1.000
## Placer          0.817 0.0740   11.05 0.001755 0.682 0.833 0.924  1777 1.002
## Plumas          0.743 0.1509    4.93 0.002975 0.444 0.778 1.000  2573 1.000
## Riverside       0.824 0.0533   15.47 0.001230 0.722 0.831 0.895  1875 0.999
## Sacramento      0.848 0.0484   17.53 0.001240 0.764 0.847 0.927  1521 1.003
## San Benito      0.825 0.1227    6.73 0.002539 0.636 0.818 1.000  2335 1.003
## San Bernardino  0.861 0.0424   20.33 0.001116 0.790 0.862 0.928  1441 1.000
## San Diego       0.849 0.0426   19.91 0.000968 0.778 0.852 0.913  1942 1.001
## San Francisco   0.761 0.0752   10.12 0.001544 0.630 0.770 0.870  2372 1.000
## San Joaquin     0.829 0.0580   14.29 0.001271 0.726 0.839 0.911  2085 0.999
## San Luis Obispo 0.856 0.0699   12.25 0.001505 0.749 0.850 0.950  2158 1.002
## San Mateo       0.801 0.0731   10.96 0.001744 0.678 0.811 0.895  1756 1.002
## Santa Barbara   0.830 0.0641   12.95 0.001326 0.728 0.840 0.926  2337 1.002
## Santa Clara     0.785 0.0685   11.45 0.001975 0.652 0.796 0.875  1203 1.002
## Santa Cruz      0.789 0.0744   10.61 0.001521 0.653 0.796 0.898  2393 1.001
## Shasta          0.817 0.0765   10.68 0.001579 0.675 0.825 0.925  2347 1.000
## Sierra          0.733 0.2346    3.12 0.004376 0.333 0.667 1.000  2875 1.000
## Siskiyou        0.846 0.0977    8.65 0.001925 0.667 0.867 1.000  2577 1.001
## Solano          0.865 0.0634   13.64 0.001670 0.750 0.867 0.967  1442 1.001
## Sonoma          0.846 0.0622   13.60 0.001276 0.741 0.852 0.935  2379 1.001
## Stanislaus      0.812 0.0723   11.24 0.001798 0.681 0.824 0.901  1614 1.000
## Sutter          0.832 0.0919    9.05 0.001805 0.650 0.850 0.950  2590 1.000
## Tehama          0.842 0.0992    8.49 0.001840 0.647 0.882 1.000  2905 1.000
## Trinity         0.759 0.2016    3.76 0.003843 0.500 0.750 1.000  2752 1.000
## Tulare          0.831 0.0633   13.12 0.001422 0.718 0.836 0.927  1985 0.999
## Tuolumne        0.883 0.0932    9.48 0.001790 0.750 0.917 1.000  2709 1.000
## Ventura         0.838 0.0511   16.38 0.001071 0.745 0.845 0.913  2281 1.000
## Yolo            0.850 0.0752   11.30 0.001456 0.714 0.857 0.971  2670 1.000
## Yuba            0.794 0.1019    7.79 0.001984 0.632 0.789 0.947  2640 1.000
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

References

Battese, G. E., R. M. Harter, and W. A. Fuller. 1988. “An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data.” Journal of the American Statistical Association 83 (401): 28–36.
Rao, J. N. K., and I. Molina. 2015. Small Area Estimation. John Wiley & Sons.