The basic unit-level model (Battese, Harter, and Fuller 1988; Rao and Molina 2015) is given by \[ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) \] where \(j\) runs from 1 to \(n\), the number of unit-level observations, \(\beta\) is a vector of regression coefficients for given covariates \(x_j\), and \(v_i\) are random area intercepts.
We use the api
dataset included in packages survey.
library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))
The apipop
data frame contains the complete population
whereas apisrs
is a simple random sample from it. The
variable cname
is the county name, and we will be
interested in estimation at the county level. Not all counties in the
population are sampled. In order to be able to make predictions for
out-of-sample areas we make sure that the levels of the sample’s
cname
variable match those of its population
counterpart.
The basic unit-level model with county random area effects is fit as follows
library(mcmcsae)
mod <- api00 ~
reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
(summary(sim))
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -1090 4.47 -244 0.12 -1097 -1089 -1083 1400 1
##
## sigma_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## sigma_ 56.2 3.09 18.2 0.0666 51.4 56.1 61.4 2161 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 802.649 25.161 31.90 0.49578 761.570 802.762 843.6561 2576 0.999
## ell -1.957 0.307 -6.37 0.00580 -2.477 -1.959 -1.4546 2805 1.001
## meals -1.966 0.284 -6.92 0.00542 -2.436 -1.967 -1.5031 2751 1.000
## stypeH -106.923 13.391 -7.98 0.24736 -128.806 -106.825 -84.3755 2931 0.999
## stypeM -60.304 11.562 -5.22 0.21109 -79.320 -60.454 -40.9672 3000 1.000
## hsg -0.616 0.412 -1.49 0.00752 -1.290 -0.611 0.0538 3000 1.000
## col.grad 0.581 0.508 1.14 0.01007 -0.234 0.576 1.4234 2542 0.999
## grad.sch 2.124 0.481 4.42 0.00936 1.343 2.108 2.9256 2642 1.000
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 23.3 6.28 3.72 0.203 13.7 22.9 34.2 956 1
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -25.8712 15.2 -1.69748 0.330 -51.7 -2.54e+01 -1.11 2131 1
## Amador 0.0396 24.2 0.00164 0.450 -40.2 -5.30e-02 39.76 2890 1
## Butte -0.4771 23.6 -0.02020 0.453 -39.0 -1.97e-01 37.69 2721 1
## Calaveras 8.2061 21.9 0.37429 0.435 -26.5 7.59e+00 44.85 2538 1
## Colusa 0.5383 24.5 0.02198 0.453 -40.4 8.97e-01 40.15 2918 1
## Contra Costa -10.1490 15.3 -0.66427 0.279 -35.4 -9.50e+00 13.53 3000 1
## Del Norte 0.9160 23.8 0.03841 0.435 -38.3 2.72e-01 40.80 3000 1
## El Dorado 0.5042 24.3 0.02077 0.456 -38.3 8.19e-01 39.16 2836 1
## Fresno 0.2572 15.5 0.01656 0.283 -25.7 4.45e-01 25.70 3000 1
## Glenn -0.6469 23.9 -0.02711 0.436 -39.9 8.09e-04 37.51 3000 1
## ... 47 elements suppressed ...
We wish to estimate the area population means \[
\theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,,
\] where \(U_i\) is the set of
units in area \(i\) of size \(N_i\). The MCMC output in variable
sim
can be used to obtain draws from the posterior
distribution for \(\theta_i\). The
\(r\)th draw can be expressed as \[
\theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}_i + \beta_r'(t_{x;i}
- n_i \bar{x}_i) + (N_i - n_i)v_{i;r} + \sum_{j \in U_i\setminus s_i}
\epsilon_{j;r} \right)\,,
\] where \(\bar{y}_i\) is the
sample mean of \(y\) in area \(i\) and \(t_{x;i}\) is a vector of population totals
for area \(i\).
N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds) # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 678 14.60 46.4 0.315 654 678 701 2144 1.000
## Amador 739 31.81 23.2 0.581 686 738 791 3000 1.000
## Butte 679 26.37 25.7 0.499 637 679 722 2791 0.999
## Calaveras 744 27.43 27.1 0.520 700 743 790 2781 1.000
## Colusa 553 31.97 17.3 0.590 499 553 605 2939 1.000
## Contra Costa 727 14.53 50.0 0.266 702 727 750 2984 1.000
## Del Norte 679 32.12 21.2 0.586 626 680 731 3000 1.000
## El Dorado 754 26.78 28.2 0.496 712 754 797 2910 0.999
## Fresno 595 15.18 39.2 0.277 570 595 620 3000 1.000
## Glenn 625 31.84 19.6 0.581 572 625 677 3000 1.000
## Humboldt 701 27.48 25.5 0.505 657 701 747 2960 1.001
## Imperial 558 25.27 22.1 0.492 518 557 601 2638 0.999
## Inyo 707 33.61 21.0 0.614 651 708 760 3000 1.000
## Kern 596 15.16 39.3 0.315 571 596 621 2318 0.999
## Kings 594 22.95 25.9 0.490 554 595 631 2191 1.001
## Lake 663 25.31 26.2 0.462 622 663 704 3000 0.999
## Lassen 706 26.55 26.6 0.485 661 707 750 3000 1.001
## Los Angeles 636 8.21 77.4 0.150 622 636 649 3000 0.999
## Madera 615 20.21 30.4 0.369 583 615 649 3000 0.999
## Marin 817 20.36 40.1 0.380 785 817 852 2869 1.000
## Mariposa 723 35.66 20.3 0.667 665 723 781 2859 0.999
## Mendocino 662 27.57 24.0 0.513 618 662 708 2888 1.001
## Merced 579 23.81 24.3 0.435 541 579 619 3000 1.000
## Modoc 672 29.46 22.8 0.597 624 671 721 2433 1.000
## Mono 704 40.57 17.3 0.769 636 704 770 2782 1.000
## Monterey 629 18.26 34.4 0.364 599 629 659 2515 1.000
## Napa 702 21.29 33.0 0.395 668 703 737 2909 0.999
## Nevada 799 29.83 26.8 0.545 749 799 847 3000 0.999
## Orange 693 15.12 45.8 0.300 669 693 718 2543 1.000
## Placer 772 23.99 32.2 0.459 736 772 812 2726 0.999
## Plumas 690 32.38 21.3 0.615 638 690 744 2772 1.000
## Riverside 636 14.24 44.6 0.274 612 636 659 2697 1.001
## Sacramento 669 15.38 43.5 0.281 644 669 695 3000 1.000
## San Benito 710 30.37 23.4 0.572 659 710 761 2816 1.000
## San Bernardino 647 13.10 49.4 0.239 625 647 668 3000 1.000
## San Diego 705 15.07 46.8 0.321 680 705 730 2197 1.000
## San Francisco 638 19.73 32.4 0.360 606 638 672 3000 0.999
## San Joaquin 630 17.56 35.9 0.376 600 630 658 2177 1.000
## San Luis Obispo 753 23.99 31.4 0.438 714 753 792 3000 1.000
## San Mateo 735 21.63 34.0 0.407 700 735 772 2821 1.001
## Santa Barbara 679 19.62 34.6 0.358 646 679 711 3000 1.002
## Santa Clara 733 15.79 46.4 0.295 706 733 758 2859 1.000
## Santa Cruz 680 20.05 33.9 0.416 646 680 712 2328 0.999
## Shasta 696 22.19 31.4 0.418 661 695 734 2821 1.002
## Sierra 708 42.19 16.8 0.794 639 708 777 2826 1.000
## Siskiyou 698 25.45 27.4 0.470 656 697 739 2938 0.999
## Solano 712 20.14 35.3 0.374 679 712 745 2900 1.000
## Sonoma 725 22.45 32.3 0.415 687 725 760 2924 1.000
## Stanislaus 661 19.82 33.4 0.374 628 661 694 2812 1.000
## Sutter 653 24.53 26.6 0.455 612 653 692 2900 1.001
## Tehama 660 27.93 23.6 0.554 614 660 705 2546 1.000
## Trinity 650 38.22 17.0 0.698 587 649 714 3000 0.999
## Tulare 583 21.75 26.8 0.402 546 583 618 2927 1.001
## Tuolumne 719 30.82 23.3 0.569 670 720 769 2930 1.001
## Ventura 709 17.44 40.7 0.332 681 709 739 2752 1.000
## Yolo 687 24.24 28.3 0.443 647 687 726 2998 1.000
## Yuba 627 28.68 21.9 0.528 581 627 674 2951 1.000
theta <- c(tapply(apipop$api00, apipop$cname, mean)) # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)
A model with binomial likelihood can also be fit. We now model the
target variable sch.wide
, a binary variable indicating
whether a school-wide growth target has been met. We use the same mean
model structure as above for the linear model, but now using a logistic
link function, \[
y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\\
\mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\\
v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2)
\]
apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)
## llh_ :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## llh_ -83.3 2.5 -33.3 0.0616 -87.7 -83.2 -79.6 1651 1
##
## beta :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## (Intercept) 4.56473 1.4712 3.1028 0.053839 2.2341 4.5078 7.11085 747 1.002
## ell -0.03652 0.0148 -2.4708 0.000397 -0.0615 -0.0362 -0.01266 1389 1.000
## meals 0.00124 0.0139 0.0889 0.000360 -0.0217 0.0013 0.02444 1495 1.001
## stypeH -2.83181 0.6298 -4.4966 0.018468 -3.8958 -2.8147 -1.80091 1163 1.000
## stypeM -1.66401 0.5866 -2.8368 0.018267 -2.6488 -1.6688 -0.69783 1031 1.000
## hsg -0.02638 0.0222 -1.1890 0.000655 -0.0630 -0.0261 0.01058 1147 1.001
## col.grad -0.03601 0.0267 -1.3478 0.000850 -0.0783 -0.0362 0.00844 988 1.000
## grad.sch 0.01267 0.0327 0.3868 0.001280 -0.0400 0.0118 0.06877 655 0.999
##
## v_sigma :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## v_sigma 0.372 0.267 1.39 0.0117 0.0336 0.325 0.868 525 1.01
##
## v :
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda -0.19361 0.428 -0.45204 0.01245 -1.041 -0.082856 0.331 1184 1.001
## Amador 0.01073 0.456 0.02353 0.00835 -0.704 0.002050 0.749 2983 0.999
## Butte 0.00163 0.451 0.00362 0.00824 -0.697 0.004466 0.692 2996 1.000
## Calaveras 0.04540 0.454 0.10000 0.00978 -0.613 0.004618 0.818 2157 1.001
## Colusa -0.00671 0.455 -0.01474 0.00831 -0.729 -0.004081 0.711 3000 0.999
## Contra Costa -0.00822 0.396 -0.02073 0.00873 -0.666 0.000834 0.620 2061 1.000
## Del Norte -0.00788 0.471 -0.01674 0.00884 -0.761 -0.002417 0.712 2839 1.000
## El Dorado -0.00810 0.454 -0.01785 0.00933 -0.755 0.000714 0.707 2368 1.000
## Fresno -0.04653 0.365 -0.12749 0.00714 -0.704 -0.015266 0.517 2612 1.000
## Glenn 0.01084 0.464 0.02336 0.00847 -0.687 0.002663 0.745 3000 1.000
## ... 47 elements suppressed ...
To predict the population fractions of schools that meet the growth target by county,
samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0 # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
show.progress=FALSE)
(summ <- summary(res))
## Mean SD t-value MCSE q0.05 q0.5 q0.95 n_eff R_hat
## Alameda 0.784 0.0603 13.00 0.001665 0.674 0.796 0.864 1312 1.002
## Amador 0.845 0.1156 7.32 0.002174 0.600 0.900 1.000 2826 0.999
## Butte 0.838 0.0745 11.25 0.001509 0.708 0.854 0.938 2440 1.000
## Calaveras 0.878 0.0928 9.46 0.001834 0.700 0.900 1.000 2562 1.000
## Colusa 0.643 0.1598 4.02 0.003052 0.333 0.667 0.889 2741 1.000
## Contra Costa 0.816 0.0572 14.26 0.001321 0.715 0.821 0.899 1876 1.001
## Del Norte 0.881 0.1080 8.16 0.001972 0.750 0.875 1.000 3000 1.000
## El Dorado 0.850 0.0744 11.42 0.001713 0.725 0.850 0.950 1886 1.000
## Fresno 0.782 0.0592 13.20 0.001232 0.677 0.785 0.871 2314 0.999
## Glenn 0.798 0.1388 5.75 0.002569 0.556 0.778 1.000 2919 1.000
## Humboldt 0.854 0.0710 12.03 0.001489 0.725 0.850 0.950 2275 1.000
## Imperial 0.698 0.1045 6.68 0.002128 0.525 0.700 0.850 2412 1.000
## Inyo 0.777 0.1541 5.04 0.003090 0.571 0.857 1.000 2488 1.000
## Kern 0.806 0.0522 15.44 0.001136 0.711 0.811 0.878 2109 1.000
## Kings 0.851 0.0803 10.60 0.001618 0.720 0.840 0.960 2460 1.000
## Lake 0.828 0.0953 8.69 0.001822 0.682 0.818 0.955 2739 1.000
## Lassen 0.836 0.1105 7.56 0.002173 0.636 0.818 1.000 2586 0.999
## Los Angeles 0.799 0.0427 18.70 0.001183 0.731 0.799 0.872 1306 1.002
## Madera 0.816 0.0757 10.78 0.001520 0.677 0.839 0.935 2482 1.001
## Marin 0.839 0.0688 12.21 0.001600 0.720 0.840 0.940 1847 1.000
## Mariposa 0.863 0.1473 5.86 0.002882 0.600 0.800 1.000 2611 1.000
## Mendocino 0.817 0.0898 9.10 0.001709 0.680 0.840 0.960 2760 1.000
## Merced 0.791 0.0790 10.02 0.001553 0.651 0.794 0.905 2590 1.001
## Modoc 0.836 0.1569 5.33 0.002864 0.600 0.800 1.000 3000 0.999
## Mono 0.752 0.2432 3.09 0.004440 0.333 0.667 1.000 3000 1.000
## Monterey 0.802 0.0688 11.66 0.001543 0.687 0.807 0.916 1989 1.000
## Napa 0.850 0.0787 10.80 0.001588 0.704 0.852 0.963 2457 1.001
## Nevada 0.871 0.0940 9.26 0.001848 0.714 0.857 1.000 2588 1.000
## Orange 0.774 0.0611 12.67 0.001340 0.670 0.778 0.871 2078 1.000
## Placer 0.817 0.0740 11.05 0.001755 0.682 0.833 0.924 1777 1.002
## Plumas 0.743 0.1509 4.93 0.002975 0.444 0.778 1.000 2573 1.000
## Riverside 0.824 0.0533 15.47 0.001230 0.722 0.831 0.895 1875 0.999
## Sacramento 0.848 0.0484 17.53 0.001240 0.764 0.847 0.927 1521 1.003
## San Benito 0.825 0.1227 6.73 0.002539 0.636 0.818 1.000 2335 1.003
## San Bernardino 0.861 0.0424 20.33 0.001116 0.790 0.862 0.928 1441 1.000
## San Diego 0.849 0.0426 19.91 0.000968 0.778 0.852 0.913 1942 1.001
## San Francisco 0.761 0.0752 10.12 0.001544 0.630 0.770 0.870 2372 1.000
## San Joaquin 0.829 0.0580 14.29 0.001271 0.726 0.839 0.911 2085 0.999
## San Luis Obispo 0.856 0.0699 12.25 0.001505 0.749 0.850 0.950 2158 1.002
## San Mateo 0.801 0.0731 10.96 0.001744 0.678 0.811 0.895 1756 1.002
## Santa Barbara 0.830 0.0641 12.95 0.001326 0.728 0.840 0.926 2337 1.002
## Santa Clara 0.785 0.0685 11.45 0.001975 0.652 0.796 0.875 1203 1.002
## Santa Cruz 0.789 0.0744 10.61 0.001521 0.653 0.796 0.898 2393 1.001
## Shasta 0.817 0.0765 10.68 0.001579 0.675 0.825 0.925 2347 1.000
## Sierra 0.733 0.2346 3.12 0.004376 0.333 0.667 1.000 2875 1.000
## Siskiyou 0.846 0.0977 8.65 0.001925 0.667 0.867 1.000 2577 1.001
## Solano 0.865 0.0634 13.64 0.001670 0.750 0.867 0.967 1442 1.001
## Sonoma 0.846 0.0622 13.60 0.001276 0.741 0.852 0.935 2379 1.001
## Stanislaus 0.812 0.0723 11.24 0.001798 0.681 0.824 0.901 1614 1.000
## Sutter 0.832 0.0919 9.05 0.001805 0.650 0.850 0.950 2590 1.000
## Tehama 0.842 0.0992 8.49 0.001840 0.647 0.882 1.000 2905 1.000
## Trinity 0.759 0.2016 3.76 0.003843 0.500 0.750 1.000 2752 1.000
## Tulare 0.831 0.0633 13.12 0.001422 0.718 0.836 0.927 1985 0.999
## Tuolumne 0.883 0.0932 9.48 0.001790 0.750 0.917 1.000 2709 1.000
## Ventura 0.838 0.0511 16.38 0.001071 0.745 0.845 0.913 2281 1.000
## Yolo 0.850 0.0752 11.30 0.001456 0.714 0.857 0.971 2670 1.000
## Yuba 0.794 0.1019 7.79 0.001984 0.632 0.789 0.947 2640 1.000