Boolean Algebra of Hypothesis Tests

library(hypothesize)

The Idea

What if hypothesis tests formed an algebra?

In SICP terms, we want three things: primitive expressions (the tests themselves), means of combination (ways to build compound tests from simpler ones), and means of abstraction (ways to transform and generalize tests). The hypothesize package provides all three, and in v0.11.0, the combinators form a genuine Boolean algebra.

This vignette develops the algebra from first principles.

The Trinity: Three Ways to Test a Hypothesis

Statistical theory gives us three fundamental likelihood-based tests. They approach the same question from different angles and are asymptotically equivalent, but each requires different information:

Test What it needs When to use
Wald MLE + standard error You already fit the full model
LRT Log-likelihoods of both models You can fit both models
Score Score + Fisher info at null only The null model is cheap, the alternative is expensive

The score test completes the trinity:

# Setup: testing whether a Poisson rate equals 5
# Observed: 60 events in 10 time units
# MLE: lambda_hat = 6, se = sqrt(6/10)

# Wald: evaluate at the MLE
w <- wald_test(estimate = 6, se = sqrt(6/10), null_value = 5)

# Score: evaluate at the null
# Score function at lambda0=5: n*xbar/lambda0 - n = 60/5 - 10 = 2
# Fisher info at lambda0=5: n/lambda0 = 10/5 = 2
s <- score_test(score = 2, fisher_info = 2)

# LRT: evaluate at both
# loglik(lambda) = sum(x)*log(lambda) - n*lambda
ll_null <- 60 * log(5) - 10 * 5
ll_alt  <- 60 * log(6) - 10 * 6
l <- lrt(null_loglik = ll_null, alt_loglik = ll_alt, dof = 1)

# All three give similar (not identical) answers:
data.frame(
  test = c("Wald", "Score", "LRT"),
  statistic = c(test_stat(w), test_stat(s), test_stat(l)),
  p_value = round(c(pval(w), pval(s), pval(l)), 4)
)
#>    test statistic p_value
#> 1  Wald  1.666667  0.1967
#> 2 Score  2.000000  0.1573
#> 3   LRT  1.878587  0.1705

The three tests agree asymptotically but can diverge in finite samples. This is expected and well-understood — the choice between them depends on what information is available, not which gives the “right” answer.

Boolean Algebra: AND, OR, NOT

The heart of v0.11.0 is a Boolean algebra over hypothesis tests. Just as propositional logic has AND, OR, and NOT, we can combine tests with the same operations.

NOT: The Complement

The complement of a test rejects when the original fails to reject:

# A significant Wald test
w <- wald_test(estimate = 3.0, se = 1.0)
pval(w)                    # small — rejects H0
#> [1] 0.002699796

# Its complement
cw <- complement_test(w)
pval(cw)                   # large — fails to reject
#> [1] 0.9973002

The algebra is clean: double complement is identity.

pval(complement_test(complement_test(w)))
#> [1] 0.002699796
pval(w)
#> [1] 0.002699796

And the class hierarchy is preserved — a complemented Wald test is simultaneously a complemented_test, a wald_test, and a hypothesis_test:

class(cw)
#> [1] "complemented_test" "wald_test"         "hypothesis_test"

Connection to equivalence testing. If a Wald test asks “is the parameter different from the null?”, its complement asks “is the parameter close to the null?” This is exactly the logic of equivalence testing. In bioequivalence studies, you don’t want to show a drug is different — you want to show it’s the same (within tolerance).

AND: The Intersection

The intersection test rejects only when all component tests reject:

# Both must be significant
intersection_test(0.01, 0.03)          # both < 0.05
#> Hypothesis test (intersection_test)
#> -----------------------------
#> Test statistic: 0.03
#> P-value: 0.03
#> Degrees of freedom: 2
#> Significant at 5% level: TRUE
is_significant_at(intersection_test(0.01, 0.03), 0.05)
#> [1] TRUE

# One non-significant component blocks rejection
intersection_test(0.01, 0.80)
#> Hypothesis test (intersection_test)
#> -----------------------------
#> Test statistic: 0.8
#> P-value: 0.8
#> Degrees of freedom: 2
#> Significant at 5% level: FALSE
is_significant_at(intersection_test(0.01, 0.80), 0.05)
#> [1] FALSE

The p-value is simply \(\max(p_1, \ldots, p_k)\) — the intersection rejects at level \(\alpha\) if and only if every component rejects, which happens if and only if the largest p-value is below \(\alpha\).

This is the intersection-union test (IUT; Berger, 1982). No multiplicity correction is needed — the max operation is inherently conservative.

Use case: bioequivalence. Bioequivalence requires showing a drug’s effect is both “not too low” AND “not too high”:

# Drug effect estimate: 1.05, with SE = 0.08
# Must show: effect > 0.8 AND effect < 1.25
t_lower <- wald_test(estimate = 1.05, se = 0.08, null_value = 0.8)
t_upper <- wald_test(estimate = 1.05, se = 0.08, null_value = 1.25)

bioequiv <- intersection_test(t_lower, t_upper)
is_significant_at(bioequiv, 0.05)
#> [1] TRUE

OR: The Union (via De Morgan)

The union test rejects when any component test rejects. But here’s the beautiful part: we don’t implement it directly. Instead, we define it through De Morgan’s law:

\[\text{OR}(t_1, \ldots, t_k) = \text{NOT}(\text{AND}(\text{NOT}(t_1), \ldots, \text{NOT}(t_k)))\]

The implementation IS the theorem:

# This is (essentially) the actual source code of union_test:
union_test <- function(...) {
  complement_test(
    do.call(intersection_test, lapply(tests, complement_test))
  )
}

Let’s verify De Morgan’s law holds:

# Three p-values
p1 <- 0.03; p2 <- 0.15; p3 <- 0.07

# Direct union
ut <- union_test(p1, p2, p3)

# Manual De Morgan construction
tests <- list(
  hypothesis_test(stat = 0, p.value = p1, dof = 1),
  hypothesis_test(stat = 0, p.value = p2, dof = 1),
  hypothesis_test(stat = 0, p.value = p3, dof = 1)
)
dm <- complement_test(
  do.call(intersection_test, lapply(tests, complement_test))
)

# Identical
c(union = pval(ut), de_morgan = pval(dm))
#>     union de_morgan 
#>      0.03      0.03

The resulting p-value is \(\min(p_1, \ldots, p_k)\). This falls out algebraically: complement maps \(p \to 1-p\), the intersection takes the max, and the outer complement maps back. So:

\[1 - \max(1-p_1, \ldots, 1-p_k) = \min(p_1, \ldots, p_k)\]

A note on multiplicity. The uncorrected \(\min(p)\) is anti-conservative. If you need to control the family-wise error rate, apply adjust_pval() to the component tests before combining. The raw union test is appropriate for screening or exploratory analysis where you want to detect any signal.

The Full Algebra

The three operations satisfy the axioms of a Boolean algebra:

Property Statement
Involution complement(complement(t)) = t
De Morgan 1 union(a, b) = complement(intersection(complement(a), complement(b)))
De Morgan 2 intersection(a, b) = complement(union(complement(a), complement(b)))

And they compose with the rest of the package — intersection and union tests are hypothesis_test objects, so they work with fisher_combine(), adjust_pval(), pval(), and everything else:

# Compose: take the union, then combine with another test
ut <- union_test(0.01, 0.05)
combined <- fisher_combine(ut, wald_test(estimate = 2, se = 1))
pval(combined)
#> [1] 0.003956342

Test-Confidence Duality

There is a deep connection between hypothesis tests and confidence intervals: a \((1-\alpha)\) confidence set contains exactly those null values that would not be rejected at level \(\alpha\). The invert_test() function makes this duality operational.

It takes a test constructor — any function that maps a null value to a hypothesis_test — and returns the set of non-rejected values:

# Invert a Wald test into a confidence interval
cs <- invert_test(
  test_fn = function(theta) wald_test(estimate = 2.5, se = 0.8, null_value = theta),
  grid = seq(0, 5, by = 0.01)
)
cs
#> Confidence set (95% level)
#> -----------------------------
#> Lower: 0.94
#> Upper: 4.06
#> Grid points in set: 313 of 501

This agrees with the analytical confidence interval (up to grid resolution):

# Analytical CI from the Wald test
confint(wald_test(estimate = 2.5, se = 0.8))
#>     lower     upper 
#> 0.9320288 4.0679712

# Numerical CI from test inversion
c(lower = lower(cs), upper = upper(cs))
#> lower.lower upper.upper 
#>        0.94        4.06

The power of invert_test() is generality. The analytical confint() methods only work for specific test types (Wald, z-test). But invert_test() works with any test — including user-defined ones:

# A custom test: reject if |observation - theta| > 2
# (This has no analytical CI, but invert_test handles it)
my_test <- function(theta) {
  obs <- 5.0
  hypothesis_test(
    stat = (obs - theta)^2,
    p.value = if (abs(obs - theta) > 2) 0.01 else 0.5,
    dof = 1
  )
}

cs <- invert_test(my_test, grid = seq(0, 10, by = 0.1))
c(lower = lower(cs), upper = upper(cs))
#> lower.lower upper.upper 
#>           3           7

This is the most SICP function in the package. It takes a function as input (the test constructor) and returns a structured result (the confidence set). It works because of the abstraction barrier: invert_test() only needs pval(), so any test that implements the hypothesis_test interface gets confidence sets for free.

Multivariate Tests

Both wald_test() and score_test() are polymorphic — they accept scalar or vector inputs and dispatch accordingly.

The multivariate Wald test uses the quadratic form:

\[W = (\hat\theta - \theta_0)^\top \Sigma^{-1} (\hat\theta - \theta_0) \sim \chi^2_k\]

# Test two parameters jointly
est <- c(2.0, 3.0)
V <- matrix(c(1.0, 0.3, 0.3, 1.0), 2, 2)  # correlated

w_joint <- wald_test(estimate = est, vcov = V, null_value = c(0, 0))
w_joint
#> Hypothesis test (wald_test)
#> -----------------------------
#> Test statistic: 10.3296703296703
#> P-value: 0.00571400463027544
#> Degrees of freedom: 2
#> Significant at 5% level: TRUE

When parameters are independent (diagonal covariance), the multivariate statistic decomposes into a sum of univariate statistics:

# Independent parameters: diagonal vcov
se1 <- 0.8; se2 <- 1.2
V_diag <- diag(c(se1^2, se2^2))

w_multi <- wald_test(estimate = c(2.0, 3.0), vcov = V_diag)
w1 <- wald_test(estimate = 2.0, se = se1)
w2 <- wald_test(estimate = 3.0, se = se2)

# The joint statistic equals the sum of individual statistics
c(joint = test_stat(w_multi), sum = test_stat(w1) + test_stat(w2))
#> joint   sum 
#>  12.5  12.5

This decomposition breaks down when parameters are correlated — that’s when you need the full vcov matrix. The off-diagonal entries capture the dependence structure that individual standard errors miss.

The Design as an Algebra

Stepping back, the package implements a small algebra with clear structure:

Primitives:     z_test, wald_test, lrt, score_test
Combinators:    fisher_combine, intersection_test, union_test
Transformers:   adjust_pval, complement_test
Duality:        invert_test, confint
Accessors:      pval, test_stat, dof, is_significant_at, lower, upper

Each layer corresponds to an SICP principle:

Layer SICP Principle What it does
Primitives Primitive expressions The four ways to test a hypothesis
Combinators Means of combination Build compound tests (closure property)
Transformers Means of abstraction Transform tests into new tests
Duality Power of the abstraction barrier Any test becomes a confidence set

The package has 18 exported functions. That’s deliberately small. The goal is not to provide every test, but to provide the algebraic primitives from which users can compose arbitrary test logic. A well-chosen basis is more powerful than an exhaustive catalog.