Riko Kelter
Institute of Medical Statistics and Computational Biology
Faculty of Medicine, University of Cologne
Cologne, Germany
This vignette illustrates how to construct frequentist optimal two-stage single-arm designs using the Bayes factor \(BF_{01}\) as the test statistic.
We consider a proof-of-concept phase II trial with binary endpoint and hypotheses
\[ H_0 : p \le p_0, \qquad H_1 : p > p_0, \]
where \(p_0\) is a benchmark response probability, compare (Kelter and Pawel 2025a).
The decision rule is based on the Bayes factor \(BF_{01}\) for \(H_0\) versus \(H_1\):
At the final analysis, efficacy is concluded when \(BF_{01} \le k\). At the interim analysis, futility is concluded when \(BF_{01} \ge k_f\).
In frequentist calibration, we require that:
even though the decision statistic is a Bayes factor.
Frequentist calibration is requested via
in design_singlearm_bf(). In this mode:
The following calibration targets must be specified:
target_freq_power: target frequentist power at
dp,target_freq_type1: target frequentist type-I error at
p0.A typical choice is
target_freq_power = 0.7 or 0.8,target_freq_type1 = 0.1, 0.05 or
0.025, depending on the phase II context and statistical
test used (directional or two-sided).We start with a concrete two-stage design chosen manually, for example
\[ n_1 = 12, \qquad n_2 = 24, \]
and investigate its operating characteristics under frequentist calibration.
res_manual <- design_singlearm_bf(
n1_min = 8,
n2_max = 30,
k = 1/3,
k_f = 3,
p0 = 0.2,
a0 = 1,
b0 = 1,
a1 = 1,
b1 = 1,
dp = 0.4,
da0 = 2.5,
db0 = 2,
da1 = 1,
db1 = 1,
type = "direction",
calibration = "frequentist",
algorithm = "manual",
interim = 12,
final = 24,
target_freq_power = 0.75,
target_freq_type1 = 0.10
)We inspect the results:
summary(res_manual)
#> Summary: Single-arm two-stage Bayes factor design
#> ---------------------------------------------------------
#> Feasible: TRUE
#> Design prior under H0: Beta(2.5, 2) truncated to [0, p0]
#> Design prior under H1: Beta(1, 1) truncated to (p0, 1]
#>
#> Selected design: n1 = 12, n2 = 24
#>
#> Bayesian operating characteristics
#> Power: 0.8379
#> Type-I: 0.0260
#> CE H0: NA
#> EN H0: 14.97
#> EN H1: 23.09
#>
#> Frequentist operating characteristics
#> Power: 0.7838
#> Type-I: 0.0828
#> EN H0: 17.30
#> EN H1: 23.00In algorithm = "manual" mode, the function does
not optimize over designs. It simply evaluates the
chosen pair (n1, n2) and reports:
dp and
p0,If Feasible is FALSE in the summary, this
only means that the chosen design does not meet the requested targets.
It does not mean the design is incorrect; it simply does not match the
desired calibration. However, even if Feasible is
TRUE in the summary, this does not mean the proposed design
is optimal in a frequentist sense. Therefore, among all designs which
fulfill our specified target constraints on frequentist power and
type-I-error rate, the resulting design needs to minimize the expected
sample size \(E_{H_0}[N]\) under the
null hypothesis.
We now let the function search for the frequentist-optimal design
which minimizes the expected sample size under the null hypothesis
within a specified range of sample sizes. Therefore, the arguments
algorithm = "manual", interim = 12 and
final = 24 are removed when calling the function. Also, we
set the required frequentist power to 80% and the type-I-error rate to
2.5%, which is the usual standard when carrying out a directional
hypothesis test. We also change the threshold for evidence \(k=1/3\) from moderate to strong evidence,
that is, \(k=1/10\):
res_freq <- design_singlearm_bf(
n1_min = 5,
n2_max = 100,
k = 1/10,
k_f = 3,
p0 = 0.2,
a0 = 1,
b0 = 1,
a1 = 1,
b1 = 1,
dp = 0.5,
da0 = 1,
db0 = 1,
da1 = 2.5,
db1 = 2,
type = "direction",
calibration = "frequentist",
target_freq_power = 0.8,
target_freq_type1 = 0.05
)We inspect the results:
summary(res_freq)
#> Summary: Single-arm two-stage Bayes factor design
#> ---------------------------------------------------------
#> Feasible: TRUE
#> Calibration: frequentist
#> Design prior under H0: Beta(1, 1) truncated to [0, p0]
#> Design prior under H1: Beta(2.5, 2) truncated to (p0, 1]
#>
#> Selected design: n1 = 7, n2 = 17
#>
#> Bayesian operating characteristics
#> Power: 0.7752
#> Type-I: 0.0056
#> CE H0: NA
#> EN H0: 8.69
#> EN H1: 16.09
#>
#> Frequentist operating characteristics
#> Power: 0.8119
#> Type-I: 0.0351
#> EN H0: 11.23
#> EN H1: 16.38The summary provides all relevant information about the optimal design the algorithm computed. We can see that both the frequentist power and type-I-error are meeting our target constraints. The expected sample size under \(H_0\) given in the summary is the smallest sample size among all two-stage designs in the sample size range we specified and thus the design is optimal in that sense.
The returned object also includes:
n1,
n2),p0 and
dp,For example:
Also, more information is available by inspecting
which is not shown here to avoid cluttered output.
The search results can be visualized:
Figure 1: Output of the plot function for an optimal frequentist single-arm two-stage design using Bayes factors. The top left panel shows Bayesian and frequentist power, Bayesian type-I-error for varying interim sample sizes. The top right panel provides information about the optimal frequentist design found by the algorithm and its Bayesian and frequentist operating characteristics. The lower left and right panels visualize the analysis and design priors under the null and alternative hypothesis. For the frequentist operating characteristics, these are irrelevant. They influence only the Bayesian operating characteristics. Under the null hypothesis \(H_0:p=p_0\), the design and analysis priors are point masses at the specified null probability p0.
The plot shows how Bayesian and frequentist operating characteristics vary as a function of the interim sample size, and highlights the optimal choice selected by the algorithm.
Under calibration = "frequentist", the design has the
following key properties:
target_freq_type1 when
the true response rate is \(p = p_0\).target_freq_power.The Bayesian operating characteristics are still reported, but they do not drive the calibration; they serve as additional information about how the design performs under the specified design priors.
When using the frequentist mode in practice:
dp as the clinically relevant response rate
under \(H_1\) where you want to guarantee power.n2_max. In particular, very high power with very
small type-I error can be incompatible with tight sample size
bounds.