---
title: "Getting started with fbrglm"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with fbrglm}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>",
  fig.width  = 6,
  fig.height = 4
)
set.seed(20260101)
```

## What `fbrglm` is for

`fbrglm` is a formula-based front-end for regularized generalized linear
models. Internally it delegates the fit to
[`glmnet`](https://cran.r-project.org/package=glmnet); the wrapper's job
is to make the user-facing experience look like base R's `glm()` — a
`formula` + `data.frame`, automatic factor handling, complete-case
filtering, and the familiar S3 methods (`print`, `summary`, `coef`,
`predict`, `nobs`, `plot`).

The MVP described here is `infer = "none"`: it returns regularized point
estimates and **does not** report classical standard errors, z values,
p values, or confidence intervals. Honest post-selection inference (via
data splitting or selective inference) is on the roadmap; see the
package `TODO.md`.

```{r setup}
library(fbrglm)
```

## A small binomial example

```{r}
n <- 150
dat <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2 + x3, data = dat,
              family = "binomial",
              lambda = "cv_min")
```

`print()` shows the call and the basics of the fit:

```{r}
print(fit)
```

`summary()` returns a structured object that includes the call, family,
chosen λ, complete-case bookkeeping, and the (regularized) coefficient
vector with zeros included:

```{r}
summary(fit)
```

Coefficients and predictions follow the same shapes you'd expect from
`glm()`:

```{r}
coef(fit)

head(predict(fit, newdata = dat[1:5, ], type = "response"))
```

A `plot()` method is registered; it delegates to `plot.cv.glmnet()` when
λ was chosen by cross-validation, and to `plot.glmnet()` otherwise.

```{r, eval = FALSE}
plot(fit)
```

## Choosing `lambda`

There are three rules, exposed through a single argument:

```{r}
fit_min <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_min")
fit_1se <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_1se")
fit_fix <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial",
                  lambda = "fix", lambda_value = 0.05)

c(cv_min = fit_min$lambda_value,
  cv_1se = fit_1se$lambda_value,
  fix    = fit_fix$lambda_value)
```

`"cv_min"` and `"cv_1se"` go through `cv.glmnet()`; `"fix"` skips CV and
goes straight to `glmnet()` at the supplied `lambda_value`. The numeric
λ actually used is always available at `fit$lambda_value`.

## Factor predictors

Factor columns are auto-dummied via `model.matrix()`, and the training
factor levels are stored on the fit object so `predict(newdata = ...)`
can rebuild a design matrix that matches the training column structure
— even when some training levels are missing from `newdata`.

```{r}
n_train <- 200
train <- data.frame(
    y  = rbinom(n_train, 1, 0.5),
    x1 = rnorm(n_train),
    g  = factor(sample(c("A", "B", "C"), n_train, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit_f <- fbrglm(y ~ x1 + g, data = train,
                family = "binomial",
                lambda = "fix", lambda_value = 0.05)

## newdata is missing level "C"
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
head(predict(fit_f, newdata = test, type = "response"))
```

`fbrglm` also tolerates the narrower case where `newdata`'s factor has
its **levels** narrowed (not just its values): missing one-hot columns
are padded with zeros before being handed to `glmnet`.

## Missing values

`fbrglm()` drops rows with any `NA` from the modelling frame, prints a
one-line note, and records the counts on the fit object under
`fit$nobs_info`.

```{r}
dat_na <- dat
dat_na$y[1:5] <- NA
fit_na <- fbrglm(y ~ x1 + x2 + x3, data = dat_na,
                 family = "binomial",
                 lambda = "fix", lambda_value = 0.05)

fit_na$nobs_info
nobs(fit_na)
```

## Offsets

`offset` at fit time goes through to `glmnet()`; at predict time, pass
`newoffset` of matching length. With `newdata = NULL` the stored
training offset is reused; with `newdata` supplied, an explicit
`newoffset` is required.

```{r}
n_off <- 80
dat_off <- data.frame(
    y  = rbinom(n_off, 1, 0.5),
    x1 = rnorm(n_off),
    x2 = rnorm(n_off)
)
fit_off <- fbrglm(y ~ x1 + x2, data = dat_off, family = "binomial",
                  offset = rep(0.2, n_off),
                  lambda = "fix", lambda_value = 0.05)

head(predict(fit_off, type = "response"))                  # reuses training offset
head(predict(fit_off, newdata = dat_off[1:5, ],
             newoffset = rep(0.2, 5), type = "response"))
```

## Reaching the underlying `glmnet` objects

If you need to use a `glmnet`-specific tool, two accessors get you out
of the wrapper:

```{r}
class(as_glmnet(fit_min))
class(as_cv_glmnet(fit_min))

class(as_glmnet(fit_fix))
as_cv_glmnet(fit_fix)        # NULL — no CV was run
```

`as_glmnet()` returns the underlying `glmnet` object (the `$glmnet.fit`
slot when the wrapper used CV). `as_cv_glmnet()` returns the `cv.glmnet`
object, or `NULL` for the `"fix"` λ path.

## Limitations (intentional)

The MVP is deliberately narrow:

- only `infer = "none"` is implemented; `"split"` and `"selective"`
  are planned but not in this release.
- families: `gaussian`, `binomial`, `poisson` only. `multinomial` and
  `cox` will land later.
- the `x` / `y` direct-matrix entry point is reserved but not yet
  supported — supply `formula` + `data` instead.
- classical `glm()`-style standard errors, z / p values, and confidence
  intervals are intentionally **not** shown for `infer = "none"`. Doing
  so naively for regularized estimators would be misleading; honest
  inference is the next milestone.

Reproducible benchmarks against raw `glmnet`, `glmnetUtils`, and a
`parsnip` / `workflows` pipeline with the `glmnet` engine live in a
separate repository:
<https://github.com/dsc-chiba-u/fbrglm-experiments>. In the
prediction-failure case (narrowed test factor levels), raw `glmnet`
built naively can fail; `parsnip` / `workflows` succeeds but with
higher runtime overhead than fbrglm in the tested small-data setting.
See the experiments repo for the CSVs and figures.
