---
title: "Using statgen in R"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Using statgen in R}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

`statgen` stores each analysis object on a shared reference SNP axis. The R
package loads reference panels, summary statistics, annotations, genotypes, and
LD panels into those coordinates.

```{r}
library(statgen)
set_verbosity("quiet")
```

## Reference, Sumstats, And Annotations

The bundled package fixtures are intentionally tiny so that examples and CRAN
checks do not require external tools or large data.

This vignette reads those installed fixtures with
`system.file(..., package = "statgen")`. They are demonstration inputs only. In
real analyses, replace these paths with post-harmonization reference, summary
statistics, annotation, genotype, and LD files prepared for one analysis input
set. For an end-to-end repository workflow that prepares source inputs and then
runs the same R analysis pattern on those files, see
`docs/TUTORIAL_1_PREPARE_DATA.md` and `docs/TUTORIAL_4_R.md`.

```{r}
reference_template <- file.path(
  dirname(system.file("extdata", "reference_chr1.bim", package = "statgen")),
  "reference_chr@.bim"
)
reference <- load_reference(reference_template)

sumstats_path <- system.file("extdata", "traits_complete.tsv.gz", package = "statgen")
sumstats <- load_sumstats(sumstats_path, reference)

annotation_paths <- system.file(
  "extdata",
  c("anno1.bed", "anno2.bed"),
  package = "statgen"
)
annotations <- load_annotations(annotation_paths, reference)

num_snp(reference)
head(logpvec(sumstats))
colnames(annomat(annotations))
```

R-native object caches are RDS files. They are useful for repeated analyses in
the same R runtime.

```{r}
sumstats_cache <- tempfile(fileext = ".rds")
save_sumstats_cache(sumstats, sumstats_cache)
sumstats_cached <- load_sumstats_cache(sumstats_cache)
identical(is_present(sumstats_cached), is_present(sumstats))
```

## LD Panels

Python builds the canonical LD `.npz` distribution. For package examples,
`ld_root` points to a bundled toy distribution that already includes the R
reference-cache sidecar. For real analyses, run `prepare_ld_npz_for_r()` once
on the Python-built `ld_npz/` directory for the same harmonized input set, then
load that directory with `load_ld()`.

```{r}
ld_root <- system.file("extdata", "ld", package = "statgen")
ld <- load_ld(ld_root)

num_snp(ld)
a1freq(ld)
```

`multiply_r2()` multiplies aligned vectors or matrices by LD `r^2`.
`fast_prune()` applies greedy LD pruning to aligned scores.

```{r}
scores <- seq_len(num_snp(ld))
multiply_r2(ld, scores)

logp <- rev(seq_len(num_snp(ld)))
fast_prune(logp, ld, r2_threshold = 0.35)
```

## Genotype Metadata And Fetching

Genotype panels keep PLINK metadata and fetch hardcalls on demand. The original
BED files must remain available unless a replacement `bed_path` is supplied.

```{r}
ref_chr1 <- load_reference(system.file("extdata", "reference_chr1.bim", package = "statgen"))
bed_chr1 <- system.file("extdata", "genotype_1.bed", package = "statgen")
genotype <- load_genotype(sub("\\.bed$", "", bed_chr1), ref_chr1)

geno <- fetch_genotypes_int8(genotype, c(1, 3))
dim(geno)
head(geno)
```