tidyGenR core functions cover all the steps necessary to determine sequence variants and genotypes from sequencing reads coming from multilocus amplicon libraries sequenced by high throughput sequencing technologies.
This vignette contains a standard workflow for paired-end data.
library(tidyGenR)
The raw and processed data used in this vignette come from amplicon libraries of Rattus baluensis (Camacho-Sanchez et al. 2026).
Naming of FASTQ must meet a given format.
# download test raw data (if not downloaded)
raw <-
system.file("extdata", "raw", package = "tidyGenR")
freads <- list.files(raw,
pattern = "1.fastq",
full.names = TRUE)
rreads <- list.files(raw,
pattern = "2.fastq",
full.names = TRUE)
Reads from different loci are mixed together in the same sample FASTQ. The known primer sequences at the ends of amplicons can be used to demultiplex reads by locus using demultiplex(). This function produces an executable for cutadapt. A data.frame with primer sequences can be used to feed demultiplex() with primer sequences.
# data.frame with primers
data("primers")
head(primers, 3)
#> locus fw rv
#> 1 nfkbia GCCTCCAAACACACAGTCAT TGAGGAGAGCTATGACACGG
#> 2 chrna9 TTATCTGGGAGAGCGTGACC TTGGGAAARGATGAACCGGC
#> 3 rogdi AGAARCCGGCTCACTACCC GAGGCACAGCTTGTTGAGG
p_sh_out <- tempfile(fileext = ".sh")
# demultiplex reads by locus using 3 primer pairs
demultiplex(
freads = freads,
rreads = rreads,
primers = primers[1:3, ],
sh_out = p_sh_out,
run = FALSE
)
demultiplex() creates a script as below:
readLines(p_sh_out)
#> [1] "#!/bin/bash"
#> [2] "cat /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/samples | while read sample"
#> [3] "do"
#> [4] "/usr/local/bin/cutadapt \\"
#> [5] "--discard-untrimmed --pair-adapters --no-indels \\"
#> [6] "--overlap 15 \\"
#> [7] "-e 0.15 \\"
#> [8] "-g file:/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/fw_primers \\"
#> [9] "-G file:/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/rv_primers \\"
#> [10] "-o /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/$sample.{name}.1.fastq.gz \\"
#> [11] "-p /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/$sample.{name}.2.fastq.gz \\"
#> [12] "/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/raw/\"$sample\".1.fastq.gz \\"
#> [13] "/private/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T/RtmpNrSQoX/Rinste0cb5728fb12/tidyGenR/extdata/raw/\"$sample\".2.fastq.gz"
#> [14] "done > /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/filee0fb1ce3180f.log"
If problems arise when running cutadapt run = TRUE, the script written to sh_out can be edited manually and run outside R. FASTQ with zero or few reads can be removed using remove_poor_fastq()
Variants are determined on the basis of 1-nt resolution. Thus, reads to be compared need to have the same length and minimum quality stantards (eg. no ‘Ns’) (see DADA2 documentation, Callahan et al. 2016). Sequence quality can be checked with FASTQC, dada2::plotQualityProfile() or other QC checking software to guide truncation lengths. Truncation and minimal quality filtering are performed with trunc_amp(), a wrapper function of dada2::filterAndTrim(). Global or locus-specific truncation lengths can be supplied in a data.frame.
# Download per-locus demultiplexed FASTQ
dem <-
system.file("extdata", "demultiplexed", package = "tidyGenR")
# truncate reads for one locus
p_trun <- file.path(tempdir(), "truncated")
one_locus <-
trunc_amp(
loci = "chrna9",
mode_trun = "pe",
in_dir = dem,
fw_pattern = "1.fastq.gz",
rv_pattern = "2.fastq.gz",
trunc_fr = c(250, 180),
max_ee = c(3, 3),
outdir = p_trun
)
#>
#> Samples detected:
#> BOR1061, BOR1063, BOR1069
#>
#> Loci detected:
#> chrna9, nfkbia, rogdi
#>
#> Sequencing files matching patterns seem to conform the required naming format 'sample.locus.[1|2].fastq.gz'
#>
#> Truncating chrna9
#> Creating output directory: /var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/truncated
#>
#> Forwards reads of chrna9 truncated at 250 nt.
#>
#> Reverse reads of chrna9 truncated at 180 nt.
#> Filtered/truncated reads written to/var/folders/2h/j11yhpy906z10zg4ncxqkx300000gn/T//RtmpRuux8U/truncated
It is possible to use locus-specific truncation lengths specified in a data.frame.
# dataframe with locus-specific truncation lengths
data("trunc_fr")
head(trunc_fr, 3)
#> locus trunc_f trunc_r
#> 1 nfkbia 198 197
#> 2 chrna9 196 182
#> 3 rogdi 183 185
# truncate reads for all loci using locus-specific truncation lengths
all_loci <-
trunc_amp(
mode_trun = "pe",
in_dir = dem,
fw_pattern = "1.fastq.gz",
rv_pattern = "2.fastq.gz",
trunc_fr = trunc_fr,
max_ee = c(3, 3),
outdir = p_trun
)
trunc_amp() writes truncated FASTQ to a directory and returns a list with in and out reads.
Truncated FASTQ:
head(list.files(p_trun))
#> [1] "BOR1061_chrna9_F_filt.fastq.gz" "BOR1061_chrna9_R_filt.fastq.gz"
#> [3] "BOR1061_nfkbia_F_filt.fastq.gz" "BOR1061_nfkbia_R_filt.fastq.gz"
#> [5] "BOR1061_rogdi_F_filt.fastq.gz" "BOR1061_rogdi_R_filt.fastq.gz"
IN and OUT reads after trunc_amp():
head(all_loci, 3)
#> $chrna9
#> reads.in reads.out
#> BOR1061 88 62
#> BOR1063 67 52
#> BOR1069 91 62
#>
#> $nfkbia
#> reads.in reads.out
#> BOR1061 92 46
#> BOR1063 94 44
#> BOR1069 93 52
#>
#> $rogdi
#> reads.in reads.out
#> BOR1061 91 50
#> BOR1063 93 60
#> BOR1069 93 55
Real sequence variants and their frequency are determined with variant_call(). It utilizes the ‘DADA2’ ASV determination pipeline, but for multiple loci. Determination of variants and potentially, merging paired-end reads, and removal of chimeras are performed altogether with variant_call(). Binned read qualities (i.e. from NovaSeq) require the use of error_function = loess_err_mod4 (details in ?loess_err_mod4()). Variants are returned in tidy format.
variants <- variant_call(in_dir = p_trun)
head(variants)
head(variants)
#> # A tibble: 6 × 7
#> sample locus variant reads nt md5 sequence
#> <chr> <fct> <chr> <int> <dbl> <chr> <chr>
#> 1 BOR1061 chrna9 01 28 196 9598f07291c660d1a00c1497350381ad TGCAGTGTG…
#> 2 BOR1061 chrna9 02 34 196 a189edcac75e0853e8fac55df31acede TGCAGTGTG…
#> 3 BOR1061 nfkbia 1 46 198 cf753e1049c5f09f805df1918309a34b CGTAGGGCA…
#> 4 BOR1061 rogdi 1 50 183 6ddfd64d697d724f09d5f0ba1c909de6 CAGCCACCC…
#> 5 BOR1063 chrna9 01 52 196 9598f07291c660d1a00c1497350381ad TGCAGTGTG…
#> 6 BOR1063 nfkbia 1 44 198 cf753e1049c5f09f805df1918309a34b CGTAGGGCA…
Variants can be further filtered with filter_variants() based on frequency (maf) and depth (ad) thresholds.
Genotypes can be determined from variants. So far, the implemented method for genotyping is based on the expected maximum number of alleles per sample and locus and the read threshold (ADt) for discriminating homozygotes from hemizygotes.
genotypes <-
genotype(variants, ADt = 10, ploidy = 2)
head(genotypes)
#> # A tibble: 6 × 8
#> sample locus allele allele_no reads nt md5 sequence
#> <chr> <chr> <chr> <int> <dbl> <dbl> <chr> <chr>
#> 1 BOR1061 chrna9 01 1 28 196 9598f07291c660d1a00c1497… TGCAGTG…
#> 2 BOR1061 chrna9 02 2 34 196 a189edcac75e0853e8fac55d… TGCAGTG…
#> 3 BOR1061 nfkbia 1 1 23 198 cf753e1049c5f09f805df191… CGTAGGG…
#> 4 BOR1061 nfkbia 1 2 23 198 cf753e1049c5f09f805df191… CGTAGGG…
#> 5 BOR1061 rogdi 1 1 25 183 6ddfd64d697d724f09d5f0ba… CAGCCAC…
#> 6 BOR1061 rogdi 1 2 25 183 6ddfd64d697d724f09d5f0ba… CAGCCAC…