| Title: | Fungal Assignment Pipeline |
| Version: | 0.1.0 |
| Description: | Fungi are ubiquitous in Earth's wonderfully diverse ecosystems. The 'ClassifyITS' package aids in the taxonomic classification of environmental internal transcribed spacer (ITS) short-read barcoding data. Unlike previous methods, it employs taxon-specific e-value and percent identity cutoffs at each taxonomic rank from kingdom to species. The package takes a conservative approach and outputs both graphics and user-friendly files to help users manually inspect fungal operational taxonomic units (OTUs) that fail classification at relevant levels (e.g., Phylum). 'ClassifyITS' is based on taxonomic cutoff criteria from "The Global Soil Mycobiome consortium dataset for boosting fungal diversity research" (Fungal Diversity, Tedersoo, 2021, <doi:10.1007/s13225-021-00493-7>) and "Best practices in metabarcoding of fungi: From experimental design to results" (Molecular Ecology, Tedersoo, 2022, <doi:10.1111/mec.16460>). |
| License: | GPL-3 |
| Encoding: | UTF-8 |
| Imports: | ggplot2, dplyr, gridExtra, grid, reshape2, data.table, seqinr |
| Suggests: | formatR, knitr, rmarkdown |
| RoxygenNote: | 7.3.3 |
| VignetteBuilder: | knitr, rmarkdown, formatR |
| NeedsCompilation: | no |
| Packaged: | 2026-04-03 11:04:46 UTC; quinnmoon |
| Author: | Quinn Moon [aut, cre] |
| Maintainer: | Quinn Moon <qmoon@umich.edu> |
| Repository: | CRAN |
| Date/Publication: | 2026-04-09 15:20:09 UTC |
Complete Fungal Assignment Pipeline
Description
Runs all steps: QC, filtering, plotting, assignments; optionally writes outputs.
Usage
ITS_assignment(
blast_file,
rep_fasta,
cutoffs_file = NULL,
cutoff_fraction = 0.6,
n_cutoff = 1,
outdir = NULL,
verbose = FALSE
)
Arguments
blast_file |
Path to BLAST results TSV file |
rep_fasta |
Path to representative sequences FASTA file |
cutoffs_file |
Path to taxonomy cutoffs CSV file (optional; defaults to package example if omitted) |
cutoff_fraction |
Numeric, fraction of median rep-seq length for BLAST filtering (default: 0.6) |
n_cutoff |
Numeric, N base percentage cutoff (default: 1) |
outdir |
Output directory for results. If NULL (default), nothing is written. |
verbose |
Logical; if TRUE emit progress messages. Default FALSE. |
Value
Named list of results and (if written) output file paths
Hierarchical best-hit taxonomy assignment with per-rank fallback rule
Description
Pass ONLY those OTUs that haven't been assigned already! For each rank, if the best e-value hit is undefined and the second-best hit is defined and at least 60
Usage
best_hit_taxonomy_assignment(blast_qc, cutoffs_long, defaults)
Arguments
blast_qc |
A data.frame of BLAST results for query sequences, must include columns for taxonomic ranks and alignment statistics. |
cutoffs_long |
A data.frame specifying per-rank cutoffs for assignment. Must include columns 'rank', 'cutoff_type', and 'cutoff_value'. |
defaults |
A named list of default cutoff values for each rank, used as fallback if no matching cutoff found. |
Value
A data.frame containing hierarchical taxonomy assignment for each query sequence.
Check proportion of N bases in each sequence.
Description
Calculates the proportion of "N" bases (ambiguous bases) in each sequence and flags if above the given threshold.
Usage
check_N(rep_seqs, cutoff = 1)
Arguments
rep_seqs |
Character vector, list (e.g., from seqinr::read.fasta(as.string=TRUE)), or (optionally) a DNAStringSet. |
cutoff |
Numeric, percent threshold (default 1). |
Value
Data frame with columns: qseqid, N_percent, N_flag.
Examples
seqs <- c(seq1 = "ATGCNNNN", seq2 = "NNNNATGC")
check_N(seqs)
check_N(seqs, cutoff = 10)
Per-rank consensus filter for taxonomy assignment
Description
Only confirms or demotes, never promotes Unclassified.
Usage
consensus_taxonomy_assignment(final_table, blast_qc)
Arguments
final_table |
Data frame of taxonomic assignments. |
blast_qc |
Data frame of filtered BLAST hits for each OTU. |
Value
Data frame of consensus assignments (same structure as input).
Easy taxonomy assignment for OTUs using BLAST QC output & phylum-specific thresholds.
Description
Easy taxonomy assignment for OTUs using BLAST QC output & phylum-specific thresholds.
Usage
easy_assignments(blast_filtered, cutoffs_file = NULL, default_cutoff = 98)
Arguments
blast_filtered |
QC-filtered BLAST dataframe (with parsed taxonomy columns!) |
cutoffs_file |
Path to taxonomy cutoffs CSV file. If not supplied or invalid, attempts to locate the default file in the package. |
default_cutoff |
Default percent identity cutoff for species assignment (default: 98) |
Value
List with assigned_otus_df and remaining_otus_df
Ensure data frame has all required columns (as character)
Description
Ensure data frame has all required columns (as character)
Usage
ensure_cols(df, all_cols)
Arguments
df |
Data frame to fix |
all_cols |
Vector of required columns |
Value
Fixed data frame (in correct order, with all columns present)
Load and check BLAST results and rep-seq FASTA
Description
Load and check BLAST results and rep-seq FASTA
Usage
load_and_check(blast_file, rep_fasta, taxonomy_col = "stitle", verbose = FALSE)
Arguments
blast_file |
Path to BLAST results TSV file. |
rep_fasta |
Path to representative sequences FASTA file. |
taxonomy_col |
The column in BLAST file containing taxonomy strings (default "stitle"). |
verbose |
Logical; if TRUE, emit progress messages. Default FALSE. |
Value
List with BLAST dataframe (kingdom-filtered) and rep_seqs as a named list of DNA strings.
Parse taxonomy cutoffs file
Description
Reads and processes a taxonomy cutoffs CSV for assignment thresholds at various ranks.
Usage
parse_taxonomy_cutoffs(cutoffs_file = NULL)
Arguments
cutoffs_file |
Path to a taxonomy cutoffs CSV file. If not supplied or invalid, attempts to locate the default file in the package. |
Value
A list with two elements: long, a data frame of parsed cutoffs, and
ranks, the vector of taxonomic ranks.
Create and return alignment length histogram (ggplot object)
Description
Create and return alignment length histogram (ggplot object)
Usage
plot_alignment_hist(blast, rep_seqs, cutoff_fraction = 0.6)
Arguments
blast |
BLAST data frame. |
rep_seqs |
Named list/character vector of DNA strings (from seqinr::read.fasta(as.string = TRUE)). |
cutoff_fraction |
Numeric; fraction of median alignment length for cutoff line. Default 0.6. |
Value
A ggplot object.
Safely rbinds list of data frames, ensuring columns match
Description
Safely rbinds list of data frames, ensuring columns match
Usage
safe_rbind_list(dfs, all_cols = NULL)
Arguments
dfs |
List of data frames |
all_cols |
Vector of required columns |
Value
Combined data frame
Save taxonomy summary charts and tables to multi-page PDF
Description
Save taxonomy summary charts and tables to multi-page PDF
Usage
save_taxonomy_graphics(
all_results,
hist_plot,
pdf_file = NULL,
caption_texts = NULL,
rank_names = c("Phylum", "Class", "Order", "Family", "Genus", "Species"),
verbose = FALSE
)
Arguments
all_results |
Combined assignments table from write_initial_assignments |
hist_plot |
ggplot2 object for histogram |
pdf_file |
Output path for multi-page PDF. If NULL (default), no file is written. |
caption_texts |
Vector of captions for PDF pages (optional) |
rank_names |
Vector of rank names (default: c("Phylum",...)) |
verbose |
Logical; if TRUE, emit a message when a PDF is written. Default FALSE. |
Value
List with plots/tables; includes pdf_file when written.
Trim BLAST alignments by minimum length
Description
Trim BLAST alignments by minimum length
Usage
trim_alignments(blast, rep_seqs, fraction = 0.6)
Arguments
blast |
BLAST data frame. |
rep_seqs |
Named list/character vector of DNA strings (from seqinr::read.fasta(as.string = TRUE)). |
fraction |
Numeric; fraction of the median rep-seq length used as the cutoff. Default 0.6. |
Value
Filtered BLAST data frame.
Create and write the initial assignments table including drops at all steps
Description
Create and write the initial assignments table including drops at all steps
Usage
write_initial_assignments(
easy_df,
consensus_df,
rep_seqs,
blast,
blast_filtered,
file = NULL,
verbose = FALSE
)
Arguments
easy_df |
Data frame of easy-assigned OTUs |
consensus_df |
Data frame of consensus-assigned OTUs (hard ones) |
rep_seqs |
DNAStringSet or named character vector of rep seqs |
blast |
Data frame of all BLAST results |
blast_filtered |
Data frame of filtered BLAST results |
file |
Path for output CSV. If NULL (default), no file is written. |
verbose |
Logical; if TRUE emit a message when a file is written. Default FALSE. |
Value
Data frame of assignments (written if file is not NULL)