| Title: | Fast Multi-Pattern String Matching with the 'Aho-Corasick' Algorithm |
| Version: | 0.2.0 |
| Description: | Provide fast multi-pattern string matching for 'R' using the 'Aho-Corasick' algorithm, powered by the 'Rust' 'aho-corasick' crate. It builds reusable automatons for detecting matches, counting matches, locating character, extracting matched text, and replacing matches in character vectors. For more details on the 'Aho-Corasick' algorithm, please see Aho and Corasick (1975) <doi:10.1145/360825.360855>. |
| License: | MIT + file LICENSE |
| URL: | https://yousa-mirage.github.io/r-ahocorasick/, https://github.com/Yousa-Mirage/r-ahocorasick |
| BugReports: | https://github.com/Yousa-Mirage/r-ahocorasick/issues |
| Encoding: | UTF-8 |
| Config/testthat/edition: | 3 |
| Config/testthat/parallel: | true |
| Config/rextendr/version: | 0.5.0 |
| SystemRequirements: | Cargo (Rust's package manager), rustc >= 1.65.0, xz |
| Depends: | R (≥ 4.2) |
| Imports: | checkmate, cli, fs, rlang |
| Suggests: | dplyr, knitr, pkgdown, rmarkdown, tibble, tidyr, testthat (≥ 3.0.0) |
| VignetteBuilder: | knitr |
| Language: | en-US |
| Config/roxygen2/version: | 8.0.0 |
| NeedsCompilation: | yes |
| Packaged: | 2026-05-29 15:07:59 UTC; Yousa-Mirage |
| Author: | Hao Cheng [aut, cre, cph] |
| Maintainer: | Hao Cheng <Yousa-Mirage@foxmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-06-02 11:00:07 UTC |
Build an Aho-Corasick automaton
Description
ac_build() compiles a character vector of patterns into a reusable
automaton backed by the Rust aho-corasick crate.
Usage
ac_build(
patterns,
match_kind = c("standard", "leftmost_first", "leftmost_longest"),
implementation = c("auto", "noncontiguous_nfa", "contiguous_nfa", "dfa"),
ascii_case_insensitive = FALSE,
duplicate = c("keep", "error", "deduplicate")
)
Arguments
patterns |
A character vector of non-empty patterns. |
match_kind |
Matching semantics:
|
implementation |
Rust automaton implementation. |
ascii_case_insensitive |
Use ASCII-only case-insensitive matching. Default is |
duplicate |
How duplicate patterns are handled:
|
Value
An immutable <ac_automaton> object.
See Also
ac_locate(), ac_locate_df(), ac_detect(), ac_count(),
ac_extract(), ac_extract_df(), ac_replace(), ac_patterns().
Examples
ac <- ac_build(c("hello", "world"))
length(ac)
ac_info(ac)
Count pattern matches in documents
Description
ac_count() returns the number of pattern matches in each document.
Usage
ac_count(ac, doc, overlapping = FALSE, na = c("keep", "zero", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
An integer vector with the same length as doc.
See Also
ac_count_file(), ac_detect(), ac_locate(), ac_extract().
Examples
if (requireNamespace("dplyr", quietly = TRUE)) {
ac <- ac_build(c("hello", "world"))
docs <- data.frame(doc = c("hello world", "nothing", "world"))
dplyr::mutate(docs, n_matches = ac_count(ac, doc))
}
Count pattern matches in files
Description
ac_count_file() returns the number of pattern matches in each file.
Usage
ac_count_file(ac, path, stream = FALSE, overlapping = FALSE)
Arguments
ac |
An |
path |
A vector of file paths to search. |
stream |
If |
overlapping |
Default is |
Value
An integer vector with the same length as path.
See Also
ac_count(), ac_detect_file(), ac_locate_bytes().
Examples
ac <- ac_build(c("hello", "world"))
path <- tempfile()
writeLines("hello hello world", path)
ac_count_file(ac, path)
Detect pattern matches in documents
Description
ac_detect() returns whether each document has at least one pattern match.
Usage
ac_detect(ac, doc, na = c("keep", "false", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
na |
How to handle |
Value
A logical vector with the same length as doc.
See Also
ac_detect_file(), ac_count(), ac_locate(), ac_extract().
Examples
if (requireNamespace("dplyr", quietly = TRUE)) {
ac <- ac_build(c("hello", "world"))
docs <- data.frame(doc = c("hello world", "nothing", "world"))
dplyr::mutate(docs, matched = ac_detect(ac, doc))
}
Detect pattern matches in files
Description
ac_detect_file() returns whether each file has at least one pattern match.
Usage
ac_detect_file(ac, path, stream = FALSE)
Arguments
ac |
An |
path |
A vector of file paths to search. |
stream |
If |
Value
A logical vector with the same length as path.
See Also
ac_detect(), ac_count_file(), ac_locate_bytes().
Examples
ac <- ac_build(c("hello", "world"))
path <- tempfile()
writeLines("hello world", path)
ac_detect_file(ac, path)
Extract pattern matches from documents
Description
ac_extract() returns one list element per document. Each element contains
the matched text and the corresponding pattern values.
Usage
ac_extract(ac, doc, overlapping = FALSE, na = c("keep", "empty", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
A list with the same length as doc. Each element is a data frame
with one row per match and two columns:
-
matches: Text matched in the document. -
patterns: Pattern values corresponding to each match.
See Also
ac_extract_df(), ac_locate(), ac_detect(), ac_count().
Examples
if (
requireNamespace("dplyr", quietly = TRUE) &&
requireNamespace("tibble", quietly = TRUE) &&
requireNamespace("tidyr", quietly = TRUE)
) {
ac <- ac_build(c("hello", "world"))
tibble::tibble(doc = c("hello world", "nothing", "world")) |>
dplyr::mutate(extracted = ac_extract(ac, doc)) |>
tidyr::unnest(extracted)
}
Extract pattern matches as a data frame
Description
ac_extract_df() is the data-frame form of ac_extract(). It is useful when
you want one row per match instead of one list element per document.
Usage
ac_extract_df(ac, doc, overlapping = FALSE, na = c("omit", "keep", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
A data frame with one row per match and three columns:
doc_id, matches, and patterns.
See Also
Examples
ac <- ac_build(c("hello", "world"))
doc <- c("hello world", "nothing", "world hello")
ac_extract_df(ac, doc)
Extract pattern matches from files
Description
ac_extract_file() returns one list element per file. Each element contains
the matched text and the corresponding pattern values.
Usage
ac_extract_file(ac, path, stream = FALSE, overlapping = FALSE)
Arguments
ac |
An |
path |
A vector of file paths to search. |
stream |
If |
overlapping |
Default is |
Value
A list with the same length as path. Each element is a data frame
with one row per match and two columns:
-
matches: Text matched in the file. -
patterns: Pattern values corresponding to each match.
See Also
ac_extract(), ac_detect_file(), ac_count_file().
Examples
ac <- ac_build(c("hello", "world"))
path <- tempfile()
writeLines("hello world", path)
ac_extract_file(ac, path)
Return automaton metadata
Description
Return automaton metadata
Usage
ac_info(ac)
Arguments
ac |
An |
Value
A list of automaton metadata.
See Also
Examples
ac <- ac_build(c("hello", "world"))
ac_info(ac)
Locate pattern matches in strings
Description
ac_locate() searches a character vector with a compiled automaton and
returns one list element per document. Character offsets are 1-based and
inclusive, so they can be used directly with substr().
Usage
ac_locate(ac, doc, overlapping = FALSE, na = c("keep", "empty", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
A list with the same length as doc. Each element is a data frame
with one row per match and three columns:
-
pattern_id: Index of the matched pattern inac_patterns(ac). -
start: 1-based index of the first character in each match. -
end: 1-based index of the last character in each match.
See Also
ac_locate_df(), ac_locate_bytes(), ac_extract(),
ac_detect(), ac_count().
Examples
if (
requireNamespace("dplyr", quietly = TRUE) &&
requireNamespace("tibble", quietly = TRUE) &&
requireNamespace("tidyr", quietly = TRUE)
) {
ac <- ac_build(c("hello", "world"))
tibble::tibble(doc = c("hello world", "nothing", "world")) |>
dplyr::mutate(hits = ac_locate(ac, doc)) |>
tidyr::unnest(hits)
}
Locate pattern matches with byte offsets
Description
ac_locate_bytes() searches a character vector with a compiled automaton
and returns byte offsets from the Rust aho-corasick crate. Byte offsets are
0-based, and byte_end is end-exclusive.
Usage
ac_locate_bytes(ac, doc, overlapping = FALSE, na = c("omit", "keep", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
A data frame with one row per match and four columns:
doc_id, pattern_id, byte_start, and byte_end.
See Also
Examples
ac <- ac_build(c("hello", "world"))
doc <- c("hello world", "nothing", "world hello")
ac_locate_bytes(ac, doc)
Locate pattern matches as a data frame
Description
ac_locate_df() is the data-frame form of ac_locate(). It is useful when
you want one row per match instead of one list element per document.
Usage
ac_locate_df(ac, doc, overlapping = FALSE, na = c("omit", "keep", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search. |
overlapping |
Default is |
na |
How to handle |
Value
A data frame with one row per match and four columns:
doc_id, pattern_id, start, and end.
See Also
ac_locate(), ac_locate_bytes(), ac_extract_df().
Examples
ac <- ac_build(c("hello", "world"))
doc <- c("hello world", "nothing", "world hello")
ac_locate_df(ac, doc)
Locate pattern matches in files
Description
ac_locate_file() searches files with a compiled automaton and returns one
list element per file. Character offsets are 1-based and inclusive, so they
can be used directly with substr().
Usage
ac_locate_file(ac, path, overlapping = FALSE)
Arguments
ac |
An |
path |
A vector of file paths to search. |
overlapping |
Default is |
Details
File location search is always non-streaming. Converting byte offsets from a
streaming search into R-facing character offsets would require a second pass
over the same file to reconstruct UTF-8 character boundaries. Keeping
ac_locate_file() as a simple in-memory search is the clearest
implementation.
Value
A list with the same length as path. Each element is a data frame
with one row per match and three columns:
-
pattern_id: Index of the matched pattern inac_patterns(ac). -
start: 1-based index of the first character in each match. -
end: 1-based index of the last character in each match.
See Also
ac_locate(), ac_detect_file(), ac_count_file(),
ac_extract_file().
Examples
ac <- ac_build(c("hello", "world"))
path <- tempfile()
writeLines("hello world", path)
ac_locate_file(ac, path)
Return patterns stored in an automaton
Description
Return patterns stored in an automaton
Usage
ac_patterns(ac)
Arguments
ac |
An |
Value
A character vector of stored patterns.
See Also
Examples
ac <- ac_build(c("hello", "world"))
ac_patterns(ac)
Replace pattern matches in documents
Description
ac_replace() replaces all non-overlapping matches in each document with
the corresponding replacement string.
Usage
ac_replace(ac, doc, replace_with, na = c("keep", "empty", "error"))
Arguments
ac |
An |
doc |
A character vector of documents to search and replace. |
replace_with |
A character vector of replacements. If length 1, the
same replacement is used for every pattern. Otherwise, it MUST have the
same length as |
na |
How to handle |
Value
A character vector with the same length and names as doc.
See Also
ac_build(), ac_detect(), ac_count(), ac_extract(),
ac_locate().
Examples
ac <- ac_build(c("fox", "brown", "quick"))
ac_replace(
ac,
"The quick brown fox.",
c("sloth", "grey", "slow")
)
ac <- ac_build(c("append", "appendage", "app"), match_kind = "leftmost_first")
ac_replace(ac, "append the app to the appendage", c("x", "y", "z"))
Replace pattern matches in files
Description
ac_replace_file() replaces all non-overlapping matches in input files and
writes the result to output files.
Usage
ac_replace_file(ac, path, replace_with, output = NULL, stream = FALSE)
Arguments
ac |
An |
path |
A vector of input file paths to search and replace. |
replace_with |
A character vector of replacements. If length 1, the
same replacement is used for every pattern. Otherwise, it MUST have the
same length as |
output |
A vector of output file paths. It must have the same
length as |
stream |
If |
Value
A character vector of output file paths with the same length as
path.
See Also
ac_replace(), ac_detect_file(), ac_count_file().
Examples
ac <- ac_build(c("fox", "brown", "quick"))
path <- tempfile(fileext = ".txt")
writeLines("The quick brown fox.", path)
ac_replace_file(path = path, ac = ac, replace_with = c("sloth", "grey", "slow"))