textpress

CRAN version CRAN downloads

textpress is an R toolkit for building text corpora and searching them – no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all composing cleanly with |>.


Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The textpress API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (fetch_*)

Find URLs and metadata – not full text. Pass results to read_urls() to get content.

2. Read (read_*)

Scrape and parse URLs into a structured corpus.

3. Process (nlp_*)

Prepare text for search or indexing.

4. Search (search_*)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

Function Query type Use case
search_regex(corpus, query) Regex pattern Specific strings, KWIC with inline highlighting.
search_dict(corpus, terms) Term vector Exact phrases and MWEs; built-in dict_generations, dict_political.
search_index(index, query) Keywords BM25 ranked retrieval over a token index.
search_vector(embeddings, query) Numeric vector Semantic nearest-neighbor search; use util_fetch_embeddings() to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval – run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assemblynlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:

Agent task Function
“Find recent articles on X” fetch_urls()
“Scrape these pages” read_urls()
“Find all mentions of these entities” search_dict()
“Follow citations from this Wikipedia article” fetch_wiki_refs()

Vignettes


License

MIT © Jason Timm

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues