Title: Robust Probabilistic Matching for German Company Names
Version: 0.1.2
Description: A pipeline for matching messy company name strings against a clean dictionary (e.g., 'Orbis'). Implements a cascading strategy: Exact -> Fuzzy ('zoomerjoin') -> 'FTS5' ('SQLite') -> Rarity Weighted. References: Beniamino Green (2025) https://beniamino.org/zoomerjoin/; https://www.sqlite.org/fts5.html.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.2
Imports: data.table, stringi, stringdist, zoomerjoin, DBI, RSQLite, cli, progressr, httr, jsonlite, glue, purrr, readr, dplyr
Suggests: testthat
NeedsCompilation: no
Packaged: 2026-02-08 23:26:30 UTC; giulianetinginfrati
Author: Giulian Etingin-Frati [aut, cre]
Maintainer: Giulian Etingin-Frati <etingin-frati@kof.ethz.ch>
Repository: CRAN
Date/Publication: 2026-02-11 19:50:07 UTC

Internal Azure Chat Completion Wrapper (Custom Endpoint)

Description

Sends a request to a custom Azure-like endpoint (e.g. /openai/v1/responses).

Usage

azure_chat_request(
  system_msg,
  user_msg,
  endpoint,
  api_key,
  deployment,
  api_version = "2024-04-14"
)

Arguments

system_msg

String. The instructions for the LLM.

user_msg

String. The specific case to evaluate.

endpoint

String. Base URL.

api_key

String. API Key.

deployment

String. Model/Deployment name.

api_version

String. API version (unused in this custom path but kept for compatibility).

Value

A character string (the JSON response) or NULL on failure.


Match Company Names against a Dictionary

Description

Runs a cascading matching pipeline: Exact -> Fuzzy (Zoomer) -> FTS5 -> Rarity. Matches found in earlier steps are removed from subsequent steps.

Usage

match_companies(
  queries,
  dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id",
  threshold_jw = 0.8,
  threshold_zoomer = 0.4,
  threshold_rarity = 1,
  n_cores = 1
)

Arguments

queries

Data frame. Must contain columns specified in query_col and unique_id_col.

dictionary

Data frame. Must contain columns specified in dict_col and dict_id_col.

query_col

String. Column name for company names in queries.

dict_col

String. Column name for company names in dictionary.

unique_id_col

String. ID column in queries.

dict_id_col

String. ID column in dictionary.

threshold_jw

Numeric (0-1). Minimum Jaro-Winkler similarity. Default 0.8.

threshold_zoomer

Numeric (0-1). Jaccard threshold for blocking. Default 0.4.

threshold_rarity

Numeric. Minimum score for rarity matching. Default 1.0.

n_cores

Integer. Number of cores (reserved for future parallel implementation).

Value

A data.table containing query_id, dict_id, and match_type.

Examples

# Create sample query data
queries <- data.frame(
  query_id = 1:3,
  company_name = c("BMW", "Siemens AG", "Deutsche Bank")
)

# Create sample dictionary
dictionary <- data.frame(
  orbis_id = c("D001", "D002", "D003"),
  company_name = c("BMW AG", "Siemens Aktiengesellschaft", "Commerzbank AG")
)

# Match companies
results <- match_companies(
  queries = queries,
  dictionary = dictionary,
  query_col = "company_name",
  dict_col = "company_name",
  unique_id_col = "query_id",
  dict_id_col = "orbis_id"
)

print(results)

Normalize Company Names

Description

Standardizes company names by lowercasing, removing legal suffixes, translating characters to ASCII, and removing noise words.

Usage

normalize_company_name(x)

Arguments

x

A character vector of company names.

Value

A character vector of normalized names.

Examples

# Normalize a single company name
normalize_company_name("BMW AG")
normalize_company_name("Siemens GmbH & Co. KG")

# Normalize multiple names
companies <- c("Deutsche Bank AG", "VW Group", "BASF SE")
normalize_company_name(companies)

Validate Matches using LLM (Azure OpenAI)

Description

Sends doubtful matches (not "Perfect" or "Unmatched") to an LLM for verification. Supports resuming from interruptions via chunk files.

Usage

validate_matches_llm(
  data,
  query_name_col,
  dict_name_col,
  output_dir = tempdir(),
  filename_stem = "match_validation",
  batch_size = 20,
  api_key = Sys.getenv("AZURE_API_KEY"),
  endpoint = Sys.getenv("AZURE_ENDPOINT"),
  deployment = Sys.getenv("AZURE_DEPLOYMENT")
)

Arguments

data

Data frame. Must contain the columns specified by query_name_col and dict_name_col.

query_name_col

String. Column containing the user's query name (Employer).

dict_name_col

String. Column containing the dictionary match name (Registry).

output_dir

String. Directory to save temporary chunks and final results. Defaults to tempdir().

filename_stem

String. Base name for output files.

batch_size

Integer. Number of rows to process before saving a chunk.

api_key

String. Azure API Key. Defaults to Sys.getenv("AZURE_API_KEY").

endpoint

String. Azure Endpoint. Defaults to Sys.getenv("AZURE_ENDPOINT").

deployment

String. Deployment name. Defaults to Sys.getenv("AZURE_DEPLOYMENT").

Value

A data frame with added LLM_decision and LLM_reason columns.

Examples

## Not run: 
# Sample matched data
matched_data <- data.frame(
  employer_name = c("BMW", "Siemens"),
  registry_name = c("BMW AG", "SAP SE"),
  dict_id = c("D001", "D002"),
  match_type = c("Fuzzy", "Fuzzy")
)

# Validate using LLM (requires Azure credentials)
validated <- validate_matches_llm(
  data = matched_data,
  query_name_col = "employer_name",
  dict_name_col = "registry_name"
)

print(validated)

## End(Not run)