| Title: | Universal Messy Panel Data Cleaner |
| Version: | 0.1.2 |
| Description: | A robust toolkit designed to standardize and clean complex tabular data from commercial enterprise systems, healthcare records, logistics software, and HR databases. Features include intelligent regex parsing for domain-specific noise (currencies, percentages), gap-based block clustering, and automated messy table resolution. Methods draw on tidy data principles described in Wickham (2014) <doi:10.18637/jss.v059.i10> and the 'readxl' parsing infrastructure described in Wickham & Bryan (2023) https://readxl.tidyverse.org. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | dplyr, stringr, readxl |
| Suggests: | knitr, rmarkdown, testthat, writexl |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-05-06 03:35:31 UTC; TonyL |
| Author: | Tony Lu [aut, cre] |
| Maintainer: | Tony Lu <xulunt123@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-05-11 19:00:09 UTC |
Standardize and Clean Variable Names
Description
clean_variable_names() standardizes column names in a messy data frame. It converts all names
to snake_case, strips special characters (except _), translates Excel serial dates (e.g., 44197)
into ISO date strings (2021-01-01), and maps common financial/academic synonyms (e.g., gvkey,
permno, cusip) to standard names (id, ticker).
Usage
clean_variable_names(data)
Arguments
data |
A |
Value
A data.frame with the same data but standardized column names.
Examples
# Toy example: standardize column names in a data frame
df <- data.frame(
`Total Revenue ($)` = 1,
`PERMNO` = 3,
`My Custom Column!` = 4,
check.names = FALSE
)
clean_df <- clean_variable_names(df)
colnames(clean_df)
# Returns: c("revenue", "id", "my_custom_column")
# Excel serial dates are also handled
df2 <- data.frame(`44197` = 2, check.names = FALSE)
colnames(clean_variable_names(df2))
# Returns: "2021-01-01"
Smart Type Coercion & NA Recognition
Description
infer_data_types() scans character columns in a data frame, identifies common
financial placeholders for missing data (e.g., "-", "N/A", "n.m."), safely replaces
them with NA, and then coerces the column to numeric or Date if a high
percentage of the remaining values match those types.
Usage
infer_data_types(
data,
na_strings = c("-", "N/A", "n/a", "n.m.", "n.m", "NA", "null", "NULL", "."),
num_threshold = 0.95
)
Arguments
data |
A |
na_strings |
A character vector of strings to be interpreted as |
num_threshold |
Numeric between 0 and 1. The proportion of valid numbers required to convert a column to numeric. Default is |
Value
A data.frame with inferred data types.
Examples
# Clean financial placeholders and coerce to numeric
df <- data.frame(val = c("1.5", "-", "2.0", "N/A"), stringsAsFactors = FALSE)
df_clean <- infer_data_types(df)
df_clean$val # numeric: c(1.5, NA, 2.0, NA)
is.numeric(df_clean$val) # TRUE
Normalize Numeric Columns Based on Header Unit Declarations
Description
normalize_units() scans the column names of a data frame for financial/scientific
unit declarations (e.g., "Revenue (in millions)", "Assets ($k)", "Employees ('000)").
It automatically multiplies the numeric values in the corresponding columns by the
detected multiplier (1,000, 1,000,000, etc.) and optionally strips the unit
declaration from the column name.
Usage
normalize_units(data, strip_units = TRUE)
Arguments
data |
A |
strip_units |
Logical. If |
Value
A data.frame with the normalized data and updated column names.
Examples
# Scale columns declared in millions and thousands
df <- data.frame(
`Revenue ($M)` = c(1.5, 2.0),
`Cost (in thousands)` = c(500, 600),
check.names = FALSE
)
result <- normalize_units(df)
result$Revenue # c(1500000, 2000000)
result$Cost # c(500000, 600000)
Robust Parsing and Extraction of Messy Excel Panel Data
Description
read_messy_panel() is an industrial-grade parser designed to extract clean, standardized data frames
from heavily malformed, human-readable Excel reports (e.g., financial statements, ERP exports).
It automatically bypasses decoy rows, stitches N-dimensional hierarchical headers, extracts structural
indentation hierarchies (parent-child relationships), amputates embedded subtotals, and standardizes
financial/scientific numbers.
Usage
read_messy_panel(
file_path,
sheet = NULL,
na_strings = c("", "NA", "#N/A", "NULL", "S", "D", "ND", "N/A", "*", "**", "***", ".",
"x", "c", "s", "z", "#VALUE!", "#REF!", "#DIV/0!", "#NUM!", "#NAME?", "none", "NR",
"--", "---", "n.a.", "N.A.", "n/a", "Not Applicable"),
clean_vars = TRUE,
auto_pivot = FALSE,
return_audit = FALSE
)
Arguments
file_path |
Character string. Path to the Excel file. |
sheet |
Optional sheet name or index. If |
na_strings |
Character vector. Strings to interpret as missing values. Supports complex missing-value lexicons. |
clean_vars |
Logical. If |
auto_pivot |
Logical. If |
return_audit |
Logical. If |
Value
If return_audit = FALSE, a cleaned and standardized data.frame.
If return_audit = TRUE, a named list containing:
data |
The cleaned |
audit |
A |
Examples
# Toy example: create a small in-memory Excel file and parse it
tmp <- tempfile(fileext = ".xlsx")
df_raw <- data.frame(
Category = c("Revenue", "Cost", "Total"),
`2022` = c("1.2M", "800k", "2.0M"),
`2023` = c("1.5M", "900k", "2.4M"),
check.names = FALSE
)
writexl::write_xlsx(df_raw, tmp)
result <- read_messy_panel(tmp, auto_pivot = TRUE)
head(result)
unlink(tmp)