| Title: | Summarise and Explore Continuous, Categorical and Date Variables |
| Version: | 0.2.0 |
| Description: | Explore continuous, date and categorical variables with summary statistics, visualisations, and frequency tables. Brings the ease and simplicity of the sum and tab commands from 'Stata' to 'R', including support for two-way cross-tabulations, hypothesis tests, duplicate and missing data exploration, and automated HTML or PDF exploratory reports. |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Imports: | cli, dplyr, ggplot2, kableExtra, knitr, magrittr, patchwork, purrr, rlang, rmarkdown, scales, stats, tibble, tidyr, utils |
| Suggests: | testthat (≥ 3.0.0), tinytex |
| Config/testthat/edition: | 3 |
| URL: | https://github.com/alstockdale/sumvar, https://alstockdale.github.io/sumvar/ |
| BugReports: | https://github.com/alstockdale/sumvar/issues |
| License: | MIT + file LICENSE |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2026-03-18 20:35:59 UTC; al_st |
| Author: | Alexander Stockdale [aut, cre, cph] |
| Maintainer: | Alexander Stockdale <a.stockdale@liverpool.ac.uk> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-19 07:10:02 UTC |
sumvar: Summarise Continuous and Categorical Variables in R
Description
The sumvar package explores continuous and categorical variables. sumvar brings the ease and simplicity of the "sum" and "tab" functions from Stata to R.
To explore a continuous variable, use
dist_sum(). You can stratify by a grouping variable:df %>% dist_sum(var, group)To explore dates, use
dist_date(); usage is the same asdist_sum().To summarise a single categorical variable use
tab1(), e.g.df %>% tab1(var). For a two-way table, usetab(), e.g.df %>% tab(var1, var2). Both include options for frequentist hypothesis tests.Explore duplicates and missing values with with
dup().
All functions are tidyverse/dplyr-friendly and accept the %>% pipe, outputting results as a tibble. You can save outputs for further manipulation, e.g. summary <- df %>% dist_sum(var).
Author(s)
Maintainer: Alexander Stockdale a.stockdale@liverpool.ac.uk [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/alstockdale/sumvar/issues
Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling rhs(lhs).
Summarize and visualize a date variable
Description
Summarises the minimum, maximum, median, and interquartile range of a date variable, optionally stratified by a grouping variable. Produces a histogram and (optionally) a density plot.
Usage
dist_date(data, var, by = NULL)
Arguments
data |
A data frame or tibble. |
var |
The date variable to summarise. |
by |
Optional grouping variable. |
Value
A tibble with summary statistics for the date variable.
See Also
dist_sum for continuous variables.
Examples
# Example ungrouped
df <- tibble::tibble(
dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE)
)
dist_date(df, dt)
# Example grouped
df2 <- tibble::tibble(
dt = as.Date("2020-01-01") + sample(0:1000, 100, TRUE),
grp = sample(1:2, 100, TRUE)
)
dist_date(df2, dt, grp)
Explore a continuous variable
Description
Summarises a continuous variable, returning a tibble of descriptive statistics and a plot. When a grouping variable is supplied, results are stratified by group.
For two groups, a t-test and Wilcoxon rank-sum test are reported. For three or more groups, a one-way ANOVA and Kruskal-Wallis test are reported.
Usage
dist_sum(data, var, by = NULL)
Arguments
data |
The data frame or tibble |
var |
The continuous variable to summarise |
by |
Optional grouping variable |
Value
A tibble with one row per group (or one row when ungrouped) containing the following columns:
nNumber of non-missing observations.
n_missNumber of missing (NA) values.
medianMedian value.
p25,p7525th and 75th percentiles (interquartile range boundaries).
meanArithmetic mean.
sdStandard deviation.
ci_lower,ci_upperLower and upper bounds of the 95% confidence interval for the mean. Uses the t-distribution when n < 30, and the Z-distribution when n >= 30.
min,maxMinimum and maximum observed values.
n_outliersCount of values more than 1.5 x IQR below Q1 or above Q3 (Tukey fence method).
shapiro_pP-value from the Shapiro-Wilk test of normality. Returns
NAwhen n < 3 or n > 5000 (outside the valid range of the test).normalLogical.
TRUEifshapiro_p > 0.05, indicating no significant departure from normality at the 5% level.p_ttestShown when two groups are compared. P-value from an independent samples t-test, testing whether the means of the two groups differ. Assumes approximately normal distributions or large samples. All p-values are reported on the first row only; remaining rows contain
NA.p_wilcoxShown when two groups are compared. P-value from the Wilcoxon rank-sum test (Mann-Whitney U test), a non-parametric alternative to the t-test. Preferred over
p_ttestwhen data are skewed, ordinal, or contain outliers, as it compares ranks rather than means and makes no distributional assumptions.p_anovaShown when three or more groups are compared. P-value from a one-way analysis of variance (ANOVA) F-test, testing whether at least one group mean differs from the others. Assumes approximately normal distributions and equal variances across groups.
p_kruskalShown when three or more groups are compared. P-value from the Kruskal-Wallis test, a non-parametric alternative to one-way ANOVA. Preferred over
p_anovawhen data are skewed or the normality assumption is not met, as it compares rank distributions and makes no distributional assumptions.
Examples
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
group = sample(c("a", "b", "c", "d"),
size = 100, replace = TRUE))
dist_sum(example_data, age, group)
example_data <- dplyr::tibble(id = 1:100, age = rnorm(100, mean = 30, sd = 10),
sex = sample(c("male", "female"),
size = 100, replace = TRUE))
dist_sum(example_data, age, sex)
summary <- dist_sum(example_data, age, sex) # Save summary statistics as a tibble.
Explore duplicate and missing data
Description
Provides an integer value for the number of duplicates found within a variable The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble.
eg. example_data %>% dup(variable)
Usage
dup(data, var = NULL)
Arguments
data |
The data frame or tibble |
var |
The variable to assess |
Value
A tibble with the number and percentage of duplicate values found, and the number of missing values (NA), together with percentages.
Examples
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
dup(example_data, age)
# It is also possible to pass a whole database to dup and it will explore all variables.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
sex = sample(c("Male", "Female"), 200, TRUE),
favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing.
dup(example_data)
Explore all variables and generate an HTML or PDF summary report
Description
Analyses a data frame or tibble, summarising all continuous, date and categorical variables, missing data and duplicate values, and produces an HTML or PDF report.
Usage
explorer(
data,
output_file = NULL,
format = c("html", "pdf"),
progress = TRUE,
id_var = NULL
)
Arguments
data |
A data frame or tibble to explore. |
output_file |
The name of the output file. Default uses |
format |
Output format, either |
progress |
If |
id_var |
Character vector of column names to treat as IDs (not summarised). |
Value
Outputs an html or PDF summary. Output in PDF typically takes longer.
For PDF output, a LaTeX distribution must be installed. TinyTeX is recommended. To install, run:
install.packages("tinytex")
tinytex::install_tinytex()
Examples
## Not run:
# Build example data from mtcars with some factors and a date column:
cars_example <- mtcars %>%
dplyr::mutate(
across(c(vs, am, gear, carb, cyl), as.factor),
date_var = as.Date("2025-06-01") +
sample(-300:300, nrow(mtcars), replace = TRUE),
id = dplyr::row_number()
)
# To run explorer:
explorer(mtcars) # with progress bar
explorer(mtcars, progress = FALSE) # omit progress bar
explorer(mtcars, format = "pdf") # PDF output
explorer(mtcars, format = "pdf", id_var = "id") # Identify ID variable
## End(Not run)
Create a cross-tabulation of two categorical variables
Description
Creates a cross-tabulation of two categorical variables with row or column
percentages, row and column totals, and optional hypothesis tests. Prints a
formatted table to the console (similar to Stata's tab command) with
the column variable name displayed as a spanning header above its levels.
Usage
tab(
data,
variable1,
variable2,
show = c("row", "col", "n"),
test = c("both", "chi", "exact", "none"),
totals = TRUE,
dp = 1
)
Arguments
data |
The data frame or tibble. |
variable1 |
The row variable (first categorical variable). |
variable2 |
The column variable (second categorical variable). |
show |
What to display in each cell: |
test |
Hypothesis test(s) to report: |
totals |
If |
dp |
Number of decimal places for percentages. Default is |
Value
A wide-format tibble (invisibly) with:
First column: levels of
variable1(plus"Total"iftotals = TRUE).For each level of
variable2:{level}_n(integer count) and, whenshow != "n",{level}_pct(numeric percentage).-
total_n: row totals (whentotals = TRUE). -
p_chi,p_fisher: p-values on the first row,NAelsewhere (whentest = "both"; individual tests addtest,statistic,p_valueinstead).
Examples
example_data <- dplyr::tibble(
group1 = sample(c("a", "b", "c"), 100, replace = TRUE),
group2 = sample(c("male", "female"), 100, replace = TRUE)
)
tab(example_data, group1, group2)
tab(example_data, group1, group2, show = "col")
tab(example_data, group1, group2, test = "none")
result <- tab(example_data, group1, group2)
Summarise a categorial variable
Description
Summarises frequencies and percentages for a categorical variable.
The function accepts an input from a dplyr pipe "%>%" and outputs the results as a tibble. eg. example_data %>% tab1(variable)
Usage
tab1(data, variable, ..., dp = 1)
Arguments
data |
The data frame or tibble |
variable |
The categorical variable you would like to summarise |
... |
Not used. Passing additional variables raises an informative
error suggesting |
dp |
The number of decimal places for percentages (default=1) |
Value
A tibble with frequencies and percentages
Examples
example_data <- dplyr::tibble(id = 1:100, group = sample(c("a", "b", "c", "d"),
size = 100, replace = TRUE))
example_data$group[sample(1:100, size = 10)] <- NA # Replace 10 with missing
tab1(example_data, group)
summary <- tab1(example_data, group) # Save summary statistics as a tibble.