| Version: | 0.4.0 |
| Title: | Simplifies Plotting Data Inside Databases |
| Description: | Leverages 'dplyr' to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels: outputs a 'ggplot', outputs the calculations, outputs the formula needed to calculate bins. |
| Depends: | R (≥ 4.1.0) |
| Imports: | dplyr (≥ 1.0.0), rlang (≥ 1.0.0), ggplot2 (≥ 3.3.0), purrr |
| Suggests: | DBI, dbplyr (≥ 2.0.0), testthat (≥ 3.0.0), tidyr, covr, lifecycle, duckdb, RSQLite, nycflights13 |
| License: | MIT + file LICENSE |
| URL: | https://github.com/edgararuiz/dbplot |
| BugReports: | https://github.com/edgararuiz/dbplot/issues |
| RoxygenNote: | 7.3.3 |
| Encoding: | UTF-8 |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2026-03-13 16:09:32 UTC; edgar |
| Author: | Edgar Ruiz [aut, cre] |
| Maintainer: | Edgar Ruiz <edgararuiz@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-13 16:30:02 UTC |
dbplot: Simplifies Plotting Data Inside Databases
Description
Leverages 'dplyr' to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels: outputs a 'ggplot', outputs the calculations, outputs the formula needed to calculate bins.
The dbplot package provides functions to create plots using data that resides in databases or remote data sources. It leverages dplyr and dbplyr to push computations to the database, allowing you to visualize large datasets without loading them entirely into R memory.
Details
## Main Features
dbplot provides three levels of functionality:
1. **Plot Functions** - Functions that output ggplot2 objects: - [dbplot_histogram()] - Histogram plots - [dbplot_bar()] - Bar plots for discrete variables - [dbplot_line()] - Line plots for discrete variables - [dbplot_raster()] - Raster/heatmap plots for two continuous variables - [dbplot_boxplot()] - Boxplot for grouped continuous data
2. **Computation Functions** - Functions that return data frames with aggregated data: - [db_compute_bins()] - Histogram bins and counts - [db_compute_count()] - Counts per discrete value - [db_compute_raster()] - Aggregated data per x/y intersection - [db_compute_raster2()] - Raster data with box coordinates - [db_compute_boxplot()] - Boxplot statistics
3. **Utility Functions** - Building blocks for custom operations: - [db_bin()] - Creates binning formulas for use in dplyr verbs
## Database Compatibility
dbplot works with any database backend supported by dplyr/dbplyr, including: - SQLite - PostgreSQL - MySQL/MariaDB - SQL Server - Oracle - Spark (via sparklyr)
## Minimum Requirements
- R >= 4.1.0 (for native pipe '|>' support) - dplyr >= 1.0.0 - ggplot2 >= 3.3.0 - rlang >= 1.0.0 - purrr (any version)
## Usage Philosophy
The package follows these principles:
1. **Push Computations to Database**: All aggregations happen in the database, minimizing data transfer and memory usage.
2. **Familiar dplyr Syntax**: Uses standard dplyr/tidyverse patterns, making it easy to integrate into existing workflows.
3. **Lazy Evaluation**: Leverages dplyr's lazy evaluation to build efficient database queries.
4. **ggplot2 Output**: Plot functions return ggplot2 objects that can be further customized using standard ggplot2 functions.
## Breaking Changes in 0.4.0
Version 0.4.0 introduced some breaking changes:
- **No longer exports ' load magrittr explicitly with 'library(magrittr)'.
- **Minimum R version increased**: Now requires R >= 4.1.0 (previously >= 3.1).
- **Modern dependencies**: Updated to require dplyr >= 1.0.0, rlang >= 1.0.0, and other modern package versions.
Author(s)
Maintainer: Edgar Ruiz edgararuiz@gmail.com
See Also
Useful links:
Useful links: - Report bugs: <https://github.com/edgararuiz/dbplot/issues> - Database connections: <https://solutions.posit.co/connections/db/> - Spark connections: <https://spark.posit.co> - dplyr documentation: <https://dplyr.tidyverse.org/> - dbplyr documentation: <https://dbplyr.tidyverse.org/>
Examples
## Not run:
library(dplyr)
library(dbplot)
library(DBI)
# Connect to database
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Create histogram
db_mtcars |>
dbplot_histogram(mpg)
# Create bar plot with custom aggregation
db_mtcars |>
dbplot_bar(cyl, avg_mpg = mean(mpg))
# Get computation results for custom plotting
db_mtcars |>
db_compute_bins(mpg, bins = 20) |>
ggplot2::ggplot() +
ggplot2::geom_col(ggplot2::aes(mpg, count))
# Use db_bin() directly in dplyr
db_mtcars |>
group_by(bin = !!db_bin(mpg, bins = 10)) |>
count()
dbDisconnect(con)
## End(Not run)
Bin formula
Description
Uses the rlang package to build the formula needed to create the bins of a numeric variable in an unevaluated fashion. This way, the formula can be then passed inside a dplyr verb.
Usage
db_bin(var, bins = 30, binwidth = NULL)
Arguments
var |
Variable name or formula |
bins |
Number of bins. Defaults to 30. |
binwidth |
Fixed width for each bin, in the same units as the data. Overrides bins when specified |
Value
An unevaluated expression (rlang quosure) that calculates bin membership for the specified variable. This expression is designed to be used within dplyr verbs using the '!!' operator.
Examples
## Not run:
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Important: Always name the field and
# prefix the function with `!!` (See Details)
# Uses the default 30 bins
db_mtcars |>
group_by(x = !!db_bin(mpg)) |>
count()
# Uses binwidth which overrides bins
db_mtcars |>
group_by(x = !!db_bin(mpg, binwidth = 10)) |>
count()
dbDisconnect(con)
## End(Not run)
Calculate histogram bins
Description
Uses dplyr operations to create histogram bins. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
db_compute_bins(data, x, bins = 30, binwidth = NULL)
Arguments
data |
A table (tbl) |
x |
A continuous variable |
bins |
Number of bins. Defaults to 30. |
binwidth |
Fixed width for each bin, in the same units as the data. Overrides bins when specified |
Value
An ungrouped data.frame with two columns: the bin values for the x variable and the count of observations in each bin.
See Also
Examples
## Not run:
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Returns record count for 30 bins in mpg
db_mtcars |>
db_compute_bins(mpg)
# Returns record count for bins of size 10
db_mtcars |>
db_compute_bins(mpg, binwidth = 10)
dbDisconnect(con)
## End(Not run)
Returns a dataframe with boxplot calculations
Description
Uses dplyr operations to create boxplot calculations. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Requires database support for percentile/quantile functions. Confirmed to work with:
DuckDB (recommended for local examples) - uses quantile()
Spark/Hive (via sparklyr) - uses percentile_approx()
SQL Server (2012+) - uses PERCENTILE_CONT()
PostgreSQL (9.4+) - uses percentile_cont()
Oracle (9i+) - uses PERCENTILE_CONT()
Does NOT work with SQLite, MySQL < 8.0, or MariaDB (no percentile support).
Note that this function supports input tbl that already contains grouping variables. This can be useful when creating faceted boxplots.
Usage
db_compute_boxplot(data, x, var, coef = 1.5)
Arguments
data |
A table (tbl) that can already contain grouping variables |
x |
A discrete variable in which to group the boxplots |
var |
A continuous variable |
coef |
Length of the whiskers as multiple of IQR. Defaults to 1.5 |
Value
An ungrouped data.frame with boxplot statistics including columns for the grouping variable(s), quartiles (lower, middle, upper), whisker bounds (ymin, ymax), and the count of observations per group.
Examples
## Not run:
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
db_mtcars |>
db_compute_boxplot(am, mpg)
dbDisconnect(con)
## End(Not run)
Aggregates over a discrete field
Description
Uses dplyr operations to aggregate data. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
db_compute_count(data, x, ..., y = n())
Arguments
data |
A table (tbl) |
x |
A discrete variable |
... |
A set of named or unnamed aggregations |
y |
The aggregation formula. Defaults to count (n) |
Value
An ungrouped data.frame with the discrete variable and one or more aggregation columns. The first column is the grouping variable (x), followed by the aggregated values.
Examples
## Not run:
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Returns the row count per am
db_mtcars |>
db_compute_count(am)
# Returns the average mpg per am
db_mtcars |>
db_compute_count(am, mean(mpg))
# Returns the average and sum of mpg per am
db_mtcars |>
db_compute_count(am, mean(mpg), sum(mpg))
dbDisconnect(con)
## End(Not run)
Aggregates intersections of two variables
Description
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
Uses dplyr operations to aggregate data. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
db_compute_raster(data, x, y, fill = n(), resolution = 100, complete = FALSE)
db_compute_raster2(data, x, y, fill = n(), resolution = 100, complete = FALSE)
Arguments
data |
A table (tbl) |
x |
A continuous variable |
y |
A continuous variable |
fill |
The aggregation formula. Defaults to count (n) |
resolution |
The number of bins created per variable. The higher the number, the more records will be imported from the source |
complete |
Uses tidyr::complete to include empty bins. Inserts value of 0. |
Details
There are two considerations when using a Raster plot with a database. Both considerations are related to the size of the results downloaded from the database:
- The number of bins requested: The higher the bins value is, the more data is downloaded from the database.
- How concentrated the data is: This refers to how many intersections return a value. The more intersections without a value, the less data is downloaded from the database.
Value
An ungrouped data.frame with three columns: the x variable bins, the y variable bins, and the aggregated fill values for each x-y intersection.
For 'db_compute_raster2': A data.frame with five columns - the x and y variable bins, the fill values, and additional columns for the upper bounds of each bin (x_2 and y_2), useful for defining precise tile boundaries.
Examples
## Not run:
library(DBI)
library(dplyr)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_faithful <- copy_to(con, faithful, "faithful")
# Returns a 100x100 grid of record count of intersections of eruptions and waiting
db_faithful |>
db_compute_raster(eruptions, waiting)
# Returns a 50x50 grid of eruption averages of intersections of eruptions and waiting
db_faithful |>
db_compute_raster(eruptions, waiting, fill = mean(eruptions), resolution = 50)
dbDisconnect(con)
## End(Not run)
Bar plot
Description
Uses dplyr operations to aggregate data and then 'ggplot2' to create the plot. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
dbplot_bar(data, x, ..., y = n())
Arguments
data |
A table (tbl) |
x |
A discrete variable |
... |
A set of named or unnamed aggregations |
y |
The aggregation formula. Defaults to count (n) |
Value
A ggplot object with a bar plot. If multiple aggregations are provided, returns a list of ggplot objects, one for each aggregation.
See Also
dbplot_line ,
dbplot_histogram, dbplot_raster
Examples
## Not run:
library(DBI)
library(dplyr)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Returns a plot of the row count per am
db_mtcars |>
dbplot_bar(am)
# Returns a plot of the average mpg per am
db_mtcars |>
dbplot_bar(am, mean(mpg))
# Returns the average and sum of mpg per am
db_mtcars |>
dbplot_bar(am, avg_mpg = mean(mpg), sum_mpg = sum(mpg))
dbDisconnect(con)
## End(Not run)
Boxplot
Description
Uses dplyr operations to aggregate data and then 'ggplot2' to create the boxplot. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Requires database support for percentile/quantile functions.
See db_compute_boxplot for supported database backends.
Usage
dbplot_boxplot(data, x, var, coef = 1.5)
Arguments
data |
A table (tbl) |
x |
A discrete variable in which to group the boxplots |
var |
A continuous variable |
coef |
Length of the whiskers as multiple of IQR. Defaults to 1.5 |
Value
A ggplot object displaying boxplots for the specified variable grouped by x.
See Also
dbplot_bar, dbplot_line ,
dbplot_raster, dbplot_histogram
Examples
## Not run:
library(DBI)
library(dplyr)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
db_mtcars |>
dbplot_boxplot(am, mpg)
dbDisconnect(con)
## End(Not run)
Histogram
Description
Uses dplyr operations to aggregate data and then 'ggplot2' to create the histogram. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
dbplot_histogram(data, x, bins = 30, binwidth = NULL)
Arguments
data |
A table (tbl) |
x |
A continuous variable |
bins |
Number of bins. Defaults to 30. |
binwidth |
Fixed width for each bin, in the same units as the data. Overrides bins when specified |
Value
A ggplot object displaying a histogram of the specified variable.
See Also
dbplot_bar, dbplot_line ,
dbplot_raster
Examples
## Not run:
library(DBI)
library(dplyr)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# A ggplot histogram with 30 bins
db_mtcars |>
dbplot_histogram(mpg)
# A ggplot histogram with bins of size 10
db_mtcars |>
dbplot_histogram(mpg, binwidth = 10)
dbDisconnect(con)
## End(Not run)
Line plot
Description
Uses dplyr operations to aggregate data and then 'ggplot2' to create a line plot. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
If multiple named aggregations are passed, 'dbplot' will only use one SQL query to perform all of the operations. The purpose is to increase efficiency, and only make one "trip" to the database in order to obtains multiple, related, plots.
Usage
dbplot_line(data, x, ..., y = n())
Arguments
data |
A table (tbl) |
x |
A discrete variable |
... |
A set of named or unnamed aggregations |
y |
The aggregation formula. Defaults to count (n) |
Value
A ggplot object with a line plot. If multiple aggregations are provided, returns a list of ggplot objects, one for each aggregation.
See Also
dbplot_bar,
dbplot_histogram, dbplot_raster
Examples
## Not run:
library(DBI)
library(dplyr)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Returns a plot of the row count per cyl
db_mtcars |>
dbplot_line(cyl)
# Returns a plot of the average mpg per cyl
db_mtcars |>
dbplot_line(cyl, mean(mpg))
# Returns the average and sum of mpg per am
db_mtcars |>
dbplot_line(am, avg_mpg = mean(mpg), sum_mpg = sum(mpg))
dbDisconnect(con)
## End(Not run)
Raster plot
Description
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
Uses dplyr operations to aggregate data and ggplot2 to create a raster plot. Because of this approach, the calculations automatically run inside the database if 'data' has a database or sparklyr connection. The 'class()' of such tables in R are: tbl_sql, tbl_dbi, tbl_spark
Usage
dbplot_raster(data, x, y, fill = n(), resolution = 100, complete = FALSE)
Arguments
data |
A table (tbl) |
x |
A continuous variable |
y |
A continuous variable |
fill |
The aggregation formula. Defaults to count (n) |
resolution |
The number of bins created per variable. The higher the number, the more records will be imported from the source |
complete |
Uses tidyr::complete to include empty bins. Inserts value of 0. |
Details
There are two considerations when using a Raster plot with a database. Both considerations are related to the size of the results downloaded from the database:
- The number of bins requested: The higher the bins value is, the more data is downloaded from the database.
- How concentrated the data is: This refers to how many intersections return a value. The more intersections without a value, the less data is downloaded from the database.
Value
A ggplot object displaying a raster/heatmap plot of the aggregated data across the two continuous variables.
See Also
dbplot_bar, dbplot_line ,
dbplot_histogram
Examples
## Not run:
library(DBI)
library(dplyr)
library(ggplot2)
con <- dbConnect(duckdb::duckdb(), ":memory:")
db_faithful <- copy_to(con, faithful, "faithful")
# Returns a 100x100 raster plot of record count of intersections of eruptions and waiting
db_faithful |>
dbplot_raster(eruptions, waiting)
# Returns a 50x50 raster plot of eruption averages of intersections of eruptions and waiting
db_faithful |>
dbplot_raster(eruptions, waiting, fill = mean(eruptions), resolution = 50)
dbDisconnect(con)
## End(Not run)