dbplot

R-CMD-check CRAN status Coverage status

Leverages dplyr to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:

  1. Functions that output a ggplot2 object
  2. Functions that output a data.frame object with the calculations
  3. Functions that create formulas for calculating bins for a Histogram or a Raster plot

Installation

You can install the released version from CRAN:

install.packages("dbplot")

Or the development version from GitHub, using the remotes package:

install.packages("remotes")
pak::pak("edgararuiz/dbplot")

Connecting to a data source

Example

The functions work with standard database connections (via DBI/dbplyr) and with Spark connections (via sparklyr). A local DuckDB database will be used for the examples in this README.

library(DBI)
library(dplyr)

con <- dbConnect(duckdb::duckdb(), ":memory:")
db_flights <- copy_to(con, nycflights13::flights, "flights")

ggplot

Histogram

By default dbplot_histogram() creates a 30 bin histogram

library(ggplot2)

db_flights |>
  dbplot_histogram(distance)
Histogram showing the distribution of flight distances with 30 bins

Histogram of flight distances with default 30 bins

Use binwidth to fix the bin size

db_flights |>
  dbplot_histogram(distance, binwidth = 400)
Histogram showing the distribution of flight distances with fixed bin width of 400 units

Histogram of flight distances with 400-unit bins

Because it outputs a ggplot2 object, more customization can be done

db_flights |>
  dbplot_histogram(distance, binwidth = 400) +
  labs(title = "Flights - Distance traveled") +
  theme_bw()
Histogram of flight distances with custom title 'Flights - Distance traveled' and minimal theme

Customized histogram with title and theme

Raster

To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.

A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.

db_flights |>
  dbplot_raster(sched_dep_time, sched_arr_time)
Heatmap showing the count of flights by scheduled departure time (x-axis) and scheduled arrival time (y-axis)

Raster plot of scheduled departure and arrival times

db_flights |>
  dbplot_raster(
    sched_dep_time,
    sched_arr_time,
    mean(distance, na.rm = TRUE)
    )
Heatmap showing the average flight distance for each combination of scheduled departure and arrival times

Raster plot showing average flight distance by time

db_flights |>
  dbplot_raster(
    sched_dep_time,
    sched_arr_time,
    mean(distance, na.rm = TRUE),
    resolution = 20
    )
Heatmap with 20x20 resolution showing average flight distance by scheduled times with larger bins

Raster plot with lower resolution (20x20 grid)

Bar Plot

db_flights |>
  dbplot_bar(origin)
Bar chart showing the number of flights from each origin airport (EWR, JFK, LGA)

Bar plot of flight counts by origin airport

db_flights |>
  dbplot_bar(origin, avg_delay =  mean(dep_delay, na.rm = TRUE))
Bar chart showing the average departure delay in minutes for each origin airport

Bar plot of average departure delay by origin airport

Line plot

db_flights |>
  dbplot_line(month)
Line chart showing the number of flights per month throughout the year

Line plot of flight counts by month

db_flights |>
  dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))
Line chart showing the average departure delay in minutes for each month of the year

Line plot of average departure delay by month

Boxplot

It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn’t calculate outliers.

Boxplot functions require database support for percentile/quantile calculations.

Supported databases:

Not supported: SQLite, MySQL < 8.0, MariaDB (no percentile functions)

Here is an example using dbplot_boxplot() with a local data frame:

nycflights13::flights |>
  dbplot_boxplot(origin, distance)
Boxplot showing the distribution of flight distances for each origin airport using a local data frame

Boxplot of flight distances by origin airport (local data)

Boxplot also works with database connections that support quantile functions:

db_flights |>
  dbplot_boxplot(origin, distance)
Boxplot showing the distribution of flight distances for each origin airport using a DuckDB database connection

Boxplot of flight distances by origin airport (DuckDB)

Calculation functions

If a more customized plot is needed, the data the underpins the plots can also be accessed:

  1. db_compute_bins() - Returns a data frame with the bins and count per bin
  2. db_compute_count() - Returns a data frame with the count per discrete value
  3. db_compute_raster() - Returns a data frame with the results per x/y intersection
  4. db_compute_raster2() - Returns same as db_compute_raster() function plus the coordinates of the x/y boxes
  5. db_compute_boxplot() - Returns a data frame with boxplot calculations
db_flights |>
  db_compute_bins(arr_delay) 
#> # A tibble: 28 × 2
#>    arr_delay  count
#>        <dbl>  <dbl>
#>  1      95.1   7890
#>  2     321.     232
#>  3     729.       5
#>  4     548.       6
#>  5     684.       1
#>  6     -40.7 207999
#>  7      NA     9430
#>  8     276.     425
#>  9     457.      23
#> 10     593        6
#> # ℹ 18 more rows

The data can be piped to a plot

db_flights |>
  filter(arr_delay < 100 , arr_delay > -50) |>
  db_compute_bins(arr_delay) |>
  ggplot() +
  geom_col(aes(arr_delay, count, fill = count))
Custom colored histogram showing the distribution of arrival delays between -50 and 100 minutes

Custom histogram of arrival delays using db_compute_bins

db_bin()

Uses ‘rlang’ to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.

db_bin(var)
#> (((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30) * ifelse(as.integer(floor((var - 
#>     min(var, na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, 
#>     na.rm = TRUE))/30))) == 30, as.integer(floor((var - min(var, 
#>     na.rm = TRUE))/((max(var, na.rm = TRUE) - min(var, na.rm = TRUE))/30))) - 
#>     1, as.integer(floor((var - min(var, na.rm = TRUE))/((max(var, 
#>     na.rm = TRUE) - min(var, na.rm = TRUE))/30))))) + min(var, 
#>     na.rm = TRUE)
db_flights |>
  group_by(x = !! db_bin(arr_delay)) |>
  count()
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB 1.4.4 [edgar@Darwin 25.3.0:R 4.5.2/:memory:]
#> # Groups:   x
#>         x      n
#>     <dbl>  <dbl>
#>  1 -40.7  207999
#>  2  NA      9430
#>  3 276.      425
#>  4 457.       23
#>  5 593         6
#>  6   4.53  79784
#>  7 186.     1742
#>  8  95.1    7890
#>  9 321.      232
#> 10 729.        5
#> # ℹ more rows
db_flights |>
  filter(!is.na(arr_delay)) |>
  group_by(x = !! db_bin(arr_delay)) |>
  count()|>
  collect() |>
  ggplot() +
  geom_col(aes(x, n))
Histogram showing the distribution of arrival delays created using the db_bin utility function

Custom histogram of arrival delays using db_bin

dbDisconnect(con)