Re-using Methods

NB: the following is an advanced usage of deident. If you are just getting started we recommend looking at the other vignettes first.

While the deident package implements multiple different methods for deidentification, one of its key advantages is the ability to re-use and share methods across data sets due to the ‘stateful’ nature of its design.

If you wish to share a unit between different pipelines, the cleanest approach is to initialize the method of interest and then pass it into the first pipeline:

library(deident)

psu <- Pseudonymizer$new()

name_pipe <- starwars |>
  deident(psu, name)

apply_deident(starwars, name_pipe)
#> # A tibble: 87 × 14
#>    name  height  mass hair_color    skin_color eye_color birth_year sex   gender
#>    <chr>  <int> <dbl> <chr>         <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 v3MUe    172    77 blond         fair       blue            19   male  mascu…
#>  2 7rHIx    167    75 <NA>          gold       yellow         112   none  mascu…
#>  3 q5Vhs     96    32 <NA>          white, bl… red             33   none  mascu…
#>  4 KQz8x    202   136 none          white      yellow          41.9 male  mascu…
#>  5 50zEr    150    49 brown         light      brown           19   fema… femin…
#>  6 PxvnO    178   120 brown, grey   light      blue            52   male  mascu…
#>  7 riJWk    165    75 brown         light      blue            47   fema… femin…
#>  8 vpMZA     97    32 <NA>          white, red red             NA   none  mascu…
#>  9 4YeYM    183    84 black         light      brown           24   male  mascu…
#> 10 OCtXW    182    77 auburn, white fair       blue-gray       57   male  mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Having called apply_deident the Pseudonymizer psu has learned encodings for each string in starwars$name. If these strings appear a second time, they will be replaced in the same way, and we can build a second pipeline using psu:

combined.frm <- data.frame(
  ID = c(head(starwars$name, 5), head(ShiftsWorked$Employee, 5))
)

reused_pipe <- combined.frm |>
  deident(psu, ID)

apply_deident(combined.frm, reused_pipe)
#>       ID
#> 1  v3MUe
#> 2  7rHIx
#> 3  q5Vhs
#> 4  KQz8x
#> 5  50zEr
#> 6  2vEoX
#> 7  beMKE
#> 8  rpSge
#> 9  Zq1ja
#> 10 4Eo42

Since the first 5 lines of combined.frm$ID are the same as starwars$ID the first 5 lines of each transformed data set are also the same.