Worked Example

To assist new users, this vignette serves as a worked example of building, applying and saving a de-identification pipeline. Due to the risk of supplying data with genuine PID to be redacted, we instead supply a synthetic data set comprising of artificial PID, ShiftsWorked.

The data set consists of 3,100 observations where each row lists the work done (shift status, actual work done, and daily pay) by an individual on a given day. ShiftsWorked consists of 7 variables:

library(deident)
ShiftsWorked
Record ID Employee Date Shift Shift Start Shift End Daily Pay
1 Maria Cook 2015-01-01 Night 17:01 00:01 78.06
2 Stephen Cox 2015-01-01 Day 08:01 16:01 155.36
3 Kimberly Ortiz 2015-01-01 Day 08:01 16:01 77.81
4 Nathan Alvarez 2015-01-01 Day 08:01 15:01 202.98
5 Samuel Parker 2015-01-01 Night 16:01 23:01 210.52
6 Scott Morris 2015-01-01 Night 17:01 00:01 141.78

Consider a hypothetical problem. As part of a research project we wish to share this data set with others, but we don’t have permission to share individuals names. How do we solve this issue?

The simplest method for de-identification is a direct replacement, e.g. ‘pseudonymization’. Using deident comes in two steps:

  1. Define a pipeline
  2. Apply to the data set

NB: we set a random seed using set.seed here for reproducibility. We recommend users avoid this step when using the package in production code.

set.seed(101)
pipeline <- deident(ShiftsWorked, "psudonymize", Employee)
apply_deident(ShiftsWorked, pipeline)
Record ID Employee Date Shift Shift Start Shift End Daily Pay
1 i4TE2 2015-01-01 Night 17:01 00:01 78.06
2 q5E87 2015-01-01 Day 08:01 16:01 155.36
3 61IcF 2015-01-01 Day 08:01 16:01 77.81
4 iFIIH 2015-01-01 Day 08:01 15:01 202.98
5 ZE7E0 2015-01-01 Night 16:01 23:01 210.52
6 6WnlG 2015-01-01 Night 17:01 00:01 141.78

This approach is relatively simple, but what if we wish to dig into the transformation? In the case of this direct replacement, we might want to access the lookup table that underpins the pseudonymizer. The easiest way to achieve this is to create an instance of a Pseudonymizer class, and pass it in to deident:

psu <- Pseudonymizer$new()
pipeline2 <- deident(ShiftsWorked, psu, Employee)

apply_deident(ShiftsWorked, pipeline2)
Record ID Employee Date Shift Shift Start Shift End Daily Pay
1 0CwoC 2015-01-01 Night 17:01 00:01 78.06
2 iTkgr 2015-01-01 Day 08:01 16:01 155.36
3 HwZQa 2015-01-01 Day 08:01 16:01 77.81
4 mCwsK 2015-01-01 Day 08:01 15:01 202.98
5 x5lX2 2015-01-01 Night 16:01 23:01 210.52
6 Gv3KR 2015-01-01 Night 17:01 00:01 141.78

and once the transform has been performed we can check the lookup attribute:

unlist(psu$lookup)
#>            Maria Cook           Stephen Cox        Kimberly Ortiz 
#>               "0CwoC"               "iTkgr"               "HwZQa" 
#>        Nathan Alvarez         Samuel Parker          Scott Morris 
#>               "mCwsK"               "x5lX2"               "Gv3KR" 
...

Often we will need to apply multiple transformations. Considering the StaffsWorked data, imagine the employees are complaining they don’t want their exact salaries divulged (even once names are removed). We can expand the pipeline to allow for this:

blur <- NumericBlurer$new(cuts = c(0, 100, 200, 300))

multistep_pipeline <- ShiftsWorked |> 
  deident(psu, Employee) |> 
  deident(blur, `Daily Pay`)

ShiftsWorked |> 
  apply_deident(multistep_pipeline)
Record ID Employee Date Shift Shift Start Shift End Daily Pay
1 0CwoC 2015-01-01 Night 17:01 00:01 (0,100]
2 iTkgr 2015-01-01 Day 08:01 16:01 (100,200]
3 HwZQa 2015-01-01 Day 08:01 16:01 (0,100]
4 mCwsK 2015-01-01 Day 08:01 15:01 (200,300]
5 x5lX2 2015-01-01 Night 16:01 23:01 (200,300]
6 Gv3KR 2015-01-01 Night 17:01 00:01 (100,200]

As well as having multiple data transforms to choose from, deident also allows the user to easily serialize their pipelines to a .yml file for audit and transfer. This is done via the calls to_yaml and from_yaml:

multistep_pipeline$to_yaml("multistep_pipeline.yml")

restored_pipeline <- from_yaml("multistep_pipeline.yml")

ShiftsWorked |> 
  apply_deident(restored_pipeline)
Record ID Employee Date Shift Shift Start Shift End Daily Pay
1 0CwoC 2015-01-01 Night 17:01 00:01 (0,100]
2 iTkgr 2015-01-01 Day 08:01 16:01 (100,200]
3 HwZQa 2015-01-01 Day 08:01 16:01 (0,100]
4 mCwsK 2015-01-01 Day 08:01 15:01 (200,300]
5 x5lX2 2015-01-01 Night 16:01 23:01 (200,300]
6 Gv3KR 2015-01-01 Night 17:01 00:01 (100,200]