To assist new users, this vignette serves as a worked example of
building, applying and saving a de-identification pipeline. Due to the
risk of supplying data with genuine PID to be redacted, we instead
supply a synthetic data set comprising of artificial PID,
ShiftsWorked
.
The data set consists of 3,100 observations where each row lists the
work done (shift status, actual work done, and daily pay) by an
individual on a given day. ShiftsWorked
consists of 7
variables:
Record ID | Employee | Date | Shift | Shift Start | Shift End | Daily Pay |
---|---|---|---|---|---|---|
1 | Maria Cook | 2015-01-01 | Night | 17:01 | 00:01 | 78.06 |
2 | Stephen Cox | 2015-01-01 | Day | 08:01 | 16:01 | 155.36 |
3 | Kimberly Ortiz | 2015-01-01 | Day | 08:01 | 16:01 | 77.81 |
4 | Nathan Alvarez | 2015-01-01 | Day | 08:01 | 15:01 | 202.98 |
5 | Samuel Parker | 2015-01-01 | Night | 16:01 | 23:01 | 210.52 |
6 | Scott Morris | 2015-01-01 | Night | 17:01 | 00:01 | 141.78 |
Consider a hypothetical problem. As part of a research project we wish to share this data set with others, but we don’t have permission to share individuals names. How do we solve this issue?
The simplest method for de-identification is a direct replacement,
e.g. ‘pseudonymization’. Using deident
comes in two
steps:
NB: we set a random seed using set.seed
here for
reproducibility. We recommend users avoid this step when using the
package in production code.
set.seed(101)
pipeline <- deident(ShiftsWorked, "psudonymize", Employee)
apply_deident(ShiftsWorked, pipeline)
Record ID | Employee | Date | Shift | Shift Start | Shift End | Daily Pay |
---|---|---|---|---|---|---|
1 | i4TE2 | 2015-01-01 | Night | 17:01 | 00:01 | 78.06 |
2 | q5E87 | 2015-01-01 | Day | 08:01 | 16:01 | 155.36 |
3 | 61IcF | 2015-01-01 | Day | 08:01 | 16:01 | 77.81 |
4 | iFIIH | 2015-01-01 | Day | 08:01 | 15:01 | 202.98 |
5 | ZE7E0 | 2015-01-01 | Night | 16:01 | 23:01 | 210.52 |
6 | 6WnlG | 2015-01-01 | Night | 17:01 | 00:01 | 141.78 |
This approach is relatively simple, but what if we wish to dig into
the transformation? In the case of this direct replacement, we might
want to access the lookup table that underpins the pseudonymizer. The
easiest way to achieve this is to create an instance of a
Pseudonymizer
class, and pass it in to deident:
psu <- Pseudonymizer$new()
pipeline2 <- deident(ShiftsWorked, psu, Employee)
apply_deident(ShiftsWorked, pipeline2)
Record ID | Employee | Date | Shift | Shift Start | Shift End | Daily Pay |
---|---|---|---|---|---|---|
1 | 0CwoC | 2015-01-01 | Night | 17:01 | 00:01 | 78.06 |
2 | iTkgr | 2015-01-01 | Day | 08:01 | 16:01 | 155.36 |
3 | HwZQa | 2015-01-01 | Day | 08:01 | 16:01 | 77.81 |
4 | mCwsK | 2015-01-01 | Day | 08:01 | 15:01 | 202.98 |
5 | x5lX2 | 2015-01-01 | Night | 16:01 | 23:01 | 210.52 |
6 | Gv3KR | 2015-01-01 | Night | 17:01 | 00:01 | 141.78 |
and once the transform has been performed we can check the
lookup
attribute:
unlist(psu$lookup)
#> Maria Cook Stephen Cox Kimberly Ortiz
#> "0CwoC" "iTkgr" "HwZQa"
#> Nathan Alvarez Samuel Parker Scott Morris
#> "mCwsK" "x5lX2" "Gv3KR"
...
Often we will need to apply multiple transformations. Considering the
StaffsWorked
data, imagine the employees are complaining
they don’t want their exact salaries divulged (even once names are
removed). We can expand the pipeline to allow for this:
blur <- NumericBlurer$new(cuts = c(0, 100, 200, 300))
multistep_pipeline <- ShiftsWorked |>
deident(psu, Employee) |>
deident(blur, `Daily Pay`)
ShiftsWorked |>
apply_deident(multistep_pipeline)
Record ID | Employee | Date | Shift | Shift Start | Shift End | Daily Pay |
---|---|---|---|---|---|---|
1 | 0CwoC | 2015-01-01 | Night | 17:01 | 00:01 | (0,100] |
2 | iTkgr | 2015-01-01 | Day | 08:01 | 16:01 | (100,200] |
3 | HwZQa | 2015-01-01 | Day | 08:01 | 16:01 | (0,100] |
4 | mCwsK | 2015-01-01 | Day | 08:01 | 15:01 | (200,300] |
5 | x5lX2 | 2015-01-01 | Night | 16:01 | 23:01 | (200,300] |
6 | Gv3KR | 2015-01-01 | Night | 17:01 | 00:01 | (100,200] |
As well as having multiple data transforms to choose from,
deident
also allows the user to easily serialize their
pipelines to a .yml file for audit and transfer. This is done via the
calls to_yaml
and from_yaml
:
multistep_pipeline$to_yaml("multistep_pipeline.yml")
restored_pipeline <- from_yaml("multistep_pipeline.yml")
ShiftsWorked |>
apply_deident(restored_pipeline)
Record ID | Employee | Date | Shift | Shift Start | Shift End | Daily Pay |
---|---|---|---|---|---|---|
1 | 0CwoC | 2015-01-01 | Night | 17:01 | 00:01 | (0,100] |
2 | iTkgr | 2015-01-01 | Day | 08:01 | 16:01 | (100,200] |
3 | HwZQa | 2015-01-01 | Day | 08:01 | 16:01 | (0,100] |
4 | mCwsK | 2015-01-01 | Day | 08:01 | 15:01 | (200,300] |
5 | x5lX2 | 2015-01-01 | Night | 16:01 | 23:01 | (200,300] |
6 | Gv3KR | 2015-01-01 | Night | 17:01 | 00:01 | (100,200] |