Worked Example

To assist new users, this vignette serves as a worked example of building, applying and saving a de-identification pipeline. Due to the risk of supplying data with genuine PID to be redacted, we instead supply a synthetic data set comprising of artificial PID, ShiftsWorked.

The data set consists of 3,100 observations where each row lists the work done (shift status, actual work done, and daily pay) by an individual on a given day. ShiftsWorked consists of 7 variables:

Record ID: integer, primary key.
Employee: character, 100 artificial staff names
Date: date.
Shift: character, one of ‘Day’, ‘Night’, ‘Rest’
Shift Start: character, start time of shift (missing for ‘Rest’ shift).
Shift End: character, end time of shift (missing for ‘Rest’ shift).
Daily Pay: numeric, calculated remuneration (£).

library(deident)
ShiftsWorked

Record ID	Employee	Date	Shift	Shift Start	Shift End	Daily Pay
1	Maria Cook	2015-01-01	Night	17:01	00:01	78.06
2	Stephen Cox	2015-01-01	Day	08:01	16:01	155.36
3	Kimberly Ortiz	2015-01-01	Day	08:01	16:01	77.81
4	Nathan Alvarez	2015-01-01	Day	08:01	15:01	202.98
5	Samuel Parker	2015-01-01	Night	16:01	23:01	210.52
6	Scott Morris	2015-01-01	Night	17:01	00:01	141.78

Consider a hypothetical problem. As part of a research project we wish to share this data set with others, but we don’t have permission to share individuals names. How do we solve this issue?

The simplest method for de-identification is a direct replacement, e.g. ‘pseudonymization’. Using deident comes in two steps:

Define a pipeline
Apply to the data set

NB: we set a random seed using set.seed here for reproducibility. We recommend users avoid this step when using the package in production code.

set.seed(101)
pipeline <- deident(ShiftsWorked, "psudonymize", Employee)
apply_deident(ShiftsWorked, pipeline)

Record ID	Employee	Date	Shift	Shift Start	Shift End	Daily Pay
1	i4TE2	2015-01-01	Night	17:01	00:01	78.06
2	q5E87	2015-01-01	Day	08:01	16:01	155.36
3	61IcF	2015-01-01	Day	08:01	16:01	77.81
4	iFIIH	2015-01-01	Day	08:01	15:01	202.98
5	ZE7E0	2015-01-01	Night	16:01	23:01	210.52
6	6WnlG	2015-01-01	Night	17:01	00:01	141.78

This approach is relatively simple, but what if we wish to dig into the transformation? In the case of this direct replacement, we might want to access the lookup table that underpins the pseudonymizer. The easiest way to achieve this is to create an instance of a Pseudonymizer class, and pass it in to deident:

psu <- Pseudonymizer$new()
pipeline2 <- deident(ShiftsWorked, psu, Employee)

apply_deident(ShiftsWorked, pipeline2)

Record ID	Employee	Date	Shift	Shift Start	Shift End	Daily Pay
1	0CwoC	2015-01-01	Night	17:01	00:01	78.06
2	iTkgr	2015-01-01	Day	08:01	16:01	155.36
3	HwZQa	2015-01-01	Day	08:01	16:01	77.81
4	mCwsK	2015-01-01	Day	08:01	15:01	202.98
5	x5lX2	2015-01-01	Night	16:01	23:01	210.52
6	Gv3KR	2015-01-01	Night	17:01	00:01	141.78

and once the transform has been performed we can check the lookup attribute:

unlist(psu$lookup)
#>            Maria Cook           Stephen Cox        Kimberly Ortiz 
#>               "0CwoC"               "iTkgr"               "HwZQa" 
#>        Nathan Alvarez         Samuel Parker          Scott Morris 
#>               "mCwsK"               "x5lX2"               "Gv3KR" 
...

Often we will need to apply multiple transformations. Considering the StaffsWorked data, imagine the employees are complaining they don’t want their exact salaries divulged (even once names are removed). We can expand the pipeline to allow for this:

blur <- NumericBlurer$new(cuts = c(0, 100, 200, 300))

multistep_pipeline <- ShiftsWorked |> 
  deident(psu, Employee) |> 
  deident(blur, `Daily Pay`)

ShiftsWorked |> 
  apply_deident(multistep_pipeline)

Record ID	Employee	Date	Shift	Shift Start	Shift End	Daily Pay
1	0CwoC	2015-01-01	Night	17:01	00:01	(0,100]
2	iTkgr	2015-01-01	Day	08:01	16:01	(100,200]
3	HwZQa	2015-01-01	Day	08:01	16:01	(0,100]
4	mCwsK	2015-01-01	Day	08:01	15:01	(200,300]
5	x5lX2	2015-01-01	Night	16:01	23:01	(200,300]
6	Gv3KR	2015-01-01	Night	17:01	00:01	(100,200]

As well as having multiple data transforms to choose from, deident also allows the user to easily serialize their pipelines to a .yml file for audit and transfer. This is done via the calls to_yaml and from_yaml:

multistep_pipeline$to_yaml("multistep_pipeline.yml")

restored_pipeline <- from_yaml("multistep_pipeline.yml")

ShiftsWorked |> 
  apply_deident(restored_pipeline)

Record ID	Employee	Date	Shift	Shift Start	Shift End	Daily Pay
1	0CwoC	2015-01-01	Night	17:01	00:01	(0,100]
2	iTkgr	2015-01-01	Day	08:01	16:01	(100,200]
3	HwZQa	2015-01-01	Day	08:01	16:01	(0,100]
4	mCwsK	2015-01-01	Day	08:01	15:01	(200,300]
5	x5lX2	2015-01-01	Night	16:01	23:01	(200,300]
6	Gv3KR	2015-01-01	Night	17:01	00:01	(100,200]