One way to reduce identifiability of a data set is by converting a categorical variable to have a more aggregated taxonomy (i.e. a many-to-one mapping). Here we refer to such a method as a ‘blur’ as it causes features to be joined together in such a way to hide the underlying information.
As an example, consider the ShiftsWorked
data:
library(deident)
head(ShiftsWorked)
#> # A tibble: 6 × 7
#> `Record ID` Employee Date Shift `Shift Start` `Shift End` `Daily Pay`
#> <int> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 1 Maria Cook 2015-01-01 Night 17:01 00:01 78.1
#> 2 2 Stephen Cox 2015-01-01 Day 08:01 16:01 155.
#> 3 3 Kimberly O… 2015-01-01 Day 08:01 16:01 77.8
#> 4 4 Nathan Alv… 2015-01-01 Day 08:01 15:01 203.
#> 5 5 Samuel Par… 2015-01-01 Night 16:01 23:01 211.
#> 6 6 Scott Morr… 2015-01-01 Night 17:01 00:01 142.
A simple ‘blur’ might be to change the taxonomy of ‘Shift’ e.g. combine ‘Day’ and ‘Night’ into a new group ‘Working’ and ignore the ‘Rest’ shifts. To do this we define the values we wish to change as a vector, build a pipeline and apply it to the data:
shift_blur <- c("Day" = "Working", "Night" = "Working")
blur_pipe <- ShiftsWorked |>
add_blur(Shift, blur=shift_blur)
apply_deident(ShiftsWorked, blur_pipe)
#> # A tibble: 3,100 × 7
#> `Record ID` Employee Date Shift `Shift Start` `Shift End` `Daily Pay`
#> <int> <chr> <date> <chr> <chr> <chr> <dbl>
#> 1 1 Maria Cook 2015-01-01 Work… 17:01 00:01 78.1
#> 2 2 Stephen C… 2015-01-01 Work… 08:01 16:01 155.
#> 3 3 Kimberly … 2015-01-01 Work… 08:01 16:01 77.8
#> 4 4 Nathan Al… 2015-01-01 Work… 08:01 15:01 203.
#> 5 5 Samuel Pa… 2015-01-01 Work… 16:01 23:01 211.
#> 6 6 Scott Mor… 2015-01-01 Work… 17:01 00:01 142.
#> 7 7 Nathan Sa… 2015-01-01 Rest <NA> <NA> 0
#> 8 8 Jose Lopez 2015-01-01 Work… 17:01 00:01 213.
#> 9 9 Donna Bro… 2015-01-01 Work… 16:01 00:01 219.
#> 10 10 George Ki… 2015-01-01 Work… 16:01 00:01 242.
#> # ℹ 3,090 more rows
category_blur
utilityApplying the blur is relatively simple, but constructing it can be
complex. Consider the starwars
data set supplied by
dplyr:
starwars <- dplyr::starwars
head(starwars)
#> # A tibble: 6 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Org… 150 49 brown light brown 19 fema… femin…
#> 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
And notably the species
variable:
table(starwars$species)
#>
#> Aleena Besalisk Cerean Chagrian Clawdite
#> 1 1 1 1 1
#> Droid Dug Ewok Geonosian Gungan
#> 6 1 1 1 3
#> Human Hutt Iktotchi Kaleesh Kaminoan
#> 35 1 1 1 2
#> Kel Dor Mirialan Mon Calamari Muun Nautolan
#> 1 2 1 1 1
#> Neimodian Pau'an Quermian Rodian Skakoan
#> 1 1 1 1 1
#> Sullustan Tholothian Togruta Toong Toydarian
#> 1 1 1 1 1
#> Trandoshan Twi'lek Vulptereen Wookiee Xexto
#> 1 2 1 2 1
#> Yoda's species Zabrak
#> 1 2
Imagine we wanted to reduce identifiability by aggregating the data
into Human vs Non-Human. We could code the vector by hand, but human
error can lead to mistakes. To aid in designing complex blurs we supply
the category_blur
utility which uses regex to define the
groups.
human_blur <- category_blur(
starwars$species,
"NotHuman" = "^(?!Human)" # Doesn't start with "Human"
)
And the vector returned can be passed into a new pipeline as before.