Blur Example

One way to reduce identifiability of a data set is by converting a categorical variable to have a more aggregated taxonomy (i.e. a many-to-one mapping). Here we refer to such a method as a ‘blur’ as it causes features to be joined together in such a way to hide the underlying information.

As an example, consider the ShiftsWorked data:

library(deident)
head(ShiftsWorked)
#> # A tibble: 6 × 7
#>   `Record ID` Employee    Date       Shift `Shift Start` `Shift End` `Daily Pay`
#>         <int> <chr>       <date>     <chr> <chr>         <chr>             <dbl>
#> 1           1 Maria Cook  2015-01-01 Night 17:01         00:01              78.1
#> 2           2 Stephen Cox 2015-01-01 Day   08:01         16:01             155. 
#> 3           3 Kimberly O… 2015-01-01 Day   08:01         16:01              77.8
#> 4           4 Nathan Alv… 2015-01-01 Day   08:01         15:01             203. 
#> 5           5 Samuel Par… 2015-01-01 Night 16:01         23:01             211. 
#> 6           6 Scott Morr… 2015-01-01 Night 17:01         00:01             142.

A simple ‘blur’ might be to change the taxonomy of ‘Shift’ e.g. combine ‘Day’ and ‘Night’ into a new group ‘Working’ and ignore the ‘Rest’ shifts. To do this we define the values we wish to change as a vector, build a pipeline and apply it to the data:

shift_blur <- c("Day" = "Working", "Night" = "Working")
blur_pipe <- ShiftsWorked |>
  add_blur(Shift, blur=shift_blur)

apply_deident(ShiftsWorked, blur_pipe)
#> # A tibble: 3,100 × 7
#>    `Record ID` Employee   Date       Shift `Shift Start` `Shift End` `Daily Pay`
#>          <int> <chr>      <date>     <chr> <chr>         <chr>             <dbl>
#>  1           1 Maria Cook 2015-01-01 Work… 17:01         00:01              78.1
#>  2           2 Stephen C… 2015-01-01 Work… 08:01         16:01             155. 
#>  3           3 Kimberly … 2015-01-01 Work… 08:01         16:01              77.8
#>  4           4 Nathan Al… 2015-01-01 Work… 08:01         15:01             203. 
#>  5           5 Samuel Pa… 2015-01-01 Work… 16:01         23:01             211. 
#>  6           6 Scott Mor… 2015-01-01 Work… 17:01         00:01             142. 
#>  7           7 Nathan Sa… 2015-01-01 Rest  <NA>          <NA>                0  
#>  8           8 Jose Lopez 2015-01-01 Work… 17:01         00:01             213. 
#>  9           9 Donna Bro… 2015-01-01 Work… 16:01         00:01             219. 
#> 10          10 George Ki… 2015-01-01 Work… 16:01         00:01             242. 
#> # ℹ 3,090 more rows

The category_blur utility

Applying the blur is relatively simple, but constructing it can be complex. Consider the starwars data set supplied by dplyr:

starwars <- dplyr::starwars
head(starwars)
#> # A tibble: 6 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
#> 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
#> 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
#> 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

And notably the species variable:

table(starwars$species)
#> 
#>         Aleena       Besalisk         Cerean       Chagrian       Clawdite 
#>              1              1              1              1              1 
#>          Droid            Dug           Ewok      Geonosian         Gungan 
#>              6              1              1              1              3 
#>          Human           Hutt       Iktotchi        Kaleesh       Kaminoan 
#>             35              1              1              1              2 
#>        Kel Dor       Mirialan   Mon Calamari           Muun       Nautolan 
#>              1              2              1              1              1 
#>      Neimodian         Pau'an       Quermian         Rodian        Skakoan 
#>              1              1              1              1              1 
#>      Sullustan     Tholothian        Togruta          Toong      Toydarian 
#>              1              1              1              1              1 
#>     Trandoshan        Twi'lek     Vulptereen        Wookiee          Xexto 
#>              1              2              1              2              1 
#> Yoda's species         Zabrak 
#>              1              2

Imagine we wanted to reduce identifiability by aggregating the data into Human vs Non-Human. We could code the vector by hand, but human error can lead to mistakes. To aid in designing complex blurs we supply the category_blur utility which uses regex to define the groups.

human_blur <- category_blur(
  starwars$species,
  "NotHuman" = "^(?!Human)" # Doesn't start with "Human"
)

And the vector returned can be passed into a new pipeline as before.

species_pipe <- starwars |>
  add_blur(species, blur=human_blur)
  
new_starwars <- apply_deident(starwars, species_pipe)

table(new_starwars$species)
#> 
#>    Human NotHuman 
#>       35       48