Shuffle Example

While individual variables can often make data personally identifiable, we can often tell quickly if a variable has this risk (e.g. names, social security numbers, etc). The less readily considered situation is when a collection of variables render individuals identifiable.

head(starwars)
#> # A tibble: 6 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 blond      fair       blue            19   male  mascu…
#> 2 C-3PO        167    75 <NA>       gold       yellow         112   none  mascu…
#> 3 R2-D2         96    32 <NA>       white, bl… red             33   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Leia Org…    150    49 brown      light      brown           19   fema… femin…
#> 6 Owen Lars    178   120 brown, gr… light      blue            52   male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Inspection of the data shows species can be a unique identifier (e.g. ‘Admiral Ackbar’ is the only ‘Mon Calamari’) so we may consider aggregating species:

starwars |> 
  dplyr::filter(species == "Mon Calamari")
#> # A tibble: 1 × 14
#>   name   height  mass hair_color skin_color   eye_color birth_year sex   gender 
#>   <chr>   <int> <dbl> <chr>      <chr>        <chr>          <dbl> <chr> <chr>  
#> 1 Ackbar    180    83 none       brown mottle orange            41 male  mascul…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

However, while knowing someone is ‘Human’ does not have the same effect, if we also knew they were from ‘Coruscant’ and had ‘blond’ hair (each of which is not uniquely identifiable) if used in combination we reduce the data to a single case:

starwars |> 
  dplyr::filter(species == "Human", 
                homeworld == "Coruscant",
                hair_color == "blond"
                )
#> # A tibble: 1 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Finis Va…    170    NA blond      fair       blue              91 male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

Hence, individual columns can contain useful information but we may not wish to disclose the inter-variable correlations. To aid with this, we introduce the shuffling method which performs column wise sampling without replacement:

NB: we set a random seed using set.seed here for reproducibility. We recommend users avoid this step when using the package in production code.

set.seed(101)

shuffle_pipe <- starwars |> 
  add_shuffle(species, homeworld, hair_color)

new_starwars <- apply_deident(starwars, shuffle_pipe)

head(new_starwars)
#> # A tibble: 6 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke Sky…    172    77 white      fair       blue            19   male  mascu…
#> 2 C-3PO        167    75 none       gold       yellow         112   none  mascu…
#> 3 R2-D2         96    32 none       white, bl… red             33   none  mascu…
#> 4 Darth Va…    202   136 none       white      yellow          41.9 male  mascu…
#> 5 Leia Org…    150    49 none       light      brown           19   fema… femin…
#> 6 Owen Lars    178   120 brown      light      blue            52   male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

A Shuffle hence preserves the column summaries, e.g. modal values and distributions, but breaks inter-column behaviours which might lead to identification.

new_starwars |> 
  dplyr::filter(species == "Human", 
                homeworld == "Coruscant"
                )
#> # A tibble: 1 × 14
#>   name      height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Kit Fisto    196    87 blond      green      black             NA male  mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

`Grouped Shuffling`

Clearly there will be situations in which inter-variable dependencies are key to our understanding of the data, and we may wish to preserve the column metrics within strata. Such a situation is foreseen, and ‘shuffling’ can be performed within a grouped data set as easily as on the whole data:

grouped_shuffle_pipe <- starwars |> 
  add_group(gender) |> 
  add_shuffle(species, homeworld, hair_color) |>
  add_ungroup()

apply_deident(starwars, grouped_shuffle_pipe)
#> # A tibble: 87 × 14
#>    name     height  mass hair_color skin_color eye_color birth_year sex   gender
#>    <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#>  1 Luke Sk…    172    77 <NA>       fair       blue            19   male  mascu…
#>  2 C-3PO       167    75 blond      gold       yellow         112   none  mascu…
#>  3 R2-D2        96    32 auburn, g… white, bl… red             33   none  mascu…
#>  4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
#>  5 Leia Or…    150    49 none       light      brown           19   fema… femin…
#>  6 Owen La…    178   120 none       light      blue            52   male  mascu…
#>  7 Beru Wh…    165    75 white      light      blue            47   fema… femin…
#>  8 R5-D4        97    32 none       white, red red             NA   none  mascu…
#>  9 Biggs D…    183    84 none       light      brown           24   male  mascu…
#> 10 Obi-Wan…    182    77 brown      fair       blue-gray       57   male  mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>