While individual variables can often make data personally identifiable, we can often tell quickly if a variable has this risk (e.g. names, social security numbers, etc). The less readily considered situation is when a collection of variables render individuals identifiable.
As an example, consider the starwars
data set (borrowed
from dplyr
):
head(starwars)
#> # A tibble: 6 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
#> 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Org… 150 49 brown light brown 19 fema… femin…
#> 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Inspection of the data shows species
can be a unique
identifier (e.g. ‘Admiral Ackbar’ is the only ‘Mon Calamari’) so we may
consider aggregating species:
starwars |>
dplyr::filter(species == "Mon Calamari")
#> # A tibble: 1 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Ackbar 180 83 none brown mottle orange 41 male mascul…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
However, while knowing someone is ‘Human’ does not have the same effect, if we also knew they were from ‘Coruscant’ and had ‘blond’ hair (each of which is not uniquely identifiable) if used in combination we reduce the data to a single case:
starwars |>
dplyr::filter(species == "Human",
homeworld == "Coruscant",
hair_color == "blond"
)
#> # A tibble: 1 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Finis Va… 170 NA blond fair blue 91 male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Hence, individual columns can contain useful information but we may
not wish to disclose the inter-variable correlations. To aid with this,
we introduce the shuffling
method which performs column
wise sampling without replacement:
NB: we set a random seed using set.seed
here for
reproducibility. We recommend users avoid this step when using the
package in production code.
set.seed(101)
shuffle_pipe <- starwars |>
add_shuffle(species, homeworld, hair_color)
new_starwars <- apply_deident(starwars, shuffle_pipe)
head(new_starwars)
#> # A tibble: 6 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sky… 172 77 white fair blue 19 male mascu…
#> 2 C-3PO 167 75 none gold yellow 112 none mascu…
#> 3 R2-D2 96 32 none white, bl… red 33 none mascu…
#> 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Org… 150 49 none light brown 19 fema… femin…
#> 6 Owen Lars 178 120 brown light blue 52 male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
A Shuffle
hence preserves the column summaries,
e.g. modal values and distributions, but breaks inter-column behaviours
which might lead to identification.
new_starwars |>
dplyr::filter(species == "Human",
homeworld == "Coruscant"
)
#> # A tibble: 1 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Kit Fisto 196 87 blond green black NA male mascu…
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>
Grouped Shuffling
Clearly there will be situations in which inter-variable dependencies are key to our understanding of the data, and we may wish to preserve the column metrics within strata. Such a situation is foreseen, and ‘shuffling’ can be performed within a grouped data set as easily as on the whole data:
grouped_shuffle_pipe <- starwars |>
add_group(gender) |>
add_shuffle(species, homeworld, hair_color) |>
add_ungroup()
apply_deident(starwars, grouped_shuffle_pipe)
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Luke Sk… 172 77 <NA> fair blue 19 male mascu…
#> 2 C-3PO 167 75 blond gold yellow 112 none mascu…
#> 3 R2-D2 96 32 auburn, g… white, bl… red 33 none mascu…
#> 4 Darth V… 202 136 none white yellow 41.9 male mascu…
#> 5 Leia Or… 150 49 none light brown 19 fema… femin…
#> 6 Owen La… 178 120 none light blue 52 male mascu…
#> 7 Beru Wh… 165 75 white light blue 47 fema… femin…
#> 8 R5-D4 97 32 none white, red red NA none mascu…
#> 9 Biggs D… 183 84 none light brown 24 male mascu…
#> 10 Obi-Wan… 182 77 brown fair blue-gray 57 male mascu…
#> # ℹ 77 more rows
#> # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> # vehicles <list>, starships <list>