Visualizing and Analyzing Distributions of Nominal Variables

Visualizing Nominal Distributions with `nomiShape`

Data can be measured on different scales, which fundamentally affects how they can be analyzed and visualized (Table 1). Four commonly recognized measurement scales are nominal, ordinal, interval, and ratio. Variables measured on continuous scales can take any value within a range and are often modeled using continuous probability distributions, whereas variables with a finite set of possible values follow discrete distributions.

Among discrete and qualitative variables, nominal variables are unique in that they classify observations into categories without any inherent order, ranking, or numerical meaning. Nominal categories indicate membership only: an observation either belongs to a category or it does not. No information about magnitude, distance, or direction is implied. Common examples of nominal variables include species identities in an ecological community, political attitudes or party affiliation in social surveys, behavioral categories in ethological or psychological studies (e.g. play, aggression, vigilance), word types in a linguistic corpus, or thematic codes in qualitative research.

Although nominal variables lack intrinsic numeric structure, the frequency with which categories occur provides rich information about the organization of the system under study. Count data derived from nominal variables can reveal patterns of dominance, rarity, symmetry, and tail structure—features that are rarely formalized but are often visually apparent. The nomiShape package is designed to make these distributional properties explicit by combining centered visualizations with quantitative indices and model-based comparisons tailored specifically to nominal data.

Table 1. Summary of Nominal Data Characteristics and Visualization and Analysis Tools in the nomiShape package

Concept	Description
Variable Type	Nominal (categorical, unordered)
Core Properties	Discrete categories with no intrinsic order or numeric meaning
Typical Examples	Species in a biological community; political attitudes (e.g. conservative, liberal, undecided); behavioral categories (e.g. play, aggression, grooming); word types in a text corpus; qualitative themes or codes
What Can Be Counted	Frequencies, proportions, dominance, rarity
What Cannot Be Computed	Means, medians, variances, distances, or ranks derived from numeric magnitude
Common Visualizations	Standard bar plots (unordered or frequency-ranked)
Often-Ignored Distributional Structure	Dominance, symmetry, central concentration, tail heaviness
Main Analytical Challenge	Distributional “shape” exists but is difficult to formalize for nominal data
Visual Tools in `nomiShape`	Centered Bar Plot, Centered Dot Plot, Ranked Bar Plot, Ranked Dot Plot, Pareto Chart
Analytical Tools in `nomiShape`	Pielou’s evenness, Dominance index, Central concentration, Tail index
Model-Based Shape Comparison	AIC-based comparison of uniform, triangular, normal-like, and exponential (Pareto-like) shapes
Design Philosophy	Reveal latent distributional structure visually (via centering and ranking), then formalize it analytically

Handling nominal (categorical) data is an essential part of data analysis. Almost every data science project involves working with such variables, and students and practitioners alike should know how to store, summarize, visualize, and manipulate them. Traditional visualizations of nominal variables often use unordered bar plots or frequency-sorted bar plots (from high to low), which emphasize category counts but rarely provide insight into distributional structure. As a result, concepts like symmetry, skewness, dominance, or tail behaviour—commonly discussed for numerical variables—are seldom considered for nominal data. However, exceptions include Pareto charts and other ranked visualizations, which can highlight the “vital few” categories following the 80:20 rule or reveal long-tailed distributions, such as rank-abundance plots in ecology where typically most species are relatively rare and a few are common. These visualizations allow insights into categorical dominance and rarity patterns even for nominal variables.

The nomiShape package is designed to further explore the shape of nominal distributions. It offers multiple plotting functions, including classic visualizations such as Pareto charts and ranked bar plots, as well as novel centered bar and dot plots. These functions help users understand frequency structures, dominance patterns, and distributional characteristics of nominal variables, facilitating more nuanced analysis of categorical data.

Visualizing and Analyzing Distributions of Nominal Variables

This vignette demonstrates how to visualize and analyze the distributions of nominal variables using various plotting functions provided by the nomiShape package. We will explore centered bar plots, ranked bar plots, centered dot plots, and ranked dot plots.

Plotting Shapes of Nominal Distributions

Ranked Bar Plots

Ranked bar plots order categories from the most frequent to the least frequent, providing a clear view of category dominance and distribution.

# Example usage of ranked_barplot
ranked_barplot(categories, "animal")

# Example usage of ranked_barplot
ranked_barplot(categories2, "animal")

# Example usage of ranked_barplot
ranked_barplot(categories3, "animal")

# Example usage of ranked_barplot
ranked_barplot(categories4, "animal")

Ranked Dot Plots

Ranked dot plots display categories as points ordered from the most frequent to the least frequent, allowing for easy comparison of category frequencies.

# Example usage of ranked_dotplot
ranked_dotplot(categories, "animal", connect = TRUE)

# Example usage of ranked_dotplot
ranked_dotplot(categories2, "animal", connect = TRUE, shade = TRUE)

# Example usage of ranked_dotplot
ranked_dotplot(categories3, "animal", connect = FALSE, shade = TRUE)

# Example usage of ranked_dotplot
ranked_dotplot(categories4, "animal", connect = TRUE)

Zipf ranked plots

This function generates a rank-frequency plot for a nominal variable. It compares the observed frequencies of each category with the expected frequencies under Zipf’s Law, where the frequency is inversely proportional to rank. Zipf’s Law is often observed in natural language (common words appear more frequently than rare words), in city populations, or any system where a few categories dominate while others are rare. The function also allows filtering the top ranks or top proportion of cumulative observations, and optionally displaying the plot in log-log scale.

zipf_rank_plot(alice, "word")

zipf_rank_plot(kafka, "word",loglog = T)

Pareto Charts

Pareto charts combine bar plots and line graphs to highlight the most significant categories in a nominal variable. They help identify the “vital few” categories that contribute most to the overall distribution.

# Example usage of pareto
pareto(categories3, "animal")
#>      Category Freq cumulative cumulative_percentage
#> 1  Sea sponge  110        110                  44.0
#> 2    Starfish   75        185                  74.0
#> 3     Octopus   20        205                  82.0
#> 4        Crab   12        217                  86.8
#> 5    Squirrel    9        226                  90.4
#> 6     Copepod    7        233                  93.2
#> 7       Snail    6        239                  95.6
#> 8  Pufferfish    5        244                  97.6
#> 9       Whale    3        247                  98.8
#> 10    Lobster    2        249                  99.6
#> 11    Sea god    1        250                 100.0

Centered Bar Plots

Centered bar plots arrange categories symmetrically around the center, with the most frequent categories in the middle and less frequent ones towards the edges. This layout helps to visualize the distribution shape effectively.

# Example usage of centered_barplot
centered_barplot(categories, "animal")

# Example usage of centered_barplot
centered_barplot(categories2, "animal",scale = "percent")

# Example usage of centered_barplot
centered_barplot(categories3, "animal")

# Example usage of centered_barplot
centered_barplot(categories4, "animal")

Centered Dot Plots

Centered dot plots display categories as points arranged symmetrically around the center, with the most frequent categories in the middle. Optionally, points can be connected with lines to highlight trends.

# Example usage of centered_dotplot
centered_dotplot(categories, "animal",connect = TRUE,shade = TRUE)

# Example usage of centered_dotplot
centered_dotplot(categories2, "animal",connect = TRUE,shade = TRUE)

# Example usage of centered_dotplot
centered_dotplot(categories3, "animal",connect = TRUE,shade = TRUE)

# Example usage of centered_dotplot
centered_dotplot(categories4, "animal",connect = TRUE,shade = TRUE)

Detecting theoretical distributions in nominal variables

Visualizing Theoretical Shapes

The shape_comp_plot function allows users to visualize common theoretical distribution shapes (uniform, triangular, normal-like, and exponential/Pareto-like) for nominal variables in comparison with the observed distribution. This helps in understanding how different distributions appear when plotted.

# Example usage of shape_comp_plot
shape_comp_plot(categories, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(categories2, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(categories3, "animal")

# Example usage of shape_comp_plot
shape_comp_plot(starwars, "species")

AIC comparison of theoretical shapes

The shape_aic function computes the Akaike Information Criterion (AIC) for different theoretical shape models fitted to the distribution of a nominal variable. This allows users to quantitatively compare how well each model fits the observed data.

# Example usage of shape_aic
shape_aic(categories, "animal")
#>         Shape      AIC  DeltaAIC
#> 1     Uniform 1198.948   0.00000
#> 2  Triangular 1248.482  49.53447
#> 3      Normal 1446.235 247.28709
#> 4 Exponential 1992.144 793.19640

# Example usage of shape_aic
shape_aic(categories2, "animal")
#>         Shape      AIC  DeltaAIC
#> 1  Triangular 1139.059   0.00000
#> 2     Uniform 1198.948  59.88857
#> 3      Normal 1204.053  64.99384
#> 4 Exponential 1551.144 412.08497

# Example usage of shape_aic
shape_aic(categories3, "animal")
#>         Shape       AIC  DeltaAIC
#> 1 Exponential  825.9440   0.00000
#> 2      Normal  909.8050  83.86093
#> 3  Triangular  993.9005 167.95651
#> 4     Uniform 1198.9476 373.00360

# Example usage of shape_aic
shape_aic(categories4, "animal")
#>         Shape       AIC  DeltaAIC
#> 1      Normal  998.1686   0.00000
#> 2  Triangular 1046.7675  48.59892
#> 3 Exponential 1181.5440 183.37543
#> 4     Uniform 1198.9476 200.77903

Measuring Shapes of Nominal Distributions

Evenness

Pielou’s evenness quantifies how evenly individuals are distributed across categories in a nominal variable.

# Example usage of pielou_evenness
pielou_evenness(categories, "animal")
#> [1] 0.9981314

# Example usage of pielou_evenness
pielou_evenness(categories2, "animal")
#> [1] 0.9462875

# Example usage of pielou_evenness
pielou_evenness(categories3, "animal")
#> [1] 0.6553931

Dominance Index

The dominance index quantifies the degree to which a few categories dominate the distribution of a nominal variable.

# Example usage of dominance_index
dominance_index(categories, "animal")
#> [1] 0.091712

# Example usage of dominance_index
dominance_index(categories2, "animal")
#> [1] 0.113056

# Example usage of dominance_index
dominance_index(categories3, "animal")
#> [1] 0.295584

Central Concentration

The central concentration quantifies how concentrated the distribution of a nominal variable is around its most frequent categories.

# Example usage of central_concentration
central_concentration(categories, "animal")
#> [1] 0.3

# Example usage of central_concentration
central_concentration(categories2, "animal")
#> [1] 0.448

# Example usage of central_concentration
central_concentration(categories3, "animal")
#> [1] 0.82

Tail Index

The tail index quantifies the proportion of categories contributing to the lower part of the distribution, useful for identifying long-tail structures in nominal data. By default, it uses a threshold of 0.8, following the Pareto principle, but this can be adjusted as needed.

# Example usage of tail_index
tail_index(categories, "animal")
#> [1] 0.1818182

# Example usage of tail_index
tail_index(categories2, "animal", threshold = 0.9)
#> [1] 0.1818182

# Example usage of tail_index
tail_index(categories3, "animal", threshold = 0.75)
#> [1] 0.7272727

Rarefaction of Nominal Distributions

Rarefaction curves are used to evaluate how the number of observed categories grows as sampling effort increases. Originally developed in ecology to estimate species richness, the same logic can be applied to any nominal variable: as more observations are collected, additional categories may appear until the diversity of the system is fully represented.

In the context of nominal data, rarefaction helps answer questions such as: Have most categories already been observed? Is the sample size sufficient to capture category diversity? How quickly do new categories emerge as observations accumulate?

The rare_plot() function implements this idea using Monte Carlo permutations. The order of observations is randomly permuted multiple times (reps), and for each permutation the cumulative number of unique categories is calculated as sampling effort increases. The resulting rarefaction curve represents the expected number of categories discovered at each sampling effort.

In the resulting plot: The solid line represents the expected number of categories, the grey shaded band represents the approximate confidence interval (±2 standard errors) across permutations, The x-axis represents sampling effort (number of observations), the y-axis represents the expected number of distinct categories.

When the curve begins to level off, it indicates that additional observations are unlikely to reveal many new categories, suggesting that the sample is approaching saturation.

Examples

rare_plot(categories3, "animal")

rare_plot(starwars, "eye_color")

Rarefaction without permutations. The rarefaction curve can also be computed without permutations by setting reps = 1. In this case, the curve reflects the accumulation of categories for a single random ordering of the data.

rare_plot(categories3, "animal", reps = 1)

When nominal categories are relatively evenly distributed—the rarefaction curve stabilizes early. This means that a relatively small number of observations and permutations may already be sufficient to determine whether sampling has captured most of the category diversity.

rare_plot(categories, "animal", reps = 100, max_effort = 40)

For very large datasets , computing the full rarefaction curve may be unnecessary. The max_effort argument allows the user to limit the maximum sampling effort, which can substantially reduce computation time while still revealing the shape of the accumulation curve.

rare_plot(ufo, "shape", reps = 200, max_effort = 250)

This latter option is particularly useful when exploring large datasets or when the goal is simply to estimate how quickly the category diversity approaches saturation.

The rarefaction approach complements the structural perspective developed in nomiShape. Whereas the other functions describe the geometry of nominal distributions once observed, rarefaction focuses on the sampling process that generates them, allowing users to evaluate how category diversity emerges and whether the observed structure likely reflects the full diversity of the system.

Visualizing and Analyzing Distributions of Nominal Variables

Visualizing Nominal Distributions with nomiShape

Visualizing and Analyzing Distributions of Nominal Variables

Plotting Shapes of Nominal Distributions

Ranked Bar Plots

Ranked Dot Plots

Zipf ranked plots

Pareto Charts

Centered Bar Plots

Centered Dot Plots

Detecting theoretical distributions in nominal variables

Visualizing Theoretical Shapes

AIC comparison of theoretical shapes

Measuring Shapes of Nominal Distributions

Evenness

Dominance Index

Central Concentration

Tail Index

Rarefaction of Nominal Distributions

Visualizing Nominal Distributions with `nomiShape`