nomiShapeData can be measured on different scales, which fundamentally affects how they can be analyzed and visualized (Table 1). Four commonly recognized measurement scales are nominal, ordinal, interval, and ratio. Variables measured on continuous scales can take any value within a range and are often modeled using continuous probability distributions, whereas variables with a finite set of possible values follow discrete distributions.
Among discrete and qualitative variables, nominal variables are unique in that they classify observations into categories without any inherent order, ranking, or numerical meaning. Nominal categories indicate membership only: an observation either belongs to a category or it does not. No information about magnitude, distance, or direction is implied. Common examples of nominal variables include species identities in an ecological community, political attitudes or party affiliation in social surveys, behavioral categories in ethological or psychological studies (e.g. play, aggression, vigilance), word types in a linguistic corpus, or thematic codes in qualitative research.
Although nominal variables lack intrinsic numeric structure, the
frequency with which categories occur provides rich
information about the organization of the system under study. Count data
derived from nominal variables can reveal patterns of dominance, rarity,
symmetry, and tail structure—features that are rarely formalized but are
often visually apparent. The nomiShape package is designed
to make these distributional properties explicit by combining centered
visualizations with quantitative indices and model-based comparisons
tailored specifically to nominal data.
Table 1. Summary of Nominal Data Characteristics and
Visualization and Analysis Tools in the nomiShape
package
| Concept | Description |
|---|---|
| Variable Type | Nominal (categorical, unordered) |
| Core Properties | Discrete categories with no intrinsic order or numeric meaning |
| Typical Examples | Species in a biological community; political attitudes (e.g. conservative, liberal, undecided); behavioral categories (e.g. play, aggression, grooming); word types in a text corpus; qualitative themes or codes |
| What Can Be Counted | Frequencies, proportions, dominance, rarity |
| What Cannot Be Computed | Means, medians, variances, distances, or ranks derived from numeric magnitude |
| Common Visualizations | Standard bar plots (unordered or frequency-ranked) |
| Often-Ignored Distributional Structure | Dominance, symmetry, central concentration, tail heaviness |
| Main Analytical Challenge | Distributional “shape” exists but is difficult to formalize for nominal data |
Visual Tools in nomiShape |
Centered Bar Plot, Centered Dot Plot, Ranked Bar Plot, Ranked Dot Plot, Pareto Chart |
Analytical Tools in nomiShape |
Pielou’s evenness, Dominance index, Central concentration, Tail index |
| Model-Based Shape Comparison | AIC-based comparison of uniform, triangular, normal-like, and exponential (Pareto-like) shapes |
| Design Philosophy | Reveal latent distributional structure visually (via centering and ranking), then formalize it analytically |
Handling nominal (categorical) data is an essential part of data analysis. Almost every data science project involves working with such variables, and students and practitioners alike should know how to store, summarize, visualize, and manipulate them. Traditional visualizations of nominal variables often use unordered bar plots or frequency-sorted bar plots (from high to low), which emphasize category counts but rarely provide insight into distributional structure. As a result, concepts like symmetry, skewness, dominance, or tail behaviour—commonly discussed for numerical variables—are seldom considered for nominal data. However, exceptions include Pareto charts and other ranked visualizations, which can highlight the “vital few” categories following the 80:20 rule or reveal long-tailed distributions, such as rank-abundance plots in ecology where typically most species are relatively rare and a few are common. These visualizations allow insights into categorical dominance and rarity patterns even for nominal variables.
The nomiShape package is designed to further explore the
shape of nominal distributions. It offers multiple plotting functions,
including classic visualizations such as Pareto charts and ranked bar
plots, as well as novel centered bar and dot plots. These functions help
users understand frequency structures, dominance patterns, and
distributional characteristics of nominal variables, facilitating more
nuanced analysis of categorical data.
This vignette demonstrates how to visualize and analyze the
distributions of nominal variables using various plotting functions
provided by the nomiShape package. We will explore centered
bar plots, ranked bar plots, centered dot plots, and ranked dot
plots.
Ranked bar plots order categories from the most frequent to the least frequent, providing a clear view of category dominance and distribution.
Ranked dot plots display categories as points ordered from the most frequent to the least frequent, allowing for easy comparison of category frequencies.
# Example usage of ranked_dotplot
ranked_dotplot(categories2, "animal", connect = TRUE, shade = TRUE)# Example usage of ranked_dotplot
ranked_dotplot(categories3, "animal", connect = FALSE, shade = TRUE)This function generates a rank-frequency plot for a nominal variable. It compares the observed frequencies of each category with the expected frequencies under Zipf’s Law, where the frequency is inversely proportional to rank. Zipf’s Law is often observed in natural language (common words appear more frequently than rare words), in city populations, or any system where a few categories dominate while others are rare. The function also allows filtering the top ranks or top proportion of cumulative observations, and optionally displaying the plot in log-log scale.
Pareto charts combine bar plots and line graphs to highlight the most significant categories in a nominal variable. They help identify the “vital few” categories that contribute most to the overall distribution.
# Example usage of pareto
pareto(categories3, "animal")
#> Category Freq cumulative cumulative_percentage
#> 1 Sea sponge 110 110 44.0
#> 2 Starfish 75 185 74.0
#> 3 Octopus 20 205 82.0
#> 4 Crab 12 217 86.8
#> 5 Squirrel 9 226 90.4
#> 6 Copepod 7 233 93.2
#> 7 Snail 6 239 95.6
#> 8 Pufferfish 5 244 97.6
#> 9 Whale 3 247 98.8
#> 10 Lobster 2 249 99.6
#> 11 Sea god 1 250 100.0Centered bar plots arrange categories symmetrically around the center, with the most frequent categories in the middle and less frequent ones towards the edges. This layout helps to visualize the distribution shape effectively.
Centered dot plots display categories as points arranged symmetrically around the center, with the most frequent categories in the middle. Optionally, points can be connected with lines to highlight trends.
# Example usage of centered_dotplot
centered_dotplot(categories, "animal",connect = TRUE,shade = TRUE)# Example usage of centered_dotplot
centered_dotplot(categories2, "animal",connect = TRUE,shade = TRUE)# Example usage of centered_dotplot
centered_dotplot(categories3, "animal",connect = TRUE,shade = TRUE)# Example usage of centered_dotplot
centered_dotplot(categories4, "animal",connect = TRUE,shade = TRUE)The shape_comp_plot function allows users to visualize
common theoretical distribution shapes (uniform, triangular,
normal-like, and exponential/Pareto-like) for nominal variables in
comparison with the observed distribution. This helps in understanding
how different distributions appear when plotted.
The shape_aic function computes the Akaike Information
Criterion (AIC) for different theoretical shape models fitted to the
distribution of a nominal variable. This allows users to quantitatively
compare how well each model fits the observed data.
# Example usage of shape_aic
shape_aic(categories, "animal")
#> Shape AIC DeltaAIC
#> 1 Uniform 1198.948 0.00000
#> 2 Triangular 1248.482 49.53447
#> 3 Normal 1446.235 247.28709
#> 4 Exponential 1992.144 793.19640# Example usage of shape_aic
shape_aic(categories2, "animal")
#> Shape AIC DeltaAIC
#> 1 Triangular 1139.059 0.00000
#> 2 Uniform 1198.948 59.88857
#> 3 Normal 1204.053 64.99384
#> 4 Exponential 1551.144 412.08497Pielou’s evenness quantifies how evenly individuals are distributed across categories in a nominal variable.
The dominance index quantifies the degree to which a few categories dominate the distribution of a nominal variable.
The central concentration quantifies how concentrated the distribution of a nominal variable is around its most frequent categories.
The tail index quantifies the proportion of categories contributing to the lower part of the distribution, useful for identifying long-tail structures in nominal data. By default, it uses a threshold of 0.8, following the Pareto principle, but this can be adjusted as needed.
Rarefaction curves are used to evaluate how the number of observed categories grows as sampling effort increases. Originally developed in ecology to estimate species richness, the same logic can be applied to any nominal variable: as more observations are collected, additional categories may appear until the diversity of the system is fully represented.
In the context of nominal data, rarefaction helps answer questions such as: Have most categories already been observed? Is the sample size sufficient to capture category diversity? How quickly do new categories emerge as observations accumulate?
The rare_plot() function implements this idea using
Monte Carlo permutations. The order of observations is randomly permuted
multiple times (reps), and for each permutation the
cumulative number of unique categories is calculated as sampling effort
increases. The resulting rarefaction curve represents the expected
number of categories discovered at each sampling effort.
In the resulting plot: The solid line represents the expected number of categories, the grey shaded band represents the approximate confidence interval (±2 standard errors) across permutations, The x-axis represents sampling effort (number of observations), the y-axis represents the expected number of distinct categories.
When the curve begins to level off, it indicates that additional observations are unlikely to reveal many new categories, suggesting that the sample is approaching saturation.
Examples
Rarefaction without permutations. The rarefaction curve can also be computed without permutations by setting reps = 1. In this case, the curve reflects the accumulation of categories for a single random ordering of the data.
When nominal categories are relatively evenly distributed—the rarefaction curve stabilizes early. This means that a relatively small number of observations and permutations may already be sufficient to determine whether sampling has captured most of the category diversity.
For very large datasets , computing the full rarefaction curve may be unnecessary. The max_effort argument allows the user to limit the maximum sampling effort, which can substantially reduce computation time while still revealing the shape of the accumulation curve.
This latter option is particularly useful when exploring large datasets or when the goal is simply to estimate how quickly the category diversity approaches saturation.
The rarefaction approach complements the structural perspective developed in nomiShape. Whereas the other functions describe the geometry of nominal distributions once observed, rarefaction focuses on the sampling process that generates them, allowing users to evaluate how category diversity emerges and whether the observed structure likely reflects the full diversity of the system.