MorphSim is an R package for simulating discrete morphological character data along phylogenetic trees. It is designed to integrate with existing R packages such as TreeSim and FossilSim, enabling simulation of datasets that include both extant and extinct taxa. This vignette walks through the main features of the package.
To simulate morphological data you first need a phylogenetic tree. MorphSim accepts either a tree with branch lengths in evolutionary distance or a time-scaled tree. Here we simulate a birth-death tree using TreeSim.
The core function is sim.morpho. At minimum you need a
tree, the number of character states (k), the total number
of traits (trait.num), and if using a time tree, the branch
rates (br.rates).
morpho_data <- sim.morpho(
time.tree = tree,
k = 2,
trait.num = 10,
br.rates = 0.1
)
morpho_data
#> [,1] [,2] [,3] [,4] [,5]
#> t5 0 1 0 1 1
#> t6 0 0 1 1 1
#> t1 1 1 0 0 1
#> t9 0 0 1 1 1
#> t3 1 1 1 1 0
#> t8 0 1 1 1 0
#> t10 0 0 0 1 1
#> t4 1 1 1 1 0
#> t7 1 1 1 1 0
#> t2 1 1 0 1 1
#> Morphological data for 10 taxa with 10 traits per taxon and 0 1 as character states
#> Showing maximum of 5 traits here for nowAll simulation output is stored in a morpho object. This
is a list with the following components:
sequences — a list with up to three
elements:
tips: character states at the tips of the tree. Stored
as a named list where each element is a vector of trait values for one
taxon.nodes: character states at internal nodes (when
ancestral = TRUE).SA: character states for sampled ancestors (when a
fossil object is provided). Named using the specimen number and the
branch number along which the fossil was sampled.trees — a list containing:
EvolTree: the tree with branch lengths in evolutionary
distance.TimeTree: the time-scaled tree (if provided).BrRates: the branch rates used for simulation.model — a list describing the model
used:
Specified: a string summarising the model per
partition, including the number of traits and character states.RateVar: the discrete rate categories drawn from the
specified distribution (if ACRV was used).RateVarTrait: which rate category was assigned to each
trait. Values are listed from lowest rate (1) to highest.transition_history — a list of data
frames, one per trait. Each data frame records every state transition
that occurred during simulation, including the branch number
(edge), the new state (state), and the
position along the branch where the transition occurred
(hmin).
root.states — a vector of root states,
one per trait.
fossil — the FossilSim fossil object
(if provided). The naming scheme matches that of FossilSim.
You can check whether an object is a morpho object using
is.morpho().
MorphSim provides helper functions for extracting data from the morpho object.
# Tip data as a matrix
get.matrix(morpho_data, seq = "tips")
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> t5 0 1 0 1 1 0 1 1 1 1
#> t6 0 0 1 1 1 1 1 1 1 1
#> t1 1 1 0 0 1 1 1 0 1 0
#> t9 0 0 1 1 1 1 1 1 1 1
#> t3 1 1 1 1 0 0 1 1 1 0
#> t8 0 1 1 1 0 1 1 1 1 1
#> t10 0 0 0 1 1 1 1 1 1 1
#> t4 1 1 1 1 0 1 0 1 1 1
#> t7 1 1 1 1 0 0 1 1 0 0
#> t2 1 1 0 1 1 0 1 1 1 0
# Transition history for a specific trait (including root state)
get.transitions(morpho_data, trait = 1)
#> edge state hmin
#> 1 0 1 0.00000000
#> 2 1 0 0.02027768
#> 3 1 1 0.09703761
#> 4 4 0 0.19770685
#> 5 7 0 0.23067076
#> 6 9 0 0.04633432A single value for br.rates applies a strict clock. A
vector of rates (one per branch) applies a relaxed clock. You can use
the SimClock package to generate branch rates under different clock
models.
# Strict clock
strict_data <- sim.morpho(time.tree = tree, k = 2, trait.num = 10,
br.rates = 0.1)
# Relaxed clock (random rates per branch for illustration)
relaxed_rates <- runif(nrow(tree$edge), min = 0.01, max = 0.2)
relaxed_data <- sim.morpho(time.tree = tree, k = 2, trait.num = 10,
br.rates = relaxed_rates)Setting variable = TRUE ensures all simulated characters
are variable across taxa, matching the MkV model of Lewis (2001).
Not all morphological characters evolve at the same rate. MorphSim can model this using among-character rate variation (ACRV), where each trait is assigned a rate drawn from a discretized distribution. Two distributions are available.
Gamma distribution — controlled by two parameters:
alpha.gamma, the shape parameter of the gamma distribution
(smaller values produce greater rate heterogeneity, with more characters
evolving very slowly or very quickly; larger values concentrate rates
closer to the mean), and ACRV.ncats, the number of discrete
rate categories used to approximate the continuous distribution
(commonly set to 4, following standard practice in phylogenetic
inference).
gamma_data <- sim.morpho(
time.tree = tree, k = 2, trait.num = 10, br.rates = 0.1,
ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4
)
# The discrete rate categories
gamma_data$model$RateVar
#> [1] 0.1369538 0.4767519 1.0000000 2.3862944
# Which category was assigned to each trait
gamma_data$model$RateVarTrait
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 4 1 2 4 3 4 3 4 3 2Lognormal distribution — controlled by
meanlog and sdlog, the mean and standard
deviation on the log scale, and ACRV.ncats as above. The
lognormal can produce heavier tails than the gamma, meaning a greater
spread between the slowest and fastest evolving characters.
lgn_data <- sim.morpho(
time.tree = tree, k = 2, trait.num = 10, br.rates = 0.1,
ACRV = "lgn", meanlog = 0, sdlog = 1, ACRV.ncats = 4
)
# The discrete rate categories
lgn_data$model$RateVar
#> [1] 0.2269731 0.5214124 0.9861614 2.2654531User-defined rates — you can also provide your own
rate categories directly using ACRV = "user" and passing a
vector of rates to define.ACRV.rates.
These can be combined to simulate under an MkV + Gamma model, consistent with models implemented in RevBayes and BEAST.
You can provide your own transition rate matrix. For example, an ordered model where transitions can only occur between adjacent states (0 ↔︎ 1 ↔︎ 2):
ord_Q <- matrix(c(
-0.5, 0.5, 0.0,
0.3333333, -0.6666667, 0.3333333,
0.0, 0.5, -0.5
), nrow = 3, byrow = TRUE)
ordered_data <- sim.morpho(
time.tree = tree, k = 3, trait.num = 10, br.rates = 0.1,
define.Q = ord_Q
)Note that when using a custom Q matrix, all traits in that simulation share the same matrix. If you want to simulate different partitions under different Q matrices, for example, some ordered and some unordered characters, you can simulate them separately and combine the results. See the section on combining morpho objects below.
By default the root state is sampled uniformly. You can fix it using
root.state:
Setting full.states = TRUE ensures that all character
states specified by k are present in at least one tip for
each trait. This is useful when you want to guarantee that the full
range of states specified in the Q matrix is observed at the tips.
Without this, it is possible for a character with k = 3 to
only show states 0 and 1 at the tips if state 2 was never reached or was
lost before the present.
Different groups of characters can be simulated with different
numbers of states. The partition argument specifies how
many traits belong to each partition, and k gives the
number of states per partition.
part_data <- sim.morpho(
time.tree = tree,
k = c(2, 3, 4),
trait.num = 20,
partition = c(10, 5, 5),
br.rates = 0.1
)
# Model specification per partition
part_data$model$Specified
#> [1] "Mk_Part:10states:2" "Mk_Part:5states:3" "Mk_Part:5states:4"The total number of traits defined by trait.num and partition must match or the function will return an error.
It is not possible to simulate multiple partitions under different
models in a single call to sim.morpho. For example, you
might want some characters to evolve under an ordered model with a
custom Q matrix and others under a standard Mk model with gamma rate
variation. To do this, simulate each partition separately and combine
them using combine.morpho.
# Partition 1: ordered characters with a custom Q matrix
ord_Q <- matrix(c(
-0.5, 0.5, 0.0,
0.3333333, -0.6666667, 0.3333333,
0.0, 0.5, -0.5
), nrow = 3, byrow = TRUE)
partition_1 <- sim.morpho(
time.tree = tree, k = 3, trait.num = 10,
br.rates = 0.1, define.Q = ord_Q
)
# Partition 2: unordered binary characters with MkV + Gamma
partition_2 <- sim.morpho(
time.tree = tree, k = 2, trait.num = 10,
br.rates = 0.1, variable = TRUE,
ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4
)
# Combine into a single morpho object
combined <- combine.morpho(partition_1, partition_2)
# The combined object has 20 traits total
length(combined$sequences$tips[[1]])
#> [1] 20
# Model specifications from both partitions are preserved
combined$model$Specified
#> [1] "Mk_Part:10states:3" "MkgammaV_Part:10states:2"Both morpho objects must share the same tree, branch lengths, and fossil object (if present). The combined object merges the sequences, transition histories, root states, and model information from both inputs.
MorphSim integrates with FossilSim to simulate character data for sampled ancestors, fossils that represent direct ancestors of other lineages. Because MorphSim tracks where transitions occur along branches, it generates trait values appropriate to the time and lineage at which each fossil was sampled.
# Simulate fossil and extant sampling
f <- FossilSim::sim.fossils.poisson(rate = 0.1, tree = tree,
root.edge = FALSE)
f2 <- FossilSim::sim.extant.samples(fossils = f, tree = tree, rho = 0.5)
# Simulate morphological data with fossils
fossil_data <- sim.morpho(
time.tree = tree, k = c(2,3), trait.num = 10, partition = c(5,5),
br.rates = 0.1, fossil = f2
)
# Sampled ancestor data
names(fossil_data$sequences$SA)
#> [1] "1_7" "2_7" "3_4" "4_17" "5_1" "6_1" "7_9" "8_7" "9_12"
#> [10] "10_4" "11_10" "12_15"Fossil morphological data are often incomplete.
sim.missing.data allows you to introduce missing characters
under different scenarios, replacing character values with
"?".
Characters are removed uniformly at random across the entire matrix. A single probability value controls the proportion of data that is removed.
Different partitions can have different probabilities of missing data. This reflects how different anatomical regions may have different preservation potential, for example, robust skeletal elements are more likely to be preserved than soft tissue features. The probability vector must match the number of partitions.
part_fossil <- sim.morpho(
time.tree = tree, k = c(2, 3), trait.num = 10,
partition = c(5, 5), br.rates = 0.1, fossil = f2
)
# Hard tissues (partition 1): 10% missing
# Soft tissues (partition 2): 70% missing
missing_part <- sim.missing.data(
data = part_fossil, method = "partition",
seq = "tips", probability = c(0.1, 0.7)
)When data is simulated with ACRV, characters can be removed according to the rate category they were simulated under. The probability vector must match the number of rate categories. This can be used to explore what happens when faster or slower evolving characters are preferentially lost.
gamma_fossil <- sim.morpho(
time.tree = tree, k = 2, trait.num = 20, br.rates = 0.1,
ACRV = "gamma", alpha.gamma = 1, ACRV.ncats = 4, fossil = f2
)
# Slow characters preserved, fast characters increasingly lost
missing_rate <- sim.missing.data(
data = gamma_fossil, method = "rate",
seq = "tips", probability = c(0, 0.2, 0.5, 0.8)
)Remove characters from specific traits. This is useful for exploring the impact of losing particular characters.
Remove characters from named taxa. This can be used to simulate scenarios where certain specimens are poorly preserved.
Remove characters only from extinct taxa while leaving extant taxa complete. This reflects the common scenario where fossil specimens have more missing data than living species.
Remove characters from specific traits and specific taxa. This reflects scenarios where certain anatomical regions are less likely to be preserved in particular lineages.
stats.morpho computes the consistency index, retention
index, identifies convergent traits, and summarises the tree structure.
The consistency index (CI; Kluge and Farris, 1969) measures the minimum
number of character changes required on a tree relative to the observed
number of changes, with values closer to 1 indicating less homoplasy.
The retention index (RI; Farris, 1989) quantifies how well
synapomorphies are retained on the tree, with higher values indicating
stronger phylogenetic signal. Both metrics range from 0 to 1.
get.convergent identifies traits where the same state
arose independently more than once. Called without a trait number, it
returns a summary across all traits. Called with a specific trait, it
shows which tips inherited their state from which evolutionary origin.
Tips sharing the same origin inherited their state from a single
transition event. Tips with the same state but different origins
represent convergent evolution.
# Which traits are convergent?
get.convergent(fossil_data)
#> trait num.convergent.states
#> 1 1 1
#> 2 2 2
#> 3 3 2
#> 4 4 2
#> 5 5 1
#> 6 6 1
#> 7 7 2
#> 8 8 1
#> 9 9 2
#> 10 10 2
# Detail for a specific trait
get.convergent(fossil_data, trait = 1)
#> state origin tip convergent
#> 1 0 edge_14_t_0.03 t3, t7, t2 TRUE
#> 2 0 edge_4_t_0.0825 t8 TRUE
#> 3 1 root t5, t6, t1, t9, t10, t4 FALSEplot.morpho displays the evolutionary history of a trait
along the tree. Transition boxes show where state changes occurred along
each branch. We start with the simplest version, just the trait history
on the distance tree, with all additional features turned off.
plot(fossil_data, trait = 6, timetree = FALSE,
show.fossil = FALSE, reconstructed = FALSE,
show.convergent = FALSE)When show.fossil = TRUE, fossils are plotted along the
branches of the tree at the time they were sampled. Extant taxa are
shown as green circles at the tips and fossil samples as black diamonds
along the branches. This requires a time tree, so we also set
timetree = TRUE.
plot(fossil_data, trait = 6, timetree = TRUE,
show.fossil = TRUE, reconstructed = FALSE,
show.convergent = FALSE)When reconstructed = TRUE, the plot distinguishes
between the full (true) tree and the reconstructed tree, the tree
containing only the sampled lineages. Branches that are part of the
reconstructed tree are drawn in black, while unsampled lineages are
shown in grey. This helps visualise how much of the true evolutionary
history is captured by the available samples.
plot(fossil_data, trait = 6, timetree = TRUE,
show.fossil = TRUE, reconstructed = TRUE,
show.convergent = FALSE)When show.convergent = TRUE, transition boxes for any
state that arose independently more than once are highlighted in blue.
This includes transitions that were subsequently reversed on descendant
branches, giving a complete picture of where convergence occurred during
the simulation.
plot(fossil_data, trait = 6, timetree = TRUE,
show.fossil = TRUE, reconstructed = TRUE,
show.convergent = TRUE)By default, plot.morpho shows all available features,
the time tree, fossils, reconstructed tree, and convergence
highlighting. If any of these are not available in the morpho object
(e.g., no fossil data was provided), the plot falls back gracefully and
informs the user. So the plot above is equivalent to the default:
write.morpho provides a single interface for exporting
trees, character matrices, and fossil ages. The type
argument specifies what to export, and reconstructed
controls whether the full or reconstructed version is written.
# Full tree
write.morpho(fossil_data, file = "tree.tre", type = "tree")
# Reconstructed tree
write.morpho(fossil_data, file = "recon_tree.tre", type = "tree",
reconstructed = TRUE)
# Character matrix
write.morpho(fossil_data, file = "matrix.nex", type = "matrix")
# Reconstructed matrix
write.morpho(fossil_data, file = "recon_matrix.nex", type = "matrix",
reconstructed = TRUE)
# Fossil ages (compatible with RevBayes)
write.morpho(fossil_data, file = "ages.tsv", type = "ages")
# With age uncertainty
write.morpho(fossil_data, file = "ages.tsv", type = "ages",
uncertainty = 2)Standard ape export functions also work since the morpho object
stores trees as phylo objects:
The morpho object can be converted to a treedata object
for use with ggtree. This requires the treeio, ggtree, tibble, and
ggplot2 packages.
library(treeio)
library(ggtree)
library(ggplot2)
library(tibble)
# Convert tip data to a data frame
tip_df <- as.data.frame(t(as.data.frame(fossil_data$sequences$tips)))
colnames(tip_df) <- paste0("trait_", seq_len(ncol(tip_df)))
tip_df$node <- match(rownames(tip_df),
fossil_data$trees$EvolTree$tip.label)
# Create treedata object
td <- treedata(phylo = fossil_data$trees$EvolTree,
data = as_tibble(tip_df))
# Plot tree with coloured tip points for one trait
ggtree(td) +
geom_tiplab(offset = 0.01) +
geom_tippoint(aes(color = factor(trait_1)), size = 3) +
scale_color_brewer(palette = "Set2", name = "State") +
theme(legend.position = "right")
# Plot tree with heatmap of all traits
heat_df <- tip_df[, grep("trait_", colnames(tip_df))]
heat_df[] <- lapply(heat_df, factor)
gheatmap(ggtree(td) + geom_tiplab(offset = 0.05),
heat_df, offset = 0.02, width = 0.5,
colnames_angle = 90, hjust = 1) +
scale_fill_brewer(palette = "Set2", name = "State")