** mppR** is an add-on package for the
statistical software

Quantitative trait locus (QTL) analysis essentially consists in finding a relationship between DNA polymorphisms (e.g., SNPs) and phenotypic variation (Doerge 2002). QTL detection methods greatly depends on the genetic properties of the population that is used. Historically, QTL detection has been performed in designed experimental populations involving two parental lines (bi-parental crosses). Several methods and software packages have been developed for QTL analysis for such populations, for a review see Varshney et al. (2015). Multi-parent populations (MPPs) are an alternative type of population that can improve the chances of QTL detection while broadening the range of research questions that can be answered (Cavanagh et al. 2008). MPPs can be seen as a compromise between bi-parental crosses and association panels (Myles et al. 2009). Different types of MPPs have been developed including nested association mapping (NAM) populations (McMullen et al. 2009), diallels (Blanc et al. 2006) and factorial designs (Bardol et al. 2013). More complicated MPPs can be created by intercrossing multiple founders followed by inbreeding, like in multi-parent advanced generation inter-cross (MAGIC) populations (Cavanagh et al. 2008). Here, we consider MPPs as a collection of genotypes that are derived by crosses between at least three different parents. In this paper, a MPP QTL analysis is akin to the joint analysis of such a population using a common marker map.

The development of an appropriate statistical methodology taking MPP
properties into consideration is a *sine qua non* condition to
fully exploit the potential of MPP genetic resources. The most critical
question is how to account for genetic relatedness between the genotypes
and how to integrate this information into the statistical model. A
first simple option is to treat MPPs as an association mapping panel and
to apply genome-wide association study (GWAS) QTL detection methods.

The use of a GWAS type of method presents several advantages. First,
GWAS methods apply to almost any type of MPP design because the
knowledge of population structure is not a prerequisite. The GWAS QTL
detection is marker-based, that is, uses identity by state (IBS)
information, and so generally only allows for two alleles at each tested
position. A second advantage is the existence of powerful algorithms
(e.g. EMMA - Kang et al. (2008))
that allow to scan in a reasonable amount of time large marker datasets.
Finally, GWAS analyses are also based on sets of well-developed mixed
models (Yu et al. 2006). These models allow to
account for the population structure and the polygenic effect using a
kinship matrix covering the relations between all genotypes from all
populations Rincent et al. (2014).
MPP GWAS type of QTL detection can be performed using packages like
** TASSEL** (Bradbury et al.
2007) the

However, a major limitation of GWAS methods is that they generally use bi-allelic marker models assuming two classes of effects at the QTL position. The bi-allelic assumption represents therefore a risk of failing to reflect the allele diversity potentially present at the QTLs within MPPs (Garin et al. 2017). Indeed, several factors like multiple alleles, cross-specific linkage phase between marker and QTL alleles, or interaction effects between the QTL and genetic background may cause complex allelic series.

Other approaches use available pedigree information to model or infer
DNA transmission to the final lines starting from a set of parents or
ancestors. This strategy make use of identity by descent (IBD)
information and gives model with more than two alleles, which can be
more appropriate to model complex allelic series Xavier et al. (2015). For example, the
** R** package

** mppR** contains functions to support data
analysis from data processing to visualisation, staying within the

This vignette is organized as follows. Section 2 describes the statistical methodology for the proposed MPP QTL detection procedures. Section 3 illustrates in detail the QTL detection procedure describing the different functions using a subset of the maize US-NAM population as example McMullen et al. (2009).

** mppR** allows to analyse any type of MPP
design with minimally two crosses between at least three different
parents. In such designs, the possibility to estimate QTL parameters
(identifiability) is linked to the notion of connectivity of the design.
It is always possible to estimate one effect per cross. Therefore, the
number crosses (\(n_{c}\)) constitutes
the largest number of QTL effects that can be estimated. The estimation
of parental and ancestral effects is linked to the connectivity of the
MPP design (Rebaï and Goffinet 2000). Taking for
example the parental alleles, it is possible to estimate \(n_{p} - 1 \leq n_{c}\) parental alleles per
connected part of the design. Design connectivity can be defined using
graph theory (Weeks and Williams 1964). Following
graph theory, an MPP design can be represented by a graph where parents
(alleles) are vertices or nodes and crosses are edges or lines (Figure
1). A connected graph, is a graph where there exists a walk from any
node \(i\) to any other node \(j\). For example, the MPP design in Figure
1 is composed of two connected parts. Ideally, MPP QTL analysis should
be run using only connected populations. The joint analysis of an MPP
composed of several disjunct but internally connected parts is still
possible. In that case, connectivity could follow from the sharing of a
common ancestor by two parents of the design. For example if parents
\(P_{B}\) and \(P_{E}\) of the MPP design of Figure 1 would
receive their allele from the same ancestor, the MPP design would
consist of a single connected part. For a bi-allelic model we assume
that the design is fully connected. In any case, even if disconnected
parts are analysed jointly, a minimal level of connection will be
assumed by assuming that cofactors are shared in the whole
population.

We propose to describe the QTL detection model with a focus on the following two main components: a) the assumption made on the form of the QTL effect and allele origin, and b) the assumption about the variance covariance structure of the residual. Let us start by defining the following underlying single locus QTL detection model describing the relationship between the phenotypic values and genotypes coming from several crosses (Rebaï and Goffinet 1993):

\[y_{ijk} = \mu_{ij} + \alpha_{i} + \alpha_{j} + g_{ij} + e_{ijk} \quad (1)\]

where \(y_{ijk}\) represents the phenotypic value for the \(k^{th}\) individual from the cross between parents \(i\) and \(j\). \(\mu_{ij}\) is the cross mean and \(\alpha_{i}\) and \(\alpha_{j}\) represent the effects associated with the QTL alleles coming from parent \(i\) and \(j\) respectively. The QTL effects are assumed to be strictly additive (no dominance, no epistasis). \(g_{ij}\) is the random polygenic effect due to QTLs elsewhere in the genome with distribution \(N(0,\sigma_{g}^{2})\). Finally, \(e_{ijk}\) represents the random micro-environmental effect (plot error) having distribution \(N(0,\sigma_{e}^{2})\). In this model, \(\sigma_{g}^{2}\) and \(\sigma_{e}^{2}\) are unique meaning that the level of polygenic effect and environmental error is considered to be the same in each cross.

Model 1 can be rewritten in matrix notation:

\[ \mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{r} \quad (2) \] where, \(\mathbf{y}\) is the \([N \times 1]\) vector of phenotypic values. \(N = \sum_{c=1}^{n_{c}}N_{c}\) where \(N_{c}\) is the number of genotypes coming from cross \(i\). \(\mathbf{X=[X_{c}|X_{Q}]}\) is the fixed effect incidence matrix and \(\mathbf{\beta}' = [\mathbf{\beta_{c}}'|\mathbf{\beta_{Q}}']\) the vectors of cross intercepts and QTL effects. \(\mathbf{X}\) is composed of a part that links observations to the particular cross it belongs to (\(\mathbf{X_{c}}\) an \([N \times n_{c}]\) matrix with \(n_{c}\) representing the number of crosses) and \(\mathbf{X_{Q}}\) the part related with the QTL effect attached to the particular observation. \(\mathbf{X_{Q}}\) is a matrix of dimensions \([N \times n_{al}]\) with \(n_{al}\) the number of QTL alleles that are assumed to segregate for the particular QTL locus. Several assumptions are possible concerning \(n_{al}\). They correspond to different statistical models presented in the next section. The form of \(\mathbf{X_{Q}}\) varies according to the type of QTL effect assumed. Finally, \(\mathbf{r}\) represents the vector of random residual terms with distribution \(N(0,\mathbf{R})\). We propose different models for the residual terms in section on the variance covariance structure.

The QTL effect incidence matrix \(\mathbf{X_{Q}}\) is the central term of the model. Assuming a diploid organism, the individual elements of \(\mathbf{X_{Q}}\), \(x_{nl}\) take values between \(0\) and \(2\) and represent the expected number of copies of allele \(l\) with \(l = 1,...,n_{al}\) received by genotype \(n\) at the QTL position. The column number of \(\mathbf{X_{Q}}\) (\(n_{al}\)) varies with the number of alleles assumed at the QTL position. We propose four models: cross-specific, parental, ancestral, and bi-allelic. These models correspond to different assumptions concerning the type of QTL effects and the allele origin. They are characterized by different ways to model genetic relatedness between genotypes using either IBD estimates, IBS information, or a combination of both.

The first model assumes that the QTL alleles that segregate within a
particular cross are different from those that segregate in another
cross. Cross-specific QTL effects can be seen as parental alleles
interacting with the cross genetic background. Under this assumption,
QTL alleles are nested within crosses and so QTL effects are estimated
per cross. In the cross-specific model, \(x_{nl} \in [0, 2]\) represents the expected
number of allele copies received from one of the cross parents given the
flanking markers. The expected number of parental allele copies is
estimated using IBD probabilities computed by the package
*R*** qtl** (Broman
et al. 2003). These probabilities are estimated with respect
to the parents of each cross. For illustration purpose, let us take the
following example of a MPP analysis combining material coming from two
crosses: cross 1 (\(P_{A} \times
P_{B}\)) and cross 2 (\(P_{A} \times
P_{C}\)). In that case, we ignore the fact that the two crosses
are connected since they share a common parent \(P_{A}\). Therefore \(\mathbf{X_{Q}}\) is a diagonal block
structure with diagonal elements specifying the within cross allele
origin. The first two columns represent cross 1 (\(P_{A} \times P_{B}\)) and the rest cross 2
(\(P_{A} \times P_{C}\)). Model 2 can
be re-written like that

\[ \mathbf{y} = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ \end{pmatrix} \mathbf{\beta_{c}} + \begin{pmatrix} 2 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 2 & 0 & 0 \\ 0 & 0 & 2 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 2 \\ \end{pmatrix} \mathbf{\beta_{Q}} + \mathbf{r} \quad (3)\]

It is not possible to estimate two effects per cross since the parental scores are linearly dependent. The design matrix for the QTL effect in a cross-specific model is therefore constrained by redefining the parental information of a cross as half the difference between parent \(i\) and parent \(j\). Therefore, for the cross-specific model, \(\mathbf{X_{Q}}\) is of dimension \([N \times n_{c}]\) where \(n_{c}\) is the number of crosses and the vector \(\mathbf{\beta_{Q}}\) is of dimension \([n_{c} \times 1]\). The cross-specific model contains the upper limit for the number of QTL effects that can be estimated. Indeed, in connected MPPs, the maximum number of effects that can be estimated is \(n_{c} \geq n_{p} -1\) where \(n_{p}\) is the number of parents Jansen, Jannink, and Beavis (2003). This model corresponds to the disconnected model described in Blanc et al. (2006).

In the cross-specific model, all crosses are considered unrelated. A second option is the parental model that adds the connection between crosses via the parents shared between crosses. In that case, the parental QTL incidence matrix is simply obtained by re-arranging the columns of model 3 taking into consideration the connections created by the use of common parents. This model estimates one allele effect per parental line, which is considered to be independent of the genetic background. The QTL effect of parent \(p\) is assumed to be constant in all crosses where this parent has been used (Blanc et al. 2006).

In a connected MPP, if \((n_{p} -1) < n_{c}\), one expects the parental model to be more powerful than the cross-specific model because the number of parameters to estimate is reduced (Blanc et al. 2006). The reduction in the number of parameters to estimate should also help to get better estimates of the QTL effects because the sample size used to estimate these effect increases (Li et al. 2005). Full half diallels, with at least four parents, represent the most connected system where the number of crosses \(n_{c} = (n_{p}*(1-n_{p}))/2\) is maximised with respect to the number of parents (Jansen, Jannink, and Beavis 2003). Coming back to the previous model 3, we integrate in the QTL incidence matrix the fact that the two crosses are connected via the common parent \(P_{A}\). Therefore column one and two are merged in a single column.

\[ \mathbf{y} = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ \end{pmatrix} \mathbf{\beta_{c}} + \begin{pmatrix} 2 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 2 & 0 \\ 2 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 0 & 2 \\ \end{pmatrix} \mathbf{\beta_{Q}} + \mathbf{r} \quad (4)\]

In the parental effect model, the matrix \(\mathbf{X_{Q}}\) is of dimension \([N \times n_{p}]\) and \(\mathbf{\beta_{Q}}\) is of dimension \([n_{p} \times 1]\). The parental model corresponds to the connected model described in Blanc et al. (2006).

The third option, called ancestral model, goes one level up in the
pedigree and uses relatedness between parents to cluster them into a
reduced number of ancestral groups. We assume that parents belonging to
the same cluster transmit the same allele Leroux
et al. (2014). Different options can be used to
cluster parental lines. One of them is the ** R**
package

The ancestral QTL incidence matrix \(\mathbf{X_{Q}^{*}}\) can be obtained by
modifying the parental IBD QTL incidence matrix \(\mathbf{X_{Q}}\) using
** clusthaplo** results. The ancestral model uses
therefore both IBD and parental relatedness IBS information. We can
modify model 4 assuming that at the considered QTL position parents
\(A\) (first column) and \(C\) (third column) belong to the same
ancestral group \(A_{1}\), and parent
\(B\) (second column) falls apart in
group \(A_2\).

\[ \mathbf{X_{Q}^{*}} = \mathbf{X_{Q}} \times \mathbf{A} = \begin{pmatrix} 2 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 2 & 0 \\ 2 & 0 & 0 \\ 1 & 0 & 1 \\ 0 & 0 & 2 \\ \end{pmatrix} \times \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ \end{pmatrix} = \begin{pmatrix} 2 & 0 \\ 1 & 1 \\ 0 & 2 \\ 2 & 0 \\ 2 & 0 \\ 2 & 0 \\ \end{pmatrix} \quad (5) \]

The matrix \(\mathbf{X_{Q}^{*}}\) of the ancestral model is of dimension \([N \times n_{a}]\) where \(n_{a}\) is the number of ancestral alleles. The corresponding vector \(\mathbf{\beta_{Q}^{*}}\) is of dimension \([n_{a} \times 1]\). The elements of \(\mathbf{\beta_{Q}^{*}}\) represent the estimates of the ancestral additive effects. The ancestral-effect model corresponds to the LDLA models used by Bardol et al. (2013) and Giraud et al. (2014).

The estimation of parental (ancestral) QTL effect also requires the application of a constraint to the QTL incidence matrix. From a theoretical point of view, it is possible to estimate maximally \(n_{p} - 1\) (\(n_{a} - 1\)) QTL effects per connected part of the design Weeks and Williams (1964). Therefore, the QTL effects are estimated setting to zero the most frequent parental (ancestral) allele within each connected part. For example in the example of Figure 1, we could have \(P_{A}\) set as reference of the first connected part and \(P_{E}\) set as reference of the second connected part. An alternative is to force the QTL effects to sum to zero. The sum to zero constraint will also take place within each connected part.

The last possibility is the bi-allelic model. If the marker is at the QTL position, the bi-allelic model assumes that genotypes with the same SNP score transmit the same allele. Therefore, we assume that the same allele segregates in the whole population which connects all parts of the design that were not connected before. Genetic relatedness is therefore defined based on marker IBS information only. In this model, using the most frequent allele set as reference, \(\mathbf{X_{Q}}\) become a vector \([N \times 1]\) with values 0, 1 or 2 corresponding to the number of copies of the minor allele.

\[ \mathbf{y} = \begin{pmatrix} 1 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ \end{pmatrix} \mathbf{\beta_{c}} + \begin{pmatrix} 2 \\ 1 \\ 0 \\ 1 \\ 2 \\ 0 \\ \end{pmatrix} \mathbf{\beta_{Q}} + \mathbf{r} \quad (6) \]

This model corresponds to association mapping models (e.g., model B in Würschum et al. (2012). In a connected MPP, if \((n_{p} - 1) < n_{c}\), the models can be ordered from less to more parsimonious models (cross-specific, parental, ancestral, bi-allelic).

A second important part of the QTL detection problem is the data variance covariance structure (VCOV). Several assumptions are possible concerning the VCOV of model 2 \(\mathbf{V} =Var(\mathbf{y}) = Var(\mathbf{r}) = \mathbf{R}\)

The simplest form that can be assumed for \(\mathbf{V}\) is a homogeneous residual term (HRT) variance. In this case, \(\mathbf{R}=\mathbf{I_{N}}\sigma_{r}^{2}\). This corresponds to a linear model where residual terms are considered to be independent and to belong to the same distribution. In the HRT model, the variance of the polygenic term (\(\sigma_{g}^{2}\)) and the error variance (\(\sigma_{e}^{2}\)) of model 1 are both pooled in the unique variance residual term \(\sigma_{r}^{2}\).

Another possibility is to consider cross-specific variance residual terms (CSRT). The homogeneous random term assumption could not reflect the potential heterogeneity of variance between the different crosses. Indeed, genetic heterogeneity of the crosses and/or various levels of genetic distances between the parents could result in different levels of variance for the polygenic effects \((\exists \quad \sigma_{gc.i}^{2} \neq \sigma_{gc.j}^{2} \quad \forall i \neq j \quad i, j = 1,..., n_{c})\). In such a situation, the heterogeneity of variance existing between crosses will require cross-specific residual terms (\(\mathbf{R} = \bigoplus_{c=1}^{n_{c}}\sigma_{r_{c}}^{2}\)) and \(\mathbf{V}\) will take the form:

\[ \mathbf{V} = \bigoplus_{c=1}^{n_{c}}\sigma_{r_{c}}^{2} = \begin{pmatrix} \mathbf{I_{N_{1}}} \sigma_{r_{1}}^{2} & & & \mathbf{0}\\ & \mathbf{I_{N_{2}}} \sigma_{r_{2}}^{2} & & \\ & & \ddots & \\ \mathbf{0} & & & \mathbf{I_{N_{n_{c}}}} \sigma_{r_{n_{c}}}^{2} \end{pmatrix} \quad (7) \]

A third option models directly the polygenic effect \(g_{ij}\) of model 1 by splitting the residual term into a part that follows a relationship structure determined by the kinship among individuals, and a nugget or independent residual, each of them with a separate variance component. Therefore, model 2 becomes

\[ \mathbf{y} = \mathbf{X\beta} + \mathbf{g} + \mathbf{r} \quad with \quad \mathbf{g} \sim N(0, \mathbf{G}\sigma_{g}^{2}) \quad (8) \]

The polygenic term \(\mathbf{g}\) and the residual term \(\mathbf{r}\) are assumed to be independent (\(cov(\mathbf{g}, \mathbf{r}) = 0\)). \(\mathbf{G}\) is a genetic relationship matrix based on pedigree information computed with the method of Luo et al. (1992). The elements of \(\mathbf{G}\) represent the inbreeding coefficient which for example takes values \(0, 0.25\) or \(0.5\) for unrelated individual, half-sibs and full sibs respectively. In such case, \(\mathbf{V} = \mathbf{G} \sigma_{g}^{2} + \mathbf{R}\)

Once again we can have two assumptions for \(\mathbf{R}\). In the first case, \(\mathbf{R}=\mathbf{I}\sigma_{r}^{2}\) and \(\mathbf{V}\) is defined as

\[ \mathbf{V} = \mathbf{G} \sigma_{g}^{2} + \mathbf{I_{N}}\sigma_{r}^{2} \quad (9) \]

In the second case, \(\mathbf{R} = \bigoplus_{c=1}^{n_{c}}\sigma_{r_{c}}^{2}\) which gives

\[ \mathbf{V} = \mathbf{G} \sigma_{g}^{2} + \bigoplus_{c=1}^{n_{c}}\sigma_{r_{c}}^{2} \quad (10) \]

The computation of pedigree models can sometime be problematic. The correlation between the genetic background matrix (\(\mathbf{G}\)) and the QTL term could be a possible cause. The combination of four types of QTL effects (cross-specific, parental, ancestral, bi-allelic) and four VCOV (HRT, CSRT, pedigree + HRT, pedigree + CSRT) gives a grid of 16 different QTL detection models.

The significance of the QTL effects \(\mathbf{\hat{\beta}_{Q}}\) can be estimated using the Wald test (Wald 1943). In the case of a HRT or a CSRT model, after simplification, the Wald test can be rewritten like that:

\[ W(\mathbf{\hat{\beta}}) = \mathbf{y}'\mathbf{\hat{V}}^{-1} \mathbf{\hat{y}} \quad (11) \]

\(W(\mathbf{\hat{\beta}})\) is distributed as a chi-squared distribution with the degrees of freedom being equal to the number of QTL alleles. Expression 11 shows that the test statistic depends on the correlation between the observed phenotypic values (\(\mathbf{y}\)) and the model fitted values (\(\mathbf{\hat{y}}\)) and on the estimated VCOV (\(\mathbf{\hat{V}}\)). For that reason, the choice of the QTL incidence matrix (\(\mathbf{X_{Q}}\)) should be such that the phenotypic variations is captured as accurately as possible. If these variations are due to parental or cross-specific effects, corresponding QTL effects should perform better at the price of a larger number of parameters to estimate. On the other hand, if the effects are similar through the MPP, a reduced number of parameters will capture this variability and allows gains in power. The VCOV structure should also be selected to reflect local patterns of variability. If heterogeneity is present between crosses, the CSRT model will give test statistics considering this heterogeneity.

The QTL detection procedure proposed in
** mppR** is based on the following steps: a)
Optional significance threshold determination by permutation test (Churchill and Doerge 1994); b) cofactors
selection by simple interval mapping (SIM); c) multi-QTL model search
using composite interval mapping (CIM) (Zeng 1993,
zeng_1994); d) simultaneous evaluation of the selected candidate
QTL positions after backward elimination.

The QTL significance threshold can be determined by permutation. The use of permutation aims at reproducing the conditions of the null hypothesis (no QTL present or no association between the marker and the QTL) by breaking the link between the phenotype and the genetic markers (Churchill and Doerge 1994). Permutations allow to build a null hypothesis for the test statistic that reflects the characteristics of the experiment and should be valid for any distribution of the quantitative trait (Churchill and Doerge 1994). The number of permutations should be at least 1000. Alternatively, the user can also specify the significance threshold value.

The determination of a multi-QTL model is done using CIM. Such a strategy is based on fitting the model using cofactors representing other QTLs than the tested QTL (Zeng 1993, zeng_1994). The selection of cofactors is based on a SIM scan using the following model where \(X_{c}\) is a cross-specific intercept and \(X_{Q}\) model the QTL effect.

\[ \mathbf{y} = \mathbf{X_{c}}\mathbf{\beta_{c}} + \mathbf{X_{Q}}\mathbf{\beta_{Q}} + \mathbf{r} \quad (12) \]

Cofactors can be selected with minimum distance in between based on the \(-log10(p-value)\) SIM profile. CIM profile is computed based on the following model

\[ \mathbf{y} = \mathbf{X_{c}}\mathbf{\beta_{c}} + \mathbf{X_{q}}\mathbf{\beta_{q}} + \mathbf{X_{Q}}\mathbf{\beta_{Q}} + \mathbf{r} \quad (13) \]

where $\mathbf{X_{q}}$ represents the selected cofactors. During the CIM scan, an exclusion window can be set around the tested QTL position to remove cofactors and avoid too strong collinearity between the cofactors and the tested position. Finally, the selected candidate QTLs can be simultaneously tested after a backward elimination. An optional confidence interval for each QTL position can be obtained using a -log10($p$)~value drop-off interval taking from a CIM profile where cofactors on the scanned chromosome were excluded.

A variation on the common QTL model with a single type of effect is the multi-QTL effect (MQE) model. In the MQE model, the QTLs present in the final model can have different types of effects (cross-specific, parental, ancestral or bi-allelic). To build an MQE model we use a forward selection procedure. For each QTL to be added, genome wide profiles with a consistent QTL effect are calculated for each types of QTL effects that have been specified by the user. The model that is fitted at each position within a profile is:

\[ \mathbf{y} = \mathbf{X_{c}}\mathbf{\beta_{c}} + \mathbf{X_{Q1}}\mathbf{\beta_{Q1}} + \mathbf{r} \quad (14) \]

Where the (first) QTL (\(\mathbf{X_{Q1}}\mathbf{\beta_{Q}}\)) has an effect that is either cross-specific, parental, ancestral, or bi-allelic along the genome. From each of these profiles, the most significant position based on the -log10(\(p\))~value statistic is selected (e.g., \(\mathbf{X_{Q1.cr}}\), \(\mathbf{X_{Q1.par}}\), \(\mathbf{X_{Q1.anc}}\), \(\mathbf{X_{Q1.biall}}\)). Note that the selected QTL positions for the different types of effects can be different. The QTL that increases most the \(R_{adj}^{2}\) (18) is selected, with it type of effect, and added to the model as a cofactor for the next set of genome wide scans. If at step 1 we selected a bi-allelic QTL, then at step 2 the QTL profiles will be based on the following models:

\[ \mathbf{y} = \mathbf{X_{c}}\mathbf{\beta_{c}} + \mathbf{X_{q1.biall}}\mathbf{\beta_{q1}} + \mathbf{X_{Q2}}\mathbf{\beta_{Q2}} + \mathbf{r} \quad (15) \] For the set of QTL effects specified by the user, genome wide scans are performed via a test for the QTL effect in the term \(\mathbf{X_{Q2}}\mathbf{\beta_{Q2}}\). The forward selection process stops when no further significant QTLs can be identified. At this point, a final list of QTLs is compiled by a backward elimination. A final model with \(t\) QTLs could look like:

\[ \mathbf{y} = \mathbf{X_{c}}\mathbf{\beta_{c}} + \mathbf{X_{Q1.biall}}\mathbf{\beta_{Q1}} + ... + \mathbf{X_{Q(t-1).par}}\mathbf{\beta_{Q(t-1)}} + \mathbf{X_{Qt.anc}}\mathbf{\beta_{Qt}} + \mathbf{r} \quad (16) \]\

Once a final list of QTLs is determined, the estimates for the regression coefficients in the corresponding multi-QTL model provide the QTL effects. For this model a global goodness of fit can also be calculated using \(R^{2}\). Partial \(R^{2}\) statistics are indicators of the contributions of each individual QTLs.

In the cross-specific model, the genetic predictors for the additive effects represent half the difference between the second parent and the first parent for their conditional QTL genotype probabilities. The elements of \(\mathbf{\beta_{Q}}\) represent the within cross allele substitution effect.

For the parental and the ancestral models, from a theoretical point of view, it is possible to estimate a maximum of \(n_{p} - 1\) (\(n_{a} - 1\)) QTL effects per connected part of the design (Rebaï and Goffinet 2000). Within each connected part, the most frequent parental (ancestral) allele is used as reference. The estimated parental (ancestral) QTL effects must be interpreted as a deviation with respect to the connected part reference. Referring again to the example in Figure 1, if \(P_{A}\) is set as reference of the first connected part, its value will be zero and the other parent alleles effects (\(P_{B}\), \(P_{C}\) and \(P_{D}\)) will represent deviations with respect to \(P_{A}\) allele. If the second part (\(P_{E}\), \(P_{F}\), and \(P_{G}\)) was analysed jointly with the first part, the effect of alleles \(P_{F}\) and \(P_{G}\) will represent deviation with respect to \(P_{E}\) and will be independent of the first part parental effects.

An alternative is to use a sum to zero constraint. In that case the parental (ancestral) QTL effects are forced to sum to zero. The individual QTL effects represent a deviation with respect to the central tendency. Here also, the constraints are defined within each connected part. For the bi-allelic model, the estimated genetic effect represents the additive effect of one allele copy of the minor allele with respect to the most frequent allele set as reference.

To compute the \(R^{2}\) statistics, we compare the residual sums of squares of a full model with QTL(s), to the one of a reduced model without QTLs.

\[ R^{2} = 1 - \frac{\sum_{n=1}^{N}{r_{n(full)}^{2}}}{\sum_{n=1}^{N}{r_{n(red)}^{2}}} \quad (17) \]

Independent of the choice for the VCOV, \(R^{2}\) is always computed from a linear model (HRT). The \(R^{2}\) measurement can be adjusted to take into consideration the number of parameters used.

\[ R_{adj}^{2} = 1 - \frac{\sum_{n=1}^{N}{r_{n(full)}^{2}}/df_{full}}{\sum_{n=1}^{N}{r_{n(red)}^{2}}/df_{red}} \quad (18) \]

Where \(df_{full}\) and \(df_{red}\) represent the degrees of freedom of the full and reduced model, respectively.

For each single QTL, a partial \(R^{2}\) can be calculated by the difference
between the \(R^{2}\) of the full model
(all QTL positions) and the \(R^{2}\)
of the model that drops that particular QTL (difference \(R^{2}\)). ** mppR**
also calculates partial \(R^{2}\) by
comparing a model without QTLs with single locus QTL models (single
\(R^{2}\)). These estimates can also be
adjusted using formula 18. The partial \(R^{2}\) is an estimation of the individual
QTL contribution to the phenotypic variation. The difference and single
\(R^{2}\) give estimates of the lower
and upper bound explained variance by individual QTLs.

A cross-validation (CV) approach can be used to evaluate the performance of the QTL detection process and to assess the QTL effect in a pseudo-independent population (Utz, Melchinger, and Schön 2000). CV allows to assess predictability of QTL effects in data not used for model training. The proposed CV procedure is an adaptation of Utz, Melchinger, and Schön (2000)} procedure to the MPP context. A single run of CV is composed of the following steps:

**Partitioning of the dataset**. The full dataset \((\mathbf{y_{DS}}, \mathbf{X_{DS}})\) is partitioned*within cross*into \(k\) subsets. Then for the k repetitions each \(k\) subset is successively used as validation set (VS) \((\mathbf{y_{VS}}, \mathbf{X_{VS}})\), the other (\(k-1\)) subsets go into the training set (TS) \((\mathbf{y_{TS}}, \mathbf{X_{TS}})\).**Explained genetic variance in the TS**. The training set is used to detect QTLs. These QTLs allow to evaluate the proportion of explained genetic variance in the TS using \(\hat{p}_{TS} = \frac{R_{adj.TS}^{2}}{h^{2}}\) where \(R_{adj.TS}^{2}\) is the adjusted \(R^{2}\) (18) and \(h^{2}\) the heritability to be specified by the user.**Predicted genetic variance in the VS**. We can now use the estimated QTL effects in the TS (\(\mathbf{\hat{\beta}_{TS}}\)) to predict phenotypic values in the VS (\(\mathbf{\hat{y}_{VS}} = \mathbf{X_{VS}} \mathbf{\hat{\beta}_{TS}}\)). The proportion of predicted genetic variance in the VS is \(\hat{p}_{VS} = \frac{R_{VS}^{2}}{h^{2}}\), where \(R_{VS}^{2}\) is the squared Pearson correlation between the observed and predicted values. \(\hat{p}_{VS}\) is computed within cross. A measurement at the whole MPP level is obtained by calculating the weighted average of the within cross values (\(\bar{p}_{VS}\)) accounting for the cross sizes. The relative bias between \(\hat{p}_{TS}\) and \(\bar{p}_{VS}\) is \(1-(\frac{\bar{p}_{VS}}{\hat{p}_{TS}})\).

** mppR** contains functions to perform all
steps of an MPP QTL analysis, starting from the data processing to the
visualisation of results. Figure 2 shows a schematic representation of
the

`mppData`

object. The `mppData`

object can be used
in the functions: `mpp_proc()`

, `mpp_CV()`

and
`MQE_proc()`

which are generic functions to perform QTL
analyses, cross-validation, and multi-QTL effect model computations,
respectively.We included a subset of the maize US-NAM population (McMullen et al. 2009) within the
** mppR** package as example dataset to illustrate
the different functions. We introduce these data and the required format
to be used in the workflow presented in Figure 2.

`library(mppR)`

The data consist of three parts obtained from <www.panzea.org>.
`USNAM_geno`

is a random sample of the US-NAM population
including the marker information of 506 genotypes and 102 markers. The
entries include the parents and 500 recombinant inbred lines (RIL)
coming from 5 crosses between the central line B73 and the peripheral
parents CML103, CML322, CML52, Hp301, and M37W.

```
data("USNAM_geno", package = "mppR")
dim(USNAM_geno)
#> [1] 506 102
rownames(USNAM_geno)[1:6]
#> [1] "B73" "CML103" "CML322" "CML52" "Hp301" "M37W"
table(substr(rownames(USNAM_geno)[-c(1:6)], 1, 4))
#>
#> Z002 Z006 Z008 Z010 Z016
#> 100 100 100 100 100
```

Raw genotypic data used in ** mppR** must be
bi-allelic markers. The genotypic data are expected to be formatted as a

`character matrix`

, with one letter for each allele. So, when
using the ACTG coding all possible genotypes are AA, CC, AC, etc.
Missing values must always be coded `NA`

. We impose a strict
data format to make mppR functioning smoothly. Any deviation from this
format will produce an informative error message.To start the data processing, the genotypic data must be split into
offspring and parent genotype scores. These matrices represent the
`geno.off`

and `geno.par`

arguments in the
`create.mppData()`

function. For these two arguments, the
genotypes define the rows with genotype identifiers as row names and the
markers are in columns with the marker identifiers as column names. The
order of the markers must be the same as the one in the `map`

argument. The `geno.off`

genotypes list must also be in the
same order as the one of the `pheno`

argument.

```
<- USNAM_geno[7:506, ]
geno.off <- USNAM_geno[1:6, ] geno.par
```

`USNAM_map`

is a three columns genetic map with marker
indicator, chromosomes and map positions given in centi-Morgan (cM). It
has the required format for the argument `map`

in function
`create.mppData()`

. The marker identifiers must be
`character`

. The chromosomes and genetic positions must be
`numeric`

. The list of markers must be column names of the
`geno.off`

and `geno.par`

arguments.

```
data("USNAM_map", package = "mppR")
head(USNAM_map)
#> mk.names chr pos.cM
#> 1 L00411 1 0.0
#> 2 L00569 1 3.7
#> 3 L00068 1 9.7
#> 4 L01003 1 13.4
#> 5 L00196 1 15.6
#> 6 L00609 1 17.9
<- USNAM_map map
```

The file `USNAM_pheno`

is a `numeric matrix`

containing the phenotypic measurements of 500 offspring genotypes for
the trait upper leaf angle (ULA). The row names represent the genotype
identifiers. They must be identical to the row names of
`geno.off`

. `USNAM_pheno`

can be used as
`pheno`

argument for the `create.mppData()`

function. The `pheno`

argument can contain several traits. A
cross indicator `character vector`

can be formed by
subsetting the genotype names. It specifies to which cross each genotype
belongs and can be used for the `cross.ind`

argument in
`create.mppData()`

.

```
data("USNAM_pheno", package = "mppR")
head(USNAM_pheno)
#> ULA
#> Z002E0001 75
#> Z002E0002 55
#> Z002E0005 60
#> Z002E0010 70
#> Z002E0011 75
#> Z002E0012 70
<- USNAM_pheno
pheno <- substr(rownames(pheno), 1, 4) cross.ind
```

The last raw data source is provided via the
`par.per.cross`

argument. It is a three column
`character matrix`

with one row per cross specifying: 1) the
cross indicator that must be identical and appear in the same order with
the one used in `cross.ind`

; 2-3) the parents 1 and 2 of the
crosses. The parents’ identifiers must be identical to the row names of
`geno.par`

.

```
<- cbind(unique(cross.ind), rep("B73", 5),
par.per.cross rownames(geno.par)[2:6])
par.per.cross#> [,1] [,2] [,3]
#> [1,] "Z002" "B73" "CML103"
#> [2,] "Z006" "B73" "CML322"
#> [3,] "Z008" "B73" "CML52"
#> [4,] "Z010" "B73" "Hp301"
#> [5,] "Z016" "B73" "M37W"
```

The `par.per.cross`

matrix can be used in the
`design_connectivity()`

function to obtain and visualize the
connected parts. For example, using the illustration of Figure 1, we
have

```
<- cbind(paste0("c", 1:7),
ppc_ex c("PA", "PA", "PB", "PA", "PE", "PE", "PG"),
c("PB", "PC", "PC", "PD", "PF", "PG", "PF"))
design_connectivity(ppc_ex)
```

```
#> $`1`
#> [1] "PA" "PB" "PC" "PD"
#>
#> $`2`
#> [1] "PE" "PF" "PG"
```

For the next part of the illustration, we assume that the
`geno.off`

, `geno.par`

, `map`

,
`pheno`

, `cross.ind`

, and
`par.per.cross`

objects are loaded in the global
environment.

The initial step is the processing of the raw data to gather all
required data in a single `mppData`

object. The functions
`create.mppData()`

, `QC.mppData()`

,
`IBS.mppData()`

, `IBD.mppData()`

, and
`parent_cluster.mppData()`

must be called in the defined
sequence to form a complete `mppData`

object. Any deviation
from this sequence will be signaled by an error message.

`mppData`

objectThe function `create.mppData()`

creates a unified
`mppData`

object containing all raw data sources.

```
<- create.mppData(geno.off = geno.off, geno.par = geno.par,
mppData map = map, pheno = pheno, cross.ind = cross.ind,
par.per.cross = par.per.cross)
#>
#> mppData object created!
#>
#> 500 genotypes
#> 5 crosses
#> 6 parents
#> 1 phenotype(s)
#> 1 connected part
```

Before QTL analysis, the raw data should go through a quality control
(QC). This procedure will ensure that marker format is correct and that
markers are informative, meaning that their segregation rate is
sufficient to provide a reliable basis to investigate and estimate the
QTL effects. The function `QC.mppData()`

performs a default
QC.

The user should be aware that it is difficult to propose general
settings for QC that will be suitable for all MPPs. Moreover, the type
of model fitted should also guide the QC. For example, if the user wants
to give more emphasis to the cross-specific model, the QC should ensure
that there is enough within cross segregation. On the other hand, the
bi-allelic model assumes that the QTLs segregate in the whole
population. Therefore for this model, minimum segregation can be
evaluated at the whole population level. The procedure implemented in
`QC.mppData()`

is the following:

Remove markers with more than two alleles.

Remove markers that are monomorphic or fully missing in the parents.

Remove markers with a missing rate higher than

`mk.miss`

across the entire MPP.Remove genotypes with more missing markers than

`gen.miss`

.Remove crosses with less than

`n.lim`

genotypes.Keep only the most polymorphic marker when multiple markers map at the same position.

Filter markers based on minor allele frequency (MAF). Different options are possible:

The first one filter marker based on MAF at the whole population level, and/or on MAF within crosses. The markers with a MAF below a threshold given by

`MAF.pop.lim`

at the whole population level will be discarded. The user can specify the critical values for MAF within cross using`MAF.cr.lim`

. By default, the within cross MAF values are defined by the following function of the cross-size \(N_{c}\): \(MAF(N_{c}) = 0.5\) if \(N_{c} \in [0, 10]\) and \(MAF(N_{c}) = (4.5/N_{c}) + 0.05\) if \(N_{c} > 10\). This means that up to \(10\) genotypes, the critical within cross MAF is set to \(50\%\). Then it decreases when the number of genotype increases until \(5\%\) set as a lower bound. If the within cross MAF is below the limit in at least one cross, then marker scores of the problematic cross are either put as missing (`MAF.cr.miss = TRUE`

) or the whole marker is discarded (`MAF.cr.miss = FALSE`

). By default,`MAF.cr.miss = TRUE`

, which allows to include a larger number of markers and to cover a wider genetic diversity.An alternative is to select only markers that segregate in at least one cross at the

`MAF.cr.lim2`

rate.

```
<- QC.mppData(mppData = mppData, n.lim = 15, MAF.pop.lim = 0.05,
mppData MAF.cr.miss = TRUE, mk.miss = 0.1,
gen.miss = 0.25, verbose = TRUE)
#>
#> Check genotyping error : 0 markers removed
#> Remove monomorphic/missing marker in parents : 2 markers removed
#> Remove marker with missing rate > 0.1 : 2 markers removed
#> Remove genotype with missing rate > 0.25 : 2 genotypes removed
#> Remove crosses with less than 15 observations : 0 genotypes removed
#> Remove markers at the same position : 0 markers removed
#> Remove markers with MAF < 0.05 : 0 markers removed
#>
#> End : 98 marker(s) remain after the check
#> 498 genotypes(s) remain after the check
```

To compute a bi-allelic model, the user needs to convert genotype
data into IBS format using `IBS.mppData()`

. This function
transforms genotype scores into 0, 1, 2 coding where the score
represents the number of minor allele copies.

`<- IBS.mppData(mppData = mppData) mppData `

The other models (cross-specific, parental, and ancestral) use IBD
probabilities. The function `IBD.mppData()`

estimates IBD
probabilities after converting the marker genotype data into within
cross ABH format. For each cross, the maker scores of the two cross
parents are used as reference. Homozygous offspring genotype scores
similar to parent 1 get score “A” and “B” for parent 2. Heterozygous
genotypes are scored “H”. If at least one of the parents is missing or
the parents are monomorphic, the offspring will receive `NA`

.
The regular ABH conversion assumes that the reference parent scores are
fully homozygous or missing. However, if some parent marker scores are
heterozygous, the ABH conversion can still be performed setting argument
`het.miss.par = TRUE`

. In that case, when a parent score is
heterozygous or missing and the other parent is homozygous, the function
will try to infer the score of the allele that was transmitted by the
heterozygous or the missing parent looking at the segregation pattern.
Then the computation of the IBD probabilities is done by calling the
function `calc.genoprob()`

of the
** R**/

The type of population must be specified in argument
`type`

. Different population types are possible: F-type
(“F”), back-cross (“BC”), backcross followed by selfing (“BCsFt”),
double haploid (“DH”), and recombinant imbred lines (“RIL”). The number
of F and BC generations can be specified using `F.gen`

and
`BC.gen`

. The agument `type.mating`

specifies if F
and RIL populations were obtained by selfing or by sibling mating. DH
and RIL populations are read as back-cross by R/qtl. For these two
population types, heterozygous scores will be treated as missing
values.

```
<- IBD.mppData(mppData = mppData, het.miss.par = TRUE, type = 'RIL',
mppData type.mating = 'selfing')
#> --Read the following data:
#> 498 individuals
#> 98 markers
#> 1 phenotypes
#> --Cross type: bc
```

The final optional step of data processing is to provide information
about parents genetic similarity using the function
`parent_cluster.mppData()`

. Parental clustering information
is necessary to calculate the ancestral model. If the parent clustering
is skipped, the other models (cross-specific, parental, bi-allelic) can
still be computed. The parent clustering information is added to the
data via the argument `par.clu`

. `par.clu`

must be
an `interger matrix`

where the columns represent the parental
lines and the rows the markers. The columns names must be the same as
the parents list of the `mppData`

object. The rownames must
be the same as the map marker list of the `mppData`

object.
At a particular position, parents with the same value are assumed to
inherit from the same ancestor.

```
data("par_clu", package = "mppR")
<- parent_cluster.mppData(mppData = mppData, par.clu = par_clu) mppData
```

The parental clustering operation can be realized calling the
** R** package

`parent_cluster.mppData()`

(`method = "clusthaplo"`

). Since
`parent_cluster.mppData()`

performs the clustering using the
`"threshold"`

method. `window`

specifies the size
of the window around the marker. The marker falling into this window
will be used for the local clustering. The value specified in argument
`K`

is the minimum number of markers that should be present
in the computation window. Below `K`

,
`plot = TRUE`

. In that case, the plots will be saved in
a folder at the location specified in `plot.loc`

. The funcion
store the average number of ancetral group along the genome in
`mppData$n.anc`

. The user also needs to install
```
library("clusthaplo")
set.seed(68769)
<- parent_cluster.mppData(mppData = mppData, method = "clusthaplo",
mppData K = 10, window = 25, plot = FALSE)
$n.anc mppData
```

A summary of the `mppData`

objects can be obtained calling
the generic function `summary()`

.

```
summary(mppData)
#> object of class 'mppData'
#>
#> Type of population: Recombinant inbred line by selfing
#>
#> No. Genotypes: 498
#>
#> Crosses Z002 Z006 Z008 Z010 Z016
#> Parent 1 B73 B73 B73 B73 B73
#> Parent 2 CML103 CML322 CML52 Hp301 M37W
#> N 100 100 98 100 100
#>
#> Phenotype(s): ULA
#> Percent phenotyped: 100
#>
#> Total marker: 98
#> No. markers: 56 42
```

Three functions allow to manipulate and modify the
`mppData objects`

: `subset.mppData()`

,
`mppData_mdf_pheno()`

and
`mppData_add_pheno()`

.

Subsets from `mppData`

objects can be obtained using the
generic fuction `subset()`

. The `mppData`

objects
can be subsetted by markers (`mk.list`

) and/or by genotypes
(`gen.list`

).

```
<- subset(x = mppData, mk.list = mppData$map[, 2] == 1,
mppData_sub gen.list = sample(mppData$geno.id, 200))
```

It is possible to change the phenotypic values of an
`mppData`

objects using the `mppData_mdf_pheno()`

and the argument `pheno`

. `pheno`

must have the
same columns (names) as the phenotype data of the modified
`mppData`

object. It must also have the same genotype list.
Only the trait values change. To add a new phenotype you can use the
function `mppData_add_pheno()`

with the extra traits
contained in the `pheno`

argument.

When all data elements are ready, the user can start the QTL
analysis. In all functions involving the computation of a QTL model
(`mpp_CIM()`

, `mpp_CV()`

, `mpp_perm()`

,
`mpp_proc()`

, `mpp_SIM()`

,
`MQE_proc()`

, `QTL_gen_effects()`

, etc.), the
arguments `Q.eff`

and `VCOV`

describe the type of
QTL effect and the form of the VCOV. `Q.eff`

takes the values
“cr”, “par”, “anc”, “biall” for the cross-specific, parental, ancestral,
and bi-allelic model, respectively. `VCOV`

takes the values
“h.err” or “h.err.as” for the HRT model, “cr.err” for the CSRT model,
“pedigree” for the pedigree model with HRT, and “ped_cr.err” for the
pedigree model with CSRT. A complete QTL analysis can be performed using
the generic function `mpp_proc()`

. This is a wrapping
function for individual functions, indicated in parenthesis, performing
each a part of the following QTL detection procedure:

SIM scan to select cofactors (

`mpp_SIM()`

). In that case, we fit an ancestral model (`Q.eff = "anc"`

) with HRT residual term (`VCOV = "h.err"`

).Cofactor selection with SIM \(-log10(p-value)\) above the

`threshold`

value with a minimum distance (`win.cof`

) between selected positions (`QTL_select()`

). The cofactor selection procedure is done per chromosome. It first selects the most significant position and then removes the positions in the neighbourhood of the selected position from the candidate list of QTLs/cofactors. The process continues until no position is significant anymore. The threshold (`thre.cof`

or`thre.QTL`

) values can be determined by permutation using`mpp_perm()`

.CIM scan (

`mpp_CIM()`

) using the selected cofactors except within an exclusion window (`window`

) around the selected cofactors where they are excluded from the model. It is possible to perform several consecutive run of CIM using`N.cim`

.Selection of QTL candidates with CIM \(-log10(p-value)\) above

`thre.QTL`

and a minimum distance (`win.QTL`

) between the selected positions (`QTL_select()`

). The selection procedure is the same as for the cofactors.If

`backward = TRUE`

, backward elimination on the list of selected QTL positions (`mpp_back_elim()`

).Estimation of the QTLs genetic effects (

`QTL_gen_effects()`

), the global and partial QTLs \(R^{2}\) (`QTL_R2()`

).If

`CI = TRUE`

, computation of QTL confidence intervals from a CIM- profile (excluding cofactors on the scanned chromosome). The confidence interval is based on a \(-log10(p-value)\) drop-off value (`drop`

).plot of the \(-log10(p-value)\) CIM QTL profile (

`plot.QTLprof()`

) and if`plot.gen.eff = TRUE`

, visualisation of the genome-wide significance of the QTL effect per cross or per parent.

```
<- mpp_proc(pop.name = "USNAM", trait.name = "ULA", trait = "ULA",
QTL_proc mppData = mppData, Q.eff = "anc",
plot.gen.eff = TRUE, N.cim = 2, thre.cof = 3,
win.cof = 20, window = 20, thre.QTL = 3,
win.QTL = 20, CI = TRUE, drop = 1.5,
verbose = FALSE, output.loc = tempdir())
```

The results of `mpp_proc()`

are returned as a list of
** R** objects. These results are also saved in
different files at the location specified in argument

`output.loc`

. The created folder contains a report
(QTL_REPORT.txt) with a summary of results such as the number of
detected QTL, the global \(R^{2}\), and
for each QTL the estimated genetic effects per cross or parent.A multi-QTL effects model can be determined using
`MQE_proc()`

. The user has to specify the types of tested QTL
effects in the argument `Q.eff`

. The `window`

argument specifies the distance on both sides of an already detected QTL
position where the search will be forbidden. A backward elimination on
the final list of detected QTLs can be performed using
(`backward = TRUE`

). The results of the last MQE CIM run can
be plotted using the function `MQE_plot()`

. This
`MQE_plot()`

will colour the QTL positions corresponding to
the type of QTL effect assumed at the position. This will be
automatically done by `MQE_proc()`

if
`plot.MQE = TRUE`

. The plot (plot_MQE.pdf) will be saved with
the other results at `output.loc`

.

```
<- MQE_proc(pop.name = "USNAM", trait.name = "ULA", mppData = mppData,
MQE Q.eff = c("par", "anc", "biall"), window = 20,
plot.MQE = TRUE, verbose = FALSE, output.loc = tempdir())
```

Once a list of QTL candidates has been determined, it is possible to
estimate the QTL effect per cross or per parents (parental, ancestral,
and bi-allelic model) using the function `QTL_gen_effects()`

.
For the cross-specific model (`Q.eff = "cr"`

), the effects
are given in absolute value and represent the substitution effect of one
allele copy from the parent increasing the trait.

For the parental and ancestral models (`Q.eff = "par"`

or
`Q.eff = "anc"`

, the QTL effects are given per parents and
must be interpreted as deviation with respect to most frequent allele
within the connected part set as reference. For the parental and
ancestral models, the parental alleles are listed from the most (top) to
the least frequent (bottom). For the ancestral model, parents with the
same score correspond to the same ancestral allele according to
** clusthaplo** results. In a NAM population, all
crosses are connected via the central parent (B73), which is the most
frequent allele, and was therefore set as reference.

```
<- mpp_SIM(mppData = mppData, Q.eff = "anc")
SIM <- QTL_select(Qprof = SIM)
cofactors <- mpp_CIM(mppData = mppData, Q.eff = "anc", cofactors = cofactors,
CIM plot.gen.eff = TRUE)
<- QTL_select(Qprof = CIM)
QTL <- QTL_gen_effects(mppData = mppData, QTL = QTL, Q.eff = "anc")
gen.eff
summary(gen.eff, QTL = 1)
#> QTL effects
#> ***********
#>
#> Number of QTL(s): 1
#>
#>
#> QTL 1
#> -----
#>
#> mk.names chr pos.cM
#> L00929 1 43.2
#>
#>
#> QTL effect per cross or parent:
#>
#> Effect Std.Err t-test p-value Sign Con.part Par.all
#> B73 0.000000 0.0000000 0.000000 1.000000e+00 c1 AA
#> CML103 -1.636412 0.7505413 -2.180310 2.971851e-02 * c1 CC
#> CML52 -1.636412 0.7505413 -2.180310 2.971851e-02 * c1 CC
#> Hp301 -2.688274 0.7321971 -3.671517 2.681344e-04 *** c1 CC
#> M37W -2.688274 0.7321971 -3.671517 2.681344e-04 *** c1 CC
#> CML322 -4.864590 1.0477228 -4.643012 4.435899e-06 *** c1 CC
```

For the bi-allelic model, the estimated genetic effect represents the additive effect of the minor allele with respect to the most frequent one, the latter set as reference. When parental genotype information is given, the results are given for each parent by multiplying the allele additive effect by the number of parent allele copies.

A QTL profile can be obtained by passing the `mpp_SIM()`

or `mpp_CIM()`

results to the `x`

argument of the
function `plot.QTLprof()`

. QTL or cofactors positions can
also be plotted on the graph (dashed lines) using the argument
`QTL`

.

`plot(x = CIM, QTL = QTL, type = "l")`

`Mpp_SIM()`

or `mpp_CIM()`

results obtained
with `plot.gen.eff = TRUE`

can also be passed to the function
`plot.QTLprof()`

with argument `gen.eff = TRUE`

to
obtain a visualisation of the genetic effect distribution along the
genome. Once again, the positions passed to the `QTL`

argument will be drawn on the graph.

`plot(x = CIM, gen.eff = TRUE, mppData = mppData, QTL = QTL, Q.eff = "anc")`

The interpretation of the genetic effect plot depends on the type of QTL effects. For a cross-specific model, blue colour means that the allele coming from parent 1(A) increases the trait value, red means that allele coming from parent 2(B) increases the trait.

For the parental and ancestral models, the effect must be interpreted as deviation with respect to the reference within a connected part. The reference allele is always defined as the most frequent allele. It can change at different positions. Therefore, it is not possible to establish a unique reference allele for the whole genome. The parental alleles significances are assessed per connected part. The blue (red) colour means that the parental allele decrease (increase) the trait value. Parents are ordered from the top to the bottom given the number of times their allele was set as reference in the whole genome. For example the upper parent was the one for which its allele was the highest number of times set as reference in the whole genome. These plots should be interpreted as a rough indication of signal distribution.

The cross-validation procedure can be performed by the function
`mpp_CV()`

. The arguments `Rep`

and `k`

represent the number of repetitions of the k-fold procedure. The
heritability, allowing to express \(R^{2}\) in terms of percentage of the
explained genotypic variance, can be specified in `her`

. By
default `her = 1`

, therefore the results are expressed in
terms of phenotypic variation. The results of the CV procedure will be
saved in a folder at `output.loc`

.

```
set.seed(89341)
<- mpp_CV(pop.name = "USNAM", trait.name = "ULA", mppData = mppData,
CV Q.eff = "cr", her = 0.4, Rep = 2, k = 5, verbose = FALSE,
output.loc = tempdir())
```

All functions involving genome scan(s) (`mpp_perm()`

,
`mpp_SIM()`

, `mpp_CIM()`

, `mpp_proc()`

,
`mpp_CV()`

or `MQE_proc()`

) can be executed in
parallel for the HRT model only (`VCOV = h.err`

). The number
of cores can be specified using `n.cores`

. Parallelization is
done using functions from the ** parallel**
library.

The QTL analyses performed in ** mppR** are all
based on “exact” (complete REML computation) mixed model computation at
the tested position. We do not use approximation. This choice makes
operations requiring the computation of multiple genome scans by mixed
models very long. The realisation of permutation tests
(

`mpp_perm()`

), cross-validation (`mpp_CV()`

) or
the determination of an MQE model (`MQE_proc()`

) by mixed
models (all models with `VCOV`

different than
`"h.err"`

) is technically possible but can become cumbersone
for modestly sized datasets (e.g., 500 genotypes and 1000 markers). From
our estimation, mixed model calculations can take 20 to 50 times longer
than those for comparable linear models.** mppR** is a package for QTL analysis in
multi-parent populations working in the

The development of ** mppR** is part of a
project financed by KWS SAAT SE.

Aulchenko, Yurii S, Stephan Ripke, Aaron Isaacs, and Cornelia M Van
Duijn. 2007. “GenABEL: An r Library for Genome-Wide Association
Analysis.” *Bioinformatics* 23 (10): 1294–96.

Bardol, N, M Ventelon, B Mangin, S Jasson, V Loywick, F Couton, C Derue,
P Blanchard, A Charcosset, and Laurence Moreau. 2013. “Combined
Linkage and Linkage Disequilibrium QTL Mapping in Multiple Families of
Maize (Zea Mays l.) Line Crosses Highlights Complementarities Between
Models Based on Parental Haplotype and Single Locus
Polymorphism.” *Theoretical and Applied Genetics* 126
(11): 2717–36.

Blanc, G, A Charcosset, B Mangin, A Gallais, and L Moreau. 2006.
“Connected Populations for Detecting Quantitative Trait Loci and
Testing for Epistasis: An Application in Maize.” *Theoretical
and Applied Genetics* 113 (2): 206–24.

Bradbury, Peter J, Zhiwu Zhang, Dallas E Kroon, Terry M Casstevens,
Yogesh Ramdoss, and Edward S Buckler. 2007. “TASSEL: Software for
Association Mapping of Complex Traits in Diverse Samples.”
*Bioinformatics* 23 (19): 2633–35. https://www.maizegenetics.net/tassel.

Broman, Karl W, Hao Wu, Śaunak Sen, and Gary A Churchill. 2003.
“R/Qtl: QTL Mapping in Experimental Crosses.”
*Bioinformatics* 19 (7): 889–90. https://rqtl.org/.

Butler, DG, Brian R Cullis, AR Gilmour, and BJ Gogel. 2009.
“ASReml-r Reference Manual.” *The State of Queensland,
Department of Primary Industries and Fisheries, Brisbane*. https://vsni.co.uk/software/asreml/.

Cavanagh, Colin, Matthew Morell, Ian Mackay, and Wayne Powell. 2008.
“From Mutations to MAGIC: Resources for Gene Discovery, Validation
and Delivery in Crop Plants.” *Current Opinion in Plant
Biology* 11 (2): 215–21.

Churchill, Gary A, and Rebecca W Doerge. 1994. “Empirical
Threshold Values for Quantitative Trait Mapping.”
*Genetics* 138 (3): 963–71.

Covarrubias-Pazaran, G. 2016. “Genome Assisted Prediction of
Quantitative Traits Using the r Package Sommer.” *PLoS
ONE* 11: 1–15. https://CRAN.R-project.org/package=sommer.

Doerge, Rebecca W. 2002. “Mapping and Analysis of Quantitative
Trait Loci in Experimental Populations.” *Nature Reviews
Genetics* 3 (1): 43–52.

Eeuwijk, Fred A van, Marco CAM Bink, Karine Chenu, and Scott C Chapman.
2010. “Detection and Use of QTL for Complex Traits in Multiple
Environments.” *Current Opinion in Plant Biology* 13 (2):
193–205.

Garin, Vincent, Valentin Wimmer, Sofiane Mezmouk, Marcos Malosetti, and
Fred van Eeuwijk. 2017. “How Do the Type of QTL Effect and the
Form of the Residual Term Influence QTL Detection in Multi-Parent
Populations? A Case Study in the Maize EU-NAM Population.”
*Theoretical and Applied Genetics* 130 (8): 1753–64.

Giraud, Heloise, Christina Lehermeier, Eva Bauer, Matthieu Falque,
Vincent Segura, Cyril Bauland, Christian Camisan, et al. 2014.
“Linkage Disequilibrium with Linkage Analysis of Multiline Crosses
Reveals Different Multiallelic QTL for Hybrid Performance in the Flint
and Dent Heterotic Groups of Maize.” *Genetics* 198 (4):
1717–34.

Jansen, Ritsert C, Jean-Luc Jannink, and William D Beavis. 2003.
“Mapping Quantitative Trait Loci in Plant Breeding
Populations.” *Crop Science* 43 (3): 829–34.

Jourjon, Marie-Françoise, Sylvain Jasson, Jacques Marcel, Baba Ngom, and
Brigitte Mangin. 2005. “MCQTL: Multi-Allelic QTL Mapping in
Multi-Cross Design.” *Bioinformatics* 21 (1): 128–30. https://carlit.toulouse.inra.fr/MCQTL/.

Kang, Hyun Min, Noah A Zaitlen, Claire M Wade, Andrew Kirby, David
Heckerman, Mark J Daly, and Eleazar Eskin. 2008. “Efficient
Control of Population Structure in Model Organism Association
Mapping.” *Genetics* 178 (3): 1709–23.

Leroux, Damien, Abdelaziz Rahmani, Sylvain Jasson, Marjolaine Ventelon,
Florence Louis, Laurence Moreau, and Brigitte Mangin. 2014.
“Clusthaplo: A Plug-in for MCQTL to Enhance QTL Detection Using
Ancestral Alleles in Multi-Cross Design.” *Theoretical and
Applied Genetics* 127 (4): 921–33. https://cran.r-project.org/src/contrib/Archive/clusthaplo/.

Li, Renhua, Malcolm A Lyons, Henning Wittenburg, Beverly Paigen, and
Gary A Churchill. 2005. “Combining Data from Multiple Inbred Line
Crosses Improves the Power and Resolution of Quantitative Trait Loci
Mapping.” *Genetics* 169 (3): 1699–709.

Lipka, Alexander E, Feng Tian, Qishan Wang, Jason Peiffer, Meng Li,
Peter J Bradbury, Michael A Gore, Edward S Buckler, and Zhiwu Zhang.
2012. “GAPIT: Genome Association and Prediction Integrated
Tool.” *Bioinformatics* 28 (18): 2397–99. https://www.maizegenetics.net/gapit.

Luo, Z et al. 1992. “Computing Inbreeding Coefficients in Large
Populations.” *Genetics Selection Evolution* 24 (4):
305–13.

McMullen, Michael D, Stephen Kresovich, Hector Sanchez Villeda, Peter
Bradbury, Huihui Li, Qi Sun, Sherry Flint-Garcia, et al. 2009.
“Genetic Properties of the Maize Nested Association Mapping
Population.” *Science* 325 (5941): 737–40.

Myles, Sean, Jason Peiffer, Patrick J Brown, Elhan S Ersoz, Zhiwu Zhang,
Denise E Costich, and Edward S Buckler. 2009. “Association
Mapping: Critical Considerations Shift from Genotyping to Experimental
Design.” *The Plant Cell* 21 (8): 2194–2202.

R Core Team. 2016. *R: A Language and Environment for Statistical
Computing*. Vienna, Austria: R Foundation for Statistical Computing.
https://www.R-project.org/.

Rebaï, Ahmed, and Bruno Goffinet. 1993. “Power of Tests for QTL
Detection Using Replicated Progenies Derived from a Diallel
Cross.” *Theoretical and Applied Genetics* 86 (8):
1014–22.

———. 2000. “More about Quantitative Trait Locus Mapping with
Diallel Designs.” *Genetical Research* 75 (02): 243–47.

Rincent, Renaud, Laurence Moreau, Hervé Monod, Estelle Kuhn, Albrecht E
Melchinger, Rosa A Malvar, Jesus Moreno-Gonzalez, et al. 2014.
“Recovering Power in Association Mapping Panels with Variable
Levels of Linkage Disequilibrium.” *Genetics* 197 (1):
375–87.

Utz, H Friedrich, Albrecht E Melchinger, and Chris C Schön. 2000.
“Bias and Sampling Error of the Estimated Proportion of Genotypic
Variance Explained by Quantitative Trait Toci Determined from
Experimental Data in Maize Using Cross Validation and Validation with
Independent Samples.” *Genetics* 154 (4): 1839–49.

Varshney, Rajeev K, Vikas K Singh, John M Hickey, Xu Xun, David F
Marshall, Jun Wang, David Edwards, and Jean-Marcel Ribaut. 2015.
“Analytical and Decision Support Tools for Genomics-Assisted
Breeding.” *Trends in Plant Science*.

Wald, Abraham. 1943. “Tests of Statistical Hypotheses Concerning
Several Parameters When the Number of Observations Is Large.”
*Transactions of the American Mathematical Society* 54 (3):
426–82.

Weeks, David L, and Donald R Williams. 1964. “A Note on the
Determination of Connectedness in an n-Way Cross Classification.”
*Technometrics* 6 (3): 319–24.

Würschum, T, W Liu, M Gowda, HP Maurer, S Fischer, A Schechert, and JC
Reif. 2012. “Comparison of Biometrical Models for Joint Linkage
Association Mapping.” *Heredity* 108 (3): 332–40.

Xavier, Alencar, Shizhong Xu, William Muir, and Katy Rainey. 2015.
“NAM: Association Studies in Multiple Populations.”
*Bioinformatics*, btv448. https://CRAN.R-project.org/package=NAM.

Yu, Jianming, James B Holland, Michael D McMullen, and Edward S Buckler.
2008. “Genetic Design and Statistical Power of Nested Association
Mapping in Maize.” *Genetics* 178 (1): 539–51.

Yu, Jianming, Gael Pressoir, William H Briggs, Irie Vroh Bi, Masanori
Yamasaki, John F Doebley, Michael D McMullen, et al. 2006. “A
Unified Mixed-Model Method for Association Mapping That Accounts for
Multiple Levels of Relatedness.” *Nature Genetics* 38 (2):
203–8.

Zeng, Zhao-Bang. 1993. “Theoretical Basis for Separation of
Multiple Linked Gene Effects in Mapping Quantitative Trait Loci.”
*Proceedings of the National Academy of Sciences* 90 (23):
10972–76.