protHMM

1: Introduction

The goal of protHMM is to help integrate profile hidden markov model (HMM) representations of proteins into the machine learning and bioinformatics workflow. protHMM ports a number of features from use in Position Specific Scoring Matrices (PSSMs) to HMMs, along with implementing features used with HMMs specifically, which to our knowledge has not been done before. The adoption of HMM representations of proteins derived from HHblits and HMMer also presents an opportunity for innovation; it has been shown that HMMs can benefit from better multiple sequence alignment than PSSMs and thus get better results than corresponding HMMs using similar feature extraction techniques (Lyons et al. 2015). protHMM implements 20 different feature extraction techniques to provide a comprehensive list of feature sets for use in bioinformatics tasks ranging from protein fold classification to protein-protein interaction.

library(protHMM)

2: Autocorrelation: hmm_MB, hmm_MA, hmm_GA

protHMM implements normalized Moreau-Broto, Moran and Geary autocorrelation descriptors, each of which measure the correlation between two amino acid residues separated by a lag value, (lg) along the sequence. Each of these feature vectors return a vector of \(20 \times lg\). Liang, Liu, and Zhang (2015) provide a mathematical representation for each of these features, having used them to predict protein structural class:

Moreau-Broto

\({N_{lg}^j = \frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} H_{i, j} \times H_{i+lg, j}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Moreau-Broto correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.

Moran

\({N_{lg}^j = \frac{\frac{1} {L-lg} \sum_{i = 1} ^ {L-lg} (H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{\frac {1} {L} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Moran correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.

Geary

\({N_{lg}^j = \frac{\frac{1} {2(L-lg)} \sum_{i = 1} ^ {L-lg} (H_{i, j} - H_{i+d, j})^2}{\frac {1} {L-1} \sum_{i = 1} {L} (H_{i, j} - \bar{H_j})^2}, \space (j = 1,2..20, 0 < d < L)}\)

In which \({N_{lg}^j}\) is the Geary correlation factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence and \(H\) represents the HMM matrix.

Examples

library(protHMM)
## basic example code
h_MB<- hmm_MB(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_MA<- hmm_MA(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_GA<- hmm_GA(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_MB, 20)
#>  [1] 0.0027399767 0.0028318487 0.0030151714 0.0027021756 0.0029396909
#>  [6] 0.0028727712 0.0030186241 0.0028945067 0.0029009803 0.0006026105
#> [11] 0.0003812977 0.0004823361 0.0003340628 0.0005073259 0.0003092559
#> [16] 0.0002997833 0.0010281865 0.0004959552 0.0023601628 0.0021576905
head(h_MA, 20)
#>  [1] -0.10731521 -0.08432884  0.03573254 -0.15878664  0.03660993 -0.02897298
#>  [7]  0.05583864 -0.02260616 -0.01659614 -0.00317749 -0.02603402 -0.01744969
#> [13] -0.03281501 -0.01579394 -0.03647018 -0.03876510  0.03093880 -0.02196930
#> [19]  0.31531271  0.17506839
head(h_GA, 20)
#>  [1] 1.0866849 1.0541761 0.9300136 1.1177010 0.9217161 0.9864679 0.9015713
#>  [8] 0.9811766 0.9780934 1.0025380 1.0347499 1.0360329 1.0615057 1.0546791
#> [15] 1.0855622 1.0984717 1.0403336 1.1041239 0.6773725 0.8157369

3: Covariance: hmm_AC, hmm_CC

protHMM implements two covariance-based features, autocovariance (AC) and cross covariance (CC). These features calculate the covariance between two residues separated by a lag value lg along the sequence. AC calculates this for positions in the same column, and CC for positions that are not in the same column of the HMM. It should also be noted that for these features, the HMM matrix is not converted to probabilities, unlike other features. Both features were used for protein fold recognition.

Autocovariance

Dong, Zhou, and Guan (2009) provide a mathematical representation of autocovariance:

\(AC(j, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j} - \bar{H_j}) \times (H_{i+lg, j} - \bar{H_j})}{L-lg}, \space j = 1,2..20\)

In which \(AC(j,\space lg)\) is the autocovariance factor for column \(j\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_j}\) represents the average of column \(j\) in matrix \(H\) and \(H\) represents the HMM matrix.

Cross covariance

Dong, Zhou, and Guan (2009) also provide a mathematical representation of cross covariance:

\(AC(j_1, j_2, lg) = \sum_{i = 1}^{L-lg} \frac{(H_{i, j_1} - \bar{H_{j_1}}) \times (H_{i+lg, j_2} - \bar{H_{j_2})}}{L-lg}, \space j_1,j_2 = 1, 2, 3...20, \space j_1 \neq j_2\)

In which \(AC(j_1,\space j_2,\space lg)\) is the cross correlation factor for columns \(j_1, \space j_2\), \(lg\) is the lag value, \(L\) is the length of the protein sequence, \(\bar{H_{j_i}}\) represents the average of column \(j_i\) in matrix \(H\) and \(H\) represents the HMM matrix.

Examples

library(protHMM)
## basic example code
h_AC<- hmm_ac(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_CC<- hmm_cc(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_AC, 20)
#>  [1] 2027588 2048277 2069394 2090950 8161072 8244348 8329341 8416105 3815863
#> [10] 3854800 3894540 3935108 2055989 2076968 2098380 2120238 4671700 4719371
#> [19] 4768024 4817691
head(h_CC, 20)
#>  [1] 432613.90 607445.71 679287.41 527806.56 446405.28 989099.80 188119.73
#>  [8] 479781.58 547044.82  29788.07 892089.22 946319.61 520382.72 820507.43
#> [15] 297221.83 194817.35 309185.70 286636.41 873835.20 -23142.80

4: Bigrams and Trigrams

The bigrams and trigrams features are outlined by Lyons et al. (2015), and they interpret the likelihood of 2 (bi) or 3 (tri) amino acids occurring consecutively in the sequence. Thus, they take shape as a 20 x 20 matrix for bigrams and a 20 x 20 x 20 array for trigrams, which are then flattened to create vectors of length 400 and 8000. These features were used by Lyons et al. (2015) for protein fold recognition.

Bigrams

Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j]\), where:

\({B[i, j] = \sum_{a = 1}^{L-1} H_{a, i}H_{a+1, j}}\)

And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).

Trigrams

Lyons et al. (2015) outlines bigrams mathematically as \(B[i, j, k]\), where:

\({B[i, j, k] = \sum_{a = 1}^{L-2} H_{a, i}H_{a+1, j}H_{a+2, k}}\)

And \({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({H}\).

Examples

library(protHMM)
## basic example code
h_tri<- hmm_trigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
h_bi<- hmm_bigrams(system.file("extdata", "1DLHA2-7", package="protHMM"))
head(h_tri, 20)
#>  [1] 0.005352191 0.011975355 0.019514992 0.008968399 0.018890547 0.008317675
#>  [7] 0.013992176 0.016166307 0.022135295 0.004365804 0.013618584 0.020090544
#> [13] 0.008747969 0.012091065 0.024533920 0.019144562 0.017654663 0.006009293
#> [19] 0.006042909 0.011424255
head(h_bi, 20)
#>  [1] 0.27125769 0.11864240 0.22349757 0.35406729 0.17926196 0.31284113
#>  [7] 0.15635251 0.24893749 0.31070932 0.43189681 0.08388694 0.26433238
#> [13] 0.39767028 0.19981994 0.24877376 0.51622764 0.38255083 0.36514895
#> [19] 0.12440151 0.13735055

5: CHMM

The CHMM feature begins by creating a CHMM, which is created by constructing 4 matrices, \({A, B, C, D}\) from the original HMM \({H}\). \({A}\) contains the first 75 percent of the original matrix \({H}\) row-wise, \({B}\) the last 75 percent, \({C}\) the middle 75 percent and \({D}\) the entire original matrix (An et al. 2019). These are then merged to create the new CHMM \({Z}\). From there, the bigrams feature is calculated with a flattened 20 x 20 matrix \({B}\), in which

\({B[i, j] = \sum_{a = 1}^{L-1} Z_{a, i} \times Z_{a+1, j}}\)

\({H}\) corresponds to the original HMM matrix, and \({L}\) is the number of rows in \({Z}\) (An et al. 2019). Local Average Group, or LAG is calculated by first splitting the CHMM into \(j\) groups and then the following:

\(LAG(k) = \frac{20}{L}\sum_{p = 1}^{N/20} Mt[p+(i-1)\times(N/20),\space j]\)

Where \({L}\) is the number of rows in \({Z}\) and \(Mt[p+(i-1)\times(N/20),\space j]\) represents the row vector of the CHMM and the \(i\)th position in the \(j\)th group (An et al. 2019). These features are then fused, and a vector of 800 along the the original length 400 bigrams and LAG vectors are returned. The CHMM features were used to predict protein-protein interactions.

Examples

library(protHMM)
h_fused<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[1]]
h_lag<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[2]]
h_bigrams<- chmm(system.file("extdata", "1DLHA2-7", package="protHMM"))[[3]]
head(h_fused, 20)
#>  [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#>  [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_lag, 20)
#>  [1] 0.050431366 0.008073699 0.025377739 0.080125307 0.037055099 0.034279129
#>  [7] 0.022591078 0.088067417 0.056547767 0.088639559 0.020870630 0.016466719
#> [13] 0.114605135 0.031219286 0.040658673 0.079269757 0.066999681 0.136328812
#> [19] 0.009850417 0.024292606
head(h_bigrams, 20)
#>  [1] 0.6644992 0.2932670 0.5686021 0.8736529 0.4484777 0.7703030 0.3734919
#>  [8] 0.6135072 0.7469687 1.0671193 0.2097486 0.6524548 0.9846256 0.5010583
#> [15] 0.6275425 1.2841556 0.9755920 0.8980086 0.3038199 0.3414741

6: Distance

The distance feature calculates the cosine distance matrix between two HMMs \({A}, \space{B}\) before dynamic time warp is applied to the distance matrix calculate the cumulative distance between the HMMs, which acts as a measure of similarity (Lyons et al. 2016). The cosine distance matrix \({D}\) is calculated with:

\({D[a_i, b_j] = 1 - \frac{a_ib_j^{T}}{a_ia_i^Tb_jb_j^T}}\)

in which \({a_i}\) and \({a_i}\) refer to row vectors of \({A}\) and \({B}\) respectively (Lyons et al. 2016). This in turn means that \(D\) is of dimensions \({nrow(A), nrow(b)}\). Dynamic time warp then calculates the cumulative distance by calculating matrix:

\(C[i, j] = min(C[i-1, j], C[i, j-1], C[i-1, j-1]) + D[i, j]\)

where \({C_{i,j}}\) is 0 when \({i}\) or \({j}\) are less than 1 (Lyons et al. 2016). The lower rightmost point of the matrix \({C}\) is then returned as the cumulative distance between proteins. The distance feature was used by Lyons et al. (2016) for protein fold recognition.

Example

library(protHMM)
# basic example code
h<- hmm_distance(system.file("extdata", "1DLHA2-7", package="protHMM"), system.file("extdata", "1TEN-7", 
                                                                                      package="protHMM"))
h
#> [1] 99.90334

7: FP_HMM

FP_HMM consists of two vectors, \({d, s}\). Vector \({d}\) corresponds to the sums across the sequence for each of the 20 amino acid columns (Zahiri et al. 2013). Vector \({s}\) corresponds to a flattened matrix of \(S\) where:

\({S[i, j] = \sum_{k = 1}^{L} H[k, j] \times \delta[k, i]}\)

in which \({\delta[k, i] = 1}\) when \({A_i = H[k, j]}\) and 0 otherwise (Zahiri et al. 2013). \({A}\) refers to a list of all possible amino acids, \({i, j}\) span from \({1:20}\). This feature has been used for the prediction of protein-protein interactions.

Example

library(protHMM)
# basic example code
h<- fp_hmm(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[[1]]
#>  [1] 5.327650 2.495163 4.226204 6.557727 3.430360 6.392750 2.638661 5.230106
#>  [9] 5.887156 7.882747 1.510770 4.522352 6.408192 3.615053 4.486816 9.137523
#> [17] 7.110800 7.647033 1.971206 2.521788
head(h[[2]], 20)
#>  [1] 0.00000000 0.06906144 0.25848195 0.61607168 0.21594299 0.21617140
#>  [7] 0.24383966 0.22567252 0.18595546 0.51564480 0.00000000 0.15273093
#> [13] 0.63770076 0.00000000 0.24438102 0.21709957 0.46775998 0.89909902
#> [19] 0.10940189 0.05263446

8: HMM_GSD

This feature was created by Jin and Zhu (2021) and begins by creates a grouping matrix \(G\) by assigning each position a number \(1,2, 3\) based on the value at each position of HMM matrix \(H\); \(1\) represents the low probability group, \(2\) the medium and \(3\) the high probability group. The number of total points in each group for each column is then calculated, and the sequence is then split based upon the the positions of the 1st, 25th, 50th, 75th and 100th percentile (last) positioned points for each of the three groups, in each of the 20 columns of the grouping matrix. Thus for column \(j\):

\({S(k, \space j, \space z) = \sum_{i = 1}^{(z) \times.25 \times N} |G[i,\space j] = k|}\)

where \({k}\) is the group number, \({z = 1,2,3,4}\) and \({N}\) corresponds to number of rows in matrix \({G}\) (Jin and Zhu 2021).

Example

library(protHMM)
# basic example code
h<- hmm_GSD(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#>  [1]  1.11117557  0.83022094 -0.43407491 -1.50170251 -1.67027529  0.91450733
#>  [7]  0.54926631  0.26831167 -0.01264296 -1.58598890  1.02688918  0.94260279
#> [13]  0.71783909  0.43688445 -1.61408437  1.11117557  0.32450260 -0.34978852
#> [19] -0.99598417 -1.67027529  0.83022094  0.71783909  0.52117084  0.24021621
#> [25] -1.38932066  0.63355270  0.63355270  0.63355270  0.63355270 -1.55789344
#> [31]  1.11117557  0.46497992 -0.34978852 -1.13646149 -1.67027529  1.05498465
#> [37]  0.85831640  0.52117084  0.18402528 -1.50170251

9: Pseudo HMM: IM_psehmm and pse_hmm

Both of the pseudo HMM features, pse_hmm and IM_psehmm, are based off of Chou and Shen (2007)’s psuedo PSSM concept. This was engineered in order to create non-variable length representations of proteins, while keeping sequence order information. IM_psehmm stands for improved pseudo hmm, which is based off of Ruan et al. (2020)’s improvements to the Chou and Shen (2007)’s initial offering; Ruan et al. (2020)’s offering is still based on the PSSM matrix. As such, this feature is a port for use with HMMs.

pseudo_hmm

Mathematically, pseudo_hmm is made of the fusion of 2 vectors. The first is of the means across the 20 amino acid emission frequency columns of the HMM. The second can be found in the following equation (Chou and Shen 2007):

\(V_{j}^{i} = \frac{1}{L-i}\sum_{k = 1}^{L-i}H[k, j] \times H[k+i, j], \space j = 1,2..20, \space i < L\)

\(H\) represents the HMM matrix, \(L\) represents the row number of \(H\), \(i\) represents the amount of contiguous amino acids that correlation is measured across and \(V_{j}^{i}\) represents the value of the vector for the \(j\)th column and \(i\)th most contiguous amino acids. This results in a vector of \(20 + g \times 20\).

IM_pse_hmm

IM_pse_hmm is also the fusion of two vectors, one being the same means vector described above. The second can be found as (Ruan et al. 2020):

\(V_{j}^{d} = \frac{1}{20-d}\sum_{k = 1}^{L}H[k, j] \times H[k, j+d], \space j = 1,2..20, \space d < 20\)

Where \(H\) the HMM matrix, \(L\) represents the number of rows in \(H\) and \(V_{j}^i\) represents the value in the feature vector for column \(j\) and columnal distance parameter \(d\). This results in a vector of length \({20+20\times d-d\times\frac{d+1}{2}}\).

10: Local Binary Pattern

This feature is based off of Li et al. (2019)’s work using local binary pattern to extract features from PSSMs to use in protein-protein interaction prediction. The local binary pattern extraction used here can be defined as \(B = b(s(H_{m, n+1}-H_{m, n}), s(H_{m+1, n+1}-H_{m, n}), s(H_{m+1, n}-H_{m, n})...s(H_{m-1, n+1}-H_{m, n})))\), where \(H[m,n]\) signifies the center of a local binary pattern neighborhood (\(2<m<L-1,2<n<20, L = nrow(H)\)), \(H\) refers to the HMM matrix, \(s(x) = 1\) if \(x \ge 0\), \(s(x) = 0\) otherwise and \(b\) refers to a function that converts a binary number to a decimal number. The local binary pattern features are then put into a histogram with 256 bins, one for each possible binary value. This is returned as a 256 length vector.

Example

library(protHMM)
## basic example code
h<- hmm_LBP(system.file("extdata", "1TEN-7", package="protHMM"))
h[1:20]
#>  [1] 166  31  16  18  35   8   4   7  12   6   3   1   7   0   2   0  19   4   5
#> [20]   2

11: Linear Predictive Coding

In this feature, linear predictive coding (LPC) is used to extract a vector \(d\) of length \(g\) from each of the 20 amino acid emission frequency columns of the HMM matrix. linear predictive coding considers that position \(H[i, j]\) in the HMM matrix can be approximated by \(\sum_{k = 1}^{g} d_{k, j}a_{i,j}\). The feature vector extracted is thus the vector of d; this is done through the phonTools package (Barreda 2015). This assumes \(g\) of 14, and thus the feature vector generated if \(14\times20 = 280\). This feature is based off of Qin et al. (2015)’s work, in which they used an LPC feature set to predict protein structural class.

Example

library(protHMM)
## basic example code
h<- hmm_LPC(system.file("extdata", "1TEN-7", package="protHMM"))
h[1:20]
#>  [1] 1.00000000 0.74635054 0.59126153 0.36331602 0.36010191 0.30277057
#>  [7] 0.47978532 0.41120113 0.29511047 0.28022645 0.20143373 0.24002696
#> [13] 0.15302503 0.04457104 1.00000000 0.71845887 0.84727202 0.92128401
#> [19] 1.02748498 0.83287042

12: SCSH

This feature is documented by Mohammadi et al. (2022) and was used for protein-protein interaction prediction. The SCSH feature returns the k-mer composition between a protein’s consensus sequence given by the HMM and the protein’s actual sequence, for k = 2,3. First, all possible k-mers for all of the 20 possible amino acids are calculated (\({20^2}\) and \({20^3}\) permutations for 2 and 3-mers respectively). With those permutations, different vectors of length 400 and 8000 are created, vectors \(v_2, \space v_3\). Each position on the vectors corresponds to a specific k-mer, i.e \(v_2[1] = AA\) and \(v_2[1] = AAA\). Then, the protein sequence that corresponds to the HMM scores is extracted, and put into a bipartite graph with the actual protein sequence. Each path of length 1 or 2 is found on the graph, and the corresponding vertices on the graph are noted as possible 2 and 3-mers. For each 2 or 3-mer found from these paths, 1 is added to the position that corresponds to that 2/3-mer in the 2-mer and 3-mer vectors, which are the length 400 and 8000 vectors created previously. The vectors are then returned.

Examples

library(protHMM)
## basic example code
h<- hmm_SCSH(system.file("extdata", "1TEN-7", package="protHMM"))
## 2-mers
h[[1]][1:20]
#>  [1] 0 0 0 1 0 2 0 0 1 1 1 0 2 0 1 0 2 1 0 0
## 3-mers: specific indexes as most of the vector is 0
h[[2]][313:333]
#>  [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0

13: Separated Dimers

Separated Dimers refers the the probability that there will be an amino acid dimer between amino acid residues separated by a distance of \(l\). Saini et al. (2015) conceived of this feature and applied it to protein fold recognition; mathematically, separated dimers calculates matrix \(F\):

\({F[m, n] = \sum_{i = 1}^{L-l} H_{i, m}H_{i+l, n}}\)

\({H}\) corresponds to the original HMM matrix, and \(L\) is the number of rows in \(H\). Matrix \(F\) is then flattened to a feature vector of length 400, and returned.

Examples

library(protHMM)
## basic example code
h<- hmm_SepDim(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#>  [1] 0.28073204 0.13127205 0.22465162 0.30993504 0.17705352 0.30286268
#>  [7] 0.13611039 0.27110569 0.26843594 0.42275865 0.08623105 0.23523189
#> [13] 0.28262058 0.17449219 0.23466631 0.45744566 0.38400417 0.41489217
#> [19] 0.06656876 0.12644926 0.12468473 0.02787985 0.17961578 0.17639461
#> [25] 0.04536077 0.33523934 0.07878719 0.08306272 0.13034049 0.14246328
#> [31] 0.04005923 0.13693382 0.09324879 0.10231668 0.09851687 0.29441110
#> [37] 0.16511460 0.15335638 0.03221440 0.03884258

14: Single Average Group

Nanni, Lumini, and Brahnam (2014) pioneer the Single Average Group feature and use it for protein classification. This feature groups together rows that are related to the same amino acid, using a vector \({SA(k)}\), in which \({k}\) spans \({1:400}\) and:

\({SA(k) = avg_{i \space = \space 1, 2... L}\space H[i, j] \times \delta(P(i), A(z))}\)

in which \({H}\) is the HMM matrix, \({P}\) in the protein sequence, \({A}\) is an ordered set of amino acids, the variables \({j, \space z = 1, 2, 3...20}\), the variable \({k = j + 20 \times (z-1)}\) when creating the vector. \({\delta()}\) represents Kronecker’s delta.

Example

library(protHMM)
## basic example code
h<- hmm_Single_Average(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1:40]
#>  [1] 5.327649515 2.495163167 4.226203618 6.557727068 3.430359915 6.392750239
#>  [7] 2.638660734 5.230106250 5.887155542 7.882747181 1.510770387 4.522351690
#> [13] 6.408192096 3.615052651 4.486815933 9.137522926 7.110799532 7.647033087
#> [19] 1.971206424 2.521787664 0.069061438 1.448076555 0.000000000 0.000000000
#> [25] 0.040982786 0.037528025 0.034250848 0.033810905 0.000000000 0.033966298
#> [31] 0.018141963 0.012174447 0.061500560 0.009964409 0.031380236 0.049529900
#> [37] 0.045531222 0.055012065 0.008144264 0.011430244

15: Smoothed HMM

This feature extraction technique is found in Fang, Noguchi, and Yamana (2013), being used in conjunction with another technique for ligand binding site prediction. This feature smooths the HMM matrix \(H\) by using sliding window of length \(sw\) to incorporate information from up and downstream residues into each row of the HMM matrix. Foreach HMM row :

\(r_i = \sum_{x \space = \space i-\frac{sw}{2}}^{i+\frac{sw}{2}}{r_x}\)

for \({i = 1, 2, 3...L}\), where \({L}\) is the number of rows in \({H}\). For rows such as the beginning and ending rows, zero matrices of dimensions \(sw/2, 20\) are appended to \({H}\).

Example

library(protHMM)
## basic example code
h<- hmm_smooth(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1,]
#>  [1] 0.11776641 0.01631112 0.05565213 0.11811762 0.02074633 0.18715081
#>  [7] 0.11375485 0.18747835 0.25055389 0.12211269 0.05342971 0.12007436
#> [13] 0.07683993 0.16693724 0.01869764 0.24692156 0.30136984 0.67903030
#> [19] 0.04331468 0.10385260

16: Singular Value Decomposition

This feature extraction method is found in Song et al. (2018), and uses singular value decomposition to extract a feature vector from a HMM matrix. SVD factorizes the matrix \(H\) of dimensions \({i, j}\) to

\({U[i, r] \times \Sigma[r, r] \times V[r, j]}\)

The diagonal values of \({\Sigma}\) are known as the singular values of matrix \(H\), and are what are returned with this function. This feature was used to predict protein-protein interactions.

Example

library(protHMM)
# basic example code
h<- hmm_svd(system.file("extdata", "1DLHA2-7", package="protHMM"))
h
#>  [1] 2.4328414 1.1318859 1.0331369 0.9046638 0.8595964 0.6696056 0.6425755
#>  [8] 0.5865227 0.5189927 0.5155567 0.4701584 0.4118091 0.3627740 0.3249188
#> [15] 0.2858293 0.2615811 0.2576213 0.2251676 0.2074293 0.1242763

17: Read

This function reads in the 20 amino acid emission frequency columns used in the feature extraction methods discussed previously, and converts the columns into probabilities.

Example

library(protHMM)
## basic example code
h<- hmm_read(system.file("extdata", "1DLHA2-7", package="protHMM"))
h[1,]
#>  [1] 0.0000000 0.0000000 0.0000000 0.3410370 0.0000000 0.0000000 0.0000000
#>  [8] 0.3382121 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.3208565 0.0000000 0.0000000

References

An, Ji-Yong, Yong Zhou, Yu-Jun Zhao, and Zi-Ji Yan. 2019. “An Efficient Feature Extraction Technique Based on Local Coding PSSM and Multifeatures Fusion for Predicting Protein-Protein Interactions.” Evolutionary Bioinformatics 15 (January): 117693431987992.

Barreda, Santiago. 2015. phonTools: Functions for Phonetics in r.

Chou, Kuo-Chen, and Hong Shen. 2007. “MemType-2L: A Web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM.” Biochemical and Biophysical Research Communications 360 (2): 339–45. https://doi.org/10.1016/j.bbrc.2007.06.027.

Dong, Qiwen, Shuigeng Zhou, and Jihong Guan. 2009. “A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation.” Bioinformatics 25 (20): 2655–62. https://doi.org/10.1093/bioinformatics/btp500.

Fang, Chun, Tamotsu Noguchi, and Hayato Yamana. 2013. “SCPSSMpred: A General Sequence-based Method for Ligand-binding Site Prediction.” IPSJ Transactions on Bioinformatics 6 (0): 35–42. https://doi.org/10.2197/ipsjtbio.6.35.

Jin, Danyu, and Ping Zhu. 2021. “Protein Subcellular Localization Based on Evolutionary Information and Segmented Distribution.” Mathematical Problems in Engineering 2021 (December): 1–14. https://doi.org/10.1155/2021/8629776.

Li, Yan, Liping Li, Lei Wang, Changqing Yu, Zheng Wang, and Zhu-Hong You. 2019. “An Ensemble Classifier to Predict Protein–Protein Interactions by Combining PSSM-based Evolutionary Information with Local Binary Pattern Model.” International Journal of Molecular Sciences 20 (14): 3511. https://doi.org/10.3390/ijms20143511.

Liang, Yunyun, Sanyang Liu, and Zhang. 2015. “Prediction of Protein Structural Class Based on Different Autocorrelation Descriptors of Position–Specific Scoring Matrix.” MATCH: Communications in Mathematical and in Computer Chemistry 73 (3): 765–84.

Lyons, James, Abdollah Dehzangi, Rhys Heffernan, Yuedong Yang, Yaoqi Zhou, Alok Sharma, and Kuldip K. Paliwal. 2015. “Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models.” IEEE Transactions on Nanobioscience 14 (7): 761–72. https://doi.org/10.1109/tnb.2015.2457906.

Lyons, James, Kuldip K. Paliwal, Abdollah Dehzangi, Rhys Heffernan, Tatsuhiko Tsunoda, and Alok Sharma. 2016. “Protein fold recognition using HMM–HMM alignment and dynamic programming.” Journal of Theoretical Biology 393 (March): 67–74. https://doi.org/10.1016/j.jtbi.2015.12.018.

Mohammadi, Alireza M., Javad Zahiri, Saber Mohammadi, Mohsen Khodarahmi, and Seyed Shahriar Arab. 2022. “PSSMCOOL: a comprehensive R package for generating evolutionary-based descriptors of protein sequences from PSSM profiles.” Biology Methods and Protocols 7 (1). https://doi.org/10.1093/biomethods/bpac008.

Nanni, Loris, Alessandra Lumini, and Sheryl Brahnam. 2014. “An Empirical Study of Different Approaches for Protein Classification.” The Scientific World Journal 2014 (January): 1–17. https://doi.org/10.1155/2014/236717.

Qin, Yufang, Xiaoqi Zheng, Jun Wang, Ming Chen, and Changjie Zhou. 2015. “Prediction of protein structural class based on Linear Predictive Coding of PSI-BLAST profiles.” Central European Journal of Biology 10 (1). https://doi.org/10.1515/biol-2015-0055.

Ruan, Xiaoli, Dongming Zhou, Rencan Nie, and Yanbu Guo. 2020. “Predictions of Apoptosis Proteins by Integrating Different Features Based on Improving Pseudo-Position-Specific Scoring Matrix.” BioMed Research International 2020 (January): 1–13. https://doi.org/10.1155/2020/4071508.

Saini, Harsh, Gaurav Raicar, Alok Sharma, Sunil K. Lal, Abdollah Dehzangi, James Lyons, Kuldip K. Paliwal, Seiya Imoto, and Satoru Miyano. 2015. “Probabilistic expression of spatially varied amino acid dimers into general form of Chou’s pseudo amino acid composition for protein fold recognition.” Journal of Theoretical Biology 380 (September): 291–98. https://doi.org/10.1016/j.jtbi.2015.05.030.

Song, Xiaoyu, Zhan-Heng Chen, Xiangyang Sun, Zhu-Hong You, Liping Li, and Yang Zhao. 2018. “An Ensemble Classifier with Random Projection for Predicting Protein–Protein Interactions Using Sequence and Evolutionary Information.” Applied Sciences 8 (1): 89. https://doi.org/10.3390/app8010089.

Zahiri, Javad, Omid Yaghoubi, Morteza Mohammad-Noori, Reza Ebrahimpour, and Ali Masoudi-Nejad. 2013. “PPIevo : Protein–protein interaction prediction from PSSM based evolutionary information.” Genomics 102 (4): 237–42. https://doi.org/10.1016/j.ygeno.2013.05.006.