AOBA SEQSCORE evaluates the propensity of an amino acid sequence to form stable three-dimensional structure.Method
The score originates from the propensities of two amino acids occurring at a given separation along protein sequence.
Co-occurrence propensity score of query sequence
For a given amino acid sequence s=aa(1)aa(2),...,aa(N) , where aa(i) and N indicate the ith amino acid and the length of the sequence s, the score of the ith residue is calculated by,Calculation of co-occurrence propensity score from sequence database
.The first term evaluates the co-occurrence with the preceding ten residues (N-terminal) and the second term with the following ten residues (C-terminal).
If i-j or i+j is smaller than 1 or larger than N, the CoP score for the pair is set zero.
The result figure shows the score of each residue in the sequence in a broken line.
The smoothed score for each residue was then calculated by averaging the score over 11 residues centered on the residue (red solid line).
CoP(a,b,k) is the co-occurrence propensity of amino acid a and b separated by k residues along sequence and it originates from the statistics in protein sequences which form stable three-dimensional structures in solvent as confirmed by X-ray crystallography.
The co-occurrence propensity was calculated by
, where F(a,b,k) and E(a,b,k) indicate the observed and expected occurrence of residue pair a and b separated by k residues in the training amino acid sequences. Amino acid pairs in the opposite directions, i.e. a and b, and b and a, are treated separately.
Calculation of the observed and expected numbers of each amino acid pair at each separation
Dataset of protein sequences with stable structures
Given a protein sequence p in the data set, the occurrence of amino acid pair a and b at separation k, f(a,b,k|p) is directly observed from the sequence. The observed occurrence in the whole dataset is then calculated by
, where the summation is taken over all the proteins in the dataset.
On the other hand, the expected occurrence is calculated from the occurrence of each amino acid in each protein p,
, where n(a|p) and n(b|p) are the numbers of amino acid a and b in protein p, n(k|p) is the number of residue pairs separated by k residues in p, and n(p) is the number of residues in p.
The expected occurrence in the whole dataset is then calculated by,
, where again the summation is taken over all the proteins in the dataset.
A non-redundant set of protein domains in which the mutual sequence identity between any of the two sequences is below 40% was obtained from the ASTRAL database. From this set, those structures which were determined X-ray crystallography were selected. In addition, membrane proteins, which were identified by having a MeSH term "MEMBRANE" were excluded. For the remaining 7575 domains, the amino acid sequences were obtained by reading the ATOM record in the PDB-style file and the His-tags were removed. The resulting sequences are obtained from here.