
| Sequence Retrieval |
The PlanTAPDB interfaces allow sequence retrieval in three ways:
|
|||||||||||||||||||||
| algorithm | Multiple sequence alignment algorithm used for this cluster. | |||||||||||||||||||||
| avg_ident | The average %identity of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| comment | Comments from the manual annotation phase | |||||||||||||||||||||
| description | The manual annotation infered for the cluster. | |||||||||||||||||||||
| f_quantile_ident | The first quantile of the summary statistics of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| fiala_stemminess | Tree measure as described in Fiala, K.L. and R.R. Sokal, 1985. Factors determining the accuracy of cladogram estimation: evaluation using computer simulation. Evolution, 39: 609-622 | |||||||||||||||||||||
| from_step |
Our pipeline filters PSI-BLAST hits according to a six-step filtering scheme. "from_step" tells you which filter step was applied when the sequences resulting in this cluster were initially filtered. Sequences passing a specific filter step have to furfill at least the alignment length and fraction identical criteria of the step:
|
|||||||||||||||||||||
| homology filtering | While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. Clusters with more than 150 members, were condensed via stepwise homology reduction until the threshold of 150 members was reached. To further investigate this process for every cluster in detail, we offer the distance matrix as text file, a graphic of distribution plots of the cluster_member distances and the initial MAFFT fftns2 alignment of the cluster used for the pairwise distances. | |||||||||||||||||||||
| last_cutoff | The last %identitiy threshold applied in the homology reduction of the cluster_members | |||||||||||||||||||||
| longest_internal_branch_length | Length of the longest internal branch used to (midpoint-)root the phylogenetic tree of the cluster. | |||||||||||||||||||||
| max_ident | The maximal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline. | |||||||||||||||||||||
| max_iteration | The maximal PSI-BLAST iteration the cluster_members are from. | |||||||||||||||||||||
| median_ident | The median %identity of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| members | Total number of cluster_members before redundancy removal and homology reduction. | |||||||||||||||||||||
| min_ident | The minimal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline. | |||||||||||||||||||||
| ml | The maximum likelihood of the consensus tree topology calculated with TREE-PUZZLE | |||||||||||||||||||||
| multiple alignment | Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise which disturbs correct inference of phylogenetic relationships. Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a best-of-two approach, during which first two alignments were calculated using different state-of-the-art algorithms and then filtered using the sum-of-pairs score. In the second step the alignment with the maximal number of remaining columns was chosen. In version 1.0, on average, the alignments consisted to 65% of gaps and were reduced to 28% of the original alignment length by applying this procedure. In 71% of the cases the MAFFT G-INSI alignment was selected to represent the cluster, whereas ProbCons or Muscle were chosen for 29% of the clusters. The "best" alignment can be downloaded and viewed with the Jalview alignment editor applet. To comprehend the MSA column filtering process for each cluster we also provide a nice overview graphic. | |||||||||||||||||||||
| nleafs | Number of leafs or taxa in this clusters phylogenetic tree. | |||||||||||||||||||||
| nr_distances | Total number of pairwise distances. | |||||||||||||||||||||
| nr_members | The number of members after redundancy removal | |||||||||||||||||||||
| number_of_internals | Number of internal nodes of the phylogenetic tree. | |||||||||||||||||||||
| number_of_nodes | Total number of nodes (internal + leafs) of the tree. | |||||||||||||||||||||
| number_of_terminals | Number of nodes without children. | |||||||||||||||||||||
| phylogenetic trees | Many approaches to phylogenomics rely solely on a distance approach using Neighbor-Joining (NJ) (Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, like maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in large-scale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma distributed rates from bootstrapped NJ topologies. The ML consensus topology of the phylogenetic tree can be downloaded in NHX and explored using the ATV Tree Viewer applet. | |||||||||||||||||||||
| queries | Total number of queries present in the cluster | |||||||||||||||||||||
| redundancy |
The amount of shared (redundant) history on the total tree. Formula: 1 / ( treelength - height / ( ntax * height - height ) ) |
|||||||||||||||||||||
| remaining | The number of members after homology reduction | |||||||||||||||||||||
| removed | Number of sequences removed in the homology reduction | |||||||||||||||||||||
| resolution | The total number of internal nodes over the total number of internal nodes on a fully bifurcating tree of the same size. | |||||||||||||||||||||
| sd_ident | The standard deviation of the %identities of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| t_quantile_ident | The third quantile of the summary statistics of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| total_paths | The sum of all root-to-tip path lengths of the phylogenetic tree of this cluster. | |||||||||||||||||||||
| tree_height | For ultrametric trees (supporting the molecular clock hypothesis) this value is the height of the tree, but this is done by averaging over all root-to-tip path lengths, so for additive trees the result should consequently be interpreted differently. | |||||||||||||||||||||
| tree_length | The sum of all branch lengths of the phylogenetic tree. |
