| PlanTAPDB Retrieval documentation |
| PlanTAP Family Viewer documentation |
| PlanTAP Family Members Viewer documentation |
| PlanTAP Family Cluster Viewer documentation |


| accession_number | Each PlanTAP entry has its distinct accession number, which is a 5-letter string comprised of a leading two-character category string and a trailing unique 3 digit number. (TF|TR|PT)([0-9]{3}) |
| category |
Each PlanTAP family entry belongs to one of the following categories:
|
| main_family | Each PlanTAP entry was annotated to belong to an existing or new family of TAPs. |
| search |
Using the search function, PlanTAPDB can be queried by searching the following database fields:
|
| sub_family | Some of the PlanTAP families can be further divided into subfamilies. |


| Filters | The family member list can be filtered using the following criteria: In addition the information displayed for every family member can be switched between full textual output and graphical domain structure by using the graphical view of the family members' domain structures checkbox. Selection of multiple filters is additive: If you modify multiple filters, all of them are used to filter the family members list after you click "Filter member list". Use the "Reset" button to reset the filters to their defaults to avoid unexpected results after modification of filter parameters. The "Reset" button only resets the selected filtering parameters, not the family members displayed below! To display the full list of family members after reset, you will have to "Filter member list" again. |
| accession_number | Each PlanTAP entry has its distinct accession number, which is a 5-letter string comprised of a leading two-character category string and a trailing unique 3 digit number. (TF|TR|PT)([0-9]{3}) |
| category |
Each PlanTAP family entry belongs to one of the following categories:
|
| citation(s) | Related literature references describing the PlanTAP entry. Follow the hyperlink to view the corresponding PubMed entry. |
| consensus domain(s) | In the manual annotation process matching InterPro domains were condensed to a set of consensus domains common to the majority of members. The entries are directly hyperlinked to the corresponding InterPro entry. If your browser supports mouse-over information, use this to display additional information. |
| domain structure | Display a graphical view of the members domain structure instead of the default full textual view. The images are scaled in relation to the longest displayed member sequence. CAUTION: depending on the number of members to be displayed, this may take a while! |
| homology reduction | Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. A total of 102 clusters had more than 150 members, these were condensed via stepwise homology reduction until the threshold of 150 members was reached. Homology reduction was implemented in the same program as redundancy removal, but follows a different strategy. Beginning with 1 substitution per 100aa and heuristically increasing this distance threshold, the distance matrix is iteratively scanned for sequence pairs with the respective distance, regardless of their species. The iteration stops when the remaining representative cluster members reach a given limit (150 sequences). |
| in family clusters | Display only entries from specific PlanTAP family clusters. |
| in_tree? | Filter the member list using the "in tree" property described under "in tree" in the family members viewer documentation section. |
| is a? | Filter the member list using the "is a" property described under "is a" in the family members viewer documentation section. |
| last modified | Timestamp of the last modification of the entry. |
| main_family | Each PlanTAP entry was annotated to belong to an existing or new family of TAPs. |
| member species | Filter the member list to show only entries having exactly the same taxonomy string (NCBI taxonomy full lineage represented by a single NCBI taxonid). Each binary species name stands for a full linage taxonomy string, e.g. like Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; commelinids; Poales; Poaceae; BEP clade; Ehrhartoideae; Oryzeae; Oryza; Oryza sativa (japonica cultivar-group). Due to the nature of sequence submission in e.g. Genbank, it can happen that there is another entry with the same binary name but slightly divergent lineage. These will be two different entries in the filter list. But since multiple selections are possible, this should not be an problem. |
| number of clusters | Total number of clusters the describing the PlanTAP family (= number of trees for the family). Multiple clusters depict the particular TAP family either from a different taxonomic perspective (e.g. restricted to the plant lineage vs. covering all kingdoms), or comprise different subfamilies. Because large TAP gene families are substantially divergent beyond their conserved domains, it appears more reasonable to deduce phylogenies from subgroups in order to be able to utilize as much homologous sequence information as possible. |
| number of members | Total number of family member sequences. |
| number of members in trees | Total number of member sequences after homology reduction. |
| number of non-redundant members | Total number of family member sequences after the redundancy removal. See redundancy removal for more details. |
| number of queries | Total number of query sequences in the PlanTAP family. |
| redundancy removal | While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. For the removal of redundant sequences, a multiple sequence alignment was performed using MAFFT FFT-NS-2 and pairwise distances were calculated using the EMBOSS distmat program. The resulting matrix was scanned for sequence pairs from the same species with a distance 1 substitutions per 100aa. For each pair, one representative was selected based on the originating database (UniProt sequences were preferred), sequence length and lexical sort order of the accession number. The procedure was implemented in Perl using several Bioperl modules, including a modified version of the Bio::Tools::Run::Alignment::MAFFT module. For the parsing of the distmat distance matrices, an object-oriented Bioperl module (Bio::Matrix::IO::distmat) was written. |
| redundant? | Filter the member list using the "redundant" property described under "redundant" in the family members viewer documentation section. |
| sub_family | Some of the PlanTAP families can be further divided into subfamilies. |
| taxonomic profile | For visualization of the distribution of TAP family members across all taxonomic lineages a taxonomic profile was created and is presented as a heat map. Initial tests using taxonomic resolution fixed at the kingdom or order level, respectively, were not able to resolve the expected phylogeny of the contributing taxa using columnwise clustering (data not shown). Therefore, those taxonomic groups which contributed significantly to the overall distribution were selected as columns, the remainder of the Eubacteria, protists, plants and animals was gathered into respective other columns. Thus, a non-redundant representation of the taxonomic distribution was created which is able to resolve the expected phylogeny using columnwise clustering. To overcome the sampling bias presented by fully sequenced genomes, the columns were normalized. Subsequent clustering yielded the significantly correlated groups. The filter "taxonomic profile" gives the opportunity to specifically select all member entries belonging to an individual taxonomic group. |
| user_contributed_trees | You can extend PlanTAPDB. If you want to contribute a manually curated or extended phylogeny of a PlanTAP family, just send us a nhx formatted tree with support values and species annotation together with a short text describing the method used. |

| Sequence Retrieval |
| description |
| domains |
| in #clusters |
| in tree |
| is a |
| length |
| member_name |
| redundant |
| repr. species |
| representative |
| species |

| Sequence Retrieval |
The PlanTAPDB interfaces allow sequence retrieval in three ways:
|
| description | The member sequences' description line, i.e. textual annotation provided by the orginating database |
| domains | Matching InterPro domains in order of occurence along the sequence. If your browser supports mouse-over information, use this to display additional information, like e.g. description, E-value, start - stop of the match. |
| in #clusters | In how many clusters belonging to this family did the member sequence occur? If you follow the hyperlink, an additional window appears displaying the PlanTAP family cluster(s), the respective entry is part of and provides hyperlinks to these clusters' ClusterView page. |
| in tree | Is the member sequence part of any of the family msa and trees? Or was it removed in the homology reduction? This is also a filter property |
| is a | Was the member sequence a hit, a query or both In the initial PSI-BLAST? This is also a filter property |
| length | Length of the member's amino acid sequence. |
| member_name | The unique accession number of a sequence which can be a member of multiple PlanTAP families. The accession numbers of the member sequences are the identifiers of their orginating databases, e.g. UniProt, GenPept, TAIR, Cosmoss ... By following the hyperlink you can retrieve the individual sequence via the Cosmoss Sequence Retrieval System. |
| redundant | Was the member sequence tagged to be redundant in the homology reduction in any of the member clusters? Sequences marked as redundant were excluded in the taxonomic profiling of the PlanTAP families.This is also a filter property |
| repr. species | Scientific name of the organism the representative member sequence is derived from. For small clusters, only redundant sequences of the same organism are considered, whereas this is not the case for huge clusters where iterative homology reduction was performed. SYNTAX: Genus species (subspecies or variety...) The last two words of the corresponding NCBI Taxonomy full linage string. Follow the hyperlink to access the corresponding NCBI taxonomy entry. |
| representative | Fellow member sequence which represents a sequence in at least one of the family msa and trees. By following the hyperlink you can retrieve the individual sequence via the Cosmoss Sequence Retrieval System. |
| species | The scientific name of the organism the member sequence is derived from. SYNTAX: Genus species (subspecies or variety...) The last two words of the corresponding NCBI Taxonomy full linage string. Follow the hyperlink to access the corresponding NCBI taxonomy entry. |


| Sequence Retrieval |
The PlanTAPDB interfaces allow sequence retrieval in three ways:
|
|||||||||||||||||||||
| algorithm | Multiple sequence alignment algorithm used for this cluster. | |||||||||||||||||||||
| avg_ident | The average %identity of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| comment | Comments from the manual annotation phase | |||||||||||||||||||||
| description | The manual annotation infered for the cluster. | |||||||||||||||||||||
| f_quantile_ident | The first quantile of the summary statistics of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| fiala_stemminess | Tree measure as described in Fiala, K.L. and R.R. Sokal, 1985. Factors determining the accuracy of cladogram estimation: evaluation using computer simulation. Evolution, 39: 609-622 | |||||||||||||||||||||
| from_step |
Our pipeline filters PSI-BLAST hits according to a six-step filtering scheme. "from_step" tells you which filter step was applied when the sequences resulting in this cluster were initially filtered. Sequences passing a specific filter step have to furfill at least the alignment length and fraction identical criteria of the step:
|
|||||||||||||||||||||
| homology filtering | While it greatly improves taxon sampling, the strategy to use both, a huge multi-species containing database like UniProt and the individual full-genome protein predictions, results in the detection of identical protein sequences in these overlapping databases. In addition, the same locus is often represented by more than one protein sequence due to divergent predicted gene models, splice variants as well as sequencing and annotation errors. To cope with this problem, prior to all functional analyses redundant copies of genes were eliminated using an identity cutoff of 99% for sequences of the same species. Phylogenetic inference of large clusters is computationally costly and the interpretation and inference of results from huge trees is difficult. Clusters with more than 150 members, were condensed via stepwise homology reduction until the threshold of 150 members was reached. To further investigate this process for every cluster in detail, we offer the distance matrix as text file, a graphic of distribution plots of the cluster_member distances and the initial MAFFT fftns2 alignment of the cluster used for the pairwise distances. | |||||||||||||||||||||
| last_cutoff | The last %identitiy threshold applied in the homology reduction of the cluster_members | |||||||||||||||||||||
| longest_internal_branch_length | Length of the longest internal branch used to (midpoint-)root the phylogenetic tree of the cluster. | |||||||||||||||||||||
| max_ident | The maximal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline. | |||||||||||||||||||||
| max_iteration | The maximal PSI-BLAST iteration the cluster_members are from. | |||||||||||||||||||||
| median_ident | The median %identity of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| members | Total number of cluster_members before redundancy removal and homology reduction. | |||||||||||||||||||||
| min_ident | The minimal %identity between to cluster members observed in the redundancy removal and homology reduction phase of our pipeline. | |||||||||||||||||||||
| ml | The maximum likelihood of the consensus tree topology calculated with TREE-PUZZLE | |||||||||||||||||||||
| multiple alignment | Due to errors introduced by the alignment algorithm, a certain fraction of columns in a multiple sequence alignment (MSA) generates noise which disturbs correct inference of phylogenetic relationships. Such positions are usually removed manually in the course of a phylogenetic analysis. While current approaches to automated phylogenies mostly rely on unprocessed ClustalW alignments, we placed more emphasis on the alignment quality to increase the reliability of the resulting phylogenies. Thus, we used a measure that describes evolutionary informative sites. We implemented a best-of-two approach, during which first two alignments were calculated using different state-of-the-art algorithms and then filtered using the sum-of-pairs score. In the second step the alignment with the maximal number of remaining columns was chosen. In version 1.0, on average, the alignments consisted to 65% of gaps and were reduced to 28% of the original alignment length by applying this procedure. In 71% of the cases the MAFFT G-INSI alignment was selected to represent the cluster, whereas ProbCons or Muscle were chosen for 29% of the clusters. The "best" alignment can be downloaded and viewed with the Jalview alignment editor applet. To comprehend the MSA column filtering process for each cluster we also provide a nice overview graphic. | |||||||||||||||||||||
| nleafs | Number of leafs or taxa in this clusters phylogenetic tree. | |||||||||||||||||||||
| nr_distances | Total number of pairwise distances. | |||||||||||||||||||||
| nr_members | The number of members after redundancy removal | |||||||||||||||||||||
| number_of_internals | Number of internal nodes of the phylogenetic tree. | |||||||||||||||||||||
| number_of_nodes | Total number of nodes (internal + leafs) of the tree. | |||||||||||||||||||||
| number_of_terminals | Number of nodes without children. | |||||||||||||||||||||
| phylogenetic trees | Many approaches to phylogenomics rely solely on a distance approach using Neighbor-Joining (NJ) (Saitou and Nei, 1987). However, NJ is known to be susceptible to noisy data, provides no confidence measures and makes it hard to compute reliable distances for strongly divergent sequences. Probabilistic approaches, like maximum likelihood (ML) and Bayesian methods, are known to overcome most of these problems, but both are very time consuming and thus usually not applied in large-scale phylogenomics approaches. We followed a combined approach by calculating ML consensus branch lengths using gamma distributed rates from bootstrapped NJ topologies. The ML consensus topology of the phylogenetic tree can be downloaded in NHX and explored using the ATV Tree Viewer applet. | |||||||||||||||||||||
| queries | Total number of queries present in the cluster | |||||||||||||||||||||
| redundancy |
The amount of shared (redundant) history on the total tree. Formula: 1 / ( treelength - height / ( ntax * height - height ) ) |
|||||||||||||||||||||
| remaining | The number of members after homology reduction | |||||||||||||||||||||
| removed | Number of sequences removed in the homology reduction | |||||||||||||||||||||
| resolution | The total number of internal nodes over the total number of internal nodes on a fully bifurcating tree of the same size. | |||||||||||||||||||||
| sd_ident | The standard deviation of the %identities of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| t_quantile_ident | The third quantile of the summary statistics of all pairwise distances of the cluster_members. | |||||||||||||||||||||
| total_paths | The sum of all root-to-tip path lengths of the phylogenetic tree of this cluster. | |||||||||||||||||||||
| tree_height | For ultrametric trees (supporting the molecular clock hypothesis) this value is the height of the tree, but this is done by averaging over all root-to-tip path lengths, so for additive trees the result should consequently be interpreted differently. | |||||||||||||||||||||
| tree_length | The sum of all branch lengths of the phylogenetic tree. |
