Gene prediction pipeline/JGI

From PhyscomeProjectWiki

Jump to: navigation, search
related article

JGI gene prediction pipeline


Several genome analyses, gene prediction, and annotation methods were integrated into the JGI annotation pipeline to annotate the genome of P. patens. First, predicted transposable elements were masked in the P. patens genome assembly using RepeatMasker (5) and a repeat library composed from a non-redundant set of (i) overrepresented oligonucleotides identified during the assembly process, (ii) fragments of draft ab initio gene models homologous to known transposable elements, and (iii) manually curated repeats. Second, gene models were built using several approaches. Initially, 3,154 putative full length genes with ORFs of 150 bp or longer were derived from 31,951 clusters of P. patens ESTs and mapped to the genomic sequence. Next, protein sequences from Genbank and IPI (6, 7) were aligned against the scaffolds using BLASTX (8) and post-processed to co-linearize high scoring hits and to select the best non-overlapping set of BLAST alignments. These alignments were used primarily as seeds for the gene prediction tools Genewise (9) and Fgenesh+ (10). All resulting Genewise models were then extended to include the nearest 5ā€™ methionine and 3ā€™ stop codons. Subsequently, ab initio gene models were predicted using Fgenesh (10) with parameters derived from training using known P. patens genes. In addition, 220,055 ESTs and the consensus sequences of their clusters were aligned with the scaffolds using BLAT (11) and used to extend and correct predicted gene models where exons in the ESTs/cDNAs overlap and extend the gene model into flanking UTR. Over 225,000 putative gene models were generated using the above mentioned gene predictors. Their translated amino acid sequences were aligned against known proteins from the NCBI non-redundant set and other databases such as KEGG (12). In addition, each predicted model was analyzed for domain content/structure using InterproScan (13) with a suite of tools such as Blast/HMM/ScanRegEx against the domain libraries Prints, Prosite, PFAM, ProDom and SMART. Finally, to produce a non-redundant set of 35,938 gene models, for every locus with overlapping models, the "best" model was selected according to homology with known proteins and EST support. Annotations for this set of genes were summarized in terms of Gene Ontology (14), eukaryotic clusters of orthologs, KOGs, (15) and KEGG pathways (12). Predicted gene models and their annotations were further manually curated and submitted to GenBank.

Selection of the "best model" - Filtered models

Several gene prediction methods were used (various methods utilizing Fgenesh and Genewise) and a representative, non-redundant set of gene models for every set of overlapping models was derived by considering:

  • modified alignment score [Sā€™ = S * CVR1 * CVR2] of the best BLAST hit,
  • EST coverage,
  • model completeness,
  • length of the protein/transcript
Personal tools