Genome Annotation/V1.2

From PhyscomeProjectWiki

Jump to: navigation, search
Based on assembly
Genome_Assembly/V1.2
Gene prediction pipeline
JGI

Contents

Filtering steps

On the basis of the Physcomitrella gene models v1.1 (protein coding genes) a filtered set of gene models was created due to following criteria:

1. Overlap with LTR_retrotransposons and fragments (MIPS angela pipeline)
2. Overlap with TE related PFAM domain
3. Overlap with helitron transposable elements
4. Overlap with tRNA genes
5. Overlap with miRNA precursors
6. Corresponding scaffold is a bacterial contamination

1. Overlap with LTR_retrotransposons and fragments

Physcomitrella gene models v1.1 were excluded if there CDS is covered more than 5% by a LTR transposable element (complete and fragments). Gene models with TE overlap but unique ESTs to this locus are "rescued". Due to LTR retrotransposons 7,685 gene models were excluded from v1.1.

2. Overlap with TE related PFAM domain

Gene models which do not overlap with LTR transposons but overlap or consist of a TE related PFAM domain were excluded. 5 gene models were removed from the gene model set.

3. Overlap with helitron transposable elements

Gene models with helitron overlap were excluded, also. 10 gene models were removed.

4. Overlap with tRNA genes

Also gene models associated to rRNA genes were removed. 69 tRNA genes were removed from the protein coding gene set.

5. Overlap with miRNA precursors

In addition, gene models with overlap to miRNA precursors were removed. 91 gene models removed.

6. Corresponding scaffold is a bacterial contamination

There were and still are Contaminations in the Physcomitrella genome assembly. Blast searches against the bacterial genomes sequenced at the JGI revealed a overlap of sequence information between the Physcomitrella genome and the proteobacteria Delftia acidovorans.

Using BLASTN searches with the whole Delftia genome sequence yield hits on 117 Physcomitrella scaffolds (e-value < 1e-4). These scaffolds sequences were further analyzed using BLASTN and BLASTP searches against genbank and genpept respectively.

Scaffolds yielding a significant hit against genbank (>100 nt and >75% identity) and hit against genpept (exluding Physcomitrella/Delftia hits, e-value 1e-4 and alignment length (best hsp) > 80 AS and >50% identity were identified as bacterial contamiantion. Furthermore these 102 scaffolds have no Physcomitrella EST/cDNA support, no miRNAs, no smallRNAs and no transposable elements.

According to these results the corresponding gene models were removed from the protein coding gene set.

117 gene models were removed due to bacterial contamination. In addition, 12 gene models on 9 scaffolds were also removed (internal wiki)



Image:Phypa1_1_filtered.png

Non-protein coding genes still in the catalog

Phypa_81206
overlaps with tRNA Phypa_tRNA_Leu_93_3
Phypa_81199
overlaps with tRNA Phypa_tRNA_Leu_93_1

Protein coding genes with manual annotations

We are trying to improve the annotation in the description lines of the protein and transcript sequences.

For this we are currently establishing a web interface to allow manual (functional) annotation of genes.

We have incorporated data from the JGI database as good as possible currently we could include 985 manual annotations into the genes' description lines.--Lang 12:49, 10 October 2008 (UTC)

Personal tools