Genome Annotation/V3.1

From PhyscomeProjectWiki

Jump to: navigation, search


This page is under construction...

It is meant to document both the annotation process and the resulting data accessible via the interfaces.

You can browse V3.1 using gbrowse:


The Physcomitrella patens V3.1 genome annotation

Improved Locus Definition - gene ids (CGIs)

CGIs provide the unique address of a gene (protein-coding and non-protein-coding genes) on a respective assembly. CGIs also function as primary IDs/accession numbers to access genes and gene products in V3.1. On the V3 assembly CGIs can either be localized on pseudo-chromosomes or scaffolds. Locus ids are ordered from 5' to the 3' end of the reference sequence.

CGI Syntax

Like a German car licence plate, the CGI carries a lot of information about the gene product you're looking at: Comprising the fields species, assembly version, reference sequence, locus index, type and (sequence) index.


Inference of locus indices

With V3.1 we further improved the cosmoss gene id concept, adopting the locus id incrementation scheme employed by TAIR [1].

In order to allow both subsequent addition of novel loci and stable accession numbers, V3.1 CGIs increment by a dynamic window (+10 minimum), i.e. allowing the addition of a minimum of 10 genes in between to previously annotated loci. The added window size is determined for each subsequent locus independently to accommodate gene-rich as well as gene sparse genomic regions. The dynamic component is determined by the median gene density of the V3.1 assembly (see below) and a fix gap bonus (200 genes for gaps > 5000bp).

The initial CGIs were derived for all models, including those filtered as transposon-derived in subsequent steps, using the following formula:

w_{ij}\times D & s > 10\\
10 & s < 10

C_j = 
\lVert C_i + s\lVert_{10}       & g_{ij} < 5000 \\
\lVert C_i + s\lVert_{10} + 200 & g_{ij} > 5000

The next locus index for a CGI Cj was calculated considering the intergeneic distance wij between two genes i and j, the CGI locus index Ci of gene i, the median gene density D, the total gap length gij between genes i and j, a constant minimal CGI incrementor of 10, a constant incrementor of 200 to account for large gaps and constant defining large gaps [>=5000bp]. Resulting locus indices Cj were rounded to the next multiple of 10.

V3.1 Gene distances, density and gap sizes

distances [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-26680     248    1618    4771    5867  129900 
chromosomal distances [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-26680     274    1716    4908    6130  129900 

Distribution plots of gene distances

gene density [Mbp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 33.44  214.80  320.00  389.80  499.30 5051.00 
gene density on chromosomes [Mbp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 137.1   145.5   148.7   149.3   152.6   166.8 

Distribution plots of gene densities

gap size [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   100     100     260    2264    2018   43280 
size of gaps larger 5kb 
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5001   10000   10000   10580   10000   43280

Distribution plots of gap sizes

CGI Examples

Normal distances
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted

Larger distances
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted

Large gaps
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted

Multi-level Annotation Pipeline behind V3.1

Overall Aim

  • Guarantee continuity to existing cosmoss annotation
    • Quality of structural annotations
      • including manual curations (at least >600 high-quality curations)
      • including more experimental evidences
    • Being able to transfer substantial existing functional annotation
      • e.g. GO and gene names
      • Ongoing projects extending annotation (e.g. literature curation, MossCyc)
  • Cosmoss Gene IDs
    • ID-Lookup: ensure continuity of Physcomitrella annotation

Structural annotation

  • Incorporate all data
    • Also in-house data which couldn't be released yet
      • RNASeq
      • Proteomics (N-termini and total protein coverage)
      • microarrays
  • Use updated transcript data:
    • train ab initio models
    • evidence for gene calling
  • Gene prediction for all scaffold partitions
  • Use combiner
  • Incorporate/Detect non-protein-coding genes early on
  • Optimally handle pseudoalleles (identical (tandem-) paralogs)

Used data - short reads

  • 84 libraries mapped with tophat-2.0.8b
    • In addition to what was used by JGI
    • 36 libraries SRA
    • 48 internal libraries (Reski or collaborators)
  • Not all these libraries are suitable for gene body prediction
    • SAGE tags, CAPSeq, sRNA

Training gene finders (EuGene/Augustus)

  • Using splice sites from short reads and EST alignments
  • Manual curation of “manually curated gene models”
    • Reducing >3000 models → 603 models
    • Only 2 of the user_model track from the old JGI browser (most were just promoted ab initio models)
  • Augusts was excluded due to bad performance on training set (30% sensitivity transcript level)
  • All EuGene sensors retrained
  • Multiple rounds of training/parameter settings using both w/o alternative splicing
  • Original V1.6 model (BMC 2013) was used as well

Evaluation of gene predictions on the training set

  • Using eval software
  • All gene predictions were reduced to representative splice variant
  • Performance measures (sensitivity and specificity)
    • Nucleotide
    • Exon
    • Transcript
    • Gene
    • Combined score (based on Gene+Exon/sensitivity+specificity) for ranking


predictor tagsourceGene SensitivityGene SpecificityTranscript Sensitivity Transcript Specificity Exon SensitivityExon SpecificityNucleotide Sensitivity Nucleotide Specificity score

Assembly of short read and EST evidences

EST and short read data cannot be used directly as evidence for gene prediction. Additionally if coverage is optimal, full-length mature transcripts and thus gene structures can be inferred directly by read assembly. This can be achieved either de-novo (e.g. Trininty) or using a genomic sequence as reference (e.g. PASA or cufflinks). Both methods have its drawbacks. De-novo assembly can result in fragmentary assemblies:


The three yellow highlighted features represent three independent Trininty short read assemblies from the same experiment. From their overlap with the V3.1 gene model below we can clearly deduce that they are transcript fragments and not full-length transcripts of three distinct genes.

Genome-guided assembly also has its problems as demonstrated in the figure below:


The different colors in the track "mapping test" represent short reads assemblies from distinct experimental samples generated by cufflinks, which was initially developed for vertebrate genomes. We can see that there are a lot of assemblies that represent fusions of at least two neighboring genes. On the moss genome, PASA is a little better in dealing with this problem, thus we have used PASA to assemble the data sets.

The cufflinks data were used to filter libraries to be used based on manual inspection to ensure:

  • Good coverage
  • Splice sites

De-novo assembly of short reads and spliced alignments

  • 21 filtered short read libraries assembled with Trinity (r2013_08_14)
  • Seqclean
  • Stats:
    • 1,702,106 transcripts total
    • Mean: 77,368.45 transcripts
    • Mean length: 1,219.96bp
  • Adding Sanger and 454 ESTs
  • Mapped with GenomeThreader
Sanger reads combined518,256476,30791.91%
454 reads631,313576,75991.36%
Trinity assemblies combined1,702,1061,702,082100.00%

Genome guided assembly with PASA

The resulting spliced alignments were assembled using PASA:

Transcripts or AssembliesCount
Total transcript seqs2755148
Fli cDNAs0
partial cDNAs (ESTs)2755148
Number transcripts with any alignment2438714
Valid custom alignments2260452
Total Valid alignments2260452
Valid FL-cDNA alignments0
Valid EST alignments2260452
Number of assemblies266051
Number of subclusters (genes)68382
Number of fli-containing assemblies0
Number of non-fli-containing assemblies266051

All Models

Predictor tags - What's E1?

predictor tagnamealgorithmsourcepredicted as
P1v3_real.gene_structures_post_PASA_updates.31614EVM/PASA 5th iterationcosmossprotein-coding_gene
P2v3_real.gene_structures_post_PASA_updates.31978EVM/PASA 6th iterationcosmossprotein-coding_gene
N1combined ncRNA predictionsInfernal, aragorn, tRNAScan, RNAammer, mirbase, snoscan, EugenecosmossncRNA_gene
P0putative long ncRNAPASA without transdecoder support and unselected ab initio modelscosmossncRNA_gene

Locus Clustering

Filtering of Transposable Element-derived Gene Models

Selected Models for the V3.1 Gene Catalog


To avoid confusion and improve database cross-referencing, V3.1 does not anymore offer Phypa_ids. Phypa_ids were used as secondary accession numbers in previous versions like e.g. Genome Annotation/V1.6. They were originally based on JGI protein/transcript IDs.

If you only know the Phypa_id of your favorite gene, you can map it to a V1.6 CGI (e.g. using this excel table). V1.6 models were mapped to V3.1 and thus can be looked up using the cosmoss genome browser simply by searching with the CGI.

Splice Variants

mRNAs derived from the same locus, i.e. splice variants are children of the same gene feature (indicated by the number after the V6 in V1.6 CGIs):

gene    ID=Pp1s1_5V6;Name=Pp1s1_5V6
mRNA    ID=Pp1s1_5V6.1;Parent=Pp1s1_5V6;Alias=Phypa_422107
mRNA    ID=Pp1s1_5V6.2;Parent=Pp1s1_5V6;Alias=Phypa_422004

V3.1 does not yet include splice variants. Although some of the predictors generated splice variants (e.g. P2, T1, ...), we have decided not to incorporate them into the final release, because the data is still to noisy and in more analysis is need to guarantee reliable quality. Instead, the selected V1.1 models should always provide the major, functional isoform encoding the evolutionarily conserved gene product. This was achieved by multiple-evidence machine learning process trained to select the optimal gene model per locus. For details see section on locus clustering.

Functional Annotation, Gene Names and Gene Families


The flat file distribution of V3.1 (FASTA, GFF3, GAF2 etc.) will be made available via the Downloads section as soon as the paper describing the analysis of the V3 assembly and V3.1 annotation is accepted for publication.


This document and the V3.1 genome annotation was created by Daniel Lang with contributions from:

Andreas Zimmer
Initial user models, SpliceMachine/Augustus/Eugene traininig and prediction
Shu Shenqiang and David Goodstein(JGI)
Gene models of the V3.0 [J1, J2, J3]
Nico van Gessel
non-protein coding gene annotation

None of this would be possible without the tremendous work and investments done by the JGI. The V3 assembly/scaffolding was created by Jeremy Schmutz and Jerry Jenkins. The underlying genetic map was inferred by Wellington Muchero based on the mapping populations from Andrew Cuming, Stuart McDaniel and the Reski lab with Stefan Rensing as head of the consortium and external PI at the JGI.

Personal tools