Genome Annotation/V3.1

From PhyscomeProjectWiki

Jump to: navigation, search

Image:Cosmoss_logo.png

This page is under construction...

It is meant to document both the annotation process and the resulting data accessible via the cosmoss.org interfaces.

You can browse V3.1 using gbrowse: http://www.cosmoss.org/fgb2/gbrowse/V3.1/

Contents

The Physcomitrella patens V3.1 genome annotation

Improved Locus Definition - cosmoss.org gene ids (CGIs)

CGIs provide the unique address of a gene (protein-coding and non-protein-coding genes) on a respective assembly. CGIs also function as primary IDs/accession numbers to access genes and gene products in V3.1. On the V3 assembly CGIs can either be localized on pseudo-chromosomes or scaffolds. Locus ids are ordered from 5' to the 3' end of the reference sequence.


CGI Syntax

Like a German car licence plate, the CGI carries a lot of information about the gene product you're looking at: Comprising the fields species, assembly version, reference sequence, locus index, type and (sequence) index.

Image:CGI_V3.1.png

Inference of locus indices

With V3.1 we further improved the cosmoss gene id concept, adopting the locus id incrementation scheme employed by TAIR [1].

In order to allow both subsequent addition of novel loci and stable accession numbers, V3.1 CGIs increment by a dynamic window (+10 minimum), i.e. allowing the addition of a minimum of 10 genes in between to previously annotated loci. The added window size is determined for each subsequent locus independently to accommodate gene-rich as well as gene sparse genomic regions. The dynamic component is determined by the median gene density of the V3.1 assembly (see below) and a fix gap bonus (200 genes for gaps > 5000bp).

The initial CGIs were derived for all models, including those filtered as transposon-derived in subsequent steps, using the following formula:

s=\begin{cases}
w_{ij}\times D & s > 10\\
10 & s < 10
\end{cases}

C_j = 
\begin{cases}
\lVert C_i + s\lVert_{10}       & g_{ij} < 5000 \\
\lVert C_i + s\lVert_{10} + 200 & g_{ij} > 5000
\end{cases}

The next locus index for a CGI Cj was calculated considering the intergeneic distance wij between two genes i and j, the CGI locus index Ci of gene i, the median gene density D, the total gap length gij between genes i and j, a constant minimal CGI incrementor of 10, a constant incrementor of 200 to account for large gaps and constant defining large gaps [>=5000bp]. Resulting locus indices Cj were rounded to the next multiple of 10.

V3.1 Gene distances, density and gap sizes

distances [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-26680     248    1618    4771    5867  129900 
chromosomal distances [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-26680     274    1716    4908    6130  129900 

Distribution plots of gene distances

gene density [Mbp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 33.44  214.80  320.00  389.80  499.30 5051.00 
gene density on chromosomes [Mbp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 137.1   145.5   148.7   149.3   152.6   166.8 

Distribution plots of gene densities

gap size [bp]
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   100     100     260    2264    2018   43280 
size of gaps larger 5kb 
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5001   10000   10000   10580   10000   43280

Distribution plots of gap sizes

CGI Examples

Normal distances
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted
Pp3c1_10Chr152165627T100010TRUEsupported_by_EST_or_cDNA
Pp3c1_20Chr1893113135P2330400500TRUEsupported_by_EST_or_cDNA
Pp3c1_30Chr11678218062J3364700510FALSEsupported_by_EST_or_cDNA
Pp3c1_40Chr11679621548T1-126600510TRUEsupported_by_EST_or_cDNA
Pp3c1_50Chr12019821414J2-135000510TRUEsupported_by_EST_or_cDNA
Pp3c1_60Chr12928636549P27872102100TRUEsupported_by_EST_or_cDNA
Pp3c1_70Chr13954647674E22997001690TRUEsupported_by_EST_or_cDNA
Pp3c1_80Chr15039753156P2272300130TRUEsupported_by_EST_or_cDNA
Pp3c1_90Chr15054950710C1-260700130FALSEsupported_by_EST_or_cDNA
Pp3c1_100Chr16306266241P21235220200TRUEsupported_by_EST_or_cDNA


Larger distances
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted
Pp3c1_15680Chr11169054411697036E2-147500400TRUEsupported_by_EST_or_cDNA
Pp3c1_15690Chr11169710511699028E26900400TRUEsupported_by_EST_or_cDNA
Pp3c1_15710Chr11180351411804120J3104486130230FALSEpredicted_by_ab_initio_computation
Pp3c1_15720Chr11180458711805231J246700230FALSEpredicted_by_ab_initio_computation
Pp3c1_15730Chr11180459911809753P2-63200230TRUEsupported_by_EST_or_cDNA


Large gaps
cgireferencestartstoppredictor taggene_offsetsgaps>100bpnmodelstotal_gap_length (gij)is_selectedpredicted
Pp3c1_17460Chr11323459813235039C110120040FALSEpredicted_by_ab_initio_computation
Pp3c1_17470Chr11323544713236029E14080040TRUEpredicted_by_ab_initio_computation
Pp3c1_17680Chr11325204313262701N1160142148120TRUEsupported_by_sequence_similarity
Pp3c1_17690Chr11326283213263058P01310010TRUEsupported_by_EST_or_cDNA
Pp3c1_17700Chr11327538213275856P0123242010TRUEsupported_by_EST_or_cDNA


Multi-level Annotation Pipeline behind V3.1

Overall Aim

  • Guarantee continuity to existing cosmoss annotation
    • Quality of structural annotations
      • including manual curations (at least >600 high-quality curations)
      • including more experimental evidences
    • Being able to transfer substantial existing functional annotation
      • e.g. GO and gene names
      • Ongoing projects extending annotation (e.g. literature curation, MossCyc)
  • Cosmoss Gene IDs
    • ID-Lookup: ensure continuity of Physcomitrella annotation

Structural annotation

  • Incorporate all data
    • Also in-house data which couldn't be released yet
      • RNASeq
      • Proteomics (N-termini and total protein coverage)
      • microarrays
  • Use updated transcript data:
    • train ab initio models
    • evidence for gene calling
  • Gene prediction for all scaffold partitions
  • Use combiner
  • Incorporate/Detect non-protein-coding genes early on
  • Optimally handle pseudoalleles (identical (tandem-) paralogs)

Used data - short reads

  • 84 libraries mapped with tophat-2.0.8b
    • In addition to what was used by JGI
    • 36 libraries SRA
    • 48 internal libraries (Reski or collaborators)
  • Not all these libraries are suitable for gene body prediction
    • SAGE tags, CAPSeq, sRNA

Training gene finders (EuGene/Augustus)

  • Using splice sites from short reads and EST alignments
  • Manual curation of “manually curated gene models”
    • Reducing >3000 models → 603 models
    • Only 2 of the user_model track from the old JGI browser (most were just promoted ab initio models)
  • Augusts was excluded due to bad performance on training set (30% sensitivity transcript level)
  • All EuGene sensors retrained
  • Multiple rounds of training/parameter settings using both w/o alternative splicing
  • Original V1.6 model (BMC 2013) was used as well

Evaluation of gene predictions on the training set

  • Using eval software
  • All gene predictions were reduced to representative splice variant
  • Performance measures (sensitivity and specificity)
    • Nucleotide
    • Exon
    • Transcript
    • Gene
    • Combined score (based on Gene+Exon/sensitivity+specificity) for ranking

Results

predictor tagsourceGene SensitivityGene SpecificityTranscript Sensitivity Transcript Specificity Exon SensitivityExon SpecificityNucleotide Sensitivity Nucleotide Specificity score
E6final_Wminmax50.7570.7580.6660.7580.9330.940.9580.9813.3884
EVM_prefinal2_Wminmax50.7570.7410.6660.7410.9340.9370.9580.9783.369
EVM_prefinal1_Wminmax50.7590.7360.6670.7360.9330.9340.9580.9753.3612
E2EuGene_more_data_weights10.7310.740.6440.740.9180.9480.9580.9853.3363
E3EuGene_more_data_weights1_AS0.7340.710.6470.710.9180.9390.9590.9833.3007
E4EuGene_more_data_weights20.7070.7110.6250.7110.910.9480.9580.9843.2755
E5EuGene_more_data_weights2_AS0.7070.6780.6250.6780.910.9380.9590.9813.2333
EVM.first0.6990.6390.6250.6390.8830.9280.9560.9773.1483
C1cosmoss_V1.60.7220.5890.680.6050.8940.8980.9470.9533.1028
J2JGI_gene0.6220.5510.5420.5510.8610.8870.9290.9562.9208
E1EuGene_BMC_Genomics20130.5820.5020.5180.5020.870.880.9450.9632.8335
T1transdecoder0.0540.0270.050.0270.0070.0120.9350.5170.0997


Assembly of short read and EST evidences

EST and short read data cannot be used directly as evidence for gene prediction. Additionally if coverage is optimal, full-length mature transcripts and thus gene structures can be inferred directly by read assembly. This can be achieved either de-novo (e.g. Trininty) or using a genomic sequence as reference (e.g. PASA or cufflinks). Both methods have its drawbacks. De-novo assembly can result in fragmentary assemblies:

Image:Chr01-8400..15100.png

The three yellow highlighted features represent three independent Trininty short read assemblies from the same experiment. From their overlap with the V3.1 gene model below we can clearly deduce that they are transcript fragments and not full-length transcripts of three distinct genes.

Genome-guided assembly also has its problems as demonstrated in the figure below:

Image:Cufflinks.png

The different colors in the track "mapping test" represent short reads assemblies from distinct experimental samples generated by cufflinks, which was initially developed for vertebrate genomes. We can see that there are a lot of assemblies that represent fusions of at least two neighboring genes. On the moss genome, PASA is a little better in dealing with this problem, thus we have used PASA to assemble the data sets.

The cufflinks data were used to filter libraries to be used based on manual inspection to ensure:

  • Good coverage
  • Splice sites

De-novo assembly of short reads and spliced alignments

  • 21 filtered short read libraries assembled with Trinity (r2013_08_14)
  • Seqclean
  • Stats:
    • 1,702,106 transcripts total
    • Mean: 77,368.45 transcripts
    • Mean length: 1,219.96bp
  • Adding Sanger and 454 ESTs
  • Mapped with GenomeThreader
sourcerawseqcleanseqclean%GenomeThreaderGenomeThreader%
Sanger reads combined518,256476,30791.91%
454 reads631,313576,75991.36%
Trinity assemblies combined1,702,1061,702,082100.00%
Total2,851,6752,755,14896.62%2,640,71495.85%


Genome guided assembly with PASA

The resulting spliced alignments were assembled using PASA:

Transcripts or AssembliesCount
Total transcript seqs2755148
Fli cDNAs0
partial cDNAs (ESTs)2755148
Number transcripts with any alignment2438714
Valid custom alignments2260452
Total Valid alignments2260452
Valid FL-cDNA alignments0
Valid EST alignments2260452
Number of assemblies266051
Number of subclusters (genes)68382
Number of fli-containing assemblies0
Number of non-fli-containing assemblies266051


All Models

Predictor tags - What's E1?

predictor tagnamealgorithmsourcepredicted as
C1cosmoss_V1.6SpliceMachine/EuGenecosmossprotein-coding_gene
E1EuGene_BMC_Genomics2013SpliceMachine/EuGenecosmossprotein-coding_gene
E2EuGene_more_data_weights1SpliceMachine/EuGenecosmossprotein-coding_gene
E3EuGene_more_data_weights1_ASSpliceMachine/EuGenecosmossprotein-coding_gene
E4EuGene_more_data_weights2SpliceMachine/EuGenecosmossprotein-coding_gene
E5EuGene_more_data_weights2_ASSpliceMachine/EuGenecosmossprotein-coding_gene
E6EVM_final_Wminmax5EvidenceModelercosmossprotein-coding_gene
J1JGI_gene_altJGIJGIprotein-coding_gene
J2JGI_geneJGIJGIprotein-coding_gene
J3JGI_pasa_genePASAJGIprotein-coding_gene
P1v3_real.gene_structures_post_PASA_updates.31614EVM/PASA 5th iterationcosmossprotein-coding_gene
P2v3_real.gene_structures_post_PASA_updates.31978EVM/PASA 6th iterationcosmossprotein-coding_gene
T1transdecoderPASA/transdecodercosmossprotein-coding_gene
N1combined ncRNA predictionsInfernal, aragorn, tRNAScan, RNAammer, mirbase, snoscan, EugenecosmossncRNA_gene
P0putative long ncRNAPASA without transdecoder support and unselected ab initio modelscosmossncRNA_gene


Locus Clustering

Filtering of Transposable Element-derived Gene Models

Selected Models for the V3.1 Gene Catalog

Phypa_ids

To avoid confusion and improve database cross-referencing, V3.1 does not anymore offer Phypa_ids. Phypa_ids were used as secondary accession numbers in previous versions like e.g. Genome Annotation/V1.6. They were originally based on JGI protein/transcript IDs.

If you only know the Phypa_id of your favorite gene, you can map it to a V1.6 CGI (e.g. using this excel table). V1.6 models were mapped to V3.1 and thus can be looked up using the cosmoss genome browser simply by searching with the CGI.

Splice Variants

mRNAs derived from the same locus, i.e. splice variants are children of the same gene feature (indicated by the number after the V6 in V1.6 CGIs):

gene    ID=Pp1s1_5V6;Name=Pp1s1_5V6
mRNA    ID=Pp1s1_5V6.1;Parent=Pp1s1_5V6;Alias=Phypa_422107
mRNA    ID=Pp1s1_5V6.2;Parent=Pp1s1_5V6;Alias=Phypa_422004

V3.1 does not yet include splice variants. Although some of the predictors generated splice variants (e.g. P2, T1, ...), we have decided not to incorporate them into the final release, because the data is still to noisy and in more analysis is need to guarantee reliable quality. Instead, the selected V1.1 models should always provide the major, functional isoform encoding the evolutionarily conserved gene product. This was achieved by multiple-evidence machine learning process trained to select the optimal gene model per locus. For details see section on locus clustering.

Functional Annotation, Gene Names and Gene Families

Download

The flat file distribution of V3.1 (FASTA, GFF3, GAF2 etc.) will be made available via the Downloads section as soon as the paper describing the analysis of the V3 assembly and V3.1 annotation is accepted for publication.

Contributions

This document and the V3.1 genome annotation was created by Daniel Lang with contributions from:

Andreas Zimmer
Initial user models, SpliceMachine/Augustus/Eugene traininig and prediction
Shu Shenqiang and David Goodstein(JGI)
Gene models of the V3.0 [J1, J2, J3]
Nico van Gessel
non-protein coding gene annotation

None of this would be possible without the tremendous work and investments done by the JGI. The V3 assembly/scaffolding was created by Jeremy Schmutz and Jerry Jenkins. The underlying genetic map was inferred by Wellington Muchero based on the mapping populations from Andrew Cuming, Stuart McDaniel and the Reski lab with Stefan Rensing as head of the consortium and external PI at the JGI.

Personal tools