Genome Annotation/V3.1
From PhyscomeProjectWiki
This page is under construction...
It is meant to document both the annotation process and the resulting data accessible via the cosmoss.org interfaces.
You can browse V3.1 using gbrowse: http://www.cosmoss.org/fgb2/gbrowse/V3.1/
The Physcomitrella patens V3.1 genome annotation
Improved Locus Definition - cosmoss.org gene ids (CGIs)
CGIs provide the unique address of a gene (protein-coding and non-protein-coding genes) on a respective assembly. CGIs also function as primary IDs/accession numbers to access genes and gene products in V3.1. On the V3 assembly CGIs can either be localized on pseudo-chromosomes or scaffolds. Locus ids are ordered from 5' to the 3' end of the reference sequence.
CGI Syntax
Like a German car licence plate, the CGI carries a lot of information about the gene product you're looking at: Comprising the fields species, assembly version, reference sequence, locus index, type and (sequence) index.
Inference of locus indices
With V3.1 we further improved the cosmoss gene id concept, adopting the locus id incrementation scheme employed by TAIR [1].
In order to allow both subsequent addition of novel loci and stable accession numbers, V3.1 CGIs increment by a dynamic window (+10 minimum), i.e. allowing the addition of a minimum of 10 genes in between to previously annotated loci. The added window size is determined for each subsequent locus independently to accommodate gene-rich as well as gene sparse genomic regions. The dynamic component is determined by the median gene density of the V3.1 assembly (see below) and a fix gap bonus (200 genes for gaps > 5000bp).
The initial CGIs were derived for all models, including those filtered as transposon-derived in subsequent steps, using the following formula:
The next locus index for a CGI Cj was calculated considering the intergeneic distance wij between two genes i and j, the CGI locus index Ci of gene i, the median gene density D, the total gap length gij between genes i and j, a constant minimal CGI incrementor of 10, a constant incrementor of 200 to account for large gaps and constant defining large gaps [>=5000bp]. Resulting locus indices Cj were rounded to the next multiple of 10.
V3.1 Gene distances, density and gap sizes
distances [bp] Min. 1st Qu. Median Mean 3rd Qu. Max. -26680 248 1618 4771 5867 129900
chromosomal distances [bp] Min. 1st Qu. Median Mean 3rd Qu. Max. -26680 274 1716 4908 6130 129900
Distribution plots of gene distances
gene density [Mbp] Min. 1st Qu. Median Mean 3rd Qu. Max. 33.44 214.80 320.00 389.80 499.30 5051.00
gene density on chromosomes [Mbp] Min. 1st Qu. Median Mean 3rd Qu. Max. 137.1 145.5 148.7 149.3 152.6 166.8
Distribution plots of gene densities
gap size [bp] Min. 1st Qu. Median Mean 3rd Qu. Max. 100 100 260 2264 2018 43280
size of gaps larger 5kb Min. 1st Qu. Median Mean 3rd Qu. Max. 5001 10000 10000 10580 10000 43280
Distribution plots of gap sizes
CGI Examples
- Normal distances
| cgi | reference | start | stop | predictor tag | gene_offset | s | gaps>100bp | nmodels | total_gap_length (gij) | is_selected | predicted |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pp3c1_10 | Chr1 | 5216 | 5627 | T1 | 0 | 0 | 0 | 1 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_20 | Chr1 | 8931 | 13135 | P2 | 3304 | 0 | 0 | 50 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_30 | Chr1 | 16782 | 18062 | J3 | 3647 | 0 | 0 | 51 | 0 | FALSE | supported_by_EST_or_cDNA |
| Pp3c1_40 | Chr1 | 16796 | 21548 | T1 | -1266 | 0 | 0 | 51 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_50 | Chr1 | 20198 | 21414 | J2 | -1350 | 0 | 0 | 51 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_60 | Chr1 | 29286 | 36549 | P2 | 7872 | 1 | 0 | 210 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_70 | Chr1 | 39546 | 47674 | E2 | 2997 | 0 | 0 | 169 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_80 | Chr1 | 50397 | 53156 | P2 | 2723 | 0 | 0 | 13 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_90 | Chr1 | 50549 | 50710 | C1 | -2607 | 0 | 0 | 13 | 0 | FALSE | supported_by_EST_or_cDNA |
| Pp3c1_100 | Chr1 | 63062 | 66241 | P2 | 12352 | 2 | 0 | 20 | 0 | TRUE | supported_by_EST_or_cDNA |
- Larger distances
| cgi | reference | start | stop | predictor tag | gene_offset | s | gaps>100bp | nmodels | total_gap_length (gij) | is_selected | predicted |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pp3c1_15680 | Chr1 | 11690544 | 11697036 | E2 | -1475 | 0 | 0 | 40 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_15690 | Chr1 | 11697105 | 11699028 | E2 | 69 | 0 | 0 | 40 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_15710 | Chr1 | 11803514 | 11804120 | J3 | 104486 | 13 | 0 | 23 | 0 | FALSE | predicted_by_ab_initio_computation |
| Pp3c1_15720 | Chr1 | 11804587 | 11805231 | J2 | 467 | 0 | 0 | 23 | 0 | FALSE | predicted_by_ab_initio_computation |
| Pp3c1_15730 | Chr1 | 11804599 | 11809753 | P2 | -632 | 0 | 0 | 23 | 0 | TRUE | supported_by_EST_or_cDNA |
- Large gaps
| cgi | reference | start | stop | predictor tag | gene_offset | s | gaps>100bp | nmodels | total_gap_length (gij) | is_selected | predicted |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Pp3c1_17460 | Chr1 | 13234598 | 13235039 | C1 | 1012 | 0 | 0 | 4 | 0 | FALSE | predicted_by_ab_initio_computation |
| Pp3c1_17470 | Chr1 | 13235447 | 13236029 | E1 | 408 | 0 | 0 | 4 | 0 | TRUE | predicted_by_ab_initio_computation |
| Pp3c1_17680 | Chr1 | 13252043 | 13262701 | N1 | 16014 | 2 | 1 | 4 | 8120 | TRUE | supported_by_sequence_similarity |
| Pp3c1_17690 | Chr1 | 13262832 | 13263058 | P0 | 131 | 0 | 0 | 1 | 0 | TRUE | supported_by_EST_or_cDNA |
| Pp3c1_17700 | Chr1 | 13275382 | 13275856 | P0 | 12324 | 2 | 0 | 1 | 0 | TRUE | supported_by_EST_or_cDNA |
Multi-level Annotation Pipeline behind V3.1
Overall Aim
- Guarantee continuity to existing cosmoss annotation
- Quality of structural annotations
- including manual curations (at least >600 high-quality curations)
- including more experimental evidences
- Being able to transfer substantial existing functional annotation
- e.g. GO and gene names
- Ongoing projects extending annotation (e.g. literature curation, MossCyc)
- Quality of structural annotations
- Cosmoss Gene IDs
- ID-Lookup: ensure continuity of Physcomitrella annotation
Structural annotation
- Incorporate all data
- Also in-house data which couldn't be released yet
- RNASeq
- Proteomics (N-termini and total protein coverage)
- microarrays
- Also in-house data which couldn't be released yet
- Use updated transcript data:
- train ab initio models
- evidence for gene calling
- Gene prediction for all scaffold partitions
- Use combiner
- Incorporate/Detect non-protein-coding genes early on
- Optimally handle pseudoalleles (identical (tandem-) paralogs)
Used data - short reads
- 84 libraries mapped with tophat-2.0.8b
- In addition to what was used by JGI
- 36 libraries SRA
- 48 internal libraries (Reski or collaborators)
- Not all these libraries are suitable for gene body prediction
- SAGE tags, CAPSeq, sRNA
Training gene finders (EuGene/Augustus)
- Using splice sites from short reads and EST alignments
- Manual curation of “manually curated gene models”
- Reducing >3000 models → 603 models
- Only 2 of the user_model track from the old JGI browser (most were just promoted ab initio models)
- Augusts was excluded due to bad performance on training set (30% sensitivity transcript level)
- All EuGene sensors retrained
- Multiple rounds of training/parameter settings using both w/o alternative splicing
- Original V1.6 model (BMC 2013) was used as well
Evaluation of gene predictions on the training set
- Using eval software
- All gene predictions were reduced to representative splice variant
- Performance measures (sensitivity and specificity)
- Nucleotide
- Exon
- Transcript
- Gene
- Combined score (based on Gene+Exon/sensitivity+specificity) for ranking
Results
| predictor tag | source | Gene Sensitivity | Gene Specificity | Transcript Sensitivity | Transcript Specificity | Exon Sensitivity | Exon Specificity | Nucleotide Sensitivity | Nucleotide Specificity | score |
|---|---|---|---|---|---|---|---|---|---|---|
| E6 | final_Wminmax5 | 0.757 | 0.758 | 0.666 | 0.758 | 0.933 | 0.94 | 0.958 | 0.981 | 3.3884 |
| EVM_prefinal2_Wminmax5 | 0.757 | 0.741 | 0.666 | 0.741 | 0.934 | 0.937 | 0.958 | 0.978 | 3.369 | |
| EVM_prefinal1_Wminmax5 | 0.759 | 0.736 | 0.667 | 0.736 | 0.933 | 0.934 | 0.958 | 0.975 | 3.3612 | |
| E2 | EuGene_more_data_weights1 | 0.731 | 0.74 | 0.644 | 0.74 | 0.918 | 0.948 | 0.958 | 0.985 | 3.3363 |
| E3 | EuGene_more_data_weights1_AS | 0.734 | 0.71 | 0.647 | 0.71 | 0.918 | 0.939 | 0.959 | 0.983 | 3.3007 |
| E4 | EuGene_more_data_weights2 | 0.707 | 0.711 | 0.625 | 0.711 | 0.91 | 0.948 | 0.958 | 0.984 | 3.2755 |
| E5 | EuGene_more_data_weights2_AS | 0.707 | 0.678 | 0.625 | 0.678 | 0.91 | 0.938 | 0.959 | 0.981 | 3.2333 |
| EVM.first | 0.699 | 0.639 | 0.625 | 0.639 | 0.883 | 0.928 | 0.956 | 0.977 | 3.1483 | |
| C1 | cosmoss_V1.6 | 0.722 | 0.589 | 0.68 | 0.605 | 0.894 | 0.898 | 0.947 | 0.953 | 3.1028 |
| J2 | JGI_gene | 0.622 | 0.551 | 0.542 | 0.551 | 0.861 | 0.887 | 0.929 | 0.956 | 2.9208 |
| E1 | EuGene_BMC_Genomics2013 | 0.582 | 0.502 | 0.518 | 0.502 | 0.87 | 0.88 | 0.945 | 0.963 | 2.8335 |
| T1 | transdecoder | 0.054 | 0.027 | 0.05 | 0.027 | 0.007 | 0.012 | 0.935 | 0.517 | 0.0997 |
Assembly of short read and EST evidences
EST and short read data cannot be used directly as evidence for gene prediction. Additionally if coverage is optimal, full-length mature transcripts and thus gene structures can be inferred directly by read assembly. This can be achieved either de-novo (e.g. Trininty) or using a genomic sequence as reference (e.g. PASA or cufflinks). Both methods have its drawbacks. De-novo assembly can result in fragmentary assemblies:
The three yellow highlighted features represent three independent Trininty short read assemblies from the same experiment. From their overlap with the V3.1 gene model below we can clearly deduce that they are transcript fragments and not full-length transcripts of three distinct genes.
Genome-guided assembly also has its problems as demonstrated in the figure below:
The different colors in the track "mapping test" represent short reads assemblies from distinct experimental samples generated by cufflinks, which was initially developed for vertebrate genomes. We can see that there are a lot of assemblies that represent fusions of at least two neighboring genes. On the moss genome, PASA is a little better in dealing with this problem, thus we have used PASA to assemble the data sets.
The cufflinks data were used to filter libraries to be used based on manual inspection to ensure:
- Good coverage
- Splice sites
De-novo assembly of short reads and spliced alignments
- 21 filtered short read libraries assembled with Trinity (r2013_08_14)
- Seqclean
- Stats:
- 1,702,106 transcripts total
- Mean: 77,368.45 transcripts
- Mean length: 1,219.96bp
- Adding Sanger and 454 ESTs
- Mapped with GenomeThreader
| source | raw | seqclean | seqclean% | GenomeThreader | GenomeThreader% |
|---|---|---|---|---|---|
| Sanger reads combined | 518,256 | 476,307 | 91.91% | ||
| 454 reads | 631,313 | 576,759 | 91.36% | ||
| Trinity assemblies combined | 1,702,106 | 1,702,082 | 100.00% | ||
| Total | 2,851,675 | 2,755,148 | 96.62% | 2,640,714 | 95.85% |
Genome guided assembly with PASA
The resulting spliced alignments were assembled using PASA:
| Transcripts or Assemblies | Count |
|---|---|
| Total transcript seqs | 2755148 |
| Fli cDNAs | 0 |
| partial cDNAs (ESTs) | 2755148 |
| Number transcripts with any alignment | 2438714 |
| Valid custom alignments | 2260452 |
| Total Valid alignments | 2260452 |
| Valid FL-cDNA alignments | 0 |
| Valid EST alignments | 2260452 |
| Number of assemblies | 266051 |
| Number of subclusters (genes) | 68382 |
| Number of fli-containing assemblies | 0 |
| Number of non-fli-containing assemblies | 266051 |
All Models
Predictor tags - What's E1?
| predictor tag | name | algorithm | source | predicted as |
|---|---|---|---|---|
| C1 | cosmoss_V1.6 | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E1 | EuGene_BMC_Genomics2013 | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E2 | EuGene_more_data_weights1 | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E3 | EuGene_more_data_weights1_AS | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E4 | EuGene_more_data_weights2 | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E5 | EuGene_more_data_weights2_AS | SpliceMachine/EuGene | cosmoss | protein-coding_gene |
| E6 | EVM_final_Wminmax5 | EvidenceModeler | cosmoss | protein-coding_gene |
| J1 | JGI_gene_alt | JGI | JGI | protein-coding_gene |
| J2 | JGI_gene | JGI | JGI | protein-coding_gene |
| J3 | JGI_pasa_gene | PASA | JGI | protein-coding_gene |
| P1 | v3_real.gene_structures_post_PASA_updates.31614 | EVM/PASA 5th iteration | cosmoss | protein-coding_gene |
| P2 | v3_real.gene_structures_post_PASA_updates.31978 | EVM/PASA 6th iteration | cosmoss | protein-coding_gene |
| T1 | transdecoder | PASA/transdecoder | cosmoss | protein-coding_gene |
| N1 | combined ncRNA predictions | Infernal, aragorn, tRNAScan, RNAammer, mirbase, snoscan, Eugene | cosmoss | ncRNA_gene |
| P0 | putative long ncRNA | PASA without transdecoder support and unselected ab initio models | cosmoss | ncRNA_gene |
Locus Clustering
Filtering of Transposable Element-derived Gene Models
Selected Models for the V3.1 Gene Catalog
Phypa_ids
To avoid confusion and improve database cross-referencing, V3.1 does not anymore offer Phypa_ids. Phypa_ids were used as secondary accession numbers in previous versions like e.g. Genome Annotation/V1.6. They were originally based on JGI protein/transcript IDs.
If you only know the Phypa_id of your favorite gene, you can map it to a V1.6 CGI (e.g. using this excel table). V1.6 models were mapped to V3.1 and thus can be looked up using the cosmoss genome browser simply by searching with the CGI.
Splice Variants
mRNAs derived from the same locus, i.e. splice variants are children of the same gene feature (indicated by the number after the V6 in V1.6 CGIs):
gene ID=Pp1s1_5V6;Name=Pp1s1_5V6 mRNA ID=Pp1s1_5V6.1;Parent=Pp1s1_5V6;Alias=Phypa_422107 mRNA ID=Pp1s1_5V6.2;Parent=Pp1s1_5V6;Alias=Phypa_422004
V3.1 does not yet include splice variants. Although some of the predictors generated splice variants (e.g. P2, T1, ...), we have decided not to incorporate them into the final release, because the data is still to noisy and in more analysis is need to guarantee reliable quality. Instead, the selected V1.1 models should always provide the major, functional isoform encoding the evolutionarily conserved gene product. This was achieved by multiple-evidence machine learning process trained to select the optimal gene model per locus. For details see section on locus clustering.
Functional Annotation, Gene Names and Gene Families
Download
The flat file distribution of V3.1 (FASTA, GFF3, GAF2 etc.) will be made available via the Downloads section as soon as the paper describing the analysis of the V3 assembly and V3.1 annotation is accepted for publication.
Contributions
This document and the V3.1 genome annotation was created by Daniel Lang with contributions from:
- Andreas Zimmer
- Initial user models, SpliceMachine/Augustus/Eugene traininig and prediction
- Shu Shenqiang and David Goodstein(JGI)
- Gene models of the V3.0 [J1, J2, J3]
- Nico van Gessel
- non-protein coding gene annotation
None of this would be possible without the tremendous work and investments done by the JGI. The V3 assembly/scaffolding was created by Jeremy Schmutz and Jerry Jenkins. The underlying genetic map was inferred by Wellington Muchero based on the mapping populations from Andrew Cuming, Stuart McDaniel and the Reski lab with Stefan Rensing as head of the consortium and external PI at the JGI.




