Contaminations/What is the closest sequenced taxon

From PhyscomeProjectWiki

Jump to: navigation, search

Contents

Based on the most highly conserved hits using the FrameD ORFs

--Lang 14:52, 26 October 2006 (CEST)

By selection of the most highly conserved Bacillus ORF hits of the scaffolds of cluster 2, we hope to get an insight of the contaminations taxon:

~/scripts/genome/contamination/get_fracIdent_tax_groups.pl 0.9 kmeans_clusters/cluster2.txt orf_hits.fil.dbm BLASTP/*.parsed      
scaffold_1535   scaffold_1535_orf_7     dbj|BAD74518.1| Firmicutes      hypothetical conserved protein [Geobacillus kaustophilus HTA426] ref|YP_146086.1| hypothetical protein GK0233 [Geobacillus kaustophilus HTA426]   0.940000        Bacteria,Firmicutes,Bacillales,Bacillaceae,Geobacillus,Geobacillus kaustophilus HTA426
scaffold_2309   scaffold_2309_orf_1     dbj|BAD83614.1| Firmicutes      thioredoxin reductase [Brevibacillus choshinensis]      0.944000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus choshinensis
scaffold_3819   scaffold_3819_orf_1     dbj|BAA75270.1| Firmicutes      rpsJ [Bacillus halodurans] dbj|BAB03852.1| 30S ribosomal protein S10 [Bacillus halodurans C-125] sp|Q9Z9L5|RS10_BACHD 30S ribosomal protein S10 ref|NP_240999.1| 30S ribosomal protein S10 [Bacillus halodurans C-125]    0.947000        Bacteria,Firmicutes,Bacillales,Bacillaceae,Bacillus,Bacillus halodurans
scaffold_453    scaffold_453_orf_86     dbj|BAB06232.1| Firmicutes      BH2513 [Bacillus halodurans C-125] ref|NP_243379.1| hypothetical protein BH2513 [Bacillus halodurans C-125] sp|Q9K9Y1|Y2513_BACHD Hypothetical UPF0296 protein BH2513     0.907000        Bacteria,Firmicutes,Bacillales,Bacillaceae,Bacillus,Bacillus halodurans C-125
scaffold_484    scaffold_484_orf_90     dbj|BAA00737.1| Firmicutes      lon protease [Brevibacillus brevis] sp|P36772|LON_BRECH ATP-dependent protease La pir||B42375 endopeptidase La (EC 3.4.21.53) [validated] - Bacillus brevis       0.924000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus brevis
scaffold_498    scaffold_498_orf_20     gb|AAO66293.1|  Firmicutes      b-alanine synthase [Brevibacillus agri] 0.912000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus agri
scaffold_498    scaffold_498_orf_22     gb|AAO66291.1|  Firmicutes      dihydropyrimidine dehydrogenase [Brevibacillus agri]    0.936000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus agri
scaffold_5114   scaffold_5114_orf_2     sp|Q7M0Z3|ARGI_BREBE    Firmicutes      Arginase        0.909000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus brevis
scaffold_645    scaffold_645_orf_20     gb|AAA18872.1|  Firmicutes      Spo0A [Brevibacillus brevis] sp|P52929|SP0A_BREPA Stage 0 sporulation protein A pir||S60869 phosphorylation-activated transcription factor Spo0A - Bacillus brevis (fragment)     0.962000        Bacteria,Firmicutes,Bacillales,Paenibacillaceae,Brevibacillus,Brevibacillus brevis

The following columns are printed above:

  1. scaffold
  2. ORFname
  3. hit_accession
  4. description
  5. tax_group
  6. frac_identical
  7. full lineage

Sadly, the full genome sequence of Brevibacillus brevis isn't yet available. See here for more details. The only other full genomic sequences available for Paenibacillaceae are:

   Paenibacillus larvae subsp. larvae BRL-230010, unfinished sequence, whole genome shotgun sequencing project
   DNA; other; Length: 4,016,553 nt
   Replicon Type: chromosome
   Created: 2006/07/20
   Brevibacillus borstelensis plasmid pHT926, complete sequence
   DNA; circular; Length: 1,786 nt
   Replicon Type: plasmid
   Replicon Name: pHT926
   Created: 2003/07/07

Both were tested via megablast but didn't show much >90% identical sequence.

Based on the 16S ribosomal RNA gene

--Zimmer , 15 November 2006

The sequence of 16S rRNA is highly conserved among prokaryotic organisms. Thus ribosomal RNA can be used to identify the origin of the contaminations in the Physcomitrella patens assembly. As already mentioned at the JGI Jamboree (2006) the most prevalent genera were (Brevi/Paeni/Geo)bacillus, Thermus, and Pseudomonas.


Therefore we used these two 16S gene sequences as a starting point:

Genbank:AY319301 Brevibacillus agri strain NCHU1002 16S ribosomal RNA gene, complete sequence

Genbank:AY530296 Paenibacillus larvae subsp. pulvifaciens strain DSM 8442 16S ribosomal RNA gene, complete sequence.


BLAST-results searching with 16S ribosomal RNA gene (AY319301 and AY530296) against scaffold partition main_genome

There are no significant hits (only hits to a 18S and 16S conserved gene region)

BLAST-results searching with 16S ribosomal RNA gene (AY319301 and AY530296) against scaffold partition prokaryotic

3 scaffold hits covering nearly the whole AY319301 16S rRNA gene

BLAST results for partition prokaryotic: AY319301


hit Score (bits) E-value
scaffold_437  2918  0
scaffold_1818 1017 0
scaffold_576  347  7e-96


3 scaffold hits (the same hits as above) covering nearly the whole AY530296 16S rRNA gene

BLAST results for partition prokaryotic: AY530296
hit Score (bits) E-value
scaffold_437  2372  0
scaffold_1818  1025 0
scaffold_576 339  1e-93


Analysis of the corresponding scaffolds

BLAST search results against full length 16S gene sequences ( NCBI Entrez: 16S ribosomal RNA gene NOT partial[all] NOT complete genome[title])

  • scaffold_1818 position 34201-35677 --> best hit: AY427832.1 --> Paenibacillus/Firmicutes
  • scaffold_437 position 154753-156247 --> best hit: AJ313027 --> Brevibacillus/Firmicutes
  • scaffold_576 position 35104-36597 --> best hit: AJ002803 --> Comamonas/beta-Proteobacteria

Alignment of the corresponding scaffolds, Physcomitrella corresponding chloroplast and mitochondrial genes and the 16S rRNA genes from Paenibacillus, Brevibacillus, Delftia

16S gene MAFFT alignment

16S phylogeny-based identification of prokaryots - leBIBI

Using leBIBI a software environment for sequence based identification of prokaryots yield to following organisms and 16S rRNA trees:

  • scaffold_1818 position 34201-35677 --> Paenibacillus pabuli / Firmicutes


  • scaffold_437 position 154753-156247 --> Brevibacillus choshinesis / Firmicutes


  • scaffold_576 position 35104-36597 --> Delftia acidovorans / beta-Proteobacteria

Available sequence data for the species indicated by the 16S analysis

--Lang 10:44, 16 November 2006 (CET)

genus Paenibacillus
383 proteins and 2592 nt sequences; 2 genome projects; 1 unfinished genome sequence
Paenibacillus pabuli
2 proteins and 18 nt sequences; no genome project; txid1472[Organism:exp]
genus Brevibacillus
212 proteins and 440 nt sequences; 2 genome projects; 1 plasmid
Brevibacillus choshinensis
21 proteins and 19 nt sequneces; no genome project; txid54911[Organism:exp]
genus Delftia
336 proteins and 153 nt sequences;
Delftia acidovorans
280 proteins and 88 nt sequences; 1 JGI genome project; plasmid sequence; txid80866[Organism:exp]

Bacterial 16S, 23S, and 5S rRNA genes are typically organized as a co-transcribed operon

  • 16S rRNA gene --> scaffold_1818 position 34201-35677
    • 5S gene --> AY242847 Paenibacillus popilliae 5S ribosomal RNA, tRNA-Ile and tRNA-Ala genes, complete sequence -- > scaffold_1818 position 35872-36217
      • 23S gene --> 109940716:308->499 Paenibacillus pabuli partial 16S rRNA gene, partial 23S --> 36746-36937


  • 16S rRNA gene --> scaffold_437 position 154753-156247
    • 5S gene --> AY242847 Paenibacillus popilliae 5S ribosomal RNA, tRNA-Ile and tRNA-Ala genes, complete sequence -- > scaffold_437 position 154473-154557
      • 23S gene --> 109940716:308->499 Paenibacillus pabuli partial 16S rRNA gene, partial 23S --> 154159-154288


  • 16S rRNA gene --> scaffold_576 position 35104-36597
    • 5S gene --> AY242847 Paenibacillus popilliae 5S ribosomal RNA, tRNA-Ile and tRNA-Ala genes, complete sequence -- > scaffold_576 position 34961-34872
      • 23S gene --> 109940716:308->499 Paenibacillus pabuli partial 16S rRNA gene, partial 23S --> 34280-34315


Gene order 16S,5S and 23S on the scaffolds: 1818, 437 and 576:

MEGABLASTN screen against all available microbial DNA sequences downloadable from the JGI FTP server

--Lang 16:07, 23 November 2006 (CET)

Due to the results from the 16S analysis, that indicated the JGI production pipeline as a possible additional source of contamination, I downloaded all available fasta files from the following JGI resources to cover all released microbial sequences:

  1. ftp://ftp.jgi-psf.org/pub/JGI_data/Microbial
  2. ftp://ftp.jgi-psf.org/pub/JGI_data/Microbiomes
wget -r  -nc -A*.fsa,*.fasta ftp://ftp.jgi-psf.org/pub/JGI_data/Microbial/
grep -c '>' jgi.microbial.fas
149212
du -hs jgi.microbial.fas 
1.5G    jgi.microbial.fas

Then all main_genome scaffolds were searched against these 149,212 sequences using megablast (-e 0.001 -F D -D 2 -p 0.5 -a 4) and filtered (HSP::length>=200;HSP::frac_identical>=0.95.

Filtered results

query_namenamespeciesmain genome clusternum_hspslength_aln('query')length_aln('hit')frac_aligned_query()frac_identical('query')start('query')end('query')hsp_frac_identical('query')hsp_start('query')hsp_end('query')
scaffold_6084000380_Cont479mycobacterium_sp_MCS_050921_4000380cluster3a1145314520.040.997114530.99724707511453
scaffold_749ctg246pelobacter_propionicus_040701_3436113cluster3a32947350.020.93910679109720.9587155961067910896
scaffold_26134002763.nofos.C65comamonas_testosteroni_KF-1_060717_4002763cluster3b24394390.010.957153919870.95738636415391890
scaffold_4484ctg195pelobacter_propionicus_040701_3436113cluster3a1119612000.650.98449416890.9841137124941689
scaffold_4986ctg352trichodesmium_erythraeum_041026_2662189cluster3a18518510.020.99918510.9988249121851
scaffold_6019Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1140214021111402111402
scaffold_60194000336_Cont21ignicoccus_sp_Kin4-I_050328_4000336cluster3a1140214021111402111402
scaffold_7295Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1124612461111246111246
scaffold_72954000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1124612461111246111246
scaffold_7524Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1122812311111228111228
scaffold_75244000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1122812311111228111228
scaffold_8171Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1118411841111184111184
scaffold_81714000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1118411841111184111184
scaffold_8290Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1117411751111174111174
scaffold_82904000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1117411751111174111174
scaffold_8418Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a11166116510.998111660.99828473411166
scaffold_84184000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a11166116510.998111660.99828473411166
scaffold_8750Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a11144114510.999111440.99912587411144
scaffold_87504000336_Cont20ignicoccus_sp_Kin4-I_050328_4000336cluster3a11144114510.999111440.99912587411144
scaffold_8868Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1113611371111136111136
scaffold_88684000336_Cont21ignicoccus_sp_Kin4-I_050328_4000336cluster3a1113611371111136111136
scaffold_8936ctg352trichodesmium_erythraeum_041026_2662189cluster3a11133113310.997111330.99735216211133
scaffold_9238Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1111611171111116111116
scaffold_92384000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1111611171111116111116
scaffold_9372Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1110811101111108111108
scaffold_93724000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1110811101111108111108
scaffold_10070Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a2107311511111073111073
scaffold_100704000336_Cont22ignicoccus_sp_Kin4-I_050328_4000336cluster3a1107310751111073111073
scaffold_10096Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1107210731111072111072
scaffold_100964000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1107210731111072111072
scaffold_10111Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a29739750.9111107111942
scaffold_101114000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a29739750.9111107111942
scaffold_10638Contig124ignicoccus_sp_Kin4-I_Finished2006_4000336.finished.chromosomecluster3a1104610481111046111046
scaffold_106384000336_Cont23ignicoccus_sp_Kin4-I_050328_4000336cluster3a1104610481111046111046

These results indicate that there are at least 4 additional species present:

main genome clusterspecies
cluster3bcomamonas_testosteroni_KF-1_060717_4002763
cluster3aignicoccus_sp_Kin4-I_Finished2006_4000336(.finished.chromosome)
cluster3amycobacterium_sp_MCS_050921_4000380
cluster3apelobacter_propionicus_040701_3436113
cluster3atrichodesmium_erythraeum_041026_2662189

The hit against Comamonas testosteroni might be in fact early sequence information from Delftia acidovorans where no sequence information is yet available.

Personal tools