Contaminations

From PhyscomeProjectWiki

Jump to: navigation, search

At the first jamboree in June 2006, the JGI presented evidence that the main_genome_scaffolds might contain prokaryotic contaminations. This was based on BLAST hits and a conspicious secondary G/C peak. Details: Contaminations/JGIs_contamination_analysis

To detect, confirm and finally eliminate these contaminations, computational analyses were carried out by Freiburg/Gent.


As a first step, BLAST-based analyses revealed that the contaminations seem to be eubacterial, probably most closely related to the genus Bacillus. However, the contaminant organism seemingly has not been sequenced, as the level of sequence identity is lower than would be expected if the genome of the bug would be present in the database.


Afterwards, we did some more G/C plots of the scaffolds and had a look at those scaffolds which predominantly yield hits against Archaea/Eubacteria, Eubacteria and Bacillus, respectively. While it is evident from this analysis that the secondary G/C peak represents the contamination, G/C and best BLAST hit filtering alone are not adequate to cleanly separate the populations.


By mapping of the P.p. EST data against the scaffolds, we revealed that the secondary G/C peak is void of EST evidence. This analysis also showed that the contaminated scaffolds are shorter than the scaffolds with EST evidence. Again, this data alone does not suffice to separate the populations.


We used FrameD to predict ORFs on those scaffolds that exhibit the most Bacillus hits. Afterwards we did BLAST/taxonomic analysis and graphical representation of the data. A scaffold with P.p. EST evidence was used to cross-check. As it turned out, the contaminated scaffolds differ from the real P.p. scaffolds in such a way that they contain operon-like ORFs without introns/intergenic regions and yield hits more or less exclusively against prokaryotic sequence.


Finally, we combined all available parameters in a multivariate analysis, using first a principal component analysis and then k-means clustering. Using this method, we are able to clearly define four different fractions in the main genome scaffolds. We also did a lot of checking to confirm the accuracy of the results.


We tested the ORFs from the (most probably) bacterial cluster for close relationship to sequenced Bacteria. There are some sequences that are > 90% identical to the genus Bacillus. Yet the analysis seems to confirm again that the contaminant species in question has/have not been sequenced yet.


A mapping of all published P. patens CDS fortunately resulted in all but one being present in cluster 1. However, there is one chimeric scaffold present in the bacterial cluster. This needs to be dealed with in the next assembly.


We wanted to confirm by wet lab data, that i) our separation ist correct and ii) sequences from the bacterial contaminant are not present in genomic DNA prepared from wildtype Physcomitrella. We tested 7 scaffold regions, all of which showed the expected results. Seven scaffolds from the bacterial cluster could not be amplified in our lab. Another interesting example was presented to Ralp Quatrano's lab for checking.


Some of the examples from the various checks we've performed to measure the accuracy of our method indicate that we still have some problems with the separation of those true Physcomitrella scaffolds that mainly consist of non-coding regions as well as with bacterial or chimeric scaffolds with low amount of either EST or ORF evidence. To be able to resolve this we need more wet-lab analyses, e.g. of those scaffolds we have analysed in detail in silico.


For a second round of wet lab confirmations primers for the Arginyl-tRNA synthetase testcase were designed by Daniel and were analyzed by Pierre-Francois Perroud from Ralph Quatrano's lab. The results clearly show support the initial hypothesis, that three of the four loci are not part of the Physcomitrella genome.


For the third round of wet lab confirmations primers for 13 scaffolds, representing extreme samples from the 4 clusters, where designed by Daniel and analyzed by Pierre-Francois Perroud from Ralph Quatrano's lab. The wet-lab data confirmed the separation between cluster 1 and 2 and between 1 and 3a/b. The data also demonstrates that the DNA that was used to create the libraries was contaminated. Our current understanding is that cluster 2 represents this contamination. Because cluster 2 is not properly separated from 3a/b and there is in silico and wet-lab evidence for contaminations within cluster 3a and 3b, we need to further analyze this clusters (see below).


In search for the source of the contamination, we also looked for 16S rDNA as an indicator for the identity of the contaminant(s). We found none in the main_genome, but 3 different 16S rRNA genes in the prokaryotic scaffold partition. Using the phylogeny-based Bacteria identification service leBIBI, the 3 different 16S rRNAs are predicted to be Paenibacillus pabuli, Brevibacillus chosinensis and the beta-proteobacterium Delftia acidovorans. The latter is currently being sequenced at the JGI. The question whether these three species contribute to the contamination in main_genome is yet to be answered. When the searches were extended to other bacterial rRNA genes, we found that the three scaffolds mentioned above all code for rDNA loci organized as co-transcribed operons, which is typical for bacterial genomes. None of them are significantly related to the Physcomitrella mt/pt genome.


As mentioned above, there is some contamination present in clusters 3a/3b. Initial evidence suggested that these sequences may originate in some mislabelled or switched plates, i.e. that organisms sequenced at the JGI at the same time than Physcomitrella pollute the data to some extent. We therefore carried out megaBLAST searches with the main_genome scaffolds against the publicly available microbial genomes sequenced by the JGI. As it turns out, there is evidence for several bacterial species (Comamonas testosteroni / Delftia acidovorans, Ignicoccus sp., Mycobacterium sp., Pelobacter propionicus, Trichodesmium erythraeum) to contribute to scaffolds within cluster 3a/b. We have asked the JGI to assist with this analyses, because we do not have access to all their data.


Please contribute to the discussion section of this page!


This is our current understanding of the composition of the main genome scaffolds:

cluster 1: long Pp scaffolds with EST evidence

cluster 2: represents a eubacterial contamination present in the gDNA used for sequencing

cluster 3a: EST evidence present, scaffolds longer than 3b, contains some contaminants

cluster 3b: no EST evidence, short scaffolds, contains some contaminants

Image:All_data_k4_zoom.annotated.png

A complete list of the clusters is available in csv format.

Results in more detail

Personal tools