Contaminations/BLASTP based taxonomic analysis

From PhyscomeProjectWiki

Jump to: navigation, search

BLASTP vs nr

Our collaborators in Gent, Stephane Rombauts and Jeffrey Fawcett joint our effort on this subject. Jeffrey did a BLASTP of all filtered model proteins against genpept/nr.

Jeffrey, 19.09.2006 

We've got some results of the Bacteria-like genes, which we send as attached files. One("scaffold_details") is a list of the number of genes in each scaffold with best hits (by BlastP with all the Physcomitrella proteins against all the non-redundant proteins) against genes from Eukaryota, Bacteria, Archaea, etc. The initial BlastP was done without a threshold (default settings) so I've put the number of Bacteria-like genes with a threshold of <0.001 in brackets. As you can see, most of the big scaffolds include some Bacteria-like genes too (the majority have low score but there are still a few), and there are some scaffolds that consist of almost only Bacteria-like genes (most with good score). Roughly 1000 out of the 5000+ Bacteria-like genes seem to be sequences from LTR retrotransposons but I don't think these are responsible for some of the scaffolds being rich in Bacteria-like genes.

I've also attached a file("bacteria_genes.masked") with all the description of the Blast hits of the Bacteria-like genes, with those of the LTR-like genes removed.

Please let us know if you can come up with anything exciting from this data.

Taxonomic analysis

Based on Jeffreys data Stefan had a closer look at the taxonomic distribution of the BLASTP results.

Stefan, 20.09.2006

For the scaffold details I calculated the percentage values for each taxonomic group and used those to determine scaffolds that have more non-eukaryotic than eukaryotic hits (scaffold_tax_groups.xls). There are no "other" that fulfill this criterion, and only some "virus" (might be due to retrotransposons). The rest of virus and archaea nicely coincides with Bacteria. I made a list of all scaffolds that contain more 1E-03Bacteria hits than eukaryotic hits (Bacterial_scaffolds_1E-03.csv). Interestingly, the higher the scaffold number, the greater the chance to be considered a bacterial contaminant by this method?! IMO we should try to correlate these potentially contaminated scaffolds with some intrinsic data, e.g. scaffold length, scaffold G/C. Maybe also repeat this after removal of potential retrotransposon-derived hits? As of the other data, I made a taxonomic analysis (bacterial_genes_masked.xls). The first sheet contains the filterable list and some taxon counts. The thing that catches the eye most is the contribution of Bacillus spec. sequences. More than 2/3 of the hits are due to Firmicutes and Proteobacteria. I made histograms for all hits, Bacillus, Proteo- and Actinobacteria. Those scaffolds that contain >= 50 hits in total, >= 20 from Bacillus and >= 10 from Proteobacteria are listed in the sheet "strange scaffolds". As you can see, most do overlap. I am somewhat cautious with the Proteobacteria scaffolds, as those might be genes of mitochondrial origin. So the scaffolds to focus on, currently, are 320, 393, 394, 413, 436, 451, 453, 484, 498 and 501. The last two show up only for Bacillus, the remainder for "all" and Bacillus. IMO we should look closely at these scaffolds. Length and G/C data could be included, but also av. intron / exon length, no. of introns per gene. We might also run individual BLAST searches for these scaffolds (na and aa gene models) and stringently filter the results and have a look at the taxon strings and description of the respective best 5-10 hits. Jeffrey, how did you remove the potential transposon-based sequences from the Bacteria-hits?

Jeffrey, 21.9.2006

I also just checked the taxonomy of each hit as you did with all the Bacteria-like genes and compared the results of the scaffolds with over 50% Bacteria-like genes, and the scaffolds with under 50% Bacteria-like genes (all low score hits and LTR-related genes removed).

One of the striking difference is the "Firmicutes" hits (especially "Bacillus"), -most of the "Firmicutes" hits are from the Bacteria-over50% scaffolds, which accounts for most of the "Bacillus" hits you mentioned, whereas the Bacteria-under50% scaffolds have very few "Firmicutes" hits. The Bacteria-under50% scaffolds on the other hand have much more "Cyanobacteria" hits (chloroplast??).

As for the removal of the LTR-derived hits, 6 of the big 'gene families' from the gene cluster analysis I did based on BLAST hits appear to be sequences related to LTR retrotransposons, so those are the genes that I removed.

Image:Bacteria over50.png Image:Bacteria under50.png

Personal tools