Contaminations/Detailed analysis of possible contaminant scaffolds

From PhyscomeProjectWiki

Jump to: navigation, search


Detailed analysis of 10 scaffolds with more than 20 Bacillus hits

--Lang 16:43, 2 October 2006 (CEST)

As proposed by Stefan in Contaminations/BLASTP_based_taxonomic_analysis#Taxonomic_analyses, I've had a look at the following main_genome scaffolds:

  1. scaffold_320
  2. scaffold_393
  3. scaffold_394
  4. scaffold_413
  5. scaffold_436
  6. scaffold_451
  7. scaffold_453
  8. scaffold_484
  9. scaffold_498
  10. scaffold_501

ORF prediction using FrameD

The idea behind this analysis was that true bacterial scaffolds would yield to better ORFs when using a bacterial ORF prediction tool like FrameD than real Physcomitrella scaffolds. Where better is meant in terms of length and the ability to find homologs to bacterial genes via BLASTP.

To collect BLASTX evidence to be used by FrameD for ORF prediction, the scaffolds were BLASTed against genpept/nr (-e 0.01 -v 50 -b 50).

ORFs were predicted using FrameD with the following parameters:

  • Short: the start, end and phase of each gene repdicted is written. Partial genes (on the border of the sequence) are indicated by the use of > or <. Frameshifts are indicated by a vertical bar (|) that separates the sequence on the left and on the right of the frameshift. When a frameshift is detected in a gene, FrameD also report as a detected ORF the sequence that begins on the same START as the gene but ignoring the framehift (the ORF stops at the first STOP in frame with the start). This can be controlled using the -R flag.
  • -B: the m8 BLASTX results per scaffold
  • the Bacillus model
  • -e : compute expectations of frameshifts and predicted states.
  • -C : corrects frameshits. A new file with suffix .cor is generated that tries to correct frameshift by inserting 2 N when an insertion is detected and inserting one ā€™Nā€™ when a deletion is detected. If sdtin was used for input, stdout is used for output.
  • -t : translates to amino acid. A new file with suffix .tra is generated that contains the amino acid translation of all ORFS identified. The sequence must be frameshift free.
  • -g : ask for graphical PNG output. This may be followed by a base filename which will be completed by the number of the figure + .png extension. If no filename is given, the sequence filename (trimmed from its suffix) is used instead.

To get all ORFs including thos with frameshifts the prediction is run twice, once with the original scaffold sequence and again with the shift corrected version produced in the first run.

Annotation for the predicted FrameD ORFs

The predicted ORFs were then annotated via BLASTP vs genpept/nr (-v 5 -b 5 -e 0.001). In the subsequent filtering hits had to furfill the following criteria:


For the filtered hits, taxonomic information was collected from Genbank and used to group the hits into the following 7 taxonomic groups:

  • Eukaryota
  • Archaea
  • Eubacteria
    • Firmicutes
    • Cyanobacteria
    • Proteobacteria
    • other_bacteria
  • other (viral or synthetical..)

Megablast against all bacterial genomes

NCBI Genomes has also a repository for bacterial genomes. These were downloaded on September, 25 2006 and used for a megablastn with the 10 scaffolds (frameshift corrected versions).

Database: Bacterial_genomes.fas 
          656 sequences; 1,290,652,656 total letters

The results were parsed using the following criteria:


The sequences of the remaining scaffolds were retrieved and represent the following genomes:


Combining all collected information graphically

I wrote a bioperl based script that integrates all harvested information foreach analyzed scaffold into a single overview graphic where the data (ORF and megablast results) is drawn as features along the full length of the scaffold.

Based on the taxonomic groups the features have the following color code :

  • 'Firmicutes' => 'red',
  • 'Cyanobacteria' => 'lightgreen',
  • 'Proteobacteria'=> 'purple',
  • 'other Bacteria'=> 'orange',
  • 'Eukaryota' => 'yellow',
  • 'Archaea' => 'lightblue',
  • 'other' => 'black'
  • 'unknown tax' => 'grey'

MegaBLAST features (bacterial genomes) are drawn first in 'green'.

The FrameD-ORF features have following additional data:

orf_number partial description hit sequence (above the feature)

%identity ORFlength %coverage_ORF %coverage_hit tax_group Accession_hit (below the feature)

The results for each of the 10 scaffolds:

(True) Negative control

--Lang 18:38, 6 October 2006 (CEST)

To check how a real Physcomitrella scaffold would behave in such an analysis, a scaffold clearly belonging to the 0.3 GC peak with annotated Physcomitrella genes was selected.

GC:0.33 length: 825251bp annotated gene:
LOCUS       DQ438971                8416 bp    DNA     linear   PLN 02-APR-2006
DEFINITION  Physcomitrella patens actin-related protein complex 4 (ARPC4) gene,
           complete cds.
VERSION     DQ438971.1  GI:90655047

This gene was chosen, because it turned out as one of our reference genes with actually 2 genes on it (and an confirmed intergenic region). The corresponding view in the genome browser can be found here.

The MegaBLAST search aginst the bacterial genomes yielded no hits.

In the ORF prediction, 4 sequences still had frameshifts even after the second iteration. 307 predicted peptides were then annotated as described for the contamination candidates.

Overiew graphic of scaffold_204.


--Lang 18:41, 6 October 2006 (CEST)

When comparing the overview graphics between candidate contamination scaffolds and negative control its obvious that a true Physcomitrella has less predicted ORFs using the FrameD Bacillus model. The few ORFs with significant hits are all from the Eukaryota group. It seems like this is a means to separate the contaminations from the "real ones...

The majority of hits to genpept is 30-60% identical ~ the same as when you compare Phypa to Arath. But there are a few 90% hits - maybe these hint us to the "bacterium" the scaffolds are from?

Because the signal is too low, the MegaBLAST searches against the bacterial genomes are not so well suited as a discriminator. For a good nt match, the genome of an organism very close to our contamination has to be available (Which is sadly not the case).

Personal tools