Contaminations/Combining all evidences via multivariate clustering

From PhyscomeProjectWiki

Jump to: navigation, search

--Lang 18:37, 25 October 2006 (CEST)

As indicated by the results from overview graphics approach, ORF predictions using a bacterial ORF prediction tool can be used as a means to separate the populations. But its obvious that a manual inspection of overview graphics for all scaffolds is not feasible.

Hence, the idea, wether a scaffold is a contaminant or not, which is immanent in the manual inspection has to be transformed to numbers, that can be analyzed and clustered via data mining tools.

Contents

Combining all collected data

All collected data used in this analysis was integrated into the following database:

Image:contamination_schema.png

The table scaffold_info contains all the information used in the analysis the other 3 tables are merely used for data import.

Running the detailed ORF-based analysis for all main_genome scaffolds

After we've checked that we could use FrameD even without a BLASTx lead (on scaffolds 320 vs 204; data not shown), we ran predictions for all 2,536 main_genome scaffolds. The peptides were annotated and filtered as described in Contaminations/Detailed_analysis_of_possible_contaminant_scaffolds#Annotation_for_the_predicted_FrameD_ORFs.

To avoid introduction of noise due to preference of longer scaffolds, a possible discriminator between real scaffolds and contaminations could be the following ratio:

\frac{total\_FrameD\_Bacillus\_model\_ORF\_length\_aa \cdot 3}{total\_scaffold\_length\_na}

scaffold_320:
124040 aa *3 / 475038 bp ~ 0.78
scaffold_204:
26417 * 3 / 825251 bp ~ 0.096

From the filtered results the following data was extracted and ratios were calculated:

scaffold_accessionlengthglengthnorfsorf_length_aaorf_length_bpgorf_length_aagorf_length_bpratiogratio
scaffold_89301133113332798372798370.7390.739
scaffold_3070308018433988296457517250.9620.936
scaffold_10641088844295373111937311190.1030.253
scaffold_4688358431754000000.0000.000
scaffold_107011044104421644921644920.4710.471
scaffold_7711254382471125517653111733510.6100.406
scaffold_90541127112712918732918730.7750.775
scaffold_327035953298711107133213992970.9240.099
scaffold_6053553326423697209169720910.0590.791
...

To correct for the possible impact of PolyN stretches or gap regions, I've also collected gap-less lengths for scaffolds and ORFs (e.g. glength = total scaffold length without internal polyN stretches).

GC content

Here I've used the data which was produced for the prior analysis of the Contaminations/GC_and_length_distributions. The GC was calculated with the EMBOSS tool geecee. (I've double checked that the GC content as calculated by geecee is independent of polyN.)

EST evidence

As indicated by Andreas' results on the EST evidence, the number of ESTs which could be reliably aligned in the EST_mapping, might also be an putative separator. Again, this data was scaled using the total gap-less scaffold_length and included as the variable nests_ratio.

Principal component analysis

The data collected so far was used to perform a principal component analysis (PCA) using R:

  • "ratio"
  • "gratio"
  • "gc"
  • "nests_ratio"
  • "norfs_ratio"

Image:Contamination PCA zoom.png

Looking at the biplot, its clear that the separation is improved but not yet complete. The only vectors which seem to separate the two populations are the two ratios and the GC content.

Adding more variables to improve the segregation of the different scaffold populations

By now scaffold lengths were always used to normalize the variables, either in their gap-free form or including the clone_gap spanning poly-N stretches.

Up to now, two hypothesis were not yet represented by the variables:

  1. True Physcomitrella scaffolds should be longer due to the theoretical 8x coverage, i.e. the signal "length" including the polyN gaps representing non-sequenced part of clones of known length has a meaning. The length also has an impact on the separation via the variable "GC", short bacterial scaffolds without much gene content will cluster with "non-coding" Physco scaffolds.
  2. ORFs predicted with the Bacillus model of FrameD of a true bacterial contaminant should yield not only more, but better hits in genpept (Bacteria) than those from Physco scaffolds. (That should be partial to the wrong model and intronic regions - e.g. due to their GC or norfs_ratio... shorter true Physco scaffolds with big fraction of coding regions will be falsely clustered with bacterial)

Hence, the following variables were included:

  • scaffold length including gaps
  • avg_fraction_identity of the BLASTP hits of the Bacillus model ORFs vs genpept
  • avg_hit_coverage of the BLASTP hits of the Bacillus model ORFs vs genpept
  • avg_query_coverage of the BLASTP hits of the Bacillus model ORFs vs genpept

Furthermore, to include more of the information which is directly available in the overview graphics, i.e. the color coded tax_group features, the following variables were added:

number of hits from tax_group X in a BLASTP with the FrameD ORFs against genpept/nr
  1. eukaryota
  2. archaea
  3. firmicutes
  4. cyanobacteria
  5. proteobacteria
  6. other_bacteria
  7. other

In addition, to include the data from the initial BLASTP analysis carried out by Jeffrey and Stefan, I've added the factorial variable (0|1) mainly_bacterial_hits , which indicates whether a scaffold has more bacterial than eukaryotic hits (1E-3).

Image:All data.pca.annotated.png

We can see that the data is now nicely separated into 3 primary clusters. Now, nearly all variables contribute to the variability of the dataset and thus to the segregation into clusters. 4 boxed extreme examples are highligthed to give an impression what the three partition represent.

To decide which cluster_sizes to use for kmeans clustering within-cluster sums of squares for each cluster (wss) were calculated for cluster_sizes 2-7:

Image:All data wss select clusters.png

Between a cluster_size of 2 to 3 there is a significant drop-off, as well as between cluster_sizes 3 and 4.

Hence, the kmeans based coloring of the first two principal components of the scaled data was performed with cluster_sizes 3 and 4. Because the visual result of the PCA suggests rather 3 than 4 clusters, let's focus on the results with cluster_size=3:

Image:All data k3.png

In combination with the eigenvectors from PCA biplot (see above), we now can nicely interpret the results:

length and eukaryota seem to contribute strongly to the segregation of cluster 1 from the remainder, whose members have the following value distribution among the variables:

cluster 1
 summary(subset(scaffolds, row.names(scaffolds) %in% names(subset(k3$cluster, k3$cluster == 1))))
     length            ratio             gratio              gc           avg_ident      avg_q_coverage  
 Min.   :   1142   Min.   :0.01600   Min.   :0.04300   Min.   :0.2700   Min.   :0.3580   Min.   :0.8000  
 1st Qu.: 426388   1st Qu.:0.08000   1st Qu.:0.07825   1st Qu.:0.3400   1st Qu.:0.5243   1st Qu.:0.8900  
 Median : 801113   Median :0.09100   Median :0.08800   Median :0.3400   Median :0.6100   Median :0.9200  
 Mean   :1022472   Mean   :0.09621   Mean   :0.09570   Mean   :0.3453   Mean   :0.6191   Mean   :0.9183  
 3rd Qu.:1380287   3rd Qu.:0.10100   3rd Qu.:0.09800   3rd Qu.:0.3500   3rd Qu.:0.7047   3rd Qu.:0.9500  
 Max.   :5387533   Max.   :0.81500   Max.   :0.70200   Max.   :0.5700   Max.   :1.0000   Max.   :1.0000  
 avg_h_coverage    nests_ratio         norfs_ratio        nnorf_hits_fil_ratio   eukaryota         archaea 
 Min.   :0.2800   Min.   :0.0000000   Min.   :0.0001500   Min.   :0.000e+00    Min.   :0.0000   Min.   :0  
 1st Qu.:0.8600   1st Qu.:0.0002600   1st Qu.:0.0002800   1st Qu.:0.000e+00    1st Qu.:1.0000   1st Qu.:0  
 Median :0.9163   Median :0.0004100   Median :0.0003200   Median :0.000e+00    Median :1.0000   Median :0  
 Mean   :0.8932   Mean   :0.0004567   Mean   :0.0003394   Mean   :1.283e-05    Mean   :0.9609   Mean   :0  
 3rd Qu.:0.9600   3rd Qu.:0.0005700   3rd Qu.:0.0003700   3rd Qu.:1.000e-05    3rd Qu.:1.0000   3rd Qu.:0  
 Max.   :1.0000   Max.   :0.0042900   Max.   :0.0013400   Max.   :8.800e-04    Max.   :1.0000   Max.   :0  
   firmicutes       cyanobacteria      proteobacteria   
 Min.   :0.000000   Min.   :0.000000   Min.   :0.00000  
 1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.00000  
 Median :0.000000   Median :0.000000   Median :0.00000  
 Mean   :0.003688   Mean   :0.003519   Mean   :0.01622  
 3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
 Max.   :0.333330   Max.   :0.500000   Max.   :1.00000

Whereas avg_ident and the two coverage-variables (and cyanobacteria but not so strong) seem to be on the same level as the ORF hits vs genpept of the second cluster:

cluster 2
summary(subset(scaffolds, row.names(scaffolds) %in% names(subset(k3$cluster, k3$cluster == 2))))
     length           ratio            gratio             gc           avg_ident      avg_q_coverage 
 Min.   :  1003   Min.   :0.0220   Min.   :0.1140   Min.   :0.2600   Min.   :0.3500   Min.   :0.800  
 1st Qu.:  1331   1st Qu.:0.7478   1st Qu.:0.7530   1st Qu.:0.4900   1st Qu.:0.4730   1st Qu.:0.900  
 Median :  3488   Median :0.8590   Median :0.8405   Median :0.5100   Median :0.5311   Median :0.940  
 Mean   : 16350   Mean   :0.7997   Mean   :0.8084   Mean   :0.5031   Mean   :0.5431   Mean   :0.928  
 3rd Qu.:  9478   3rd Qu.:0.9380   3rd Qu.:0.9050   3rd Qu.:0.5200   3rd Qu.:0.5999   3rd Qu.:0.967  
 Max.   :475038   Max.   :1.2480   Max.   :1.0070   Max.   :0.6100   Max.   :0.9090   Max.   :1.000  
 avg_h_coverage    nests_ratio         norfs_ratio        nnorf_hits_fil_ratio   eukaryota        
 Min.   :0.0200   Min.   :0.000e+00   Min.   :0.0004300   Min.   :0.0001300    Min.   :0.0000000  
 1st Qu.:0.9200   1st Qu.:0.000e+00   1st Qu.:0.0009675   1st Qu.:0.0004675    1st Qu.:0.0000000  
 Median :0.9586   Median :0.000e+00   Median :0.0012500   Median :0.0006400    Median :0.0000000  
 Mean   :0.8974   Mean   :4.412e-07   Mean   :0.0013791   Mean   :0.0007148    Mean   :0.0001127  
 3rd Qu.:0.9827   3rd Qu.:0.000e+00   3rd Qu.:0.0017025   3rd Qu.:0.0008400    3rd Qu.:0.0000000  
 Max.   :1.0000   Max.   :1.800e-04   Max.   :0.0037300   Max.   :0.0184600    Max.   :0.0303000  
    archaea          firmicutes     cyanobacteria     proteobacteria   
 Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:0.00000   1st Qu.:0.8000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :0.00000   Median :1.0000   Median :0.00000   Median :0.00000  
 Mean   :0.01449   Mean   :0.8133   Mean   :0.00518   Mean   :0.09827  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000

But this cluster could be easily separated using the variables firmicutes and norf_hits_fil_ratio, as well as proteobacteria and archaea but not as strongly as the other two.

The cluster shares eigenvectors of the norfs_ratio and the two total_ORF_length ratios with the lower part of cluster 3 (which can be separated when using a cluster_size of 4, as shown below). The both parts of the cluster are composed as follows:

cluster 3a (upper region)
summary(subset(scaffolds, row.names(scaffolds) %in% names(subset(k3$cluster, k4$cluster == 3))))
     length            ratio             gratio             gc          avg_ident         avg_q_coverage      avg_h_coverage     
 Min.   :   1000   Min.   :0.00000   Min.   :0.0000   Min.   :0.100   Min.   :0.0000000   Min.   :0.0000000   Min.   :0.0000000  
 1st Qu.:   1860   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.330   1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.0000000  
 Median :   9627   Median :0.04600   Median :0.0690   Median :0.400   Median :0.0000000   Median :0.0000000   Median :0.0000000  
 Mean   :  44160   Mean   :0.08925   Mean   :0.1014   Mean   :0.404   Mean   :0.0003246   Mean   :0.0006721   Mean   :0.0004672  
 3rd Qu.:  18808   3rd Qu.:0.11000   3rd Qu.:0.1510   3rd Qu.:0.460   3rd Qu.:0.0000000   3rd Qu.:0.0000000   3rd Qu.:0.0000000  
 Max.   :1988570   Max.   :0.93400   Max.   :0.6870   Max.   :0.720   Max.   :0.3960000   Max.   :0.8200000   Max.   :0.5700000  
  nests_ratio         norfs_ratio        nnorf_hits_fil_ratio   eukaryota    archaea    firmicutes cyanobacteria proteobacteria
 Min.   :0.0000000   Min.   :0.0000000   Min.   :0.000e+00    Min.   :0   Min.   :0   Min.   :0    Min.   :0     Min.   :0     
 1st Qu.:0.0000000   1st Qu.:0.0000000   1st Qu.:0.000e+00    1st Qu.:0   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0     
 Median :0.0000000   Median :0.0003400   Median :0.000e+00    Median :0   Median :0   Median :0    Median :0     Median :0     
 Mean   :0.0002911   Mean   :0.0003956   Mean   :8.197e-09    Mean   :0   Mean   :0   Mean   :0    Mean   :0     Mean   :0     
 3rd Qu.:0.0000000   3rd Qu.:0.0006300   3rd Qu.:0.000e+00    3rd Qu.:0   3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0     
 Max.   :0.0368800   Max.   :0.0016800   Max.   :1.000e-05    Max.   :0   Max.   :0   Max.   :0    Max.   :0     Max.   :0
cluster 3b (lower region)
summary(subset(scaffolds, row.names(scaffolds) %in% names(subset(k3$cluster, k4$cluster == 1))))
     length          ratio            gratio             gc           avg_ident         avg_q_coverage     avg_h_coverage     
 Min.   : 1000   Min.   :0.0210   Min.   :0.0850   Min.   :0.2200   Min.   :0.0000000   Min.   :0.000000   Min.   :0.0000000  
 1st Qu.: 1122   1st Qu.:0.6220   1st Qu.:0.5670   1st Qu.:0.4700   1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.0000000  
 Median : 1333   Median :0.8590   Median :0.8090   Median :0.5100   Median :0.0000000   Median :0.000000   Median :0.0000000  
 Mean   : 5994   Mean   :0.7762   Mean   :0.7424   Mean   :0.5019   Mean   :0.0008485   Mean   :0.001879   Mean   :0.0003232  
 3rd Qu.: 8802   3rd Qu.:0.9675   3rd Qu.:0.9285   3rd Qu.:0.5400   3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.0000000  
 Max.   :53220   Max.   :1.9200   Max.   :1.0500   Max.   :0.6700   Max.   :0.4200000   Max.   :0.930000   Max.   :0.1600000  
  nests_ratio         norfs_ratio       nnorf_hits_fil_ratio   eukaryota    archaea    firmicutes cyanobacteria proteobacteria
 Min.   :0.000e+00   Min.   :0.000370   Min.   :0.000e+00    Min.   :0   Min.   :0   Min.   :0    Min.   :0     Min.   :0     
 1st Qu.:0.000e+00   1st Qu.:0.000970   1st Qu.:0.000e+00    1st Qu.:0   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0     
 Median :0.000e+00   Median :0.001430   Median :0.000e+00    Median :0   Median :0   Median :0    Median :0     Median :0     
 Mean   :2.162e-06   Mean   :0.001425   Mean   :8.081e-07    Mean   :0   Mean   :0   Mean   :0    Mean   :0     Mean   :0     
 3rd Qu.:0.000e+00   3rd Qu.:0.001795   3rd Qu.:0.000e+00    3rd Qu.:0   3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0     
 Max.   :7.900e-04   Max.   :0.003840   Max.   :4.000e-04    Max.   :0   Max.   :0   Max.   :0    Max.   :0     Max.   :0

Image:All data k3 zoom.annotated.png Image:All data k4 zoom.annotated.png

The two sub clusters 3a and 3b can are subdived because:

  1. norfs,ratio,gratio --> 3b has more Bacillus model FrameD ORFs
  2. length --> on average, 3a scaffolds are significantly longer than those of 3b
  3. nests --> 3a has more EST alignments
  4. gc --> regarding GC content, 3b groups closer with cluster 2, while 3a is more close to cluster 1

But both's FrameD Bacillus ORFs do not yield significant hits among any of the taxonomic groups.

Displaying the FrameD ORF annotation for all clusters

In order to be able to manually inspect all scaffolds like the candiate samples described earlier, the visualziation tool was ran for all scaffolds.

The resulting images for all scaffolds from all clusters described above, can be accessed as follows:

A closer look at some (extreme) samples of the 4 clusters

As indicated with the small black arrows in figure Image:All_data_k4_zoom.annotated.png, we will have a look at 11 scaffolds which are localized on the borders of the 4 clusters.

Cluster 2: scaffold_11072

First, we will look at the most obvious outlier - scaffold_11072 which is localized on the far left of cluster 2. For close-up shown in figure Image:All_data_k4_zoom.annotated.png it had to cut off.

SELECT * FROM scaffold_info WHERE scaffold_accession = 'scaffold_11072';
scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_11072 1029 1029 2 19 0 1 727 1 1 291 873 291 873 0.848 0.848 0.48 0.483 0.9 1 0 0 0 0 19 0 0 t

(1 row)

As the plot of the FrameD-ORF hits vs genpept reveals, only one of the two predicted ORFs yields a significant hit and that is :

>gb|ABA71904.1| Glycosyl hydrolase, BNR repeat [Pseudomonas fluorescens PfO-1]
ref|YP_345893.1| Glycosyl hydrolase, BNR repeat [Pseudomonas fluorescens PfO-1]
         Length = 2156

Its a proteobacterial species so we could speculate that the scaffold is part of the mitochondrial genome - But both the scaffold itself and the ORF do not yield any hits against the Physcomitrella mitochondrial genome.

As indicated by the description the ORF represents a repetitive sequence that is found on the Pseudomonas fluorescens genome 19 times(!) consecutively. The next hits from the BLASTP against genpept are the same repetitive sequence e.g. from (55 hits in total):

  • ABB29167 hypothetical protein Cag_1919 [Chlorobium chlorochromatii CaD3]. 1526
  • EAT71323 glycosyl hydrolase, BNR repeat [Verminephrobacter eiseniae EF01-2]. 918
  • EAT75490 glycosyl hydrolase, BNR repeat [Verminephrobacter eiseniae EF01-2]. 2031
  • EAT75492 glycosyl hydrolase, BNR repeat [Verminephrobacter eiseniae EF01-2]. 1969
  • ZP_00881195 Parallel beta-helix repeat [Shewanella sp. MR-4]. 2302
  • ZP_00881196 RTX toxin, putative [Shewanella sp. MR-4]. 446
  • NP_440434 Fat protein [Synechocystis sp. PCC 6803].
Conclusion
The scaffold is a true bacterial contamination. As indicated by the taxonomic occurence of the hits it could also be part of the yet unsequenced Firmicutes bacterium.

Cluster 1: scaffold_1

SELECT * FROM scaffold_info WHERE scaffold_accession = 'scaffold_1';
scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_1 5387533 5108762 1413 19 2744 479 974483 2096 4.37578288100209 138172 414516 129184 387552 0.077 0.076 0.35 0.582789473684211 0.906315789473684 0.938421052631579 19 0 0 0 0 0 0 f

(1 row)

This is an example how a true Physcomitrella scaffold behaves in the analysis, .i.e. a true negative located at the upper peak of cluster 1. As shown in the plot, only Eukarytic ORFs were found with an average hit coverage of ~ 94%. Only 19 of the 1,413 ORFs predicted by FrameD did survive the filtering.

Cluster 1: scaffold_1134

SELECT * FROM scaffold_info WHERE scaffold_accession = 'scaffold_1134';
scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_1134 23368 9381 10 1 0 0 0 0 0 1100 3300 1100 3300 0.141 0.352 0.44 0.45 0.86 0.87 1 0 0 0 0 0 0 f

(1 row)

As shown in the plot, this scaffold only has one ORF with a significant hit and its Eukaryotic (Medicargo):

>gb|ABE83579.1| pol polyprotein-related [Medicago truncatula]
gb|ABE83466.1| pol polyprotein-related [Medicago truncatula]
         Length = 246
Score =  182 bits (461), Expect = 1e-44
Identities = 95/211 (45%), Positives = 129/211 (61%)
Conclusion
a true Physcomitrella with putative a transposable element - Would be interesting whether all scaffold located at the lower flank of cluster 1 look like that.

Cluster 1: scaffold_1657

SELECT * FROM scaffold_info WHERE scaffold_accession = 'scaffold_1657';
scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_1657 12177 7552 7 1 0 0 0 0 0 3309 9927 1766 5298 0.815 0.702 0.44 0.945 1 0.54 1 0 0 0 0 0 0 f

(1 row)

Cluster 1: scaffold_8782

SELECT * FROM scaffold_info WHERE scaffold_accession = 'scaffold_8782';
scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_8782 1142 1142 1 1 0 2 948 2 1 156 468 156 468 0.41 0.41 0.57 0.464 0.86 0.97 1 0 0 0 0 0 0 f

(1 row)

Cluster 2: scaffold_4354

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_4354 35270 1894 1 1 0 1 35199 4 4 263 789 263 789 0.022 0.417 0.51 0.74 0.96 0.98 0 0 0 0 1 0 0 t

(1 row)

  • 1 trailing ORF hit: >gb|AAY91924.1| citrate lyase, beta subunit [Pseudomonas fluorescens Pf-5]
ref|YP_259758.1| citrate lyase, beta subunit [Pseudomonas fluorescens Pf-5]
         Length = 270
Score =  382 bits (980), Expect = e-104
Identities = 191/258 (74%), Positives = 214/258 (82%)
  • The other hits range, like the ones of ORF on scaffold_11072, a wide variety of Bacterial species
  • No hits against the plastome and mitochondrial genome of Physcomitrella (also not with the whole scaffold sequences)
  • the FM gene model is basically the same as the ORF. Despite a two (nonsense) exons before the large internal gap. The hits collected with the gene model are the same species as the ORF
  • A search with the whole scaffold reveals that the beginning of the scaffolds is not an exon of the same gene, but a fragment of different bacterial gene! The best hit is also from Pseudomonas fluorescens. BLASTX result and overview graphic
Conclusion
Most likely a contamination. Interesting candidate to clarify via PCR analysis! Maybe something closer to Pseudomonas fluorescens which also isn't yet characterized. Maybe more of it can be found in JGI's prokaryotic partition of the assembly? Answer: No, not so mutch to indicate the occurence of fragments of an additional bacterial genome close to Pseudomonas - when the prokaryotic partition of the JGI 8X assembly was screened by BLASTX vs genpept/nr.

Cluster 2: scaffold_5427

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_5427 1519 1518 1 1 0 1 1129 2 2 181 543 181 543 0.357 0.358 0.35 0.511 0.98 0.98 0 0 1 0 0 0 0 t

(1 row)

  • 1 ORF covering 98% of a Firmicutes sequence

Cluster 2: scaffold_6805

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_6805 1293 1293 3 2 0 1 712 1 1 434 1302 434 1302 1.007 1.007 0.61 0.6455 0.9 0.98 0 0 2 0 0 0 0 f

(1 row)

  • leading ORFfragment covering 82% of a Firmicutes sequence - the beginning of the ORF is simply missing

Cluster 2: scaffold_5765

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_5765 8815 1444 4 1 0 1 8548 2 2 2881 8643 422 1266 0.98 0.877 0.51 0.497 0.8 0.07 0 0 1 0 0 0 0 t

(1 row)

  • In this case FrameD incomprehensibly included a large part of the internal gap in the ORF peptide sequence! That's why the qcoverage is so low! Otherwise its a perfect match to a Firmicutes protein.
  • This is a case where maybe one bacterial clone was sequenced from two ends and the unsequenced part is not as big as the gap, because the alignment with the 219aa long Carboxydothermus hydrogenoformans protein spans the whole gap.

Cluster 2 or 3b: scaffold_3591

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_3591 9008 2531 4 1 0 1 8589 2 2 2970 8910 810 2430 0.989 0.96 0.51 0.42 0.93 0.16 0 0 0 0 0 1 0 t

(1 row)

  • also a short sequence with a big internal gap and a gene model spanning it
  • the filtered ORF (which also includes the gap) hits a Flavobacteria sequence
  • a BLASTX with the whole scaffold reveals that there is a fragmentary bacterial fuosidase at the beginning and a galactosidase directly after the polyN-gap (like scaffold_4354). BLASTX result and overview graphic

Cluster 3b: scaffold_4415

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_4415 9001 1867 4 0 0 1 8898 4 4 5388 16164 630 1890 1.796 1.012 0.51 0 0 0 0 0 0 0 0 0 0 t

(1 row)

  • No filtered ORF - BUT a look at the original BLASTP and BLASTX with the scaffold reveals that in this case filtering was maybe too stringent and this also might be a contaminant! BLASTX Result and overview graphic

To further investigate which fraction of cluster 3b might fall into the same category we preformed an addional PCA and kmeans clustering using only scaffolds from cluster 3b:

Image:All data cluster 3b PCA.png Image:All data cluster 3b k3 annotated.png

Cluster 3a: scaffold_449

scaffold_accession length glength norfs norf_hits_fil nests ngene_models gm_length nexons avg_nexons orf_length_aa orf_length_bp gorf_length_aa gorf_length_bp ratio gratio gc avg_ident avg_q_coverage avg_h_coverage eukaryota archaea firmicutes cyanobacteria proteobacteria other_bacteria other mainly_bacterial_hits
scaffold_449 203083 193922 51 1 29 11 20516 54 4.90909090909091 3967 11901 3966 11898 0.059 0.061 0.33 0.396 0.82 0.57 0 0 0 0 0 0 0 f

(1 row)

  • clearly a Physcomitrella scaffold

Composition of "Bacterial" cluster 2

Image:All data cluster 2 subset pca.png Image:All data cluster 2 subset k5 annotated.png

Mapping of known Physcomitrella nt sequences

Furthermore, the localization of published Physcomitrella nt sequences was checked via BLASTN (-e 0.00001 -v 10 -b 10) and subsequent filtering (HIT::length_aln('query')>=300; HIT::frac_identical>=0.95).

In total 265 distinct scaffolds were hit. Only one of the was not as expected from cluster1:

When I further analyzed the cluster, it became obvious, that this scaffold is the result of chimaric assembly: the first half is Physcomitrella (here is also the localization of the gene which yielded the hit), while the second half is bacterial as indicated by the vast occurence of ORFs with homologs to Firmicutes.

Hence, this is a extreme case which cannot be dealt with by excluding it entirely. This can possibly be solved by exclusion of the contigs from the second half prior to the next assembly.

Further checking of the separation performance of this approach

As already indicated by the results from the further subclustering of cluster3b, the cluster3b might still contain putative contaminations.

Test_case 1: arginyl-tRNA synthetase

Stefan found additional interesting contamination candidates, when investigating arginyl-tRNA synthetase from Canavalia ensiformis, that he communicated already to Ralph Quatrano:

Another (related) story to check. If you BLAST with the sequence of the arginyl-tRNA synthetase from Canavalia ensiformis (sequence below) against the 8x main genome scaffolds, you will receive an intronless hit against pos. 123649-125376 in scaffold 451 (one of the above mentioned scaffolds). There are three more significant hits (BLAST result). The last hit (scaffold 39) almost certainly represents the actual eukaryotic plant gene that you would expect. Hit #one and #two are intronless hits against scaffold 637 and scaffold_9342. It would certainly be interesting to see which of the four sequences are actually present in the (nuclear) genome. The expectation would be that the first three are contaminants and the last one is for real.

When we have a look in what clusters those 4 scaffolds end up we get the following picture:

Conclusion

scaffold_9342 is definitely an interesting candidate to check int the wet lab!!!

Test_case 2

A diploma student in our lab has checked 7 putative Physcomitrella homologs to bacterial proteins in the lab, which he discovered in the traces:

All but one, could NOT be amplified from Physcomitrella genomic DNA / cDNA using several PCR approaches.

When I've checked these contamination candidates in the 8X_main_genome via TBLASTN (1E-4), I found the following scaffolds:

1. scaffold_1061 from cluster2 short, clearly bacterial, candidate couldn't be amplified
2. scaffold_484 from cluster2 longer, clearly bacterial, candidate couldn't be amplified
3. scaffold_583 from cluster2 longer, clearly bacterial, candidate couldn't be amplified
4. scaffold_451 from cluster2 longer, clearly bacterial, candidate couldn't be amplified, already known as contaminant from Contaminations/Detailed_analysis_of_possible_contaminant_scaffolds
5. scaffold_501 from cluster2 longer, clearly bacterial, 2 candidates couldn't be amplified, already known as contaminant from Contaminations/Detailed_analysis_of_possible_contaminant_scaffolds
6. scaffold_5296 from cluster2 short, clearly bacterial, candidate couldn't be amplified
Conclusion
For these 7 cases, our method's accuracy is 100%! I.e., we already have 7 wet-lab confirmations for the bacterial cluster2.

More variables?

To improve the separation of cluster3, I've tried to include data from the JGI's filtered gene models:

  1. ngene_models => number of filtered gene models
  2. gm_length => length of the filtered gene models
  3. nexons => total number of exons
  4. avg_nexons => avg_number of exons per gene model

Sadly, this analysis doesn't lead to a substantially better separation of the clusters:

Image:With_FM_data.PCA.png

I also checked, whether the scaling using scaffold_length ratios has a negative influence - with the result that the separation is even worse than the initial analysis with fewer variable:

Image:Without_length_scaling.PCA.png

Conclusions

It seems as we have detected the four following partitions among our main_genome scaffolds:

cluster 1
real Physcomitrella scaffolds making up 86.95% of the sequence information in the main_genome partition (423,303,332bp ~ 83% of the expected Physcomitrella genome (FCM analysis 511Mbp) [average GC ~35%]
cluster 2
likely contamination with an/several? yet unsequenced bacterium/bacteria (1.37% of the sequence information) [average GC ~50%]
cluster 3a
most likely real Physcomitrella sequences with EST evidence of average length of 44,160bp (53,875,777bp ~ 11% of the expected Physco genome size). [average GC ~40%]
cluster 3b
possibly contains both, real Physcomitrella sequences with some predicted, but non-sense or fragmentary ORFs without homology to known sequences and bacterial scaffolds with ORFs which are either not so well predicted by FrameD or (more likely) with ORFs that yield hits below our (very stringent) thresholds. The somewhat "bacterial" GC of cluster 3b is the result of that fact. After we have more wet-lab confirmation, especially for the subpartitions of cluster3b, we can calibrate the thresholds using the results. [average GC ~50%]
Personal tools