cosmoss.org documentation
Available databases
|
Available databases:
|
ppp_nr The nr (non redundant)
dataset is based on the ppp dataset (see below). However, each cluster is
represented by the longest sequence it contains only. This is similar
to the NCBI's unique unigene approach. In theory, the nr
set represents all genes present in the ppp dataset as well.
|
| ppp These are all the sequences that are left after clustering and assembly. They consist of contigs as well as singlets. Singlets can be part of clusters or independent. In addition, we included so-called problem sequences, that were detected to be possible chimeras (cloning artifacts) into the dataset. Please note that each cluster can contain more than one sequence! Ideally, each cluster contains one gene and its transcripts (with splice variants) - however, multiple sequences in one cluster can also be due to close paralogs and/or cloning or sequencing artefacts. | |
| ppp_orf Nucleic acid database. This dataset is based on the ppp_nr dataset. It contains the predicted ORFs that are longer than 150na. ORF prediction was carried out using ESTScan and FrameD with species-specific models. | |
| ppp_fil This database is a cleaned up version of the raw EST input data. It contains the raw data after filtering. During filtering, sequences that significantly match against the E. coli genome or Physcomitrella rRNA, mitochondrial or plastidal genes are removed. All sequence stretches that match against vector are excised. Sequences that contain less than 150 meaningful bases are then removed. Before the sequences are clustered, low complexity, A-tail and repetitive regions are masked, so they don't disturb clustering and assembly. In addition, known plant UTRs from UTR-DB and repeats from Repbase are also masked. | |
| ppp_raw The unfiltered EST data. | |
| ppp_icm The "iterative contig members". For very large clusters, contigs are build iteratively. The information of the original member sequences is not present in the visualization of the contig, but can be figured by a BLAST search against this dataset. | |
| ppp_seeds The full length, annotated CDS that have been used for seed clustering. | |
| ppp_orfpep Peptide sequence database. This dataset is based on the ppp_nr dataset. It contains the predicted ORFs that are longer than 50aa. ORF prediction was carried out using ESTScan and FrameD with species-specific models. | |
| pp_fosmids High quality, full length sequences of genomic clones that have been produced by JGI as a quality control for the WGS data. | |
| pp_traces The whole genome shotgun (WGS) reads or traces, produced by JGI. For historical reasons, the data are split into two databases. To search them simultaneously, use the advanced option -d database | |
| Ceratodon The equivalent to the Physcomitrella ppp_nr data, produced for the available Ceratodon purpureus ESTs | |
| Tortula The equivalent to the Physcomitrella ppp_nr data, produced for the available Tortula ruralis ESTs |
Nomenclature
|
Nomenclature:
|
A sequence of the form PPP_1001_C1
is a contig from a cluster.
In this case, contig 1 (C1) from cluster
number 1001.
|
|
If the sequence name contains an
additional sd (like PPP_sd_112_C2) it is a seed cluster
sequence. Seed clusters are produced prior to the main steps of
clustering and assembly. As seeds, we take all the publicly available
Physomitrella CDS (coding sequences). Because of this, you will find
genes that have previously been characterised in seed clusters.
Singlets generally use the genbank
accession number they are derived from as their name.
|
|
|
Singlets that are part of clusters
(clustered singlets) additionally contain the name of the cluster after
the accession number.
|
|
| Problem sequences are easily distinguishable by the PR- in front of the sequence name. Such sequences have a high probability to be chimeras (cloning artefacts). | |
| For a more detailed description of the information contained in the sequence headers, please read this PDF. |
Some definitions concerning EST clustering
|
Some
explanations:
|
contig: a contig is a consensus
sequence built from at least two sequences with local sequence
similarity
|
| singlet: a singlet is a sequence that did not find a matching partner during the initial pairwise comparisons between the input sequences (clustering) or during assembly of a specific cluster. The latter type of singlet is referred to as a clustered singlet. | |
| cluster: a cluster is a pool of sequences that were found to have local similarities. During assembly, the pool of sequences is reduced to a single contig (ideally) or several contigs and/or singlets. |
The Sequence Retrieval User Manual
BLAST interface - alphabetical list of terms
|
alignments
|
the maximum number of detailed alignments
to be shown (default: 50)
|
|
database
|
|
|
e-mail
|
activate this button to receive your
results via e-mail (plain text recommended)
|
|
e-value
threshold
|
the e-value threshold (cutoff), format
0.01 or 10e-2 accepted
(default: 10e-4 for peptide comparisons and 10e-2 for BLASTN) |
|
gap extension
|
the gap extension penalty (default: 2 for
BLASTN, otherwise 1 [for BLOSUM62])
|
|
gap open
|
the gap opening penalty (default: 5 for
BLASTN, otherwise 11 [for BLOSUM62])
|
|
html
|
activate this button to retrieve your
results in html format (with hyperlinks)
|
|
list size
|
the maximum amount of hits that will be
listed (default: 50)
|
|
matrix
|
the substitution matrix (default:
BLOSUM62)
|
|
molecule
|
the molecule type of your input (query)
sequence: nucleic acid or peptide
|
|
nofilter
|
activate to turn OFF low complexity
filtering (default: filtering ON)
|
|
other
|
additional BLAST command line parameters
(experts only)
|
|
output
|
determines whether you like to have plain
text or html output (default: html) and whether you want to have it
displayed on a webpage (default) or send by e-mail
|
|
query name
|
the name of your query sequence (an
identifier for your search)
|
|
sequence
|
your sequence in plain text or in FastA
format
|
|
sequence from
file
|
instead of pasting your sequence into the form you can upload it from disk (FastA format accepted) |
|
subsequence
|
only the given range of your sequence
will be used
|
|
text
|
activate this button to receive your
results as plain text
|
|
type
|
specify the kind of blast search you want
to execute (the default is automatically determined by input
sequence type vs. database type!)
|
|
ungapped
|
activate if you don't want to allow gaps
(default: allow gaps)
|
|
wordsize
|
the word size k (default: 11 for BLASTN,
otherwise 3)
|
![]() |