Definition
The first line starts with a greater than sign ">" and
contains a name or other identifier for the sequence. This is the sequence
header and must be in a single line. The remaining lines contain the sequence data. The sequence
can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.
Supported Nucleic acid code
A --> adenosine
C --> cytidine
G --> guanine
T --> thymidine
U --> uridine
R --> G A (purine)
Y --> T C (pyrimidine)
K --> G T (keto) | M --> A C (amino)
S --> G C (strong)
W --> A T (weak)
B --> G T C
D --> G A T
H --> A C T
V --> G C A
N --> A G C T (any)
- gap of indeterminate length |
Ns are accepted, IUB/GCG letters (MRWSYKVHDBX) will be converted to Ns.
Any other characters will be deleted.
Example
The FASTA format is a
plain text format which looks something like this:
>Escherichia coli UTI89|886538|887045
GTTCACTGCCGTACAGGCAGCTTAGAAA TGACGCCATATGCAGATCATTGAGGCGAAACC
GTTCACTGCCGTACAGGCAGCTTAGAAA ACGTTCGCACCGGTCAGGGTACTGCGCAGCGT
GTTCACTGCCGTACAGGCAGCTTAGAAA GAAACCAGAGCGCCCGCATAAAACAGGCACAA
GTTCACTGCCGTACAGGCAGCTTAGAAA GCCAGCATAAAACCGCCTTTGATATTTTATTG
GTTCACTGCCGTACAGGCAGCTTAGAAA TCAGCCGGAGGCTCTCAATTTCAGCCGCGCGG
GTTCACTGCCGTACAGGCAGCTTAGAAA AGCACGGCTGCGGGGAATGGCTCAATCTCTGC
GTTCACTGCCGTACAGGCAGCTTAGAAA TGATGGCGCAGCAGTCCTCCCTCCTGCCGCCA
GTTCACTGCCGTACAGGCAGCTTAGAAA CTGAACGTTGAAGAGTGCGACCGTCTCTCCTT
GTTCACTGCCGTACAGGCAGTATTCACA
CRISPR definition
CRISPR structure
Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) present a curious repeat structure found in many prokaryotic genomes. They show characteristics of both tandem and interspaced repeats. They have been described in a wide range of prokaryotes, including the majority of Archae and many Eubacteria (Jansen 2002 [4]).
A CRISPR locus is mainly characterized by :
- Direct Repeat (DRs) and Spacers : A CRISPR is a succession of 21-47bp sequences called Direct Repeat (DRs) separated by unique sequences of a similar length (spacers). Sometimes, at one end of the CRISPR, the DR is not totally conserved, it is called degenerate DR.
- A leader sequence : the CRISPR locus is generally
flanked on one side by a common leader sequence of 200-350 bp,
- A family of Cas genes : CRISPR-associated genes are genes always found closely linked to the repetitive sequences.
DRs and Spacers
In a given strain several CRISPRs can be found with a single or different DR sequences but only one of each kind is associated with the cas genes. The spacers in the different CRISPR are different.
The nature of the unique sequences is still largely unknown but several recent studies identified some of them as fragments of foreign DNA mostly of viral origin (Bolotin 2005 [1]) ; (Mojica 2005 [6] ) ; (Pourcel 2005 [7]) .
It is proposed that these spacers derive from phages and subsequently help protect the cell from infection (Barrangou 2007 [12] ).
Cas Genes
Some genes called cas for CRISPR-associated are found in the vicinity of CRISPRs (Jansen 2002 [4]). Their exact number is not known and apparently varies from one strain to another. However a core of 4 genes is regularly identified. Phylogenetic studies performed on the CAS protein suggest that CRISPRs are acquired by horizontal transfer (Godde 2006[2]); (Haft 2005 [3]). This is further shown by their presence on megaplasmids.
Leader sequence
Different observations suggest that CRISPR loci are transcribed into small RNA possibly from the leader acting as a promoter, and that these might play a role of siRNA to block the entry of foreign sequences (Tang 2002 [8]); (Makarova 2006 [5]).
CRISPRFinder description
Maximal repeats
A maximal repeat is a repeat with no possible extension to the right or the left without incurring a mismatch.
Maximal repeats have interesting computationnal properties since they can be computed in linear time using a suffix-tree-based algorithm and their number is linear (at most equal to the sequence length).
A CRISPR structure is a succession of maximal repeats (the direct repeats) separated by the spacers. CRISPRFinder uses this property to find possible localizations of CRISPRs. Finding the maximal repeats is done with
VMatch which is the upgrade of
REPuter(Kurtz 1999
[4]) based on an efficient implementation of enhanced suffix arrays (Abouelhoda 2004
[4]).
Program structure
The main idea of the CRISPRFinder program is to find possible CRISPR localizations and then check if these regions contain a cluster that meets CRISPR structure definition.
- 1)Possible localizations
Finding possible CRISPR localizations is achieved by detecting maximal repeats (see paragraph below). This step is performed by the VMatch package. Default parameters used are the following :
a repeat length of 23 to 55 bp, a gap size between repeats of 25 to 60 bp, one nucleotide mismatch between repeats.
- 2)CRISPR features
The criteria a CRISPR should fit are the following:
- The spacer size compared to the DR size:
This filter is mainly added to eliminate structures having for example a 45 bp DR and a 20 bp spacer.
By default, the spacer size should be from 0,6* to 2,5* the DR size.
- The spacers similarity:
This filter is set to eliminate tandem repeats. The spacers comparison is made by aligning them (using default parameters of the ClustalW program). Spacers similarity percentage is calculated with the function percentage_identity() of the (Bio)perl interface ( AlignIO methods, ClustalW interface ).
By default, this parameter is set to 60%.
- The DR conservation:
The direct repeat should be well conserved.
The DR scan is done using the fuzznuc program of the EMBOSS package (Rice 2000 [13] ). The allowed mismatch is equal to the third of the DR size (default parameters) to take into account the degenerated DR (one of the flanking DRs. Then a global mismatch score is computed as the average of mismatches (not including the degenerated DR) and should not exceed a threshold of 20% the DR size(by default).
Parameter description
The advanced version of CRISPRfinder allows users to interactively modify the program parameters.
Two types of parameters may be altered :
- 1) Vmatch parameters (first possible CRISPRs localizations):
The range of the repeat length may be modified within the range [18 bp, 60 bp]. The default interval for repeats is [23 bp, 55 bp]. The gap size between repeats may also be modified and it is set by default to an interval of [25 bp, 60 bp]. The user may choose to allow one or zero nucleotide mismatch between the repeats. By default, this parameter is set to one.
Attention : long sequences and a small minimum repeat size results in a long computation time.
- 2) CRISPR properties parameters :
The CRISPR properties parameters are divided into spacers parameters and DRs parameters.
- i)Spacers parameters
Spacer size : specify that the spacers size depends on the DR size. By default it is within the range : 0.6*, 2.5* the DR size.
Spacers similarity: specify the maximal tolerated similarity percentage between spacers in the tandem repeats elimination. By default this parameter is set to 60%. Increasing this value results in more background, especially tandem repeats.
- ii)DRs parameters
Allowed mismatch between DRs specifies the maximal percentage of non similarity between DRs.
Allowed mismatch for the degenerated DR shows one of the flanking DRs with a mismatch up to the given percentage. This parameter should meet or exceed 40% and should be larger than the allowed mismatch between DRs.
Questionable CRISPRs
There are two kinds of "questionable" CRISPRs:
- Small CRISPRs, i.e structures having only two or three DRs
- Structures where the repeated motifs (DR in CRISPR) are not 100% identical.
Many of these structures are not true CRISPRs, and they need to be critically investigated.
One way to "critically investigate" is to see if
the questionable CRISPR seems to be within a coding sequence. CRISPR are
usually non-coding, and do not belong to genes. An other way is to check
the internal conservation of the candidate DRs, and the divergence of the
candidate spacers. More definitive evidence might be provided by the
typing of a collection of strains from this species. Some bench work is
needed there.
External links
References
- ↑M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing Suffix Treeswith Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2:53-86, 2004.
- ↑ Bolotin, A., B. Quinquis, A. Sorokin, and S. D. Ehrlich. 2005. Clustered
regularly interspaced short palindrome repeats (CRISPRs) have spacers of
extrachromosomal origin. Microbiology 151:2551-61.
- ↑
Godde, J. S., and A. Bickerton. 2006. The repetitive DNA elements called
CRISPRs and their associated genes: evidence of horizontal transfer among
prokaryotes. J Mol Evol 62:718-29.
-
- ↑Stefan Kurtz, Chris Schleiermacher:
REPuter: Fast Computation of Maximal Repeats in Complete
Genomes. Bioinformatics 15(5), pages 426-427, 1999.
- ↑
Haft, D. H., J. Selengut, E. F. Mongodin, and K. E. Nelson. 2005. A Guild
of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes
Exist in Prokaryotic Genomes. PLoS Comput Biol 1:e60.
- ↑
Jansen, R., J. D. van Embden, W. Gaastra, and L. M. Schouls. 2002.
Identification of a novel family of sequence repeats among prokaryotes. Omics
6:23-33.
- ↑
Makarova, K. S., N. V. Grishin, S. A. Shabalina, Y. I. Wolf, and E. V.
Koonin. 2006. A putative RNA-interference-based immune system in prokaryotes:
computational analysis of the predicted enzymatic machinery, functional analogies
with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1:7.
- ↑
Mojica, F. J., C. Diez-Villasenor, J. Garcia-Martinez, and E. Soria. 2005.
Intervening sequences of regularly spaced prokaryotic repeats derive from foreign
genetic elements. J Mol Evol 60:174-82.
- ↑ Pourcel, C., G. Salvignol, and G. Vergnaud. 2005. CRISPR elements in
Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and
provide additional tools for evolutionary studies. Microbiology 151:653-63.
- ↑
Tang, T. H., J. P. Bachellerie, T. Rozhdestvensky, M. L. Bortolin, H.
Huber, M. Drungowski, T. Elge, J. Brosius, and A. Huttenhofer. 2002. Identification
of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus
fulgidus. Proc Natl Acad Sci U S A 99:7536-41.
- ↑
Barrangou, R., C. Fremaux, H. Deveau, M. Richards, P. Boyaval, S. Moineau, D. A. Romero, P. Horvath. 2007. CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes. Science Vol. 315. no. 5819, pp. 1709 - 1712
- ↑
Rice, P., I. Longden and A. Bleasby 2000. EMBOSS: the European Molecular Biology
Open Software Suite. Trends Genet, 16, 276-277