Institut de Génétique et Microbiologie

CRISPRfinder program online


Home About CRISPRs News FAQs Help Contact Us Examples IGM
Université Paris Sud

Navigation


Tools


GPMS Links


External Links

Related Works


Data Summary

Genomes analysed CRISPRs found (*)
Archea 231 890(202)
Bacteria 6600 8732(3293)
Total 6831 9622(3495)

*number of convincing CRISPR structures (number of genomes with such CRISPR)

Database status:
Last update : 2017-01-02

Contact:
Christine POURCEL


FASTA Format

Definition

The first line starts with a greater than sign ">" and contains a name or other identifier for the sequence. This is the sequence header and must be in a single line. The remaining lines contain the sequence data. The sequence can be in upper or lower case letters. Anything other than letters (numbers for example) is ignored. Multiple sequences can be present in the same file as long as each sequence has its own header.

Supported Nucleic acid code

A --> adenosine
C --> cytidine
G --> guanine
T --> thymidine
U --> uridine
R --> G A (purine)
Y --> T C (pyrimidine)
K --> G T (keto)
M --> A C (amino)
S --> G C (strong)
W --> A T (weak)
B --> G T C
D --> G A T
H --> A C T
V --> G C A
N --> A G C T (any)
- gap of indeterminate length
Ns are accepted, IUB/GCG letters (MRWSYKVHDBX) will be converted to Ns.
Any other characters will be deleted.

Example

The FASTA format is a plain text format which looks something like this:

>Escherichia coli UTI89|886538|887045
GTTCACTGCCGTACAGGCAGCTTAGAAA TGACGCCATATGCAGATCATTGAGGCGAAACC
GTTCACTGCCGTACAGGCAGCTTAGAAA ACGTTCGCACCGGTCAGGGTACTGCGCAGCGT
GTTCACTGCCGTACAGGCAGCTTAGAAA GAAACCAGAGCGCCCGCATAAAACAGGCACAA
GTTCACTGCCGTACAGGCAGCTTAGAAA GCCAGCATAAAACCGCCTTTGATATTTTATTG
GTTCACTGCCGTACAGGCAGCTTAGAAA TCAGCCGGAGGCTCTCAATTTCAGCCGCGCGG
GTTCACTGCCGTACAGGCAGCTTAGAAA AGCACGGCTGCGGGGAATGGCTCAATCTCTGC
GTTCACTGCCGTACAGGCAGCTTAGAAA TGATGGCGCAGCAGTCCTCCCTCCTGCCGCCA
GTTCACTGCCGTACAGGCAGCTTAGAAA CTGAACGTTGAAGAGTGCGACCGTCTCTCCTT
GTTCACTGCCGTACAGGCAGTATTCACA

CRISPR definition

CRISPR structure

Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) present a curious repeat structure found in many prokaryotic genomes. They show characteristics of both tandem and interspaced repeats. They have been described in a wide range of prokaryotes, including the majority of Archae and many Eubacteria (Jansen 2002 [4]). A CRISPR locus is mainly characterized by :

DRs and Spacers

In a given strain several CRISPRs can be found with a single or different DR sequences but only one of each kind is associated with the cas genes. The spacers in the different CRISPR are different.
The nature of the unique sequences is still largely unknown but several recent studies identified some of them as fragments of foreign DNA mostly of viral origin (Bolotin 2005 [1]) ; (Mojica 2005 [6] ) ; (Pourcel 2005 [7]) .
It is proposed that these spacers derive from phages and subsequently help protect the cell from infection (Barrangou 2007 [12] ).

Cas Genes

Some genes called cas for CRISPR-associated are found in the vicinity of CRISPRs (Jansen 2002 [4]). Their exact number is not known and apparently varies from one strain to another. However a core of 4 genes is regularly identified. Phylogenetic studies performed on the CAS protein suggest that CRISPRs are acquired by horizontal transfer (Godde 2006[2]); (Haft 2005 [3]). This is further shown by their presence on megaplasmids.

Leader sequence

Different observations suggest that CRISPR loci are transcribed into small RNA possibly from the leader acting as a promoter, and that these might play a role of siRNA to block the entry of foreign sequences (Tang 2002 [8]); (Makarova 2006 [5]).

CRISPRFinder description

Maximal repeats

A maximal repeat is a repeat with no possible extension to the right or the left without incurring a mismatch.

Maximal repeats have interesting computationnal properties since they can be computed in linear time using a suffix-tree-based algorithm and their number is linear (at most equal to the sequence length). A CRISPR structure is a succession of maximal repeats (the direct repeats) separated by the spacers. CRISPRFinder uses this property to find possible localizations of CRISPRs. Finding the maximal repeats is done with VMatch which is the upgrade of REPuter(Kurtz 1999 [4]) based on an efficient implementation of enhanced suffix arrays (Abouelhoda 2004 [4]).

Program structure

The main idea of the CRISPRFinder program is to find possible CRISPR localizations and then check if these regions contain a cluster that meets CRISPR structure definition.

Parameter description

The advanced version of CRISPRfinder allows users to interactively modify the program parameters.
Two types of parameters may be altered :

Questionable CRISPRs

There are two kinds of "questionable" CRISPRs:

Many of these structures are not true CRISPRs, and they need to be critically investigated.
One way to "critically investigate" is to see if the questionable CRISPR seems to be within a coding sequence. CRISPR are usually non-coding, and do not belong to genes. An other way is to check the internal conservation of the candidate DRs, and the divergence of the candidate spacers. More definitive evidence might be provided by the typing of a collection of strains from this species. Some bench work is needed there.

References

  1. M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing Suffix Treeswith Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2:53-86, 2004.
  2. Bolotin, A., B. Quinquis, A. Sorokin, and S. D. Ehrlich. 2005. Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin. Microbiology 151:2551-61.
  3. Godde, J. S., and A. Bickerton. 2006. The repetitive DNA elements called CRISPRs and their associated genes: evidence of horizontal transfer among prokaryotes. J Mol Evol 62:718-29.
  4. Stefan Kurtz, Chris Schleiermacher: REPuter: Fast Computation of Maximal Repeats in Complete Genomes. Bioinformatics 15(5), pages 426-427, 1999.
  5. Haft, D. H., J. Selengut, E. F. Mongodin, and K. E. Nelson. 2005. A Guild of 45 CRISPR-Associated (Cas) Protein Families and Multiple CRISPR/Cas Subtypes Exist in Prokaryotic Genomes. PLoS Comput Biol 1:e60.
  6. Jansen, R., J. D. van Embden, W. Gaastra, and L. M. Schouls. 2002. Identification of a novel family of sequence repeats among prokaryotes. Omics 6:23-33.
  7. Makarova, K. S., N. V. Grishin, S. A. Shabalina, Y. I. Wolf, and E. V. Koonin. 2006. A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct 1:7.
  8. Mojica, F. J., C. Diez-Villasenor, J. Garcia-Martinez, and E. Soria. 2005. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol 60:174-82.
  9. Pourcel, C., G. Salvignol, and G. Vergnaud. 2005. CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology 151:653-63.
  10. Tang, T. H., J. P. Bachellerie, T. Rozhdestvensky, M. L. Bortolin, H. Huber, M. Drungowski, T. Elge, J. Brosius, and A. Huttenhofer. 2002. Identification of 86 candidates for small non-messenger RNAs from the archaeon Archaeoglobus fulgidus. Proc Natl Acad Sci U S A 99:7536-41.
  11. Barrangou, R., C. Fremaux, H. Deveau, M. Richards, P. Boyaval, S. Moineau, D. A. Romero, P. Horvath. 2007. CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes. Science Vol. 315. no. 5819, pp. 1709 - 1712
  12. Rice, P., I. Longden and A. Bleasby 2000. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet, 16, 276-277