Home
Tool description
TRF and mreps results processing
Papers
Contact
Related references

SWAN Parameters, Options and Output

The SWAN software is written in C language, using some C++ features. The input of the program is a sequence file and the following parameters:

  1. degeneracy level;
  2. the minimal and the maximal period sizes;
  3. the mode of statistical significance calculation (i.e. the "motif" mode or the "mask" mode);
  4. plain, fasta, EMBL or GenBank data format.
The program returns a single result file. It is a table containing the following information:
  1. the name of the sequence;
  2. the start, the end, and the length of the repeat;
  3. the period size;
  4. the number of copies;
  5. the IUPAC consensus;
  6. the number of words in the motif that satisfy the consensus;
  7. the "motif" probability;
  8. the "mask" probability;
  9. the "motif" P-value;
  10. the "mask" P-value;
  11. the "motif" statistical significance;
  12. the "mask" statistical significance;
  13. the tandem repeat itself.

Parameters

Degeneracy Level

One of the user-defined parameters of SWAN is a Degeneracy Level D. The Degeneracy is measured with the number of substitutions between any three neighboring units. Only words that have less than (P-D) substitutions in the neighboring words are included into the pattern.

The Minimal And The Maximal Period Sizes

The minimal an the maximal period sizes are user defined parameters within the range from 3 up to the half of the sequence length. Default values are 3 for the minimal period size and 100 for the maximal period size.

Mode of Statistical Significance Calculation: "Motif" or "Mask"

To evaluate the statistical significance one can use either of two possible modes. The first one is the "Motif" mode. Motif means a set of words satisfying an IUPAC consensus composed for the repeated pattern. The "Motif" Statistical Significance is based on computation of the probability to find a motif repeated contiguously no less than n times (where n is the number of copies) in a independance random sequence of length N given the condition that the motif has occurred at least once. This conditional probability literally reflects our searching algorithm: “for each word in the sequence one checks whether it is repeated n times”.

Using the second mode the significance is calculated based on the probability to find a structure similar to that of the repeated pattern. Here the structure means a set of words complying to some "Mask". For example, for the repeat R=ATC|ACG|AGC we see that the same letter occurred three times at positions 1, 4, 7, then three different letters occupy positions 2, 5, 8, and two identical letters occurred at positions 3, 6, 9. We say that a word of length 9 satisfies the "Mask" of repeat R if the same letter occurred three times at positions 1, 4, 7, any three letters could be at positions 2, 5, 8, and at least two identical letters occurred at positions 3, 6, 9. So, for each position i, 1≤i≤T, and the repeat exponent n there are defined values k1 ,..., kT, where each ki is the maximal number of identical letters on positions i, i+T,…,i+(n-1)T. In the example above T =3, n=3, k1=3, k2=1, k3=2. Other repeats satisfying this "Mask" are TTC|TCC|TGG, ATC|ATC|ATC, CAA|CTA|CTC etc.
Probabilities on which Statistical Significance is based we call 'Motif' and 'Mask' p-values. Minus logarithm of p-value is taken as Statistical Significance.

Data Format

One can choose one of the following data formats:

Plain:
Fasta:
EMBL:
GenBank:


Moscow State University, Moscow, Russia
GosNII "Genetika", Moscow, Russia
INRIA Rocquencourt, Le Chesnay, France
Send any questions or comments to: valeyo@yandex.ru
Last revised April 27, 2005