SWAN Parameters, Options and Output
The SWAN software is written in C language, using some C++ features.
The input of the program is a sequence file and the following parameters:
program returns a single result file. It is a table containing the following information:
- degeneracy level;
- the minimal and the maximal period sizes;
- the mode of statistical significance calculation (i.e.
the "motif" mode or the "mask"
- plain, fasta, EMBL or GenBank data format.
- the name of the sequence;
- the start, the end, and the length of the repeat;
- the period size;
- the number of copies;
- the IUPAC consensus;
- the number
of words in the motif that satisfy the consensus;
- the "motif" probability;
- the "mask" probability;
- the "motif" P-value;
- the "mask" P-value;
- the "motif" statistical significance;
the "mask" statistical significance;
- the tandem repeat itself.
One of the user-defined parameters of SWAN is a Degeneracy Level D.
The Degeneracy is measured with the number of substitutions between any three neighboring
units. Only words that have
less than (P-D)
substitutions in the neighboring words are included into the pattern.
The Minimal And The Maximal Period Sizes
The minimal an the maximal period sizes are user defined
parameters within the range from 3 up to the half of the sequence length. Default values are 3 for
the minimal period size and 100 for the maximal period size.
Mode of Statistical Significance Calculation: "Motif" or "Mask"
To evaluate the statistical significance one can use either of two possible modes. The first one is the "Motif" mode.
Motif means a set of words satisfying an IUPAC consensus composed for the repeated pattern.
The "Motif" Statistical Significance is based on
of the probability to find a motif repeated contiguously no less than n
times (where n is the number of copies) in a independance random
of length N given the condition that the motif has occurred at least once. This conditional probability literally reflects our searching
algorithm: “for each word in the sequence one checks whether it is repeated n times”.
Using the second mode the significance is calculated
based on the probability to find a structure similar to that of the repeated pattern. Here the structure means a set of words complying to some "Mask". For example, for the repeat R=ATC|ACG|AGC we see that the same letter occurred three times at positions 1, 4, 7, then three different letters occupy positions 2, 5, 8, and two identical letters occurred at positions 3, 6, 9. We say that a word of length 9 satisfies the "Mask" of repeat R if the same letter occurred three times at positions 1, 4, 7, any three letters could be at positions 2, 5, 8, and at least two identical letters occurred at positions 3, 6, 9. So, for each position i, 1≤i≤T, and the repeat exponent n there are defined values k1 ,..., kT, where each ki is the maximal number of identical letters on positions i, i+T,…,i+(n-1)T. In the example above T =3, n=3, k1=3, k2=1, k3=2. Other repeats satisfying this "Mask" are TTC|TCC|TGG, ATC|ATC|ATC, CAA|CTA|CTC etc.
Probabilities on which Statistical Significance is based we call 'Motif' and 'Mask' p-values. Minus logarithm of p-value is taken
as Statistical Significance.
One can choose one of the following data formats: