Supplementary data from Bandelt et al. (2002)

On this page, you will find supplementary data from the following paper:

Bandelt, H.-J., Quintana-Murci, L., Salas, A. and Macaulay, V. (2002). The fingerprint of phantom mutations in mtDNA data. American Journal of Human Genetics, 71, 1150-1160. [PDF file from journal site]

(June 2010): the programs SPECTRA and NETMAT have been recompiled and should now work in Windows XP, Vista and 7. Let me know if they don't!


Program to perform the weighty filter

For given data, the program NETMAT (which you can download here) will filter away the speedy mutations and prepare a binary matrix of the filtered data in a form that can be used as input to the program SPECTRA (below), or to the phylogenetic network software package NETWORK.

Prepare your data file as follows. Each line of the file represents one individual's DNA sequence, with a list of positions that differ from some reference sequence. The distinct mutations at each site (transitions, transversions, indels) are given distinct labels. The convention we use is to label (i) a transition with respect to the reference sequence by its position in the reference sequence and (ii) a transversion or indel by its position, plus a suffix to indicate the base change. For example, if the reference sequence was
AAGGCCTTA
we might code the sequence
AGGGCAT-A
as
2 6A 8del
Sequences that match the reference sequence should be signified with a "0". Labels should be restricted to a maximum of 6 characters.

An example data file is available here of the 50 Adygei mtDNA HVSI sequences from Macaulay et al. (1999) Am. J. Hum. Genet. 64:232-249. Here the mutations are labelled by their position in the reference sequence, minus 16,000. Three character codes refer to transitions, and the (single) four character code to a transversion.

To generate an RDF file for NETWORK, use:
netmat -s -o output.rdf < input.txt
where "input.txt" is the name of the data file and "output.rdf" is the name of the output file (for an example, see "adygei.rdf").

To generate an RDF file for NETWORK, but filtering away some mutations use:
netmat -s -o output.rdf -f filter.txt < input.txt
A file "filter.txt" should contain a list of the codes corresponding to the speedy mutations, one per line. In the example above, if the deletion at position 8 was speedy, you would prepare a file containing the single line "8del", and use that as "filter.txt". Our suggested filter files for HVS-I and HVS-II of human mtDNA can be found below.

To generate a matrix for the program SPECTRA, while filtering away some mutations use:
netmat -t -s -o output.txt -f filter.txt < input.txt

A suggested weight filter file for HVS-I

A file containing the list of speedy transitions in HVS-I (between 16051 and 16365, less 16000) that we used for the weighty filter examples in the paper can be downloaded here. To use this file, with NETMAT, a possible command line would be
netmat -s -o matrix.rdf -f speedy.hvsi < input.txt
which would prepare a file "matrix.rdf" for NETWORK.

A suggested weight filter file for HVS-II

A file containing the list of fast sites in HVS-II from the paper of Malyarchuk et al. (2002) Hum. Genet., 111:46-53 is available here. These we equate with speedy transitions.

A list of mutational counts in HVS-I

A list of the mutational counts that we used, for example, to construct our list of speedy HVS-I transitions is available here.

Program to compute spectra

For a given binary matrix, the program SPECTRA computes the cube and incompatibility spectra, and can also perform the permutation described in the paper.

In the context we describe, the matrix is the result of a mapping from aligned DNA sequence data. Each row represents the sequence of a (haploid) individual and each column the nucleotide at a particular position in the DNA. If a position is polymorphic in the sample, only two nucleotides are assumed to be segregrating, and these are coded as '0' and '1'. A non-segregating position can be represented as a column of '0's or of '1's.

On the first line of the data file, two positive integers should be present which are i) the number of sequences (rows in the matrix) and ii) the number of DNA positions (columns in the matrix). Then follows the matrix, with one row of the file per row of the matrix. An example would be:
5 4
0001
0011
1001
1111
0001
which represents 5 individuals and 4 DNA positions, the first and third of which are incompatible, the last of which is fixed.

The program, which you can download here, runs under the DOS prompt in the various versions of Windows. (If you would like the C code in order to compile the program for your favorite operating system, please contact me at the email address below.) It reads the data from the standard input and sends the results to the standard output, so you will probably want to use redirection to make things easy. So, for example, you might type:
spectra < data.txt > results.txt
at the DOS prompt to run the program on the data file "data.txt" and to put the output in "results.txt".

Applied to the above example, the program should put the following in "results.txt":
The raw data:
No. of haplotypes: 5
No. of characters: 4
000 : 0001
001 : 0011
002 : 1001
003 : 1111
004 : 0001
------------
Non-pruned characters: 0 1 2
The cooked data:
No. of haplotypes: 5
No. of characters: 3
000 : 000
001 : 001
002 : 100
003 : 111
004 : 000
------------
The incompatibility matrix of the cooked data:
000 : 001
001 : 000
002 : 100
------------
The incompatibility and cube spectra:
s = ( 1 3 1)
f = ( 5 5 1)
------------

This contains the data that the program read from the input file (sequences numbered consecutively, starting from zero); the "cooked" data, which has fixed positions and positions that split the sequences into the same two sets merged (positions are numbered consecutively, starting from zero); the incompatibility matrix of the cooked data; and the incompatibility (s) and cube spectra (f).

You can specify the option "-n perms" on the command line to perform the permutation described in the paper. Here "perms" is the number of permutations to perform. Crudely speaking, the 0s and 1s in each column are jumbled up in each permutation. The output file then contains the same output as above, for the original (unpermuted) data, and then for each permuted data set; and finally the mean cube and incompatibility spectra across the permutations (with standard deviations).


If you download the software and want to be told of any bugs or new versions, please email me your email address (at the address below).

We would be very grateful to hear of any problems with the software, errors in the paper or of any problems you have with the web site. A list of any errata will appear here.

Vincent Macaulay
11th November 2002 (corrected 26th June 2010)
v.macaulay@stats.gla.ac.uk