Determines lineages with given ingroup and outgroup frequencies in coalescent genealogies

View the Project on GitHub mpcox/hapmatch


This C++ program calculates haplotypes with given ingroup and outgroup frequencies in a set of coalescent genealogies. The code was used to calculate lineage frequencies for a bottlenecked Madagascar ingroup versus its source Indonesian outgroup in:

Cox MP, MG Nelson, MK Tumonggor, FX Ricaut and H Sudoyo. 2012. A small cohort of Island Southeast Asian women founded Madagascar. Proceedings of the Royal Society B 279: 2761-2768.

hapmatch reads input from Richard Hudson’s ms, like the following worked example file, test.ms:

ms 10 1 –t 4 –I 2 4 6 ...
1    111001100000110110000011110000000011101101110001110011 ⎤
2    110001100000110110000011110000000011101101110001110011   Malagasy
3    000001000000100110000011000000010000001010001100111000 
4    000001000000100110000011000000010000001010001100111000 ⎦
5    000001000000100110000011000000010000000000001100111000 ⎤
6    000001000000100110000011000000010000001010001100111000 
7    000000000010100110000011000110010000001110001100000000   Indonesians
8    000000000010100110000011000110010000001110001100000000 
9    000000000010100110000011000110000000001110001000000000 
10   000000000010100110000011000110011000001110011100000000 ⎦

Where sequences 1–4 are simulated data representing the Madagascar ingroup and 5–10 represent the Indonesian outgroup.

Running hapmatch with an ingroup frequency cutoff of ≥0.2, an outgroup frequency cutoff of ≤0.8 and requiring exact matches:

cat test.ms | hapmatch 4 0.2 6 0.8 0


S       Sin     Sout    hapmatch
54      31      20      1

This output occurs because:


hapmatch requires a working installation of Kevin Thornton’s libsequence library. The current code has been updated to require libsequence v2.

The easiest way to install libsequence is via bioconda:

conda install -c bioconda libsequence

The hapmatch source code can then be compiled with:

g++ -o hapmatch hapmatch.cc -lsequence -std=c++11 -Wall


Usage information can be found by running the command:


The program expects five input values: the sample size and frequency cutoff of the ingroup, the sample size and frequency cutoff of the outgroup, and a threshold for the number of allowed mismatches. A threshold of 0 returns exact matches only.

hapmatch 4 0.444 6 0.948 0


The following command simulates 3 datasets, each containing 10 chromosome copies, with the first 4 copies deriving from population 1 (the ingroup) and the remaining 6 copies from population 2 (the outgroup). hapmatch is set to find exact matches only, where haplotypes also meet the frequency cutoffs of ≥44.4% for population 1 and ≤94.8% for population 2.

ms 10 3 -t 4 -I 2 4 6 1 | hapmatch 4 0.444 6 0.948 0

The coalescent genealogies generated by ms are random, but the output formatting looks like the following, with each line containing summary values for a single input dataset:

S    Sin    Sout    hapmatch
27    26    24    0
17    12    15    1
14    8    9    0

where S is the total number of segregating sites, Sin is the number of segregating sites in the ingroup, Sout is the number of segregating sites in the outgroup, and hapmatch is the number of haplotypes that meet the required criteria.