Various notes

  • There is a step in which variants are re-discovered in the BAM file. This may fail when the variant caller has used some type of re-alignment (as freebayes does). Would be better to integrate this into the variant caller or to get the information out of it. This applies only to indels, which are not supported right now anyway.

  • Input format for HapCompass: http://www.brown.edu/Research/Istrail_Lab/resources/hapcompass_manual.html#sec11

Allele detection with re-alignment

WhatsHap can detect which allele a read contains at a variant position by aligining a part of the read to the two possible haplotypes. The haplotype for which the alignment is better wins.

Allele detection through re-alignment is enabled when the --reference parameter is used on the command-line.

Re-alignment in this version detects slightly fewer alleles than the old algorithm, but this is typically justified because the old algorithm gave wrong results. Re-alignment however correctly detects that both haplotypes are equally good and then refuses to choose.

The alignment algorithm uses edit distance at the moment, which allows us to detect alleles correctly most of the time, but does not allow us to make use of base qualities (in fact, the weighted algorithm degenerates into an unweighted one). To fix this, we need a better alignment algorithm.

Here are some examples for how re-alignment works.

Insertion next to a SNP

Haplotypes:

ref:   CCTTAGT
alt:   CCTCAGT

Alignment as reported in BAM file:

ref:   CCT-TAGT
query: CCTCAAGT

The second T is aligned to an A, which is not one of the expected bases. Thus, no variant would be detected here.

Re-aligning the query to the “alt” haplotype, we get:

alt:   CCTCA-GT
query: CCTCAAGT

This alignment has lower cost and we therefore detect that the allele in this read is probably the alternative one.

Ambiguous

This was previously detected incorrectly:

ref:   TGCTTTAAGG
alt:   TGCTTTCAGG
query: TGCCTTCAAGG

Two possible alignments are

ref:   TGC-TTTAAGG
query: TGCCTTCAAGG

and

alt:   TGCTTTCA-GG
query: TGCCTTCAAGG

Both have cost two and therefore the correct allele is unclear.