Providing a reference FASTA (with
-r) is now mandatory even for
whatshap haplotag. It was already mandatory for
whatshap phase. In both cases, this is to prevent accidentally getting bad results because allele detection through realignment (which usually performs better) is only possible if a reference is provided. Use
--no-referenceexplicitly to fall back to the less accurate algorithm.
whatshap splitcrashed when attempting to split reads in a FASTQ file by haplotype.
#377: Speed-up of about 20-30% for
whatshap polyphasevia some optimizations in the read clustering algorithm.
Removed the deprecated
whatshap polyphasereceived extensive algorithmic updates. The compatiblity with different data sets (species and sequencing technology) has been improved. The wall-clock time has been reduced by about 20-30%, depending on the input data.
#353: Fix incorrect HS tags in
#356: Fixed crash when reading VCF variants without
GTfields (happens in GVCFs).
whatshap haplotaghas gained option
--output-threadsfor setting the number of compression threads, significantly reducing wall-clock time. Also, if output is sent to a pipe, uncompressed BAM is written. Thanks to @cjw85.
phase --merge-reads. This option has never worked correctly and just led to
whatshap phasetaking a very long time and in some cases even crashing. With the fix, the option should work as intended, but we have not evaluated how much it improves phasing results.
#335: Add option
whatshap compare(thanks to Pontus Höjer)
whatshap comparecrashing on VCFs with genotypes with an unknown allele (where
whatshap statsnow reads the chromosome lengths (for N50 computation) from the VCF header, no need to use
haplotag --ignore-linked-readsnot working
#241: Fix some
#249: Fix crash in the
haplotagcommand on reading a VCF with the
PStag set to
haplotagto correctly write to standard output.
#207: Allow multiple
The file created with
--output-read-listwas not correctly tab-separated.
phase --full-genotypingoption. Instead, use
whatshap genotypefollowed by
#289: Fix parsing of GVCFs (with dots in the ALT column)
polyphasecan now work in parallel
WhatsHap has not seen a release in over a year although development has continued. To make up for it, we decided to leave ZeroVer behind and set the version number to 1.0.
WhatsHap has gained initial support for phasing polyploid samples! While this feature may not be quite ready for production use, we encourage you to test it by using the
whatshap polyphasesubcommand and to report any issues you find back to us. See also the pre-print at <https://doi.org/10.1101/2020.02.04.933523> for details.
#51: Reading and writing VCF files is now significantly faster because we switched to a different library for that task (
The switch to
pysam.VariantFilealso makes WhatsHap stricter in which VCF files it accepts. We have tried to give sensible error messages in these cases, but please report any remaining issues.
.bcffiles can now be read and written.
.vcf.gzoutput files are now compressed with bgzip so that they can be indexed with tabix.
Providing an indexed reference FASTA is now mandatory (with
--reference). It is possible to bypass this by using
--no-reference, but that will disable realignment and therefore give worse phasing results on error-prone reads (PacBio, Nanopore).
#187: Implemented a
--regionsoption for the
--discard-unknown-readsoption for the
splitsubcommand. Reads that are in the input reads file (BAM/FASTQ), but are not listed in the haplotag file will be discarded (by default, they are part of the “untagged” output).
splitsubcommand can now process
.bamfiles lacking the
sequencefield for some/all reads.
The minimum required Python version for WhatsHap is now 3.6.
whatshap stats: sometimes returned wrong N50 values if the end position of the last block of a chromosome was larger than the starting position of the first block of the next chromosome.
haplotagcommand should now be able to properly write CRAM files.
--ignore-read-groupsdid not work when phased blocks (VCF) were provided as input.
Integration of the HapChat algorithm as an alternative MEC solver, available through
whatshap phase --algorithm=hapchat. Contributed by the HapChat team, see https://doi.org/10.1186/s12859-018-2253-8.
This is the last release of WhatsHap to support Python 3.4.
#140: Haplotagging now works when chromosomes are missing in the VCF.
--merge-reads, which is helpful for high coverage data.
When phasing pedigrees, ensure that haplotypes are ordered as paternal_allele|maternal_allele in the output VCF. This seems to be a common convention and also used by 1000G.
Test cases now use pytest instead of nose (which is discontinued).
#167: Fix the
haplotagcommand. It would tag reads incorrectly.
#154: Use barcode information in BX tags when running
haplotagon 10x Genomics linked read data.
#153: Allow combination of
--samplesto only work on a subset of samples in a pedigree. Added
--use-ped-samplesto only phase samples mentioned in PED file (while ignoring other samples in input VCF).
genotypefor haplotype-aware genotyping (see https://doi.org/10.1101/293944 for details on the method).
Support CRAM files in addition to BAM.
#133: No longer create BAM/CRAM index if it does not exist. This is safer when running multiple WhatsHap instances in parallel. From now on, you need to create the index yourself (for example with
samtools index) before running WhatsHap.
#152: Reads marked as “duplicate” in the input BAM/CRAM file are now ignored.
#157: Adapt to changed interface in Pysam 0.14.
#158: Handle read groups with missing sample (SM) tag correctly.
Fix compilation problem by distinguishing gcc and clang.
--full-genotypingto (re-)genotype the given variants based on the reads
whatshap compare --switch-error-bedto write BED file with switch error positions
whatshap compare --plot-blocksizesto plot histogroms of block sizes
--longest-block-tsvto output position-wise stats on longest joint haplotype block
whatshap compare --tsv-multiwayto write results of multi-way comparison to tab-separated file
Added option –chromosome to whatshap stats
whatshap comparecan now compute the block-wise Hamming distance
whatshap statscan now compute an N50 for the phased blocks
Fixed compilation issues on OS X (clang)
Detect unsorted VCFs and chromosome name mismatches between BAM and VCF
Fix crash when whatshap compare encounteres unphased VCFs
PStag instead of
HPtag by default to store phasing information. This applies to the
PSis also used by other tools and standard according to the VCF specification.
Incorporated genotype likelihoods into our phasing framework. On request (by using option
--distrust-genotypes), genotypes can now be changed at a cost corresponding to their input genotype likelihoods. The changed genotypes are written to the output VCF. The behavior of
--distrust-genotypescan be fine-tuned by the added options
Correctly handle cases when processing VCFs with two or more disjoint families.
Speed up allele detection
unphasesubcommand which removes all phasing from a VCF file (
PStags, pipe notation).
phasesubcommand, which allows to choose whether ReadBackedPhasing-compatible
HPtags or standard
PStags are used to describe phasing in the output VCF.
Manage versions with versioneer. This means that
whatshap --versionand the program version in the VCF header will include the Git commit hash, such as
Add subcommand “haplotag” to tag reads in a BAM file with their haplotype.
Fix a bug where re-alignment around variants at the very end of a chromosome would lead to an AssertionError.
When phasing a pedigree, blocks that are not connected by reads but can be phased based on genotypes will be connected per default. This behavior can be turned off using option
Implemented allele detection through re-alignment: To detect which allele of a variant is seen in a read, the query is aligned to the two haplotypes at that position. This results in better quality phasing, especially for low-quality reads (PacBio). Enabled if
--referenceis provided. Current limitation: No score for the allele is computed.
As a side-effect of the new allele detection, we can now also phase insertions, deletions, MNPs and “complex” variants.
--chromosometo only work on specifed chromosomes.
Use constant recombination rate per default, allows to use
whatshaphas become a command with subcommands. From now on, you need to run
whatshap phaseto phase VCFs.
statssubcommand that prints statistics about phased VCFs.
--pedto phase pedigrees with the PedMEC algorithm
Phase all samples in a multi-sample VCF
Drop support for Python 3.2 - we require at least Python 3.3 now
This is the first release available via PyPI (and that can therefore be installed via
pip install whatshap)
Trio phasing implemented in a branch
pWhatsHap implemented (in a branch)
Create haplotype-specific BAM files
Smart read selection
Ability to read multiple BAM files and merge them on the fly
Cython wrapper for C++ code done
Ability to write a phased VCF (using HP tags).
Repository for WhatsHap refactoring created
The WhatsHap algorithm is introduced at RECOMB