Changes
v2.4 (2025-01-22)
#554: Added
--exclude-chromosome
option (can be used multiple times) to most subcommands (phase
,haplotag
,genotype
etc.)#537: Fixed a crash when running haplotag on CRAM files.
#545: haplotagphase now supports multi-allelic variants.
#579: Fix
--supplementary-distance
option tophase
not working.Reduced processing time of BAM files by about 33% when using realignment.
v2.3 (2024-05-05)
#521: Added
haplotagphase
command. The command adds phase information to variants based on haplotagged reads. Contributed by Nikolai Karpov (@nkkarpov) and Mitchell Robert Vollger (@mrvollger).#516: Added
--use-supplementary
option tophase
. Use this to use supplementary alignments for phasing (previously, supplementary alignments would be ignored). Contributed by Nikolai Karpov (@nkkarpov).
v2.2 (2024-01-26)
#496: Fixed a segmentation fault in
polyphase
.#498: Fixed a numeric overflow in the scoring phase of
polyphase
. It could occur for variants with extremely high coverages (i.e. >200X).#472: Fixed various warnings and assertion violations when running
polyphase
.#214: Added support for ploidies greater than two to
whatshap split
.Added another algorithm for diploid phasing, which is a heuristic version of the default algorithm. Since it has not been tested extensively, we recommend the old algorithm for productive use, especially for pedigree phasing. Main benefit is support for higher coverages and/or larger pedigrees at the cost of not solving the underlying MEC model to optimality anymore. The heuristic is accessible via the parameter
--algorithm=heuristic
.
v2.1 (2023-10-17)
We added k-merald, a new method for allele detection based on k-mer alignment. Instead of using a fixed cost value, k-merald derives k-mer mismatch penalties using the error profiles generated by
whatshap learn
. k-merald is available as an alternative to the edit-distance-based allele detection.WhatsHap can now be used to generate sequencing error profiles for a specific technology using
whatshap learn
.#470: Avoid ZeroDivisionError in
whatshap stats
when there are no heterozygous or no phased variants.#485: Fixed a KeyError: ‘parse_vcf’ in
whatshap polyphase
when a full chromosome is skipped.
v2.0 (2023-06-30)
#346: Phasing of indels (and other non-SNVs) is now enabled by default. This previously required specifying the
--indels
option, which not all users knew about and were thus unnecessarily getting suboptimal phasing results. The option is now ignored and leads to a warning. An--only-snvs
option was added that restores the old behavior. This change applies to the following subcommands:phase
,haplotype
,polyphase
,polyphasegenetic
.Since this is a backwards incompatible change (when not using
--indels
already), the major version has been increased.#425: Haplotagging CRAM files should now work in more cases with
haplotag
.#427:
polyphase
did not phase indels, even if explicitly told.#432:
polyphase
can use existing phasing information in VCF when using the--use-prephasing
flag. Still very experimental.#439:
polyphasegenetic
now handles pedigree information more robustly and properly detects available ILP solvers.#449: Fixed runtime issues for ploidies above 4, if no pre-phasing is used.
#450:
polyphase
now supports multi-allelic variants.#457:
haplotag
now also tags alignments marked as duplicate.#466: Inconsistent runtime measurements now lead to a warning and no longer to a crash.
This is the last WhatsHap release to support Python 3.7.
v1.7 (2022-12-01)
#379: Added the ability to do polyploid phasing with pedigree information. This is implemented in a new
polyphasegenetic
subcommand.#143:
whatshap stats
now outputs the fraction of heterozygous variants that are phased.#410:
haplotag
gained support for tagging data with ploidy greater than two (use option--ploidy
).#400: Fixed artificial overinflation of block length stats in
whatshap stats
.#418: Fixed problem in
stats
where NaN values caused ValuError#416: Clarified in the docs what
stats
considers as “phased”.#207: Enable comma-separated chromosomes as argument to
whatshap stats
.#412: Changed
stats
to compute all length statistics on split blocks#399: Formatted
stats
output so that long values are right-aligned with all other values.
v1.6 (2022-09-06)
#384: Fixed how interleaved phase blocks in
whatshap stats
are split when computing NG50 values. This allows NG50 values to be larger than before. Thanks to @pontushojer.#385: Speed up
whatshap stats
when used with--chromosomes
by avoiding to read in the entire VCF. Thanks to @pontushojer.#387:
whatshap haplotag
got some optimizations and is now about 20% faster. Thanks to @pontushojer.#397: Fixed
whatshap haplotag
to include reads not assigned to a contig (unmapped) in the output (unless the--region
option is used).
v1.5 (2022-08-23)
Providing a reference FASTA (with
--reference
or-r
) is now mandatory even forwhatshap haplotag
. It was already mandatory forwhatshap phase
. In both cases, this is to prevent accidentally getting bad results because allele detection through realignment (which usually performs better) is only possible if a reference is provided. Use--no-reference
explicitly to fall back to the less accurate algorithm.#394: Fixed
whatshap phase
option--recombination--list
not working.#371:
whatshap split
crashed when attempting to split reads in a FASTQ file by haplotype.#377: Speed-up of about 20-30% for
whatshap polyphase
via some optimizations in the read clustering algorithm.Removed the deprecated
--pigz
option forwhatshap split
v1.4 (2022-04-07)
#362:
whatshap polyphase
received extensive algorithmic updates. The compatiblity with different data sets (species and sequencing technology) has been improved. The wall-clock time has been reduced by about 20-30%, depending on the input data.
v1.3 (2022-03-11)
#353: Fix incorrect HS tags in
whatshap polyphase
#356: Fixed crash when reading VCF variants without
GT
fields (happens in GVCFs).#352:
whatshap haplotag
has gained option--output-threads
for setting the number of compression threads, significantly reducing wall-clock time. Also, if output is sent to a pipe, uncompressed BAM is written. Thanks to @cjw85.
v1.2 (2021-12-08)
#208: Fix
phase --merge-reads
. This option has never worked correctly and just led towhatshap phase
taking a very long time and in some cases even crashing. With the fix, the option should work as intended, but we have not evaluated how much it improves phasing results.#337: Add
--skip-missing-contigs
option towhatshap haplotag
#335: Add option
--ignore-sample-name
towhatshap compare
(thanks to Pontus Höjer)#342: Fix
whatshap compare
crashing on VCFs with genotypes with an unknown allele (whereGT
is1|.
or similar).#343:
whatshap stats
now reads the chromosome lengths (for N50 computation) from the VCF header, no need to use--chr-lengths
.
v1.1 (2021-04-08)
#223: Fix
haplotag --ignore-linked-reads
not working#241: Fix some
polyphase
problems.#249: Fix crash in the
haplotag
command on reading a VCF with thePS
tag set to.
.#251: Allow
haplotag
to correctly write to standard output.#207: Allow multiple
--chromosome
arguments tostats
.The file created with
--output-read-list
was not correctly tab-separated.#248: Remove
phase --full-genotyping
option. Instead, usewhatshap genotype
followed bywhatshap phase
.#289: Fix parsing of GVCFs (with dots in the ALT column)
#265:
polyphase
can now work in parallel
v1.0 (2020-06-24)
WhatsHap has not seen a release in over a year although development has continued. To make up for it, we decided to leave ZeroVer behind and set the version number to 1.0.
WhatsHap has gained initial support for phasing polyploid samples! While this feature may not be quite ready for production use, we encourage you to test it by using the
whatshap polyphase
subcommand and to report any issues you find back to us. See also the pre-print at <https://doi.org/10.1101/2020.02.04.933523> for details.#51: Reading and writing VCF files is now significantly faster because we switched to a different library for that task (
pysam.VariantFile
).The switch to
pysam.VariantFile
also makes WhatsHap stricter in which VCF files it accepts. We have tried to give sensible error messages in these cases, but please report any remaining issues..bcf
files can now be read and written.#110:
.vcf.gz
output files are now compressed with bgzip so that they can be indexed with tabix.Providing an indexed reference FASTA is now mandatory (with
-r
or--reference
). It is possible to bypass this by using--no-reference
, but that will disable realignment and therefore give worse phasing results on error-prone reads (PacBio, Nanopore).#187: Implemented a
--regions
option for thehaplotag
subcommand.Implemented a
--discard-unknown-reads
option for thesplit
subcommand. Reads that are in the input reads file (BAM/FASTQ), but are not listed in the haplotag file will be discarded (by default, they are part of the “untagged” output).Fixed #215.
split
subcommand can now process.bam
files lacking thesequence
field for some/all reads.The minimum required Python version for WhatsHap is now 3.6.
v0.18 (2019-02-15)
Add option
--plot-sum-of-blocksizes
towhatshap compare
.Fix in
whatshap stats
: sometimes returned wrong N50 values if the end position of the last block of a chromosome was larger than the starting position of the first block of the next chromosome.#173: The
haplotag
command should now be able to properly write CRAM files.#177: Option
--ignore-read-groups
did not work when phased blocks (VCF) were provided as input.#122: Add
--ignore-read-groups
and--samples
options tohaplotag
.Integration of the HapChat algorithm as an alternative MEC solver, available through
whatshap phase --algorithm=hapchat
. Contributed by the HapChat team, see https://doi.org/10.1186/s12859-018-2253-8.This is the last release of WhatsHap to support Python 3.4.
v0.17 (2018-07-20)
#140: Haplotagging now works when chromosomes are missing in the VCF.
Added option
--merge-reads
, which is helpful for high coverage data.When phasing pedigrees, ensure that haplotypes are ordered as paternal_allele|maternal_allele in the output VCF. This seems to be a common convention and also used by 1000G.
Test cases now use pytest instead of nose (which is discontinued).
v0.16 (2018-05-22)
#167: Fix the
haplotag
command. It would tag reads incorrectly.#154: Use barcode information in BX tags when running
haplotag
on 10x Genomics linked read data.#153: Allow combination of
--ped
and--samples
to only work on a subset of samples in a pedigree. Added--use-ped-samples
to only phase samples mentioned in PED file (while ignoring other samples in input VCF).
v0.15 (2018-04-07)
New subcommand
genotype
for haplotype-aware genotyping (see https://doi.org/10.1101/293944 for details on the method).Support CRAM files in addition to BAM.
#133: No longer create BAM/CRAM index if it does not exist. This is safer when running multiple WhatsHap instances in parallel. From now on, you need to create the index yourself (for example with
samtools index
) before running WhatsHap.#152: Reads marked as “duplicate” in the input BAM/CRAM file are now ignored.
#157: Adapt to changed interface in Pysam 0.14.
#158: Handle read groups with missing sample (SM) tag correctly.
v0.14.1 (2017-07-07)
Fix compilation problem by distinguishing gcc and clang.
v0.14 (2017-07-06)
Added
--full-genotyping
to (re-)genotype the given variants based on the readsAdded option
whatshap compare --switch-error-bed
to write BED file with switch error positionsAdded
whatshap compare --plot-blocksizes
to plot histogroms of block sizesAdded option
--longest-block-tsv
to output position-wise stats on longest joint haplotype blockAdded option
whatshap compare --tsv-multiway
to write results of multi-way comparison to tab-separated fileAdded option –chromosome to whatshap stats
whatshap compare
can now compute the block-wise Hamming distancewhatshap stats
can now compute an N50 for the phased blocksFixed compilation issues on OS X (clang)
Detect unsorted VCFs and chromosome name mismatches between BAM and VCF
Fix crash when whatshap compare encounteres unphased VCFs
Expanded documentation.
v0.13 (2016-10-27)
Use
PS
tag instead ofHP
tag by default to store phasing information. This applies to thephase
andhapcut2vcf
subcommands.PS
is also used by other tools and standard according to the VCF specification.Incorporated genotype likelihoods into our phasing framework. On request (by using option
--distrust-genotypes
), genotypes can now be changed at a cost corresponding to their input genotype likelihoods. The changed genotypes are written to the output VCF. The behavior of--distrust-genotypes
can be fine-tuned by the added options--include-homozygous
,--default-gq
,--gl-regularizer
, and--changed-genotype-list
.Correctly handle cases when processing VCFs with two or more disjoint families.
v0.12 (2016-07-01)
Speed up allele detection
Add an
unphase
subcommand which removes all phasing from a VCF file (HP
andPS
tags, pipe notation).Add option
--tag=
to thephase
subcommand, which allows to choose whether ReadBackedPhasing-compatibleHP
tags or standardPS
tags are used to describe phasing in the output VCF.Manage versions with versioneer. This means that
whatshap --version
and the program version in the VCF header will include the Git commit hash, such aswhatshap 0.11+50.g1b7af7a
.Add subcommand “haplotag” to tag reads in a BAM file with their haplotype.
Fix a bug where re-alignment around variants at the very end of a chromosome would lead to an AssertionError.
v0.11 (2016-06-09)
When phasing a pedigree, blocks that are not connected by reads but can be phased based on genotypes will be connected per default. This behavior can be turned off using option
--no-genetic-haplotyping
.Implemented allele detection through re-alignment: To detect which allele of a variant is seen in a read, the query is aligned to the two haplotypes at that position. This results in better quality phasing, especially for low-quality reads (PacBio). Enabled if
--reference
is provided. Current limitation: No score for the allele is computed.As a side-effect of the new allele detection, we can now also phase insertions, deletions, MNPs and “complex” variants.
Added option
--chromosome
to only work on specifed chromosomes.Use constant recombination rate per default, allows to use
--ped
without using--genmap
.whatshap
has become a command with subcommands. From now on, you need to runwhatshap phase
to phase VCFs.Add a
stats
subcommand that prints statistics about phased VCFs.
v0.10 (2016-04-27)
Use
--ped
to phase pedigrees with the PedMEC algorithmPhase all samples in a multi-sample VCF
Drop support for Python 3.2 - we require at least Python 3.3 now
v0.9 (2016-01-05)
This is the first release available via PyPI (and that can therefore be installed via
pip install whatshap
)
January 2016
Trio phasing implemented in a branch
September 2015
pWhatsHap implemented (in a branch)
April 2015
Create haplotype-specific BAM files
February 2015
Smart read selection
January 2015
Ability to read multiple BAM files and merge them on the fly
December 2014
Logo
Unit tests
November 2014
Cython wrapper for C++ code done
Ability to write a phased VCF (using HP tags).
June 2014
Repository for WhatsHap refactoring created
April 2014
The WhatsHap algorithm is introduced at RECOMB