Changes

v2.2 (2024-01-26)

  • #496: Fixed a segmentation fault in polyphase.

  • #498: Fixed a numeric overflow in the scoring phase of polyphase. It could occur for variants with extremely high coverages (i.e. >200X).

  • #472: Fixed various warnings and assertion violations when running polyphase.

  • #214: Added support for ploidies greater than two to whatshap split.

  • Added another algorithm for diploid phasing, which is a heuristic version of the default algorithm. Since it has not been tested extensively, we recommend the old algorithm for productive use, especially for pedigree phasing. Main benefit is support for higher coverages and/or larger pedigrees at the cost of not solving the underlying MEC model to optimality anymore. The heuristic is accessible via the parameter --algorithm=heuristic.

v2.1 (2023-10-17)

  • We added k-merald, a new method for allele detection based on k-mer alignment. Instead of using a fixed cost value, k-merald derives k-mer mismatch penalties using the error profiles generated by whatshap learn. k-merald is available as an alternative to the edit-distance-based allele detection.

  • WhatsHap can now be used to generate sequencing error profiles for a specific technology using whatshap learn.

  • #470: Avoid ZeroDivisionError in whatshap stats when there are no heterozygous or no phased variants.

  • #485: Fixed a KeyError: ‘parse_vcf’ in whatshap polyphase when a full chromosome is skipped.

v2.0 (2023-06-30)

  • #346: Phasing of indels (and other non-SNVs) is now enabled by default. This previously required specifying the --indels option, which not all users knew about and were thus unnecessarily getting suboptimal phasing results. The option is now ignored and leads to a warning. An --only-snvs option was added that restores the old behavior. This change applies to the following subcommands: phase, haplotype, polyphase, polyphasegenetic.

    Since this is a backwards incompatible change (when not using --indels already), the major version has been increased.

  • #425: Haplotagging CRAM files should now work in more cases with haplotag.

  • #427: polyphase did not phase indels, even if explicitly told.

  • #432: polyphase can use existing phasing information in VCF when using the --use-prephasing flag. Still very experimental.

  • #439: polyphasegenetic now handles pedigree information more robustly and properly detects available ILP solvers.

  • #449: Fixed runtime issues for ploidies above 4, if no pre-phasing is used.

  • #450: polyphase now supports multi-allelic variants.

  • #457: haplotag now also tags alignments marked as duplicate.

  • #466: Inconsistent runtime measurements now lead to a warning and no longer to a crash.

  • This is the last WhatsHap release to support Python 3.7.

v1.7 (2022-12-01)

  • #379: Added the ability to do polyploid phasing with pedigree information. This is implemented in a new polyphasegenetic subcommand.

  • #143: whatshap stats now outputs the fraction of heterozygous variants that are phased.

  • #410: haplotag gained support for tagging data with ploidy greater than two (use option --ploidy).

  • #400: Fixed artificial overinflation of block length stats in whatshap stats.

  • #418: Fixed problem in stats where NaN values caused ValuError

  • #416: Clarified in the docs what stats considers as “phased”.

  • #207: Enable comma-separated chromosomes as argument to whatshap stats.

  • #412: Changed stats to compute all length statistics on split blocks

  • #399: Formatted stats output so that long values are right-aligned with all other values.

v1.6 (2022-09-06)

  • #384: Fixed how interleaved phase blocks in whatshap stats are split when computing NG50 values. This allows NG50 values to be larger than before. Thanks to @pontushojer.

  • #385: Speed up whatshap stats when used with --chromosomes by avoiding to read in the entire VCF. Thanks to @pontushojer.

  • #387: whatshap haplotag got some optimizations and is now about 20% faster. Thanks to @pontushojer.

  • #397: Fixed whatshap haplotag to include reads not assigned to a contig (unmapped) in the output (unless the --region option is used).

v1.5 (2022-08-23)

  • Providing a reference FASTA (with --reference or -r) is now mandatory even for whatshap haplotag. It was already mandatory for whatshap phase. In both cases, this is to prevent accidentally getting bad results because allele detection through realignment (which usually performs better) is only possible if a reference is provided. Use --no-reference explicitly to fall back to the less accurate algorithm.

  • #394: Fixed whatshap phase option --recombination--list not working.

  • #371: whatshap split crashed when attempting to split reads in a FASTQ file by haplotype.

  • #377: Speed-up of about 20-30% for whatshap polyphase via some optimizations in the read clustering algorithm.

  • Removed the deprecated --pigz option for whatshap split

v1.4 (2022-04-07)

  • #362: whatshap polyphase received extensive algorithmic updates. The compatiblity with different data sets (species and sequencing technology) has been improved. The wall-clock time has been reduced by about 20-30%, depending on the input data.

v1.3 (2022-03-11)

  • #353: Fix incorrect HS tags in whatshap polyphase

  • #356: Fixed crash when reading VCF variants without GT fields (happens in GVCFs).

  • #352: whatshap haplotag has gained option --output-threads for setting the number of compression threads, significantly reducing wall-clock time. Also, if output is sent to a pipe, uncompressed BAM is written. Thanks to @cjw85.

v1.2 (2021-12-08)

  • #208: Fix phase --merge-reads. This option has never worked correctly and just led to whatshap phase taking a very long time and in some cases even crashing. With the fix, the option should work as intended, but we have not evaluated how much it improves phasing results.

  • #337: Add --skip-missing-contigs option to whatshap haplotag

  • #335: Add option --ignore-sample-name to whatshap compare (thanks to Pontus Höjer)

  • #342: Fix whatshap compare crashing on VCFs with genotypes with an unknown allele (where GT is 1|. or similar).

  • #343: whatshap stats now reads the chromosome lengths (for N50 computation) from the VCF header, no need to use --chr-lengths.

v1.1 (2021-04-08)

  • #223: Fix haplotag --ignore-linked-reads not working

  • #241: Fix some polyphase problems.

  • #249: Fix crash in the haplotag command on reading a VCF with the PS tag set to ..

  • #251: Allow haplotag to correctly write to standard output.

  • #207: Allow multiple --chromosome arguments to stats.

  • The file created with --output-read-list was not correctly tab-separated.

  • #248: Remove phase --full-genotyping option. Instead, use whatshap genotype followed by whatshap phase.

  • #289: Fix parsing of GVCFs (with dots in the ALT column)

  • #265: polyphase can now work in parallel

v1.0 (2020-06-24)

WhatsHap has not seen a release in over a year although development has continued. To make up for it, we decided to leave ZeroVer behind and set the version number to 1.0.

  • WhatsHap has gained initial support for phasing polyploid samples! While this feature may not be quite ready for production use, we encourage you to test it by using the whatshap polyphase subcommand and to report any issues you find back to us. See also the pre-print at <https://doi.org/10.1101/2020.02.04.933523> for details.

  • #51: Reading and writing VCF files is now significantly faster because we switched to a different library for that task (pysam.VariantFile).

  • The switch to pysam.VariantFile also makes WhatsHap stricter in which VCF files it accepts. We have tried to give sensible error messages in these cases, but please report any remaining issues.

  • .bcf files can now be read and written.

  • #110: .vcf.gz output files are now compressed with bgzip so that they can be indexed with tabix.

  • Providing an indexed reference FASTA is now mandatory (with -r or --reference). It is possible to bypass this by using --no-reference, but that will disable realignment and therefore give worse phasing results on error-prone reads (PacBio, Nanopore).

  • #187: Implemented a --regions option for the haplotag subcommand.

  • Implemented a --discard-unknown-reads option for the split subcommand. Reads that are in the input reads file (BAM/FASTQ), but are not listed in the haplotag file will be discarded (by default, they are part of the “untagged” output).

  • Fixed #215. split subcommand can now process .bam files lacking the sequence field for some/all reads.

  • The minimum required Python version for WhatsHap is now 3.6.

v0.18 (2019-02-15)

  • Add option --plot-sum-of-blocksizes to whatshap compare.

  • Fix in whatshap stats: sometimes returned wrong N50 values if the end position of the last block of a chromosome was larger than the starting position of the first block of the next chromosome.

  • #173: The haplotag command should now be able to properly write CRAM files.

  • #177: Option --ignore-read-groups did not work when phased blocks (VCF) were provided as input.

  • #122: Add --ignore-read-groups and --samples options to haplotag.

  • Integration of the HapChat algorithm as an alternative MEC solver, available through whatshap phase --algorithm=hapchat. Contributed by the HapChat team, see https://doi.org/10.1186/s12859-018-2253-8.

  • This is the last release of WhatsHap to support Python 3.4.

v0.17 (2018-07-20)

  • #140: Haplotagging now works when chromosomes are missing in the VCF.

  • Added option --merge-reads, which is helpful for high coverage data.

  • When phasing pedigrees, ensure that haplotypes are ordered as paternal_allele|maternal_allele in the output VCF. This seems to be a common convention and also used by 1000G.

  • Test cases now use pytest instead of nose (which is discontinued).

v0.16 (2018-05-22)

  • #167: Fix the haplotag command. It would tag reads incorrectly.

  • #154: Use barcode information in BX tags when running haplotag on 10x Genomics linked read data.

  • #153: Allow combination of --ped and --samples to only work on a subset of samples in a pedigree. Added --use-ped-samples to only phase samples mentioned in PED file (while ignoring other samples in input VCF).

v0.15 (2018-04-07)

  • New subcommand genotype for haplotype-aware genotyping (see https://doi.org/10.1101/293944 for details on the method).

  • Support CRAM files in addition to BAM.

  • #133: No longer create BAM/CRAM index if it does not exist. This is safer when running multiple WhatsHap instances in parallel. From now on, you need to create the index yourself (for example with samtools index) before running WhatsHap.

  • #152: Reads marked as “duplicate” in the input BAM/CRAM file are now ignored.

  • #157: Adapt to changed interface in Pysam 0.14.

  • #158: Handle read groups with missing sample (SM) tag correctly.

v0.14.1 (2017-07-07)

  • Fix compilation problem by distinguishing gcc and clang.

v0.14 (2017-07-06)

  • Added --full-genotyping to (re-)genotype the given variants based on the reads

  • Added option whatshap compare --switch-error-bed to write BED file with switch error positions

  • Added whatshap compare --plot-blocksizes to plot histogroms of block sizes

  • Added option --longest-block-tsv to output position-wise stats on longest joint haplotype block

  • Added option whatshap compare --tsv-multiway to write results of multi-way comparison to tab-separated file

  • Added option –chromosome to whatshap stats

  • whatshap compare can now compute the block-wise Hamming distance

  • whatshap stats can now compute an N50 for the phased blocks

  • Fixed compilation issues on OS X (clang)

  • Detect unsorted VCFs and chromosome name mismatches between BAM and VCF

  • Fix crash when whatshap compare encounteres unphased VCFs

  • Expanded documentation.

v0.13 (2016-10-27)

  • Use PS tag instead of HP tag by default to store phasing information. This applies to the phase and hapcut2vcf subcommands. PS is also used by other tools and standard according to the VCF specification.

  • Incorporated genotype likelihoods into our phasing framework. On request (by using option --distrust-genotypes), genotypes can now be changed at a cost corresponding to their input genotype likelihoods. The changed genotypes are written to the output VCF. The behavior of --distrust-genotypes can be fine-tuned by the added options --include-homozygous, --default-gq, --gl-regularizer, and --changed-genotype-list.

  • Correctly handle cases when processing VCFs with two or more disjoint families.

v0.12 (2016-07-01)

  • Speed up allele detection

  • Add an unphase subcommand which removes all phasing from a VCF file (HP and PS tags, pipe notation).

  • Add option --tag= to the phase subcommand, which allows to choose whether ReadBackedPhasing-compatible HP tags or standard PS tags are used to describe phasing in the output VCF.

  • Manage versions with versioneer. This means that whatshap --version and the program version in the VCF header will include the Git commit hash, such as whatshap 0.11+50.g1b7af7a.

  • Add subcommand “haplotag” to tag reads in a BAM file with their haplotype.

  • Fix a bug where re-alignment around variants at the very end of a chromosome would lead to an AssertionError.

v0.11 (2016-06-09)

  • When phasing a pedigree, blocks that are not connected by reads but can be phased based on genotypes will be connected per default. This behavior can be turned off using option --no-genetic-haplotyping.

  • Implemented allele detection through re-alignment: To detect which allele of a variant is seen in a read, the query is aligned to the two haplotypes at that position. This results in better quality phasing, especially for low-quality reads (PacBio). Enabled if --reference is provided. Current limitation: No score for the allele is computed.

  • As a side-effect of the new allele detection, we can now also phase insertions, deletions, MNPs and “complex” variants.

  • Added option --chromosome to only work on specifed chromosomes.

  • Use constant recombination rate per default, allows to use --ped without using --genmap.

  • whatshap has become a command with subcommands. From now on, you need to run whatshap phase to phase VCFs.

  • Add a stats subcommand that prints statistics about phased VCFs.

v0.10 (2016-04-27)

  • Use --ped to phase pedigrees with the PedMEC algorithm

  • Phase all samples in a multi-sample VCF

  • Drop support for Python 3.2 - we require at least Python 3.3 now

v0.9 (2016-01-05)

  • This is the first release available via PyPI (and that can therefore be installed via pip install whatshap)

January 2016

  • Trio phasing implemented in a branch

September 2015

  • pWhatsHap implemented (in a branch)

April 2015

  • Create haplotype-specific BAM files

February 2015

  • Smart read selection

January 2015

  • Ability to read multiple BAM files and merge them on the fly

December 2014

  • Logo

  • Unit tests

November 2014

  • Cython wrapper for C++ code done

  • Ability to write a phased VCF (using HP tags).

June 2014

  • Repository for WhatsHap refactoring created

April 2014

  • The WhatsHap algorithm is introduced at RECOMB