However, all above algorithms are based on arrays, which do not support dynamic updates. We believe that L-PBWT-Query represents a missing piece of the PBWT algorithms. Arguably, L-PBWT-Query makes PBWT search more practical as it returns all long enough matches rather than merely the best matching ones. This algorithm offers efficient long matches, a more practical target for genealogical search. introduced Linked Equal/Alternating Positions (LEAP) arrays, an additional data structure allowing direct jumping to boundaries of matching blocks. presented a new algorithm, L-PBWT-Query, that reports all long matches between an out-of-panel query against a constructed PBWT panel in time complexity linear to the length of the haplotypes and constant to the size of the panel. The original PBWT paper described an array version of the PBWT, and a set of basic algorithms: Algorithms 1 and 2 for construction, Algorithms 3 and 4 for reporting all versus all long matches and set maximal matches, and Algorithm 5 for reporting set maximal matches between an out-of-panel query against a constructed PBWT panel. This has produced methods that scale to biobank scale datasets. Indeed, PBWT has been applied to important tasks such as genotype imputation, identification of identical by descent (IBD) segments, and genealogical search. It offers efficient algorithms for matching haplotypes that approach theoretically optimal complexity.
In doing so, we systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity independent of database size, they have different worst case complexities, linear time complexity with the size of the genome, and dependency on additional data structures.ĭurbin’s positional Burrows-Wheeler transform (PBWT) is a scalable foundational data structure for modeling population haplotype sequences.
In addition, we verified that d-PBWT can support all algorithms of PBWT. We developed efficient algorithms for insertion and deletion of individual haplotypes. Here, we generalize the static PBWT to a dynamic data structure, d-PBWT, where the reverse prefix sorting at each position is represented by linked lists. However, the standard PBWT is an array-based static data structure and does not support dynamic updates of the panel. Once the PBWT of a haplotype panel is constructed, it supports efficient retrieval of all shared long segments among all individuals (long matches) and efficient query between an external haplotype and the panel. Durbin’s PBWT, a scalable data structure for haplotype matching, has been successfully applied to identical by descent (IBD) segment identification and genotype imputation.