PhD Thesis: Parallel String Alignments #
In my Ph.D. thesis, I studied new approaches to using parallel computation in string comparison. These allow us to understand algorithms for computing the LCS on the word-RAM, the PRAM and the BSP models in a unified fashion.
Short Summary #
For my Ph.D. thesis I studied parallel algorithms for sequence comparison, specifically, for computing sequence alignments like this:
In biology, variants of such alignments are used for comparing DNA or protein sequences.
The following theoretical aspects of parallel string alignment are discussed:
- Dynamic programming and transposition network representation,
- sparsity, i.e. dealing with few or many character matches,
- parallel work-optimality and scalable communication.
I also developed an efficient method for computing local string alignments ("The Seaweed Code"), which is available at github.com/pkrusche/seaweeds.
My Ph.D. supervisor was Dr. Alexander Tiskin.
Full Text #
Here is the full thesis in PDF format.
Full Abstract #
String comparison is a fundamental component in many of today's applications of computing, including bioinformatics, signal processing, databases, internet search and many others. A specific type of practical string comparison problem is computing the longest common subsequence (LCS) of two input strings. The LCS problem is equivalent to computing edit distances and strongly connected to computing string alignments, which are of great importance for applications in computational biology. In this thesis, we show new approaches to using parallel computation in string comparison, which allow us to understand algorithms for computing the LCS on the word-RAM, the PRAM and the BSP models in a unified fashion. This approach is based on recent results on algorithms for semi-local string comparison. Semi-local string comparison is a straightforward extension of the standard LCS problem, in which we ask to compute the LCS for one string and all substrings of another string. We use a revised approach to analyzing BSP algorithms which includes the analysis of communication and I/O cost, as well as of the local memory requirements. We define asymptotic scalability in memory and communication. Scalable communication captures the input/output cost of partitioning a problem into subproblems, and achieving scalable memory allows to obtain small subproblems which can fit into processor caches. In this thesis, we show how to achieve scalable memory and communication for computing longest common subsequences, as well as for computing longest increasing subsequences. Furthermore, we present new algorithms for semi-local string comparison that are parameterized by the similarity of the input strings, and we discuss aspects of their practical implementation. Alongside with theoretical results, this thesis also comprises an algorithm engineering project, which has the goal of allowing practical applications to make use of the techniques of semi-local string comparison. For a particular problem of local string comparison in evolutionary biology, our methods have improved upon the fastest existing methods. We have obtained a speedup by a factor of more than 14 over the fastest existing implementation, running on a single processor, and achieved the potential to scale to hundreds of processors. This has greatly increased the feasible size of the input sequences that can be compared using our method, potentially allowing loss-free local comparison of entire genomes.