_MPII Home Page_ 

CMB

Up to: Research Units Building 46.1

Algorithms and Complexity Group
Programming Logics Group

Biochemical ALgorithms Library
Sequence Alignment
Molecular Dynamics
Protein-Protein Docking

(Structural) Sequence Alignment



Investigators: Hans Peter Lenhof, Kurt Mehlhorn, Petra Mutzel and Knut Reinert

Alignment algorithms are of central importance for the interpretation of protein and nucleic sequences. The comparison of a set of sequences yields important insights about parts that are functionally and/or structurally related. Below you see a picture of a multiple alignment of some acetylglucosamin binding proteins and the tertiary structure of one representative, the hevein. The yellow regions in the alignment most probably build a similar tertiary structure. It is eight cysteins that build four sulfur bonds. To know more about our approach click  here.


 

Traditional sequence alignment methods are important tools for detecting homologies in different molecules and work quite well for protein sequences. However, such primary sequence based methods are not well suited for analyzing RNA.
An RNA molecule, unlike DNA,is a generally single-stranded nucleic acid molecule that folds in space due to the formation of hydrogen bonds between its bases. Traditional sequence alignment methods can only account for the primary structure and thus ignore aspects of the secondary structure. Most functional RNAs appear to be selected more for conservation of a particular secondary structure than for conservation of primary structure. RNA sequence analysis therefore must work with this pattern of correlation in addition to primary sequence conservation.
The solution of this so called RNA Sequence Structure Alignment (RSA) problem is an alignment that optimizes simultaneously the structure and sequence consensus of the sequences under consideration. Below you the the primary (sequence), secondary and tertiary structure of a transfer-RNA.  To know more about our approach click  here


 
 

The Maximum Weight Trace problem

In our mwork we looked at a special version of the multiple sequence alignment problem, the so called Maximum Weight Trace (MWT) problem and present a formulation as an integer linear program. Based on that formulation we present a branch-and-cut algorithm, which can solve problem instances to optimality, the size of which was not tractable for exact algorithms.

In the Complete Maximum Weight Trace formulation (CMWT) the letters of the strings under consideration
are viewed as vertices V in a complete k-partite graph G. Every edge e in G has a non-negative weight representing
the gain of aligning the endpoints of the edge. We say that an alignment realizes an edge if it places the endpoints into the same column of the alignment array. The set of edges realized by an alignment  is called the trace  of  the alignment. and the weight
of an alignment is the sum of the weights of the edges that it realizes. The goal is to compute an alignment of maximum weight. The CMWT formulation is quite general, containing the well-studied Sum of Pairs Alignment problem with uniform insertion and deletion costs as a special case. It has the computational drawback that the number of edges in G is quadratic in the length of the sequences which becomes large quickly as the number of sequences or their length grows.

The (Sparse) Maximum Weight Trace formulation (MWT) is a sparsification of CMWT. It restricts consideration to a judiciously chosen subgraph of the complete k-partite graph. We use G = (V,E) to denote this graph and call it the alignment graph. In our computational experiments we put all edges into E that are used in some nearly optimal pairwise alignment. Alternatively, the edges of the alignment graph may be obtained from pairwise alignment under arbitrarily complex objective functions. They can encode suboptimal pairwise alignments, and may even represent correspondences from 3-d alignments of protein structures. The hope is that MWT is computationally simpler than CMWT (since the input graph is smaller) and yet has a biologically meaningful optimal solution (as E is carefully chosen). MWT was introduced in 1991 by Kececioglu originally as a model for the final alignment phase of DNA sequence assembly and is unique as a graph-theoretic formulation of multiple alignment.

Like most multiple alignment problems, Maximum Weight Trace can be solved by dynamic-programming, and is equivalent to finding a longest-path from a designated source to a designated sink in a k-dimensional acyclic mesh-shaped graph, the so-called dynamic programming graph. The set of paths from the source to the sink code all possible alignments. Each (directed) edge of the dynamic programming graph represents a possible column. The weight of such an edge is the sum of the weights of the edges e in E (of the input graph) that are realized by the corresponding column. Dynamic programming yields an algorithm with time complexity  O(k^2 2^k N) and space complexity O(N), where N is the product of the sequence lengths, which is feasible only for very small problem instances. While MWT is NP-complete, Kececioglu presented a branch-and-bound algorithm whose implementation could optimally align six sequences of length 250 in a few minutes.
Larger examples, however, required excessive space. In his approach, a heuristic alignment of the k sequences yields a lower bound for the branch-and-bound procedure. Upper bounds are calculated by adding up the weights of all MWT-optimal pairwise alignments over suffixes of the sequences. On instances that can be solved, the set of subproblems is kept manageable by an elaborate procedure for branching on a small set of columns, and by pruning subproblems via bounds.

In our work we formulate Maximum Weight Trace as an integer linear program. Define the trace polytope of G as the convex hull of all incidence vectors of traces in G. The polyhedral combinatorics to integer linear programming calls for a description of the trace polytope by linear inequalities (also called constraints) after which linear programming can be used to find the Maximum Weight Trace.

The most effective constraints are the ones that define facets of the trace polytope, since these are the constraints that appear in an irredundant description of it by linear inequalities. Due to the NP-hardness of our problem, we cannot expect to be able to find a full description of the facial structure of the trace polytope by linear inequalities. Nevertheless, a partial description of the facial structure of the trace polytope by linear inequalities is useful for the design of a branch-and-cut algorithm. We give a graph-theoretic characterization  of traces which we use to give a complete description of the trace polytope for the two-sequence case, i.e., we give the complete set of the linear inequalities and polynomial time separation algorithms for them (a separation algorithm for a class of inequalities takes a multi-dimensional point  and returns a violated inequality from the class if there is one).  Our complete description implies the existence of a polynomial-time, non-dynamic-programming algorithm for pairwise sequence alignment and thus gives another answer to a longstanding problem in stringology. The inequalities naturally fall into two classes which we call trivial and clique inequalities respectively. Furthermore we give a partial description of the trace polytope for the multiple-sequence case. More specifically, we show that all facet-defining inequalities of the two-sequence case lift to the multiple-sequence case and we derive three more classes of facet-defining inequalities (which we call mixed-cycle inequalities, chorded-mixed-cycle inequalities and ladder inequalities respectively  Furthermore we derive polynomial time separation algorithms for all but the last class of inequalities  and we outline a branch-and-cut algorithm.
 
 

The full extended abstract can be found:

  Knut Reinert, Hans-Peter Lenhof, Petra Mutzel, Kurt Mehlhorn, and John Kececioglu. A Branch-And-Cut algorithm  for multiple sequence alignment. Proc. of the First Annual International Conference on Computational Molecular Biology  RECOMB 97, 241--250, 1997.
 
 

The RNA Secondary Structure Problem

With the prediction of tRNA structure from a set of similar sequences, Levitt had strikingly demonstrated that sets of similar sequences can yield convincing evidence for how an RNA molecule folds. The computational problem of considering sequence and structure of an RNA molecule simultaneously was first addressed by Sankoff who proposed a dynamic programming algorithm that aligns a set of RNA sequences while at the same time predicting their common fold. Algorithms similar in spirit were proposed later on for the problem of comparing one RNA sequence to one or more of known structure. Corpet and Michot align simultaneously a
sequence with a number of other sequences using both primary and secondary structures. Their dynamic programming algorithm requires O(n^5) running time and O(n^4) space (n is the length of the sequences) and thus can handle only short sequences.
Corpet and Michot propose an anchor-point heuristic to divide large alignment problems by fixed alignment regions into small subproblems that the dynamic programming algorithm can then be applied to. Bafna et al. improved the dynamic programming algorithm to a running time of O(n^4) which still does not make it applicable to real-life problems.
Common motifs among several sequences are searched by Waterman. Eddy and Durbin describe probabilistic models for measuring the secondary structure and primary sequence consensus of RNA sequence families. They present algorithms for analyzing and comparing RNA sequences as well as database search techniques. Since the basic operation in their approach is an expensive dynamic programming algorithm, their algorithms cannot analyze sequences longer than 150--200 nucleotides.
Recently, Gorodkin et al. published a simplified version of Sankoff's original algorithm.

In our work we present a branch-and-cut algorithm for aligning an RNA molecule S1 of known sequence and known structure to an RNA molecule S2 of known sequence but unknown structure. The algorithm computes an (optimal) alignment that maximizes sequence and structure consensus simultaneously. To be more precise, the score that is optimized is a weighted sum of a sequence alignment score and a base pairing score which measures the quantity and quality of the base pairs of S1 and S2 that are ``matched'' by the sequence alignment. We call this problem the RNA Sequence Structure Alignment (RSA) problem.
Note that we do not restrict base pairs to the ones that can be realized in a secondary structure but allow for the inclusion of pseudoknots.

The RSA problem is the central problem of the different RNA similarity or structure prediction problems defined in the above cited papers. If the two molecules are functionally related and have a similar structure the RSA alignment allows, for example, to draw conclusions about the structure of molecule S2. The RSA techniques we will put forward can be easily modified in such a way that incremental or simultaneous computations of multiple structural sequence alignments are possible. Since the older approaches cannot solve middle sized or large instances of the RSA problem without using ad hoc heuristics, biologists still have to carry out structural alignments of sequences of such size by hand. Furthermore, the known algorithms are not able to integrate tertiary structure interactions like pseudoknots. Bafna et al. claim that ``their algorithm can be extended to structures that allow crossing edges'', but they do not mention how this can be done and whether the complexity of the algorithm changes.

We tested whether methods from polyhedral combinatorics can be used to solve large instances of the RSA problem. Such methods have recently been successfully applied to problems like physical mapping or multiple sequence alignment.  We rephrased the RSA problem as an integer linear program and studied the facets of the convex hull of all integral solutions. Based on these studies we implemented a first version of a branch-and-cut algorithm coded in C++. As an example we tested two 23S ribosomal RNA sequences. Each of  these sequences is more than 1400 bases long. A ``hand-made'' structural alignment for these two sequences was taken from the Antwerpen rRNA database. Our algorithm was able to solve this large instance of the problem in reasonable time to optimality. The produced structural alignment is a very good approximation of the ``hand-made'' optimal structural alignment and is, in particular, much better than the purely sequence-based optimal alignment. Although not documented in this example, the branch-and-cut algorithm can also handle pseudoknots without any change in the algorithm.
 

The full extended abstract can be found in:
 
   Hans-Peter Lenhof, Knut Reinert and Martin Vingron. A Polyhedral Approach to RNA Sequence Structure  Alignment. To appear in Proc. on the Second Annual International Conference on Computational Molecular Biology  RECOMB 98, 1998.