![]() ![]()
CMB |
![]() |
|
||
Alignment algorithms are of central importance for the interpretation of protein and nucleic sequences. The comparison of a set of sequences yields important insights about parts that are functionally and/or structurally related. Below you see a picture of a multiple alignment of some acetylglucosamin binding proteins and the tertiary structure of one representative, the hevein. The yellow regions in the alignment most probably build a similar tertiary structure. It is eight cysteins that build four sulfur bonds. To know more about our approach click here.
Traditional sequence alignment methods are important tools for detecting
homologies in different molecules and work quite well for protein sequences.
However, such primary sequence based methods are not well suited for analyzing
RNA.
An RNA molecule, unlike DNA,is a generally single-stranded nucleic
acid molecule that folds in space due to the formation of hydrogen bonds
between its bases. Traditional sequence alignment methods can only account
for the primary structure and thus ignore aspects of the secondary structure.
Most functional RNAs appear to be selected more for conservation of a particular
secondary structure than for conservation of primary structure. RNA sequence
analysis therefore must work with this pattern of correlation in addition
to primary sequence conservation.
The solution of this so called RNA Sequence Structure Alignment
(RSA) problem is an alignment that optimizes simultaneously the structure
and sequence consensus of the sequences under consideration. Below you
the the primary (sequence), secondary and tertiary structure of a transfer-RNA.
To know more about our approach click here
In the Complete Maximum Weight Trace formulation (CMWT) the letters
of the strings under consideration
are viewed as vertices V in a complete k-partite graph G.
Every edge e in G has a non-negative weight representing
the gain of aligning the endpoints of the edge. We say that an alignment
realizes an edge if it places the endpoints into the same column of the
alignment array. The set of edges realized by an alignment is called
the trace of the alignment. and the weight
of an alignment is the sum of the weights of the edges that it realizes.
The goal is to compute an alignment of maximum weight. The CMWT formulation
is quite general, containing the well-studied Sum of Pairs Alignment problem
with uniform insertion and deletion costs as a special case. It has the
computational drawback that the number of edges in G is quadratic in the
length of the sequences which becomes large quickly as the number of sequences
or their length grows.
The (Sparse) Maximum Weight Trace formulation (MWT) is a sparsification of CMWT. It restricts consideration to a judiciously chosen subgraph of the complete k-partite graph. We use G = (V,E) to denote this graph and call it the alignment graph. In our computational experiments we put all edges into E that are used in some nearly optimal pairwise alignment. Alternatively, the edges of the alignment graph may be obtained from pairwise alignment under arbitrarily complex objective functions. They can encode suboptimal pairwise alignments, and may even represent correspondences from 3-d alignments of protein structures. The hope is that MWT is computationally simpler than CMWT (since the input graph is smaller) and yet has a biologically meaningful optimal solution (as E is carefully chosen). MWT was introduced in 1991 by Kececioglu originally as a model for the final alignment phase of DNA sequence assembly and is unique as a graph-theoretic formulation of multiple alignment.
Like most multiple alignment problems, Maximum Weight Trace can be solved
by dynamic-programming, and is equivalent to finding a longest-path from
a designated source to a designated sink in a k-dimensional acyclic mesh-shaped
graph, the so-called dynamic programming graph. The set of paths
from the source to the sink code all possible alignments. Each (directed)
edge of the dynamic programming graph represents a possible column. The
weight of such an edge is the sum of the weights of the edges e in
E (of the input graph) that are realized by the corresponding column.
Dynamic programming yields an algorithm with time complexity O(k^2
2^k N) and space complexity O(N), where N is the product
of the sequence lengths, which is feasible only for very small problem
instances. While MWT is NP-complete, Kececioglu presented a branch-and-bound
algorithm whose implementation could optimally align six sequences of length
250 in a few minutes.
Larger examples, however, required excessive space. In his approach,
a heuristic alignment of the k sequences yields a lower bound for
the branch-and-bound procedure. Upper bounds are calculated by adding up
the weights of all MWT-optimal pairwise alignments over suffixes of the
sequences. On instances that can be solved, the set of subproblems is kept
manageable by an elaborate procedure for branching on a small set of columns,
and by pruning subproblems via bounds.
In our work we formulate Maximum Weight Trace as an integer linear program. Define the trace polytope of G as the convex hull of all incidence vectors of traces in G. The polyhedral combinatorics to integer linear programming calls for a description of the trace polytope by linear inequalities (also called constraints) after which linear programming can be used to find the Maximum Weight Trace.
The most effective constraints are the ones that define facets of the
trace polytope, since these are the constraints that appear in an irredundant
description of it by linear inequalities. Due to the NP-hardness of our
problem, we cannot expect to be able to find a full description of the
facial structure of the trace polytope by linear inequalities. Nevertheless,
a partial description of the facial structure of the trace polytope by
linear inequalities is useful for the design of a branch-and-cut algorithm.
We give a graph-theoretic characterization of traces which we use
to give a complete description of the trace polytope for the two-sequence
case, i.e., we give the complete set of the linear inequalities and polynomial
time separation algorithms for them (a separation algorithm for a class
of inequalities takes a multi-dimensional point and returns a violated
inequality from the class if there is one). Our complete description
implies the existence of a polynomial-time, non-dynamic-programming algorithm
for pairwise sequence alignment and thus gives another answer to a longstanding
problem in stringology. The inequalities naturally fall into two classes
which we call trivial and clique inequalities respectively.
Furthermore we give a partial description of the trace polytope for the
multiple-sequence case. More specifically, we show that all facet-defining
inequalities of the two-sequence case lift to the multiple-sequence case
and we derive three more classes of facet-defining inequalities (which
we call mixed-cycle inequalities, chorded-mixed-cycle inequalities
and ladder inequalities respectively Furthermore we derive
polynomial time separation algorithms for all but the last class
of inequalities and we outline a branch-and-cut algorithm.
The full extended abstract can be found:
Knut Reinert, Hans-Peter
Lenhof, Petra Mutzel, Kurt Mehlhorn, and John Kececioglu. A Branch-And-Cut
algorithm for multiple sequence alignment. Proc. of the First Annual
International Conference on Computational Molecular Biology RECOMB
97, 241--250, 1997.
In our work we present a branch-and-cut algorithm for aligning an RNA
molecule S1 of known sequence and known structure to an RNA molecule
S2 of known sequence but unknown structure. The algorithm computes
an (optimal) alignment that maximizes sequence and structure consensus
simultaneously. To be more precise, the score that is optimized is a weighted
sum of a sequence alignment score and a base pairing score which measures
the quantity and quality of the base pairs of S1 and S2 that
are ``matched'' by the sequence alignment. We call this problem the RNA
Sequence Structure Alignment (RSA) problem.
Note that we do not restrict base pairs to the ones that can be realized
in a secondary structure but allow for the inclusion of pseudoknots.
The RSA problem is the central problem of the different RNA similarity or structure prediction problems defined in the above cited papers. If the two molecules are functionally related and have a similar structure the RSA alignment allows, for example, to draw conclusions about the structure of molecule S2. The RSA techniques we will put forward can be easily modified in such a way that incremental or simultaneous computations of multiple structural sequence alignments are possible. Since the older approaches cannot solve middle sized or large instances of the RSA problem without using ad hoc heuristics, biologists still have to carry out structural alignments of sequences of such size by hand. Furthermore, the known algorithms are not able to integrate tertiary structure interactions like pseudoknots. Bafna et al. claim that ``their algorithm can be extended to structures that allow crossing edges'', but they do not mention how this can be done and whether the complexity of the algorithm changes.
We tested whether methods from polyhedral combinatorics can be used
to solve large instances of the RSA problem. Such methods have recently
been successfully applied to problems like physical mapping or multiple
sequence alignment. We rephrased the RSA problem as an integer linear
program and studied the facets of the convex hull of all integral solutions.
Based on these studies we implemented a first version of a branch-and-cut
algorithm coded in C++. As an example we tested two 23S ribosomal RNA sequences.
Each of these sequences is more than 1400 bases long. A ``hand-made''
structural alignment for these two sequences was taken from the Antwerpen
rRNA database. Our algorithm was able to solve this large instance of the
problem in reasonable time to optimality. The produced structural alignment
is a very good approximation of the ``hand-made'' optimal structural alignment
and is, in particular, much better than the purely sequence-based optimal
alignment. Although not documented in this example, the branch-and-cut
algorithm can also handle pseudoknots without any change in the algorithm.
The full extended abstract can be found in:
Hans-Peter
Lenhof, Knut Reinert and Martin Vingron. A Polyhedral Approach to RNA Sequence
Structure Alignment. To appear in Proc. on the Second Annual International
Conference on Computational Molecular Biology RECOMB 98, 1998.