Short description

History

Proposal

Call for Proposals

Events


Coordinating Board

Board of Referees

 


Funded Projects

Projects related to the Priority Program

Software

People

Final Report of the Priority Program


Home

 

SHORT DESCRIPTION

INFORMATICS METHODS FOR THE ANALYSIS AND INTERPRETATION OF LARGE GENOMIC DATASETS



The extensive activities at sequencing the genomes of whole organisms revolutionize molecular biology and biotechnology. Already more than 10 microorganisms have completely been sequenced (Status of March 1997). Seven genomes are openly accessible including the eucaryote yeast with about 12 million base pairs. The sequencing of the human genome is planned to be completed at the latest until the year 2005. The result of the sequencings is a wealth of data, which cannot be dealt with any more by conventional methods of data analysis and modelling. Yeast, whose sequencing has just been completed, has about 6000 genes. Already the task of getting an overview over this amount of data requires new methods of data analysis. It is not sufficient any more to concentrate on the examination of sequence patterns, structures and functions of single genes, RNA-molecules or proteins. Rather, new procedures are needed in order to search through and assess selectively large genomic data records. We call such procedures "screening methods". Here, the analysis of evolutionary, structural and functional similarities play an important role. The methods for this analysis originate from computer science and mathematics. For this reason, in its recommendations on progress in biotechnology, the Technology Council of the Chancellor of the Federal Republic of Germany attributes a high significance to Computational Biology and recommends, to push it forward with much emphasis. We react to this recommendation with the present suggestion for a priority program of the DFG.

The priority program addresses the interdisciplinary expert community of computer scientists and mathematicians on the one side and molecular biologists and biochemists on the other side, which has developed in Germany and internationally established itself. Thanks to the sequenced genomes we have data records at our disposal containing all the relevant information of a growing number of species. However, a detailed classification of the functions of the genetic elements can only be done partially up to now. Well over half of all genes of the sequenced organisms have not yet or only insufficiently been characterized. Therefore, the main task of the program is explore large genomic data sets using the methods from computer science. These systematic comparisons of sequence patterns as well a models of molecular structures and interactions make it possible to uncover relationships between structure and function and thus categorize cellular building blocks into metabolic or regulatory networks. Through the identification of orthologous proteins in model genomes, human hereditary diseases can be functionally classified, pathogenicity of microorganisms studied or approaches for drug development found. These methods have fundamental significance for biological basic research. Due to the rapidly growing number of ava
ilable genomes, computational biology is increasingly challenged when dealing with the definite identification of genes which are to be chosen as candidates for the costly experimental function analysis. These difficulties must be seen, in particular, in the context of the sequencing of unicellular organisms in a range of 20-40 megabases considered up by DFG.

On the methodical side appropriate modeling of complex biological interactions as well as the development of efficient algorithms plays a role for the required data throughput, but also data handling- and accessing-problems as well as questions concerning visual presentation of complex data are important. After all, screening systems consist of many software components, whose consistency and usability must be assured.

The following problem areas pertain to this overall goal:

  • Analysis of molecular sequence data, especially sequence alignments and analysis of sequence variability, structuring of the genome as well as polymorphisms
  • Systematic genome comparison (e.g. analysis of the genomic topology of related pathogen/non-pathogen organisms)
  • Molecular structure determination with methods of computer science (proteins, RNA, complexes)
  • Determination of molecular function based on sequence and/or structure comparison
  • Molecular biology data bases (organization, access, data validation, pattern recognition, classification)
  • Computer modelling of regulatory and metabolic networks
  • Visualization of molecular biology data

In all these fields the focus of the priority program is on methods that afford a sufficiently high data throughput for subjecting large genomic datasets to complex analysis, as well.

Projects to be funded in the Priority Program have to have both significant Computer Science and/or mathematics content and direct relevance to important molecular biological questions. To this end, it is important to secure the methodical side through qualified computer science and/or mathematics and simultaneously secure that the goals of the project are relevant for genome research. Scientists from the fields of computational biology, Computer science and mathematics can apply. Interdisciplinary cooperation is especially desirable.

The evaluation should be carried out by specialists that have interdisciplinary competence in both biology and computer science/mathematics. The results of the priority program should be made accessible directly to the biological research community. Therefore, the validation of tools on biological data is mandatory and the software should be usable and accessible. Routine colloquiums shall present the work of the priority program and foster the direct exchange with the bio-sciences. An internet website of the priority program is an integral part of the program.