Current Year

PhD
[1]
A. Abujabal and K. Berberich, “Question Answering over Knowledge Bases with Continuous Learning,” Universität des Saarlandes, Saarbrücken, 2019.
Abstract
Answering complex natural language questions with crisp answers is crucial towards satisfying the information needs of advanced users. With the rapid growth of knowledge bases (KBs) such as Yago and Freebase, this goal has become attainable by translating questions into formal queries like SPARQL queries. Such queries can then be evaluated over knowledge bases to retrieve crisp answers. To this end, three research issues arise: (i) how to develop methods that are robust to lexical and syntactic variations in questions and can handle complex questions, (ii) how to design and curate datasets to advance research in question answering, and (iii) how to efficiently identify named entities in questions. In this dissertation, we make the following five contributions in the areas of question answering (QA) and named entity recognition (NER). For issue (i), we make the following contributions: We present QUINT, an approach for answering natural language questions over knowledge bases using automatically learned templates. Templates are an important asset for QA over KBs, simplifying the semantic parsing of input questions and generating formal queries for interpretable answers. QUINT is capable of answering both simple and compositional questions. We introduce NEQA, a framework for continuous learning for QA over KBs. NEQA starts with a small seed of training examples in the form of question-answer pairs, and improves its performance over time. NEQA combines both syntax, through template-based answering, and semantics, via a semantic similarity function. %when templates fail to do so. Moreover, it adapts to the language used after deployment by periodically retraining its underlying models. For issues (i) and (ii), we present TEQUILA, a framework for answering complex questions with explicit and implicit temporal conditions over KBs. TEQUILA is built on a rule-based framework that detects and decomposes temporal questions into simpler sub-questions that can be answered by standard KB-QA systems. TEQUILA reconciles the results of sub-questions into final answers. TEQUILA is accompanied with a dataset called TempQuestions, which consists of 1,271 temporal questions with gold-standard answers over Freebase. This collection is derived by judiciously selecting time-related questions from existing QA datasets. For issue (ii), we publish ComQA, a large-scale manually-curated dataset for QA. ComQA contains questions that represent real information needs and exhibit a wide range of difficulties such as the need for temporal reasoning, comparison, and compositionality. ComQA contains paraphrase clusters of semantically-equivalent questions that can be exploited by QA systems. We harness a combination of community question-answering platforms and crowdsourcing to construct the ComQA dataset. For issue (iii), we introduce a neural network model based on subword units for named entity recognition. The model learns word representations using a combination of characters, bytes and phonemes. While achieving comparable performance with word-level based models, our model has an order-of-magnitude smaller vocabulary size and lower memory requirements, and it handles out-of-vocabulary words.
Export
BibTeX
@phdthesis{Abujabalphd2013, TITLE = {Question Answering over Knowledge Bases with Continuous Learning}, AUTHOR = {Abujabal, Abdalghani and Berberich, Klaus}, LANGUAGE = {eng}, DOI = {10.22028/D291-27968}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2019}, MARGINALMARK = {$\bullet$}, DATE = {2019}, ABSTRACT = {Answering complex natural language questions with crisp answers is crucial towards satisfying the information needs of advanced users. With the rapid growth of knowledge bases (KBs) such as Yago and Freebase, this goal has become attainable by translating questions into formal queries like SPARQL queries. Such queries can then be evaluated over knowledge bases to retrieve crisp answers. To this end, three research issues arise: (i) how to develop methods that are robust to lexical and syntactic variations in questions and can handle complex questions, (ii) how to design and curate datasets to advance research in question answering, and (iii) how to efficiently identify named entities in questions. In this dissertation, we make the following five contributions in the areas of question answering (QA) and named entity recognition (NER). For issue (i), we make the following contributions: We present QUINT, an approach for answering natural language questions over knowledge bases using automatically learned templates. Templates are an important asset for QA over KBs, simplifying the semantic parsing of input questions and generating formal queries for interpretable answers. QUINT is capable of answering both simple and compositional questions. We introduce NEQA, a framework for continuous learning for QA over KBs. NEQA starts with a small seed of training examples in the form of question-answer pairs, and improves its performance over time. NEQA combines both syntax, through template-based answering, and semantics, via a semantic similarity function. %when templates fail to do so. Moreover, it adapts to the language used after deployment by periodically retraining its underlying models. For issues (i) and (ii), we present TEQUILA, a framework for answering complex questions with explicit and implicit temporal conditions over KBs. TEQUILA is built on a rule-based framework that detects and decomposes temporal questions into simpler sub-questions that can be answered by standard KB-QA systems. TEQUILA reconciles the results of sub-questions into final answers. TEQUILA is accompanied with a dataset called TempQuestions, which consists of 1,271 temporal questions with gold-standard answers over Freebase. This collection is derived by judiciously selecting time-related questions from existing QA datasets. For issue (ii), we publish ComQA, a large-scale manually-curated dataset for QA. ComQA contains questions that represent real information needs and exhibit a wide range of difficulties such as the need for temporal reasoning, comparison, and compositionality. ComQA contains paraphrase clusters of semantically-equivalent questions that can be exploited by QA systems. We harness a combination of community question-answering platforms and crowdsourcing to construct the ComQA dataset. For issue (iii), we introduce a neural network model based on subword units for named entity recognition. The model learns word representations using a combination of characters, bytes and phonemes. While achieving comparable performance with word-level based models, our model has an order-of-magnitude smaller vocabulary size and lower memory requirements, and it handles out-of-vocabulary words.}, }
Endnote
%0 Thesis %A Abujabal, Abdalghani %Y Weikum, Gerhard %A referee: Linn, Jimmy %A Berberich, Klaus %+ Databases and Information Systems, MPI for Informatics, Max Planck Society International Max Planck Research School, MPI for Informatics, Max Planck Society Databases and Information Systems, MPI for Informatics, Max Planck Society External Organizations Databases and Information Systems, MPI for Informatics, Max Planck Society %T Question Answering over Knowledge Bases with Continuous Learning : %G eng %U http://hdl.handle.net/21.11116/0000-0003-AEC0-0 %R 10.22028/D291-27968 %I Universität des Saarlandes %C Saarbrücken %D 2019 %P 141 p. %V phd %9 phd %X Answering complex natural language questions with crisp answers is crucial towards satisfying the information needs of advanced users. With the rapid growth of knowledge bases (KBs) such as Yago and Freebase, this goal has become attainable by translating questions into formal queries like SPARQL queries. Such queries can then be evaluated over knowledge bases to retrieve crisp answers. To this end, three research issues arise: (i) how to develop methods that are robust to lexical and syntactic variations in questions and can handle complex questions, (ii) how to design and curate datasets to advance research in question answering, and (iii) how to efficiently identify named entities in questions. In this dissertation, we make the following five contributions in the areas of question answering (QA) and named entity recognition (NER). For issue (i), we make the following contributions: We present QUINT, an approach for answering natural language questions over knowledge bases using automatically learned templates. Templates are an important asset for QA over KBs, simplifying the semantic parsing of input questions and generating formal queries for interpretable answers. QUINT is capable of answering both simple and compositional questions. We introduce NEQA, a framework for continuous learning for QA over KBs. NEQA starts with a small seed of training examples in the form of question-answer pairs, and improves its performance over time. NEQA combines both syntax, through template-based answering, and semantics, via a semantic similarity function. %when templates fail to do so. Moreover, it adapts to the language used after deployment by periodically retraining its underlying models. For issues (i) and (ii), we present TEQUILA, a framework for answering complex questions with explicit and implicit temporal conditions over KBs. TEQUILA is built on a rule-based framework that detects and decomposes temporal questions into simpler sub-questions that can be answered by standard KB-QA systems. TEQUILA reconciles the results of sub-questions into final answers. TEQUILA is accompanied with a dataset called TempQuestions, which consists of 1,271 temporal questions with gold-standard answers over Freebase. This collection is derived by judiciously selecting time-related questions from existing QA datasets. For issue (ii), we publish ComQA, a large-scale manually-curated dataset for QA. ComQA contains questions that represent real information needs and exhibit a wide range of difficulties such as the need for temporal reasoning, comparison, and compositionality. ComQA contains paraphrase clusters of semantically-equivalent questions that can be exploited by QA systems. We harness a combination of community question-answering platforms and crowdsourcing to construct the ComQA dataset. For issue (iii), we introduce a neural network model based on subword units for named entity recognition. The model learns word representations using a combination of characters, bytes and phonemes. While achieving comparable performance with word-level based models, our model has an order-of-magnitude smaller vocabulary size and lower memory requirements, and it handles out-of-vocabulary words. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/27438
[2]
J. A. Biega, “Enhancing Privacy and Fairness in Search Systems,” Universität des Saarlandes, Saarbrücken, 2019.
Abstract
Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.
Export
BibTeX
@phdthesis{biegaphd2019, TITLE = {Enhancing Privacy and Fairness in Search Systems}, AUTHOR = {Biega, Joanna Asia}, LANGUAGE = {eng}, URL = {urn:nbn:de:bsz:291--ds-278861}, DOI = {10.22028/D291-27886}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2019}, MARGINALMARK = {$\bullet$}, DATE = {2019}, ABSTRACT = {Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms.}, }
Endnote
%0 Thesis %A Biega, Joanna Asia %Y Weikum, Gerhard %A referee: Gummadi, Krishna %A referee: Nejdl, Wolfgang %+ Ontologies, MPI for Informatics, Max Planck Society International Max Planck Research School, MPI for Informatics, Max Planck Society Databases and Information Systems, MPI for Informatics, Max Planck Society Group K. Gummadi, Max Planck Institute for Software Systems, Max Planck Society External Organizations %T Enhancing Privacy and Fairness in Search Systems : %G eng %U http://hdl.handle.net/21.11116/0000-0003-9AED-5 %R 10.22028/D291-27886 %U urn:nbn:de:bsz:291--ds-278861 %I Universität des Saarlandes %C Saarbrücken %D 2019 %P 111 p. %V phd %9 phd %X Following a period of expedited progress in the capabilities of digital systems, the society begins to realize that systems designed to assist people in various tasks can also harm individuals and society. Mediating access to information and explicitly or implicitly ranking people in increasingly many applications, search systems have a substantial potential to contribute to such unwanted outcomes. Since they collect vast amounts of data about both searchers and search subjects, they have the potential to violate the privacy of both of these groups of users. Moreover, in applications where rankings influence people's economic livelihood outside of the platform, such as sharing economy or hiring support websites, search engines have an immense economic power over their users in that they control user exposure in ranked results. This thesis develops new models and methods broadly covering different aspects of privacy and fairness in search systems for both searchers and search subjects. Specifically, it makes the following contributions: (1) We propose a model for computing individually fair rankings where search subjects get exposure proportional to their relevance. The exposure is amortized over time using constrained optimization to overcome searcher attention biases while preserving ranking utility. (2) We propose a model for computing sensitive search exposure where each subject gets to know the sensitive queries that lead to her profile in the top-k search results. The problem of finding exposing queries is technically modeled as reverse nearest neighbor search, followed by a weekly-supervised learning to rank model ordering the queries by privacy-sensitivity. (3) We propose a model for quantifying privacy risks from textual data in online communities. The method builds on a topic model where each topic is annotated by a crowdsourced sensitivity score, and privacy risks are associated with a user's relevance to sensitive topics. We propose relevance measures capturing different dimensions of user interest in a topic and show how they correlate with human risk perceptions. (4) We propose a model for privacy-preserving personalized search where search queries of different users are split and merged into synthetic profiles. The model mediates the privacy-utility trade-off by keeping semantically coherent fragments of search histories within individual profiles, while trying to minimize the similarity of any of the synthetic profiles to the original user profiles. The models are evaluated using information retrieval techniques and user studies over a variety of datasets, ranging from query logs, through social media and community question answering postings, to item listings from sharing economy platforms. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/27389
[3]
M. Döring, “Computational Approaches for Improving Treatment and Prevention of Viral Infections,” Universität des Saarlandes, Saarbrücken, 2019.
Abstract
The treatment of infections with HIV or HCV is challenging. Thus, novel drugs and new computational approaches that support the selection of therapies are required. This work presents methods that support therapy selection as well as methods that advance novel antiviral treatments. geno2pheno[ngs-freq] identifies drug resistance from HIV-1 or HCV samples that were subjected to next-generation sequencing by interpreting their sequences either via support vector machines or a rules-based approach. geno2pheno[coreceptor-hiv2] determines the coreceptor that is used for viral cell entry by analyzing a segment of the HIV-2 surface protein with a support vector machine. openPrimeR is capable of finding optimal combinations of primers for multiplex polymerase chain reaction by solving a set cover problem and accessing a new logistic regression model for determining amplification events arising from polymerase chain reaction. geno2pheno[ngs-freq] and geno2pheno[coreceptorhiv2] enable the personalization of antiviral treatments and support clinical decision making. The application of openPrimeR on human immunoglobulin sequences has resulted in novel primer sets that improve the isolation of broadly neutralizing antibodies against HIV-1. The methods that were developed in this work thus constitute important contributions towards improving the prevention and treatment of viral infectious diseases.
Export
BibTeX
@phdthesis{Doringphd2013, TITLE = {Computational Approaches for Improving Treatment and Prevention of Viral Infections}, AUTHOR = {D{\"o}ring, Matthias}, LANGUAGE = {eng}, DOI = {10.22028/D291-27946}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2019}, MARGINALMARK = {$\bullet$}, DATE = {2019}, ABSTRACT = {The treatment of infections with HIV or HCV is challenging. Thus, novel drugs and new computational approaches that support the selection of therapies are required. This work presents methods that support therapy selection as well as methods that advance novel antiviral treatments. geno2pheno[ngs-freq] identifies drug resistance from HIV-1 or HCV samples that were subjected to next-generation sequencing by interpreting their sequences either via support vector machines or a rules-based approach. geno2pheno[coreceptor-hiv2] determines the coreceptor that is used for viral cell entry by analyzing a segment of the HIV-2 surface protein with a support vector machine. openPrimeR is capable of finding optimal combinations of primers for multiplex polymerase chain reaction by solving a set cover problem and accessing a new logistic regression model for determining amplification events arising from polymerase chain reaction. geno2pheno[ngs-freq] and geno2pheno[coreceptorhiv2] enable the personalization of antiviral treatments and support clinical decision making. The application of openPrimeR on human immunoglobulin sequences has resulted in novel primer sets that improve the isolation of broadly neutralizing antibodies against HIV-1. The methods that were developed in this work thus constitute important contributions towards improving the prevention and treatment of viral infectious diseases.}, }
Endnote
%0 Thesis %A Döring, Matthias %Y Pfeifer, Nico %A referee: Lengauer, Thomas %A referee: Kalinina, Olga V. %+ Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society International Max Planck Research School, MPI for Informatics, Max Planck Society Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society %T Computational Approaches for Improving Treatment and Prevention of Viral Infections : %G eng %U http://hdl.handle.net/21.11116/0000-0003-AEBA-8 %R 10.22028/D291-27946 %I Universität des Saarlandes %C Saarbrücken %D 2019 %P 337 p. %V phd %9 phd %X The treatment of infections with HIV or HCV is challenging. Thus, novel drugs and new computational approaches that support the selection of therapies are required. This work presents methods that support therapy selection as well as methods that advance novel antiviral treatments. geno2pheno[ngs-freq] identifies drug resistance from HIV-1 or HCV samples that were subjected to next-generation sequencing by interpreting their sequences either via support vector machines or a rules-based approach. geno2pheno[coreceptor-hiv2] determines the coreceptor that is used for viral cell entry by analyzing a segment of the HIV-2 surface protein with a support vector machine. openPrimeR is capable of finding optimal combinations of primers for multiplex polymerase chain reaction by solving a set cover problem and accessing a new logistic regression model for determining amplification events arising from polymerase chain reaction. geno2pheno[ngs-freq] and geno2pheno[coreceptorhiv2] enable the personalization of antiviral treatments and support clinical decision making. The application of openPrimeR on human immunoglobulin sequences has resulted in novel primer sets that improve the isolation of broadly neutralizing antibodies against HIV-1. The methods that were developed in this work thus constitute important contributions towards improving the prevention and treatment of viral infectious diseases. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/27443
[4]
P. Ebert, “What we leave behind : reproducibility in chromatin analysis within and across species,” Universität des Saarlandes, Saarbrücken, 2019.
Abstract
Epigenetics is the field of biology that investigates heritable factors regulating gene expression without being directly encoded in the genome of an organism. The human genome is densely packed inside a cell's nucleus in the form of chromatin. Certain constituents of chromatin play a vital role as epigenetic factors in the dynamic regulation of gene expression. Epigenetic changes on the chromatin level are thus an integral part of the mechanisms governing the development of the functionally diverse cell types in multicellular species such as human. Studying these mechanisms is not only important to understand the biology of healthy cells, but also necessary to comprehend the epigenetic component in the formation of many complex diseases. Modern wet lab technology enables scientists to probe the epigenome with high throughput and in extensive detail. The fast generation of epigenetic datasets burdens computational researchers with the challenge of rapidly performing elaborate analyses without compromising on the scientific reproducibility of the reported findings. To facilitate reproducible computational research in epigenomics, this thesis proposes a task-oriented metadata model, relying on web technology and supported by database engineering, that aims at consistent and human-readable documentation of standardized computational workflows. The suggested approach features, e.g., computational validation of metadata records, automatic error detection, and progress monitoring of multi-step analyses, and was successfully field-tested as part of a large epigenome research consortium. This work leaves aside theoretical considerations, and intentionally emphasizes the realistic need of providing scientists with tools that assist them in performing reproducible research. Irrespective of the technological progress, the dynamic and cell-type specific nature of the epigenome commonly requires restricting the number of analyzed samples due to resource limitations. The second project of this thesis introduces the software tool SCIDDO, which has been developed for the differential chromatin analysis of cellular samples with potentially limited availability. By combining statistics, algorithmics, and best practices for robust software development, SCIDDO can quickly identify biologically meaningful regions of differential chromatin marking between cell types. We demonstrate SCIDDO's usefulness in an exemplary study in which we identify regions that establish a link between chromatin and gene expression changes. SCIDDO's quantitative approach to differential chromatin analysis is user-customizable, providing the necessary flexibility to adapt SCIDDO to specific research tasks. Given the functional diversity of cell types and the dynamics of the epigenome in response to environmental changes, it is hardly realistic to map the complete epigenome even for a single organism like human or mouse. For non-model organisms, e.g., cow, pig, or dog, epigenome data is particularly scarce. The third project of this thesis investigates to what extent bioinformatics methods can compensate for the comparatively little effort that is invested in charting the epigenome of non-model species. This study implements a large integrative analysis pipeline, including state-of-the-art machine learning, to transfer chromatin data for predictive modeling between 13 species. The evidence presented here indicates that a partial regulatory epigenetic signal is stably retained even over millions of years of evolutionary distance between the considered species. This finding suggests complementary and cost-effective ways for bioinformatics to contribute to comparative epigenome analysis across species boundaries.
Export
BibTeX
@phdthesis{Ebertphd2019, TITLE = {What we leave behind : reproducibility in chromatin analysis within and across species}, AUTHOR = {Ebert, Peter}, LANGUAGE = {eng}, URL = {urn:nbn:de:bsz:291--ds-278311}, DOI = {doi.org/10.22028/D291-27831}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2019}, MARGINALMARK = {$\bullet$}, DATE = {2019}, ABSTRACT = {Epigenetics is the field of biology that investigates heritable factors regulating gene expression without being directly encoded in the genome of an organism. The human genome is densely packed inside a cell's nucleus in the form of chromatin. Certain constituents of chromatin play a vital role as epigenetic factors in the dynamic regulation of gene expression. Epigenetic changes on the chromatin level are thus an integral part of the mechanisms governing the development of the functionally diverse cell types in multicellular species such as human. Studying these mechanisms is not only important to understand the biology of healthy cells, but also necessary to comprehend the epigenetic component in the formation of many complex diseases. Modern wet lab technology enables scientists to probe the epigenome with high throughput and in extensive detail. The fast generation of epigenetic datasets burdens computational researchers with the challenge of rapidly performing elaborate analyses without compromising on the scientific reproducibility of the reported findings. To facilitate reproducible computational research in epigenomics, this thesis proposes a task-oriented metadata model, relying on web technology and supported by database engineering, that aims at consistent and human-readable documentation of standardized computational workflows. The suggested approach features, e.g., computational validation of metadata records, automatic error detection, and progress monitoring of multi-step analyses, and was successfully field-tested as part of a large epigenome research consortium. This work leaves aside theoretical considerations, and intentionally emphasizes the realistic need of providing scientists with tools that assist them in performing reproducible research. Irrespective of the technological progress, the dynamic and cell-type specific nature of the epigenome commonly requires restricting the number of analyzed samples due to resource limitations. The second project of this thesis introduces the software tool SCIDDO, which has been developed for the differential chromatin analysis of cellular samples with potentially limited availability. By combining statistics, algorithmics, and best practices for robust software development, SCIDDO can quickly identify biologically meaningful regions of differential chromatin marking between cell types. We demonstrate SCIDDO's usefulness in an exemplary study in which we identify regions that establish a link between chromatin and gene expression changes. SCIDDO's quantitative approach to differential chromatin analysis is user-customizable, providing the necessary flexibility to adapt SCIDDO to specific research tasks. Given the functional diversity of cell types and the dynamics of the epigenome in response to environmental changes, it is hardly realistic to map the complete epigenome even for a single organism like human or mouse. For non-model organisms, e.g., cow, pig, or dog, epigenome data is particularly scarce. The third project of this thesis investigates to what extent bioinformatics methods can compensate for the comparatively little effort that is invested in charting the epigenome of non-model species. This study implements a large integrative analysis pipeline, including state-of-the-art machine learning, to transfer chromatin data for predictive modeling between 13 species. The evidence presented here indicates that a partial regulatory epigenetic signal is stably retained even over millions of years of evolutionary distance between the considered species. This finding suggests complementary and cost-effective ways for bioinformatics to contribute to comparative epigenome analysis across species boundaries.}, }
Endnote
%0 Thesis %A Ebert, Peter %Y Lengauer, Thomas %A referee: Lenhof, Hans-Peter %A referee: Weikum, Gerhard %+ Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society International Max Planck Research School, MPI for Informatics, Max Planck Society Computational Biology and Applied Algorithmics, MPI for Informatics, Max Planck Society Algorithms and Complexity, MPI for Informatics, Max Planck Society Databases and Information Systems, MPI for Informatics, Max Planck Society %T What we leave behind : reproducibility in chromatin analysis within and across species : %G eng %U http://hdl.handle.net/21.11116/0000-0003-9ADF-5 %R doi.org/10.22028/D291-27831 %U urn:nbn:de:bsz:291--ds-278311 %I Universität des Saarlandes %C Saarbrücken %D 2019 %P 152 p. %V phd %9 phd %X Epigenetics is the field of biology that investigates heritable factors regulating gene expression without being directly encoded in the genome of an organism. The human genome is densely packed inside a cell's nucleus in the form of chromatin. Certain constituents of chromatin play a vital role as epigenetic factors in the dynamic regulation of gene expression. Epigenetic changes on the chromatin level are thus an integral part of the mechanisms governing the development of the functionally diverse cell types in multicellular species such as human. Studying these mechanisms is not only important to understand the biology of healthy cells, but also necessary to comprehend the epigenetic component in the formation of many complex diseases. Modern wet lab technology enables scientists to probe the epigenome with high throughput and in extensive detail. The fast generation of epigenetic datasets burdens computational researchers with the challenge of rapidly performing elaborate analyses without compromising on the scientific reproducibility of the reported findings. To facilitate reproducible computational research in epigenomics, this thesis proposes a task-oriented metadata model, relying on web technology and supported by database engineering, that aims at consistent and human-readable documentation of standardized computational workflows. The suggested approach features, e.g., computational validation of metadata records, automatic error detection, and progress monitoring of multi-step analyses, and was successfully field-tested as part of a large epigenome research consortium. This work leaves aside theoretical considerations, and intentionally emphasizes the realistic need of providing scientists with tools that assist them in performing reproducible research. Irrespective of the technological progress, the dynamic and cell-type specific nature of the epigenome commonly requires restricting the number of analyzed samples due to resource limitations. The second project of this thesis introduces the software tool SCIDDO, which has been developed for the differential chromatin analysis of cellular samples with potentially limited availability. By combining statistics, algorithmics, and best practices for robust software development, SCIDDO can quickly identify biologically meaningful regions of differential chromatin marking between cell types. We demonstrate SCIDDO's usefulness in an exemplary study in which we identify regions that establish a link between chromatin and gene expression changes. SCIDDO's quantitative approach to differential chromatin analysis is user-customizable, providing the necessary flexibility to adapt SCIDDO to specific research tasks. Given the functional diversity of cell types and the dynamics of the epigenome in response to environmental changes, it is hardly realistic to map the complete epigenome even for a single organism like human or mouse. For non-model organisms, e.g., cow, pig, or dog, epigenome data is particularly scarce. The third project of this thesis investigates to what extent bioinformatics methods can compensate for the comparatively little effort that is invested in charting the epigenome of non-model species. This study implements a large integrative analysis pipeline, including state-of-the-art machine learning, to transfer chromatin data for predictive modeling between 13 species. The evidence presented here indicates that a partial regulatory epigenetic signal is stably retained even over millions of years of evolutionary distance between the considered species. This finding suggests complementary and cost-effective ways for bioinformatics to contribute to comparative epigenome analysis across species boundaries. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/27387
[5]
E. Levinkov, “Generalizations of the Multicut Problem for Computer Vision,” Universität des Saarlandes, Saarbrücken, 2019.
Abstract
Graph decomposition has always been a very important concept in machine learning and computer vision. Many tasks like image and mesh segmentation, community detection in social networks, as well as object tracking and human pose estimation can be formulated as a graph decomposition problem. The multicut problem in particular is a popular model to optimize for a decomposition of a given graph. Its main advantage is that no prior knowledge about the number of components or their sizes is required. However, it has several limitations, which we address in this thesis: Firstly, the multicut problem allows to specify only cost or reward for putting two direct neighbours into distinct components. This limits the expressibility of the cost function. We introduce special edges into the graph that allow to define cost or reward for putting any two vertices into distinct components, while preserving the original set of feasible solutions. We show that this considerably improves the quality of image and mesh segmentations. Second, multicut is notorious to be NP-hard for general graphs, that limits its applications to small super-pixel graphs. We define and implement two primal feasible heuristics to solve the problem. They do not provide any guarantees on the runtime or quality of solutions, but in practice show good convergence behaviour. We perform an extensive comparison on multiple graphs of different sizes and properties. Third, we extend the multicut framework by introducing node labels, so that we can jointly optimize for graph decomposition and nodes classification by means of exactly the same optimization algorithm, thus eliminating the need to hand-tune optimizers for a particular task. To prove its universality we applied it to diverse computer vision tasks, including human pose estimation, multiple object tracking, and instance-aware semantic segmentation. We show that we can improve the results over the prior art using exactly the same data as in the original works. Finally, we use employ multicuts in two applications: 1) a client-server tool for interactive video segmentation: After the pre-processing of the video a user draws strokes on several frames and a time-coherent segmentation of the entire video is performed on-the-fly. 2) we formulate a method for simultaneous segmentation and tracking of living cells in microscopy data. This task is challenging as cells split and our algorithm accounts for this, creating parental hierarchies. We also present results on multiple model fitting. We find models in data heavily corrupted by noise by finding components defining these models using higher order multicuts. We introduce an interesting extension that allows our optimization to pick better hyperparameters for each discovered model. In summary, this thesis extends the multicut problem in different directions, proposes algorithms for optimization, and applies it to novel data and settings.
Export
BibTeX
@phdthesis{Levinkovphd2013, TITLE = {Generalizations of the Multicut Problem for Computer Vision}, AUTHOR = {Levinkov, Evgeny}, LANGUAGE = {eng}, DOI = {10.22028/D291-27909}, SCHOOL = {Universit{\"a}t des Saarlandes}, ADDRESS = {Saarbr{\"u}cken}, YEAR = {2019}, MARGINALMARK = {$\bullet$}, DATE = {2019}, ABSTRACT = {Graph decomposition has always been a very important concept in machine learning and computer vision. Many tasks like image and mesh segmentation, community detection in social networks, as well as object tracking and human pose estimation can be formulated as a graph decomposition problem. The multicut problem in particular is a popular model to optimize for a decomposition of a given graph. Its main advantage is that no prior knowledge about the number of components or their sizes is required. However, it has several limitations, which we address in this thesis: Firstly, the multicut problem allows to specify only cost or reward for putting two direct neighbours into distinct components. This limits the expressibility of the cost function. We introduce special edges into the graph that allow to define cost or reward for putting any two vertices into distinct components, while preserving the original set of feasible solutions. We show that this considerably improves the quality of image and mesh segmentations. Second, multicut is notorious to be NP-hard for general graphs, that limits its applications to small super-pixel graphs. We define and implement two primal feasible heuristics to solve the problem. They do not provide any guarantees on the runtime or quality of solutions, but in practice show good convergence behaviour. We perform an extensive comparison on multiple graphs of different sizes and properties. Third, we extend the multicut framework by introducing node labels, so that we can jointly optimize for graph decomposition and nodes classification by means of exactly the same optimization algorithm, thus eliminating the need to hand-tune optimizers for a particular task. To prove its universality we applied it to diverse computer vision tasks, including human pose estimation, multiple object tracking, and instance-aware semantic segmentation. We show that we can improve the results over the prior art using exactly the same data as in the original works. Finally, we use employ multicuts in two applications: 1) a client-server tool for interactive video segmentation: After the pre-processing of the video a user draws strokes on several frames and a time-coherent segmentation of the entire video is performed on-the-fly. 2) we formulate a method for simultaneous segmentation and tracking of living cells in microscopy data. This task is challenging as cells split and our algorithm accounts for this, creating parental hierarchies. We also present results on multiple model fitting. We find models in data heavily corrupted by noise by finding components defining these models using higher order multicuts. We introduce an interesting extension that allows our optimization to pick better hyperparameters for each discovered model. In summary, this thesis extends the multicut problem in different directions, proposes algorithms for optimization, and applies it to novel data and settings.}, }
Endnote
%0 Thesis %A Levinkov, Evgeny %Y Andres, Bjoern %A referee: Lempitsky, Victor %A referee: Rother, Carsten %+ International Max Planck Research School, MPI for Informatics, Max Planck Society Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society External Organizations External Organizations %T Generalizations of the Multicut Problem for Computer Vision : %G eng %U http://hdl.handle.net/21.11116/0000-0003-9B27-3 %R 10.22028/D291-27909 %I Universität des Saarlandes %C Saarbrücken %D 2019 %P 151 p. %V phd %9 phd %X Graph decomposition has always been a very important concept in machine learning and computer vision. Many tasks like image and mesh segmentation, community detection in social networks, as well as object tracking and human pose estimation can be formulated as a graph decomposition problem. The multicut problem in particular is a popular model to optimize for a decomposition of a given graph. Its main advantage is that no prior knowledge about the number of components or their sizes is required. However, it has several limitations, which we address in this thesis: Firstly, the multicut problem allows to specify only cost or reward for putting two direct neighbours into distinct components. This limits the expressibility of the cost function. We introduce special edges into the graph that allow to define cost or reward for putting any two vertices into distinct components, while preserving the original set of feasible solutions. We show that this considerably improves the quality of image and mesh segmentations. Second, multicut is notorious to be NP-hard for general graphs, that limits its applications to small super-pixel graphs. We define and implement two primal feasible heuristics to solve the problem. They do not provide any guarantees on the runtime or quality of solutions, but in practice show good convergence behaviour. We perform an extensive comparison on multiple graphs of different sizes and properties. Third, we extend the multicut framework by introducing node labels, so that we can jointly optimize for graph decomposition and nodes classification by means of exactly the same optimization algorithm, thus eliminating the need to hand-tune optimizers for a particular task. To prove its universality we applied it to diverse computer vision tasks, including human pose estimation, multiple object tracking, and instance-aware semantic segmentation. We show that we can improve the results over the prior art using exactly the same data as in the original works. Finally, we use employ multicuts in two applications: 1) a client-server tool for interactive video segmentation: After the pre-processing of the video a user draws strokes on several frames and a time-coherent segmentation of the entire video is performed on-the-fly. 2) we formulate a method for simultaneous segmentation and tracking of living cells in microscopy data. This task is challenging as cells split and our algorithm accounts for this, creating parental hierarchies. We also present results on multiple model fitting. We find models in data heavily corrupted by noise by finding components defining these models using higher order multicuts. We introduce an interesting extension that allows our optimization to pick better hyperparameters for each discovered model. In summary, this thesis extends the multicut problem in different directions, proposes algorithms for optimization, and applies it to novel data and settings. %U https://publikationen.sulb.uni-saarland.de/handle/20.500.11880/27415