Anna Rohrbach (PhD Student)

MSc Anna Rohrbach

Address
Max-Planck-Institut für Informatik
Saarland Informatics Campus
Campus
Location
-
Phone
+49 681 9325 2000
Fax
+49 681 9325 2099
Email
Get email via email

Personal Information

Research Interests

  • Computer Vision
  • Computational Linguistics
  • Machine Learning

Education

2008-2010 M.Sc in Applied Mathematics Odessa I.I.Mechnikov National University, Odessa, Ukraine

Research Projects

Other

See my Google Scholar web-page.

News

Publications

2017
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Generating Descriptions with Grounded and Co-Referenced People
A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh and B. Schiele
30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017
(Accepted/in press)
Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume 123, Number 1, 2017
Abstract
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Generation and Grounding of Natural Language Descriptions for Visual Data
A. Rohrbach
PhD Thesis, universität des Saarlandes, 2017
Abstract
Generating natural language descriptions for visual data links computer vision and computational linguistics. Being able to generate a concise and human-readable description of a video is a step towards visual understanding. At the same time, grounding natural language in visual data provides disambiguation for the linguistic concepts, necessary for many applications. This thesis focuses on both directions and tackles three specific problems. First, we develop recognition approaches to understand video of complex cooking activities. We propose an approach to generate coherent multi-sentence descriptions for our videos. Furthermore, we tackle the new task of describing videos at variable level of detail. Second, we present a large-scale dataset of movies and aligned professional descriptions. We propose an approach, which learns from videos and sentences to describe movie clips relying on robust recognition of visual semantic concepts. Third, we propose an approach to ground textual phrases in images with little or no localization supervision, which we further improve by introducing Multimodal Compact Bilinear Pooling for combining language and vision representations. Finally, we jointly address the task of describing videos and grounding the described people. To summarize, this thesis advances the state-of-the-art in automatic video description and visual grounding and also contributes large datasets for studying the intersection of computer vision and computational linguistics.
2016
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele
Computer Vision -- ECCV 2016, 2016
Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
International Journal of Computer Vision, Volume 119, Number 3, 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016
Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags
N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach and G. Weikum
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
2015
A Dataset for Movie Description
A. Rohrbach, M. Rohrbach, N. Tandon and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
The Long-short Story of Movie Description
A. Rohrbach, M. Rohrbach and B. Schiele
Pattern Recognition (GCPR 2015), 2015
2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Senina, M. Rohrbach, W. Qiu, A. Friedrich, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
Technical Report, 2014
(arXiv: 1403.6173)
Abstract
Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.