Anna Rohrbach (PhD Student)

MSc Anna Rohrbach

Max-Planck-Institut für Informatik
Saarland Informatics Campus
Campus E1 4
66123 Saarbrücken
E1 4 - Room 620
+49 681 9325 2111
+49 681 9325 2099
Get email via email

Personal Information

Research Interests

  • Computer Vision
  • Computational Linguistics
  • Machine Learning


2008-2010 M.Sc in Applied Mathematics Odessa I.I.Mechnikov National University, Odessa, Ukraine

Research Projects


See my Google Scholar web-page.


  • Our paper on "Grounding of textual phrases in images by reconstruction" was selected for an oral presentation at ECCV 2016, the video of the talk is available here.
  • MaxPlanckResearch magazine writes about our work on video description.
  • I co-organize a "Joint 2nd Workshop on Storytelling with Images and Videos (VisStory) and Large Scale Movie Description and Understanding Challenge (LSMDC 2016)" in conjunction with ECCV 2016, Amsterdam, The Netherlands.


Movie Description
A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville and B. Schiele
International Journal of Computer Vision, Volume First Online, 2017
Audio Description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. In total the Large Scale Movie Description Challenge (LSMDC) contains a parallel corpus of 118,114 sentences and video clips from 202 movies. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are indeed more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several teams who participated in a challenge organized in the context of the workshop "Describing and Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at ICCV 2015.
Grounding of Textual Phrases in Images by Reconstruction
A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell and B. Schiele
Computer Vision -- ECCV 2016, 2016
Recognizing Fine-grained and Composite Activities Using Hand-centric Features and Script Data
M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
International Journal of Computer Vision, Volume 119, Number 3, 2016
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell and M. Rohrbach
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), 2016
Commonsense in Parts: Mining Part-Whole Relations from the Web and Image Tags
N. Tandon, C. D. Hariman, J. Urbani, A. Rohrbach, M. Rohrbach and G. Weikum
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 2016
A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-blank Question-answering
T. Maharaj, N. Ballas, A. Rohrbach, A. Courville and C. Pal
Technical Report, 2016
(arXiv: 1611.07810)
While deep convolutional neural networks frequently approach or exceed human-level performance at benchmark tasks involving static images, extending this success to moving images is not straightforward. Having models which can learn to understand video is of interest for many applications, including content recommendation, prediction, summarization, event/object detection and understanding human visual perception, but many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features; the effects of increasing dataset size, the number of frames sampled; and of vocabulary size. We illustrate that: this task is not solvable by a language model alone; our model combining 2D and 3D visual information indeed provides the best result; all models perform significantly worse than human-level. We provide human evaluations for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgement. We suggest avenues for improving video models, and hope that the proposed dataset can be useful for measuring and encouraging progress in this very interesting field.
A Dataset for Movie Description
A. Rohrbach, M. Rohrbach, N. Tandon and B. Schiele
IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), 2015
The Long-short Story of Movie Description
A. Rohrbach, M. Rohrbach and B. Schiele
Pattern Recognition (GCPR 2015), 2015
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal and B. Schiele
Pattern Recognition (GCPR 2014), 2014
Coherent Multi-sentence Video Description with Variable Level of Detail
A. Senina, M. Rohrbach, W. Qiu, A. Friedrich, S. Amin, M. Andriluka, M. Pinkal and B. Schiele
Technical Report, 2014
(arXiv: 1403.6173)
Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description are mainly focused on single sentence generation and produce descriptions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by enforcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric approach as well as to the robust generation of sentences using a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more detailed and shorter descriptions, we collect and analyze a video description corpus of three levels of detail.