Describing Videos with Natural Language

Anna Rohrbach & Marcus Rohrbach

Describing Videos with Natural Language

An important aspect for automated systems is to communicate to humans what they recognize or “see”. However, current computer vision approaches typically focus on generating isolated labels for a video (e.g., “slicing”, “cucumber”, “plate”) that are not well suited for communication with humans. We thus study the problem of generating natural language descriptions of videos. A corresponding sentence for the example above could be “Someone sliced the cucumber on the plate”. Describing videos with natural language is important, e.g., for the automatic captioning of web videos, human-
robot interaction, or assisting visually impaired people. In our work, we study two scenarios. First, we generate descriptions of cooking videos. Second, we study the problem of generating audio descriptions for movies to enable blind people to follow a movie without seeing it.

Generating descriptions for cooking videos

In order to study the problem of automatic video description, we propose to learn how to “translate” video snippet to a natural language sentence using techniques from statistical machine translation between two languages, e.g., French and English. For training our translation approach, we need pairs of videos and sentences depicting various cooking activities, which we have collected on a large scale for videos, e.g., open tin, stir pasta.

In contrast to related work, we do not only describe a short video snippet with a single sentence, but also a long video with multiple sentences. To ensure that we generate a consistent description, we propose to model the topic of the video shared by all short video snippets within a long video. In our kitchen scenario, the topic is a dish to be cooked, e.g. preparing pasta.

We also explore the novel task of describing videos at multiple levels of detail. All of the prior work has focused on describing videos at a fixed level of abstraction, whereas our system is able to produce detailed, short, and single sentence descriptions of cooking videos [ figure 1 ].

Figure 1: An example output of our system, which automatically generated a detailed, short and single-sentence description of a video.

Detailed: A man took a cutting board and knife from the drawer. He took out an orange from the refrigerator. Then, he took a knife from the drawer. He juiced one half of the orange. Next, he opened the refrigerator. He cut the orange with the knife. The man threw away the skin. He got a glass from the cabinet. Then, he poured the juice into the glass. Finally, he placed the orange in the sink.
Short: A man juiced the orange. Next, he cut the orange in half. Finally, he poured the juice into a glass.
One sentence: A man juiced the orange.

Movie description

Existing video description datasets focus on short video snippets, are limited in size, or restricted to the cooking scenario. In order to overcome these limitations, we propose a new dataset of movies and associated textual descriptions. We make use of two sources of text data, namely movie scripts (screenplays) and audio descriptions available for many DVDs and Blu-rays. Audio descriptions
provide linguistic descriptions of movies and allow visually impaired people to follow along with friends and family.

The collected dataset additionally opens new possibilities to understand stories and plots across multiple sentences in an open domain on a large scale [figure 2].

Figure 2: Example textual descriptions from our dataset, aligned to three movie snippets: first, coming from audio description (AD), and second, from movie script.

AD: Abby gets in the basket.
Script: After a moment a frazzled Abby pops up in his place.

AD: Mike leans over and sees how high they are.
Mike looks down to see – they are now fifteen feet above the ground.

AD: Abby clasps her hands around his face and kisses him passionately.
Script: For the first time in her life, she stops thinking and grabs Mike and kisses the hell out of him.


Anna Rohrbach

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-2111

Marcus Rohrbach

DEPT. 2 Computer Vision and Multimodal Computing
Phone +49 681 9325-2111