Anna Khoreva (Post-Doc)

Personal Information

Research Interests

Computer Vision
Machine Learning

Education

Ph.D. student, Computer Science, Max-Planck-Institut für Informatik, Saarland Informatics Campus, Germany (2014 - present)
M.Sc., Visual Computing, Saarland University, Saarland Informatics Campus, Germany, 2012-2014
Dipl.-Math., Applied Mathematics, Ulyanovsk State Technical University, Ulyanovsk, Russia, 2004-2009

Research Projects

Other

See my Google Scholar profile.

Publications

2025

Conference paper

Y. Li, W. Beluch, M. Keuper, D. Zhang, and A. Khoreva

“VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis,” in The Thirteenth International Conference on Learning Representations (ICLR 2025), Singapore, 2025.

@inproceedings{Li_ICLR25,
TITLE = {{VSTAR}: Generative Temporal Nursing for Longer Dynamic Video Synthesis},
AUTHOR = {Li, Yumeng and Beluch, William and Keuper, Margret and Zhang, Dan and Khoreva, Anna},
LANGUAGE = {eng},
PUBLISHER = {OpenReview.net},
YEAR = {2025},
MARGINALMARK = {$\bullet$},
BOOKTITLE = {The Thirteenth International Conference on Learning Representations (ICLR 2025)},
ADDRESS = {Singapore},
}

2024

Conference paper

U. A. Kaplan, Y. Li, M. Keuper, A. Khoreva, and D. Zhang

“Domain-Aware Fine-Tuning of Foundation Models,” in ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop), Vienna, Austria, 2024.

@inproceedings{kaplan2024domainaware,
TITLE = {Domain-Aware Fine-Tuning of Foundation Models},
AUTHOR = {Kaplan, U{\u g}ur Ali and Li, Yumeng and Keuper, Margret and Khoreva, Anna and Zhang, Dan},
LANGUAGE = {eng},
URL = {https://openreview.net/forum?id=fIc8BXTKVc},
PUBLISHER = {OpenReview.net},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {ICML 2024 Workshop on Foundation Models in the Wild (ICML 2024 FM-Wild Workshop)},
ADDRESS = {Vienna, Austria},
}

Article

Y. Li, D. Zhang, M. Keuper, and A. Khoreva

“Intra- & Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization,” International Journal of Computer Vision, vol. 132, 2024.

@article{Li_2023,
TITLE = {Intra- \& Extra-Source Exemplar-Based Style Synthesis for Improved Domain Generalization},
AUTHOR = {Li, Yumeng and Zhang, Dan and Keuper, Margret and Khoreva, Anna},
LANGUAGE = {eng},
ISSN = {0920-5691},
DOI = {10.1007/s11263-023-01878-8},
PUBLISHER = {Springer},
ADDRESS = {New York, NY},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
JOURNAL = {International Journal of Computer Vision},
VOLUME = {132},
PAGES = {446--465},
}

Conference paper

Y. Li, M. Keuper, D. Zhang, and A. Khoreva

“Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive,” in The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 2024.

@inproceedings{li2024aldm,
TITLE = {Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive},
AUTHOR = {Li, Yumeng and Keuper, Margret and Zhang, Dan and Khoreva, Anna},
LANGUAGE = {eng},
URL = {https://openreview.net/forum?id=EJPIzl7mgc; https://iclr.cc/Conferences/2024},
PUBLISHER = {OpenReview.net},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
DATE = {2024},
BOOKTITLE = {The Twelfth International Conference on Learning Representations (ICLR 2024)},
PAGES = {1--23},
ADDRESS = {Vienna, Austria},
}

Paper

N. Kister, I. Sárándi, A. Khoreva, and G. Pons-Moll

“Are Pose Estimators Ready for the Open World? STAGE: Synthetic Data Generation Toolkit for Auditing 3D Human Pose Estimators,” 2024. [Online]. Available: https://arxiv.org/abs/2408.16536.

Abstract

The estimation of 3D human poses from images has progressed tremendously over
the last few years as measured on standard benchmarks. However, performance in
the open world remains underexplored, as current benchmarks cannot capture its
full extent. Especially in safety-critical systems, it is crucial that 3D pose
estimators are audited before deployment, and their sensitivity towards single
factors or attributes occurring in the operational domain is thoroughly
examined. Nevertheless, we currently lack a benchmark that would enable such
fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing
3D human pose estimators. We enable a text-to-image model to control the 3D
human body pose in the generated image. This allows us to create customized
annotated data covering a wide range of open-world attributes. We leverage
STAGE and generate a series of benchmarks to audit the sensitivity of popular
pose estimators towards attributes such as gender, ethnicity, age, clothing,
location, and weather. Our results show that the presence of such naturally
occurring attributes can cause severe degradation in the performance of pose
estimators and leads us to question if they are ready for open-world
deployment.

BibTeX

@online{Kister2408.16536,
TITLE = {Are Pose Estimators Ready for the Open World? {STAGE}: Synthetic Data Generation Toolkit for Auditing {3D} Human Pose Estimators},
AUTHOR = {Kister, Nikita and S{\'a}r{\'a}ndi, Istv{\'a}n and Khoreva, Anna and Pons-Moll, Gerard},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/2408.16536},
EPRINT = {2408.16536},
EPRINTTYPE = {arXiv},
YEAR = {2024},
MARGINALMARK = {$\bullet$},
ABSTRACT = {The estimation of 3D human poses from images has progressed tremendously over<br>the last few years as measured on standard benchmarks. However, performance in<br>the open world remains underexplored, as current benchmarks cannot capture its<br>full extent. Especially in safety-critical systems, it is crucial that 3D pose<br>estimators are audited before deployment, and their sensitivity towards single<br>factors or attributes occurring in the operational domain is thoroughly<br>examined. Nevertheless, we currently lack a benchmark that would enable such<br>fine-grained analysis. We thus present STAGE, a GenAI data toolkit for auditing<br>3D human pose estimators. We enable a text-to-image model to control the 3D<br>human body pose in the generated image. This allows us to create customized<br>annotated data covering a wide range of open-world attributes. We leverage<br>STAGE and generate a series of benchmarks to audit the sensitivity of popular<br>pose estimators towards attributes such as gender, ethnicity, age, clothing,<br>location, and weather. Our results show that the presence of such naturally<br>occurring attributes can cause severe degradation in the performance of pose<br>estimators and leads us to question if they are ready for open-world<br>deployment.<br>},
}

2023

Conference paper

Y. Li, M. Keuper, D. Zhang, and A. Khoreva

“Divide & Bind Your Attention for Improved Generative Semantic Nursing,” in 34th British Machine Vision Conference (BMVC 2023), Aberdeen, UK, 2023.

@inproceedings{LiBMVC23,
TITLE = {Divide \& Bind Your Attention for Improved Generative Semantic Nursing},
AUTHOR = {Li, Yumeng and Keuper, Margret and Zhang, Dan and Khoreva, Anna},
LANGUAGE = {eng},
PUBLISHER = {BMVA Press},
YEAR = {2023},
MARGINALMARK = {$\bullet$},
BOOKTITLE = {34th British Machine Vision Conference (BMVC 2023)},
EID = {366},
ADDRESS = {Aberdeen, UK},
}

Conference paper

Y. Li, D. Zhang, M. Keuper, and A. Khoreva

“Intra-Source Style Augmentation for Improved Domain Generalization,” in 2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023), Waikoloa, HI, USA, 2023.

@inproceedings{Li_WACV23,
TITLE = {Intra-Source Style Augmentation for Improved Domain Generalization},
AUTHOR = {Li, Yumeng and Zhang, Dan and Keuper, Margret and Khoreva, Anna},
LANGUAGE = {eng},
ISBN = {978-1-6654-9346-8},
DOI = {10.1109/WACV56688.2023.00058},
PUBLISHER = {IEEE},
YEAR = {2023},
MARGINALMARK = {$\bullet$},
BOOKTITLE = {2023 IEEE Winter Conference on Applications of Computer Vision (WACV 2023)},
PAGES = {509--519},
ADDRESS = {Waikoloa, HI, USA},
}

Conference paper

E. Schönfeld, J. Borges, V. Sushko, B. Schiele, and A. Khoreva

“Discovering Class-Specific GAN Controls for Semantic Image Synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023), Vancouver, Canada, 2023.

@inproceedings{SchoenfeldCVPRW23,
TITLE = {Discovering Class-Specific {GAN} Controls for Semantic Image Synthesis},
AUTHOR = {Sch{\"o}nfeld, Edgar and Borges, Julio and Sushko, Vadim and Schiele, Bernt and Khoreva, Anna},
LANGUAGE = {eng},
ISBN = {979-8-3503-0249-3},
DOI = {10.1109/CVPRW59228.2023.00076},
PUBLISHER = {IEEE},
YEAR = {2023},
MARGINALMARK = {$\bullet$},
DATE = {2023},
BOOKTITLE = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2023)},
PAGES = {688--697},
ADDRESS = {Vancouver, Canada},
}

2022

Article

V. Sushko, E. Schönfeld, D. Zhang, J. Gall, B. Schiele, and A. Khoreva

“OASIS: Only Adversarial Supervision for Semantic Image Synthesis,” International Journal of Computer Vision, vol. 130, 2022.

@article{Sushko22,
TITLE = {OASIS: {Only Adversarial Supervision for Semantic Image Synthesis}},
AUTHOR = {Sushko, Vadim and Sch{\"o}nfeld, Edgar and Zhang, Dan and Gall, J{\"u}rgen and Schiele, Bernt and Khoreva, Anna},
LANGUAGE = {eng},
ISSN = {0920-5691},
DOI = {10.1007/s11263-022-01673-x},
PUBLISHER = {Springer},
ADDRESS = {New York, NA},
YEAR = {2022},
JOURNAL = {International Journal of Computer Vision},
VOLUME = {130},
PAGES = {2903--2923},
}

2021

Conference paper

E. Schönfeld, V. Sushko, D. Zhang, J. Gall, B. Schiele, and A. Khoreva

“You Only Need Adversarial Supervision for Semantic Image Synthesis,” in International Conference on Learning Representations (ICLR 2021), Vienna, Austria (Virtual), 2021.

@inproceedings{Schoenfeld_ICLR2021,
TITLE = {You Only Need Adversarial Supervision for Semantic Image Synthesis},
AUTHOR = {Sch{\"o}nfeld, Edgar and Sushko, Vadim and Zhang, Dan and Gall, J{\"u}rgen and Schiele, Bernt and Khoreva, Anna},
LANGUAGE = {eng},
URL = {https://openreview.net/forum?id=yvQKLaqNE6M; https://iclr.cc/Conferences/2021},
PUBLISHER = {OpenReview.net},
YEAR = {2021},
BOOKTITLE = {International Conference on Learning Representations (ICLR 2021)},
ADDRESS = {Vienna, Austria (Virtual)},
}

2020

Conference paper

E. Schönfeld, B. Schiele, and A. Khoreva

“A U-Net Based Discriminator for Generative Adversarial Networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA (Virtual), 2020.

@inproceedings{Schoenfeld_CVPR2020,
TITLE = {A {U}-{Net} Based Discriminator for Generative Adversarial Networks},
AUTHOR = {Sch{\"o}nfeld, Edgar and Schiele, Bernt and Khoreva, Anna},
LANGUAGE = {eng},
ISBN = {978-1-7281-7168-5},
DOI = {10.1109/CVPR42600.2020.00823},
PUBLISHER = {IEEE},
YEAR = {2020},
BOOKTITLE = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)},
PAGES = {8204--8213},
ADDRESS = {Seattle, WA, USA (Virtual)},
}

2019

Article

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele

“Lucid Data Dreaming for Video Object Segmentation,” International Journal of Computer Vision, vol. 127, no. 9, 2019.

@article{Khoreva2019,
TITLE = {Lucid Data Dreaming for Video Object Segmentation},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Ilg, Eddy and Brox, Thomas and Schiele, Bernt},
LANGUAGE = {eng},
ISSN = {0920-5691},
DOI = {10.1007/s11263-019-01164-6},
PUBLISHER = {Springer},
ADDRESS = {New York, NY},
YEAR = {2019},
DATE = {2019},
JOURNAL = {International Journal of Computer Vision},
VOLUME = {127},
NUMBER = {9},
PAGES = {1175--1197},
}

2018

Conference paper

A. Khoreva, A. Rohrbach, and B. Schiele

“Video Object Segmentation with Language Referring Expressions,” in Computer Vision - ACCV 2018, Perth, Australia, 2019.

@inproceedings{Khoreva_ACCV2018,
TITLE = {Video Object Segmentation with Language Referring Expressions},
AUTHOR = {Khoreva, Anna and Rohrbach, Anna and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-3-030-20869-1},
DOI = {10.1007/978-3-030-20870-7_8},
PUBLISHER = {Springer},
YEAR = {2018},
DATE = {2019},
BOOKTITLE = {Computer Vision -- ACCV 2018},
EDITOR = {Jawahar, C. V. and Li, Hongdong and Mori, Greg and Schindler, Konrad},
PAGES = {123--141},
SERIES = {Lecture Notes in Computer Science},
VOLUME = {11364},
ADDRESS = {Perth, Australia},
}

Conference paper

M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele

“Learning to Refine Human Pose Estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), Salt Lake City, UT, USA, 2018.

@inproceedings{Fieraru_2018_CVPR_Workshops,
TITLE = {Learning to Refine Human Pose Estimation},
AUTHOR = {Fieraru, Mihai and Khoreva, Anna and Pishchulin, Leonid and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-1-5386-6100-0},
DOI = {10.1109/CVPRW.2018.00058},
PUBLISHER = {IEEE},
YEAR = {2018},
BOOKTITLE = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018)},
PAGES = {318--327},
ADDRESS = {Salt Lake City, UT, USA},
}

2017

Conference paper

A. Khoreva, F. Perazzi, R. Benenson, B. Schiele, and A. Sorkine-Hornung

“Learning Video Object Segmentation from Static Images,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.

@inproceedings{Khoreva1612.02646,
TITLE = {Learning Video Object Segmentation from Static Images},
AUTHOR = {Khoreva, Anna and Perazzi, Federico and Benenson, Rodrigo and Schiele, Bernt and Sorkine-Hornung, Alexander},
LANGUAGE = {eng},
ISBN = {978-1-5386-0458-8},
DOI = {10.1109/CVPR.2017.372},
PUBLISHER = {IEEE Computer Society},
YEAR = {2017},
DATE = {2017},
BOOKTITLE = {30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)},
PAGES = {3491--3500},
ADDRESS = {Honolulu, HI, USA},
}

Conference paper

A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele

“Simple Does It: Weakly Supervised Instance and Semantic Segmentation,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.

@inproceedings{Khoreva1603.07485,
TITLE = {Simple Does It: Weakly Supervised Instance and Semantic Segmentation},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Hosang, Jan and Hein, Matthias and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-1-5386-0458-8},
DOI = {10.1109/CVPR.2017.181},
PUBLISHER = {IEEE Computer Society},
YEAR = {2017},
DATE = {2017},
BOOKTITLE = {30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)},
PAGES = {1665 --1674},
ADDRESS = {Honolulu, HI, USA},
}

Conference paper

S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele

“Exploiting Saliency for Object Segmentation from Image Level Labels,” in 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 2017.

@inproceedings{OhBKAFS17,
TITLE = {Exploiting Saliency for Object Segmentation from Image Level Labels},
AUTHOR = {Oh, Seong Joon and Benenson, Rodrigo and Khoreva, Anna and Akata, Zeynep and Fritz, Mario and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-1-5386-0458-8},
DOI = {10.1109/CVPR.2017.535},
PUBLISHER = {IEEE Computer Society},
YEAR = {2017},
DATE = {2017},
BOOKTITLE = {30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)},
PAGES = {5038--5047},
ADDRESS = {Honolulu, HI, USA},
}

Conference paper

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele

“Lucid Data Dreaming for Object Tracking,” in DAVIS Challenge on Video Object Segmentation 2017, Honolulu, HI, USA, 2017.

@inproceedings{DAVIS2017-2nd,
TITLE = {Lucid Data Dreaming for Object Tracking},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Ilg, Eddy and Brox, Thomas and Schiele, Bernt},
LANGUAGE = {eng},
YEAR = {2017},
BOOKTITLE = {DAVIS Challenge on Video Object Segmentation 2017},
ADDRESS = {Honolulu, HI, USA},
}

Thesis

D2IMPR-CS

A. Khoreva

“Learning to Segment in Images and Videos with Different Forms of Supervision,” Universität des Saarlandes, Saarbrücken, 2017.

Abstract

Much progress has been made in image and video segmentation
over the last years. To a large extent, the success can be attributed to
the strong appearance models completely learned from data, in particular
using deep learning methods. However,to perform best these methods require
large representative datasets for training with expensive pixel-level
annotations, which in case of videos are prohibitive to obtain. Therefore,
there is a need to relax this constraint and to consider alternative forms
of supervision, which are easier and cheaper to collect. In this thesis,
we aim to develop algorithms for learning to segment in images and videos
with different levels of supervision.
First, we develop approaches for training convolutional networks with weaker
forms of supervision, such as bounding boxes or image labels, for object
boundary estimation and semantic/instance labelling tasks. We propose to
generate pixel-level approximate groundtruth from these weaker forms of
annotations to train a network, which allows to achieve high-quality
results comparable to the full supervision quality without any
modifications of the network architecture or the training procedure.
Second, we address the problem of the excessive computational and memory
costs inherent to solving video segmentation via graphs. We propose
approaches to improve the runtime and memory efficiency as well as the
output segmentation quality by learning from the available training data
the best representation of the graph. In particular, we contribute with
learning must-link constraints, the topology and edge weights of the graph
as well as enhancing the graph nodes - superpixels - themselves.
Third, we tackle the task of pixel-level object tracking and address the
problem of the limited amount of densely annotated video data for training
convolutional networks. We introduce an architecture which allows training
with static images only and propose an elaborate data synthesis scheme
which creates a large number of training examples close to the target
domain from the given first frame mask. With the proposed techniques we
show that densely annotated consequent video data is not necessary to
achieve high-quality temporally coherent video segmentationresults.
In summary, this thesis advances the state of the art in weakly supervised
image segmentation, graph-based video segmentation and pixel-level object
tracking and contributes with the new ways of training convolutional
networks with a limited amount of pixel-level annotated training data.

BibTeX

@phdthesis{Khorevaphd2017,
TITLE = {Learning to Segment in Images and Videos with Different Forms of Supervision},
AUTHOR = {Khoreva, Anna},
LANGUAGE = {eng},
URL = {urn:nbn:de:bsz:291-scidok-ds-269954},
DOI = {10.22028/D291-26995},
SCHOOL = {Universit{\"a}t des Saarlandes},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2017},
DATE = {2017},
ABSTRACT = {Much progress has been made in image and video segmentation<br>over the last years. To a large extent, the success can be attributed to<br>the strong appearance models completely learned from data, in particular<br>using deep learning methods. However,to perform best these methods require<br>large representative datasets for training with expensive pixel-level<br>annotations, which in case of videos are prohibitive to obtain. Therefore,<br>there is a need to relax this constraint and to consider alternative forms<br>of supervision, which are easier and cheaper to collect. In this thesis,<br>we aim to develop algorithms for learning to segment in images and videos<br>with different levels of supervision.<br>First, we develop approaches for training convolutional networks with weaker<br>forms of supervision, such as bounding boxes or image labels, for object<br>boundary estimation and semantic/instance labelling tasks. We propose to<br>generate pixel-level approximate groundtruth from these weaker forms of<br>annotations to train a network, which allows to achieve high-quality<br>results comparable to the full supervision quality without any<br>modifications of the network architecture or the training procedure.<br>Second, we address the problem of the excessive computational and memory<br>costs inherent to solving video segmentation via graphs. We propose<br>approaches to improve the runtime and memory efficiency as well as the<br>output segmentation quality by learning from the available training data<br>the best representation of the graph. In particular, we contribute with<br>learning must-link constraints, the topology and edge weights of the graph<br>as well as enhancing the graph nodes -- superpixels -- themselves.<br>Third, we tackle the task of pixel-level object tracking and address the<br>problem of the limited amount of densely annotated video data for training<br>convolutional networks. We introduce an architecture which allows training<br>with static images only and propose an elaborate data synthesis scheme<br>which creates a large number of training examples close to the target<br>domain from the given first frame mask. With the proposed techniques we<br>show that densely annotated consequent video data is not necessary to<br>achieve high-quality temporally coherent video segmentationresults.<br>In summary, this thesis advances the state of the art in weakly supervised<br>image segmentation, graph-based video segmentation and pixel-level object<br>tracking and contributes with the new ways of training convolutional<br>networks with a limited amount of pixel-level annotated training data.},
}

Paper

A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele

“Lucid Data Dreaming for Multiple Object Tracking,” 2017. [Online]. Available: http://arxiv.org/abs/1703.09554.

Abstract

Convolutional networks reach top quality in pixel-level object tracking but

require a large amount of training data (1k ~ 10k) to deliver such results. We

propose a new training strategy which achieves state-of-the-art results across

three evaluation datasets while using 20x ~ 100x less annotated data than

competing methods. Instead of using large training sets hoping to generalize

across domains, we generate in-domain training data using the provided

annotation on the first frame of each video to synthesize ("lucid dream")

plausible future video frames. In-domain per-video training data allows us to

train high quality appearance- and motion-based models, as well as tune the

post-processing stage. This approach allows to reach competitive results even

when training from only a single annotated frame, without ImageNet

pre-training. Our results indicate that using a larger training set is not

automatically better, and that for the tracking task a smaller training set

that is closer to the target domain is more effective. This changes the mindset

regarding how many training samples and general "objectness" knowledge are

required for the object tracking task.

BibTeX

@online{khoreva_lucid_dreams17,
TITLE = {Lucid Data Dreaming for Multiple Object Tracking},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Ilg, Eddy and Brox, Thomas and Schiele, Bernt},
LANGUAGE = {eng},
URL = {http://arxiv.org/abs/1703.09554},
EPRINT = {1703.09554},
EPRINTTYPE = {arXiv},
YEAR = {2017},
ABSTRACT = {Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k ~ 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20x ~ 100x less annotated data than competing methods. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize ("lucid dream") plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general "objectness" knowledge are required for the object tracking task.},
}

2016

Conference paper

A. Khoreva, R. Benenson, M. Omran, M. Hein, and B. Schiele

“Weakly Supervised Object Boundaries,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.

Abstract

State-of-the-art learning based boundary detection methods require extensive

training data. Since labelling object boundaries is one of the most expensive

types of annotations, there is a need to relax the requirement to carefully

annotate images to make both the training more affordable and to extend the

amount of training data. In this paper we propose a technique to generate

weakly supervised annotations and show that bounding box annotations alone

suffice to reach high-quality object boundaries without using any

object-specific boundary annotations. With the proposed weak supervision

techniques we achieve the top performance on the object boundary detection

task, outperforming by a large margin the current fully supervised

state-of-the-art methods.

BibTeX

@inproceedings{khoreva_cvpr16,
TITLE = {Weakly Supervised Object Boundaries},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Omran, Mohamed and Hein, Matthias and Schiele, Bernt},
ISBN = {978-1-4673-8852-8},
DOI = {10.1109/CVPR.2016.27},
PUBLISHER = {IEEE Computer Society},
YEAR = {2016},
DATE = {2016},
ABSTRACT = {State-of-the-art learning based boundary detection methods require extensive training data. Since labelling object boundaries is one of the most expensive types of annotations, there is a need to relax the requirement to carefully annotate images to make both the training more affordable and to extend the amount of training data. In this paper we propose a technique to generate weakly supervised annotations and show that bounding box annotations alone suffice to reach high-quality object boundaries without using any object-specific boundary annotations. With the proposed weak supervision techniques we achieve the top performance on the object boundary detection task, outperforming by a large margin the current fully supervised state-of-the-art methods.},
BOOKTITLE = {29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016)},
PAGES = {183--192},
ADDRESS = {Las Vegas, NV, USA},
}

Conference paper

A. Khoreva, R. Benenson, F. Galasso, M. Hein, and B. Schiele

“Improved Image Boundaries for Better Video Segmentation,” in Computer Vision -- ECCV 2016 Workshops, Amsterdam, The Netherlands, 2016.

Abstract

Graph-based video segmentation methods rely on superpixels as starting point.

While most previous work has focused on the construction of the graph edges and

weights as well as solving the graph partitioning problem, this paper focuses

on better superpixels for video segmentation. We demonstrate by a comparative

analysis that superpixels extracted from boundaries perform best, and show that

boundary estimation can be significantly improved via image and time domain

cues. With superpixels generated from our better boundaries we observe

consistent improvement for two video segmentation methods in two different

datasets.

BibTeX

@inproceedings{KhorevaetalECCVW16ImprovedBoundaries,
TITLE = {Improved Image Boundaries for Better Video Segmentation},
AUTHOR = {Khoreva, Anna and Benenson, Rodrigo and Galasso, Fabio and Hein, Matthias and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-3-319-49408-1},
DOI = {10.1007/978-3-319-49409-8_64},
PUBLISHER = {Springer},
YEAR = {2016},
DATE = {2016},
ABSTRACT = {Graph-based video segmentation methods rely on superpixels as starting point. While most previous work has focused on the construction of the graph edges and weights as well as solving the graph partitioning problem, this paper focuses on better superpixels for video segmentation. We demonstrate by a comparative analysis that superpixels extracted from boundaries perform best, and show that boundary estimation can be significantly improved via image and time domain cues. With superpixels generated from our better boundaries we observe consistent improvement for two video segmentation methods in two different datasets.},
BOOKTITLE = {Computer Vision -- ECCV 2016 Workshops},
EDITOR = {Hua, Gang and J{\'e}gou, Herv{\`e}},
PAGES = {773--788},
SERIES = {Lecture Notes in Computer Science},
VOLUME = {9915},
ADDRESS = {Amsterdam, The Netherlands},
}

2015

Conference paper

A. Khoreva, F. Galasso, M. Hein, and B. Schiele

“Classifier Based Graph Construction for Video Segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015), Boston, MA USA, 2015.

@inproceedings{Khoreva15cvpr,
TITLE = {Classifier Based Graph Construction for Video Segmentation},
AUTHOR = {Khoreva, Anna and Galasso, Fabio and Hein, Matthias and Schiele, Bernt},
LANGUAGE = {eng},
DOI = {10.1109/CVPR.2015.7298697},
PUBLISHER = {IEEE Computer Society},
YEAR = {2015},
DATE = {2015},
BOOKTITLE = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015)},
PAGES = {951--960},
ADDRESS = {Boston, MA USA},
}

2014

Conference paper

A. Khoreva, F. Galasso, M. Hein, and B. Schiele

“Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering,” in Pattern Recognition (GCPR 2014), Münster, Germany, 2014.

@inproceedings{876,
TITLE = {Learning Must-Link Constraints for Video Segmentation Based on Spectral Clustering},
AUTHOR = {Khoreva, Anna and Galasso, Fabio and Hein, Matthias and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-3-319-11751-5; 978-3-319-11752-2},
DOI = {10.1007/978-3-319-11752-2_58},
PUBLISHER = {Springer},
YEAR = {2014},
DATE = {2014},
BOOKTITLE = {Pattern Recognition (GCPR 2014)},
EDITOR = {Jiang, Xiaoyi and Hornegger, Joachim and Koch, Reinhard},
PAGES = {701--712},
SERIES = {Lecture Notes in Computer Science},
VOLUME = {8753},
ADDRESS = {M{\"u}nster, Germany},
}