Computer Vision and Machine Learning

Mateusz Malinowski (PhD Student)

MSc Mateusz Malinowski

Max-Planck-Institut für Informatik
Saarland Informatics Campus

Personal Information

Research Interests

  • Synergy of Machine Vision and Natural Language Understanding
    • Question Answering based on Images
    • Text-to-image Retrieval
  • Deep Learning
  • Optimization methods


Research Projects




  • Neural Information Processing Systems (NIPS)
  • Conference on Computer Vision and Pattern Recognition (CVPR)
  • European Conference on Computer Vision (ECCV)
  • Asian Conference on Computer Vision (ACCV)
  • The European Chapter of the ACL (EACL)
  • International Conference on Pattern Recognition (ICPR)
  • Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
  • International Journal of Computer Vision (IJCV)
  • Journal of Mathematical Imaging and Vision (JMIV)
  • Information Processing and Management (IPM)
  • IEEE Transactions on Computational Intelligence and AI in Games
  • Language and Linguistics Compass



Long-Term Image Boundary Prediction
A. Bhattacharyya, M. Malinowski, B. Schiele and M. Fritz
Thirty-Second AAAI Conference on Artificial Intelligence, 2018
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz and A. Leonardis
Computer Vision - ECCV 2018 Workshops, 2018
Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
Ask Your Neurons: A Deep Learning Approach to Visual Question Answering
M. Malinowski, M. Rohrbach and M. Fritz
International Journal of Computer Vision, Volume 125, Number 1-3, 2017
Spatio-Temporal Image Boundary Extrapolation
A. Bhattacharyya, M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1605.07363)
Boundary prediction in images as well as video has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation in this challenging scenario. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We successfully predict boundaries in a billiard scenario without any assumptions of a strong parametric model or any object notion. We argue that our model has with minimalistic model assumptions derived a notion of 'intuitive physics' that can be applied to novel scenes.
Tutorial on Answering Questions about Images with Deep Learning
M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1610.01076)
Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.
Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries
S. Nag Chowdhury, M. Malinowski, A. Bulling and M. Fritz
ICMR’16, ACM International Conference on Multimedia Retrieval, 2016
Multi-Cue Zero-Shot Learning with Strong Supervision
Z. Akata, M. Malinowski, M. Fritz and B. Schiele
29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), 2016
Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation
M. Malinowski, M. Rohrbach and M. Fritz
IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016
(Accepted/in press)
We are addressing an open-ended question answering task about real-world images. With the help of currently available methods developed in Computer Vision and Natural Language Processing, we would like to push an architecture with a global visual representation to its limits. In our contribution, we show how to achieve competitive performance on VQA with global visual features (Residual Net) together with a carefully desgined architecture.
Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
A. Mokarian Forooshani, M. Malinowski and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2016), 2016
Long Term Boundary Extrapolation for Deterministic Motion
A. Bhattacharyya, M. Malinowski and M. Fritz
NIPS Workshop on Intuitive Physics, 2016
Hard to Cheat: A Turing Test based on Answering Questions about Images
M. Malinowski and M. Fritz
Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015
(arXiv: 1501.03302)
Progress in language and image understanding by machines has sparkled the<br>interest of the research community in more open-ended, holistic tasks, and<br>refueled an old AI dream of building intelligent machines. We discuss a few<br>prominent challenges that characterize such holistic tasks and argue for<br>"question answering about images" as a particular appealing instance of such a<br>holistic task. In particular, we point out that it is a version of a Turing<br>Test that is likely to be more robust to over-interpretations and contrast it<br>with tasks like grounding and generation of descriptions. Finally, we discuss<br>tools to measure progress in this field.<br>
Ask Your Neurons: A Neural-based Approach to Answering Questions About Images
M. Malinowski, M. Rohrbach and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input
M. Malinowski and M. Fritz
Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014
Towards a Visual Turing Challenge
M. Malinowski and M. Fritz
NIPS 2014 Workshop on Learning Semantics, 2014
(arXiv: 1410.8027)
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation
M. Malinowski and M. Fritz
Technical Report, 2014
(arXiv: 1411.5190)
Over the last two decades we have witnessed strong progress on modeling visual object classes, scenes and attributes that have significantly contributed to automated image understanding. On the other hand, surprisingly little progress has been made on incorporating a spatial representation and reasoning in the inference process. In this work, we propose a pooling interpretation of spatial relations and show how it improves image retrieval and annotations tasks involving spatial language. Due to the complexity of the spatial language, we argue for a learning-based approach that acquires a representation of spatial relations by learning parameters of the pooling operator. We show improvements on previous work on two datasets and two different tasks as well as provide additional insights on a new dataset with an explicit focus on spatial relations.
Learnable Pooling Regions for Image Classification
M. Malinowski and M. Fritz
International Conference on Learning Representations Workshop Proceedings (ICLR 2013), 2013
(arXiv: 1301.3516)
Biologically inspired, from the early HMAX model to Spatial Pyramid Matching, pooling has played an important role in visual recognition pipelines. Spatial pooling, by grouping of local codes, equips these methods with a certain degree of robustness to translation and deformation yet preserving important spatial information. Despite the predominance of this approach in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. This paper proposes a model for learning task dependent pooling scheme -- including previously proposed hand-crafted pooling schemes as a particular instantiation. In our work, we investigate the role of different regularization terms showing that the smooth regularization term is crucial to achieve strong performance using the presented architecture. Finally, we propose an efficient and parallel method to train the model. Our experiments show improved performance over hand-crafted pooling schemes on the CIFAR-10 and CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on the latter.
Learning Smooth Pooling Regions for Visual Recognition
M. Malinowski and M. Fritz
Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), 2013
From the early HMAX model to Spatial Pyramid Matching, spatial pooling has played an important role in visual recognition pipelines. By aggregating local statistics, it equips the recognition pipelines with a certain degree of robustness to translation and deformation yet preserving spatial information. Despite of its predominance in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. In this paper, we propose a flexible parameterization of the spatial pooling step and learn the pooling regions together with the classifier. We investigate a smoothness regularization term that in conjuncture with an efficient learning scheme makes learning scalable. Our framework can work with both popular pooling operators: sum-pooling and max-pooling. Finally, we show benefits of our approach for object recognition tasks based on visual words and higher level event recognition tasks based on object-bank features. In both cases, we improve over the hand-crafted spatial pooling step showing the importance of its adaptation to the task.