Mateusz Malinowski (PhD Student)

MSc Mateusz Malinowski
- Address
- Max-Planck-Institut für Informatik
Saarland Informatics Campus
Campus - Standort
- -
- Telefon
- Fax
- Get email via email
Personal Information
Research Interests
- Synergy of Machine Vision and Natural Language Understanding
- Question Answering based on Images
- Text-to-image Retrieval
- Deep Learning
- Optimization methods
Education
- Saarland University, Computer Science, Master's Degree (Honor's Degree), Germany
- University of Wrocław, Computer Science, Poland
Research Projects
- Visual Turing Challenge
- Tutorial on Visual Turing Test
- Learning Spatial Relations
- Learning Smooth Pooling Regions for Visual Recognition
Students
- Ashkan Mokarian, 2016
- Master's Thesis co-advisor, main supervisor is Dr. Mario Fritz
- Title: "Deep Learning for Filling Blanks in Image Captions"
- Sreyasi Nag Chowdhury, 2015
- Master's Thesis co-advisor, main supervisors: Dr. Mario Fritz and Dr. Andreas Bulling
- Title: "Contextual Media Retrieval Using Natural Language Queries"
- Now PhD student at MPI D5: Databases and Information Systems
Teaching
- Deep Learning Seminar 2015, teaching assistant
- Probabilistic Graphical Models and their Applications 2013, teaching assistant
Reviewer
- Neural Information Processing Systems (NIPS)
- Conference on Computer Vision and Pattern Recognition (CVPR)
- European Conference on Computer Vision (ECCV)
- Asian Conference on Computer Vision (ACCV)
- The European Chapter of the ACL (EACL)
- International Conference on Pattern Recognition (ICPR)
- Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
- International Journal of Computer Vision (IJCV)
- Journal of Mathematical Imaging and Vision (JMIV)
- Information Processing and Management (IPM)
- IEEE Transactions on Computational Intelligence and AI in Games
- Language and Linguistics Compass
Publications
2018
2017
Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
M. Malinowski
PhD Thesis, Universität des Saarlandes, 2017
Abstract
Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
2016
Spatio-Temporal Image Boundary Extrapolation
A. Bhattacharyya, M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1605.07363) A. Bhattacharyya, M. Malinowski and M. Fritz
Technical Report, 2016
Abstract
Boundary prediction in images as well as video has been a very active topic
of research and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on predicting boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and extrapolate motion patterns. We
experiment on established real-world video segmentation dataset, which provides
a testbed for this new task. We show for the first time spatio-temporal
boundary extrapolation in this challenging scenario. Furthermore, we show
long-term prediction of boundaries in situations where the motion is governed
by the laws of physics. We successfully predict boundaries in a billiard
scenario without any assumptions of a strong parametric model or any object
notion. We argue that our model has with minimalistic model assumptions derived
a notion of 'intuitive physics' that can be applied to novel scenes.
Tutorial on Answering Questions about Images with Deep Learning
M. Malinowski and M. Fritz
Technical Report, 2016
(arXiv: 1610.01076) M. Malinowski and M. Fritz
Technical Report, 2016
Abstract
Together with the development of more accurate methods in Computer Vision and
Natural Language Understanding, holistic architectures that answer on questions
about the content of real-world images have emerged. In this tutorial, we build
a neural-based approach to answer questions about images. We base our tutorial
on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the
models that we present here can achieve a competitive performance on both
datasets, in fact, they are among the best methods that use a combination of
LSTM with a global, full frame CNN representation of an image. We hope that
after reading this tutorial, the reader will be able to use Deep Learning
frameworks, such as Keras and introduced Kraino, to build various architectures
that will lead to a further performance improvement on this challenging task.
Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation
M. Malinowski, M. Rohrbach and M. Fritz
IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016
(Accepted/in press) M. Malinowski, M. Rohrbach and M. Fritz
IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016), 2016
Abstract
We are addressing an open-ended question answering task
about real-world images. With the help of currently available methods
developed in Computer Vision and Natural Language Processing, we would
like to push an architecture with a global visual representation to its
limits. In our contribution, we show how to achieve competitive
performance on VQA with global visual features (Residual Net) together
with a carefully desgined architecture.
Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task
A. Mokarian Forooshani, M. Malinowski and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2016), 2016
A. Mokarian Forooshani, M. Malinowski and M. Fritz
Proceedings of the British Machine Vision Conference (BMVC 2016), 2016
Long Term Boundary Extrapolation for Deterministic Motion
A. Bhattacharyya, M. Malinowski and M. Fritz
NIPS Workshop on Intuitive Physics, 2016
A. Bhattacharyya, M. Malinowski and M. Fritz
NIPS Workshop on Intuitive Physics, 2016
2015
Hard to Cheat: A Turing Test based on Answering Questions about Images
M. Malinowski and M. Fritz
Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015
(arXiv: 1501.03302) M. Malinowski and M. Fritz
Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015
Abstract
Progress in language and image understanding by machines has sparkled the<br>interest of the research community in more open-ended, holistic tasks, and<br>refueled an old AI dream of building intelligent machines. We discuss a few<br>prominent challenges that characterize such holistic tasks and argue for<br>"question answering about images" as a particular appealing instance of such a<br>holistic task. In particular, we point out that it is a version of a Turing<br>Test that is likely to be more robust to over-interpretations and contrast it<br>with tasks like grounding and generation of descriptions. Finally, we discuss<br>tools to measure progress in this field.<br>
Ask Your Neurons: A Neural-based Approach to Answering Questions About Images
M. Malinowski, M. Rohrbach and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
M. Malinowski, M. Rohrbach and M. Fritz
ICCV 2015, IEEE International Conference on Computer Vision, 2015
2014
A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input
M. Malinowski and M. Fritz
Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014
M. Malinowski and M. Fritz
Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014
Towards a Visual Turing Challenge
M. Malinowski and M. Fritz
NIPS 2014 Workshop on Learning Semantics, 2014
(arXiv: 1410.8027) M. Malinowski and M. Fritz
NIPS 2014 Workshop on Learning Semantics, 2014
Abstract
As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation
M. Malinowski and M. Fritz
Technical Report, 2014
(arXiv: 1411.5190) M. Malinowski and M. Fritz
Technical Report, 2014
Abstract
Over the last two decades we have witnessed strong progress on modeling
visual object classes, scenes and attributes that have significantly
contributed to automated image understanding. On the other hand, surprisingly
little progress has been made on incorporating a spatial representation and
reasoning in the inference process. In this work, we propose a pooling
interpretation of spatial relations and show how it improves image retrieval
and annotations tasks involving spatial language. Due to the complexity of the
spatial language, we argue for a learning-based approach that acquires a
representation of spatial relations by learning parameters of the pooling
operator. We show improvements on previous work on two datasets and two
different tasks as well as provide additional insights on a new dataset with an
explicit focus on spatial relations.
2013
Learnable Pooling Regions for Image Classification
M. Malinowski and M. Fritz
International Conference on Learning Representations Workshop Proceedings (ICLR 2013), 2013
(arXiv: 1301.3516) M. Malinowski and M. Fritz
International Conference on Learning Representations Workshop Proceedings (ICLR 2013), 2013
Abstract
Biologically inspired, from the early HMAX model to Spatial Pyramid Matching,
pooling has played an important role in visual recognition pipelines. Spatial
pooling, by grouping of local codes, equips these methods with a certain degree
of robustness to translation and deformation yet preserving important spatial
information. Despite the predominance of this approach in current recognition
systems, we have seen little progress to fully adapt the pooling strategy to
the task at hand. This paper proposes a model for learning task dependent
pooling scheme -- including previously proposed hand-crafted pooling schemes as
a particular instantiation. In our work, we investigate the role of different
regularization terms showing that the smooth regularization term is crucial to
achieve strong performance using the presented architecture. Finally, we
propose an efficient and parallel method to train the model. Our experiments
show improved performance over hand-crafted pooling schemes on the CIFAR-10 and
CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on
the latter.
Learning Smooth Pooling Regions for Visual Recognition
M. Malinowski and M. Fritz
Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), 2013
M. Malinowski and M. Fritz
Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), 2013
Abstract
From the early HMAX model to Spatial Pyramid Matching, spatial pooling
has played an important role in visual recognition pipelines. By
aggregating local statistics, it equips the recognition pipelines
with a certain degree of robustness to translation and deformation
yet preserving spatial information. Despite of its predominance in
current recognition systems, we have seen little progress to fully
adapt the pooling strategy to the task at hand. In this paper, we
propose a flexible parameterization of the spatial pooling step and
learn the pooling regions together with the classifier. We investigate
a smoothness regularization term that in conjuncture with an efficient
learning scheme makes learning scalable. Our framework can work with
both popular pooling operators: sum-pooling and max-pooling. Finally,
we show benefits of our approach for object recognition tasks based
on visual words and higher level event recognition tasks based on
object-bank features. In both cases, we improve over the hand-crafted
spatial pooling step showing the importance of its adaptation to
the task.
Other
My GitHub profile.
My personal webpage.
See publication list of Scalable Learning and Perception group that I belong to.