Mateusz Malinowski (PhD Student)

MSc Mateusz Malinowski

Address: Max-Planck-Institut für Informatik
Saarland Informatics Campus
Campus
Location: -
Phone
Fax
E-mail: Get email via email

Personal Information

Research Interests

Synergy of Machine Vision and Natural Language Understanding
- Question Answering based on Images
- Text-to-image Retrieval
Deep Learning
Optimization methods

Education

Saarland University, Computer Science, Master's Degree (Honor's Degree), Germany
University of Wrocław, Computer Science, Poland

Research Projects

Students

Ashkan Mokarian, 2016
- Master's Thesis co-advisor, main supervisor is Dr. Mario Fritz
- Title: "Deep Learning for Filling Blanks in Image Captions"
Sreyasi Nag Chowdhury, 2015
- Master's Thesis co-advisor, main supervisors: Dr. Mario Fritz and Dr. Andreas Bulling
- Title: "Contextual Media Retrieval Using Natural Language Queries"
- Now PhD student at MPI D5: Databases and Information Systems

Teaching

Deep Learning Seminar 2015, teaching assistant
Probabilistic Graphical Models and their Applications 2013, teaching assistant

Reviewer

Neural Information Processing Systems (NIPS)
Conference on Computer Vision and Pattern Recognition (CVPR)
European Conference on Computer Vision (ECCV)
Asian Conference on Computer Vision (ACCV)
The European Chapter of the ACL (EACL)
International Conference on Pattern Recognition (ICPR)
Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
International Journal of Computer Vision (IJCV)
Journal of Mathematical Imaging and Vision (JMIV)
Information Processing and Management (IPM)
IEEE Transactions on Computational Intelligence and AI in Games
Language and Linguistics Compass

Publications

2018

Conference paper

M. Wagner, H. Basevi, R. Shetty, W. Li, M. Malinowski, M. Fritz, and A. Leonardis

“Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions,” in Computer Vision - ECCV 2018 Workshops, Munich, Germany, 2019.

@inproceedings{wagner18eccvw,
TITLE = {Answering Visual What-If Questions: {F}rom Actions to Predicted Scene Descriptions},
AUTHOR = {Wagner, Misha and Basevi, Hector and Shetty, Rakshith and Li, Wenbin and Malinowski, Mateusz and Fritz, Mario and Leonardis, Ales},
LANGUAGE = {eng},
ISBN = {978-3-030-11008-6},
DOI = {10.1007/978-3-030-11009-3_32},
PUBLISHER = {Springer},
YEAR = {2018},
DATE = {2019},
BOOKTITLE = {Computer Vision -- ECCV 2018 Workshops},
EDITOR = {Leal-Taix{\'e}, Laura and Roth, Stefan},
PAGES = {521--537},
SERIES = {Lecture Notes in Computer Science},
VOLUME = {11129},
ADDRESS = {Munich, Germany},
}

Endnote

%0 Conference Proceedings
%A Wagner, Misha
%A Basevi, Hector
%A Shetty, Rakshith
%A Li, Wenbin
%A Malinowski, Mateusz
%A Fritz, Mario
%A Leonardis, Ales
%+ External Organizations
External Organizations
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
Computer Vision and Machine Learning, MPI for Informatics, Max Planck Society
External Organizations
%T Answering Visual What-If Questions: From Actions to Predicted Scene
  Descriptions : 
%G eng
%U http://hdl.handle.net/21.11116/0000-0002-B962-F
%R 10.1007/978-3-030-11009-3_32
%D 2019
%B Workshop on Visual Learning and Embodied Agents in Simulation Environments
%Z date of event: 2018-09-09 - 2018-09-09
%C Munich, Germany
%B Computer Vision - ECCV 2018 Workshops
%E Leal-Taix&#233;, Laura; Roth, Stefan
%P 521 - 537
%I Springer
%@ 978-3-030-11008-6
%B Lecture Notes in Computer Science
%N 11129

Conference paper

A. Bhattacharyya, M. Malinowski, B. Schiele, and M. Fritz

“Long-Term Image Boundary Prediction,” in Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2018.

@inproceedings{apratim18aaai,
TITLE = {Long-Term Image Boundary Prediction},
AUTHOR = {Bhattacharyya, Apratim and Malinowski, Mateusz and Schiele, Bernt and Fritz, Mario},
LANGUAGE = {eng},
ISBN = {978-1-57735-800-8},
URL = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17280/16540},
PUBLISHER = {AAAI},
YEAR = {2018},
BOOKTITLE = {Thirty-Second AAAI Conference on Artificial Intelligence},
PAGES = {2720--2729},
EID = {17280},
ADDRESS = {New Orleans, LA, USA},
}

Endnote

%0 Conference Proceedings
%A Bhattacharyya, Apratim
%A Malinowski, Mateusz
%A Schiele, Bernt
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Long-Term Image Boundary Prediction : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002C-26B1-A
%U https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17280/16540
%D 2018
%B Thirty-Second AAAI Conference on Artificial Intelligence
%Z date of event: 2018-02-02 - 2018-02-07
%C New Orleans, LA, USA
%B Thirty-Second AAAI Conference on Artificial Intelligence
%P 2720 - 2729
%Z sequence number: 17280
%I AAAI
%@ 978-1-57735-800-8

2017

Article

M. Malinowski, M. Rohrbach, and M. Fritz

“Ask Your Neurons: A Deep Learning Approach to Visual Question Answering,” International Journal of Computer Vision, vol. 125, no. 1–3, 2017.

@article{malinowski17ijcv,
TITLE = {Ask Your Neurons: {A} Deep Learning Approach to Visual Question Answering},
AUTHOR = {Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
LANGUAGE = {eng},
ISSN = {0920-5691},
DOI = {10.1007/s11263-017-1038-2},
PUBLISHER = {Kluwer Academic Publishers},
ADDRESS = {Hingham, Mass.},
YEAR = {2017},
JOURNAL = {International Journal of Computer Vision},
VOLUME = {125},
NUMBER = {1-3},
PAGES = {110--135},
}

Endnote

%0 Journal Article
%A Malinowski, Mateusz
%A Rohrbach, Marcus
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
External Organizations
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Ask Your Neurons: A Deep Learning Approach to Visual Question Answering : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-0648-9
%R 10.1007/s11263-017-1038-2
%7 2017-08-29
%D 2017
%8 29.08.2017
%J International Journal of Computer Vision
%O Int. J. Comput. Vis.
%V 125
%N 1-3
%& 110
%P 110 - 135
%I Kluwer Academic Publishers
%C Hingham, Mass.
%@ false

Thesis

D2IMPR-CS

M. Malinowski

“Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image,” Universität des Saarlandes, Saarbrücken, 2017.

Abstract

Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first ‘question answering about real-world images’ dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.

BibTeX

@phdthesis{Malinowskiphd17,
TITLE = {Towards Holistic Machines: From Visual Recognition To Question Answering About Real-world Image},
AUTHOR = {Malinowski, Mateusz},
LANGUAGE = {eng},
URL = {urn:nbn:de:bsz:291-scidok-68978},
DOI = {10.22028/D291-26773},
SCHOOL = {Universit{\"a}t des Saarlandes},
ADDRESS = {Saarbr{\"u}cken},
YEAR = {2017},
DATE = {2017},
ABSTRACT = {Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first {\textquoteleft}question answering about real-world images{\textquoteright} dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.},
}

Endnote

%0 Thesis
%A Malinowski, Mateusz
%Y Fritz, Mario
%A referee: Pinkal, Manfred
%A referee: Darrell, Trevor
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
International Max Planck Research School, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
External Organizations
External Organizations
%T Towards Holistic Machines: From Visual Recognition To Question Answering
About Real-world Image : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002D-9339-5
%U urn:nbn:de:bsz:291-scidok-68978
%R 10.22028/D291-26773
%F OTHER: hdl:20.500.11880/26786
%I Universit&#228;t des Saarlandes
%C Saarbr&#252;cken
%D 2017
%P 276 p.
%V phd
%9 phd
%X Computer Vision has undergone major changes over the recent five years. Here, we investigate if the performance of such architectures generalizes to more complex tasks that require a more holistic approach to scene comprehension. The presented work focuses on learning spatial and multi-modal representations, and the foundations of a Visual Turing Test, where the scene understanding is tested by a series of questions about its content. In our studies, we propose DAQUAR, the first &#8216;question answering about real-world images&#8217; dataset together with methods, termed a symbolic-based and a neural-based visual question answering architectures, that address the problem. The symbolic-based method relies on a semantic parser, a database of visual facts, and a bayesian formulation that accounts for various interpretations of the visual scene. The neural-based method is an end-to-end architecture composed of a question encoder, image encoder, multimodal embedding, and answer decoder. This architecture has proven to be effective in capturing language-based biases. It also becomes the standard component of other visual question answering architectures. Along with the methods, we also investigate various evaluation metrics that embraces uncertainty in word's meaning, and various interpretations of the scene and the question.
%U http://scidok.sulb.uni-saarland.de/volltexte/2017/6897/http://scidok.sulb.uni-saarland.de/doku/lic_ohne_pod.php?la=de

2016

Conference paper

Z. Akata, M. Malinowski, M. Fritz, and B. Schiele

“Multi-Cue Zero-Shot Learning with Strong Supervision,” in 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 2016.

@inproceedings{999,
TITLE = {Multi-Cue Zero-Shot Learning with Strong Supervision},
AUTHOR = {Akata, Zeynep and Malinowski, Mateusz and Fritz, Mario and Schiele, Bernt},
LANGUAGE = {eng},
ISBN = {978-1-4673-8852-8},
DOI = {10.1109/CVPR.2016.14},
PUBLISHER = {IEEE Computer Society},
YEAR = {2016},
DATE = {2016},
BOOKTITLE = {29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016)},
PAGES = {59--68},
ADDRESS = {Las Vegas, NV, USA},
}

Endnote

%0 Conference Proceedings
%A Akata, Zeynep
%A Malinowski, Mateusz
%A Fritz, Mario
%A Schiele, Bernt
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Multi-Cue Zero-Shot Learning with Strong Supervision : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002A-1B22-7
%R 10.1109/CVPR.2016.14
%D 2016
%B 29th IEEE Conference on Computer Vision and Pattern Recognition
%Z date of event: 2016-06-26 - 2016-07-01
%C Las Vegas, NV, USA
%B 29th IEEE Conference on Computer Vision and Pattern Recognition
%P 59 - 68
%I IEEE Computer Society
%@ 978-1-4673-8852-8

Conference paper

S. Nag Chowdhury, M. Malinowski, A. Bulling, and M. Fritz

“Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries,” in ICMR’16, ACM International Conference on Multimedia Retrieval, New York, NY, USA, 2016.

@inproceedings{sreyasi16icmr,
TITLE = {{Xplore-M-Ego}: {C}ontextual Media Retrieval Using Natural Language Queries},
AUTHOR = {Nag Chowdhury, Sreyasi and Malinowski, Mateusz and Bulling, Andreas and Fritz, Mario},
LANGUAGE = {eng},
ISBN = {978-1-4503-4359-6},
DOI = {10.1145/2911996.2912044},
PUBLISHER = {ACM},
YEAR = {2016},
DATE = {2016},
BOOKTITLE = {ICMR'16, ACM International Conference on Multimedia Retrieval},
PAGES = {243--247},
ADDRESS = {New York, NY, USA},
}

Endnote

%0 Conference Proceedings
%A Nag Chowdhury, Sreyasi
%A Malinowski, Mateusz
%A Bulling, Andreas
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-04A1-4
%R 10.1145/2911996.2912044
%D 2016
%B ACM International Conference on Multimedia Retrieval
%Z date of event: 2016-06-06 - 2016-06-09
%C New York, NY, USA
%B ICMR'16
%P 243 - 247
%I ACM
%@ 978-1-4503-4359-6

Conference poster

M. Malinowski, M. Rohrbach, and M. Fritz

“Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation,” IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016). IEEE, Piscataway, NJ.

Abstract

We are addressing an open-ended question answering task

about real-world images. With the help of currently available methods

developed in Computer Vision and Natural Language Processing, we would

like to push an architecture with a global visual representation to its

limits. In our contribution, we show how to achieve competitive

performance on VQA with global visual features (Residual Net) together

with a carefully desgined architecture.

BibTeX

@inproceedings{malinowski16vqa,
TITLE = {Ask Your Neurons Again: Analysis of Deep Methods with Global Image Representation},
AUTHOR = {Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
LANGUAGE = {eng},
PUBLISHER = {IEEE},
YEAR = {2016},
PUBLREMARK = {Accepted},
ABSTRACT = {We are addressing an open-ended question answering task about real-world images. With the help of currently available methods developed in Computer Vision and Natural Language Processing, we would like to push an architecture with a global visual representation to its limits. In our contribution, we show how to achieve competitive performance on VQA with global visual features (Residual Net) together with a carefully desgined architecture.},
BOOKTITLE = {IEEE Conference on Computer Vision and Pattern Recognition Workshops (VQA 2016)},
ADDRESS = {Las Vegas, NV, USA},
}

Endnote

%0 Generic
%A Malinowski, Mateusz
%A Rohrbach, Marcus
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Ask Your Neurons Again: Analysis of Deep Methods with Global  Image Representation : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-412A-F
%D 2016
%Z name of event: VQA Challenge Workshop
%Z date of event: 2016-06-26 - 2016-06-26
%Z place of event: Las Vegas, NV, USA
%X We are addressing an open-ended question answering task 
about real-world images. With the help of currently available methods 
developed in Computer Vision and Natural Language Processing, we would 
like to push an architecture with a global visual representation to its 
limits. In our contribution, we show how to achieve competitive 
performance on VQA with global visual features (Residual Net) together 
with a carefully desgined architecture.
%B IEEE Conference on Computer Vision and Pattern Recognition Workshops

Conference paper

A. Bhattacharyya, M. Malinowski, and M. Fritz

“Long Term Boundary Extrapolation for Deterministic Motion,” in NIPS Workshop on Intuitive Physics, Barcelona, Spain, 2016.

@inproceedings{apratim16nipsw,
TITLE = {Long Term Boundary Extrapolation for Deterministic Motion},
AUTHOR = {Bhattacharyya, Apratim and Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
YEAR = {2016},
BOOKTITLE = {NIPS Workshop on Intuitive Physics},
ADDRESS = {Barcelona, Spain},
}

Endnote

%0 Conference Proceedings
%A Bhattacharyya, Apratim
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Long Term Boundary Extrapolation for Deterministic Motion : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002C-53F3-B
%D 2016
%B NIPS Workshop on Intuitive Physics
%Z date of event: 2016-12-09 - 2016-12-09
%C Barcelona, Spain
%B NIPS Workshop on Intuitive Physics

Conference paper

A. Mokarian Forooshani, M. Malinowski, and M. Fritz

“Mean Box Pooling: A Rich Image Representation and Output Embedding for the Visual Madlibs Task,” in Proceedings of the British Machine Vision Conference (BMVC 2016), York, UK, 2016.

@inproceedings{ashkan16bmvc,
TITLE = {Mean Box Pooling: {A} Rich Image Representation and Output Embedding for the Visual Madlibs Task},
AUTHOR = {Mokarian Forooshani, Ashkan and Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
ISBN = {1-901725-59-6},
URL = {http://www.bmva.org/bmvc/2016/papers/paper111/index.html},
DOI = {10.5244/C.30.111},
PUBLISHER = {BMVA Press},
YEAR = {2016},
BOOKTITLE = {Proceedings of the British Machine Vision Conference (BMVC 2016)},
EDITOR = {Wilson, Richard C. and Hancock, Edwin R. and Smith, Wiliam A. P.},
ADDRESS = {York, UK},
}

Endnote

%0 Conference Proceedings
%A Mokarian Forooshani, Ashkan
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Mean Box Pooling: A Rich Image Representation and Output Embedding for  the Visual Madlibs Task : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-3533-7
%U http://www.bmva.org/bmvc/2016/papers/paper111/index.html
%R 10.5244/C.30.111
%D 2016
%B 27th British Machine Vision Conference
%Z date of event: 2016-09-19 - 2016-09-22
%C York, UK
%B Proceedings of the British Machine Vision Conference
%E Wilson, Richard C.; Hancock, Edwin R.; Smith, Wiliam A. P.
%I BMVA Press
%@ 1-901725-59-6
%U http://www.bmva.org/bmvc/2016/papers/paper111/paper111.pdf

Paper

A. Bhattacharyya, M. Malinowski, and M. Fritz

“Spatio-Temporal Image Boundary Extrapolation,” 2016. [Online]. Available: http://arxiv.org/abs/1605.07363.

Abstract

Boundary prediction in images as well as video has been a very active topic

of research and organizing visual information into boundaries and segments is

believed to be a corner stone of visual perception. While prior work has

focused on predicting boundaries for observed frames, our work aims at

predicting boundaries of future unobserved frames. This requires our model to

learn about the fate of boundaries and extrapolate motion patterns. We

experiment on established real-world video segmentation dataset, which provides

a testbed for this new task. We show for the first time spatio-temporal

boundary extrapolation in this challenging scenario. Furthermore, we show

long-term prediction of boundaries in situations where the motion is governed

by the laws of physics. We successfully predict boundaries in a billiard

scenario without any assumptions of a strong parametric model or any object

notion. We argue that our model has with minimalistic model assumptions derived

a notion of 'intuitive physics' that can be applied to novel scenes.

BibTeX

@online{Bhattacharyya_arXiv2016,
TITLE = {Spatio-Temporal Image Boundary Extrapolation},
AUTHOR = {Bhattacharyya, Apratim and Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
URL = {http://arxiv.org/abs/1605.07363},
EPRINT = {1605.07363},
EPRINTTYPE = {arXiv},
YEAR = {2016},
ABSTRACT = {Boundary prediction in images as well as video has been a very active topic of research and organizing visual information into boundaries and segments is believed to be a corner stone of visual perception. While prior work has focused on predicting boundaries for observed frames, our work aims at predicting boundaries of future unobserved frames. This requires our model to learn about the fate of boundaries and extrapolate motion patterns. We experiment on established real-world video segmentation dataset, which provides a testbed for this new task. We show for the first time spatio-temporal boundary extrapolation in this challenging scenario. Furthermore, we show long-term prediction of boundaries in situations where the motion is governed by the laws of physics. We successfully predict boundaries in a billiard scenario without any assumptions of a strong parametric model or any object notion. We argue that our model has with minimalistic model assumptions derived a notion of 'intuitive physics' that can be applied to novel scenes.},
}

Endnote

%0 Report
%A Bhattacharyya, Apratim
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Spatio-Temporal Image Boundary Extrapolation : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-065A-1
%U http://arxiv.org/abs/1605.07363
%D 2016
%X   Boundary prediction in images as well as video has been a very active topic
of research and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on predicting boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and extrapolate motion patterns. We
experiment on established real-world video segmentation dataset, which provides
a testbed for this new task. We show for the first time spatio-temporal
boundary extrapolation in this challenging scenario. Furthermore, we show
long-term prediction of boundaries in situations where the motion is governed
by the laws of physics. We successfully predict boundaries in a billiard
scenario without any assumptions of a strong parametric model or any object
notion. We argue that our model has with minimalistic model assumptions derived
a notion of 'intuitive physics' that can be applied to novel scenes.

%K Computer Science, Computer Vision and Pattern Recognition, cs.CV

Paper

M. Malinowski and M. Fritz

“Tutorial on Answering Questions about Images with Deep Learning,” 2016. [Online]. Available: http://arxiv.org/abs/1610.01076.

Abstract

Together with the development of more accurate methods in Computer Vision and

Natural Language Understanding, holistic architectures that answer on questions

about the content of real-world images have emerged. In this tutorial, we build

a neural-based approach to answer questions about images. We base our tutorial

on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the

models that we present here can achieve a competitive performance on both

datasets, in fact, they are among the best methods that use a combination of

LSTM with a global, full frame CNN representation of an image. We hope that

after reading this tutorial, the reader will be able to use Deep Learning

frameworks, such as Keras and introduced Kraino, to build various architectures

that will lead to a further performance improvement on this challenging task.

BibTeX

@online{malinowski2016tutorial,
TITLE = {Tutorial on Answering Questions about Images with Deep Learning},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
URL = {http://arxiv.org/abs/1610.01076},
EPRINT = {1610.01076},
EPRINTTYPE = {arXiv},
YEAR = {2016},
ABSTRACT = {Together with the development of more accurate methods in Computer Vision and Natural Language Understanding, holistic architectures that answer on questions about the content of real-world images have emerged. In this tutorial, we build a neural-based approach to answer questions about images. We base our tutorial on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the models that we present here can achieve a competitive performance on both datasets, in fact, they are among the best methods that use a combination of LSTM with a global, full frame CNN representation of an image. We hope that after reading this tutorial, the reader will be able to use Deep Learning frameworks, such as Keras and introduced Kraino, to build various architectures that will lead to a further performance improvement on this challenging task.},
}

Endnote

%0 Report
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Tutorial on Answering Questions about Images with Deep Learning : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-002B-945B-A
%U http://arxiv.org/abs/1610.01076
%D 2016
%X   Together with the development of more accurate methods in Computer Vision and
Natural Language Understanding, holistic architectures that answer on questions
about the content of real-world images have emerged. In this tutorial, we build
a neural-based approach to answer questions about images. We base our tutorial
on two datasets: (mostly on) DAQUAR, and (a bit on) VQA. With small tweaks the
models that we present here can achieve a competitive performance on both
datasets, in fact, they are among the best methods that use a combination of
LSTM with a global, full frame CNN representation of an image. We hope that
after reading this tutorial, the reader will be able to use Deep Learning
frameworks, such as Keras and introduced Kraino, to build various architectures
that will lead to a further performance improvement on this challenging task.

%K Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Artificial Intelligence, cs.AI,Computer Science, Computation and Language, cs.CL,Computer Science, Learning, cs.LG,Computer Science, Neural and Evolutionary Computing, cs.NE

2015

Conference paper

M. Malinowski, M. Rohrbach, and M. Fritz

“Ask Your Neurons: A Neural-based Approach to Answering Questions About Images,” in ICCV 2015, IEEE International Conference on Computer Vision, Santiago, Chile, 2015.

@inproceedings{948,
TITLE = {Ask Your Neurons: A Neural-based Approach to Answering Questions About Images},
AUTHOR = {Malinowski, Mateusz and Rohrbach, Marcus and Fritz, Mario},
LANGUAGE = {eng},
ISBN = {978-1-4673-8390-5},
DOI = {10.1109/ICCV.2015.9},
PUBLISHER = {IEEE},
YEAR = {2015},
DATE = {2015},
BOOKTITLE = {ICCV 2015, IEEE International Conference on Computer Vision},
PAGES = {1--9},
ADDRESS = {Santiago, Chile},
}

Endnote

%0 Conference Proceedings
%A Malinowski, Mateusz
%A Rohrbach, Marcus
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Ask Your Neurons: A Neural-based Approach to Answering Questions About Images : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0029-4ABE-1
%R 10.1109/ICCV.2015.9
%D 2015
%B IEEE International Conference on Computer Vision
%Z date of event: 2015-12-13 - 2015-12-16
%C Santiago, Chile
%B ICCV 2015
%P 1 - 9
%I IEEE
%@ 978-1-4673-8390-5
%U http://www.cv-foundation.org/openaccess/content_iccv_2015/html/Malinowski_Ask_Your_Neurons_ICCV_2015_paper.html

Conference poster

M. Malinowski and M. Fritz

“Hard to Cheat: A Turing Test based on Answering Questions about Images,” Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test), 2015. [Online]. Available: https://arxiv.org/abs/1501.03302.

Abstract

Progress in language and image understanding by machines has sparkled the
interest of the research community in more open-ended, holistic tasks, and
refueled an old AI dream of building intelligent machines. We discuss a few
prominent challenges that characterize such holistic tasks and argue for
"question answering about images" as a particular appealing instance of such a
holistic task. In particular, we point out that it is a version of a Turing
Test that is likely to be more robust to over-interpretations and contrast it
with tasks like grounding and generation of descriptions. Finally, we discuss
tools to measure progress in this field.

BibTeX

@inproceedings{MalinowskiAAAIWS9,
TITLE = {Hard to Cheat: {A} Turing Test based on Answering Questions about Images},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
URL = {https://arxiv.org/abs/1501.03302},
EPRINT = {1501.03302},
EPRINTTYPE = {arXiv},
YEAR = {2015},
ABSTRACT = {Progress in language and image understanding by machines has sparkled the<br>interest of the research community in more open-ended, holistic tasks, and<br>refueled an old AI dream of building intelligent machines. We discuss a few<br>prominent challenges that characterize such holistic tasks and argue for<br>"question answering about images" as a particular appealing instance of such a<br>holistic task. In particular, we point out that it is a version of a Turing<br>Test that is likely to be more robust to over-interpretations and contrast it<br>with tasks like grounding and generation of descriptions. Finally, we discuss<br>tools to measure progress in this field.<br>},
BOOKTITLE = {Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test (AAAI 2015 W6, Beyond the Turing Test)},
ADDRESS = {Austin, TX},
}

Endnote

%0 Generic
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Hard to Cheat: A Turing Test based on Answering Questions about Images : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-9358-8
%U https://arxiv.org/abs/1501.03302
%D 2015
%Z name of event: Twenty-Ninth AAAI Conference on Artificial Intelligence Workshop 6: Beyond the Turing Test
%Z date of event: 2015-01-25 - 2015-01-25
%Z place of event: Austin, TX
%X   Progress in language and image understanding by machines has sparkled the<br>interest of the research community in more open-ended, holistic tasks, and<br>refueled an old AI dream of building intelligent machines. We discuss a few<br>prominent challenges that characterize such holistic tasks and argue for<br>"question answering about images" as a particular appealing instance of such a<br>holistic task. In particular, we point out that it is a version of a Turing<br>Test that is likely to be more robust to over-interpretations and contrast it<br>with tasks like grounding and generation of descriptions. Finally, we discuss<br>tools to measure progress in this field.<br>
%K Computer Science, Artificial Intelligence, cs.AI,Computer Science, Computation and Language, cs.CL,Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Learning, cs.LG
%B Twenty-Ninth AAAI Conference on Artificial Intelligence W6, Beyond the Turing Test
%U https://arxiv.org/abs/1501.03302

2014

Conference paper

M. Malinowski and M. Fritz

“A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input,” in Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, Canada, 2014.

@inproceedings{malinowski14nips,
TITLE = {A Multi-world Approach to Question Answering about Real-world Scenes based on Uncertain Input},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
PUBLISHER = {Curran},
YEAR = {2014},
BOOKTITLE = {Advances in Neural Information Processing Systems 27 (NIPS 2014)},
EDITOR = {Ghahramani, Z. and Welling, M. and Cortes, C. and Lawrence, N. D. and Weinberger, K. Q.},
PAGES = {1682--1690},
ADDRESS = {Montreal, Canada},
}

Endnote

%0 Conference Proceedings
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T A Multi-world Approach to Question Answering about Real-world Scenes
based on Uncertain Input : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-3534-A
%D 2014
%B Twenty-Eighth Annual Conference on Neural Information Processing Systems 
%Z date of event: 2014-12-08 - 2014-12-13
%C Montreal, Canada
%K Computer Science, Artificial Intelligence, cs.AI,Computer Science, Computation and Language, cs.CL,Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Learning, cs.LG
%B Advances in Neural Information Processing Systems 27
%E Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; Weinberger, K. Q.
%P 1682 - 1690
%I Curran
%U http://arxiv.org/abs/1410.0210http://papers.nips.cc/paper/5411-a-multi-world-approach-to-question-answering-about-real-world-scenes-based-on-uncertain-input

Conference paper

M. Malinowski and M. Fritz

“Towards a Visual Turing Challenge,” in NIPS 2014 Workshop on Learning Semantics, Montréal, Canada, 2014.

Abstract

As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on ’}social consensus{’ as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.

BibTeX

@inproceedings{887,
TITLE = {Towards a Visual Turing Challenge},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
EPRINT = {1410.8027},
EPRINTTYPE = {arXiv},
YEAR = {2014},
ABSTRACT = {As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on {\textquoteright}}social consensus{{\textquoteright} as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.},
BOOKTITLE = {NIPS 2014 Workshop on Learning Semantics},
ADDRESS = {Montr{\'e}al, Canada},
}

Endnote

%0 Conference Proceedings
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Towards a Visual Turing Challenge : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-6B93-0
%D 2014
%B Learning Semantics 2014
%Z date of event: 2014-12-12 - 2014-12-12
%C Montr&#233;al, Canada
%X As language and visual understanding by machines progresses rapidly, we are observing an increasing interest in holistic architectures that tightly interlink both modalities in a joint learning and inference process. This trend has allowed the community to progress towards more challenging and open tasks and refueled the hope at achieving the old AI dream of building machines that could pass a turing test in open domains. In order to steadily make progress towards this goal, we realize that quantifying performance becomes increasingly difficult. Therefore we ask how we can precisely define such challenges and how we can evaluate different algorithms on this open tasks? In this paper, we summarize and discuss such challenges as well as try to give answers where appropriate options are available in the literature. We exemplify some of the solutions on a recently presented dataset of question-answering task based on real-world indoor images that establishes a visual turing challenge. Finally, we argue despite the success of unique ground-truth annotation, we likely have to step away from carefully curated dataset and rather rely on &#8217;}social consensus{&#8217; as the main driving force to create suitable benchmarks. Providing coverage in this inherently ambiguous output space is an emerging challenge that we face in order to make quantifiable progress in this area.
%B NIPS 2014 Workshop on Learning Semantics
%U http://arxiv.org/abs/1410.8027

Paper

M. Malinowski and M. Fritz

“A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation,” 2014. [Online]. Available: http://arxiv.org/abs/1411.5190.

Abstract

Over the last two decades we have witnessed strong progress on modeling

visual object classes, scenes and attributes that have significantly

contributed to automated image understanding. On the other hand, surprisingly

little progress has been made on incorporating a spatial representation and

reasoning in the inference process. In this work, we propose a pooling

interpretation of spatial relations and show how it improves image retrieval

and annotations tasks involving spatial language. Due to the complexity of the

spatial language, we argue for a learning-based approach that acquires a

representation of spatial relations by learning parameters of the pooling

operator. We show improvements on previous work on two datasets and two

different tasks as well as provide additional insights on a new dataset with an

explicit focus on spatial relations.

BibTeX

@online{892,
TITLE = {A Pooling Approach to Modelling Spatial Relations for Image Retrieval and Annotation},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
URL = {http://arxiv.org/abs/1411.5190},
EPRINT = {1411.5190},
EPRINTTYPE = {arXiv},
YEAR = {2014},
ABSTRACT = {Over the last two decades we have witnessed strong progress on modeling visual object classes, scenes and attributes that have significantly contributed to automated image understanding. On the other hand, surprisingly little progress has been made on incorporating a spatial representation and reasoning in the inference process. In this work, we propose a pooling interpretation of spatial relations and show how it improves image retrieval and annotations tasks involving spatial language. Due to the complexity of the spatial language, we argue for a learning-based approach that acquires a representation of spatial relations by learning parameters of the pooling operator. We show improvements on previous work on two datasets and two different tasks as well as provide additional insights on a new dataset with an explicit focus on spatial relations.},
}

Endnote

%0 Report
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T A Pooling Approach to Modelling Spatial Relations for Image Retrieval
and Annotation : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0024-4D38-0
%U http://arxiv.org/abs/1411.5190
%D 2014
%8 19.11.2014
%X   Over the last two decades we have witnessed strong progress on modeling
visual object classes, scenes and attributes that have significantly
contributed to automated image understanding. On the other hand, surprisingly
little progress has been made on incorporating a spatial representation and
reasoning in the inference process. In this work, we propose a pooling
interpretation of spatial relations and show how it improves image retrieval
and annotations tasks involving spatial language. Due to the complexity of the
spatial language, we argue for a learning-based approach that acquires a
representation of spatial relations by learning parameters of the pooling
operator. We show improvements on previous work on two datasets and two
different tasks as well as provide additional insights on a new dataset with an
explicit focus on spatial relations.

%K Computer Science, Computer Vision and Pattern Recognition, cs.CV

2013

Conference paper

M. Malinowski and M. Fritz

“Learning Smooth Pooling Regions for Visual Recognition,” in Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013), Bristol, UK, 2013.

Abstract

From the early HMAX model to Spatial Pyramid Matching, spatial pooling

has played an important role in visual recognition pipelines. By

aggregating local statistics, it equips the recognition pipelines

with a certain degree of robustness to translation and deformation

yet preserving spatial information. Despite of its predominance in

current recognition systems, we have seen little progress to fully

adapt the pooling strategy to the task at hand. In this paper, we

propose a flexible parameterization of the spatial pooling step and

learn the pooling regions together with the classifier. We investigate

a smoothness regularization term that in conjuncture with an efficient

learning scheme makes learning scalable. Our framework can work with

both popular pooling operators: sum-pooling and max-pooling. Finally,

we show benefits of our approach for object recognition tasks based

on visual words and higher level event recognition tasks based on

object-bank features. In both cases, we improve over the hand-crafted

spatial pooling step showing the importance of its adaptation to

the task.

BibTeX

@inproceedings{757,
TITLE = {Learning Smooth Pooling Regions for Visual Recognition},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
DOI = {10.5244/C.27.118},
PUBLISHER = {BMVA Press},
YEAR = {2013},
ABSTRACT = {From the early HMAX model to Spatial Pyramid Matching, spatial pooling has played an important role in visual recognition pipelines. By aggregating local statistics, it equips the recognition pipelines with a certain degree of robustness to translation and deformation yet preserving spatial information. Despite of its predominance in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. In this paper, we propose a flexible parameterization of the spatial pooling step and learn the pooling regions together with the classifier. We investigate a smoothness regularization term that in conjuncture with an efficient learning scheme makes learning scalable. Our framework can work with both popular pooling operators: sum-pooling and max-pooling. Finally, we show benefits of our approach for object recognition tasks based on visual words and higher level event recognition tasks based on object-bank features. In both cases, we improve over the hand-crafted spatial pooling step showing the importance of its adaptation to the task.},
BOOKTITLE = {Electronic Proceedings of the British Machine Vision Conference 2013 (BMVC 2013)},
EDITOR = {Burghardt, Tilo and Damen, Dima and Mayol-Cuevas, Walterio and Mirmehdi, Majid},
PAGES = {1--11},
EID = {118},
ADDRESS = {Bristol, UK},
}

Endnote

%0 Conference Proceedings
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Learning Smooth Pooling Regions for Visual Recognition : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0018-0C60-C
%R 10.5244/C.27.118
%D 2013
%B 24th British Machine Vision Conference
%Z date of event: 2013-09-09 - 2013-09-13
%C Bristol, UK
%X From the early HMAX model to Spatial Pyramid Matching, spatial pooling
	has played an important role in visual recognition pipelines. By
	aggregating local statistics, it equips the recognition pipelines
	with a certain degree of robustness to translation and deformation
	yet preserving spatial information. Despite of its predominance in
	current recognition systems, we have seen little progress to fully
	adapt the pooling strategy to the task at hand. In this paper, we
	propose a flexible parameterization of the spatial pooling step and
	learn the pooling regions together with the classifier. We investigate
	a smoothness regularization term that in conjuncture with an efficient
	learning scheme makes learning scalable. Our framework can work with
	both popular pooling operators: sum-pooling and max-pooling. Finally,
	we show benefits of our approach for object recognition tasks based
	on visual words and higher level event recognition tasks based on
	object-bank features. In both cases, we improve over the hand-crafted
	spatial pooling step showing the importance of its adaptation to
	the task.
%B Electronic Proceedings of the British Machine Vision Conference 2013
%E Burghardt, Tilo; Damen, Dima; Mayol-Cuevas, Walterio; Mirmehdi, Majid
%P 1 - 11
%Z sequence number: 118
%I BMVA Press

Conference paper

M. Malinowski and M. Fritz

“Learnable Pooling Regions for Image Classification,” in International Conference on Learning Representations Workshop Proceedings (ICLR 2013), Scottsdale, AZ, USA, 2013.

Abstract

Biologically inspired, from the early HMAX model to Spatial Pyramid Matching,

pooling has played an important role in visual recognition pipelines. Spatial

pooling, by grouping of local codes, equips these methods with a certain degree

of robustness to translation and deformation yet preserving important spatial

information. Despite the predominance of this approach in current recognition

systems, we have seen little progress to fully adapt the pooling strategy to

the task at hand. This paper proposes a model for learning task dependent

pooling scheme -- including previously proposed hand-crafted pooling schemes as

a particular instantiation. In our work, we investigate the role of different

regularization terms showing that the smooth regularization term is crucial to

achieve strong performance using the presented architecture. Finally, we

propose an efficient and parallel method to train the model. Our experiments

show improved performance over hand-crafted pooling schemes on the CIFAR-10 and

CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on

the latter.

BibTeX

@inproceedings{758,
TITLE = {Learnable Pooling Regions for Image Classification},
AUTHOR = {Malinowski, Mateusz and Fritz, Mario},
LANGUAGE = {eng},
URL = {http://arxiv.org/abs/1301.3516; http://openreview.net/venue/iclr2013},
EPRINT = {1301.3516},
EPRINTTYPE = {arXiv},
YEAR = {2013},
ABSTRACT = {Biologically inspired, from the early HMAX model to Spatial Pyramid Matching, pooling has played an important role in visual recognition pipelines. Spatial pooling, by grouping of local codes, equips these methods with a certain degree of robustness to translation and deformation yet preserving important spatial information. Despite the predominance of this approach in current recognition systems, we have seen little progress to fully adapt the pooling strategy to the task at hand. This paper proposes a model for learning task dependent pooling scheme -- including previously proposed hand-crafted pooling schemes as a particular instantiation. In our work, we investigate the role of different regularization terms showing that the smooth regularization term is crucial to achieve strong performance using the presented architecture. Finally, we propose an efficient and parallel method to train the model. Our experiments show improved performance over hand-crafted pooling schemes on the CIFAR-10 and CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on the latter.},
BOOKTITLE = {International Conference on Learning Representations Workshop Proceedings (ICLR 2013)},
ADDRESS = {Scottsdale, AZ, USA},
}

Endnote

%0 Conference Proceedings
%A Malinowski, Mateusz
%A Fritz, Mario
%+ Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
Computer Vision and Multimodal Computing, MPI for Informatics, Max Planck Society
%T Learnable Pooling Regions for Image Classification : 
%G eng
%U http://hdl.handle.net/11858/00-001M-0000-0018-0B03-4
%U http://arxiv.org/abs/1301.3516
%D 2013
%B International Conference on Learning Representations
%Z date of event: 2013-05-02 - 2013-05-04
%C Scottsdale, AZ, USA
%X   Biologically inspired, from the early HMAX model to Spatial Pyramid Matching,
pooling has played an important role in visual recognition pipelines. Spatial
pooling, by grouping of local codes, equips these methods with a certain degree
of robustness to translation and deformation yet preserving important spatial
information. Despite the predominance of this approach in current recognition
systems, we have seen little progress to fully adapt the pooling strategy to
the task at hand. This paper proposes a model for learning task dependent
pooling scheme -- including previously proposed hand-crafted pooling schemes as
a particular instantiation. In our work, we investigate the role of different
regularization terms showing that the smooth regularization term is crucial to
achieve strong performance using the presented architecture. Finally, we
propose an efficient and parallel method to train the model. Our experiments
show improved performance over hand-crafted pooling schemes on the CIFAR-10 and
CIFAR-100 datasets -- in particular improving the state-of-the-art to 56.29% on
the latter.

%K Computer Science, Computer Vision and Pattern Recognition, cs.CV,Computer Science, Learning, cs.LG
%B International Conference on Learning Representations Workshop Proceedings

Other

My GitHub profile.

My personal webpage.

My Google Scholar profile.

My Semantic Scholar profile.

See publication list of Scalable Learning and Perception group that I belong to.