Tracking multiple persons in a monocular video of a crowded scene is a challenging task. Humans can master it even if they loose track of a person locally by re-identifying the same person based on their appearance. Care must be taken across long distances, as similar-looking persons need not be identical. In this work, we propose a novel graph based formulation that links and clusters person hypotheses over time by solving an instance of a minimum cost lifted multicut problem. Our model generalizes previous works by introducing a mechanism for adding long-range attractive connections between nodes in the graph without modifying the original set of feasible solutions. This allows us to reward tracks that assign detections of similar appearance to the same person in a way that does not introduce implausible solutions. To effectively match hypotheses over longer temporal gaps we develop new deep architectures for re-identification of people. They combine holistic representations extracted with deep networks and body pose layout obtained with a state-of-the-art pose estimation model. We demonstrate the effectiveness of our formulation by reporting a new state-of-the-art for the MOT16 benchmark
The paper is currently under patent application. Please send an email to firstname.lastname@example.org, if you are interested in the source code and the pre-trained models.