CoTracker: It is Better to Track Together

1Meta AI 2Visual Geometry Group, University of Oxford

Abstract

Methods for video motion prediction either estimate jointly the instantaneous motion of all points in a given video frame using optical flow, or track the motion of individual points throughout the video, but independently. The latter is true even for powerful deep learning methods that can track points through occlusions. Tracking points individually ignores the strong correlation that can exist between the points, for instance when they arise from the same physical object, potentially harming performance.

In this paper, we thus propose CoTracker, an architecture that jointly tracks multiple points throughout an entire video. This architecture is based on several ideas from the optical flow and tracking literature, and combines them in a new, flexible and powerful design. It is based on a transformer network that models the correlation of different points in time via specialised attention layers.

The transformer is designed to update iteratively an estimate of several trajectories. It can be applied in a sliding-window manner to very long videos, for which we engineer an unrolled training loop. It compares favourably against state-of-the-art point tracking methods, both in terms of efficiency and accuracy.

Points on a uniform grid

We track points sampled on a regular grid starting from the initial video frame. The colors represent the object (magenta) and the background (cyan).

PIPs
RAFT
TAPIR
CoTracker (Ours)

For PIPs, many points are incorrectly tracked and end up being ’stuck’ on the front of the object or the side of the image when they become occluded. RAFT predictions have less noise, but the model fails to handle occlusions, leading to points being lost or stuck on the object. TAPIR predictions are pretty accurate for non-occluded points. When a point becomes occluded, the model struggles to estimate its position. CoTracker produces cleaner and more ’linear’ tracks, which is accurate as the primary motion is a homography (the observer does not translate).

Individual points

We track the same queried point with different methods and visualize its trajectory using color encoding based on time. The red cross (❌) indicates the ground truth point coordinates.

TAP-Net
PIPs
RAFT
CoTracker (Ours)

BibTeX

@InProceedings{karaev2023cotracker,
      author    = {Nikita Karaev and Ignacio Rocco and Benjamin Graham and Natalia Neverova and Andrea Vedaldi and Christian Rupprecht},
      title     = {{CoTracker}: It is Better to Track Together},
      journal   = {arxiv},
      year      = {2023}
    }