CoTracker: It is Better to Track Together

Overview

We introduce CoTracker, a transformer-based model that tracks dense points in a frame jointly across a video sequence. This differs from most existing state-of-the-art approaches that track points independently, ignoring their correlation. We show that joint tracking results in a significantly higher tracking accuracy and robustness.

We also provide several technical innovations, including the concept of virtual tracks, which allows CoTracker to track 70k points jointly and simultaneously. Furthermore, CoTracker operates causally on short windows (hence, it is suitable for online tasks), but is trained by unrolling the windows across longer video sequences, which enables and significantly improves long-term tracking.

We demonstrate qualitatively impressive tracking results, where points can be tracked for a long time even when they are occluded or leave the field of view. Quantitatively, CoTracker outperforms all recent trackers on standard benchmarks, often by a substantial margin.

Points on a uniform grid

We track points sampled on a regular grid starting from the initial video frame. The colors represent the object (magenta) and the background (cyan).

PIPs

RAFT

TAPIR

CoTracker (Ours)

For PIPs, many points are incorrectly tracked and end up being ’stuck’ on the front of the object or the side of the image when they become occluded. RAFT predictions have less noise, but the model fails to handle occlusions, leading to points being lost or stuck on the object. TAPIR predictions are pretty accurate for non-occluded points. When a point becomes occluded, the model struggles to estimate its position. CoTracker produces cleaner and more ’linear’ tracks, which is accurate as the primary motion is a homography (the observer does not translate).

Individual points

We track the same queried point with different methods and visualize its trajectory using color encoding based on time. The red cross (❌) indicates the ground truth point coordinates.

TAP-Net