Track4World

Feedforward World-centric Dense 3D Tracking of All Pixels
Jiahao Lu1 Jiayi Xu1 Wenbo Hu2† Ruijie Zhu2 Chengfeng Zhao1 Sai-Kit Yeung1 Ying Shan2 Yuan Liu1†
1The Hong Kong University of Science and Technology // 2Tencent ARC Lab
Paper (PDF) arXiv View Code Launch Demo

Abstract

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking.

In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs.

The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Feedforward Model Efficient Holistic Tracking
World-Centric Global Coordinate System
Dense Tracking Trajectory for Every Pixel
2D-to-3D Correlation Simultaneous 2D & 3D Flow

Camera-Centric 3D Tracking

Our method lifts 2D correspondences into 3D space, enabling robust tracking of points relative to the moving camera coordinate system.

Select a Camera-Centric sequence...

World-Centric 3D Tracking

By decoupling camera ego-motion, our method effectively establishes consistent and dense point tracking in the globally aligned world coordinate system.

Select a World-Centric sequence...

Dense 2D Tracking

Track4World enables consistent long-term dense tracking across frames. Click the buttons below to visualize tracking results on challenging sequences involving rapid motion, occlusions, and deformations.

Select a sequence above...

Interactive Visualizations

Note: High-fidelity scene data is large. Please allow some time for the visualization to load.

Camera-Centric 3D Tracking

LIVE INSTANCE

Explore the reconstructed point clouds and trajectories in the camera coordinate system.

Scene:
Mode:
Reference (Click to Pause)

Loading Scene Data...

World-Centric 3D Tracking

LIVE INSTANCE

Visualizing the reconstructed motion in the global world coordinate system.

SCENE:
Mode:
Reference (Click to Pause)

Loading World Data...

Methodology

Track4World Framework

Given (a) the input video frames, Track4World first extracts (b) global scene representations (geometric embeddings, point clouds, and camera poses). (c) A sparse-to-dense scene flow decoder then predicts 2D-3D joint flows between arbitrary timesteps, which applies a novel 2D-to-3D correlation scheme to improve efficiency and allows 2D-3D joint supervision. (d) The pairwise flows are ultimately fused to establish holistic world-centric 3D tracking.

Performance Metrics

Comprehensive evaluation on Scene & Optical Flow, 3D Tracking, and 2D Tracking. Underlined indicates best results.

1. Scene & Optical Flow Estimation

In-Domain (Kubric-3D)

Method Kubric-3D val (short) Kubric-3D val (long)
Abs Rel ↓ δ<1.25 ↑ EPE3D ↓ AccS ↑ AccR ↑ EPE2D ↓ AccS2D AccR2D Abs Rel ↓ δ<1.25 ↑ EPE3D ↓ AccS ↑ AccR ↑ EPE2D ↓ AccS2D AccR2D
RAFT / / / / / 6.7974 0.7442 0.9018 / / / / / 51.9034 0.5183 0.7042
GMFlowNet / / / / / 7.0390 0.7619 0.9103 / / / / / 51.9049 0.5309 0.7201
SEA-RAFT / / / / / 9.5794 0.7720 0.8947 / / / / / 58.7148 0.5596 0.6825
RAFT-3D 0.0649 0.9344 0.6170 0.0015 0.0078 40.4480 0.0002 0.0015 0.1245 0.8422 1.5652 0.0001 0.0010 83.6966 0.0004 0.0040
OpticalExpansion 0.2170 0.6266 0.2093 0.2890 0.4760 19.6471 0.1183 0.3316 0.2177 0.6316 0.7037 0.1062 0.1903 68.6562 0.0255 0.0874
POMATO 0.1525 0.8329 0.9672 0.0566 0.1696 / / / 0.1761 0.7760 1.6925 0.0148 0.0564 / / /
ZeroMSF 0.0860 0.9196 0.3528 0.1867 0.3413 / / / 0.1208 0.8609 1.2182 0.0475 0.0895 / / /
Any4D 0.0585 0.9547 0.3908 0.1610 0.2893 / / / 0.1017 0.8770 1.2442 0.0429 0.0855 / / /
V-DPM 0.0716 0.9010 0.4087 0.1442 0.2491 / / / 0.1155 0.8205 1.2620 0.0407 0.0803 / / /
Track4World (Ours) 0.0344 0.9719 0.1537 0.5494 0.7460 1.8685 0.8086 0.9309 0.0472 0.9371 0.4808 0.3247 0.5491 15.0906 0.6134 0.7711

Out-of-Domain (KITTI & BlinkVision)

Method KITTI BlinkVision
Abs Rel ↓ δ<1.25 ↑ EPE3D ↓ AccS ↑ AccR ↑ EPE2D ↓ AccS2D AccR2D Abs Rel ↓ δ<1.25 ↑ EPE3D ↓ AccS ↑ AccR ↑ EPE2D ↓ AccS2D AccR2D
RAFT / / / / / 5.4150 0.6271 0.8068 / / / / / 14.1255 0.5037 0.6953
GMFlowNet / / / / / 4.6977 0.6432 0.8241 / / / / / 12.0176 0.5281 0.7170
SEA-RAFT / / / / / 4.8863 0.6654 0.8297 / / / / / 20.9160 0.5697 0.7186
RAFT-3D 0.1619 0.8413 0.3837 0.0118 0.0678 54.0938 0.0001 0.0007 0.1426 0.8455 0.6690 0.0454 0.1280 85.4975 0.0018 0.0121
OpticalExpansion 0.2764 0.4302 0.2419 0.1553 0.2612 8.8808 0.5446 0.7326 0.3372 0.4099 0.4406 0.2091 0.3116 20.2384 0.4122 0.6139
POMATO 0.2752 0.4359 0.2602 0.1127 0.2156 / / / 0.2089 0.6569 0.4038 0.1522 0.2870 / / /
ZeroMSF 0.2064 0.5913 0.1823 0.1695 0.3481 / / / 0.1934 0.6620 0.3937 0.1913 0.2991 / / /
Any4D 0.2398 0.4974 0.1856 0.1429 0.2931 / / / 0.2218 0.6125 0.9238 0.1242 0.1818 / / /
V-DPM 0.1469 0.7981 0.4462 0.1180 0.1608 / / / 0.2117 0.6449 1.1476 0.1079 0.1547 / / /
Track4World (Ours) 0.0707 0.9570 0.0742 0.6929 0.8238 2.5722 0.6849 0.8769 0.0371 0.9768 0.1135 0.5091 0.7144 7.5632 0.5131 0.7424
2. 3D Tracking Estimation (APD Metric)
Method PointOdyssey ADT PStudio DriveTrack Avg.
L-16 L-50 L-16 L-50 L-16 L-50 L-16 L-50 L-16 L-50
Camera coordinate 3D tracking
SpatialTracker* 0.3116 0.2977 0.4962 0.4692 0.5390 0.4991 0.2529 0.2502 0.3999 0.3791
DELTA* 0.3529 0.3412 0.5116 0.4952 0.5922 0.5533 0.2704 0.2701 0.4317 0.4150
STV2† 0.1864 0.1785 0.2400 0.2330 0.3784 0.3690 0.1711 0.1725 0.2400 0.2383
MASt3R 0.3546 0.3253 0.3368 0.3029 0.3293 0.2956 0.2767 0.2559 0.3244 0.2949
MonST3R 0.3912 0.3860 0.3694 0.3429 0.3511 0.3381 0.3056 0.2787 0.3543 0.3364
POMATO 0.4816 0.4623 0.5338 0.5299 0.5163 0.4726 0.4237 0.4329 0.4888 0.4744
ZeroMSF 0.4214 0.3887 0.5382 0.4635 0.5083 0.4524 0.4448 0.4513 0.4782 0.4390
Track4World (Ours) 0.5397 0.5268 0.6501 0.6091 0.5948 0.5423 0.5003 0.5092 0.5712 0.5469
World coordinate 3D tracking
STV2† 0.1925 0.1763 0.2456 0.2163 0.3790 0.3689 0.1711 0.1725 0.2470 0.2335
POMATO‡ 0.4425 0.3905 0.3611 0.3548 0.5166 0.4713 0.4227 0.4210 0.4357 0.4094
ZeroMSF‡ 0.4053 0.3505 0.4530 0.3563 0.4828 0.4386 0.4474 0.4382 0.4471 0.3959
Any4D 0.4769 0.4174 0.4460 0.3717 0.5707 0.5066 0.5235 0.5079 0.5043 0.4509
V-DPM 0.4848 0.4233 0.4783 0.3759 0.6084 0.5795 0.4854 0.4817 0.5142 0.4668
Track4World (Ours) 0.5345 0.5162 0.6250 0.5622 0.5946 0.5422 0.5003 0.5087 0.5636 0.5323

* Ground-truth intrinsics used. † No bundle adjustment. ‡ Using camera poses estimated by VGGT.

3. 2D Tracking Estimation
Method Kinetics RoboTAP RGB-S
AJ ↑ δvis OA ↑ AJ ↑ δvis OA ↑ AJ ↑ δvis OA ↑
PIPs++ / 63.5 / / 63.0 / / 58.5 /
TAPIR 49.6 64.2 85.0 59.6 73.4 87.0 55.5 69.7 88.0
CoTracker 49.6 64.3 83.3 58.6 70.6 87.0 67.4 78.9 85.2
TAPTR 49.0 64.4 85.2 60.1 75.3 86.9 60.8 76.2 87.0
LocoTrack 52.9 66.8 85.3 62.3 76.2 87.1 69.7 83.2 89.5
BootsTAPIR 54.6 68.4 86.5 64.9 80.1 86.3 70.8 83.0 89.9
CoTracker3 55.8 68.5 88.3 66.4 78.8 90.8 71.7 83.6 91.1
Track4World (Ours) 59.1 71.3 90.6 70.9 81.8 93.3 78.2 88.5 92.3
4. Camera pose estimation
Method Sintel Bonn
ATE ↓ RTE ↓ RRE ↓ ATE ↓ RTE ↓ RRE ↓
Align3R 0.128 0.042 0.432 0.023 0.007 0.620
CUT3R 0.217 0.070 0.636 0.035 0.014 1.212
VGGT 0.167 0.062 0.490 0.051 0.011 1.038
MapAnything 0.227 0.111 2.047 0.026 0.014 0.668
Pi3 0.088 0.043 0.299 0.012 0.011 0.612
DA3 0.124 0.061 0.331 0.010 0.011 0.638
POMATO 0.209 0.064 0.694 0.041 0.017 0.832
STV2 0.133 0.057 0.641 0.019 0.015 0.701
Track4World (Ours) 0.119 0.054 0.309 0.009 0.009 0.604

Citation

BibTeX :: tex
@article{lu2026track4world,
    title   = {Track4World: Feedforward World-Centric Dense 3D Tracking of All Pixels},
    author  = {Jiahao Lu and Jiayi Xu and Wenbo Hu and Ruijie Zhu and Chengfeng Zhao and Sai-Kit Yeung and Ying Shan and Yuan Liu},
    journal = {arXiv preprint arXiv:2603.02573},
    year    = {2026}
}