Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking.
In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs.
The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.
Our method lifts 2D correspondences into 3D space, enabling robust tracking of points relative to the moving camera coordinate system.
By decoupling camera ego-motion, our method effectively establishes consistent and dense point tracking in the globally aligned world coordinate system.
Track4World enables consistent long-term dense tracking across frames. Click the buttons below to visualize tracking results on challenging sequences involving rapid motion, occlusions, and deformations.
Explore the reconstructed point clouds and trajectories in the camera coordinate system.
Visualizing the reconstructed motion in the global world coordinate system.
Given (a) the input video frames, Track4World first extracts (b) global scene representations (geometric embeddings, point clouds, and camera poses). (c) A sparse-to-dense scene flow decoder then predicts 2D-3D joint flows between arbitrary timesteps, which applies a novel 2D-to-3D correlation scheme to improve efficiency and allows 2D-3D joint supervision. (d) The pairwise flows are ultimately fused to establish holistic world-centric 3D tracking.
Comprehensive evaluation on Scene & Optical Flow, 3D Tracking, and 2D Tracking. Underlined indicates best results.
| Method | Kubric-3D val (short) | Kubric-3D val (long) | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Abs Rel ↓ | δ<1.25 ↑ | EPE3D ↓ | AccS ↑ | AccR ↑ | EPE2D ↓ | AccS2D ↑ | AccR2D ↑ | Abs Rel ↓ | δ<1.25 ↑ | EPE3D ↓ | AccS ↑ | AccR ↑ | EPE2D ↓ | AccS2D ↑ | AccR2D ↑ | |
| RAFT | / | / | / | / | / | 6.7974 | 0.7442 | 0.9018 | / | / | / | / | / | 51.9034 | 0.5183 | 0.7042 |
| GMFlowNet | / | / | / | / | / | 7.0390 | 0.7619 | 0.9103 | / | / | / | / | / | 51.9049 | 0.5309 | 0.7201 |
| SEA-RAFT | / | / | / | / | / | 9.5794 | 0.7720 | 0.8947 | / | / | / | / | / | 58.7148 | 0.5596 | 0.6825 |
| RAFT-3D | 0.0649 | 0.9344 | 0.6170 | 0.0015 | 0.0078 | 40.4480 | 0.0002 | 0.0015 | 0.1245 | 0.8422 | 1.5652 | 0.0001 | 0.0010 | 83.6966 | 0.0004 | 0.0040 |
| OpticalExpansion | 0.2170 | 0.6266 | 0.2093 | 0.2890 | 0.4760 | 19.6471 | 0.1183 | 0.3316 | 0.2177 | 0.6316 | 0.7037 | 0.1062 | 0.1903 | 68.6562 | 0.0255 | 0.0874 |
| POMATO | 0.1525 | 0.8329 | 0.9672 | 0.0566 | 0.1696 | / | / | / | 0.1761 | 0.7760 | 1.6925 | 0.0148 | 0.0564 | / | / | / |
| ZeroMSF | 0.0860 | 0.9196 | 0.3528 | 0.1867 | 0.3413 | / | / | / | 0.1208 | 0.8609 | 1.2182 | 0.0475 | 0.0895 | / | / | / |
| Any4D | 0.0585 | 0.9547 | 0.3908 | 0.1610 | 0.2893 | / | / | / | 0.1017 | 0.8770 | 1.2442 | 0.0429 | 0.0855 | / | / | / |
| V-DPM | 0.0716 | 0.9010 | 0.4087 | 0.1442 | 0.2491 | / | / | / | 0.1155 | 0.8205 | 1.2620 | 0.0407 | 0.0803 | / | / | / |
| Track4World (Ours) | 0.0344 | 0.9719 | 0.1537 | 0.5494 | 0.7460 | 1.8685 | 0.8086 | 0.9309 | 0.0472 | 0.9371 | 0.4808 | 0.3247 | 0.5491 | 15.0906 | 0.6134 | 0.7711 |
| Method | KITTI | BlinkVision | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Abs Rel ↓ | δ<1.25 ↑ | EPE3D ↓ | AccS ↑ | AccR ↑ | EPE2D ↓ | AccS2D ↑ | AccR2D ↑ | Abs Rel ↓ | δ<1.25 ↑ | EPE3D ↓ | AccS ↑ | AccR ↑ | EPE2D ↓ | AccS2D ↑ | AccR2D ↑ | |
| RAFT | / | / | / | / | / | 5.4150 | 0.6271 | 0.8068 | / | / | / | / | / | 14.1255 | 0.5037 | 0.6953 |
| GMFlowNet | / | / | / | / | / | 4.6977 | 0.6432 | 0.8241 | / | / | / | / | / | 12.0176 | 0.5281 | 0.7170 |
| SEA-RAFT | / | / | / | / | / | 4.8863 | 0.6654 | 0.8297 | / | / | / | / | / | 20.9160 | 0.5697 | 0.7186 |
| RAFT-3D | 0.1619 | 0.8413 | 0.3837 | 0.0118 | 0.0678 | 54.0938 | 0.0001 | 0.0007 | 0.1426 | 0.8455 | 0.6690 | 0.0454 | 0.1280 | 85.4975 | 0.0018 | 0.0121 |
| OpticalExpansion | 0.2764 | 0.4302 | 0.2419 | 0.1553 | 0.2612 | 8.8808 | 0.5446 | 0.7326 | 0.3372 | 0.4099 | 0.4406 | 0.2091 | 0.3116 | 20.2384 | 0.4122 | 0.6139 |
| POMATO | 0.2752 | 0.4359 | 0.2602 | 0.1127 | 0.2156 | / | / | / | 0.2089 | 0.6569 | 0.4038 | 0.1522 | 0.2870 | / | / | / |
| ZeroMSF | 0.2064 | 0.5913 | 0.1823 | 0.1695 | 0.3481 | / | / | / | 0.1934 | 0.6620 | 0.3937 | 0.1913 | 0.2991 | / | / | / |
| Any4D | 0.2398 | 0.4974 | 0.1856 | 0.1429 | 0.2931 | / | / | / | 0.2218 | 0.6125 | 0.9238 | 0.1242 | 0.1818 | / | / | / |
| V-DPM | 0.1469 | 0.7981 | 0.4462 | 0.1180 | 0.1608 | / | / | / | 0.2117 | 0.6449 | 1.1476 | 0.1079 | 0.1547 | / | / | / |
| Track4World (Ours) | 0.0707 | 0.9570 | 0.0742 | 0.6929 | 0.8238 | 2.5722 | 0.6849 | 0.8769 | 0.0371 | 0.9768 | 0.1135 | 0.5091 | 0.7144 | 7.5632 | 0.5131 | 0.7424 |
| Method | PointOdyssey | ADT | PStudio | DriveTrack | Avg. | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| L-16 | L-50 | L-16 | L-50 | L-16 | L-50 | L-16 | L-50 | L-16 | L-50 | |
| Camera coordinate 3D tracking | ||||||||||
| SpatialTracker* | 0.3116 | 0.2977 | 0.4962 | 0.4692 | 0.5390 | 0.4991 | 0.2529 | 0.2502 | 0.3999 | 0.3791 |
| DELTA* | 0.3529 | 0.3412 | 0.5116 | 0.4952 | 0.5922 | 0.5533 | 0.2704 | 0.2701 | 0.4317 | 0.4150 |
| STV2† | 0.1864 | 0.1785 | 0.2400 | 0.2330 | 0.3784 | 0.3690 | 0.1711 | 0.1725 | 0.2400 | 0.2383 |
| MASt3R | 0.3546 | 0.3253 | 0.3368 | 0.3029 | 0.3293 | 0.2956 | 0.2767 | 0.2559 | 0.3244 | 0.2949 |
| MonST3R | 0.3912 | 0.3860 | 0.3694 | 0.3429 | 0.3511 | 0.3381 | 0.3056 | 0.2787 | 0.3543 | 0.3364 |
| POMATO | 0.4816 | 0.4623 | 0.5338 | 0.5299 | 0.5163 | 0.4726 | 0.4237 | 0.4329 | 0.4888 | 0.4744 |
| ZeroMSF | 0.4214 | 0.3887 | 0.5382 | 0.4635 | 0.5083 | 0.4524 | 0.4448 | 0.4513 | 0.4782 | 0.4390 |
| Track4World (Ours) | 0.5397 | 0.5268 | 0.6501 | 0.6091 | 0.5948 | 0.5423 | 0.5003 | 0.5092 | 0.5712 | 0.5469 |
| World coordinate 3D tracking | ||||||||||
| STV2† | 0.1925 | 0.1763 | 0.2456 | 0.2163 | 0.3790 | 0.3689 | 0.1711 | 0.1725 | 0.2470 | 0.2335 |
| POMATO‡ | 0.4425 | 0.3905 | 0.3611 | 0.3548 | 0.5166 | 0.4713 | 0.4227 | 0.4210 | 0.4357 | 0.4094 |
| ZeroMSF‡ | 0.4053 | 0.3505 | 0.4530 | 0.3563 | 0.4828 | 0.4386 | 0.4474 | 0.4382 | 0.4471 | 0.3959 |
| Any4D | 0.4769 | 0.4174 | 0.4460 | 0.3717 | 0.5707 | 0.5066 | 0.5235 | 0.5079 | 0.5043 | 0.4509 |
| V-DPM | 0.4848 | 0.4233 | 0.4783 | 0.3759 | 0.6084 | 0.5795 | 0.4854 | 0.4817 | 0.5142 | 0.4668 |
| Track4World (Ours) | 0.5345 | 0.5162 | 0.6250 | 0.5622 | 0.5946 | 0.5422 | 0.5003 | 0.5087 | 0.5636 | 0.5323 |
* Ground-truth intrinsics used. † No bundle adjustment. ‡ Using camera poses estimated by VGGT.
| Method | Kinetics | RoboTAP | RGB-S | ||||||
|---|---|---|---|---|---|---|---|---|---|
| AJ ↑ | δvis ↑ | OA ↑ | AJ ↑ | δvis ↑ | OA ↑ | AJ ↑ | δvis ↑ | OA ↑ | |
| PIPs++ | / | 63.5 | / | / | 63.0 | / | / | 58.5 | / |
| TAPIR | 49.6 | 64.2 | 85.0 | 59.6 | 73.4 | 87.0 | 55.5 | 69.7 | 88.0 |
| CoTracker | 49.6 | 64.3 | 83.3 | 58.6 | 70.6 | 87.0 | 67.4 | 78.9 | 85.2 |
| TAPTR | 49.0 | 64.4 | 85.2 | 60.1 | 75.3 | 86.9 | 60.8 | 76.2 | 87.0 |
| LocoTrack | 52.9 | 66.8 | 85.3 | 62.3 | 76.2 | 87.1 | 69.7 | 83.2 | 89.5 |
| BootsTAPIR | 54.6 | 68.4 | 86.5 | 64.9 | 80.1 | 86.3 | 70.8 | 83.0 | 89.9 |
| CoTracker3 | 55.8 | 68.5 | 88.3 | 66.4 | 78.8 | 90.8 | 71.7 | 83.6 | 91.1 |
| Track4World (Ours) | 59.1 | 71.3 | 90.6 | 70.9 | 81.8 | 93.3 | 78.2 | 88.5 | 92.3 |
| Method | Sintel | Bonn | ||||
|---|---|---|---|---|---|---|
| ATE ↓ | RTE ↓ | RRE ↓ | ATE ↓ | RTE ↓ | RRE ↓ | |
| Align3R | 0.128 | 0.042 | 0.432 | 0.023 | 0.007 | 0.620 |
| CUT3R | 0.217 | 0.070 | 0.636 | 0.035 | 0.014 | 1.212 |
| VGGT | 0.167 | 0.062 | 0.490 | 0.051 | 0.011 | 1.038 |
| MapAnything | 0.227 | 0.111 | 2.047 | 0.026 | 0.014 | 0.668 |
| Pi3 | 0.088 | 0.043 | 0.299 | 0.012 | 0.011 | 0.612 |
| DA3 | 0.124 | 0.061 | 0.331 | 0.010 | 0.011 | 0.638 |
| POMATO | 0.209 | 0.064 | 0.694 | 0.041 | 0.017 | 0.832 |
| STV2 | 0.133 | 0.057 | 0.641 | 0.019 | 0.015 | 0.701 |
| Track4World (Ours) | 0.119 | 0.054 | 0.309 | 0.009 | 0.009 | 0.604 |
@article{lu2026track4world,
title = {Track4World: Feedforward World-Centric Dense 3D Tracking of All Pixels},
author = {Jiahao Lu and Jiayi Xu and Wenbo Hu and Ruijie Zhu and Chengfeng Zhao and Sai-Kit Yeung and Ying Shan and Yuan Liu},
journal = {arXiv preprint arXiv:2603.02573},
year = {2026}
}