Track4World: Future of Motion Reconstruction

Abstract

Estimating the 3D trajectory of every pixel from a monocular video is crucial and promising for a comprehensive understanding of the 3D dynamics of videos. Recent monocular 3D tracking works demonstrate impressive performance, but are limited to either tracking sparse points on the first frame or a slow optimization-based framework for dense tracking.

In this paper, we propose a feedforward model, called Track4World, enabling an efficient holistic 3D tracking of every pixel in the world-centric coordinate system. Built on the global 3D scene representation encoded by a VGGT-style ViT, Track4World applies a novel 3D correlation scheme to simultaneously estimate the pixel-wise 2D and 3D dense flow between arbitrary frame pairs.

The estimated scene flow, along with the reconstructed 3D geometry, enables subsequent efficient 3D tracking of every pixel of this video. Extensive experiments on multiple benchmarks demonstrate that our approach consistently outperforms existing methods in 2D/3D flow estimation and 3D tracking, highlighting its robustness and scalability for real-world 4D reconstruction tasks.

Feedforward Model Efficient Holistic Tracking

World-Centric Global Coordinate System

Dense Tracking Trajectory for Every Pixel

2D-to-3D Correlation Simultaneous 2D & 3D Flow

Camera-Centric 3D Tracking

Our method lifts 2D correspondences into 3D space, enabling robust tracking of points relative to the moving camera coordinate system.

Select a Camera-Centric sequence...

World-Centric 3D Tracking

By decoupling camera ego-motion, our method effectively establishes consistent and dense point tracking in the globally aligned world coordinate system.

Select a World-Centric sequence...

Dense 2D Tracking

Track4World enables consistent long-term dense tracking across frames. Click the buttons below to visualize tracking results on challenging sequences involving rapid motion, occlusions, and deformations.

Select a sequence above...

Interactive Visualizations

                
                Note: High-fidelity scene data is large. Please allow some time for the visualization to load.
            

Camera-Centric 3D Tracking

LIVE INSTANCE

Explore the reconstructed point clouds and trajectories in the camera coordinate system.

Scene:

Mode:

Reference (Click to Pause)

Loading Scene Data...

World-Centric 3D Tracking

LIVE INSTANCE

Visualizing the reconstructed motion in the global world coordinate system.

SCENE:

Mode:

Reference (Click to Pause)

Loading World Data...

Methodology

Given (a) the input video frames, Track4World first extracts (b) global scene representations (geometric embeddings, point clouds, and camera poses). (c) A sparse-to-dense scene flow decoder then predicts 2D-3D joint flows between arbitrary timesteps, which applies a novel 2D-to-3D correlation scheme to improve efficiency and allows 2D-3D joint supervision. (d) The pairwise flows are ultimately fused to establish holistic world-centric 3D tracking.

Performance Metrics

Comprehensive evaluation on Scene & Optical Flow, 3D Tracking, and 2D Tracking. Underlined indicates best results.

1. Scene & Optical Flow Estimation

In-Domain (Kubric-3D)

Method	Kubric-3D val (short)								Kubric-3D val (long)
Method	Abs Rel ↓	δ<1.25 ↑	EPE3D ↓	AccS ↑	AccR ↑	EPE2D ↓	AccS_2D ↑	AccR_2D ↑	Abs Rel ↓	δ<1.25 ↑	EPE3D ↓	AccS ↑	AccR ↑	EPE2D ↓	AccS_2D ↑	AccR_2D ↑
RAFT	/	/	/	/	/	6.7974	0.7442	0.9018	/	/	/	/	/	51.9034	0.5183	0.7042
GMFlowNet	/	/	/	/	/	7.0390	0.7619	0.9103	/	/	/	/	/	51.9049	0.5309	0.7201
SEA-RAFT	/	/	/	/	/	9.5794	0.7720	0.8947	/	/	/	/	/	58.7148	0.5596	0.6825
RAFT-3D	0.0649	0.9344	0.6170	0.0015	0.0078	40.4480	0.0002	0.0015	0.1245	0.8422	1.5652	0.0001	0.0010	83.6966	0.0004	0.0040
OpticalExpansion	0.2170	0.6266	0.2093	0.2890	0.4760	19.6471	0.1183	0.3316	0.2177	0.6316	0.7037	0.1062	0.1903	68.6562	0.0255	0.0874
POMATO	0.1525	0.8329	0.9672	0.0566	0.1696	/	/	/	0.1761	0.7760	1.6925	0.0148	0.0564	/	/	/
ZeroMSF	0.0860	0.9196	0.3528	0.1867	0.3413	/	/	/	0.1208	0.8609	1.2182	0.0475	0.0895	/	/	/
Any4D	0.0585	0.9547	0.3908	0.1610	0.2893	/	/	/	0.1017	0.8770	1.2442	0.0429	0.0855	/	/	/
V-DPM	0.0716	0.9010	0.4087	0.1442	0.2491	/	/	/	0.1155	0.8205	1.2620	0.0407	0.0803	/	/	/
Track4World (Ours)	0.0344	0.9719	0.1537	0.5494	0.7460	1.8685	0.8086	0.9309	0.0472	0.9371	0.4808	0.3247	0.5491	15.0906	0.6134	0.7711

Out-of-Domain (KITTI & BlinkVision)

Method	KITTI								BlinkVision
Method	Abs Rel ↓	δ<1.25 ↑	EPE3D ↓	AccS ↑	AccR ↑	EPE2D ↓	AccS_2D ↑	AccR_2D ↑	Abs Rel ↓	δ<1.25 ↑	EPE3D ↓	AccS ↑	AccR ↑	EPE2D ↓	AccS_2D ↑	AccR_2D ↑
RAFT	/	/	/	/	/	5.4150	0.6271	0.8068	/	/	/	/	/	14.1255	0.5037	0.6953
GMFlowNet	/	/	/	/	/	4.6977	0.6432	0.8241	/	/	/	/	/	12.0176	0.5281	0.7170
SEA-RAFT	/	/	/	/	/	4.8863	0.6654	0.8297	/	/	/	/	/	20.9160	0.5697	0.7186
RAFT-3D	0.1619	0.8413	0.3837	0.0118	0.0678	54.0938	0.0001	0.0007	0.1426	0.8455	0.6690	0.0454	0.1280	85.4975	0.0018	0.0121
OpticalExpansion	0.2764	0.4302	0.2419	0.1553	0.2612	8.8808	0.5446	0.7326	0.3372	0.4099	0.4406	0.2091	0.3116	20.2384	0.4122	0.6139
POMATO	0.2752	0.4359	0.2602	0.1127	0.2156	/	/	/	0.2089	0.6569	0.4038	0.1522	0.2870	/	/	/
ZeroMSF	0.2064	0.5913	0.1823	0.1695	0.3481	/	/	/	0.1934	0.6620	0.3937	0.1913	0.2991	/	/	/
Any4D	0.2398	0.4974	0.1856	0.1429	0.2931	/	/	/	0.2218	0.6125	0.9238	0.1242	0.1818	/	/	/
V-DPM	0.1469	0.7981	0.4462	0.1180	0.1608	/	/	/	0.2117	0.6449	1.1476	0.1079	0.1547	/	/	/
Track4World (Ours)	0.0707	0.9570	0.0742	0.6929	0.8238	2.5722	0.6849	0.8769	0.0371	0.9768	0.1135	0.5091	0.7144	7.5632	0.5131	0.7424

2. 3D Tracking Estimation (APD Metric)

Method	PointOdyssey		ADT		PStudio		DriveTrack		Avg.
Method	L-16	L-50	L-16	L-50	L-16	L-50	L-16	L-50	L-16	L-50
Camera coordinate 3D tracking
SpatialTracker*	0.3116	0.2977	0.4962	0.4692	0.5390	0.4991	0.2529	0.2502	0.3999	0.3791
DELTA*	0.3529	0.3412	0.5116	0.4952	0.5922	0.5533	0.2704	0.2701	0.4317	0.4150
STV2†	0.1864	0.1785	0.2400	0.2330	0.3784	0.3690	0.1711	0.1725	0.2400	0.2383
MASt3R	0.3546	0.3253	0.3368	0.3029	0.3293	0.2956	0.2767	0.2559	0.3244	0.2949
MonST3R	0.3912	0.3860	0.3694	0.3429	0.3511	0.3381	0.3056	0.2787	0.3543	0.3364
POMATO	0.4816	0.4623	0.5338	0.5299	0.5163	0.4726	0.4237	0.4329	0.4888	0.4744
ZeroMSF	0.4214	0.3887	0.5382	0.4635	0.5083	0.4524	0.4448	0.4513	0.4782	0.4390
Track4World (Ours)	0.5397	0.5268	0.6501	0.6091	0.5948	0.5423	0.5003	0.5092	0.5712	0.5469
World coordinate 3D tracking
STV2†	0.1925	0.1763	0.2456	0.2163	0.3790	0.3689	0.1711	0.1725	0.2470	0.2335
POMATO‡	0.4425	0.3905	0.3611	0.3548	0.5166	0.4713	0.4227	0.4210	0.4357	0.4094
ZeroMSF‡	0.4053	0.3505	0.4530	0.3563	0.4828	0.4386	0.4474	0.4382	0.4471	0.3959
Any4D	0.4769	0.4174	0.4460	0.3717	0.5707	0.5066	0.5235	0.5079	0.5043	0.4509
V-DPM	0.4848	0.4233	0.4783	0.3759	0.6084	0.5795	0.4854	0.4817	0.5142	0.4668
Track4World (Ours)	0.5345	0.5162	0.6250	0.5622	0.5946	0.5422	0.5003	0.5087	0.5636	0.5323

* Ground-truth intrinsics used. † No bundle adjustment. ‡ Using camera poses estimated by VGGT.

3. 2D Tracking Estimation

Method	Kinetics			RoboTAP			RGB-S
Method	AJ ↑	δ_vis ↑	OA ↑	AJ ↑	δ_vis ↑	OA ↑	AJ ↑	δ_vis ↑	OA ↑
PIPs++	/	63.5	/	/	63.0	/	/	58.5	/
TAPIR	49.6	64.2	85.0	59.6	73.4	87.0	55.5	69.7	88.0
CoTracker	49.6	64.3	83.3	58.6	70.6	87.0	67.4	78.9	85.2
TAPTR	49.0	64.4	85.2	60.1	75.3	86.9	60.8	76.2	87.0
LocoTrack	52.9	66.8	85.3	62.3	76.2	87.1	69.7	83.2	89.5
BootsTAPIR	54.6	68.4	86.5	64.9	80.1	86.3	70.8	83.0	89.9
CoTracker3	55.8	68.5	88.3	66.4	78.8	90.8	71.7	83.6	91.1
Track4World (Ours)	59.1	71.3	90.6	70.9	81.8	93.3	78.2	88.5	92.3

4. Camera pose estimation

Method	Sintel			Bonn
Method	ATE ↓	RTE ↓	RRE ↓	ATE ↓	RTE ↓	RRE ↓
Align3R	0.128	0.042	0.432	0.023	0.007	0.620
CUT3R	0.217	0.070	0.636	0.035	0.014	1.212
VGGT	0.167	0.062	0.490	0.051	0.011	1.038
MapAnything	0.227	0.111	2.047	0.026	0.014	0.668
Pi3	0.088	0.043	0.299	0.012	0.011	0.612
DA3	0.124	0.061	0.331	0.010	0.011	0.638
POMATO	0.209	0.064	0.694	0.041	0.017	0.832
STV2	0.133	0.057	0.641	0.019	0.015	0.701
Track4World (Ours)	0.119	0.054	0.309	0.009	0.009	0.604

Abstract

Camera-Centric 3D Tracking

World-Centric 3D Tracking

Dense 2D Tracking

Interactive Visualizations

Camera-Centric 3D Tracking

Loading Scene Data...

World-Centric 3D Tracking

Loading World Data...

Methodology

Performance Metrics

In-Domain (Kubric-3D)

Out-of-Domain (KITTI & BlinkVision)

Citation