Geometry-Aware Monocular-to-Stereo Video Generation
The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency.
| Dataset | Domain | IPD-aligned | Available | Frames |
|---|---|---|---|---|
| Spring | Optical Flow | ✗ | ✓ | 5K |
| Sintel | Optical Flow | ✗ | ✓ | 1K |
| VKITTI2 | Driving | ✗ | ✓ | 21K |
| PLT-D3 | Driving | ✗ | ✓ | 3K |
| IRS | Robotics | ✗ | ✓ | 103K |
| TartanAir | Robotics | ✗ | ✓ | 306K |
| 3D Movies | Movies | ✓ | ✗ | 75K |
| StereoWorld-11M | Movies | ✓ | ✓ | 11M |
We curated a new dataset tailored for stereo video generation with baseline (distance between two cameras) aligned to natural human perception. We collected and cleaned over a hundred high-definition Blu-ray side-by-side (SBS) stereo movies spanning animation, realism, war, sci-fi, historical, and drama, ensuring visual diversity and richness.
All videos are unified into SBS by stretching and horizontal cropping to obtain left–right views, each at 1080p, 16:9, 24 fps. To match the base model requirements (480p resolution, 81-frame inputs), we uniformly downscale to 480p.
To enhance motion diversity and temporal density, we uniformly sample 81 frames per clip at fixed intervals.
Before training, we use Video Depth Anything and Stereo Any Video
to obtain the depth maps D_r and disparity maps Dispgt, and the left-view videos are then
concatenated with the right-view videos and corresponding depth maps along the frame dimension in the latent
space as conditioning inputs. During training, a lightweight differentiable stereo projector estimates the
disparity between the input left-view and the generated right-view, which is supervised by disparity maps
Dispgt via disparity loss to enforce accurate geometric correspondence. Additionally, the last few
DiT blocks are duplicated to form dual branches, allowing the model to learn RGB and depth distributions
separately to further supplement geometric information. During inference, only the shared and RGB DiT blocks
are used, taking the monocular video as the sole input.
During training, the first few frames of noisy latents are replaced with ground-truth frames with a probability
p. During inference, long videos are split into overlapping segments, with the last frames of the previous
segment used to guide the next, ensuring temporal consistency.
During inference, high-resolution videos are encoded into latents, which are split into overlapping tiles. Each tile is denoised independently, and then the tiles are stitched back to the original size with overlapping regions fused before decoding.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | IQ-Score ↑ | TF-Score ↑ | EPE ↓ | D1-all ↓ |
|---|---|---|---|---|---|---|---|
| GenStereo | 19.4486 | 0.6803 | 0.3008 | 0.4047 | 0.9642 | 35.0022 | 0.8954 |
| SVG | 18.0256 | 0.5881 | 0.3467 | 0.4714 | 0.9706 | 33.2508 | 0.9630 |
| StereoCrafter | 23.0372 | 0.6561 | 0.1869 | 0.4370 | 0.9685 | 24.7784 | 0.5271 |
| Ours | 25.9794 | 0.7964 | 0.0952 | 0.5019 | 0.9704 | 17.4527 | 0.4213 |