StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation

Geometry-Aware Monocular-to-Stereo Video Generation

1Beijing Jiaotong University 2Dzine AI 3University of Toronto
Corresponding Author Project Lead
Paper

TL;DR

  • We propose StereoWorld, the first fully end-to-end diffusion framework that adapts a pretrained monocular video generative model into a stereo generator with high visual fidelity and geometric accuracy.
  • We build a large-scale, high-definition stereo video dataset aligned with human-IPD, featuring over 11M curated Blu-ray SBS video frames across diverse genres with comprehensive evaluation metrics.
  • Extensive experiments demonstrate that StereoWorld substantially outperforms prior works in visual quality, geometric consistency, and temporal stability, with clear advantages in both objective metrics and subjective perception.

Abstract

The growing adoption of XR devices has fueled strong demand for high-quality stereo video, yet its production remains costly and artifact-prone. To address this challenge, we present StereoWorld, an end-to-end framework that repurposes a pretrained video generator for high-fidelity monocular-to-stereo video generation. Our framework jointly conditions the model on the monocular video input while explicitly supervising the generation with a geometry-aware regularization to ensure 3D structural fidelity. A spatio-temporal tiling scheme is further integrated to enable efficient, high-resolution synthesis. To enable large-scale training and evaluation, we curate a high-definition stereo video dataset containing over 11M frames aligned to natural human interpupillary distance (IPD). Extensive experiments demonstrate that StereoWorld substantially outperforms prior methods, generating stereo videos with superior visual fidelity and geometric consistency.

Method

StereoWorld-11M Dataset

Dataset Domain IPD-aligned Available Frames
Spring Optical Flow 5K
Sintel Optical Flow 1K
VKITTI2 Driving 21K
PLT-D3 Driving 3K
IRS Robotics 103K
TartanAir Robotics 306K
3D Movies Movies 75K
StereoWorld-11M Movies 11M

We curated a new dataset tailored for stereo video generation with baseline (distance between two cameras) aligned to natural human perception. We collected and cleaned over a hundred high-definition Blu-ray side-by-side (SBS) stereo movies spanning animation, realism, war, sci-fi, historical, and drama, ensuring visual diversity and richness.

All videos are unified into SBS by stretching and horizontal cropping to obtain left–right views, each at 1080p, 16:9, 24 fps. To match the base model requirements (480p resolution, 81-frame inputs), we uniformly downscale to 480p.

To enhance motion diversity and temporal density, we uniformly sample 81 frames per clip at fixed intervals.

Framework

StereoWorld training and inference framework

Before training, we use Video Depth Anything and Stereo Any Video to obtain the depth maps D_r and disparity maps Dispgt, and the left-view videos are then concatenated with the right-view videos and corresponding depth maps along the frame dimension in the latent space as conditioning inputs. During training, a lightweight differentiable stereo projector estimates the disparity between the input left-view and the generated right-view, which is supervised by disparity maps Dispgt via disparity loss to enforce accurate geometric correspondence. Additionally, the last few DiT blocks are duplicated to form dual branches, allowing the model to learn RGB and depth distributions separately to further supplement geometric information. During inference, only the shared and RGB DiT blocks are used, taking the monocular video as the sole input.

Practical and Scalable Optimization

Temporal Tiling Strategy

StereoWorld long video tiling

During training, the first few frames of noisy latents are replaced with ground-truth frames with a probability p. During inference, long videos are split into overlapping segments, with the last frames of the previous segment used to guide the next, ensuring temporal consistency.

Spatial Tiling Strategy

StereoWorld high-resolution tiling

During inference, high-resolution videos are encoded into latents, which are split into overlapping tiles. Each tile is denoised independently, and then the tiles are stitched back to the original size with overlapping regions fused before decoding.

Comparison (Generated Right-View 480p)

Method PSNR ↑ SSIM ↑ LPIPS ↓ IQ-Score ↑ TF-Score ↑ EPE ↓ D1-all ↓
GenStereo 19.4486 0.6803 0.3008 0.4047 0.9642 35.0022 0.8954
SVG 18.0256 0.5881 0.3467 0.4714 0.9706 33.2508 0.9630
StereoCrafter 23.0372 0.6561 0.1869 0.4370 0.9685 24.7784 0.5271
Ours 25.9794 0.7964 0.0952 0.5019 0.9704 17.4527 0.4213
GenStereo
SVG
StereoCrafter
Ours
GT
00001_segment_0137
00001_segment_0137
00001_segment_0137
00001_segment_0137
00001_segment_0137
00036_segment_0019
00036_segment_0019
00036_segment_0019
00036_segment_0019
00036_segment_0019
00093_segment_0393
00093_segment_0393
00093_segment_0393
00093_segment_0393
00093_segment_0393
00093_segment_0455
00093_segment_0455
00093_segment_0455
00093_segment_0455
00093_segment_0455
00095_segment_0156
00095_segment_0156
00095_segment_0156
00095_segment_0156
00095_segment_0156
00127_segment_0191
00127_segment_0191
00127_segment_0191
00127_segment_0191
00127_segment_0191
00134_segment_1314
00134_segment_1314
00134_segment_1314
00134_segment_1314
00134_segment_1314

More Examples (Side-by-Side Format)

Example
Example
Example
Example
Example
Example
Example
Example
Example
Example
Example
Example
Example
Example

Citation

@misc{xing2025stereoworldgeometryawaremonoculartostereovideo, title={StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation}, author={Ke Xing and Longfei Li and Yuyang Yin and Hanwen Liang and Guixun Luo and Chen Fang and Jue Wang and Konstantinos N. Plataniotis and Xiaojie Jin and Yao Zhao and Yunchao Wei}, year={2025}, eprint={2512.09363}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.09363}, }