TiP4GEN: Text to Immersive Panorama 4D Scene Generation

¹Institute of Information Science, Beijing Jiaotong University
²University of Toronto
³The University of Texas at Austin
⁴Visual Intelligence + X International Joint Laboratory
Proceedings of the 33rd ACM International Conference on Multimedia (MM '25)
^*Both authors contributed equally to this research
^†Corresponding author

Abstract

With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce TiP4GEN, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a Dual-branch Generation Model consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a Geometry-aligned Reconstruction Model based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes.

Method

The framework of TiP4GEN. In the dual-branch panorama video generation model (upper), we employ a panorama branch to provide global guidance, ensuring the overall consistency of the generated content. Simultaneously, to leverage the robust knowledge priors inherent in diffusion models about perspective content generation, we incorporate a perspective branch to enhance the diversity and authenticity of the generated content. The bidirectional cross-attention module facilitates information exchange between these two branches. In the reconstruction model (lower), we utilize Monst3r for estimating camera parameters and depth maps to initialize and align the scene geometry, followed by 4D scene optimization.

BibTeX

@misc{xing2025tip4gentextimmersivepanorama, title={TiP4GEN: Text to Immersive Panorama 4D Scene Generation}, author={Ke Xing and Hanwen Liang and Dejia Xu and Yuyang Yin and Konstantinos N. Plataniotis and Yao Zhao and Yunchao Wei}, year={2025}, eprint={2508.12415}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.12415}, }

TiP4GEN: Text to Immersive Panorama 4D Scene Generation

We introduce TiP4GEN, an advanced text-to-dynamic panorama scene generation framework

Abstract

Method

Supplementary Video

BibTeX