4DiM: Controlling Space and Time with Diffusion Models

Daniel Watson*, Saurabh Saxena*, Lala Li*, Andrea Tagliasacchi, David Fleet

Google DeepMind
*equal contribution




We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS) of scenes, conditioned on one or more images and a set of camera poses and timestamps for the known and unknown views. To overcome the challenges due to limited availability of 4D training data, we advocate joint training on 3D (poses only), 4D (pose+time) and video (time only) data and propose a new architecture that enables the same. We further propose calibrating SfM posed data using monocular metric depth estimators for metric scale camera control. We introduce new metrics to enrich and overcome shortcomings of current evaluation schemes and achieve state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS with the additional capability of handling temporal dynamics. We also showcase zero-shot applications, including improved panorama stitching and rendering space-time trajectories from novel viewpoints.




4DiM takes one or more images as input and outputs images at the specified time and camera coordinates.


Given a single image, 4DiM generates images for different 3D camera trajectories. The input images below are from our evaluation set, other public datasets, the internet, or from a text-to-image model.





To achieve the best results with arbitrary inputs, we co-train with large amounts of unposed video. This expands 4DiM's capabilities to also control time. Here we show videos generated from 2 input images.






More applications of 4DiM


4DiM can move time both forward and backward.





Video interpolation: 4DiM can create videos given discontinuous input images. Here we input the start and end image and 4DiM creates the intermediate frames.






Video to video translation: below we input 8-frame videos to 4DiM to re-create them with different cameras.






4DiM can create 360° panoramas without the exposure artifacts found in traditional stitching methods. Here we show a comparison of 4DiM v.s. homography with gamma adjustment, respectively, given six input images.





4DiM can generate 360° camera rotation given a single image, which is extremely challenging. The model still suffers from some loss of detail in this case.




4D from a single image is also possible but is the most challenging case we consider. Here we show some driving examples getting increasingly further away from the input image in space and time:







Thank you!

Please see our paper for more details. We include a few more samples below.