Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models

Jason J. Yu^1,2, Fereshteh Forghani¹, Konstantinos G. Derpanis^1,2, Marcus A. Brubaker^1,2,

¹York University, ²Vector Institute for AI

Abstract

Novel view synthesis from a single input image is a challenging task, where the goal is to generate a new view of a scene from a desired camera pose that may be separated by a large motion. The highly uncertain nature of this synthesis task due to unobserved elements within the scene (i.e., occlusion) and outside the field-of-view makes the use of generative models appealing to capture the variety of possible outputs. In this paper, we propose a novel generative model which is capable of producing a sequence of photorealistic images consistent with a specified sequence and a single starting image. Our approach is centred on an autoregressive conditional diffusion-based model capable of interpolating visible scene elements and extrapolating unobserved regions in a view and geometry consistent manner. Conditioning is limited to an image capturing a single camera view and the (relative) pose of the new camera view. To measure the consistency over a sequence of generated views, we introduce a new metric, the thresholded symmetric epipolar distance (TSED), to measure the number of consistent frame pairs in a sequence. While previous methods have been shown to produce high quality images and consistent semantics across pairs of views, we show empirically with our metric that they are often in consistent with the desired camera poses. In contrast, we demonstrate that our method produces both photorealistic and view-consistent imagery.

Authors

Jason J. Yu

Fereshteh Forghani

Konstantinos G. Derpanis

Marcus A. Brubaker

Material

arXiv

Code

BibTex

RealEstate10K Qualitative Results: In-Distribution Trajectories

The following examples show generations conditioned on in-distribution trajectories, the standard protocol in previous work. Specifically, each generation is sampled using an initial image and trajectory from the same video in the test dataset. Select a scene from the menu below. You can also select different samples from the same scene and trajectory to see multiple plausible extrapolations for each scene. Then use the frame controls to view the generated views from all the models. We also include automated playback controls to cycle the frames forwards and backwards at various rates.

Scene Selection

Sample Instance

The stochastic nature of these models allow multiple plausible realities to be sampled. We provide three here.

Samples

GeoGPT

LookOut

Ours

Frame Control (automated playback or manual slider)

Notice how generated scenes diverge more with frames further to the right of the slider.

Stop animation

Cycle 5fps

Cycle 10fps

RealEstate10K Qualitative Results: Out-of-Distribution Trajectories

The following examples show the ability of our model to generate novel views using a variety of trajectories not typically found in the training data. Select a scene and motion type for the camera trajectory. You can also select different samples from the same scene and trajectory to see multiple plausible extrapolations for each scene. Then use the frame controls to view the generated views from all the models. We also include automated playback controls to cycle the frames forwards and backwards at various rates.

Scene Selection

Motion Selection

Orbit

Spin

Hop

Sample Instance

The stochastic nature of these models allow multiple plausible realities to be sampled. We provide three here.

Samples

GeoGPT

LookOut

Ours

Frame Control (automated playback or manual slider)

Notice how generated scenes diverge more with frames further to the right of the slider.

Stop animation

Cycle 5fps

Cycle 10fps

Matterport3D Qualitative Results: In-Distribution Trajectories

Similar to the RealEstate10K generations with in-distribution trajectories, the following examples also use in-distribution trajectories, but using an initial image and trajectory from Matterport3D. Select a scene from the menu below. Then use the frame controls to view the generated views from all the models. We also include automated playback controls to cycle the frames forwards and backwards at various rates.

Scene Selection

Sample Instance

The stochastic nature of these models allow multiple plausible realities to be sampled. We provide three here.

Samples

LookOut

Ours

Frame Control (automated playback or manual slider)

Notice how generated scenes diverge more with frames further to the right of the slider.

Stop animation

Cycle 5fps

Cycle 10fps

Matterport3D Qualitative Results: Out-of-Distribution Trajectories

Similar to the RealEstate10K generations with out-of-distribution trajectories, the following examples also use out-of-distribution trajectories, but using an initial image Matterport3D. The following examples show the ability of our model to generate novel views using a variety of trajectories not typically found in the training data. Select a scene and motion type for the camera trajectory. Then use the frame controls to view the generated views from all the models. We also include automated playback controls to cycle the frames forwards and backwards at various rates.

Scene Selection

Motion Selection

Orbit

Spin

Hop

Sample Instance

The stochastic nature of these models allow multiple plausible realities to be sampled. We provide three here.

Samples

LookOut

Ours

Frame Control (automated playback or manual slider)

Notice how generated scenes diverge more with frames further to the right of the slider.

Stop animation

Cycle 5fps

Cycle 10fps

Thresholded Symmetric Epipolar Distance (TSED)

With the camera pose used to condition the view generation, we first compute the fundamental matrix . Then given a feature point on the first image, the fundamental matrix constrains the location of to a position on the line, , on the second image. The Symmetric Epipolar Distance (SED) given the points , , and fundamental matrix is defined as:

where

is the minimum Euclidean distance between point

and the epipolar line. Given a set of feature correspondences

, we define the pair of images to be consistent if there are a sufficient number of matching features, and the median SED over M is less than a certain threshold:

The use of epipolar geometry does not require knowledge of the scene geometry or scale. Using this metric, we can evaluate the consistency of the generated novel views by computing which fraction of neighbouring views are consistent.

Visualization of SED

Select two corresponding points in each image below. The epipolar line from each point will be drawn in the images with the same colour as the point. When both points have been selected, the minimum length line will be drawn between each point and its epipolar line, and the computed SED will be displayed under the image. Notice that the SED is low when selecting the same points in the scene, while the SED is high when different points in the scene are selected for each image.

Null

BibTeX

@inproceedings{Yu2023PhotoconsistentNVS,
  title={Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models},
  author={Jason J. Yu and Fereshteh Forghani and Konstantinos G. Derpanis and Marcus A. Brubaker},
  booktitle={{Proceedings of the International Conference on Computer Vision ({ICCV})}},
  year={2023},
}