PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

Jason J. Yu^1,2, Tristan Aumentado-Armstrong^1,2, Fereshteh Forghani¹, Konstantinos G. Derpanis^1,2,3, Marcus A. Brubaker^1,2

{jjyu,forghani,kosta,marcus.brubaker}@yorku.ca, taumen@cs.toronto.edu

¹York University, ²Vector Institute for AI, ³Samsung AI Centre Toronto

Abstract

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.

Authors

Jason J. Yu

Tristan Aumentado-Armstrong

Fereshteh Forghani

Konstantinos G. Derpanis

Marcus A. Brubaker

Material

arXiv

Code

BibTex

RealEstate10K Qualitative Results: Ground-truth Trajectories

The following examples show generations that follow ground-truth trajectories: camera trajectories from sequences in the RealEstate10K test set. The initial image (first frame) used is taken from the same sequence as the trajectory.

Select a scene from the menu below. There are also different plausible generations sampled for each scene which can be selected under "Sample Instance". Use the frame controls to inspect each frame individually, or use the automated playback options which will cycle the sequence forward and backwards.

Scene Selection

Motion Selection

Orbit

Sample Instance

These generative models allow multiple plausible realities to be sampled. We provide three here.

Samples

PhotoCon

Ours-Markov

Ours-1step

Ours-Keyframed

Frame Control (automated playback or manual slider)

Notice that the later frames of the keyframed generations are higher quality.

Stop animation

Play at 5fps

Play at 10fps

Matterport3D Qualitative Results: Ground-truth Trajectories

The following examples show generations that follow ground-truth trajectories: camera trajectories from sequences in the Matterport3D test set. The initial image (first frame) used is taken from the same sequence as the trajectory.

Scene Selection

Motion Selection

Orbit

Sample Instance

These generative models allow multiple plausible realities to be sampled. We provide three here.

Samples

PhotoCon

Ours-Markov

Ours-1step

Ours-Keyframed

Frame Control (automated playback or manual slider)

Notice that the later frames of the keyframed generations are higher quality.

Stop animation

Play at 5fps

Play at 10fps

DFM Qualitative Results: Ground-truth Trajectories

The following examples show generations that also follow ground-truth trajectories: camera trajectories from sequences in the RealEstate10K test set. The initial image (first frame) used is taken from the same sequence as the trajectory.

These results specifically compare our method with DFM-1 (DFM using 1 target view) and DFM-2 (DFM using 2 target views). DFM is a image-to-NeRF method which is in constrast to our method which is image-based. All results are shown at the native resolution of the DFM outputs at 128x128 resolution.

Scene Selection

Motion Selection

Orbit

Sample Instance

These generative models allow multiple plausible realities to be sampled. We provide three here.

Samples

DFM-1

DFM-2

Ours-Keyframed

Frame Control (automated playback or manual slider)

Notice that the later frames of the keyframed generations are higher quality.

Stop animation

Play at 5fps

Play at 10fps

RealEstate10K Qualitative Results: Cyclical Trajectory - Spin

The following examples show our model's ability to generate cycle consistent views. This custom trajectory was introduced in previous work, and defines a camera trajectory that returns to its original position. The camera poses in this set of views has no single obvious ordering. Achieving cycle consistency where scene content is consistent with the last and first frame is challenging with standard autoregressive sampling methods since the limited conditioning window causes the model to "forget" about scene content that moves out of view.

Our keyframed method is able to achieve cycle consistency by constraining the generation using keyframes, which jointly cover a large region of the scene. The simultaneous generation of the keyframes avoids the inconsistencies caused by the standard ordering of views along a trajectory.

Scene Selection

Motion Selection

Orbit

Sample Instance

These generative models allow multiple plausible realities to be sampled. We provide three here.

Samples

Ours-Markov

Ours-1step

Ours-Keyframed

Frame Control (automated playback or manual slider)

The automated playback here will loop the sequence rather than play the sequence back and forth to emphasise the cycle inconsistency. Notice that the first and last frames of the non-keyframed methods have inconsistent scene content.

Stop animation

Play at 5fps

Play at 10fps

Loop First and Last

RealEstate10K Qualitative Results: Stereo Grouped View Generation

In the following example we seek to generate stereo views along a trajectory. For such a set of views, there is no single obvious ordering among the views. As a naive baseline, we generate the stereo pairs by first generating the right view, then the left, before repeating for the next pair in the trajectory. This results in a "zigzag" pattern in the view order for naive sampling.

Using our set-based model, we are able to generate groups of views with no ordering within them. For this set of views, the stereo pairs are generated simultaneously, conditioned on the previous stereo pair in the trajectory.

Scene Selection

Motion Selection

Orbit

Sample Instance

These generative models allow multiple plausible realities to be sampled. We provide three here.

Stereo Generations

Ours-Markov-LEFT

Ours-Markov-RIGHT

Ours-Keyframed-LEFT

Ours-Keyframed-RIGHT

Flip stereo pairs

Frame Control (automated playback or manual slider)

Click the button above to flip the left right stereo pairs, this can enable cross-eyed stereoscopic viewing for viewers that know how. Notice that the naive generation experiences large lateral motions within each side of the stereo pairs. The naive sampling makes it difficult for the model to maintain a constant scene scale across stereo pairs.

Stop animation

Play at 5fps

Play at 10fps

RealEstate10K Qualitative Results: Alternative Sampling

Here we explore an alternative sampling strategy for sampling arbitrary sets of views. An initial image is provided which lies at the center of a grid of views (outlined in red). This set of views orbit the scene at a fixed radius. Generated views can be inspected by hovering your cursor over the grey squares at the bottom of the Viewer.

We also provide a visualizer to inspect the sampling order selected by our heuristic, see the supplemental document for more details.

Viewer

The outlined view is the initial image. Hover your cursor over the the squares to see different views. Moving horizontally along the squares changes the camera's azimuth, while moving vertically changes the camera's elevation.

Sampling order visualization

Move the slider to inspect how views are sampled when generating a set of views.

Notice that we first generate a sparse set of keyframes before generating the remaining views.

Legend

Given View

Ungenerated View

Conditioning View

View being generated

Already generated view

View sampling step

BibTeX

@inproceedings{Yu2024PolyOculusNVS,
  title={PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis},
  author={Jason J. Yu and  Tristan Aumentado-Armstrong and Fereshteh Forghani and Konstantinos G. Derpanis and Marcus A. Brubaker},
  booktitle={{Proceedings of the European Conference on Computer Vision ({ECCV})}},
  year={2024},
}