Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis

{forghani,jjyu,kosta,marcus.brubaker}@yorku.ca, tristan.a@samsung.com
1York University, 2Vector Institute for AI, 3Samsung AI Centre Toronto 4Google DeepMind

Abstract

Conventional depth-free multi-view datasets are captured using a moving monocular camera without metric calibration. The scales of camera positions in this monocular setting are ambiguous. Previous methods have acknowledged scale ambiguity in multi-view data via various ad-hoc normalization pre-processing steps, but have not directly analyzed the effect of incorrect scene scales on their application. In this paper, we seek to understand and address the effect of scale ambiguity when used to train generative novel view synthesis methods (GNVS). In GNVS, new views of a scene or object can be minimally synthesized given a single image and are, thus, unconstrained, necessitating the use of generative methods. The generative nature of these models captures all aspects of uncertainty, including any uncertainty of scene scales, which act as nuisance variables for the task. We study the effect of scene scale ambiguity in GNVS when sampled from a single image by isolating its effect on the resulting models and, based on these intuitions, define new metrics that measure the scale inconsistency of generated views. We then propose a framework to estimate scene scales jointly with the GNVS model in an end-to-end fashion. Empirically, we show that our method reduces the scale inconsistency of generated views without the complexity or downsides of previous scale normalization methods. Further, we show that removing this ambiguity improves generated image quality of the resulting GNVS model.




RealEstate10K Qualitative Scale Noise Edge Maps

This viewer allows individual novel view samples with the same conditioning information to be viewed to inspect their entropy. Select a scene from the menu below. Then use the frame controls to automatically loop through the samples to get a sense of the differences between samples. Use the toggle button below the image to highlight regions of interest.

Scene Selection



Toggle regions of interest

Frame Control (automated playback or manual slider)

Notice how the samples from models that use scale learning (two right-most columns) "jitter" less.

Stop animation
Play


Sample Flow Consistency (SFC)

SFC measures scale variability using motion variation among generated images with the same conditioning image and camera pose. We measure motion by optical flow and use the median absolute deviation (MAD) of them as a proxy to quantify scale uncertainty. The lower the SFC is, the more consistent scales of the samples are.

In this viewer, we visualize the components of our proposed method, SFC. Same as the previous viewer, you can select a scene from the menu below, view novel view samples with the same conditioning information, calculated optical flows, and the MAD map generated from the flows. The control buttons are the same as the above viewer.

Scene Selection



Frame Control (automated playback or manual slider)

Notice how the MAD maps of the samples from models that use scale learning (two right-most columns) are darker and their flow maps "fliker" less.

Stop animation
Play


Scale-Sensitive Thresholded Symmetric Epipolar Distance (SS-TSED)

Start from a camera pose as the conditioning view.
Slide 0
Translate it along one of the axes (e.g., the x-axis) and generate the corresponding frame.
Slide 1
Then translate it along another one of the axes (e.g., the y-axis) and generate the corresponding frame.
Slide 2
3D position of a point in the conditioning view will always lie on a ray originating from the conditioning view.
Slide 3
However, different scene scales from each generated view will place the point at different distances from the ray.
Slide 4
Different scales cause the 2D position of the point observed from one camera to lie a distance off the epipolar line/plane formed by the generated views and the point observed by they other camera.
Slide 5
Whereas, consistent scales in both directions causes the 2D position of the point observed from one camera to lie on the epipolar line/plane formed by the generated views and the point observed by the other camera.
Slide 6

BibTeX

@article{forghani2025learnyourscales,
      title={Learn Your Scales: Towards Scale-Consistent Generative Novel View Synthesis},
      author={Forghani, Fereshteh and Yu, Jason J and Aumentado-Armstrong, Tristan and Derpanis, Konstantinos G and Brubaker, Marcus A},
      journal={arXiv preprint arXiv:2503.15412},
      year={2025}
    }