Positional Encoding Field
PE-Field can achieve high-quality Novel View synthesis results simply by operating on DiT's Positional Encoding.
Abstract
Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field–augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
Analysis of PE in DiT:
Figure 1: Illustration of DiT patch-level independence.
Figure 2: Illustration of direct novel view synthesis (NVS) Results.
When positional encodings (PEs) of image tokens or noise tokens are perturbed, the decoded or generated outputs still produce semantically meaningful images. The resulting structures follow the warping imposed by the PE modification, while boundaries between patches remain visually distinct.
We apply 2D positional encodings (PEs) derived from 3D reconstruction and view transformation directly to the source-view image tokens. Using these modified tokens as image conditions in DiT enables direct generation of a relatively accurate novel-view image.
Method
The transformer takes both noise tokens and source-view image tokens. Noise tokens are placed on a 2D grid with depth set to zero, while image tokens are assigned hierarchical PEs according to their projected positions from monocular reconstruction and view transformation, with depth values taken from the reconstruction. Tokens projected outside the grid (e.g., index 6) are discarded, and empty grid locations without image tokens (e.g., index 0) are filled by noise, which is refined to generate plausible content.
Comparisons

Applications

The left example shows object 3D editing, while the right example shows object removal, highlighting the versatility of our model in different spatial editing tasks.
Paper
 |
"Positional Encoding Field ",
Yunpeng Bai, Haoxiang Li, Qixing Huang.
Arxiv 2025
[PDF]
|
BibTeX
@article{bai2025positional,
title={Positional Encoding Field},
author={Bai, Yunpeng and Li, Haoxiang and Huang, Qixing},
journal={arXiv preprint arXiv:2510.20385},
year={2025}
}
Credit
This website template was borrowed from
Imagic.