Skip to main content

VGGT: Visual Geometry Grounded Transformer

Venue
CVPR
Year
2025
Authors
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Topic
CV

🌟 Highlights

  • Overall, it's a solid paper with novel contributions.

  • The authors introduce a transformer neural network that can perform directly perform 3D reconstruction given a set of images.

📝 Summary

VGGT introduces a large-scale feed forward transformer neural network for 3D scene reconstruction. Given a set of one or more iamges, VGGT can predict key attributes such as camera parameters, multi-view depth estimation, and dense point cloud reconstruction. Additionally, VGGT can run directly as a single forward pass and still achieve better performance than state-of-the-art baselines, which explicitly require post-processing. The authors show that VGGT can additionally benefit downstream tasks such as dynamic point tracking.

🧩 Key Contributions

  • The authors introduce VGGT, a feed forward transformer neural network that can perform 3D reconstruction given one or many images of a scene.

  • Instead of a working with a pipeline that relies on post-processing to optimize 3D geometry, VGGT performs 3D reconstruction in a single forward pass.

  • Although the paper claims that one does not need post-processing, post-processing (i.e. bundle adjustment) is still shown to improve VGGT's output.

Strengths

  • Very clean and well structured paper, easy to jump across sections for reference too.

  • The algorithmic/equation aspects were introduced clearly; there weren't any equations that required enormous mental effort to understand.

  • I like the architecture figure (figure 2) as well as the formatting for the results tables.

⚠️ Weaknesses / Questions

  • The issue of tightly coupling of the model and post-processing is demonstrated almost exclusively through discussions of VGGSfM - a prior work created by the same research group. The other baselines are mentioned more in passing in the related works section. A slightly broader survey in the first paragraph would strengthened the introduction.

  • The alternating-attention component, in which the transformer focuses on frames and globally in alternating fashion, is presented without much substance. Mixing local/global attention like that is established in video transformers (e.g. Swin Transformer). Citing these works while establishing the novel way this paper utilizes alternating-attention in a multi-view context would have landed better.

  • More of a personal note: It's not clear whether or not one explicitly needs a state of the art H100 graphics card to run this model. The memory and processing times show that I could possibly run it on my local desktop.

🔍 Related Work

  • Structure from Motion - A classical machine learning problem that reconstructs sparse point clouds from multiple images taken of a static scene

  • Multi-view Stereo - Reconstruction of the geometry of a scene through overlapping images from known camera parameters.

  • Tracking any Point - Tracks points of interest across video sequences.

📄 Attachments

PDF
📄 View PDF
Code
🧑‍💻 GitHub Repository
Paper Link
🔗 External Page