VGGT: Visual Geometry Grounded Transformer

Venue: CVPR
Year: 2025
Authors: Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
Topic: CV

🌟 Highlights

Overall, it's a solid paper with novel contributions.
The authors introduce a transformer neural network that can perform directly perform 3D reconstruction given a set of images.

📝 Summary

VGGT introduces a large-scale feed forward transformer neural network for 3D scene reconstruction. Given a set of one or more iamges, VGGT can predict key attributes such as camera parameters, multi-view depth estimation, and dense point cloud reconstruction. Additionally, VGGT can run directly as a single forward pass and still achieve better performance than state-of-the-art baselines, which explicitly require post-processing. The authors show that VGGT can additionally benefit downstream tasks such as dynamic point tracking.

🧩 Key Contributions

The authors introduce VGGT, a feed forward transformer neural network that can perform 3D reconstruction given one or many images of a scene.
Instead of a working with a pipeline that relies on post-processing to optimize 3D geometry, VGGT performs 3D reconstruction in a single forward pass.
Although the paper claims that one does not need post-processing, post-processing (i.e. bundle adjustment) is still shown to improve VGGT's output.

✅ Strengths

Very clean and well structured paper, easy to jump across sections for reference too.
The algorithmic/equation aspects were introduced clearly; there weren't any equations that required enormous mental effort to understand.
I like the architecture figure (figure 2) as well as the formatting for the results tables.

⚠️ Weaknesses / Questions

The issue of tightly coupling of the model and post-processing is demonstrated almost exclusively through discussions of VGGSfM - a prior work created by the same research group. The other baselines are mentioned more in passing in the related works section. A slightly broader survey in the first paragraph would strengthened the introduction.
The alternating-attention component, in which the transformer focuses on frames and globally in alternating fashion, is presented without much substance. Mixing local/global attention like that is established in video transformers (e.g. Swin Transformer). Citing these works while establishing the novel way this paper utilizes alternating-attention in a multi-view context would have landed better.
More of a personal note: It's not clear whether or not one explicitly needs a state of the art H100 graphics card to run this model. The memory and processing times show that I could possibly run it on my local desktop.

🔍 Related Work

Structure from Motion - A classical machine learning problem that reconstructs sparse point clouds from multiple images taken of a static scene
Multi-view Stereo - Reconstruction of the geometry of a scene through overlapping images from known camera parameters.
Tracking any Point - Tracks points of interest across video sequences.

📄 Attachments

PDF: 📄 View PDF
Code: 🧑‍💻 GitHub Repository
Paper Link: 🔗 External Page