Predicting Ground-Level Scene Layout from Aerial Imagery
Highlights
-
The dataset used, CVUSA, contains 1.5 million geotagged pairs of ground and aerial images from across the United States. The derived dataset for this paper is ~44k images.
-
The aerial to ground image synthesis is both novel and extremely interesting. As this paper is from 2017, and computer vision has undergone a lot of advancements since then, it would be interesting to find papers that build off these authors' works.
-
The geocalibration section reminds me of an interesting version of GeoGuessr, where one could align a ground-level visualization against an aerial layout and predict the orientation.
-
Although novel, there are enough critical things missing in the paper that should have earned it a weak reject.
Summary
Methods for per-pixel labeling of aerial imagery have historically relied on manually annotated data. Unfortunately, this data can be both cost-prohibitive to create, and suffer from decreased performance when applied to an alternative source of aerial imagery. The authors thus propose a novel technique for predicting the semantic layout of a ground image given only a corresponding aerial image of the same location. The authors show that their technique can be applied for applications such as orientation prediction and synthesis of ground-level visualizations from aerial layouts.
Key Contributions
-
A novel CNN architecture that "relates the appearance of an aerial image to the semantic layout of a ground image of the same location". Their architecture models this through three sub-networks: A() - which extracts the aerial semantic features, S() - which encodes the spatial contextual features of the aerial image, and F() which learns the matrix across views, or more accurately, says where each pixel in the ground view maps in the aerial view.
-
An application of the technique to tasks such as orientation estimation and ground image synthesis.
-
The authors use semantic labels from ground images as a form of weak supervision for the aerial imagery.
Strengths
-
Very approachable paper, easy to read. Avoids getting too bogged down with derivations.
-
It was easy to connect the figures to text and figure out what's going on.
-
Good demonstration of weak supervision transfer.
Weaknesses / Questions
-
There's a question of how well it would work for dynamic objects such as cars or pedestrians. However, to the authors' credit, they do acknowledge this as a limitation in Section 4.
-
The resolution and quality are extremely low (64x320) and struggle with complex content (e.g. skies and manmade structures).
-
Subjective, but I believe the abstract could have been better written.
-
10 training epochs is incredibly low, that would have immediately been a red flag for me. This low epoch count is not justified in the implementation section.
-
The authors do not explain, or attempt to explain why VGG16 performs better on precision than their solution on higher training samples.
-
Why is only per-class precision measured? If recall, F1, or mIoU was measured but uninteresting, just put it in the appendix. This seems like metric cherry-picking.
-
This method is trained on 4 semantic classes (i.e. road, vegetation, building, and sky). There is no indication on how well this method scales with more classes.
Related Work
- CVUSA - A 1.5 million geotagged pair of ground and aerial images from across the United States