Abstract

We present three multi-scale similarity learning architectures, or DeepSim networks. These models learn pixel-level matching with a contrastive loss and are agnostic to the geometry of the considered scene. We establish a middle ground between hybrid and end-to-end approaches by learning to densely allocate all corresponding pixels of an epipolar pair at once. Our features are learnt on large image tiles to be expressive and capture the scene’s wider context. We also demonstrate that curated sample mining can enhance the overall robustness of the predicted similarities and improve the performance on radiometrically homogeneous areas. We run experiments on aerial and satellite datasets. Our DeepSim-Nets outperform the baseline hybrid approaches and generalize better to unseen scene geometries than end-to-end methods. Our flexible architecture can be readily adopted in standard multi-resolution image matching pipelines.

Method

Sampling for dense matching. Training follows the self-supervised contrastive learning paradigm. Conversely to patch-based training, our triplets are sets of features output by a multi-scale CNN backbone. Our sample mining scheme enforces the matching uniqueness constraint and and learnt similarities robustness.

[GitHub]

BibTeX Citation

@article{chebbi2023,
author = {Mohamed Ali Chebbi and Ewelina Rupnik and Marc Pierrot-Deseilligny and Paul Lopes},
title = {DeepSim-Nets: Deep Similarity Networks for Stereo Image Matching.},
journal = {CVPR EarthVision 2023},
year = {2023}}

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.