D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

Mihai Dusmanu 1, 2, 3, Ignacio Rocco 1, 2, Tomas Pajdla 4, Marc Pollefeys 3, 5, Josef Sivic 1, 2, 4, Akihiko Torii 6, Torsten Sattler 7

1DI, ENS, 2Inria, 3ETH Zürich, 4CIIRC, CTU in Prague, 5Microsoft, 6Tokyo Institute of Technology, 7Chalmers University of Technology

CVPR 2019

Proposed detect-and-describe (D2) approach. A convolutional neural network is used to extract feature maps that play a dual role: (i) local descriptors are simply obtained by traversing all the feature maps at a spatial position ; (ii) detections are obtained by performing a non-local-maximum suppression on a feature map followed by a non-maximum suppression across each descriptor - during training, keypoint detection scores are computed from a soft local-maximum score and a ratio-to-maximum score per descriptor .

Abstract

In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.

Code

The code is available on GitHub at mihaidusmanu/d2-net.

Paper

Mihai Dusmanu 1, 2, 3, Ignacio Rocco 1, 2, Tomas Pajdla 4, Marc Pollefeys 3, 5, Josef Sivic 1, 2, 4, Akihiko Torii 6, Torsten Sattler 7
D2-Net: A Trainable CNN for Joint Detection and Description of Local Features
In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition
[Latest version on arXiv] [Poster]
Erratum
Post camera-ready deadline, we noticed that the average memory usage of features per image from Table 2 is wrongly reported in megabits (Mb) instead of megabytes (MB) for both D2 and dense features. Nevertheless, our claims regarding the increased memory efficiency stand. This was corrected in the arXiv version.
BibTeX
@InProceedings{Dusmanu2019CVPR,
    author = "Dusmanu, Mihai and Rocco, Ignacio and Pajdla, Tomas and Pollefeys, Marc and Sivic, Josef and Torii, Akihiko and Sattler, Torsten",
    title = "{D2-Net}: {A} {T}rainable {CNN} for {J}oint {D}etection and {D}escription of {L}ocal {F}eatures",
    booktitle = "Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition",
    year = "2019"
}