MESA: Effective Matching Redundancy Reduction by Semantic Area Segmentation

IEEE TPAMI 2025

Yesheng Zhang, Shuhan Shen, Xu Zhao

{preacher, zhaoxu}@sjtu.edu.cn, shuhan.shen@ia.ac.cn

Graphical_abstract

Abstract

Matching redundancy, which refers to fine-grained feature comparison between irrelevant image areas, is a prevalent limitation in current feature matching approaches. It leads to unnecessary and error-prone computations, ultimately diminishing matching accuracy. To reduce matching redundancy, we propose MESA and DMESA, both leveraging advanced image understanding of Segment Anything Model (SAM) to establish semantic area matches prior to point matching. These informative area matches, then, can undergo effective internal feature comparison, facilitating precise intra-area point matching. Specifically, MESA adopts a sparse matching framework, while DMESA applies a dense one. Both of them first obtain candidate areas from SAM results through a novel Area Graph (AG). In MESA, matching the candidates is formulated as a graph energy minimization and solved by graphical models derived from AG. In contrast, DMESA performs area matching by generating dense matching distributions on the entire image, aiming at enhancing efficiency. The distributions are produced from off-the-shelf patch matching, modeled as the Gaussian Mixture Model, and refined via the Expectation Maximization. With less repetitive computation, DMESA showcases an area matching speed improvement of nearly five times compared to MESA, while maintaining competitive accuracy. Our methods are extensively evaluated on four different tasks across six datasets, encompassing both indoor and outdoor scenes. The results suggest that our method achieves notable accuracy improvements for nine baselines of point matching in most cases. Furthermore, our methods exhibit promise generalization and improved robustness against image resolution. Code is available at github.com/Easonyesheng/A2PM-MESA.

Video

Motivation

Matching Redundancy Reduction by Semantic Area Matching

Recently, most mainstream advanced image matching methods adopt Transformer-based learning models. They perform dense cross-image attention calculations, integrate cross-view information, and obtain accurate matching points. Nevertheless, these methods suffer from matching redundancy: dense attention often conducts elaborate yet unnecessary calculations between irrelevant image areas. It wastes substantial computing power and is susceptible to noise (e.g., illumination changes and repeated textures), leading to incorrect matches and ultimately reducing the overall accuracy.

However, this issue can be mitigated by first acquiring highly relevant cross-view image area pairs (area matching) via perceptual priors, then using these areas as input for subsequent dense matching. To this end, we turn to the recent Segment Anything Model (SAM). Its universal segmentation capability provides valuable prior for area matching. Specifically, trained on massive data, SAM has developed robust semantic perception capabilities, allowing it to generate areas that correspond to sets of semantic entities (referred to as semantic areas). Notably, the compact nature of such semantic representations further boosts the accuracy of intra-area point matching.

Thus, our method leverages SAM's segmentation results to first conduct cross-image semantic area matching, followed by intra-area point matching, which effectively reduces matching redundancy while enhancing overall matching accuracy. We term this paradigm the Area-to-Point Matching (A2PM) framework.

motivation

Overview

To achieve semantic area matching, we propose two algorithms under distinct frameworks: MESA and DMESA. Their shared premise is to integrate SAM segmentation results via a novel Area Graph structure. Based on this, MESA constructs the area matching algorithm on the graph structure, introduces graphical models and graph cut algorithms for sparse area similarity calculation, and ultimately achieves semantic area matching through energy model-based optimization. However, the sparse matching framework involves considerable redundant computations, which impairs computational efficiency of MESA. Therefore, we further propose DMESA, which leverages off-the-shelf matching models for dense similarity calculation and performs optimization based on the Gaussian Mixture Model (GMM). This dense matching paradigm significantly reduces redundant computation, yielding approximately a 5x improvement in area matching speed compared to MESA, while maintaining comparable accuracy.

overview

Highlights

Area Graph

Due to the nature of its training objectives, SAM's segmentation results tend to have a granularity that is challenging to finely control, rendering them unsuitable for direct use in area matching or intra-area point matching tasks. To overcome this limitation, we propose the Area Graph, which establishes hierarchical associations among segmented areas, enabling precise and simultaneous control over area granularity. Specifically, we first use SAM to perform segmentation with dense point prompts, obtaining the finest-grained segmentations as bottom-level graph nodes. Higher-level nodes with varying granularities are then generated through node filtering and merging. A hybrid edge structure is employed to construct the graph topology: directed edges represent inclusion relationships among area nodes, while undirected edges indicate adjacency. This results in an Area Graph structure that encapsulates both hierarchical features and spatial association information, providing a robust foundation for subsequent cross-view area matching.

method_1

MESA: Sparse Area Matching

For two images captured from different viewpoints, MESA facilitates area matching directly on their corresponding Area Graphs. This task can be framed as node matching between two graphs, enabling the application of established theoretical frameworks such as probabilistic graphical models. By modeling the Area Graph as a Markov Random Field (MRF), node matching can be effectively performed using graph cut algorithms, and optimized through energy models. Additionally, the hierarchical structure of the Area Graph can be represented as a Bayesian network to enhance the efficiency of area similarity calculations.

method_2

DMESA: Dense Area Matching

Although MESA achieves precise area matching, the substantial overlap between areas leads to considerable redundant computations within its sparse paradigm, resulting in significant computational overhead. To mitigate this inefficiency, we introduce the DMESA method, which utilizes existing matchers to generate a area matching probability distribution across the entire image, effectively avoiding redundant calculations caused by overlapping areas. Additionally, the matching distribution is modeled as a Gaussian Mixture Model, allowing for iterative optimization using the Expectation Maximization algorithm, which subsequently enhances the accuracy of area matching.

method_3

Results

Extensive experiments conducted across six datasets and four downstream tasks have demonstrated that our approaches yield substantial accuracy improvements for a total of nine baseline methods in the vast majority of cases. These results support the effectiveness and versatility of our methods.

exp_1

Point Matching Results

In the experiments on point matching, our method significantly improves performance across three different types of point matching baselines in most cases. For the exceptions, we hypothesize that the cause is overfitting of the point matchers to the training resolution, a hypothesis we further explored in subsequent experiments.

exp_2

Pose Estimation Results

We conducted experiments on camera pose estimation using a wide range of indoor and outdoor datasets. The results indicate that our MESA and DMESA methods provide relatively significant performance improvements across multiple point matching baseline methods, including sparse, semi-dense, and dense types, in the vast majority of cases.

exp_3 exp_4

Visual Odometry Results

Experiments on the KITTI360 dataset demonstrate that our method improves the performance of baseline methods in long-range pose estimation tasks.

exp_5

3D Reconstruction Results

The 3D reconstruction results on the DTU dataset indicate that our method significantly enhances accuracy for both traditional COLMAP-based reconstruction methods and current state-of-the-art feedforward reconstruction approaches.

exp_6

Resolution Overfitting Study

We initially hypothesized that point matching methods based on Transformer architectures might suffer from resolution overfitting. This hypothesis was validated through extensive experiments (cf. the Figure above). Additionally, we investigated the performance of Transformer-based point matchers using different Positional Encoding methods within the A2PM matching framework. The findings revealed that Absolute Positional Encoding tends to overfit to resolution, ultimately reducing matching performance of the A2PM framework. In contrast, state-of-the-art point matchers employing Relative Positional Encoding or omitting Positional Encoding performed well within the A2PM framework. Thus, the A2PM framework's performance advantage is still maintained when applied to the most advanced matching methods.

exp_7

exp_8

Citation

@article{zhang2025dmesa,
  title={MESA: Effective Matching Redundancy Reduction by Semantic Area Segmentation},
  author={Zhang, Yesheng and Shen, Shuhan and Zhao, Xu},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025}
}