1. Introduction
The increasing trend of organizations disclosing their policies on satellite-based acquisition of multimodal data has significantly increased the availability of such data. The highly complementary nature of optical images and SAR data renders them invaluable for various applications, including land cover analysis, building damage assessment, earth resource surveys, crop identification, and more [
1,
2].
Optical remote sensing images, captured in various bands, offer distinct feature information representations. The imaging mechanism of optical images resonates with human visual habits. This alignment makes optical remote sensing images a cornerstone in most current deep learning-based semantic segmentation research. However, these images face limitations due to variable weather, seasons, light, and geographic factors. Moreover, clouds and shadows can obscure feature information. In contrast, synthetic aperture radar (SAR) photographs provide a rich array of landscape information from diverse material and physical viewpoints. SAR, an active sensor, detects backscatter information and is particularly sensitive to geometric features, such as surface roughness, temperature, and complex dielectric constants [
3]. Although SAR images can penetrate obstacles and are impervious to water, they present challenges such as shortening, shadows, and speckle noise, making their imaging characteristics less intuitive for human visual interpretation. The fusion of data from various imaging sources offers a solution to the difficulties posed by single-modal data, especially in land cover classification (LCC) task [
4]. These multimodal data offer different perspectives on the same phenomena [
1].
Fusing multimodal information for LCC tasks presents significant challenges due to the considerable differences in images captured by separate sensors; pre-processing different modal data and determining effective fusion methods is very difficult [
5]. For example, to effectively use fused multimodal data to interpret remote sensing images, it is necessary to perform speckle reduction and register multimodal data because the backscattered signal undergoes coherent processing and SAR images are damaged by multiplicative speckle noise [
6]. The registration of multimodal data is a complex task because of the significant geometric distortions between the data. Several studies have been proposed to accomplish the registration task of multimodal data, and these methods can be broadly categorized into area-based, feature-based, and learning-based pipelines [
7]. For example, among feature-based pipelines, one study proposed the Harris-PIFD image registration framework with multi-scale features [
8]. Moreover, one area-based method has been proposed to use two phases, coarse and fine registration, as well as first- and second-order gradients in the fine registration phase for geometric deformation problems [
9]. Learning-based methods provide more possibilities for multimodal data registration, such as integrating deep learning methods with traditional methods to form new registration processes. Additionally, learning-based pipelines convert one modality into another, transforming complex multimodal data registration problems into simpler same source data registration problems. These approaches can also directly train the transformation parameters between regression multimodal data [
7]. Overall, some targeted research has focused on improving multimodal data registration. However, after completing this task, it is necessary to determine how to utilize multimodal data with significant differences to perform remote sensing image interpretation more effectively.
Multimodal data fusion methods can be categorized into pixel-level-based, feature-level-based, and image-level-based approaches [
2]. Pixel-level fusion uses fusion rules such as intensity–hue–saturation, Gram–Schmidt orthogonalization, bravery transform, high pass filtering, principal component analysis, wavelet transform, and generalized Laplace pyramid to fuse the pixel values of multimodal data [
10,
11]. Fused images have richer content but are more computationally demanding, and they require strict registration. Feature-level fusion extracts features such as edges, shapes, and textures from the image and then fuses them. Downstream tasks can be accomplished more accurately based on these typical fused features. Decision-based fusion fuses information from classified images and refines some uncertain information using decision rules, and the registration requirements are less stringent [
6].
There has been considerable research on three fusion methods based on traditional approaches. To achieve effective LCC using multimodal data, this study proposes a stochastic gradient descent method for image fusion, leveraging three directional color components and a Sobel approach [
10]. Addressing the issue of noise in SAR images, the study in [
12] introduced a bilateral filter method based on a pixel-by-pixel similarity metric. This method successfully fuses multimodal data using a co-aligned optical image as a reference. In the realm of feature-level fusion methods, Zhang et al. [
13] proposed a technique that involves extracting spectral, texture, and spatial features from both optical images and SAR data. Additionally, they incorporated the normalized difference in vegetation index, as well as elevation and slope information. Despite the recognized importance of feature normalization in data processing, most existing normalization methods are not suitable for the fusion of multimodal data, primarily due to the differing imaging mechanisms of optical images and SAR data. To address this challenge, the study proposed the use of a scale normalization algorithm. This algorithm is specifically designed to combine multimodal data effectively, thereby facilitating the evaluation of LCC [
14]. To integrate contextual information into the multimodal data fusion process for downstream tasks such as LCC, this study suggests employing the Markov random field (MRF) approach. Decision-level fusion involves combining various options to generate a common decision [
15].
The commonly used deep learning methods for multimodal fusion can be broadly categorized into two types: those based on weight sharing and those in which the weights are not shared. Because of the massive differences between SAR data and optical images, many studies have proved that better inference results can be achieved by non-weight-sharing methods [
16]. The prevalent deep learning methods are mainly based on feature-level fusion. For example, Zhang et al. [
17] suggested constructing a model based on a classical network structure, incorporating a dual encoder and a shared decoder to effectively process both SAR and optical images, furthering the goal of leveraging the complementary aspects of SAR data and optical images. Moreover, this study proposed a triple attention feature network model. This model integrates a self-attention module, a spatial attention module, and a spectral information attention module, which synergistically enhance the utilization of multimodal features. Li et al. [
18] proposed a method that involves analyzing distribution histograms of the depth semantic features of both optical and SAR images, revealing the complementary nature of their feature information. They developed a model with a dual-line feature extractor, a multimodal attention module, and a gated heterogeneous data fusion module to improve the accuracy of multimodal fusion segmentation results. Additionally, Li et al. [
16] designed a multimodal bilinear fusion network that accomplishes feature extraction using an encoder, a channel selection module for second-order attention, and a bilinear fusion module.
Traditional approaches typically involve manually designed characteristics, such as pixel color in the image space, gradient histograms, and other similar features, rather than depending on domain-specific knowledge [
2]. Thus, when using traditional methods relying on parameter settings, whether applying feature-level fusion or decision-level fusion, adjusting the appropriate parameters is a complex task. Moreover, machine learning methods such as support vector machines are limited by their ability to represent manual features, and thus, machine learning does not perform well in terms of robustness and experimental accuracy [
19]. The rapid development of deep learning provides a new approach to multimodal data fusion, as this technology can leverage the diverse features of remote sensing images, including spectral, textural, and structural information due to its nonlinear expression and strong performance in feature extraction [
7,
13]. Therefore, deep learning shows great potential. However, in general, the choice of fusion modality must be based on the specific downstream task, and there is no single fusion modality that works in all situations [
19]. For example, in the completion of the underwater detection task, the process suffers from image noise, texture blur, low contrast, color distortion, and impurity particles affecting the optical imaging as a result of blurring atomization, as well as other issues. Therefore, it is often difficult to use single modal data to comprehensively express the characteristics of the object, and it is often necessary to combine multiple features in order to accurately achieve this task. In this case, it is necessary to perform underwater histogram enhancement and Retinex theory enhancement, or simulate the generation of underwater images through generative adversarial networks, underwater restoration, or other processing [
20]. At the same time, SAS originates from SAR; as a sensor working in different environments, it can perform underwater imaging, and based on this type of imaging data, multimodal data fusion can be performed to accomplish underwater observation and image interpretation [
21,
22].
In our research, we have developed modules that can be integrated into popular encoding–decoding network architectures. These enhancements improve feature extraction findings and overcome the limitations of optical images, such as their susceptibility to weather conditions and inability to gather real-time feature information. By using multimodal data, our model overcomes these challenges. The proposed model consists of two primary components. The first step involves processing SAR images to address the significant differences in image disparities between SAR and optical data. We also introduced a two-way network designed for multimodal-based semantic segmentation tasks. An image input into this dual-input network model undergoes image-level fusion and encoding, resulting in a shallow feature map rich in geometric features. We focused on stitching shallow feature maps due to their similarity in geometric features, such as shapes, found in both optical images and SAR data. An attention module was incorporated to extract the correlation and rich contextual information of the multimodal data effectively. This module ensures that the fusion process goes beyond simple concatenation, preserving the complementary characteristics of the multimodal data [
19].
(1) We suggest implementing image-level fusion through the PCA transform prior to performing feature-level fusion of multimodal data.
(2) Our approach introduces a two-input network model designed explicitly for feature-level fusion of multimodal data. In this model, shallowly encoded feature maps are fused, and an attention model is incorporated to achieve the LCC task for multiclass targets effectively.
(3) The efficacy of our proposed method is validated through experimental results on various public multi-class LCC datasets.
The structure of the remainder of this paper is as follows:
Section 2 provides an overview of the image-level and feature-level fusion methods we have proposed, as well as the loss function utilized.
Section 3 details the experimental datasets and experimental setting details, and offers qualitative and quantitative analyses of the experimental results for different models based on public datasets. This section also includes a comparison with other prevalent multimodal fusion methods. Finally,
Section 4 summarizes our study and outlines potential directions for future research in this field.
2. Overview
To improve the inference results of LCC, we propose a multimodal fusion approach that incorporates both image-level and feature-level fusion methods.
Figure 1 depicts the comprehensive architecture of this paper. The methodology is divided into two primary sections: image-level and feature-level fusion. We carried out experiments with several public datasets and established networks, aiming to develop a generalized strategy that effectively utilizes multimodal data for LCC. The detailed experimental procedure is described in the following sections.
As presented in
Figure 2, the visualization of multimodal data intuitively showcases the optical image and SAR data of typical geographic elements, such as farmland, buildings, water, and forests. A comparative analysis of the figure reveals that distinguishing features such as farmland is challenging when solely relying on optical images, whereas SAR data provide a clearer classification. For elements such as buildings and water, features such as texture, color, and shape are more discernible in optical images, but these features are less apparent in the SAR data. Therefore, finding an appropriate method to fuse optical images with SAR data is crucial for enhancing the effectiveness of LCC tasks.
In optical images, different classes of pixels can display similar spectral characteristics [
23]. This phenomenon is evident in
Figure 3, which shows spectral reflectance values across different bands. It can be observed that roads and buildings exhibit similar reflectance values in optical images. Similarly, the spectral reflectance values of the farmland and forest are closely aligned in the blue and red bands. As a result, when relying solely on optical images for LCC, features with similar spectral reflectance values, manifesting as similar colors and other features, become challenging to distinguish. However, these features exhibit distinctly different characteristics in SAR data. Consequently, fusing optical images with SAR data proves to be an effective strategy for accomplishing LCC tasks.
Figure 1 demonstrates the proposed process for fusing multimodal data. This process encompasses two key phases to achieve multimodal multitarget semantic segmentation based on remote sensing images: the use of the PCA transform for image-level fusion of optical images with SAR data and feature-level fusion following the encoder stage. For processing optical images, a network model can be chosen from several widely recognized semantic segmentation model structures. These include U-Net [
24], UNetPlusPlus [
25], EfficientNet-UF [
26], and Swin-Transformer [
27]. The initial step in the process involves encoding the data using popular encoders designed for SAR data processing. Numerous available encoders come with suitable pre-training weights, enhancing their efficacy. Notable examples of such encoder models include VGG [
28] and ResNet [
29], among other structures.
2.1. Image-Level Fusion
The PCA transformation is an effective and widely used method for reducing data dimensionality. It transforms correlated data into a set of uncorrelated features through orthogonal variations. The principal components (PCs) at the forefront represent the greatest differences in the features they contain. This method is notable for its ability to reduce data loss [
30].
To achieve feature-level fusion of multimodal data, most current deep-learning-based strategies for multimodal data fusion have developed specialized fusion modules. In this study, the multimodal fusion task was accomplished by first adding an image-level fusion step. The primary steps for processing SAR data are outlined below. First, SAR data significantly differ from optical images, as they employ active microwave transmission for feature detection while simultaneously receiving ground-based echo information. To optimally harness the feature information from both optical and SAR data, we proposes the implementation of PCA fusion for these data types. This strategy is leveraged to reduce the number of parameters necessary for develo** relevant deep learning modules, thereby achieving a more efficient feature-level fusion approach. Simultaneously, PCA fusion methods are used to generate three-dimensional images for SAR data input, which can be equipped with pre-training weights bearing optimal initial values. This enhances the efficiency of subsequent training processes. The ultimate goal is to successfully complete the semantic segmentation task using multimodal data while addressing two main challenges: reducing semantic differences between different datasets and accommodating the fact that SAR data typically consist of single-band images. The information from SAR and optical images is then fed into an encoder designed for processing SAR data. These procedural steps are detailed in
Figure 4.
2.2. Feature-Level Fusion
To achieve feature-level fusion, we performed separate encoding operations for the optical and fused multimodal images, which had previously undergone image-level fusion. During the encoding process, feature maps rich in geometric features acquired from the shallow layers of the encoder were fused. Additionally, some existing linear fusion methods that perform element-wise summation tend to assign identical weights to different modalities, overlooking the fact that each modality contributes differently to various land classes. In contrast, attention module is employed to assign distinct weights to features based on their respective contributions [
2,
18,
31]. Thus, following this, a channel attention module was added to allocate more weight to the channels containing important information, thus completing the encoding operation.
Figure 5 provides a visual representation of this specific fusion method.
In the realm of semantic segmentation models, the choice of encoder for processing optical images is typically predefined. For our research, we selected resnet18 [
29] as the encoder for processing SAR data. Additionally, we incorporate an attention module in our method, specifically the channel attention module. Its primary function is assigning proportional weights to different channels, based on the amount of information they contain, after the optical image and SAR data have finished the concatenation process.
After completing the feature fusion step, the probability of the LCC is obtained using softmax. The commonly used loss function in LCC is cross entropy loss [
32,
33,
34], and some studies explored the influence of inference results after adding the label smoothing to the loss function [
35]. To make the experimental inference results comparable, the most commonly used cross-entropy loss function is selected and defined by the following equation:
where
M denotes the number of categories,
takes 0 or 1, 1 if the true category of the sample is
c and 0 otherwise, and
denotes the probability that sample
i belongs to category
c.