1. Introduction
Water resources play an important role in the Earth’s energy cycles and the development of human society [
1]. Therefore, accurately map** water bodies holds immense significance in various domains, including environmental protection [
2,
3], urban planning [
4,
5], flooding control [
6,
7], and disaster mitigation [
8]. Due to their ability to rapidly capture extensive surface information at a minimal cost [
9], remote sensing images (RSIs) have emerged as the predominant data source for water map**. RSIs exhibit inherent complexity [
10] that encompasses various types of disturbance information such as man-made structures, forests, and snow, making water body extraction difficult and challenging [
11]. In addition, the diversity of water distribution and the variation in shape and size also limit the extraction accuracy [
12]. The purpose of this study is to achieve accurate water body extraction from RSIs.
Automatically map** water bodies from RSIs is a significant and actively researched area within the field of remote sensing and pattern recognition. In the early days, the threshold method was primarily employed for water body extraction, aiming to distinguish water bodies from other objects within one or multiple spectral bands by selecting an appropriate threshold. Unfortunately, this method has proven to be unsuitable for extracting small water bodies and challenges are frequently encountered in determining the optimal threshold value [
13]. Subsequently, spectral water index methods emerged, taking into account the inter-band correlation and offering improved map** accuracy. Among these, the Normalized Difference Water Index (NDWI), initially proposed by McFeeters [
14], served as the pioneering water index method. Since NDWI exhibited limitations in suppressing noise in built-up areas, the Modified NDWI (MNDWI) was proposed by Xu [
15]. Many other water index methods [
16,
17] have been proposed over the past few decades. Nonetheless, these approaches necessitate the manual adjustment of thresholds and fall short of achieving satisfactory segmentation performance in a complex geographical environment [
18].
With the rapid progress of deep learning (DL) techniques and the emergence of massive remote sensing data, DL-based solutions have been widely implemented in remote sensing image interpretation. As an indispensable branch of DL, convolutional neural networks (CNNs) [
19] have been widely employed in scene classification [
20], semantic segmentation [
21], and object detection [
22]. The fully convolutional network (FCN) [
23], an innovative breakthrough in the realm of semantic segmentation, enhances the performance of CNNs by eliminating the last fully connected layers and replacing them with convolutional layers, thereby successfully breaking the constraint imposed by the size of input images. However, the continuous pooling operation employed in an FCN tends to discard excessive detailed information, imposing limitations on the overall performance. To tackle this issue, some optimizations have been proposed. UNet [
24] involves shallow detailed information in the feature map recovery process via skip connections. Badrinarayanan et al. [
25] proposed an encoder–decoder segmentation network (SegNet), which utilizes an encoder–decoder structure to restore the resolution of feature maps through the maximum pooling index during upsampling in the decoder. Zhao et al. [
26] introduced the pyramid scene parsing network (PSPNet), which incorporates a pyramid pooling module to integrate contextual information into the segmentation process. Chen et al. [
27] proposed the DeeplabV3+ model, which expands the receptive field via the utilization of dilated convolutions and integrates multi-scale semantic information through the atrous spatial pyramid pooling (ASPP) module.
CNN-based models possess the inherent advantages of autonomously extracting discriminative and representative features. Consequently, considerable endeavors have been dedicated to the pursuit of water body extraction from RSIs. Miao et al. [
28] proposed the RRFDeconvnet model, which integrates the advantages of deconvolution and residual units, along with the introduction of a new loss function to mitigate the problem of boundary blurring. However, it cannot deal with noise interference. Based on the improved UNet, Feng et al. [
29] adopted a fully connected conditional random field and regional restriction to retain the edge structures of water bodies and reduce salt-and-pepper noise. Wang et al. [
30] achieved significant advancements in urban water body extraction by skillfully leveraging skip connections to aggregate lower-level information. By the reasonable augmentation of network depth and the optimization of model training evaluation criteria, Qin et al. [
31] proposed a novel framework specifically tailored for small water bodies.
Diverse distribution, shape and size variations, and complex scenarios significantly influence the extraction results of water bodies from RSIs. It is imperative to consider these factors comprehensively to achieve accurate and precise extraction. In this paper, a boundary-guided semantic context network (BGSNet) is proposed for water body extraction. BGSNet treats boundary and semantic context as two independent subtasks and then integrates them effectively. Three modules were embedded to emphasize boundaries and abstract semantics, namely, the boundary refinement (BR) module, semantic context fusion (SCF) module, and boundary-guided semantic context (BGS) module. The BR module integrates low-level detail features and highest-level semantic features to obtain semantic boundaries. Based on the channel attention mechanism, the SCF module gradually fuses high-level feature maps to capture semantic context. Finally, the BGS module leverages boundary information to guide the fusion of semantic context, promoting the dependence between the same semantic pixels to obtain more refined extraction results. In summary, the main contributions are as follows:
Based on the encoder–decoder architecture, BGSNet is proposed for extracting water bodies from RSIs. BGSNet first captures boundary features and abstract semantics, and then leverages the boundary features as a guide for semantic context aggregation.
To accurately locate water bodies, a boundary refinement (BR) module is proposed to preserve sufficient boundary distributions from shallow layer features. Additionally, a semantic context fusion (SCF) module is devised to capture semantic context for the generation of a coarse feature map.
To fully exploit the interdependence between the boundary and semantic context, a boundary-guided semantic context (BGS) module is designed. BGS aggregates context information along the boundaries to achieve the mutual enhancement of pixels belonging to the same class, thereby effectively improving intra-class consistency.
3. Method
This section first outlines the proposed framework of the boundary-guided semantic context network. The three modules are then introduced in detail.
3.1. Architecture of BGSNet
Boundary and semantic context information dominate the accuracy of map** water bodies. However, focusing solely on one aspect without considering their interdependence may lead to suboptimal segmentation results. Therefore, a boundary-guided semantic context (BGS) network is proposed to process semantic context and boundary information separately in the decoder, thereby realizes their efficient integration. The overall architecture is illustrated in
Figure 1.
Similar to most water body extraction models, our approach also adopts the classic encoder–decoder structure. In the encoder stage, ResNet-50 is utilized as the backbone to capture features at different levels. The backbone produces five feature maps: and contain low-level detail features, while , , and contain high-level semantic information. Moving to the decoder stage, three modules are designed to mine and leverage boundary and semantic context, namely, the boundary refinement (BR) module, semantic context fusion (SCF) module, and boundary-guided semantic context (BGS) module. These modules work together to fully exploit the boundary and semantic context of water bodies, yielding accurate segmentation results.
3.2. Boundary Refinement Module
The BR module is designed to preserve boundary distributions. With several convolutional layers, a CNN can capture spatial details on shallow layers, while progressively increasing the receptive field to capture abstract semantic information. Due to their higher resolution, low-level feature maps preserve abundant spatial details, including intricate shape representations, distinct edge features, and fine-grained texture information. Therefore, feature maps and are leveraged for water body localization. However, RSIs often exhibit small inter-class variance, making shallow feature maps susceptible to noise interference. Therefore, relying solely on the fusion of low-level feature maps may not accurately segment the boundary under complex backgrounds. To address this challenge, the highest-level semantic features are also employed to generate semantic boundaries, thereby producing more differentiated feature maps.
Figure 2 illustrates the structure of the BR module. The BR receives two inputs and employs the multiplication operation to fuse them. The utilization of multiplication is advantageous as it facilitates the elimination of redundant information and the suppression of noise. Subsequently, the fused features pass through two 3 × 3 convolution layers with BN and ReLU to enhance their robustness and discriminative capability. The above process can be formulated as follows:
where
denotes 1
1 convolution,
and
present ReLU function and element-wise multiplication, respectively,
and
are input feature maps,
is the output of the BR module.
The semantic boundary
can be obtained by fusing two low-level feature maps (
,
) with the highest-level semantic feature map (
).
In addition, noting that the resolution of feature maps is inconsistent across each layer of the network, to ensure compatibility before fusion, it is necessary to resample them to a uniform size. In the decoder stage, various upsampling methods can be employed, such as deconvolution, up-pooling, and interpolation algorithms. Among these alternatives, bilinear upsampling is effective and reduces computing requirements. Thus, in the decoder, bilinear upsampling is used to restore the high-level feature map to its original shape.
3.3. Semantic Context Fusion Module
Semantic context and global context are pivotal factors in achieving accurate segmentation of water bodies. High-level feature maps (, , ) are adept at capturing various pieces of semantic information due to different receptive fields, rendering them suitable for pixel classification. Hence, these feature maps are used to capture rich semantic context. The SCF module is designed to fuse them. Since different channels correspond to different semantic information, the design of the SCF module is based on the channel attention mechanism.
As depicted in
Figure 3, the fusion process involves two feature maps of different scales (
and
,
). These feature maps are first concatenated along the channel dimension. Subsequently, a 1 × 1 convolution with BN and ReLU is carried out to reduce the number of channels by half. The integration of channel attention facilitates the acquisition of crucial weights, so global average pooling followed by 1 × 1 convolution and the Sigmoid function were performed to generate the feature map
with weights, which can be calculated as follows:
where
denotes 1
1 convolution, concat and avg represent concatenation and global average pooling, respectively,
and
represent ReLU and Sigmoid functions, respectively, and
and
are inputs.
In this way, the generated feature map
can serve as a weight guide for the fusion of different semantic features through multiplication. This enables the automatic learning of semantic dependencies between feature map channels. The formulation of this process can be expressed as follows:
where
denotes element-wise multiplication.
Finally, the SCF module hierarchically fuses the three high-level feature maps (
,
,
), leading to the generation of the final fused feature map
.
3.4. Boundary-Guided Semantic Context Module
The final output of the SCF module contains rich semantic context, which can generate an initial rough water feature map, while the final output of the BR module retains salient boundary information. These two outputs complement each other in describing water bodies. Consequently, the key is to find ways to aggregate them.
The BGS module is specifically designed to fuse boundary and semantic context. By leveraging the intrinsic partitioning capability of boundaries, BGS employs the extracted semantic boundary
as a guide to integrate the fused semantic features
, thereby reinforcing intra-class consistency. The BGS module adopts the method of double-branch cross-fusion, which employs details to guide the feature response of semantic context. Unlike simple compositions, this approach focuses on the hierarchical dependencies between two branches. Global average pooling is also used in the semantic branch. As such, pixels belonging to the same object exhibit a higher degree of activation in corresponding attention areas, whereas pixels from different objects demonstrate relatively fewer similarities in their activation patterns. The specific structure is shown in
Figure 4.
In general, the multiplication operation serves to selectively emphasize boundary-related information, while the addition operation facilitates the complementary combination of two features. By cross-multiplying and adding, the fusion of two complementary features effectively captures the comprehensive information of an object. The process can be defined as:
where
denotes 3
3 convolution, up and avg represent bilinear interpolation and global average pooling, respectively,
represents the ReLU function,
denotes element-wise multiplication,
and
are the median feature map of each branch, and
is the output of the BGS module.
5. Conclusions
In this paper, we propose a boundary-guided semantic context network (BGSNet) to accurately segment water bodies from RSIs. Striving to bridge boundary representations and semantic context, three specific modules are designed:
- (1)
The BR module is designed to obtain prominent boundary information which is beneficial for localization.
- (2)
The SCF module is embedded to capture semantic context for generating a coarse feature map.
- (3)
The BGS module is devised to aggregate context information along the boundaries, facilitating the mutual enhancement of internal pixels belonging to the same class, thereby improving intra-class consistency.
Extensive experiments were conducted on the QTPL and the LoveDA datasets, demonstrating the superiority of the proposed method compared to existing mainstream methods.
With the advancements of aeronautics and space technology, a large number of detailed remote sensing images have been captured. Due to the influences of imaging conditions and water quality, the distinctiveness among water bodies has been progressively amplified. Furthermore, due to the presence of silt along riverbanks and the blurred shadows cast by towering vegetation, there is increasing resemblance between water bodies and their surrounding environments. These factors pose a further challenge to the extraction of water bodies. In future, our endeavors will focus on model refinement to adapt to these demanding scenarios.