DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation

Huang, Honghe; Zuo, Zhen; Sun, Bei; Wu, Peng; Zhang, Jiaju

doi:10.3390/app12189365

Open AccessArticle

DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation

by

Honghe Huang

,

Zhen Zuo

^*,

Bei Sun

,

Peng Wu

and

Jiaju Zhang

College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(18), 9365; https://doi.org/10.3390/app12189365

Submission received: 24 August 2022 / Revised: 9 September 2022 / Accepted: 13 September 2022 / Published: 19 September 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Side-scan sonar systems play an important role in tasks such as marine terrain exploration and underwater target identification. Target segmentation of side-scan sonar images is an effective method of underwater target detection. However, the principle of side-scan sonar systems leads to high noise interference, weak boundary information, and difficult target feature extraction of sonar images. To solve these problems, we propose a Double Split Attention (DSA) SOLO. Specially, we present an efficient attention module called DSA which fuses spatial attention and channel attention together effectively. DSA first splits feature maps into two parts along channel dimensions before processing them in parallel. Next, DSA utilizes C-S Unit and S-C Unit to describe relevant features in the spatial and channel dimensions, respectively. After that, the results of the two parts are aggregated to improve feature representation. We embedded the proposed DSA module after the FPN network of SOLOv2, and this approach improves the instance segmentation accuracy to a great extent. Experimental results show that our proposed DSA-SOLO on SCTD dataset achieves 78.4% mAP.5, which is 5.1% higher than SOLOv2.

Keywords:

side-scan sonar image; instance segmentation; attention mechanism; deep learning

1. Introduction

Side-scan sonar systems generate sonar images in different gray levels by recording the intensity of back-scattered acoustic waves from the seafloor. As the main sensor of underwater vehicles, side-scan sonar has a wide coverage and high resolution, which can not only map the topography of the seafloor, but also image underwater objects such as shipwrecks and aircraft wrecks. However, due to ocean ambient noise, ship self-noise and reverberant noise, sonar images contain a lot of speckle noise (as shown in Figure 1). In addition, the center frequency of side-scan sonar is above several hundred kilohertz to ensure the resolution of sonar images, but the absorption of acoustic energy by seawater increases with its center frequency in square order, which makes high-frequency acoustic waves lose much energy in seawater. Therefore, the sonar images will have uneven brightness and low contrast. All of these problems affect the accuracy of sonar target segmentation.

In the last decades, the research related to sonar image target segmentation has mainly included clustering [1,2,3,4], level sets [5,6], and Markov random field (MRF) models [7,8]. Clustering segmentation obtains a feature vector describing the pixel features through the gradation parameter of the pixel itself and the statistical parameter in the vicinity thereof, and clusters the feature vector to obtain the segmentation result. Guo et al. [1] extracted a novel method by combining quantum-inspired particle swarm with fuzzy clustering. This method calculated the variance of the particle swarm fitness to update the particle position, and then obtained the global optimal results by initializing the K-means clustering center with the updated position. Huo et al. [2] used K-means clustering to obtain the pre-segmentation map for reducing complexity, and combined this with edge information and region model to accelerate convergence. Steele et al. [3] proposed a spatially coherent K-means clustering method, which utilizes features extracted from the input images to cluster the information in feature space. The essence of level-set-based segmentation is to solve the extreme value and then obtain the corresponding partial differential equation by combined using the active contour model. Liu et al. [5] employed K-means clustering to obtain a pre-segmentation map, and then reinitialized the distance map of the initial contour to ensure the parameter’s accuracy during the level-set evolution. Finally, a robust variational level set model is used to achieve segmentation. Imen et al. [6] proposed a novel level-set method which is region based. Specifically, sonar textures are extracted by a series of filters, and the similarity of these textures is calculated to get the final segmentation results. The MRF-based segmentation model generates segmentation results by modeling the labeled and observed fields in a Bayesian framework based on criteria such as maximum posterior probability. Wang et al. [7] introduced a novel method in which a sonar image was described using gray value. In the proposed method, C-means clustering is used to pre-segment sonar images into three categories, including target, shadow and background. Next, the hierarchical MRF model is employed to segment the sonar images into regions of interest (RoI) and background regions, where targets and shadows are collectively referred to as RoI. Finally, RoI is re-segmented into two classes using the gray-scale thresholding method to achieve real-time segmentation of sonar images. Li et al. [8] proposed a new model based on active contours for sonar image segmentation, in which, firstly, pixel energy is calculated by itself and its neighbors against the noise and object boundary contamination, and secondly an MRF model and Bayesian framework are used to counteract intensity in homogeneity, and finally a new energy function is employed to the level set. Although the above sonar segmentation methods can achieve better segmentation results, they have shortcomings. The clustering segmentation method needs to adjust the parameters for different scenes to achieve good results, the level-set segmentation method is sensitive to noise, and the MRF model is computationally expensive.

Nowadays, deep learning algorithms have been widely employed in the field of image segmentation, detection and tracking, and a large number of excellent convolutional neural network (CNN) methods such as VGG-Net [9], GoogLeNet [10] and ResNet [11] have been proposed. Meanwhile, various image segmentation networks have also been proposed, such as FCN [12], U-Net [13], RefineNet [14], and DeepLabV3 [15]. U-Net is an excellent encoder–decoder framework, which can use fewer training images to achieve better segmentation accuracy. DeepLabV3 is built on ResNet and employs dilated convolution of different rates to expand the receptive field. In addition, DeepLabV3 adds atrous spatial pyramid pooling (ASPP) in the last block of ResNet, which improves the ability to extract multi-scale information. However, most of these networks are all based on RGB images and are not very suitable for other imaging systems. Considering the differences between sonar images and RGB images, many CNN-based methods were proposed for sonar image segmentation. Sledge et al. [16] proposed an encoder–decoder network for sonar target detection and segmentation named MB-CEDN, in which an encoder model is used for extracting features of a sonar object, and then a dual decoder model is used for pixel-level segmentation. Each decoder learns the target features from different perspectives and aggregates the obtained features to a deep resolution network to refine the segmentation. Yu et al. [17] combined Recurrent Convolutional Neural Networks with Residual Convolutional Neural Networks and proposed a new network called the R²CNN module. The proposed R²CNN is used to acquire sonar image features, and the self-guiding module is used to improve the segmentation robustness and accuracy. Wang et al. [18] proposed a new convolutional neural model. Firstly, a depth-separable based residual module is introduced for extracting multi-scale features, and then an adaptive supervision model is used for classifying the pixels. In addition, adaptive transfer learning is utilized in the training process, which enhances generalization ability and robustness. The above CNN-based sonar image segmentation methods are all semantic segmentation methods, which classify the images pixel by pixel to achieve the purpose of segmentation.

In this paper, we conducted the sonar target segmentation task as instance segmentation. Instance segmentation pays more attention to segmentation targets, which includes two steps: semantic segmentation and object detection. This means that instance segmentation can obtain the object category as well as the object mask. Considering the weak boundary information, high noise interference, and poor texture feature of sonar images (shown in Figure 1), segmentation networks need to focus on both spatial and channel features. To this end, we designed a new attention mechanism named the Double Split Attention (DSA) module to enhance the performance of the network, drawing inspiration from ShuffleNetV2 [19] and CBAM [20]. DSA fuses the spatial and channel attention mechanisms, and focuses the network on the boundary feature and spatial location of the sonar target.

The main contributions of this paper are summarized as follows:

(1): We proposed a novel model named DSA-SOLO for side-scan sonar image instance segmentation, and experimentally demonstrated its effectiveness in the object segmentation of side-scan sonar images.
(2): We proposed a DSA module which fuses spatial and channel attention together to extract the target feature. This model improves segmentation accuracy without affecting the speed.
(3): The experimental results contrasting to the existed instance segmentation methods on SCTD [21] dataset show that the proposed DSA-SOLO can achieve better performance.

This paper is organized as follows. In Section 2, we review the related works, including instance segmentation for sonar images and attention mechanisms. In Section 3, we introduce the detailed content of the proposed DSA-SOLO model. The experiments and results are reported in Section 4. Finally, we conclude our work in Section 5.

2. Literature Review

2.1. Instance Segmentation for Sonar Images

Instance segmentation is a crucial, intricate, and challenging task in machine vision research. It localizes different classes of object instances that appear in diverse images in order to forecast object class labels and instance masks. The existing methods can be divided into two types: two-stage based methods and one-stage based methods. Based on the two sub-tasks, the process of detection and segmentation are the main keys of two-stage methods, while the detection-based method firstly generates a priori-bounding box of the target region, based on which a semantic segmentation is performed. For example, the Mask R-CNN [22] is a classical detection-based method, which uses a fully convolutional network (FCN) to generate a mask for each priori-bounding box. Segmentation-based methods firstly make a label for each pixel, and then cluster the pixels into different object instances. Representative algorithms include SGN [23] and SSAP [24]. In recent years, some one-stage methods are proposed following the one-stage target detection methods, such as YOLACT [25], Polar Mask [26], and SOLO [27]. Compared with other instance segmentation algorithms, SOLO adopts a fully convolutional, box-free, and grou**-free approach to directly output the instance mask and the corresponding class, which balances speed and accuracy better.

All the above methods are proposed based on RGB images. Aiming at the differences between RGB images and sonar images, many researchers try to adopt these methods for imaging sonar systems. Xu et al. [28] proposed a novel model named active Mask-Box Scoring R-CNN, which uses a model header to balance the boxIoU and NMS score. In addition, in order to improve the segmentation performance, a triplets-measure-based active learning (TBAL) method and a balanced-sampling method are used. Fan et al. [29] replaced the Resnet50/101 with a 32-layer feature extraction network in Mask RCNN, which not only reduces the training parameters but also guarantees detection performance. Kessel et al. [30] used speckle to automatically subdivide the image into regions of interest (ROI) and background regions. However, most of the recent sonar target segmentation methods based on instance segmentation adopt two-stage methods, yet those methods are indirect and usually have a low efficiency.

2.2. Attention Mechanisms

At present, attention mechanisms are an important strategy to improve network performance, and their purpose is to enable the neural network focusing on salience regions. Spatial attention and channel attention are two basic attention mechanisms, and have been widely employed in computer vision applications. Spatial attention aims to capture the relationship between pixels. The Spatial Transformer Network (STN) [31] is one of the typical spatial attention modules, which uses nonlinear interpolation to affine transform the input and output to obtain the map** relationship, so as to optimize the parameters using network-back-propagation to achieve the optimal value of data in the spatial location. Furthermore, the Dynamic Capacity Network (DCN) [32] adopts two sub-models, in which a course model is used for image processing and ROI locating, and a fine model is used for refining the ROI. In contrast, the DCN obtains an excellent performance with low cost and high accuracy. However, channel attention pays more attention to channel dependency, such as the SENet [33], which shows a great performance and mainly contains two operations: squeeze, which establishes dependencies between channels, and excitation, which re-calibrates features.

Since spatial attention and channel attention improve CNN performance from different aspects, researchers usually fuse them to achieve better performance. Park et al. [34] designed a Bottleneck Attention Module (BAM). For an input feature map F, BAM calculates its attention maps M_S(F) and M_C(F) separately by spatial attention and channel attention, and adjusts the size of the two attention maps to

R^{C \times H \times W}

using sigmoid function. Next, the two attention feature maps are combined by an element-wise model, and finally, based on the sigmoid function, the value of the 3D attention map M(F) is normalized from 0 to 1. Woo et al. [20] proposed a new module called Convolutional Block Attention Module (CBAM), which connects channel attention and spatial attention in series. The channel attention utilizes a structure similar to SENet, except that a parallel max-pooling layer is added in CBAM. The spatial attention takes the output of channel attention as input, and utilizes global max-pooling and global average pooling to obtain two feature maps which are sized in

H \times W \times 1

. Next, the two feature maps are aggregated by a concat operation and then the final output is generated by sigmoid function.

3. Methods

3.1. DSA-SOLO

As shown in Figure 2, DSA-SOLO takes a sonar image as input, performs ResNet, double split attention (DSA) module, feature pyramid network (FPN) [35], and two branches of SOLOv2. The proposed DSA-SOLO is built on a host network called SOLOv2, and the proposed DSA module is embedded between ResNet and FPN to deal with the weak boundary features of sonar images. ResNet-18 is used as a backbone architecture for extracting the sonar image feature maps. We utilize the DSA module to improve the ability to extract the boundary information of the targets. In order to segment multi-scale targets more effectively, FPN is utilized. As shown in Table 1, some output maps of different layers have the same size, which are highlighted in the same stage. FPN obtains the feature map by taking the last layer output of each stage as a reference feature set. Specifically, the output of the last residual block in each stage of ResNet-18 acts as the input of FPN, and is denoted as

{C_{2}, C_{3}, C_{4}, C_{5}}

for conv2, conv3, conv4, and conv5. Conv1 is not included because of its large memory footprint.

The outputs of the DSA module are used as input for the two branches mentioned in SOLOv2 to generate the final predictions. Similarly to SOLOv2, DSA-SOLO divides the input image into

S \times S

grids. For each grid, the category branch outputs a C-dimensional vector, which indicates the probability of each semantic class the pixel belongs to, where C is the number of classes. Therefore, the output space of a category branch is

S \times S \times C

. In parallel, the mask branch generates the corresponding instance mask of each grid. For an input sonar image, I, there are at most S² predicted masks in total. Therefore, the output space of a mask branch is

H \times W \times S^{2}

, where H, W respectively, represent the height and width of the sonar image. The k-th channel denotes where the segment instance is at the grid

(i, j)

, where

k = i \cdot S + j (i, j \in [0, S - 1])

. The instance segmentation result of each grid is obtained by associating the predictions of category and mask. Finally, a non-maximum-suppression (NMS) structure is used to obtain the accurate segmentation results.

3.2. Double Split Attention (DSA) Module

As illustrated in Figure 3, for a given feature map

X \in ℝ^{C \times H \times W}

, where C, H, and W indicate the channel number, spatial height, and width, respectively, DSA firstly splits X into two parts which are denoted as

{X_{0}, X_{1}}

,

X_{0}, X_{1} \in ℝ^{C / 2 \times H \times W}

, meaning the channel number of the two parts is equal. The two parts then gradually extract the boundary information in the training process through the C-S unit and S-C unit, respectively. C-S and S-C units consist of channel attention and spatial attention, mentioned in Section 3.2.1 and Section 3.2.2. Finally, the outputs of the C-S and S-C units are aggregated by a channel shuffle operator.

3.2.1. Channel Attention

Each feature map has an n channel (generally, n = 1024, 512, 256), and the information at the same index of different channels are also different. Channel attention squeezes the size of the input feature map into

1 \times 1

to focus on ‘what’ is significant. In this paper, we firstly used global averaging pooling (GAP) to obtain channel-wise features. For an input I, the channel-wise statistics s can be calculated as follows:

s = F_{g a p} (I) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} I (i, j)

(1)

where

s \in ℝ^{C / 2 \times 1 \times 1}

,

I (i, j)

represents the pixel of the input I at the position (i, j). H, W represents the height and width of I. Next, in order to enable accurate and adaptive selection, a compact feature is generated based on a simple sigmoid activation. The final output of channel attention can be calculated by:

I' = σ (F_{c} (s)) \cdot I = σ (W_{1} s + b_{1}) \cdot I

(2)

where

σ

and

F_{c}

represent the operation of sigmoid activation and scale transformation, respectively.

W_{1} \in ℝ^{C / 2 \times 1 \times 1}

and

b_{1} \in ℝ^{C / 2 \times 1 \times 1}

, respectively, represent the parameters for scaling and shifting s.

3.2.2. Spatial Attention

In contrast to channel attention, spatial attention emphasizes ’where’ an informational component is, which works in conjunction with channel attention. In this paper, we utilize Group Norm (GN) [36] to obtain spatial-wise features. Feature normalization is a well-known method of increasing the speed of training.

Separate from Batch Norm (BN) [37], Layer Norm (LN) [38], and Instance Norm (IN) [39], Group Norm focuses on factors such as frequency, shapes, illumination, and textures. Next, according to these factors, a new grou** method is proposed. A series of feature normalization methods perform the following computation:

{\hat{x}}_{1} = \frac{1}{σ_{i}} (x_{i} - μ_{i})

(3)

where

x_{i}

is the feature map of the i-th layer. For a 2D image,

i = (i_{N}, i_{C}, i_{H}, i_{W})

indicates that the features are indexed in order of (N, C, H, W), where N denotes batch, C denotes channel, and H and W the spatial size, respectively. Furthermore,

μ

and

σ

are the mean and standard deviation, which can be computed by:

μ_{i} = \frac{1}{m} \sum_{k \in S_{i}} x_{k}

(4)

σ_{i} = \sqrt{\frac{1}{m} \sum_{k \in S_{i}} {(x_{k} - μ_{i})}^{2} + ε}

(5)

where

S_{i}

is the set of pixels to compute the mean and standard deviation, and m is the set size.

ε

is a small constant. Different types of feature normalization methods fundamentally contrast in how the set

S_{i}

is defined (as shown in Figure 4). For GN, the

S_{i}

is defined as:

S_{i} = {k | k_{N} = i_{N}, ⌊ \frac{k_{C}}{C / G} ⌋ = ⌊ \frac{i_{C}}{C / G} ⌋}

(6)

Here G denotes the groups number, C is the channel number, and

⌊ \cdot ⌋

is the floor operation. The equation in (6) indicates that the indexes i and k are in the same channel groups. While in GN, the

μ

and

σ

are computed along the (H, W) axes and along a group of C/G channels.

Similar to the channel attention, a

F_{c}

is used for enhancing the result of GN

\overset{⌢}{X}

. The final output of spatial attention can be obtained by:

\overset{⌢}{X}' = σ (W_{2} \overset{⌢}{X} + b_{2}) \cdot \overset{⌢}{X}

(7)

where

W_{2}

and

b_{2}

are parameters with size

ℝ^{C / 2 \times 1 \times 1}

.

The channel attention module and spatial attention module, respectively, focus on ‘what’ and ‘where’ information. Based on this, the two modules can be fused based on parallel or series connection, such as in [20]. Aiming at the weak boundary features of sonar images, we designed a DSA module which combines the two attention modules sequentially in different ways in two branches. The two branches are placed in a parallel arrangement, and the two results are aggregated by a channel shuffle operator. The DSA module is mainly realized by the C-S and S-C units. Specifically, for the two parts

X_{1}, X_{2} \in ℝ^{C / 2 \times H \times W}

split from the feature map X, C-S unit is a channel-first ordered sequential arrangement, which puts channel attention in front of spatial attention. Therefore, the output of the C-S unit can be computed by:

X_{1}' = σ (F_{c} (F_{g a p} (X_{1}))) \cdot X_{1}

(8)

X_{1} ″ = σ (F_{c} (G N (X_{1}'))) \cdot X_{1}'

(9)

where

F_{c}

and

F_{g a p}

are mentioned in the Section 3.2.1. As for the S-C unit, it is a spatial-first ordered sequential arrangement, and the output of the S-C unit can be computed by:

X_{2}' = σ (F_{c} (G N (X_{2}))) \cdot X_{2}

(10)

X_{2} ″ = σ (F_{c} (F_{g a p} (X_{2}'))) \cdot X_{2}'

(11)

where GN is Group Norm. The two outputs

X_{1}'

and

X_{2}'

are aggregated with a ‘channel shuffle’ operator to obtain cross-group information, which is similar to ShuffleNetv2 [16].

3.3. Loss Function

The loss function used to evaluate the performance of DSA-SOLO is similar to SOLOv2, which is denoted as follows:

L = L_{c a t e} + λ L_{m a s k}

(12)

where

L_{c a t e}

is the focus loss [40] of category classification.

L_{m a s k}

is for the loss of mask prediction, and its calculation can be conducted as follows:

L_{m a s k} = \frac{1}{N_{p o s}} \sum_{k} 1_{{P_{i, j}^{*} > 0}} d_{m a s k} (m_{k}, m_{k}^{*})

(13)

where, k, i, and j are related as

k = i \times S + j

.

N_{p o s}

is the number of positive samples.

P^{*}

and

m^{*}

, respectively, represent the category and mask of targets. The function

1_{{P_{i . j}^{*} > 0}}

represents the indicator function. If

P_{i, j}^{*} > 0

, the result of the function is 1. For

d_{m a s k}

, we adopt the Dice Loss function defined as follows:

L_{D i c e} = 1 - \frac{2 \sum_{x, y} (p_{x, y} \cdot q_{x, y})}{\sum_{x, y} p_{x, y}^{2} + \sum_{x, y} q_{x, y}^{2}}

(14)

where

p_{x, y}

and

q_{x, y}

refer to the mask p and the ground truth mask q located at (x, y).

4. Experimental Results

4.1. Dataset

The dataset used in this paper was collected by Zhang et al. [21] and is named SCTD (sonar common target detection). It included 497 images obtained by side-scan sonars, on which 514 targets are included. The sample numbers of SCTD are shown in Table 2.

Generally speaking, the dataset will be split into three sub-sets, including the train set, validation set, and test set. In the training process, the train set is used to train models, the validation set is used to optimize training parameters, and the test set is used to evaluate the final model. However, SCTD is smaller in size compared to existing RGB datasets, so we only split it into two sets: train and validation. We used the entire dataset as the test set to evaluate the final model. In addition, we used transfer learning strategies to adapt to the low volume of data. We utilized the RGB dataset to pre-train the backbone of DSA-SOLO.

As SCTD is collected online, each image was different in size, and we therefore cut and scaled it. SOLOv2 requires that the size of the input image must be a multiple of 32, so we set the sonar image size to 256 × 256. In addition, SCTD is originally a target detection dataset. To adapt to instance segmentation in this paper, we annotated it with the Labelme application. All the algorithms mentioned in this paper were tested on SCTD.

4.2. Implementation Detail and Evaluation Indexes

The hardware environment for training and testing was as follows: Intel Core i7-11800H@ 2.30 GHz and NVIDIA GeForce RTX 3060 Laptop GPU. All experiments were conducted using the same hardware environment. The experimental operating system was based on Ubuntu 20.04, Python 3.7 and PyTorch 1.8.0 framework.

To fairly analyze the performance of the proposed model, we quantified the segment results based on mAP (mean Average Precision) and FPS. The mAP is a quantitative measurement to assess the effectiveness of target segmentation by computing the area under the PR (Precision-Recall) curve. Precision is the ratio of the number of correctly classified positive samples to all predicted positive samples. Recall is the ratio of the number of correctly classified positive samples to all true positive samples. The two indexes are calculated as follows:

P = \frac{T_{P}}{T_{P} + F_{P}}

(15)

R = \frac{T_{P}}{T_{P} + F_{N}}

(16)

where

T_{P}

,

F_{P}

,

F_{N}

denote true positive, false positive, and false negative, respectively. The Precision-Recall curve is obtained with precision as the y-axis and recall as the x-axis. In this paper, we used the mAP calculation standard in COCO. That is, the mAP at a total of 10 thresholds, with gradual increments of 0.05 from 0.5 to 0.95 (mAP.5:.95), and the mAP at threshold t = 0.5 (mAP.5) and t = 0.75 (mAP.75), were evaluated. Where t means IoU (Intersection over Union), which is expressed as follows:

I o U = \frac{S_{o v e r l a p}}{S_{u n i o n}}

(17)

where

S_{o v e r l a p}

is the overlap** area of the prediction mask and the ground truth mask, and

S_{u n i o n}

is the union area of the prediction mask and the ground truth mask. If we set the threshold to 0.5, then the predicted mask will be considered as a positive sample when the value of IoU is greater than t.

The FPS (frames per second) denotes the image numbers processed per second, and it usually indicates the efficiency of the network.

4.3. Comparative Experiments

In order to demonstrate the superiority of DSA-SOLO, we performed quantitative and qualitative comparisons on the SCTD dataset and compared with other instance segmentation methods, such as Mask R-CNN [19], YOLACT [23], Polar Mask [24], and SOLO [25].

Firstly, this paper compares the effects of training losses on the validation set of DSA-SOLO, Mask R-CNN, YOLACT, Polar Mask, and SOLO. As shown in Figure 5, the training loss of DSA-SOLO is significantly lower than other instance segmentation methods, and the training effects (mAP) of DSA-SOLO are the best. Thus, it can be seen that on the SCTD dataset, our proposed DSA-SOLO has better feature extraction ability and better segmentation precision.

Secondly, to assess the segmentation performance, this paper compares the above methods with the indexes mentioned in Section 4.2, i.e., mAP.5:.95, mAP.5, mAP.75, and FPS. We record the quantitative results in Table 3. From Table 3, we can determine that DSA-SOLO achieves the best results in mAP.5:.95, mAP.5, and mAP.75. Specifically, for mAP.5, DSA-SOLO is 13.3% higher than YOLACT, 9.8% higher than Polar Mask, 6.6% higher than Mask R-CNN, and 5.1% higher than SOLOv2.

Figure 6 displays the segmentation results of different methods. As we can see in picture (a), the results of Mask R-CNN, YOLACT, Polar Mask, and SOLOv2 are incorrect and incomplete. Our DSA-SOLO can segment the target more accurately. In picture (b), all methods can segment the target accurately except YOLACT, and DSA-SOLO achieved the highest precision compared to other segmentation methods.

4.4. Ablation Experiments

In this subsection, we performed ablation experiments to demonstrate the superiority of the DSA module and the two units called C-S and S-C. We compared the proposed DSA module with SENet, STN, CBAM, and DANet [41]. We embedded these attention modules between ResNet-18 and FPN, and the results are shown in Table 4. Figure 7 displays the segmentation results of embedding different attention modules.

In order to demonstrate the contribution of the C-S and S-C units, we proposed two variant structures: C-S Unit Only and S-C Unit Only. As shown in Figure 3, the C-S Unit Only structure utilizes the C-S unit to process X₁, and the processed result is aggregated with X₂ by a ‘channel shuffle’ operator. On the contrary, the S-C Unit Only structure uses the S-C unit to process X₂, and the results are aggregated with X₁.

The results are shown in Table 5. Both C-S and S-C units can improve the accuracy of SOLOv2, the experimental results demonstrating the effectiveness of the two proposed units. Specifically, the C-S unit has limited improvement on accuracy, but it improves the speed of SOLOv2. The S-C unit improves the segmentation accuracy, but the speed is affected. The segmentation accuracy is obviously improved when the two units are used simultaneously.

5. Conclusions

Aiming at the problems of high noise interference, weak boundary information, and difficult target feature extraction of sonar images, we proposed a lightweight network which fuses channel attention and spatial attention effectively for side-scan sonar target segmentation. The experimental results show that our proposed DSA-SOLO demonstrates strong side-scan sonar image instance segmentation capabilities. To deal with the weak boundary features of the targets in side-scan sonar images, the proposed network utilizes two attention units, C-S and S-C, to determine ‘where’ and ‘what’ is worthy of attention, thus improving the resistance of the network to noise. Though DSA-SOLO achieves better segmentation accuracy, the parameters introduced by the DSA module are a little larger, which has affected the segmentation speed. Further work will focus on reducing the parameters of the model and improving the segmentation speed while ensuring accuracy.

Author Contributions

Conceptualization, H.H. and Z.Z.; methodology, H.H.; software, H.H. and B.S.; validation, H.H., B.S. and J.Z.; formal analysis, H.H. and P.W.; data curation, H.H., J.Z. and P.W.; writing—original draft preparation, H.H.; writing—review and editing, Z.Z., P.W. and B.S.; project administration, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Hunan Province of China, grant number 2020JJ5672; National Natural Science Foundation of China, grant number 52101377; and Hunan Province Innovation Foundation for Postgraduate, grant number CX20210020.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available now due to deficient maintenance capacity.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, Y.; Wei, L.; Xu, X. A sonar image segmentation algorithm based on quantum-inspired particle swarm optimization and fuzzy clustering. Neural Comput. Appl. 2018, 32, 16775–16782. [Google Scholar] [CrossRef]
Huo, G.; Yang, S.; Li, Q.; Zhou, Y. A Robust and Fast Method for Sidescan Sonar Image Segmentation Using Nonlocal Despeckling and Active Contour Model. IEEE Trans. Cybern. 2017, 47, 855–872. [Google Scholar] [CrossRef] [PubMed]
Steele, S.; Ejdrygiewicz, J.; Dillon, J. Automated Synthetic Aperture Sonar Image Segmentation using Spatially Coherent Clustering. In Proceedings of the OCEANS 2021: San Diego—Porto, San Diego, CA, USA, 20–23 September 2021. [Google Scholar]
Chabane, A.N.; Islam, N.; Zerr, B. Incremental clustering of sonar images using self-organizing maps combined with fuzzy adaptive resonance theory. Ocean Eng. 2017, 142, 133–144. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Huo, G. Robust and fast-converging level set method for side-scan sonar image segmentation. J. Electron. Imaging 2017, 26, 063021. [Google Scholar] [CrossRef]
Imen, K.; Fablet, R.; Boucher, J.M.; Augustin, J.M. Region-based and incidence angle dependent segmentation of seabed sonar images using a level set approach combined to local texture statistics. In Proceedings of the OCEANS 2006—Asia Pacific, Singapore, 16–19 May 2006. [Google Scholar]
Wang, L.; Ye, X.; Wang, G.; Wang, L. A Fast Hierarchical MRF Sonar Image Segmentation Algorithm. Int. J. Robot. Autom 2017, 32, 48–54. [Google Scholar] [CrossRef]
Li, J.; Jiang, P.; Zhu, H. A Local Region-Based Level Set Method With Markov Random Field for Side-Scan Sonar Image Multi-Level Segmentation. IEEE Sens. J. 2021, 21, 510–519. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. ar**s in Deep Residual Networks. ar**s+in+Deep+Residual+Networks&author=He,+K.&author=Zhang,+X.&author=Ren,+S.&author=Sun,+J.&publication_year=2016&journal=ar** Networks for Instance Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Gao, N.; Shan, Y.; Wang, Y.; Zhao, X.; Huang, K. SSAP: Single-Shot Instance Segmentation With Affinity Pyramid. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 661–673. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; **ao, F.; Lee, Y.J. YOLACT Real-time Instance Segmentation. ar**v 2019, ar**v:1904.02689v2. [Google Scholar]
**e, E.; Sun, P.; Song, X.; Wang, W.; Liang, D.; Shen, C.; Luo, P. PolarMask: Single Shot Instance Segmentation with Polar Representation. ar**v 2020, ar**v:1909.13226v4. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. SOLO: A Simple Framework for Instance Segmentation. ar**v 2021, ar**v:2106.15974v1. [Google Scholar] [CrossRef]
Xu, F.; Huang, H.; Wu, J.; Jiang, L. Active Mask-Box Scoring R-CNN for Sonar Image Instance Segmentation. Electronics 2022, 11, 2048. [Google Scholar] [CrossRef]
Fan, Z.; **a, W.; Liu, X.; Li, H. Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. Signal. Image Video Process. 2021, 15, 1135–1143. [Google Scholar] [CrossRef]
Kessel, R.T. Using sonar speckle to identify regions of interest and for mine detection. Proc. Detect. Remediat. Technol. Mines Minelike Targets 2002, 4742, 440–451. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. ar**v 2015, ar**v:1506.02025. [Google Scholar]
Almahairi, A.; Ballas, N.; Cooijmans, T.; Zheng, Y.; Larochelle, H.; Courville, A. Dynamic Capacity Networks. ar**v 2015, ar**v:1511.07838v7. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [Green Version]
Park, J.; Woo, S.; Lee, J.; Kweon, I.S. BAM: Bottleneck Attention Module. ar**v 2018, ar**v:1807.06514v2. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. ar**v 2016, ar**v:1612.03144v2. [Google Scholar]
Wu, Y.; He, K. Group Normalization. ar**v 2018, ar**v:1803.08494. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proc. Int. Conf. Mach. Learn. 2015, 37, 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. ar**v 2016, ar**v:1607.06450. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance normalization: The missing ingredient for fast stylization. ar**v 2016, ar**v:1607.08022. [Google Scholar]
Li, T.Y.; Goyal, P.; Grishick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar]
Sun, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. ar**v 2019, ar**v:1809.02983v4. [Google Scholar]

Figure 1. Several images in the dataset SCTD used in this paper. The principle of imaging sonar systems leads to high noise interference, weak boundary information, and difficult target feature extraction of sonar images.

Figure 2. The structure of DSA-SOLO. The proposed DSA module is utilized and embedded between the ResNet-18 and FPN. We show an example when S = 5.

Figure 3. The structure of the DSA module, where F_gap and GN denote global averaging pooling and group norm mentioned in the following sections. F_c refers to scale transformation, and σ(·) denotes the sigmoid activation.

Figure 4. Several types of normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W) as the spatial axes. The pixels in blue are normalized by the same mean and variance parameters.

Figure 5. Comparison of segmentation scores, i.e., training loss, and mAP.5 between Mask R-CNN, YOLACT, Polar Mask, SOLO, and DSA-SOLO during the training phase. We set the training epoch as 100.

Figure 6. The segmentation results of Mask R-CNN, YOLACT, Polar Mask, SOLOv2, and DSA-SOLO. The red box shows the instance category and the segmentation precision, which is enlarged in the upper left corner of the images. (a,b) shows the results of two different targets.

Figure 7. The segmentation results of embedding STN, DANet, SENet, CBAM, and DSA modules in SOLOV2. The red box shows the instance category and the segmentation precision which is enlarged in the upper left corner of the images.

Table 1. ResNet-18, the backbone of DSA-SOLO. An input image with the size of 224 × 224 is taken as an example.

Stage	Output Size	Backbone
Conv1	112 × 112	7 × 7 conv, 64
Conv2	56 × 56	3 × 3 max pooling
Conv2	56 × 56	$[\begin{matrix} 3 \times 3, 64 \\ 3 \times 3, 64 \end{matrix}] \times 2$
Conv3	28 × 28	$[\begin{matrix} 3 \times 3, 128 \\ 3 \times 3, 128 \end{matrix}] \times 2$
Conv4	14 × 14	$[\begin{matrix} 3 \times 3, 256 \\ 3 \times 3, 256 \end{matrix}] \times 2$
Conv5	7 × 7	$[\begin{matrix} 3 \times 3, 512 \\ 3 \times 3, 512 \end{matrix}] \times 2$

Table 2. Sample numbers of SCTD.

Label	Train	Val
ship	295	83
plane	72	18
body	37	9

Table 3. Quantitative Results of Different Segmentation Models. The bold numbers represent the highest one of each indicator.

Model	mAP.5:.95	mAP.5	mAP.75	FPS
Mask R-CNN	37.8%	71.8%	31.6%	7.9
YOLACT	33.6%	65.7%	15.3%	13.13
Polar Mask	34.5%	68.6%	17.6%	11.97
SOLOv2	40.0%	73.3%	35.8%	18.64
DSA-SOLO	42.7%	78.4%	43.2%	18.14

Table 4. Quantitative Results of Embedding Different Attention Modules. The bold numbers represent the highest one of each indicator.

Attention Module	mAP.5:.95	mAP.5	mAP.75	FPS
SENet	37.8%	72.6%	31.8%	18.89
STN	36.9%	73.5%	32.8%	18.35
CBAM	39.2%	74.3%	39.8%	19.04
DANet	38.1%	73.8%	31.6%	19.02
DSA	42.7%	78.4%	43.2%	18.14

Table 5. Quantitative Results of Ablation Experiments. The bold numbers represent the highest one of each indicator.

Model	mAP.5:.95	mAP.5	mAP.75	FPS
SOLOv2	40.0%	73.3%	35.8%	18.64
C-S Unit Only	40.2%	75.2%	35.7%	19.02
S-C Unit Only	41.9%	75.9%	44.3%	18.39
DSA	42.7%	78.4%	43.2%	18.14

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, H.; Zuo, Z.; Sun, B.; Wu, P.; Zhang, J. DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation. Appl. Sci. 2022, 12, 9365. https://doi.org/10.3390/app12189365

AMA Style

Huang H, Zuo Z, Sun B, Wu P, Zhang J. DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation. Applied Sciences. 2022; 12(18):9365. https://doi.org/10.3390/app12189365

Chicago/Turabian Style

Huang, Honghe, Zhen Zuo, Bei Sun, Peng Wu, and Jiaju Zhang. 2022. "DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation" Applied Sciences 12, no. 18: 9365. https://doi.org/10.3390/app12189365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation

Abstract

1. Introduction

2. Literature Review

2.1. Instance Segmentation for Sonar Images

2.2. Attention Mechanisms

3. Methods

3.1. DSA-SOLO

3.2. Double Split Attention (DSA) Module

3.2.1. Channel Attention

3.2.2. Spatial Attention

3.3. Loss Function

4. Experimental Results

4.1. Dataset

4.2. Implementation Detail and Evaluation Indexes

4.3. Comparative Experiments

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI