GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation

Guo, Bin; Cao, Ning; Zhang, Ruihao; Yang, Peng

doi:10.3390/diagnostics14121257

Open AccessArticle

GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation

by

Bin Guo

^1,2

,

Ning Cao

^1,*,

Ruihao Zhang

² and

Peng Yang

²

¹

College of Information Science and Engineering, Hohai University, Nan**g 210098, China

²

College of Computer and Information Engineering, **njiang Agricultural University, Urumqi 830052, China

^*

Author to whom correspondence should be addressed.

Diagnostics 2024, 14(12), 1257; https://doi.org/10.3390/diagnostics14121257

Submission received: 23 April 2024 / Revised: 8 June 2024 / Accepted: 11 June 2024 / Published: 14 June 2024

(This article belongs to the Special Issue Recent Advances in Artificial Intelligence-Based Medical Image Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Currently, brain tumors are extremely harmful and prevalent. Deep learning technologies, including CNNs, UNet, and Transformer, have been applied in brain tumor segmentation for many years and have achieved some success. However, traditional CNNs and UNet capture insufficient global information, and Transformer cannot provide sufficient local information. Fusing the global information from Transformer with the local information of convolutions is an important step toward improving brain tumor segmentation. We propose the Group Normalization Shuffle and Enhanced Channel Self-Attention Network (GETNet), a network combining the pure Transformer structure with convolution operations based on VT-UNet, which considers both global and local information. The network includes the proposed group normalization shuffle block (GNS) and enhanced channel self-attention block (ECSA). The GNS is used after the VT Encoder Block and before the downsampling block to improve information extraction. An ECSA module is added to the bottleneck layer to utilize the characteristics of the detailed features in the bottom layer effectively. We also conducted experiments on the BraTS2021 dataset to demonstrate the performance of our network. The Dice coefficient (Dice) score results show that the values for the regions of the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) were 91.77, 86.03, and 83.64, respectively. The results show that the proposed model achieves state-of-the-art performance compared with more than eleven benchmarks.

Keywords:

brain tumor segmentation; MRI; medical image; deep learning; Transformer

1. Introduction

A brain tumor may cause symptoms such as headache, dizziness, nausea and vomiting, lethargy, weakness, and optic disc edema [1]. If the tumor is large and compresses the optic nerve, decreased and blurred vision may occur [2]. Generally, tumors in the brain are more serious than those in other parts of the body. Benign and malignant tumors can continue to grow, increasing intracranial pressure [3]. Additionally, they can compress brain tissue and affect brain function, having a greater impact. Treatment plans for brain tumors are generally developed through medical imaging. Common medical imaging methods include X-ray imaging, computed tomography (CT), and magnetic resonance imaging (MRI) [4]. Compared with CT, MRI can obtain tomographic images in any direction, which helps display the anatomical relationship between tissue structures, clarify the origin and scope of lesions, and accurately diagnose the disease. It is safer to avoid the radiation damage that can be caused by traditional imaging methods, such as CT and X-ray [5]. MRI is widely used in clinical practice because it does not use radiation and produces high-resolution soft tissue and multisequence imaging. Although MRI is very helpful in the treatment of brain tumors, segmentation by hand is subject to human error, and evaluations vary among radiologists, leading to inconsistent results.

Deep learning can automatically learn useful features from many medical images [6], such as brain tumor shape, size, and boundary information, which are important in brain tumor segmentation and disease analysis. Convolutional neural network (CNN) models based on deep learning have advantages in image processing [7]. Convolution and pooling layers are used in CNNs. The convolution layer is responsible for extracting image features, and the pooling layer is used to greatly reduce the number of dimensions. Unlike CNNs, which are mainly used for feature extraction, classification, and the regression of input images, a fully convolutional network (FCN) adopts a full convolution structure and can adapt to variable-size input images. Thus, FCNs are widely used in image segmentation tasks. UNet is an excellent model that performs well in medical image segmentation. However, MRI images generally contain depth information, which is not fully utilized by the traditional 2D UNet; thus, 3D UNet was developed and has been popular for many years. Many researchers have developed excellent methods based on 3D UNet to complete segmentation tasks in the case of brain tumors. One of the weaknesses of 3D UNet is its extraction capability in terms of long-distance information, despite many attempts, such as atrous spatial pyramid pooling (ASPP), to expand the receptive field. Transformer deep learning methods, have achieved good results in natural language processing (NLP) and have been introduced into image processing to address this challenge. The Pyramid Vision Transformer (PVT), Swin Transformer, and Volumetric Transformer (VT-UNet) based on Transformer can capture the dependencies between different features through a self-attention mechanism, especially for long-distance dependencies, which has obvious advantages in image processing [8]. One of the challenges with Transformer is related to its focus on local details. We propose GETNet to improve the segmentation of brain tumor images to address the challenge of fusing global features and local details.

In this paper, we focus on integrating modules that extract local features with blocks that capture long-distance relationships. The main contributions of our work are as follows:

We proposed a new GETNet for brain tumor segmentation which combined 3D convolution with VT-UNet to comprehensively capture delicate local information and global semantic information and improve brain tumor segmentation performance.
We developed a GNS block between the VT Encoder Block and the downsampling module to enable the Transformer architecture to obtain local information effectively.
We designed an ECSA block in the bottleneck layer to enhance the model for detailed feature extraction.

This paper is organized as follows: related work is described in Section 2. The materials and methods are presented in Section 3. In Section 4, comparison results and ablation experiments are presented and analyzed. Finally, Section 5 provides the discussion and conclusion.

2. Related Work

2.1. Deep-Learning-Based Methods for Medical Image Segmentation

In recent years, research on image analysis and segmentation has made breakthroughs with the proposal of deep learning methods, as represented by CNNs [9]. CNNs can learn representative image features by continuously iterating model parameters and then constructing a model for subsequent segmentation tasks. The block-based CNN method uses images as the network input and adopts image classification to replace pixel classification in the image. Because traditional CNNs use a sliding window frame based on images for image segmentation, the overlap between adjacent image blocks will lead to repeated convolution calculations during training, increasing calculation time and reducing efficiency. Long et al. [10] proposed using an FCN to classify images at the pixel level to solve the problem of image segmentation at the semantic level and achieved good results. An FCN can classify images at the pixel level and has no limit in terms of input image size; it can reduce the number of computations and improve the efficiency of segmentation compared with traditional CNNs. However, this approach lacks spatial consistency and does not fully use contextual information. In 2015, Ronneberger et al. [11] proposed UNet, a U-shaped CNN for medical image segmentation, to address this challenge. Özgün Çiçek et al. [12] proposed 3D UNet, which extends the previous UNet by replacing all 2D operations with 3D operations. Applying three-dimensional depth information is helpful for improving the performance of brain tumor segmentation. Recently, many network variants have been proposed due to the success of 3D UNet. Many of these networks attempt to expand the receptive field to extract global features. DeepLabv1 was proposed by Chen et al. [13] to ensure that feature resolution is not reduced and that the network has a larger receptive field. DeepLabv2, DeepLabv3, and DeepLabv3+ were subsequently developed by Chen et al. [14,15,16]. Chen et al. [17] designed DMFNet to construct multiscale feature representations via 3D dilated convolutions. Xu et al. [18] proposed a network to capture multiscale information using 3D atrous spatial pyramid pooling (ASPP). Jiang et al. [19] developed AIU-Net with the ASPP module to expand the receptive field and increase the width and depth of the network. Parvez Ahmad et al. [20] designed RD²A 3D UNet to preserve more contextual information of small sizes. A multiscale feature extraction module was developed by Wang et al. [21] to extract more receptive fields and improve the ability to capture features with different scales. The E1D3 network was introduced by Syed Talha Bukhari et al. [22] to perform effective multiclass segmentation. There are one-encoder and three-decoder fully convolutional neural network architectures where each decoder segments one of the hierarchical regions of interest (WT, TC, and ET) in the E1D3 network. Parvez Ahmad et al. [23] suggested that multiscale features are very important in MS UNet. Wu et al. [24] proposed SDS-Net to enhance segmentation performance. Local space with detailed feature information was designed by Chen et al. [25] to increase the detailed feature awareness of voxels between adjacent dimensions. MonaKharaji et al. [26] incorporated residual blocks and attention gates to capture emphasized informative regions. Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET). They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC. Using multiscale receptive fields to extract features for the three regions has advantages over using only a single receptive field. Despite acknowledging that considerable research has been carried out concerning the capture of contextual information or the expansion of the receptive field, the effective fusion of local feature information and the long-distance relationships between features is also crucial for multi-subregion segmentation of brain tumors (WT, TC, and ET).

2.2. Attention-Based Module for Medical Image Segmentation

An attention mechanism is a weighted change in target data. It is widely used in clustering learning, reinforcement learning, image processing, and speech recognition. An attention mechanism based on deep learning that imitates the human visual system automatically adopts some visual areas that need to be focused on and can improve the effectiveness of related learning tasks. Both spatial attention [27] and channel attention mechanisms can be used to recalibrate the characteristic information of the input data. They generally utilize a global pooling operation to obtain richer global information. One of the differences is that the channel attention mechanism performs global pooling layer by layer along the channel direction. In contrast, the spatial attention mechanism focuses on the feature information at a different location. Recently, many researchers have focused on multiscale and contextual information. Zhou et al. [28] designed attention mechanisms for learning contextual and attentive information. Zhang et al. [29] constructed the SMTFNet to aggregate global feature information. Zhao et al. [30] developed MSEF-Net to adapt a multiscale fusion module. Liu et al. [31] proposed MSMV-Net while considering the strengths of multiscale feature extraction. Wang et al. [32] proposed a multiscale contextual block to focus on spatial information at different scales. Self-attention [33] can establish a global dependency and expand the receptive field of an image, which is the foundation of Transformer methods. The above are all based on convolutional approaches. Local convolutional operations are limited by the size of the convolutional kernel, which results in a weaker perception of global features. Transformers, through self-attention, can capture dependencies at various positions; the receptive field for global features is relatively large, increasing the richness of global information and allowing for the capture of more information from medium to large targets.

2.3. The Transformer-Based Module for Medical Image Segmentation

The excellent performance of the Transformer in natural language processing tasks fully demonstrates its effectiveness. The breakthrough of Transformer networks in NLP has stimulated interest in applying them to computer vision tasks. Alexey Dosovitskiy et al. [34] proposed the Vision Transformer (ViT) to capture long-range dependencies in images through a global attention mechanism, a milestone in the application of Transformers to computer vision. Wang et al. [35] developed the Pyramid Vision Transformer (PVT) to generate multiscale feature maps for intensive prediction tasks. Liu et al. [36] presented the Swin Transformer, which captures global feature information via self-attention. Many researchers have investigated pure Transformers, such as the Volumetric Transformer Net (VT-UNet) [37]. Their advantage is that the encoder benefits from the self-attention mechanism by encoding local and global features simultaneously, while the decoder uses parallel self-attention and cross-attention to capture fine details for boundary refinement. Ali Hatamizadeh et al. [38] utilized a Transformer as an encoder to learn sequence representations of the input volume and effectively capture global multiscale information. However, one of the disadvantages of pure Transformers is that they only focus on global contextual information and address local details less. As stated earlier, in the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET). In the task of brain tumor segmentation, the three indicators, as well as the boundary information, global information, and local information, need to be used together to enhance the final segmentation results. Recently, considerable research on brain tumor segmentation based on the fusion of Transformers and CNNs has been conducted. Jia et al. [39] proposed using BiTr-UNet, a combined CNN–Transformer network, to achieve good performance on the BraTS2021 validation dataset. TransBTS was developed by Wang et al. [40] to capture local 3D contextual information. Cai et al. [41] reported that Swin UNet can adequately learn both global and local dependency information in all layers of an image. Fu et al. [42] proposed HmsU-Net, a hybrid multiscale UNet based on the combination of a CNN and Transformer for medical image segmentation. Ao et al. [43] developed an effective combined Transformer–CNN network using multiscale feature learning. Ilyasse Aboussaleh et al. [44] designed 3DUV-NetR+ to capture more contextual information. Recently, hybrid architectures of CNNs and Transformers have been research hotspots. The further development of this research is very beneficial for improving performance in brain tumor segmentation.

3. Materials and Methods

3.1. Datasets and Preprocessing

The brain tumor segmentation challenge (BraTS) dataset [45,46] is a public medical image dataset used to research and develop brain tumor segmentation algorithms. The BraTS dataset integrates four MRI modalities: T1-weighted (T1), T2-weighted (T2), T1-enhanced contrast (T1ce), and fluid-attenuated inversion recovery (FLAIR). The BraTS2021 [47] dataset, consisting of data from 1251 patients for training and 219 patients for validation, is popular among researchers. Generally, 1251 cases all contain ground truths labeled by board-certified neuroradiologists, while the 219 ground-truth cases are hidden from the public; the results can be obtained only via online validation. Our training strategy included 80% and 20% of the BraTS2021 training data for training and validation, respectively. In addition, we uploaded our prediction results to the official BraTS platform (https://www.synapse.org/#) (accessed on 12 June 2024) for model evaluation.

In order to enable our network to segment brain tumor images normally, we first read the BraTS2021 dataset into our program in the preprocessing stage. After processing with simpleITK and MONAI, we used the Z-score method to standardize each image. Sequentially, we reduced the background as much as possible while ensuring that all of the brain was included and randomly re-cropped the fixed patch size of the image to 128 × 128 × 128. All of the intensity values were clipped to the 1st and 99th percentiles of the non-zero voxel distribution of the volume. In this research, we used rotation between −30 and 30, additive Gaussian noise of a centered normal distribution with a standard deviation of 0.1, blurring between 0.5 and 1, and the addition of a gamma transformation value between 0.7 and 1.5 as data augmentation techniques. The procedural flowchart of the proposed GETNet is depicted in Figure 1.

3.2. Implementation Details

Our network was constructed using Python 3.8.10 and PyTorch 1.11.0. A single NVIDIA RTX A5000 with 24 G memory and AMD EPYC 7551P were used during training. Table 1 shows that the initial learning rate was 1.00 × 10⁻⁴, with a batch size of 1. Cuda version cu113 was used. During training, Adam [48] was used to optimize our network. Unlike the case of hybrid loss, only ordinary soft Dice loss [49] was trained in our network. The input and output sizes were both 128 × 128 × 128.

3.3. Evaluation Metrics

Quantitative and qualitative analyses were carried out using evaluation metrics, including the Dice similarity coefficient (Dice) score [50], the Hausdorff distance (HD) [51,52], sensitivity, and specificity.

Dice is a measure of the similarity between two effects. It is used to measure the similarity between the results predicted through network segmentation and manual masks in image segmentation, and it which can be represented as follows:

Dice = \frac{2 TP}{2 TP + FP + FN}

(1)

where TP, FP, and FN represent true positive cases, false positive cases, and false negative cases, respectively.

HD represents the maximum distance between the predicted and real region boundaries. The smaller the value is, the smaller the predicted boundary segmentation error and the better the quality. HD95 is similar to the maximum HD, but it is calculated based on the 95th percentile of distances between boundary points in t and p. The purpose of using this measure is to mitigate the impact of a very small subset of outliers. HD can be represented as follows:

HD (P, T) = \max {\sup_{t \in T} in f_{p \in P} d (t, p), \sup_{p \in P} in f_{t \in T} d (t, p)}

(2)

where t and p represent the real region boundary and predicted segmentation region boundary, respectively. d(•) represents the distance between t and p. Sup denotes the supremum.

Sensitivity refers to as the true positive rate. It quantifies the accurate probability of complete positive detection. The sensitivity can be represented as follows:

S e n s i t i v i t y = \frac{T P}{T P + F N}

(3)

where TP and FN represent true positive cases and false negative cases, respectively. A higher sensitivity corresponds to a smaller discrepancy between glioma segmentation and the ground truth.

The Specificity represents the true negative rate, which reflects the probability of complete negative detection. The specificity can be represented as follows:

S p e c i f i c i t y = \frac{T N}{T N + F P}

(4)

where TN and FP represent true negative cases and false positive cases, respectively. The higher the specificity is, the smaller the difference between the segmentation and ground truth for normal tissue.

3.4. Methodology

3.4.1. Network Architecture

The effective integration of local features and global relationships is very helpful for improving the performance of brain tumor segmentation tasks. As shown in Figure 2, our network is a U-shaped architecture based on a Transformer with convolution operations. The encoder branch is on the left, the bottleneck layer is located at the bottom, and the decoder is on the right of the architecture. The encoder incorporates a 3D Patch Partition Block, Linear Embedding Block, VT Encoder Block, GNS Block, and 3D Patch-Merging Block. The 3D Patch Partition Block cuts the brain tumor images into nonoverlap** patches, and the Linear Embedding block maps the tokens to a vector dimension equal to the number of channels. The 3D Patch-Merging Block reduces the size of the image by half and doubles the number of channels, similarly to the process of pooling or convolution with a stride of 2 in a CNN. This operation is akin to downsampling and increasing the feature depth, contributing to the overall efficiency and effectiveness of the network architecture. After the 3D Patch-Merging operation, VT-UNet only changes the height and width, while the depth remains unchanged. To reduce the image dimensions and floating-point operations per second (FLOPs) and to prevent overfitting, changes were made to the height, width, and depth of our model.

The GNS Block addresses the issue of insufficient local features in feature extraction. A hierarchical representation is constructed by the VT Encoder Block from small patches, which are gradually merged with neighboring patches as the Transformer layers deepen to capture better features. The two modules in one decoder layer are the 3D Patch Expanding Block and the VT Decoder Block. Here, 3D Patch Expansion reshapes the image size along the spatial axis, doubling the image size and reducing the number of channels by half. The VT Decoder Block integrates high-resolution information from the encoder and low-resolution information from the decoder to recover features lost during downsampling and improve segmentation accuracy.

Notably, the VT Encoder Block, VT Decoder Block, and ECSA Block are used twice. The VT Decoder Block combines a self-attention block and cross-attention to improve the prediction quality. The SC Block is similar to the UNet skip connection, which establishes a bridge for information transmission between the encoding layer and the corresponding decoding layer. Specifically, the values of both K and V generated by the multi-head self-attention (W-MSA) of the VT Encoder Block are passed to the W-MSA of the VT Decoder Block. Similarly, the shifted window-based multi-head self-attention (SW-MSA) of the VT Encoder Block delivers K′ and V′ to the SW-MSA of the VT Decoder Block in the same way. The bottleneck layer has two modules: the 3D Patch-Expanding Block, whose function is the same as that of the 3D Patch Expanding of the decoder, and the ESCA block, which can capture detailed features with long-distance relationships of the bottom layer.

The VT Encoder Block and VT Decoder Block employ attention layers with windows to important feature information when capturing long-distance dependencies between tokens. The attentions of W-MSA and SW-MSA in the VT Encoder Block and VT Decoder Block utilize tokens within the window to help with representation learning. In W-MSA, we uniformly divide the volume into smaller nonoverlap** windows. The tokens in adjacent windows of W-MSA cannot be seen by each other. In contrast, they can see each other by using the shifting window in SW-MSA, which facilitates the interaction of information between different windows, thereby guiding effective feature extraction. The VT Decoder Block can be divided into two parts: the left part is cross-attention (CA), and the right part is self-attention (SA). The fusion subblock, shown in Figure 3, merges the results of CA and SA and delivers them to the later layer. The fusion subblock comprises a convex combination, Fourier feature positional encoding (FPE), layer normalization (LN) [53], and a multi-layer perceptron (MLP) [54]. The dimensions of the input image are 4 × 128 × 128 × 128, and the classifier layer includes a 3D convolutional layer to map deep dimensional features to 3 × 128 × 128 × 128.

3.4.2. Enhanced Channel Self-Attention Block (ECSA)

A diagram of the Enhanced Transformer and ECSA Block is shown in Figure 4. The bottom layer is the lowest in the network and has the smallest image size. However, it contains the richest semantic information. It is helpful to extract detailed features effectively from the bottleneck layer, which is important in terms of the brain tumor segmentation results. The Enhanced Transformer combines the advantages of global and local features, which is beneficial for extracting the details of image features and can be represented as follows:

Z = (E C S A (L N (x)) + x) + M L P (L N (E C S A (L N (x)) + x))

(5)

where x denotes the input features. LN represents layer normalization. MLP is a multi-layer perceptron. Z is the result of the equation. The ECSA Block is an enhanced channel self-attention block.

The ECSA Block first extracts the channel weights, Kw, Qw, and Vw, of the image features. A weighted self-attention mechanism was developed to capture more effective global features. Depth-wise separable convolution [55] with large convolution kernels of 7 × 7 × 7 is used to ensure larger receptive fields, and it is then performed on each channel to obtain local features while minimizing information loss. Finally, all channels are aggregated using a 1 × 1 × 1 convolution before being output. The ECSA can be divided into three steps: calculating the weights, capturing the weighted global features, and fusing the local features.

First step: The calculation formula for the three weights can be described as follows:

Q w = F S W (F L (x))

(6)

K w = F S W (F L (x))

(7)

V w = F W (x)

(8)

where FL denotes a linear operation. FSW can be calculated as follows:

F W (x) = s i g m o i d ((F L (R e l u (F L (A P (x))))))

(9)

F S W (x) = s i g m o i d ((F L (R e l u (F L (A P (x))))))

(10)

where AP denotes average pooling. FL represents a linear operation.

In the second step, the capture of the weighted global features is calculated as follows:

Q^{'} = W ((Q w), F L (x))

(11)

K^{'} = W ((K w), F L (x))

(12)

V^{'} = W ((V w), x)

(13)

where W(•), which is a multiplication operation using the input data, can be represented as follows:

Q^{'} = Q w \times F L (x)

(14)

K^{'} = K w \times F L (x)

(15)

V^{'} = V w \times x

(16)

In the third step, the fusion of the local features is calculated as follows:

Y_{o u t} = {F L ((s o f t m a x (C o n v_{1 \times 1 \times 1} (D W C_{7 \times 7 \times 7} (K}^{'} ⊙ Q')))) ⊙ V')

(17)

where Y_out denotes the final result. Conv_1×1×1 represents convolution with a 1 × 1 × 1 kernel. DWC_7×7×7 is expressed as a depth-wise separable convolution with a 7 × 7 × 7 kernel. FL denotes a linear operation.

⊙

denotes the Hadamard product [56], which converts second-order map**s into third-order map**s.

3.4.3. Group Normalization Shuffle (GNS) Block

A diagram of the GNS Block is shown in Figure 5. Ma et al. [57] proposed ShuffleNetv2 to divide the input feature map into multiple subblocks and perform a shuffling operation on these subblocks. Shuffling operations typically involve rearranging the features between different subblocks to introduce more variation and diversity. This process helps the model better capture details and structures in images and improve the generalization ability.

Batch normalization (BN) has become an important component of many advanced deep learning models, especially in computer vision. BN normalizes layer inputs by calculating the average and variance in batch processing. The batch size must be sufficiently large, for BN to perform well. However, only small batches are available in some cases. Group normalization (GN) [58] is suitable for tasks that require a large amount of memory, such as image segmentation. GN calculates the mean and variance in each group channel-wise and is not related to or constrained by batch size. As the batch size decreases, GN performance is basically unaffected.

The rectified linear unit (ReLU) and Gaussian error linear unit (GeLU) [59] are the most common activation functions. ReLU is a very simple function that returns 0 only when the input is negative and returns the value of the input when the input is positive. Thus, it contains only one piecewise linear transformation. However, the ReLU output remains constant at 0 when the input is negative. This problem may lead to neuronal death, reducing the expression of the model. The GeLU function is a continuous S-shaped curve with a smoother shape than that of ReLU, and it can alleviate neuronal death to a certain extent. Inspired by the above, we utilized GN instead of BN and replaced ReLU with GeLU in our GNS Block to enable the communication of information between different channel groups and improve accuracy, which can be represented as follows:

d = shuffle (concat (FS (x), Con v_{1 \times 1 \times 1} (DW C_{3 \times 3 \times 3} (Con v_{1 \times 1 \times 1} (FS (x))))))

(18)

where FS(

•

) denotes the split operation, which divides the features of one channel into two channels on average. Conv_1×1×1 represents a convolution with a 1 × 1 × 1 kernel. DWC_3×3×3 is expressed as a separable convolution with a 3 × 3 × 3 kernel. The concat operation represents the concatenation of two sets of features. The shuffle operation is a channel shuffle operation.

4. Results and Discussion

4.1. Comparison with Other Methods

We compared eleven advanced models to evaluate the advantages of the proposed model. Two networks were compared for 2024, two for 2023, and two for 2022, in addition to five classic networks. The five classic networks were 3D UNet, Att-UNet, UNETR, TransBTS, and VT-UNet. There are six architecture variants based on basic UNet, and five are structures based on Transformer. In order to accurately validate the effectiveness of the model that we proposed, we sequentially compared it with different methods offline and online on BraTS2021. The offline results were obtained by running experiments on our servers, while the online results were obtained after uploading the model to the official BraTS platform and receiving the official results. We utilized five-fold cross-validation in our offline experiments. In Table 2, the offline results are presented for comparison with those of other methods. In the table, it can be observed that except for a slightly lower F1-score value, all other values are the highest. As shown in Table 3, we separately conducted a statistical significance analysis to compare different methods using the BraTS2021 dataset; the results were determined with a one-sided Wilcoxon signed rank test. Bold numbers indicate statistical significance (p < 0.05).

Inspired by the studies of Michael Rebsamen [61] and Snehal Prabhudesai [62], GETNet method was validated separately on the HGG dataset (293 cases), LGG dataset (76 cases), and a combination of the two. In Table 4, it can be seen that the Dice values of the HGG cases are relatively higher, followed by the mixed HGG and LGG values, and the LGG cases are slightly lower. In the BRATS dataset, due to the lack of representation of LGG samples, there was an inevitable performance decrease for the LGG data. There are typically no necrotic areas; hence, they exhibit significant differences in the appearance of the tumor core region compared to the HGG. Additionally, the appearance and size of the enhancing tumor region are also distinct. This impacts the network’s performance, and these differences can lead to suboptimal model performance in HGG segmentation.

Table 5 and Figure 6 and Figure 7 show that the Dice coefficient values of GETNet in the online validation for the whole tumor (WT), tumor core (TC), and enhancing tumor (ET) are 91.77, 86.03, and 83.64, respectively. The values of HD, shown in Table 5, are 4.36, 11.35, and 14.58 for the three tumor subregions (WT, TC, and ET), respectively. We used VT-UNet as the baseline, and our WT, TC, ET, and average Dice results increased by 0.11, 1.62, 2.89, and 1.55, respectively, when the values of HD95 were close to each other. From the results, it can be seen that the incorporation of convolution into the pure Transformer (which also has local characteristics) improved its operation.

The results show that our network slightly improved in terms of TC and ET; that is, our network performs better than the baseline in small target segmentation. Compared to other networks, ours may not be the best on a single indicator, but our average results and ET values are the highest. Regarding the BraTS dataset, the final segmentation results are divided into three parts: whole tumor (WT), tumor core (TC), and enhancing tumor (ET). They have an inclusive relationship, meaning that WT encompasses both TC and ET, with ET being included within TC. If there is an emphasis on enhancing the focus on local detail features, the ET results are likely to improve. This indicates that our segmentation performance for small targets is the best among the compared networks, mainly due to the incorporation of local detail features. Figure 8 shows the visualization results of the GETNet model on the BraTS2021 dataset, in which five cases were randomly chosen. The medical cases, as shown in sequences A, B, C, D, and E of Figure 8, were segmented by GETNet. The figures from left to right are, respectively, FLAIR, 3DUNet, Att-Unet, UNetr, TransBTS, VT-UNet, SwinUNet3D, the results segmented by GETNet, and the ground truth. Green, yellow, and red represent WT, TC, and ET, respectively. In general, the results of GETNet are close to the labeled ground truth. Compared to the network with only convolutions, the results of our model are the best. Our network also performs better than the networks based on a Transformer. Overall, our architecture and modules achieved better results in relation to BraTS2021, providing a good basis for subsequent research.

4.2. Ablation Experiments

4.2.1. Ablation Study of Each Module in GETNet

We conducted ablation experiments to verify the effects of different modules in this architecture. Table 6 and Figure 9 show the results of utilizing GNS and ECSA in GETNet, which improved the average Dice coefficient by 0.64 and 0.91, respectively. Empirically, we added both the GNS and ECSA, and all indicators improved. The results are 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively. The Hausdorff 95% (HD) values are 4.36, 11.35, 14.58, and 10.10 for WT, TC, ET, and the average HD, respectively.

The original plan was to use a pure Transformer, which does not include the local connectivity of convolution (capturing local features), shared weights (reducing the number of parameters), and sparse interactions (reducing the number of parameters and computational overhead). The module that we designed uses convolution within the pure Transformer architecture; thus, it captures local features. Table 6 shows that whether we add the GNS module alone, the ECSA module alone, or both, there is an improvement. The best results show that our network and all of the modules can be effectively applied to brain tumor segmentation tasks.

4.2.2. Ablation Study of GN and GeLU in the GNS Module

To verify the effectiveness of replacing BN and ReLU with GN and GeLU, we conducted five sets of experiments. The results and experimental plan are shown in Table 7. Experiment A uses the original shuffle block of ShuffleNet V2. Unit 1, Unit 2, and Unit 3 can be seen in Figure 10, which represents BN, GN, BN + ReLU, and GN + GeLU placed at different positions.

In the experiment, the first combination in Unit 1 was GN, that in Unit 2 was GN + GeLU, and that in Unit 3 was GN, and the effect improved (Experiment E had the best result). Next, we attempted the case where Unit 1, Unit 2, and Unit 3 were all GN + GeLU, and the results improved compared to those of Experiment A but were worse than those obtained in Experiment E. If all were replaced with GN, the results could improve (Experiment C) compared to those of Experiment A, but they were similar to those of Experiment B. Naturally, we also replaced Unit 1, Unit 2, and Unit 3 (Experiment D) in Experiment E with GN, GN + GeLU, and GN, and the results worsened. The experimental plans were as follows:

A: Unit 1 → BN+ReLU; Unit 2 → BN; Unit 3 → BN+ReLU;

B: Unit 1 → GN+GeLU; Unit 2 → GN+GeLU; Unit 3 → GN+GeLU;

C: Unit 1 → GN; Unit 2 → GN; Unit 3 → GN;

D: Unit 1 → GN; Unit 2 → GN+GeLU; Unit 3 → GN;

E: Unit 1 → GN+GeLU; Unit 2 → GN; Unit 3 → GN+GeLU.

The results are shown in Table 7 and Figure 11. We imitated the shuffle block and replaced the corresponding BN and ReLU with GN and GeLU, respectively, achieving good results. The results are 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively. The results of Experiment E are still the best. These results exceed those of the shuffle block, indicating that our improvement is effective.

4.2.3. Ablation Study of the Convex Combination in the ECSA Module

Table 8 and Figure 12 compare the coefficients of the convex combination in the GNS module; 1 − λ and λ represent the proportion of information processed from cross-attention and self-attention in the VT Decoder Block, respectively. The different proportions of 1 − λ and λ determine which part plays a decisive role, and this has a certain impact on the processing of the later layer. In order to find the optimal combination of λ values in convex combinations, values of λ from 0.1 to 0.9 were tested. The results show that the results for λ = 0.5 and 1 − λ = 0.5 were the best. The results are 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively. It can be seen from the average Dice coefficient that there is indeed a certain improvement when the proportion of cross-attention increases, but this also has little impact on the results. The best effect is achieved when cross-attention and self-attention reach a balance.

We also considered the case where convex combinations are not used, meaning the adaptive learning parameters of ω; η = 1 and θ = 1 did not exceed the results of λ = 0.5 and 1 − λ = 0.5, as illustrated in Table 9 and Figure 13.

The results are 91.56, 85.54, 82.83, and 86.57 for WT, TC, ET, and the average Dice coefficient, respectively, when η and θ were adaptive learning parameters ω. The results were 91.50, 85.76, 82.47, and 86.58 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 1 and θ was 1. The results were 91.77, 86.03, 83.64, and 87.15 for WT, TC, ET, and the average Dice coefficient, respectively, when η was 0.5 and θ was 0.5. The results indicate that further feature processing and extraction are needed to achieve better results when both cross-attention and self-attention contain more information. From the comparison of the results of Table 8 and Table 9, it can be seen that the adaptive learning parameters played a certain role, but the results also need further processing.

4.2.4. Ablation Study of the Frequency Coefficient of FEP in the ECSA Module

Table 10 and Figure 14 show a comparison of the frequency coefficient of FEP in the GNS module. The frequency coefficient of FEP is 10,000 in the case of VT-UNet. In Table 10, Test A represents a frequency coefficient of 5000 in FEP, and Test B represents a frequency coefficient of 20,000. That is, the wavelengths form a geometric progression from 2π to 10,000·λπ. We made an effort to change this coefficient to achieve better results. We conducted experiments to modify the default coefficient by half and to double it. The results indicate that 10,000 is still optimal.

In this section of the experiment, we only performed simple scaling by half or double, and a large number of parameters remained untested. However, from the results in Table 10, it can be seen that the frequency coefficient does not have a particularly significant impact on the final results. Perhaps there will be better results in later testing, but this requires much experimentation.

4.2.5. Comparative Experiment on the Depth-Wise Size of the 3D Patch-Merging Operation

In the original VT-UNet, the depth-wise size does not change as the network deepens, but in the GETNet that we proposed, it does change with depth. Table 11 shows that when the depth-wise size changes with the layers, its performance is not affected, but the floating-point operations per second (FLOPs) are reduced by 48.99G. This indicates that changes in the depth-wise size with the layers can reduce the FLOPs and improve the segmentation efficiency.

5. Conclusions

In this paper, we propose GETNet based on VT-UNet, which integrates a GNS block and an ECSA block. It enhances the performance of brain tumor segmentation by effectively fusing local features with long-distance relationships. The GNS module is used between the VT Encoder Block and 3D Patch-Merging Block, improving the shuffle block in ShuffleNetV2, enabling communication of information between different groups of channels, and improving accuracy. We propose the ECSA Block, which works in a bottleneck and can combine the advantages of global and local features, which is beneficial for extracting image feature details. In addition to comparing our results with those of the classic VT-UNet, we compared our results with those of networks based on UNet or Transformer. Our results yield Dice coefficients of 91.77, 86.03, and 83.64, respectively, for three tumor subregions (WT, TC, and ET). Our advantage over architectures based on UNet and Transformer lies in the more effective fusion of local and global features using GNS and ECSA modules. We also conducted ablation experiments on the GNS module, ECSA module, convex combination, and FPE, which proved the effectiveness of our modules. Table 8 shows that for the average Dice coefficient, there is a certain improvement when the proportion of cross-attention increases, but it also has little impact on the results. The best effect is achieved when cross-attention and self-attention reach a balance. From the results in Table 10, it can be seen that the frequency coefficient does not have a particularly significant impact on the final results. Furthermore, quantitative and qualitative experiments demonstrated the accuracy of GETNet. Our architecture and the proposed modules can provide effective ideas for subsequent research.

Author Contributions

Conceptualization, N.C.; Methodology, B.G.; Software, R.Z.; Data Curation, P.Y.; Writing—Original Draft, B.G.; Writing—Review and Editing, N.C. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant number 41830110).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets released to the public were analyzed in this study. The BraTS2021 dataset can be found through the following link: https://www.med.upenn.edu/cbica/brats2021/#Data2 (accessed on 12 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nelson, S.; Taylor, L.P. Headaches in brain tumor patients: Primary or secondary? Headache J. Head Face Pain 2014, 54, 776–785. [Google Scholar] [CrossRef]
Hoesin, F.R.; Adikusuma, W. Visual Disturbances as an Early Important Sign of Brain Tumor: A Case Report. J. Oftalmol. 2022, 4, 1–5. [Google Scholar] [CrossRef]
Sorribes, I.C.; Moore, M.N.; Byrne, H.M.; Jain, H.V. A biomechanical model of tumor-induced intracranial pressure and edema in brain tissue. Biophys. J. 2019, 116, 1560–1574. [Google Scholar] [CrossRef]
Siddiq, M. Ml-based medical image analysis for anomaly detection in CT scans, X-rays, and MRIs. Devot. J. Res. Community Serv. 2020, 2, 53–64. [Google Scholar] [CrossRef]
Kwong, R.Y.; Yucel, E.K. Computed tomography scan and magnetic resonance imaging. Circulation 2003, 108, e104–e106. [Google Scholar] [CrossRef] [PubMed]
Castiglioni, I.; Rundo, L.; Codari, M.; Di Leo, G.; Salvatore, C.; Interlenghi, M.; Gallivanone, F.; Cozzi, A.; D‘Amico, N.C.; Sardanelli, F. AI applications to medical images: From machine learning to deep learning. Phys. Medica 2021, 83, 9–24. [Google Scholar] [CrossRef]
Yang, J.; An, P.; Shen, L.; Wang, Y. No-reference stereo image quality assessment by learning dictionaries and color visual characteristics. IEEE Access 2019, 7, 173657–173669. [Google Scholar] [CrossRef]
**n, W.; Liu, R.; Liu, Y.; Chen, Y.; Yu, W.; Miao, Q. Transformer for skeleton-based action recognition: A review of recent advances. Neurocomputing 2023, 537, 164–186. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2016: 19th International Conference, Athens, Greece, 17–21 October 2016; pp. 424–432. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. ar**v 2014, ar**v:1412.7062. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. ar**v 2017, ar**v:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Chen, C.; Liu, X.; Ding, M.; Zheng, J.; Li, J. 3D dilated multi-fiber network for real-time brain tumor segmentation in MRI. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2019: 22nd International Conference, Shenzhen, China, 13–17 October 2019; pp. 184–192. [Google Scholar]
Xu, Y.; Gong, M.; Fu, H.; Tao, D.; Zhang, K.; Batmanghelich, K. Multi-scale masked 3-D U-net for brain tumor segmentation. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2019; pp. 222–233. [Google Scholar]
Jiang, Y.; Ye, M.; Huang, D.; Lu, X. AIU-Net: An efficient deep convolutional neural network for brain tumor segmentation. Math. Probl. Eng. 2021, 2021, 7915706. [Google Scholar] [CrossRef]
Ahmad, P.; **, H.; Qamar, S.; Zheng, R.; Saeed, A. RD 2 A: Densely connected residual networks using ASPP for brain tumor segmentation. Multimed. Tools Appl. 2021, 80, 27069–27094. [Google Scholar] [CrossRef]
Wang, L.; Liu, M.; Wang, Y.; Bai, X.; Zhu, M.; Zhang, F. A multi-scale method based on U-Net for brain tumor segmentation. In Proceedings of the 2022 7th International Conference on Communication, Image and Signal Processing (CCISP), Chengdu, China, 18–20 November 2022; pp. 271–275. [Google Scholar]
Bukhari, S.T.; Mohy-ud-Din, H. E1D3 U-Net for brain tumor segmentation: Submission to the RSNA-ASNR-MICCAI BraTS 2021 challenge. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 276–288. [Google Scholar]
Ahmad, P.; Qamar, S.; Shen, L.; Rizvi, S.Q.A.; Ali, A.; Chetty, G. Ms unet: Multi-scale 3d unet for brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 30–41. [Google Scholar]
Wu, Q.; Pei, Y.; Cheng, Z.; Hu, X.; Wang, C. SDS-Net: A lightweight 3D convolutional neural network with multi-branch attention for multimodal brain tumor accurate segmentation. Math. Biosci. Eng. 2023, 20, 17384–17406. [Google Scholar] [CrossRef]
Chen, R.; Lin, Y.; Ren, Y.; Deng, H.; Cui, W.; Liu, W. An efficient brain tumor segmentation model based on group normalization and 3D U-Net. Int. J. Imaging Syst. Technol. 2024, 34, e23072. [Google Scholar] [CrossRef]
Kharaji, M.; Abbasi, H.; Orouskhani, Y.; Shomalzadeh, M.; Kazemi, F.; Orouskhani, M. Brain Tumor Segmentation with Advanced nnU-Net: Pediatrics and Adults Tumors. Neurosci. Inform. 2024, 4, 100156. [Google Scholar] [CrossRef]
Liu, T.; Luo, R.; Xu, L.; Feng, D.; Cao, L.; Liu, S.; Guo, J. Spatial channel attention for deep convolutional neural networks. Mathematics 2022, 10, 1750. [Google Scholar] [CrossRef]
Zhou, C.; Chen, S.; Ding, C.; Tao, D. Learning contextual and attentive information for brain tumor segmentation. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2019; pp. 497–507. [Google Scholar]
Zhang, X.; Zhang, X.; Ouyang, L.; Qin, C.; **ao, L.; **ong, D. SMTF: Sparse transformer with multiscale contextual fusion for medical image segmentation. Biomed. Signal Process. Control 2024, 87, 105458. [Google Scholar] [CrossRef]
Zhao, J.; Sun, L.; Sun, Z.; Zhou, X.; Si, H.; Zhang, D. MSEF-Net: Multi-scale edge fusion network for lumbosacral plexus segmentation with MR image. Artif. Intell. Med. 2024, 148, 102771. [Google Scholar] [CrossRef]
Liu, C.; Liu, H.; Zhang, X.; Guo, J.; Lv, P. Multi-scale and multi-view network for lung tumor segmentation. Comput. Biol. Med. 2024, 172, 108250. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Zou, Y.; Chen, H.; Liu, P.X.; Chen, J. Multi-scale features and attention guided for brain tumor segmentation. J. Vis. Commun. Image Represent. 2024, 100, 104141. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. ar**v 2020, ar**v:2010.11929. [Google Scholar]
Wang, W.; **e, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Peiris, H.; Hayat, M.; Chen, Z.; Egan, G.; Harandi, M. A robust volumetric transformer for accurate 3D tumor segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 162–172. [Google Scholar]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 574–584. [Google Scholar]
Jia, Q.; Shu, H. Bitr-unet: A cnn-transformer combined network for mri brain tumor segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 3–14. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. TransBTS: Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Virtual Event, 27 September–1 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–119. [Google Scholar]
Cai, Y.; Long, Y.; Han, Z.; Liu, M.; Zheng, Y.; Yang, W.; Chen, L. Swin Unet3D: A three-dimensionsal medical image segmentation network combining vision transformer and convolution. BMC Med. Inform. Decis. Mak. 2023, 23, 33. [Google Scholar] [CrossRef] [PubMed]
Fu, B.; Peng, Y.; He, J.; Tian, C.; Sun, X.; Wang, R. HmsU-Net: A hybrid multi-scale U-net based on a CNN and transformer for medical image segmentation. Comput. Biol. Med. 2024, 170, 108013. [Google Scholar] [CrossRef] [PubMed]
Ao, Y.; Shi, W.; Ji, B.; Miao, Y.; He, W.; Jiang, Z. MS-TCNet: An effective Transformer–CNN combined network using multi-scale feature learning for 3D medical image segmentation. Comput. Biol. Med. 2024, 170, 108057. [Google Scholar] [CrossRef] [PubMed]
Aboussaleh, I.; Riffi, J.; el Fazazy, K.; Mahraz, A.M.; Tairi, H. 3DUV-NetR+: A 3D hybrid Semantic Architecture using Transformers for Brain Tumor Segmentation with MultiModal MR Images. Results Eng. 2024, 21, 101892. [Google Scholar] [CrossRef]
Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 2014, 34, 1993–2024. [Google Scholar] [CrossRef] [PubMed]
Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef]
Baid, U.; Ghodasara, S.; Mohan, S.; Bilello, M.; Calabrese, E.; Colak, E.; Farahani, K.; Kalpathy-Cramer, J.; Kitamura, F.C.; Pati, S. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. ar**v 2021, ar**v:2107.02314. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. ar**v 2014, ar**v:1412.6980. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Dice, L.R. Measures of the amount of ecologic association between species. Ecology 1945, 26, 297–302. [Google Scholar] [CrossRef]
Kim, I.S.; McLean, W. Computing the Hausdorff distance between two sets of parametric curves. Commun. Korean Math. Soc. 2013, 28, 833–850. [Google Scholar] [CrossRef]
Aydin, O.U.; Taha, A.A.; Hilbert, A.; Khalil, A.A.; Galinovic, I.; Fiebach, J.B.; Frey, D.; Madai, V.I. On the usage of average Hausdorff distance for segmentation performance assessment: Hidden error when used for ranking. Eur. Radiol. Exp. 2021, 5, 4. [Google Scholar] [CrossRef] [PubMed]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. ar**v 2016, ar**v:1607.06450. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Sifre, L.; Mallat, S. Rigid-motion scattering for texture classification. ar**v 2014, ar**v:1403.1687. [Google Scholar]
Horn, R.A. The hadamard product. In Proceedings of Symposia in Applied Mathematics; American Mathematical Society: Providence, RI, USA, 1990; pp. 87–169. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). ar**v 2016, ar**v:1606.08415. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B. Attention u-net: Learning where to look for the pancreas. ar**v 2018, ar**v:1804.03999. [Google Scholar]
Rebsamen, M.; Knecht, U.; Reyes, M.; Wiest, R.; Meier, R.; McKinley, R. Divide and conquer: Stratifying training data by tumor grade improves deep learning-based brain tumor segmentation. Front. Neurosci. 2019, 13, 469127. [Google Scholar] [CrossRef]
Prabhudesai, S.; Wang, N.C.; Ahluwalia, V.; Huan, X.; Bapuraj, J.R.; Banovic, N.; Rao, A. Stratification by tumor grade groups in a holistic evaluation of machine learning for brain tumor segmentation. Front. Neurosci. 2021, 15, 740353. [Google Scholar] [CrossRef] [PubMed]
Pawar, K.; Zhong, S.; Goonatillake, D.S.; Egan, G.; Chen, Z. Orthogonal-Nets: A Large Ensemble of 2D Neural Networks for 3D Brain Tumor Segmentation. In Proceedings of the International MICCAI Brainlesion Workshop, Virtual Event, 27 September 2021; pp. 54–67. [Google Scholar]
Håversen, A.H.; Bavirisetti, D.P.; Kiss, G.H.; Lindseth, F. QT-UNet: A self-supervised self-querying all-Transformer U-Net for 3D segmentation. IEEE Access 2024, 12, 62664–62676. [Google Scholar] [CrossRef]
Akbar, A.S.; Fatichah, C.; Suciati, N.; Za’in, C. Yaru3DFPN: A lightweight modified 3D UNet with feature pyramid network and combine thresholding for brain tumor segmentation. Neural Comput. Appl. 2024, 36, 7529–7544. [Google Scholar] [CrossRef]

Figure 1. Procedural flowchart of the proposed GETNet.

Figure 2. An illustration of the proposed GETNet for brain tumor image segmentation.

Figure 3. An illustration of the fusion sub-block.

Figure 4. An illustration of the building blocks of the ECSA Block.

Figure 5. An illustration of the building blocks from a channel-wise perspective. (a) The shuffle block in ShuffleNetV2; (b) the GNS Block presented in this paper.

Figure 6. Comparison of the Dice results of different segmentation methods.

Figure 7. Comparison of the HD results of different segmentation methods.

Figure 8. Visualization results for medical cases. From left to right: FLAIR, 3DUNet, Att-Unet, UNETR, TransBTS, VT-UNet, SwinUNet3D, the results segmented by GETNet, and the ground truth. (A–E) are five cases were randomly chosen on the BraTS2021 dataset. Green, yellow, and red represent WT, TC, and ET, respectively.

Figure 9. The results of the ablation study of each module in GETNet.

Figure 10. An illustration of building blocks in the channel-wise perspective of GNS.

Figure 11. The results of the ablation study of GN and GeLU in the ECSA module.

Figure 12. The results of the ablation study of the convex combination in the ECSA module when λ < 1 and λ ≠ 1 − λ.

Figure 13. The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ.

Figure 14. The results of the FEP frequency coefficient in the ECSA module.

Table 1. Model parameter configuration.

Basic Configuration	Value
PyTorch Version	1.11.0
Python	3.8.10
GPU	NVIDIA RTX A5000 (24 G)
Cuda	cu113
Learning Rate	1.00 × 10⁻⁴
Optimizer	Adam
Epoch	350
Batch Size	1
Input Size	128 × 128 × 128
Output Size	128 × 128 × 128

Table 2. The offline validation results for the comparison of different methods in relation to BraTS2021, with the best performance highlighted in bold.

Methods	Dice (%)			SD			Recall (%)			F1-Score (%)
Methods	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
3D U-Net [12]	91.29	89.13	85.78	7.14	15.49	15.93	91.38	88.60	87.31	95.75	95.87	93.16
Att-Unet [60]	91.43	89.51	85.71	9.23	15.15	17.11	90.73	88.68	86.35	96.32	96.43	94.13
UNETR [38]	91.53	88.57	85.27	8.92	15.94	18.48	92.54	88.71	85.92	95.69	95.44	94.15
TransBTS [40]	90.61	88.78	84.29	10.73	16.48	19.30	91.20	87.74	85.73	95.69	96.38	93.35
VT-UNet [37]	92.39	90.12	86.07	8.60	14.48	16.37	92.76	90.61	87.85	96.43	96.06	93.63
Swin Unet3D (2023) [41]	92.85	90.69	86.26	5.67	14.30	17.15	92.18	90.81	87.85	96.94	96.42	93.63
GETNet (ours)	93.04	91.70	87.41	5.53	11.60	14.01	92.87	91.36	88.22	96.78	96.75	94.32

Table 3. Ratio (in %) of the improvement in the performance of GETNet compared to different methods. Bold numbers indicate statistical significance (p < 0.05).

Methods	WT		TC		ET
Methods	%Subjects	p	%Subjects	p	%Subjects	p
GETNet (ours) vs. 3D U-Net	76.4	2.987 × 10⁻²⁴	82.8	6.849 × 10⁻¹⁸	73.7	7.590 × 10⁻⁷
GETNet (ours) vs. Att-Unet	76.1	9.093 × 10⁻¹⁶	81.2	4.548 × 10⁻¹⁴	73.7	0.008
GETNet (ours) vs. UNETR	76.1	1.286 × 10⁻⁶	83.6	1.612 × 10⁻¹⁵	77.2	0.092
GETNet (ours) vs. TransBTS	78.8	4.564 × 10⁻²⁴	83.6	3.762 × 10⁻¹⁷	79.6	2.579 × 10⁻¹²
GETNet (ours) vs. VT-UNet	73.3	2.724 × 10⁻⁶	78.8	1.421 × 10⁻¹⁰	73.7	1.576 × 10⁻⁹
GETNet (ours) vs. Swin Unet3D	70.9	0.0006	76.4	0.007	72.9	0.01

Table 4. The results of GETNet when using the LGG/HGG dataset of BraTS2020, with the best performance highlighted in bold.

Methods	Dice (%)			SD			Recall (%)			F1-Score (%)
Methods	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
GETNet in HGG cases	92.74	92.53	87.24	4.29	6.81	8.62	92.73	91.66	89.65	96.41	96.87	92.55
GETNet in LGG cases	92.63	82.10	77.64	3.89	15.86	29.32	91.77	81.87	81.33	96.82	92.97	92.03
GETNet in all cases	92.72	90.38	85.26	4.21	10.31	15.84	92.53	89.65	87.94	96.49	96.09	92.44

Table 5. The online validation results for the comparison of different methods in relation to BraTS2021, with the best performance highlighted in bold.

Methods	Dice (%)				HD95 (mm)
Methods	WT	TC	ET	AVG	WT	TC	ET	AVG
3D U-Net [12]	88.02	76.17	76.20	80.13	9.97	21.57	25.48	19.00
Att-Unet [60]	89.74	81.59	79.60	83.64	8.09	14.68	19.37	14.05
UNETR [38]	90.89	83.73	80.93	85.18	4.71	13.38	21.39	13.16
TransBTS [40]	90.45	83.49	81.17	85.03	6.77	10.14	18.94	11.95
VT-UNet [37]	91.66	84.41	80.75	85.60	4.11	13.20	15.08	10.80
E1d3-UNet (2022) [22]	92.30	86.30	81.80	86.80	4.34	9.62	18.24	10.73
Orthogonal-Net (2022) [63]	91.40	85.00	83.20	86.53	5.43	9.81	20.97	12.07
SDS-Net (2023) [24]	91.80	86.80	82.50	87.00	21.07	11.99	13.13	15.40
Swin Unet3D (2023) [41]	90.50	86.60	83.40	86.83	-	-	-	-
QT-UNet-B (2024) [64]	91.24	83.20	79.99	84.81	4.44	12.95	17.19	11.53
Yaru3DFPN (2024) [65]	92.02	86.27	80.90	86.40	4.09	8.43	21.91	11.48
GETNet (ours)	91.77	86.03	83.64	87.15	4.36	11.35	14.58	10.10

Table 6. The results of the ablation study of each module in GETNet, with the best performance highlighted in bold.

Expt	Dice (%)			HD95 (mm)			Sensitivity (%)			Specificity (%)
Expt	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
Base	91.66	84.41	80.75	4.11	13.20	15.08	-	-	-	-	-	-
Base + GNS	91.50	85.30	81.92	5.36	13.68	20.24	91.67	84.08	82.51	99.93	99.96	99.98
Base + ECSA	91.62	85.82	82.11	4.79	11.44	18.04	91.73	84.99	81.81	99.92	99.97	99.98
Base + GNS + ECSA (GETNet)	91.77	86.03	83.64	4.36	11.35	14.58	92.83	85.12	83.91	99.98	99.97	99.99

Table 7. The results of the ablation study of GN and GeLU in the GNS module, with the best performance highlighted in bold.

Expt	Position	GN	BN	GN + GeLU	BN + ReLU		Dice	HD95	Sen	Spe
A	Unit 1				√	WT	91.32	5.72	91.64	99.98
	Unit 2		√			TC	84.26	17.18	83.27	99.97
	Unit 3				√	ET	83.01	15.02	83.56	99.93
B	Unit 1			√		WT	91.36	4.87	92.32	99.92
	Unit 2			√		TC	85.89	10.07	84.80	99.97
	Unit 3			√		ET	82.19	19.91	81.90	99.98
C	Unit 1	√				WT	91.65	4.67	92.30	99.92
	Unit 2	√				TC	85.67	13.69	84.51	99.98
	Unit 3	√				ET	82.88	16.71	83.01	99.98
D	Unit 1	√				WT	91.42	4.56	92.87	99.91
	Unit 2			√		TC	85.54	13.10	84.03	99.98
	Unit 3	√				ET	82.90	17.90	82.42	99.98
E (GETNet)	Unit 1			√		WT	91.77	4.36	92.83	99.98
	Unit 2	√				TC	86.03	11.35	85.12	99.97
	Unit 3			√		ET	83.64	14.58	83.91	99.99

Table 8. The results of the ablation study of the convex combination in the ECSA module when λ < 1 and λ ≠ 1 − λ, with the best performance highlighted in bold.

Expt	λ	1 − λ	Dice (%)			HD95 (mm)			Sensitivity (%)			Specificity (%)
Expt	λ	1 − λ	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
A	0.1	0.9	91.56	85.22	82.94	5.28	14.91	16.22	92.22	84.26	83.08	99.92	99.97	99.98
B	0.2	0.8	91.46	86.41	81.86	4.56	9.69	18.52	92.65	85.39	81.71	99.91	99.97	99.98
C	0.3	0.7	91.73	86.18	82.61	4.68	13.2	17.96	91.65	85.65	82.85	99.93	99.97	99.98
D	0.4	0.6	91.63	85.23	82.47	4.86	16.44	18.01	92.07	83.97	81.91	99.93	99.97	99.98
E	0.6	0.4	91.50	84.57	82.16	4.75	13.46	14.81	91.56	83.82	82.66	99.93	99.97	99.98
F	0.7	0.3	91.89	85.08	82.84	4.33	13.26	14.83	92.4	84.3	83.05	99.93	99.97	99.98
G	0.8	0.2	91.48	85.02	82.51	5.28	11.71	17.97	93.43	81.65	81.48	99.91	99.99	99.98
H	0.9	0.1	91.58	85.13	82.72	4.69	13.21	14.7	92.39	81.64	82.71	99.92	99.97	99.98
GETNet	0.5	0.5	91.77	86.03	83.64	4.36	11.35	14.58	92.83	85.12	83.91	99.98	99.97	99.99

Table 9. The results of the ablation study of the convex combination in the ECSA module when λ = 1 or λ = 1 − λ, with the best performance highlighted in bold.

Expt	η	θ	Dice (%)			HD95 (mm)			Sensitivity (%)			Specificity (%)
Expt	η	θ	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
A	ω	ω	91.56	85.54	82.83	4.67	11.70	14.78	91.82	84.91	83.10	99.93	99.97	99.98
B	1	1	91.50	85.76	82.47	4.49	11.39	16.44	92.28	85.72	83.08	99.92	99.97	99.98
GETNet	0.5	0.5	91.77	86.03	83.64	4.36	11.35	14.58	92.83	85.12	83.91	99.98	99.97	99.99

Table 10. The results of the FEP frequency coefficient in the ECSA module, with the best performance highlighted in bold.

Expt	λ	Dice (%)			HD95 (mm)			Sensitivity (%)			Specificity (%)
Expt	λ	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
A	5000	91.70	85.33	82.17	4.47	13.23	19.87	92.63	83.96	82.07	99.92	99.97	99.98
B	20,000	91.64	84.96	82.21	5.38	14.97	18.57	93.79	84.95	83.18	99.90	99.97	99.98
GETNet	10,000	91.77	86.03	83.64	4.36	11.35	14.58	92.83	85.12	83.91	99.98	99.97	99.99

Table 11. The results of a comparative experiment on the depth-wise size of the 3D Patch-Merging operation, with the best performance highlighted in bold.

Expt	Dice (%)			HD95 (mm)			Sensitivity (%)			Specificity (%)			FLOPs
Expt	WT	TC	ET	WT	TC	ET	WT	TC	ET	WT	TC	ET
A	91.50	86.83	82.88	4.45	9.65	17.88	92.09	86.58	82.54	99.92	99.96	99.97	130.94G
GETNet	91.77	86.03	83.64	4.36	11.35	14.58	92.83	85.12	83.91	99.98	99.97	99.99	81.95G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Cao, N.; Zhang, R.; Yang, P. GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation. Diagnostics 2024, 14, 1257. https://doi.org/10.3390/diagnostics14121257

AMA Style

Guo B, Cao N, Zhang R, Yang P. GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation. Diagnostics. 2024; 14(12):1257. https://doi.org/10.3390/diagnostics14121257

Chicago/Turabian Style

Guo, Bin, Ning Cao, Ruihao Zhang, and Peng Yang. 2024. "GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation" Diagnostics 14, no. 12: 1257. https://doi.org/10.3390/diagnostics14121257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GETNet: Group Normalization Shuffle and Enhanced Channel Self-Attention Network Based on VT-UNet for Brain Tumor Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Deep-Learning-Based Methods for Medical Image Segmentation

2.2. Attention-Based Module for Medical Image Segmentation

2.3. The Transformer-Based Module for Medical Image Segmentation

3. Materials and Methods

3.1. Datasets and Preprocessing

3.2. Implementation Details

3.3. Evaluation Metrics

3.4. Methodology

3.4.1. Network Architecture

3.4.2. Enhanced Channel Self-Attention Block (ECSA)

3.4.3. Group Normalization Shuffle (GNS) Block

4. Results and Discussion

4.1. Comparison with Other Methods

4.2. Ablation Experiments

4.2.1. Ablation Study of Each Module in GETNet

4.2.2. Ablation Study of GN and GeLU in the GNS Module

4.2.3. Ablation Study of the Convex Combination in the ECSA Module

4.2.4. Ablation Study of the Frequency Coefficient of FEP in the ECSA Module

4.2.5. Comparative Experiment on the Depth-Wise Size of the 3D Patch-Merging Operation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI