3.3. Lightweight GEFA-Net
The spectral characteristics of a given vibration signal may exhibit variations due to fluctuations in operational parameters such as load, velocity, and lubrication conditions. Such variability necessitates that neural networks possess the capability to intricately capture these evolving feature dynamics. However, conventional neural networks typically demonstrate inadequate adaptability to these changes. Traditional CNN models enhance fault feature extraction by incorporating additional convolutional layers, thereby better adapting to diagnostics under complex operating conditions. This approach not only escalates the risk of overfitting but also increases the number of model parameters, necessitating more computational resources and extended processing time. To address the challenge of adapting traditional models for feature extraction across diverse operational conditions, our advanced deep learning framework, as depicted in
Figure 3, integrates local feature extraction with global information synthesis. This framework is designed to significantly enhance accuracy in image classification tasks. The proposed framework markedly improves the model’s capability to discern the inherent complexity of images by amalgamating traditional convolutional mechanisms with enhanced attention strategies and an optimized Transformer encoder. This innovation results in a substantial reduction in the model’s parameter count. Concurrently, we transitioned from the AdamW optimizer to the Lion optimizer, aiming to boost both model performance and training efficiency. The enhancements achieved with the revised GEFA-Net are elucidated in the subsequent sections.
- (1)
Ghost module, Ghost Bottleneck, and EPSA
We refine the balance between depth, width, and computational efficiency using the Ghost module, Ghost Bottleneck, and EPSA mechanisms as integrated into the MobileVit model. The Ghost module enhances feature map** through cost-effective operations, thus reducing computational overhead. As depicted in
Figure 4, this module bifurcates the standard convolutional layer into two segments. Initially, conventional convolutional operations are performed and meticulously regulated to minimize the number of convolutions. These operations yield a primary set of feature maps that encapsulate the essential information of the input data. The effectiveness of these enhancements is systematically explored in subsequent analyses. In the subsequent phase of the Ghost module, a linear operation is employed to expand the number of feature maps. This stage leverages the feature maps produced during the initial convolutional process, applying a sequence of economical linear transformations to create additional “ghost” feature maps. These ghost feature maps are designed to effectively encapsulate and amplify the information inherent in the original features. Consequently, the Ghost module facilitates the generation of an augmented set of feature maps, achieving this with reduced parameter count and computational complexity while maintaining the dimensional integrity of the output feature map [
54]. The Ghost Bottleneck architecture is composed of two sequentially stacked Ghost modules. The initial Ghost module functions as an expansion layer, amplifying the channel count to enrich the representational capacity of the feature maps. Conversely, the subsequent Ghost module compresses the number of channels to align with the dimensions required by the shortcut paths. These shortcut paths are strategically deployed to integrate the inputs and outputs of both Ghost modules. By executing a series of cost-effective linear operations on each intrinsic feature map, the Ghost Bottleneck effectively multiplies the number of feature maps, thereby enhancing the model’s depth without proportionately escalating the computational burden or parameter count.
We refine the balance between depth, width, and computational efficiency using the Ghost module, Ghost Bottleneck, and EPSA mechanisms as integrated into the MobileVit model. The Ghost module enhances feature map** through cost-effective operations, thus reducing computational overhead. As depicted in
Figure 4, this module bifurcates the standard convolutional layer into two segments. Initially, conventional convolutional operations are performed; these are meticulously regulated to minimize the count of convolutions. Such operations yield a primary set of feature maps, encapsulating essential information of the input data. The effectiveness of these enhancements is systematically explored in subsequent analyses. In the subsequent phase of the Ghost module, a linear operation is employed to expand the number of feature maps. This stage leverages the feature maps produced during the initial convolutional process, applying a sequence of economical linear transformations to create additional “ghost” feature maps. These ghost feature maps are designed to effectively encapsulate and amplify the information inherent in the original features. Consequently, the Ghost module facilitates the generation of an augmented set of feature maps, achieving this with reduced parameter count and computational complexity while maintaining the dimensional integrity of the output feature map [
54]. We chose the Ghost module as the core of our approach after extensive theoretical research and empirical comparisons. Our objective was to enhance MobileVit by improving model inference speed, significantly reducing the number of model parameters and simultaneously improving diagnostic accuracy. Among various modules tested, including ALBERT, TinyViT, MiniViT, and DynamicViT, the Ghost module demonstrated superior performance in achieving our lightweighting goals. Specifically, Ghostnet employs fewer parameters to linearly generate new features, which is particularly advantageous for creating efficient lightweight models.
The Ghost Bottleneck architecture is composed of two sequentially stacked Ghost modules. The initial Ghost module functions as an expansion layer, amplifying the channel count to enrich the representational capacity of the feature maps. Conversely, the subsequent Ghost module compresses the number of channels to align with the dimensions required by the shortcut paths. These shortcut paths are strategically deployed to integrate the inputs and outputs of both Ghost modules. By executing a series of cost-effective linear operations on each intrinsic feature map, the Ghost Bottleneck effectively multiplies the number of feature maps, thereby enhancing the model’s depth without proportionately escalating the computational burden or parameter count.
Specifically, within the MobileViTBlock, the conventional convolution operation is enhanced through the integration of the Ghost module. Furthermore, the Ghost module is employed as the initial convolution to enhance the model’s efficiency by facilitating feature extraction at a reduced computational cost during the early stages of the network. Concurrently, the Ghost Bottleneck primarily replaces the bottleneck layers within MobileViT, specifically those following the Inverted Residual layer. This substitution introduces more effective feature processing capabilities to the model, thereby optimizing overall performance. The computational expenditure of these operations is considerably lower compared to standard convolutional operations, thereby significantly enhancing the efficiency of the model. Despite the diminished computational demand and fewer parameters, experimental results demonstrate that the MobileViT model, incorporating both the Ghost module and Ghost Bottleneck structures, exhibits improved recognition performance. This approach further refines the features while preserving the spatial dimensionality of the feature map. Consequently, it circumvents the common issue of parameter proliferation associated with traditional convolutional layers and mitigates the risk of overfitting that often accompanies excessive convolutional layers.
To augment the model’s capacity for discerning and processing critical information, the EPSA has been integrated following the Ghost Bottleneck layer, as detailed in [
50]. This integration enhances the model’s representation of global features within the channel dimension by compressing and recalibrating the global content of the feature maps. The deployment process of the EPSA module is systematically outlined in
Figure 5, comprising four methodical steps. The Squeeze and Concat (SPC) module first generates a channel-specific multi-scale feature map, consolidating vital data to fortify the foundational analysis. The SE block module then extracts attention vectors from these feature maps at various scales, meticulously emphasizing essential features while discarding non-essential information, thereby improving the model’s computational efficiency. Following this, the vectors are recalibrated using the Softmax function, which aids in deriving channel-specific weights to optimize feature representation and enhance the focus on crucial attributes for achieving the stipulated objectives. This sequence culminates in an element-wise multiplication of the recalibrated weights with the corresponding feature maps, significantly refining the representation’s accuracy and clarity. The resultant feature map, rich in detailed and precise multi-scale feature information, considerably enhances the model’s ability to analyze diverse feature dimensions and elevates its performance and efficiency in complex data analysis scenarios.
Drawing upon Zhang’s investigation [
50] and addressing the intricate challenge of bearing fault feature extraction under variable operating conditions, this study deduces that the incorporation of the EPSA module markedly enhances the model’s capability to discern critical features. This is achieved through the adept integration of multi-scale spatial information and a cross-channel attention mechanism. Such enhancements are particularly pronounced in scenarios characterized by complex backgrounds and heterogeneous target scales. Consequently, we strategically assessed the integration of the EPSA module within our model, considering its placement and effectiveness in enhancing feature extraction and representational capacities. Our empirical modifications focused on three critical integration points within the network’s architecture, aiming to optimize the model’s representational and diagnostic accuracy. a. Post-MobileViTBlock Integration: We introduced the EPSA module following each MobileViTBlock, particularly after processing by the self-attention mechanism. This enhancement targets the amplification of multi-scale representational capabilities, which is crucial for capturing detailed intricacies within datasets. b. Pre-Feature Pyramid Network Enhancement: Before linking with the Feature Pyramid Network (FPN), the EPSA module was integrated to strengthen the model’s detection capabilities and augment its segmentation performance. This placement is designed to refine the accuracy with which the model delineates distinct segments in complex images. c. Global Average Pooling Optimization: The placement of the EPSA module prior to the Global Average Pooling (GAP) stage aimed at facilitating the module’s role in the final attention weighting and feature tuning, which is essential for extracting critical features and enhancing diagnostic precision.
Our experimental investigations indicate that the EPSA module’s incorporation post-MobileViTBlock significantly enhances the model’s diagnostic capabilities, especially in analyzing fault signals under variable operational conditions. These conditions require robust multi-scale processing to manage the complexity of feature representations effectively. The results demonstrate the EPSA module’s effectiveness in fortifying the model’s feature representation and underscore its potential in new diagnostic scenarios. This enhancement adaptively responds to diverse operational environments by customizing to specific input features, thus enriching the model’s feature extraction capabilities with greater depth and precision in complex scenarios. Concurrently, this approach substantially reduces computational resource consumption and operational latency. By integrating efficient parameter utilization with a blend of global and local information processing techniques, we have developed a robust framework suitable for complex tasks such as fault diagnosis in intricate settings. This design not only enhances the model’s analytical performance but also streamlines the computational process, facilitating more rapid and accurate diagnostics and classifications.
- (2)
CB module and CB Integration with Transformer
In the MobileVit framework, the employment of the Transformer architecture significantly enhances the model’s comprehension of intricate image content through the effective capturing of long-range dependencies among image features. This capability is paramount in vision-related tasks as it facilitates the integration of minute local details with expansive global information, thereby refining the overall interpretation of the image’s structure and content. While MobileVit demonstrates exceptional prowess in extracting global information pertinent to fault characteristics, effective troubleshooting under multifaceted operational conditions necessitates a targeted approach to context-specific feature learning. The conventional Transformer architecture, despite its proficiency in managing complex sequential dependencies, does not optimally leverage the global contextual information inherent in high-dimensional image features when applied directly. This limitation can adversely affect the diagnostic capabilities in complex scenarios, thereby diminishing the model’s effectiveness in accurately representing and identifying bearing fault features under challenging operational conditions.
To address this issue, we have refined the Transformer encoder within the MobileVit architecture by incorporating a CB module [
48], as illustrated in
Figure 6. This enhancement aims to augment the model’s capability to discern more profound global information pertinent to bearing fault characteristics under diverse operational conditions, thereby elevating diagnostic effectiveness. Specifically, a CB module is integrated into the Transformer to process and amplify global context information, ensuring that the model leverages a synthesized representation of both global and local data at each processing stage. Initially, the CB module computes the mean of the sequence (or feature map) in the spatial dimension to produce a context vector that encapsulates the global context of the entire image. This vector is subsequently redistributed to each position within the sequence and combined with the original features, thereby enriching the local features with global contextual data. The deployment of this mechanism ensures that each token is imbued with a comprehensive representation of global context prior to progressing to subsequent layers, which is vital for an accurate interpretation of both the global structure and intricate details of the image.
We integrate the CB with the Transformer encoder module in the Transformer section, as shown in
Figure 7, and the integrated module is named CB-TransformerEncoder. This module is strategically embedded within the feed-forward network (FFN) layer, subsequent to the self-attention mechanism and local information processing. The placement of CB-TransformerEncoder after the FFN is pivotal, as it introduces a global context subsequent to the processing of local features, ensuring that the integration of global context is predicated on a nuanced and enriched representation of local features. Such enhancement significantly fortifies the model’s ability to merge global and local information, thus augmenting the overall completeness and analytical depth of the model. Within the Transformer architecture, while the self-attention layer is primarily responsible for facilitating local-to-local relationship learning, the introduction of the FFN layer adds a layer of nonlinearity, which is crucial for capturing more complex feature representations. The subsequent integration of the CB module encourages the model to further amalgamate this locally processed information with a global context, leveraging its pre-established comprehension of local features and complex non-linear relationships. This strategic arrangement not only optimizes information flow within the model but also substantially enhances its capacity to process and interpret complex data, thereby extending the learning scope and augmenting diagnostic capabilities in intricate scenarios. Furthermore, positioning the CB module prior to or within the self-attention layer could prematurely merge global context with local features, potentially constraining the self-attention mechanism’s capacity to delineate complex relationships among local details. Conversely, situating the CB module subsequent to the FFN layer allows the model to first exhaustively exploit the self-attention mechanism for enhancing the interrelations among local features. The subsequent incorporation of global context through the CB module thus follows, strategically avoiding any premature interference of global information with the local feature delineation process.
The experimental results demonstrate a significant enhancement in the performance of the improved model. Specifically, after 50 training iterations, the diagnostic accuracy exhibits a 4.1% increase relative to the MobileVit model. Moreover, this methodology enables the model not only to discern intricate details among local features but also to integrate global context at each processing stage. Consequently, this enhances the model’s proficiency in identifying bearing fault characteristics under complex operational conditions. This integrated approach of processing both local and global information offers a more holistic and effective strategy for the detailed analysis of image contents.