1. Introduction
Most mineral identification and textural characterization in coarse-grained clastic sediments (such as sandstones) necessitate the use of microscopy assessment, which involves interpretation and the gathering of counting statistics using a light microscope [
1,
2,
3,
4,
5,
6]. The analysis of sandstone images is a vital part of geological exploration research, as rocks are rich in hydrocarbon resources [
7]. Sedimentary petrology is a specialized field that demands significant expertise to interpret and classify porosity, minerals, matrix materials, and mineral cements, along with their associated subclasses. A range of objective lenses, optical techniques, and light sources are employed to identify and quantify components. Certain components in sandstone can be small, dark, or opaque, posing challenges for interpretation even to an experienced petrologist. In such cases, methods beyond light microscopy, such as scanning electron microscopy or bulk measurement techniques like X-ray diffraction, are often utilized to assist in mineral identification [
8]. Traditionally, the analysis of sandstone thin sections is primarily conducted by professionals, which is labor-intensive, expensive, and subjective.
At present, classic mineral image segmentation technology primarily relies on the low-level visual information of image pixels and is mainly categorized into three types: (1) the threshold-based mineral image segmentation algorithm, which compares each pixel in the input images with a preset threshold value to segment the target areas [
9,
10,
11,
12]; (2) the region-based mineral image segmentation algorithm that divides the original image into different pixel regions, separating the target areas from the background [
13,
14,
15]; and (3) the specific theory-based mineral image segmentation algorithm that employs more targeted computational methods such as cluster analysis to separate mineral grain images [
16]. Although these classic mineral image segmentation methods have improved the efficiency of sandstone analysis, they cannot effectively address the issue of adhesion and overlap between adjacent grains, and their segmentation performance for small and irregular grains is relatively poor. Additionally, the classic image segmentation algorithm requires manual tuning for different types of mineral grains, which lowers the efficiency of mineral image segmentation and increases the time consumption.
Owing to the booming development of deep learning and the exceptional feature extraction capabilities of CNNs (convolutional neural networks), deep learning-based approaches have been increasingly employed in mineral image segmentation. These methods, diverging from traditional ones, adhere to an end-to-end paradigm and significantly surpass conventional methods in accuracy and efficiency. Furthermore, they are data-driven methods (i.e., the larger the dataset or the more refined the data analysis, the higher the achievable accuracy), allowing for enhanced performance with continuous data expansion. Some studies have explored and applied it in mineral image segmentation tasks. An RDU (R: residual connection; DU: DUNet) ore image segmentation model was proposed to estimate the grain size of ore fragments in conveyor belts, which can adjust the receptive field adaptively according to the size and shape of different ore fragments and achieve accurate segmentation [
17]. Deep learning-based methods for mineral image segmentation also excel in segmenting adherent, overlap**, and multi-scale mineral grains, effectively addressing the typical challenges encountered in previous approaches [
18]. However, existing image segmentation methods mainly focus on semantic segmentation, which cannot meet the requirements to compute specified morphological data in the mineral analysis, such as circularity, particle size, and contact relationship. Furthermore, these models are mostly designed for scenarios with clean image backgrounds, whereas in sandstone analysis scenarios, the image backgrounds are complex and filled with indistinguishable fillers, which makes it a challenge to segment grains out [
19].
In order to solve the above problems and further improve the application prospects of automatic sandstone analysis, we propose a high-accuracy instance segmentation model based on Mask R-CNN [
20] by introducing a hybrid attention mechanism to better adapt to the complex shapes of sandstone particles. Secondly, we design a shape-aware training loss function for the improved model and conduct an ablation experiment to confirm its advantages. Additionally, original convolution is replaced with dilated convolution for the purpose of obtaining more global information. Finally, this experiment establishes a dataset of sandstone instance segmentation with 40,122 grain labels.
2. Methodology
Section 2 mainly introduces various methods and ideas involved in the improved model of the sandstone image segmentation system. In the model-building stage, backbone selection, module setting, and loss function design are the main considered aspects to improve the segmentation model performance and solve edge blurring and incomplete segmentation in the segmentation process.
2.1. Mask R-CNN
The deep learning-based Mask R-CNN network segmentation method has superior performance and can achieve good results in the sandstone microscopic image segmentation task [
21]. We customize and optimize the Mask R-CNN network to form a task specified model, based on the distribution characteristics of sandstone grains in sandstone microscopic images.
The Mask R-CNN network is divided into three steps: object localization, object category calculation, and segmentation mask prediction. The process of the Mask R-CNN network is as follows: the image passes through the ResNet backbone network, and different levels of feature maps are obtained using the feature pyramid. This facilitates the model in extracting features at different levels. It then enters the region proposal network to generate candidate areas where grains might be present. On the one hand, a classifier determines whether pixels belong to sandstone grains or the background. On the other hand, a box regressor corrects the boundaries of the sandstone grains. The combination of these two steps forms the candidate target areas. Finally, through a fully convolutional neural network, accurate segmentation results are achieved, completely extracting the sandstone grains. The processing procedure is shown in
Figure 1.
2.2. SE-Net
Squeeze-and-Excitation Network (SE-Net [
22]) introduces the concept of channel attention. By modeling and assigning weights to feature channels, it forms the parameters that can be learned and updated and continuously increases the weights of useful feature channels to optimize the generalization capability of the model. In this paper, SE-Net is added to the backbone recognition network to improve the spatial information processing capability of the model and therefore achieve better results for sandstone grains at different scales. The specific structure of SE-Net is shown in
Figure 2.
The first operation is F
tr conversion, which converts the input x with a feature channel number C
1 by a series of general transformations such as convolution to obtain a feature with a feature channel number C
2. The F
tr operation is shown in Equation (1).
The Squeeze operation is shown in Equation (2), where the two-dimensional feature channels are mapped by compression to obtain a real number, which is connected to form a one-dimensional vector to obtain the global distribution and construct the global receptive field of the model. The dimension number of the vector equals the number of input feature channels.
Next is the Excitation operation. The parameter w is used to generate the weights for each feature channel. Multiply W
1 (which dimension is C/r × C) and z to reduce the computational complexity by scale operation. The dimension of the feature map of W
1z is still 1 × 1 × C/r; it is then passed through the ReLU layer and multiplied with W
2 (which dimension is C × C/r) to obtain the feature map dimension 1 × 1 × C. Finally, the Sigmoid is derived to generate the weights (s) of the feature maps (at a total of C). The parameter s incorporates the feature map information of each feature channel and is part of the neural network, which can be learned and optimized.
Finally, the weights calculated by the model are weighted onto the original feature map by the previous channels through the scale operation, as shown:
2.3. Coordinate Attention and Spatial Attention
The channel attention mechanism models global information through channels, allowing the model to effectively extract sandstone grains. When extracting, the location information of the sandstone is directly related to the accuracy of the boundary fit. In this paper, a CA + SP hybrid attention mechanism is proposed to optimize the model by combining the coordinate attention mechanism [
23] and the spatial attention mechanism [
24].
Coordinate attention takes a similar operation to channel attention in both horizontal and vertical directions to obtain relatively independent feature maps in both directions, effectively preserving one-dimensional location information and establishing spatial long-range dependence in one dimension. This mechanism is very sensitive to coordinate information and can effectively pinpoint spatial coordinate information. Similar to the channel attention mechanism, the coordinate attention mechanism first performs coordinate position encoding and then generates coordinate attention. A coordinate attention module can be seen as a computational unit used to augment feature representation capabilities. It can take any intermediate tensor
as input and produces an output
of the same size with enhanced representational power. The horizontal and vertical directions are treated separately and computed using pooling kernels of dimensions (H, 1) and (1, W), giving the following outputs for the vertical and horizontal channels.
The two feature maps generated by the previous module are cascaded and then transformed using a shared 1 × 1 convolution to perform the transformation F
1, expressed as in Equation (5); the generated
is the intermediate feature map during the computation, where r denotes the down-sampled ratio, which is used to control the size of the module, like the SE module.
Next, f is cut into two direction-independent tensors,
, and then using the two 1 × 1 convolutions F
h and F
w, we transform two tensors to the same number of channels as the input and output, as in the following equation:
The two are then expanded as attention weights, and final output of the CA module is
The spatial attention mechanism can simulate the function of the human eye and extract the parts of interest to the model, which generates a mask for the space, and draws out one way, which undergoes similar operations as described above to form the spatial attention mechanism, effectively extracting its relative spatial information.
2.4. Dilated Convolution
The up-sampling process of bilinear interpolation has large errors, leading to problems such as distortion when generating sandstone grain contours, and this paper achieves refinement of sandstone contours by introducing dilated convolution. The dilated convolution [
25] has the following main functions:
Expanding the Mask R-CNN network receptive fields more efficiently while taking into account image resolution;
Changing the size of the convolution kernel and the perceptual field of the model by adjusting the expansion rate (r) to obtain multi-scale global semantic information.
The dilated convolution is shown in
Figure 3. Only non-zero elements play a role in the calculation, and the dilated convolution fills the ordinary convolution with zeros to increase the size of the receptive field, as shown in Equation (9).
K is the size of the expanded convolution kernel, k is the size of the original convolution kernel, and r represents the expansion rate. As shown in the figure, when k = 5 and r = 2, compared with the ordinary 5 × 5 convolution on the left, the dilated convolution expands the convolution kernel to 11 × 11, and the range of the sensory field is greatly improved.
5. Discussion
Channel attention can enhance the network’s ability to extract image information. Coordinate attention can improve the model’s ability to locate boundaries. Spatial attention can enhance the model’s receptive capability and optimize the model’s generalization ability, that is, its performance on images that have not been used for training. By introducing the hybrid attention mechanism, the model proposed in this paper surpasses other models on SMISD. With the goal of the task—to make the boundaries of grain segmentation results more precise—as the performance evaluation index, a shape-aware loss function can improve the model’s segmentation effect on grain contours, especially irregular grains. Therefore, the model adopts Log-Cosh DiceLoss as the model’s loss function. Finally, the dilated convolution can expand the receptive field and segment by combining more regional information surrounding the grains, which makes model perform better. Additionally, the dataset with rich labeled grains helps the model fully learn the segment rules, which is vital for data-driven methods.