1. Introduction
The maturity of tobacco leaves is pivotal in determining the quality and characteristics of cigarette products [
1], thereby becoming a key factor that affects the quality of the leaves. Not only does it correlate with the chemical composition, aroma, and flavor of tobacco leaves [
2,
3], but it also considerably influences the sustainability of the tobacco industry and the economic welfare of tobacco farmers [
4,
5]. Maturity serves as a critical indicator determining the harvesting period for tobacco leaves [
6]. However, prevailing manual classification methods are inefficient, costly, and susceptible to bias. Overripe or underripe tobacco leaves may lead to decreased yield, reduced quality, and increased difficulty in curing [
7], thereby impacting the earnings of tobacco farmers and the overall competitiveness of the tobacco industry. Therefore, accurately determining the optimal harvest time for tobacco leaves and establishing an objective tobacco leaf maturity evaluation system is vital for elevating the quality of tobacco leaf production, aiding tobacco farmers in maximizing profits, and maintaining the stable development of the tobacco industry while securing high-quality tobacco leaves [
8].
In recent years, researchers have ventured into exploring various technologies to measure the maturity of tobacco leaves. Yu et al. [
9] employed electrochemical fingerprint technology to classify tobacco types and monitor the growth status of tobacco leaves. Conversely, Chen et al. [
10] amalgamated near-infrared (NIR) spectroscopy with convolutional neural network (CNN) deep learning to achieve maturity recognition of the upper, middle, and lower parts of the tobacco leaves, reaching accuracies of 96.18%, 95.2%, and 97.31%, respectively. Concurrently, Lu et al. [
11] utilized hyperspectral imaging technology combined with the SNV-SPA-PLS-DA model, successfully increasing the accuracy of the tobacco leaf maturity classification validation set and prediction to 99.32% and 98.46%, respectively. These findings underscore the efficacy of utilizing visible light/NIR hyperspectral imaging technology to detect tobacco leaf maturity. However, these methods recognize tobacco leaf maturity after harvesting, leading to the destruction of the leaves. Although spectral technology offers non-destructive identification of tobacco leaf maturity, spectrometers are expensive, lack portability, and are susceptible to environmental interference. With the swift evolution of machine vision and smart sensing technology, these methodologies showcase substantial potential in agricultural arenas such as plant disease detection, fruit maturity evaluation, and yield prediction, thus offering more convenient and efficient solutions for agricultural operations [
12,
13]. Mallikarjuna et al. [
14] proposed a tobacco leaf selective harvesting method based on texture features, validating the effectiveness of texture analysis in distinguishing mature from immature leaves. Moreover, color features play an integral role in tobacco leaf maturity recognition [
15]. Drawing on these studies, Mallikarjuna et al. [
16] proposed a maturity evaluation method combining filters and color models, effectively evaluating tobacco leaf maturity, thereby assisting in the development of efficient automatic harvesting systems. Nonetheless, research on tobacco leaf maturity recognition in the field under complex backgrounds and variable weather conditions is still limited. Traditional feature extraction methods may have limited accuracy and generalization capabilities under these conditions [
17,
18,
19,
20,
21]. Compared with traditional machine learning methods, deep learning methods, particularly convolutional neural networks (CNNs), have marked advantages in image classification [
22,
23] and object detection tasks, being able to automatically learn and extract image features [
24]. By leveraging the ubiquity and convenience of mobile devices in tandem with deep learning technology, Chen et al. [
25] established a practical solution for evaluating tobacco leaf maturity. Although Li et al. [
26] proposed an improved lightweight deep learning network architecture for recognizing tobacco leaf maturity to surmount the limitations of traditional methods, their research was still conducted post-harvest and does not fully replicate the in situ working conditions of field tobacco leaves. Therefore, the development of a high-accuracy tobacco leaf maturity recognition method suitable for complex environments still holds significant value.
High-definition lenses and optical sensors are capable of capturing the rich texture and spectral features of crops [
27]. Hence, in response to the limited research on in situ tobacco leaf maturity recognition in the field, this study leveraged a mobile phone as an information collection tool, proposing a novel method that integrates machine vision with the MobileNetV1 deep convolutional network for in situ tobacco leaf maturity recognition research in the field. This study maintained the authenticity of field operation conditions during the process of tobacco leaf image collection. To accommodate the size differences between tobacco leaves, we introduced the Feature Pyramid Network (FPN) structure to capture and merge features of different scales, thereby enhancing feature expression capabilities. By integrating the attention mechanism, we focus on key information, maintaining high recognition performance in complex backgrounds and weather environments. Further, an improved MobileNetV1 model was used for training and validation to ensure its accuracy and robustness under different environmental conditions. The proposed method’s advantages and limitations were evaluated by comparing it with other technologies. This study aimed to overcome the limitations of sensory evaluation of tobacco leaf maturity levels, provide accurate, reliable, and scientific aids for tobacco leaf harvesting, and establish a maturity discrimination model.
3. Results and Discussion
3.1. Experiment Environment and Parameters
The experiments were conducted in a Python 3.7 environment with TensorFlow-GPU version 2.6.0. Training was performed using the Pycharm compiler on a hardware environment featuring an 8-core i5-12450H processor and 16 GB of VRAM. The batch size was set to 32, and training was performed for 30 epochs.
3.2. Optimizer and Learning Rate Selection
Choosing the appropriate optimizer and learning rate has a significant impact on model performance and accuracy. In deep learning, the optimizer is responsible for controlling the update of model parameters, while the learning rate determines the magnitude of each parameter update. A learning rate that is too small can cause the objective function to decrease slowly, leading to slow model convergence or inability to converge to the optimal solution; a learning rate that is too large may cause oscillation near the optimal solution or lead to the explosion of the objective function, resulting in the model failing to converge or overfitting. The choice of optimizer also affects model performance and accuracy, as different optimizers have different characteristics and strengths.
To further investigate the impact of model parameters on network accuracy, we chose three optimizers—Stochastic Gradient Descent (SGD), RMSprop, and Adaptive Moment Estimation (Adam)—and applied them to the pre-trained MobileNetV1 tobacco leaf maturity recognition model integrated with an FPN and attention mechanism. The initial learning rates of 0.005, 0.0005, and 0.00005 were used for in situ tobacco leaf maturity recognition training. These three optimizers are the most commonly used and popular optimizers in the field of deep learning. They have achieved good performance in various types of neural networks. The Adam and RMSprop optimizers can adaptively adjust the learning rate, effectively controlling the size of parameter updates, and have faster convergence speeds and better generalization performance. In contrast, the SGD optimizer might require more training steps to reach the optimal solution but can avoid overfitting to some extent.
The impact of the optimizer and learning rate on the improved MobileNetV1 network model is shown in
Figure 6. The Adam optimizer outperformed the SGD and RMSprop optimizers at all learning rates. Especially at a learning rate of 0.0005, when the last five epochs were inspected, it can be seen that the Adam optimizer converged quickly and achieved the highest accuracy and lowest loss values, 96.3% and 0.13, respectively, indicating that the Adam optimizer achieved a good balance of learning speed and model performance. In comparison, although the SGD optimizer showed some accuracy at a learning rate of 0.005, its loss value was higher than that of Adam and RMSprop, and its convergence speed was slower. The RMSprop optimizer, at a learning rate of 0.0005, had accuracy and loss values similar to Adam but a moderate convergence speed. Overall, the Adam optimizer maintained the best performance and convergence speed at a learning rate of 0.0005. Therefore, when training the tobacco leaf maturity recognition model, we used the Adam optimizer and a learning rate of 0.0005 to achieve high performance in the shortest time.
3.3. Ablation Study
The aim of this study was to improve the accuracy of the MobileNetV1 model by introducing an attention mechanism and FPN layers. To verify the individual contributions of these improvements to the model and their impact on model performance, we conducted an ablation study, the results of which are presented in
Table 1.
From
Table 1, we can see that the MobileNetV1 model with the FPN structure had improved accuracy on both the validation and test sets, increasing from 95.37% to 96.06% and from 94.47% to 95.39%, respectively. This indicates that the FPN structure can effectively implement multi-scale feature fusion, enhancing the model’s generalization capability. When we introduced the SE attention mechanism on this basis, although the accuracy of the validation set decreased compared to the base model, the accuracy of the test set improved. This suggests that while the SE mechanism may lead to overfitting of the training data, it does indeed enhance the model’s recognition accuracy in complex scenarios by improving the model’s ability to adaptively calibrate channel features.
We can also see that when the SP mechanism was introduced, the model’s accuracy on the validation set further improved, but the accuracy on the test set decreased slightly. This may be due to the SP mechanism’s overfitting of the training data. However, when we integrated both the SP and SE attention mechanisms into the model, the accuracy on both the validation and test sets reached the highest values of 96.3% and 96.31%, respectively. This suggests that while there may be some issues when using SP and SE attention mechanisms independently, when they work together, they can complement each other, collectively enhancing model performance and achieving higher tobacco leaf maturity recognition performance, exhibiting an excellent generalization capability.
To more intuitively show how these improvements affect the model’s decision-making process, we performed feature visualization analysis on the layer prior to the global average pooling layer of the model. This layer, which is close to the network output, contains more discriminative features. These features are crucial for the model’s classification decision and are located before the global average pooling operation, so the feature maps still have spatial dimensions. Therefore, we can see the distribution of each feature in the image. To understand how this layer responds to input images, we overlayed its output with the original input images. We employed a process of extracting the maximum activation value across all channels at each position to generate a new two-dimensional image, which was resized to match the original image. Subsequently, the grayscale values of the two-dimensional image were transformed into colors, and the original input image was blended with the color-mapped feature map. In the resulting image, as shown in
Figure 7, blue areas indicate weaker responses from the model at these locations, while red areas indicate stronger responses. This blended display method can effectively reveal which areas of the image the model reacted strongly to, thereby hel** us understand the model’s decision-making process.
Observations revealed significant differences in the feature visualization between the MobileNetV1 model with integrated FPN and the original model, particularly in the layer preceding the global average pooling. This disparity can be attributed to FPN serving as an efficient multiscale spatial feature extraction network capable of capturing image features across multiple scales. This characteristic enables FPN to extract more refined and diverse features from the image, such as clear edge and shape information. Consequently, the integrated FPN in the MobileNetV1 model may present a greater abundance of line information, resembling a heatmap created by line contours.
Subsequently, attempts were made to further incorporate SP mechanisms and SE attention mechanisms into the FPN-based model. However, these improvements, when applied to the recognition of tobacco leaf maturity, exhibited varying degrees of excessive focus on background elements, such as weeds. In contrast, the enhanced model that integrates FPN, SP mechanisms, and SE attention mechanisms primarily emphasized the features of the tobacco leaves themselves, providing a more favorable basis for tobacco leaf maturity recognition.
These experimental results highlight the significance of FPN, SP, and SE mechanisms in the improved model. FPNs, by integrating features at different scales, help the model capture richer contextual information. The SE enhances the model’s recognition accuracy in complex scenarios by modeling dependencies between channels. The SP enables the model to automatically focus on important parts of the input feature map. When these improvements are combined, the model demonstrates significant performance improvements in identifying the maturity of tobacco leaves under complex weather and background conditions.
3.4. Performance Evaluation of Different Models
To further verify the recognition performance of the improved model, we conducted recognition tests on the maturity of tobacco leaves and compared it with classic models such as MobileNetV1, MobileNetV2, MobileNetV3, VGG16, VGG19, ResNet50, EfficientNetB0, and EfficientNetB1. The comparison results are shown in
Table 2.
Firstly, in terms of recognition accuracy, both MobileNetV3Large and the enhanced version of MobileNetV1 stood out, achieving a remarkable test accuracy of up to 96.3%, positioning them as leaders in the field. They were closely followed by EfficientNetB1 and MobileNetV3Small, achieving 95.85% and 94.93%, respectively. In precision and recall, the enhanced MobileNetV1 model reaffirmed its exceptional performance, with respective scores of 96.47% and 96.31%. This underscores the model’s robust capacity to correctly identify true positives and retrieve the majority of actual positive cases. The MobileNetV3Large model ranked second, with precision and recall rates of 96.31% and 96.29%, respectively. The F1 score, a balanced measure of precision and recall, was led by the enhanced MobileNetV1, achieving 96.33%, slightly below MobileNetV3Large’s score of 96.46%. Subsequent models include EfficientNetB1 and MobileNetV1, with F1 scores of 95.79% and 94.46%, respectively. Regarding mean average precision, the top performers were the enhanced MobileNetV1 and MobileNetV3Large, both scoring 96.31%, followed by MobileNetV1 and MobileNetV3Small, at 94.47% and 94.93%, respectively. This metric highlights the model’s aptitude for accurate pixel classification within the image.
Secondly, regarding model weight size, MobileNetV3Small stood out as the smallest model, at only 6.7 M. In terms of model size, MobileNetV2 closely follows with a size of 9.5 M, succeeded by MobileNetV1 and its enhanced version, sized at 13.2 M and 13.7 M, respectively. These compact models have distinct advantages in storage-constrained environments. Lastly, in terms of processing speed (FPS), MobileNetV1 achieved the highest speed, with MobileNetV2 and the enhanced MobileNetV1 following closely behind.
In conclusion, the enhanced MobileNetV1 model exhibits superior performance across several metrics, including test accuracy, precision, recall, F1 score, and mean average precision. While the enhanced MobileNetV1 and MobileNetV3Large attained identical recognition accuracy, the former lags slightly in terms of processing speed and model weight size compared to MobileNetV1 and MobileNetV2. However, it still significantly outperforms traditional models such as MobileNetV3Large, ResNet50, and VGG16. This demonstrates that the enhanced MobileNetV1, despite maintaining high performance, necessitates less storage and computational complexity. This suggests that this enhanced model holds significant advantages and broad application potential in this task. All in all, the enhanced MobileNetV1 proved to be an effective and efficient model, ideally suited for recognizing tobacco leaf maturity in complex in situ field conditions and backgrounds, especially in resource-constrained environments.
3.5. Model Robustness Evaluation against Real-World Application Challenges
Model robustness is an important metric to assess its sensitivity to small variations in input data. Research on model robustness primarily focuses on how to reduce the model’s sensitivity to input data without compromising model performance. In the practical application of on-site tobacco leaf maturity recognition, the complex field environment imposes high demands on the model’s generalizability. Common environmental disturbances include noise such as raindrops or mud spots on tobacco leaves, which add uncertainty and randomness to the environment. In addition, changes in ambient light can cause changes in image brightness, which is an important factor affecting image recognition. Furthermore, in practical applications, the target object may be partially or completely obscured by other objects. In our research, tobacco leaves might overlap, causing occlusion.
Therefore, we comprehensively evaluated and improved the model’s robustness through three types of experiments (noise disturbance, brightness transformation, and occlusion) to cope with the complex and changing environmental conditions in practical applications. In the noise disturbance experiment, Gaussian noise was selected with noise intensities of 0.5 and 0.1. In the brightness transformation experiment, brightness intensities of 0.8 and 1.2 were used. In the occlusion experiment, occlusion ratios of 0.3 and 0.5 were adopted. For each experiment, we conducted experiments with dataset proportions of 30%, 70%, and 100% to ensure the comprehensiveness of the results. These experiments aimed to simulate various challenges that might be encountered in real-world situations, ho** that the model could still maintain high performance when facing these challenges.
As shown in
Figure 8,
Figure 9 and
Figure 10, the improved model demonstrated good robustness under different conditions of noise intensity, brightness changes, and occlusion ratios, from the experimental data. Whether for lower, middle, or upper leaves, at various maturity stages, the model’s recognition accuracy was maintained at a high level. For the noise disturbance experiment, even when the noise intensity increased to 0.5 and the noise accounted for more than 70%, the model could still maintain a relatively high accuracy, with only a slight decrease in the recognition of mature leaves. In the brightness transformation experiment, the model could stably identify the maturity of tobacco leaves, whether the brightness intensity was 1.2 or 0.8. At a brightness intensity of 0.8, the recognition accuracy for mature upper leaves and overripe lower leaves slightly decreased, possibly due to the loss of information caused by low brightness. In the occlusion robustness test, the model performed well. Even under extensive occlusion, the recognition accuracy in most cases remained high. However, the recognition of mature and overripe lower tobacco leaves declined when the occlusion ratio was 0.5, and the proportion of occluded dataset was 100%, likely due to the loss of important information for maturity judgment caused by excessive occlusion. Overall, the improved model demonstrated good robustness against noise, occlusion, and brightness changes, and could accurately identify tobacco leaves of different positions and maturities.
3.6. Analysis and Visualization of Maturity Recognition Results
To analyze the performance of the improved model in recognizing different maturity levels of tobacco leaves in different parts of the plant, we used a confusion matrix and Score-CAM visualization methods to display and interpret the model’s prediction results.
From the confusion matrices of the model before and after improvements in
Figure 11, we can observe that the improved model performed well in all categories, especially in recognizing the categories of lower immature, middle immature, middle overripe, upper immature, and upper overripe leaves. Only a few samples made errors in classification, and only a few samples were misclassified when identifying the categories of lower mature, middle mature, and upper mature. In the same task, although the original model also performed well in recognizing the categories of lower immature, middle immature, middle overripe, upper immature, and upper overripe, there were many misclassified samples when dealing with complex categories of lower mature and middle mature. In general, the improved model performed better and made fewer misclassifications when dealing with the two more complex categories of lower mature and middle mature. This highlights the effectiveness and applicability of the improved model, especially when handling complex category recognition tasks.
The Score-CAM (Score-weighted Class Activation Map**) [
35] algorithm can provide profound insights into the contributions of deep learning models in recognizing the maturity of fresh tobacco. Score-CAM, a visualization technique, can predict behaviors in Convolutional Neural Networks (CNNs), and is particularly apt for Softmax classification models. In comparison to other methods, Score-CAM demonstrates reduced computational complexity, crucial for managing complex backgrounds and resource-constrained devices. Additionally, Score-CAM does not necessitate gradient information, thereby circumventing potential noise and instability issues that may surface with Grad-CAM. By generating Class Activation Maps (CAMs), it aids in explicating the model’s key feature location, localization accuracy, performance, and interpretability in the process of recognizing the maturity of fresh tobacco, thereby assessing whether the model captures salient features pertaining to specific categories.
The fundamental principle of Score-CAM is to weigh activation images based on the scores of the target category, thereby generating an interpretable heatmap. Initially, during the forward propagation process, the model generates a series of activation images for the input image at a particular convolutional layer. Subsequently, each activation image is propagated forward up to the Softmax layer to compute the corresponding category score. These calculated category scores are then used to weigh these activation images to derive the final class activation map**. Throughout this procedure, Score-CAM can yield highly discriminative and precise localization results.
The pixels within the heatmap correspond to regions on the input image, indicative of the model’s focus on that area. Typically, heatmaps utilize color-coding to signify weights or intensities, where warmer colors (such as red) imply greater attention paid by the model during prediction, and cooler colors (like blue) indicate less attention. The results of the fresh tobacco maturity recognition task of the pre- and post-improved model are visualized using Score-CAM. As illustrated in
Figure 12, when the MobileNetV1 model was amalgamated with the Feature Pyramid Network (FPN) and attention mechanisms, the model’s focal point shifted, and the improved model exhibited enhanced precision in localization. Compared to the original MobileNetV1 model, the highlighted regions in the heatmap focused more on the color and texture of the tobacco leaves relevant to maturity judgement, especially the areas of tobacco leaf veins and color changes. This is particularly evident in mature and overripe tobacco leaves, which assists in enhancing classification accuracy.
3.7. Validation of the Enhanced MobileNetV1 on the V2 Plant Seedlings Dataset
To validate the performance of our enhanced model, we applied it to the widely available V2 Plant Seedlings Dataset. This dataset, composed of 5539 RGB images captured under a variety of weather conditions, closely mirrors the real-world scenarios encountered by our research subject: in-field tobacco leaves. It encompasses three species of plants and nine types of weeds.
The dataset was divided into training, validation, and testing subsets at ratios of 70%, 15%, and 15%, respectively. The training subset comprised 3877 images, including three plant species—common wheat, maize, and sugar beet—and nine weed species, namely, black-grass, common chickweed, cleavers, scentless mayweed, small-flowered cranesbill, shepherd’s purse, loose silky-bent, charlock (also known as wild mustard), and fat hen. The validation and testing subsets, each containing approximately 831 images, similarly included the same range of plant and weed species.
Through comparative analysis, as illustrated in
Table 3, our enhanced model demonstrated an accuracy of 96.63% in the plant seedling recognition task, outperforming other models. This result validates the model’s effectiveness and superiority.
5. Conclusions
In addressing the challenge of distinguishing tobacco leaf maturity in complex field environments, we have developed a tobacco leaf maturity classification model. This model leverages an enhanced MobileNetV1 framework, a FPN, and an attention mechanism. It demonstrated resilience and precision in tackling tobacco leaf maturity identification challenges within intricate field settings, achieving an accuracy rate of 96.3%, which surpassed conventional models such as VGG16, VGG19, ResNet50, and EfficientNetB0. Furthermore, it attained a 96.63% accuracy rate on the V2 Plant Seedlings Dataset. Vein patterns and color transition zones in tobacco leaves emerged as critical features in maturity recognition.
Our refined MobileNetV1 model, while delivering superior performance, requires minimal storage and computational power, signifying its substantial potential for real-world application in tobacco leaf maturity recognition within the field. Future research could delve deeper into updated backbone network architectures like Mobile-Former, MixFormer, TopFormer, EfficientFormer, RepVGG, and LeVit, which could potentially enhance model performance and expedite the advancement of tobacco leaf maturity recognition technology.
In conclusion, our investigation presents a practical solution for the automatic discernment of tobacco leaf maturity, making a significant contribution to the burgeoning field of smart agriculture.