1. Introduction
A boiler is an essential component in thermal power plants that utilize various fuels, including coal, oil, nuclear, or waste. Functioning as heat exchangers, boilers transform purified water into high-pressure steam through heat radiation from hot flue gas. This steam subsequently drives turbine blades for electricity generation. Typically, a boiler comprises economizers, evaporators, superheaters, and a steam drum, although the specific configuration may vary depending on the design and function of the power plant [
1,
2,
3,
4]. Given the harsh operating conditions of elevated temperature, pressure, corrosive substances, and mechanical stress, boilers are prone to frequent failures. Boiler tube failures account for the majority of unplanned shutdowns in power plants [
5]. These failures commonly manifest as tube ruptures, significantly compromising both the safety and revenue of a power plant. In the event of tube rupture, the steam generation process can be halted, or worse, it might lead to more serious accidents, compelling a complete plant shutdown for necessary repairs [
6,
7,
8,
9]. Such unplanned downtime leads to substantial economic ramifications for the plant. Research indicates that the average cost of a single day of unscheduled power plant downtime in Europe is approximately EUR 100,000 [
10].
Investigating the causes of boiler failures holds significant importance for the safety and profitability of power plants. Extensive research has been dedicated to probing the origins of boiler failures, with a predominant focus on chemical and physical mechanisms. These culprits can be generally classified into several categories, including short-term overheating, long-term overheating (high-temperature creep), caustic corrosion from the water/steam side, hydrogen attack from the water/steam side, high-temperature corrosion from the fireside, and dew point corrosion from the fireside [
11,
12,
13,
14,
15,
16]. These phenomena often occur concurrently and can be intricately interconnected. For example, caustic corrosion can set the stage for hydrogen attack. When substantial quantities of alkaline compounds deposit on the inner surface of a tube, they initiate a reaction with the oxide layer, resulting in the depletion of it. Consequently, the hydroxide ions continue to interact with the inner material of the tube, leading to caustic corrosion. Simultaneously, atomic hydrogen is generated. The atomic hydrogen diffuses into the tube wall, where it reacts with metal carbide, forming methane. The accumulation of methane can result in the formation of cracks in the tube wall, a phenomenon known as hydrogen attack [
5]. However, if the oxide layer remains intact and accumulates gradually over time, it can diminish heat exchange between the water/steam and flue gas. This reduction in heat exchange fosters localized overheating, which can significantly contribute to tube creep or fatigue [
17,
18].
Inspecting failed tubes typically demands complex chemical treatments and expensive equipment, such as Scanning Electron Microscopes [
12,
19]. Furthermore, findings from one part of the boiler may not be relevant to another due to variations in design and operating conditions among different sections of the boiler. Even for the same boiler component, conclusions may not apply consistently over time, given the dynamic nature of the surrounding environment. For example, variations in fuel mixtures can introduce fluctuations in the environment around the boiler, a common occurrence in waste-to-energy (WtE) plants where the quality of municipal solid waste is uncontrollable [
20,
21]. Furthermore, some studies indicate that prior corrosion experiences can influence the current rate of corrosion [
22].
The ultimate objective of uncovering the root causes of failures is to leverage these insights to inform future operations and proactively prevent similar incidents. Unfortunately, conventional examination methods struggle to pinpoint the exact parameters and their specific values that contributed to the failure. The conventional examination results typically yield general recommendations on adjusting operating conditions, but these fall short of offering precise guidance to operators. Regarding operational guidance, an efficient approach to failure investigation should prioritize the connection between a failure and precise operating parameters without delving extensively into the intricacies of the failure mechanism, especially considering the intricate and variable nature of the aforementioned boiler failure mechanisms. Therefore, it is advisable to harness historical operational monitoring data and apply suitable data science methodologies for failure analysis.
Only very few data science applications related to boilers in power plants have been documented in the literature. For instance, one study demonstrated the high effectiveness of a data-driven approach comprising Wavelet Packet Transform analysis and Deep Neural Network in detecting boiler tube leakages [
23]. Another developed two short-term forecasting models (Convolutional Neural Network (CNN) and Long Short-Term Memory Network) for predicting three safety indicators of a supercharged boiler. Both models yielded excellent results, but CNN was preferred due to its lower computational cost [
24]. Additionally, an Extreme Gradient Boosting model, fine-tuned with a Particle Swarm Optimization algorithm, accurately predicted the metal temperature time series, enabling the early detection of metal temperature anomalies in a coal-fired boiler [
25]. Furthermore, the Extra-Tree classifier and Minimum Redundancy Maximum Relevance model were found to be highly effective in selecting the most relevant sensors for detecting faults in turbines and boilers, respectively. The results indicated a substantial reduction in the number of sensors needed for fault detection and a significant increase in detection accuracy [
26]. Moreover, three individual machine learning algorithms, Random Forest, Lasso, and Support Vector Regression, along with the ensemble model based on them, were employed to forecast boiler faults in a thermal power plant by predicting the key performance indicators of the boiler. The findings indicated that the ensemble model outperformed all three individual models, delivering a highly satisfactory outcome [
27].
However, the literature presents two gaps. Firstly, there is a lack of data science applications specifically focused on analyzing root causes of boiler failures. Secondly, all the prior studies are based on supervised learning, which is not suitable for scenarios where operational data lacks clear labels, a common occurrence in engineering settings, including the case study addressed in this research. Motivated by these gaps, this study introduces a novel and methodical framework that integrates engineering expertise with data science methods to investigate the causes of boiler failures and improve future operational practices. Beginning with formulating the boiler failure investigation problem into a data mining problem, the framework encompasses data preprocessing, model building and selection, and result evaluation and analysis, culminating in the provision of precise operational recommendations to prevent future boiler failures. The data science techniques employed predominantly include Discrete Wavelet Transform (DWT), Principal Component Analysis (PCA), K-means clustering, and Deep Embedded Clustering (DEC). This framework is designed and leveraged to achieve the aforementioned objective, which is pinpointing the exact operational parameters and their specific values that contributed to boiler failures so that similar failures in the future can be proactively prevented by adjusting process operations.
This paper is structured as follows. Following this introduction, the subsequent section introduces the case study subject and the datasets used in this research. The case study was conducted on a WtE facility situated in Umeå, Sweden. Its purpose was to demonstrate the details of the framework and validate the framework’s applicability in a real engineering context. The
Section 3 that follows presents the framework, the chosen data science techniques, and the rationale behind their adoption. The results derived from the case study and the ensuing discussion are then presented in the subsequent section. Finally, the
Section 5 summarizes the key findings of this research.
2. Overview of Umeå Waste-to-Energy Plant and Data Origin
The subject of the case study is the WtE plant located in Umeå, Sweden, operated by Umeå Energi. Umeå WtE plant is a 65 MW Combined Heat and Power (CHP) plant fueled by approximately 50% municipal solid waste and 50% industrial waste. Boasting a waste processing capacity of around 20 t/h, the plant operates roughly 8000 h per year and undergoes an annual scheduled maintenance shutdown.
Illustrated in
Figure 1 is the boiler-related layout of the Umeå WtE plant. Waste is introduced through the hopper for incineration on the grate, and the resulting flue gas traverses four flue gas passages until it reaches the flue gas treatment modules. The initial three passages are vertically oriented, primarily relying on radiation for heat transfer, while the fourth passage is horizontal and characterized by convective heat exchange. In the initial three passages, numerous tubes containing water/steam are positioned along the inner walls. These tubes serve dual purposes: functioning as evaporators within the boiler system and acting as safeguards against overheating for the walls. In the fourth passage lies the central segment of the boiler arrangement, consisting of one evaporator unit, three superheater units, another evaporator unit, and three economizer units, arranged from left to right. Within this segment, water/steam typically flows counter to the flue gas to facilitate convective processes. Within the economizers, boiler feed water is raised to a temperature below boiling point under certain water pressure. Concurrently, the flue gas surrounding the economizers achieves the desired (lower) temperature for subsequent flue gas treatment. Following the economizers, the heated water ascends to the uppermost steam drum situated atop the flue gas passages. Subsequently, the water in the steam drum flows through the downcomers to reach the evaporators, where it undergoes a phase transition into wet steam before ascending back to the steam drum. Within the steam drum, a separator works to transform the wet steam into saturated steam. This saturated steam is extracted from the upper section of the steam drum and subsequently undergoes additional heating in the superheaters to attain the status of superheated steam. The superheating process is crucial for optimizing turbine efficiency and ensuring its continued optimal performance.
The entirety of the plant is monitored by numerous online sensors. With the assistance of the engineers at the Umeå WtE plant, 66 of them (presented in
Table S1 in Supplementary Material) were identified to possess potential associations with boiler failure occurrences. Consequently, there were 66 variables in the case study datasets. Throughout the case study, a total of three boiler failures were examined, each corresponding to a specific repair stoppage. The timeframes for these stoppages were derived from the log. Closely proximate failures were analyzed collectively, resulting in the investigation of two datasets (as outlined in
Table 1). The time spans of the datasets were decided by setting the starting points three to five months (depending on the availability of data) before the initial stoppage. This approach ensured an adequate number of observations for evaluating distinctions between normal and abnormal operational conditions (further elaborated on in
Section 3.1). The datasets were obtained at a 30 min resolution through averaging, despite the original data being of a higher resolution. Averaging was employed for two main purposes: noise reduction and, notably, mitigation of the time-lag impact caused by the movement of water, steam, and flue gas.
3. Methodology
3.1. The Framework
The investigation into the causes of boiler failure in this study is primarily grounded in the inference that there are certain abnormal conditions giving rise to the failure, and these abnormal conditions persist until the operators detect the failures and halt the process line. Hence, the primary phase of abnormal conditions ceases around the time of the commencement of stoppage/repair. Preceding the occurrence of abnormal conditions there exists a period characterized by normal operational conditions. Through a comparison of variable values under normal and abnormal conditions, we can identify which variables deviate from the expected behavior and consequently lead to failure.
However, identifying the normal and abnormal periods presents a two-fold challenge. First, the monitored data lack labels, aside from the logging of boiler repair events. Second, the criteria for classifying operational conditions as abnormal may differ from one tube to another and across various time periods, owing to variations in the functions of different tubes and the potential degradation of their properties over time. Thus, to the authors’ best knowledge, case-based unsupervised clustering stands as the sole fitting approach for identifying normal and abnormal periods in this study. The specific method of unsupervised clustering applied in this study is K-means [
28].
Figure 2 shows the flowchart of the failure analysis framework in this study. Following the initial data cleansing process, the application of Discrete Wavelet Transform (DWT) served to effectively eliminate any noise stemming from the sensors. Next, embedding techniques were implemented to mitigate noise that may exist among different variables. Importantly, the utilization of embedding also aids in averting the curse of dimensionality [
29], as it effectively reduces the dimensionality of the data. This study employed two distinct embedding techniques. The first approach was Principal Component Analysis (PCA), whereas the second approach was a Deep Neural Network (DNN) integrated within the structure of Deep Embedded Clustering (DEC). Following the embedding process, the transformed data were input into K-means, producing the final clustering results. PCA + K-means served as the baseline against which the performance of DEC was evaluated. For PCA + K-means, the developments of PCA and K-means are loosely combined as the information flow is unidirectional from PCA to K-means. Conversely, in DEC, the DNN and K-means are seamlessly connected and trained simultaneously and iteratively. The information flow in DEC is bidirectional: from DNN to K-means, further extending to KL divergence, and reciprocally from KL divergence back to K-means and DNN. Having obtained the initial clustering results that categorized all observations into three distinct clusters, the subsequent task was to determine the identity of each cluster. Initially, the repair cluster (period of stoppage) can be discerned by referencing the operational log, as the log indicates when the boiler underwent repair and subsequently resumed operation. Following this, the contiguous timeframe directly preceding the repair event can be recognized as the cluster indicative of abnormal operating conditions. Finally, the continuous timeframe preceding the cluster of abnormal conditions can be designated as the cluster of normal operating conditions. Once the clusters were identified, an assessment and comparison of the clustering outcomes between PCA + K-means and DEC were conducted from an operational perspective. This evaluation aimed to determine the optimal clustering result. Based on this optimal clustering result, histograms were constructed for each individual variable. The purpose was to scrutinize potential disparity in distribution patterns between clusters under normal and abnormal conditions. To quantify this distribution disparity, the Normalized Peak Shift (NPS) metric was employed. It assesses the normalized shift in peak values (the most frequent values) within two distinct distributions. Variables that displayed a noticeable shift, characterized by NPS values surpassing the threshold of 30%, were identified as contributors to failure occurrences. These identified variables require vigilant monitoring to proactively prevent the recurrence of similar failures. Furthermore, recommendations concerning their values during production were formulated based on observations of their distributions under both normal and abnormal conditions.
In addition to the utilization of data science techniques, the experiential insights contributed by WtE plant engineers held a substantial influence within the framework. The term ‘empirical engineering knowledge’ within the framework pertains to the experiential knowledge garnered by the engineers through their operational and maintenance experiences. This encompasses their specialized engineering expertise in the realms of chemistry and physics. This form of knowledge served as an important complement within this framework, ensuring the data science methodologies were effectively employed to align seamlessly with the study’s objectives. For example, as described in
Section 2, the engineers helped to narrow down relevant variables significantly. Moreover, empirical engineering knowledge was sought when setting the noise threshold in the DWT process. More importantly, it was employed to evaluate and compare different clustering results, and, finally, to select the optimal one.
3.2. Discrete Wavelet Transform
Discrete Wavelet Transform (DWT) is a powerful tool for denoising signal data [
30]. DWT-based denoising typically comprises three steps: Decomposition, Thresholding, and Reconstruction.
Decomposition: Solve the DWT coefficients from the decomposition expansion of the signal with noise. Given a signal
, decompose it using DWT to obtain the approximation coefficients
and detail coefficients
. The decomposition expansion can be expressed as Equation (1):
Here, is the wavelet function, and is the scaling function associated with the wavelet function. j is the level parameter, and k is the translation parameter.
Thresholding: Keep the detail coefficients associated with the signal as they are, and replace the ones related to noise with zeros. Given the detail coefficients
, apply a threshold
to it to suppress the noise. This study adopted the hard thresholding approach that is presented in Equation (2):
Reconstruction: Reconstruct the signal with the modified coefficients. Equation (3) demonstrates the reconstructed and denoised signal
using the original approximation coefficients
and the modified detail coefficients
:
For the DWT work in this study, we used the Python package
PyWavelets (version: 1.1.1) [
31]. Specifically, we use the
wavedec application programming interface (API) for multilevel decomposition with the arguments ‘db6’ for
wavelet and 5 for
level.
3.3. Principal Component Analysis
Principal Component Analysis (PCA) yields several principal components (PCs), which are the result of map** the raw data’s variation space to a new space of lower dimensionality. All the PCs are linear combinations of the original variables, but the PCs are orthogonal to each other. The number of PCs is determined by maximizing the total variation explained by the PCs, while minimizing the noise remaining. Typically, PCA is conducted by calculating the covariance matrix of the original data, which is followed by eigenvalue decomposition of the covariance matrix. The eigenvectors from the decomposition define the directions of the PCs. The eigenvectors are sorted according to their corresponding eigenvalues, and larger eigenvalues represent greater capability of explaining variation by the corresponding PCs [
32]. For the PCA work in this study, we used the API
sklearn.decomposition.PCA in the Python package
sickit-learn (version: 0.24.0) [
33].
3.4. K-Means
The idea of K-means clustering is quite straightforward: all the observations in the dataset are grouped into
k clusters based on their distances to each other, minimizing the distances among observations within each cluster, while maximizing the distances among different clusters [
28]. To be specific, the objective of K-means is to minimize
E, as presented in Equation (4):
Here, k is the set number of clusters, is the ith cluster, and is the mean vector (centroid) of . Since the total variance is constant, minimizing E is equivalent to maximizing the variance among different clusters.
However, minimizing E is an NP-hard problem. Thus, the following heuristic algorithm is used:
- (1)
Randomly generate k initial centroids within the dataset.
- (2)
Generate new clusters by assigning every observation to its nearest centroid.
- (3)
Calculate the centroids of the new clusters.
- (4)
Repeat Steps 2 and 3 until convergence is reached.
For the K-means work (for both PCA + K-means and DEC) in this study, we used the Python API
sklearn.cluster.KMeans in the Python package
sickit-learn (version: 0.24.0) [
33] with the parameters
n_clusters = 3,
tol = 0.001, and
random_state = 5. The number of clusters for K-means was set to 3, because for every case, there are three categories of operating conditions for the boiler–normal conditions, abnormal conditions, and repair/stoppage.
3.5. Deep Embedded Clustering
Deep Embedded Clustering (DEC) is a method that learns variable embedding and observation clustering simultaneously using deep neural networks (DNN) and K-means [
34]. Instead of clustering the original data
X into
k clusters, DEC first maps
X nonlinearly onto a new space
Z with much lower dimensionality. The map** is conducted through a DNN with the parameters
. Subsequently, DEC learns the centroids set
and the parameter
simultaneously. DEC consists of two stages:
- (1)
Using a stacked autoencoder (SAE) to initialize the parameters .
- (2)
Iterating the process of generating an auxiliary target distribution and minimizing the Kullback–Leibler (KL) divergence between the soft assignment and the auxiliary target distribution . By doing this, the parameters are optimized.
SAE is applied because much research has demonstrated its capability of consistently yielding good representations (results of map**) for real-world datasets [
35,
36,
37]. As shown by
Figure S1 in Supplementary Material, SAE consists of an encoder and a decoder, and their structures are symmetric with respect to one another. The low-dimension layer in the middle is the embedded space. The activation function applied for the SAE (except for the embedded layer and the reconstruction layer) in this study is ReLU [
38]. The training is performed by minimizing the least-square loss between the input layer and the reconstruction layer. Once initialization is carried out, the encoder part is selected to concatenate with K-means in Stage (2) for further training.
In Stage (2), the loss function KL divergence is expressed in Equation (5):
The term
mentioned above is defined in Equation (6):
Here, corresponds to , and α are the degrees of freedom of the Student’s t distribution. indicates the probability of assigning sample i to cluster j.
The term
mentioned above is defined in Equation (7):
Here, are soft cluster frequencies.
The DEC work in this study was carried out based on the Keras script written by **feng Guo [
39]. The original script was designed to perform image clustering, but we customized it to fit this study.
3.6. Key Hyperparameters of Models
Since the primary focus of this study is the optimization of the WtE process, the description and discussion of the data science methodology are presented concisely. Therefore, only the core hyperparameters of the models are discussed in this paper, while any API arguments not explicitly mentioned are retained at their default settings. We adopted the Grad Student Descent approach [
40] for tuning all the model hyperparameters. In our analyses, two key hyperparameters took center stage: the number of PCs (
npc) for PCA + K-means, and the number of neurons in the embedded layer (
nn_el) for DEC. They both dictate the dimensionality of the embedded spaces. To facilitate optimization, we defined an identical range, specifically {2, 3, 4, 5, 6, 7, 8}, for both of these hyperparameters’ tuning. In cases where the optimal outcome for either approach emerged at 2 or 8, an exploration of 1 or 9 would be initiated to assess the potential for yielding a new optimal result. This iterative process would continue until the superior outcome was no longer derived from the boundary values of the specified range. For DEC, additional significant hyperparameters included the number of hidden layers within the encoder (
nhl) and the number of neurons within these layers (
nn_hl). Given the datasets’ moderate scales, the optimization range for
nhl was designated as {1, 2, 3}, with 2 consistently identified as the optimal selection across all datasets. To enhance tuning efficiency, we maintained uniformity in
nn_hl across all hidden layers for a specific dataset. Nevertheless, the optimization ranges and optimal values for
nn_hl differed among datasets.
3.7. Normalized Peak Shift
Normalized Peak Shift (NPS) was introduced as the metric to evaluate the state difference of each variable between normal conditions and abnormal conditions. It is based on the notion that the most frequently observed value (peak value of a distribution) under certain conditions can effectively encapsulate the variable’s state under those conditions. Thus, by estimating the peak values’ shift between two distributions, the state change of the variable of interest can be quantified. To enhance the clarity and utility of this metric, the range of variable values under normal conditions (excluding extreme values) is utilized to normalize the shift, resulting in NPS values presented as percentages.
can be calculated by Equation (8):
Here, is the Probability Mass Function (PMF) of the variable values under normal conditions (), while is the PMF of the variable values under abnormal conditions (). is the number of observations under normal conditions after excluding the observations with extreme variable values.
5. Conclusions
A novel and methodical data mining framework was introduced for conducting operational-level (focused on operating parameters) investigations into the attribution of boiler failures. The framework centered on two data mining approaches, PCA + K-means and DEC, with PCA + K-means serving as the baseline against which the performance of DEC was evaluated. To demonstrate the framework’s specifics, a case study was performed using datasets obtained from a WtE plant in Sweden. Within the case study, different operational conditions were clustered and identified, followed by the quantification of shifts in variable states between normal and abnormal conditions. Based on this quantification, we pinpointed the variables that played a substantial role in causing failures and recommended their safe operational values to forestall similar incidents in the future. The major findings of the case study are as follows:
- (1)
The clustering outcomes of DEC consistently surpass those of PCA + K-means across nearly every dimension. This is attributed to DEC’s iterative refinement of the non-linearly embedded space and cluster centroids based on KL divergence feedback.
- (2)
T-BSH3rm, T-BSH2l, T-BSH3r, T-BSH1l, T-SbSH3, and T-BSH1r emerged as the most significant contributors to the three failures recorded in the two datasets. This underscores the critical importance of vigilant monitoring and precise temperature control of the superheaters to ensure safe production.
- (3)
It is advisable to maintain the operational levels of T-BSH3rm, T-BSH2l, T-BSH3r, T-BSH1l, T-SbSH3, and T-BSH1r around 527 °C, 432 °C, 482 °C, 338 °C, 313 °C, and 343°C, respectively. Additionally, it is crucial to prevent these values from reaching or exceeding 594 °C, 471 °C, 537 °C, 355 °C, 340 °C, and 359 °C for prolonged durations.
The findings offer the opportunity to improve future operational conditions, thereby extending the overall service life of the boiler. Consequently, operators can address faulty tubes during scheduled annual maintenance without encountering failures and disrupting production. In future research, by examining a broader range of failures, we can develop a repository of diverse influential variables and their recommended operational values. This resource can facilitate more comprehensive, precise, and reliable production operation and management.