Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

Zhao, Yu; Zhu, Yi; Yu, Qiao; Chen, **aoying

doi:10.3390/fi13080216

Open AccessArticle

Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

¹

School of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China

²

Key Laboratory of Safety-Critical Software, Ministry of Industry and Information Technology, Nan**g University of Aeronautics and Astronautics, Nan**g 211106, China

^*

Author to whom correspondence should be addressed.

Future Internet 2021, 13(8), 216; https://doi.org/10.3390/fi13080216

Submission received: 10 August 2021 / Revised: 18 August 2021 / Accepted: 19 August 2021 / Published: 20 August 2021

(This article belongs to the Section Big Data and Augmented Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Traditional research methods in software defect prediction use part of the data in the same project to train the defect prediction model and predict the defect label of the remaining part of the data. However, in the practical realm of software development, the software project that needs to be predicted is generally a brand new software project, and there is not enough labeled data to build a defect prediction model; therefore, traditional methods are no longer applicable. Cross-project defect prediction uses the labeled data of the same type of project similar to the target project to build the defect prediction model, so as to solve the problem of data loss in traditional methods. However, the difference in data distribution between the same type of project and the target project reduces the performance of defect prediction. To solve this problem, this paper proposes a cross-project defect prediction method based on manifold feature transformation. This method transforms the original feature space of the project into a manifold space, then reduces the difference in data distribution of the transformed source project and the transformed target project in the manifold space, and finally uses the transformed source project to train a naive Bayes prediction model with better performance. A comparative experiment was carried out using the Relink dataset and the AEEEM dataset. The experimental results show that compared with the benchmark method and several cross-project defect prediction methods, the proposed method effectively reduces the difference in data distribution between the source project and the target project, and obtains a higher F1 value, which is an indicator commonly used to measure the performance of the two-class model.

Keywords:

cross-project defect prediction; manifold feature transformation; naive Bayes prediction model; F1

1. Introduction

Currently, the complexity of software is increasing, which is mainly reflected in the continuous increase in the number of developers and the scale of the software system. The increase in personnel and scale will inevitably lead to more hidden defects in the software system. However, a comprehensive test of a complex software system requires a considerable amount of resources. Therefore, we hope to know the defect tendency of each module of the software system in advance so as to allocate the test resources of each module in a targeted manner. Software defect prediction technology [1,2,3,4,5] can help us achieve this goal. Through software defect prediction technology, we can predict the defects of each module of the software project. If the predicted module is defective, then we will focus on testing the module so as to achieve the purpose of allocating testing resources for each software module in a targeted manner.

Software defect prediction technology is used to build a software defect prediction model with the help of machine learning methods. The training data are the value of the measurement features extracted from software modules, such as CK metrics [6]. The current research on within-project defect prediction (WPDP) [7,8,9,10] has been established, but WPDP cannot be practical because WPDP requires part of the defect data of the same project or the historical version of the project. However, the software developed in practice is generally a brand new software project, and the brand new software project does not have historical version data. Machine learning methods are data-supported methods; thus, WPDP lacks practicality in the absence of training data.

In view of the lack of training data, researchers have proposed cross-project defect prediction (CPDP) [11,12,13]. CPDP is based on the labeled defect data of other similar software projects (i.e., source project) to train the model and predict the defects of the software projects under development (i.e., target project). However, due to the large differences in the data distribution of the source project and the target project, the defect prediction model constructed using the source project data cannot achieve good predictive performance on the target project. Therefore, how to reduce the difference in data distribution between the source project and the target project has become the focus of CPDP research.

Researchers have proposed a variety of methods and models to reduce the difference in data distribution between projects to improve the performance of CPDP. These efforts include the Burak instance filtering method [14] and Peters instance filtering method [15] based on similarity training data selection, the TNB model based on instance weighting proposed by Ma et al. [16], the large-scale combination model HYDRA based on searching the best instance weights proposed by ** to enhance the separability and cost-sensitive technique to ease the misclassification problem. Chao et al. [21] proposed a cross-project defect prediction method based on feature migration and instance migration. Specifically, in the feature migration stage, the method uses cluster analysis to select the features with high distribution similarity between the source project and the target project; in the instance migration stage, based on the TrAdaBoost method, this method selects instances with similar distribution to these labeled instances from the source project with the help of a small number of labeled instances in the target project. In addition, Li et al. [32] proposed a cost-sensitive transfer kernel method for linear inseparability and class imbalance in CPDP. Additionally, Fan et al. [33] extracted and migrated from the source project from the perspective of instance filtering and instance migration, and transformed the training data with high correlation with the target data.

Currently, researchers developed multi-source defect prediction methods for the data of multiple source projects. Zhang et al. [34] evaluated 7 composite algorithms on 10 open source projects. When predicting a target project, a collection of labeled instances of other projects is used to iteratively train the composite algorithm. Experimental results show that the use of bagging and boosting algorithms, combined with appropriate classification models, can improve the performance of CPDP. **a et al. [17] proposed a large-scale combination model HYDRA for cross-project defect prediction. HYDRA considered the weights of instances in the training data and searched for the best weights on the training data for instance selection. Chen et al. [35] proposed a collective transfer learning for defect prediction (CTDP), which includes two stages—the source data expansion stage and the adaptive weighting stage. CTDP expands the source project dataset by using the TCA method, then builds multiple base classifiers for multiple source projects, and finally uses the PSO algorithm to adaptively weight multiple base classifiers to build a collective classifier to obtain better prediction results.

From various angles, the above methods ultimately improve the performance of CPDP, especially methods such as feature transformation and feature migration can greatly improve the performance of CPDP. However, few researchers consider the feature distortion phenomenon in the datasets when the feature dimension is large. Therefore, we propose a method of manifold feature transformation from this aspect.

3. Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

This section first introduces the relevant symbol definitions involved in the MFTCPDP method, then describes the overall framework of the MFTCPDP method, introduces the manifold feature transformation process in detail, and finally, provides the model construction algorithm.

3.1. Definition of Related Symbols

D_{S}

represents the source project data with a label, and

D_{T}

represents the target project data without label;

D_{S} = {X_{S i}, Y_{S i}}_{i = 1}^{n}

represents a module with n known labels in the source project

D_{S}

, and

D_{T} = {X_{T j}}_{j = 1}^{m}

represents a module with m unknown labels in the target project.

D_{S}

and

D_{T}

have the same feature space and label space—that is, feature space

F_{S} = F_{T}

and label space

Y_{S} = Y_{T}

. The same feature space means that the metrics of the source project and the target project are the same, and the same label space means that the modules of the source project and the target project are binary; that is, they either belong to a non-defective class or defective class. The cross-project defect prediction problem is used to train the defect prediction model from the data in

D_{S}

and then predict the label of the data in

D_{T}

.

3.2. Method Framework

The framework of the MFTCPDP method is shown in Figure 1. The input is a source project and a target project. After manifold feature transformation, new source projects and target projects were obtained. Then, we used the machine learning model to build a defect prediction model on the transformed source project and predict defects on the transformed target project to obtain a prediction label. Finally, the performance of the MFTCPDP method was compared with real labels. The core of the MFTCPDP method lies in the process of manifold feature transformation. We introduce the concept and implementation details of manifold feature transformation in the next section.

3.3. Manifold Feature Transformation

Manifold feature transformation is to transform the original feature space into manifold space and then reduce the data distribution difference between the transformed source project and the transformed target project in the manifold space. The data in the manifold space have the low-dimensional manifold structure of high-dimensional space. Therefore, the data features have good geometric properties, which can avoid the phenomenon of feature distortion [36]. Therefore, we first transformed the features in the original space into the manifold space and then transferred them in the manifold space.

In the Grassmann manifold, the feature transformation has an effective numerical form, which can be expressed and solved efficiently [37]. Therefore, this paper transformed the original feature space into the Grassmann manifold space. There are many ways to accomplish this process [38,39]. We chose the geodesic flow kernel method (i.e., GFK) [40,41] to complete the manifold feature transformation process because GFK has higher computational efficiency.

GFK attempts to model the data domain with d-dimensional subspaces and then embeds these subspaces into the manifold

G

. Let

P_{s}

and

P_{T}

respectively, represent the subspace of the original data space after principal component analysis, then

G

can be regarded as a collection of all d-dimensional subspaces. Each d-dimensional original subspace can be regarded as a point on

G

. Therefore, the geodesic

{Φ (t), 0 \leq t \leq 1}

between two points can form a path between the two subspaces. If we make

P_{S} = Φ (0), P_{T} = Φ (1)

, then finding a geodesic from

Φ (0)

to

Φ (1)

is equivalent to transforming the original feature into an infinite-dimensional space, and the integral calculation of this path is the process of migrating the source project to the target project in the manifold space so as to reduce the data distribution difference between the two projects [36]. This method can be considered an incremental migration method from

Φ (0)

to

Φ (1)

. Below, we will introduce the construction process and solution process in detail.

P_{s} \in ℝ^{D \times d}

and

P_{T} \in ℝ^{D \times d}

respectively represent the subspace of the source project dataset and the target project dataset after principal component analysis, and d represents the feature dimension of the subspace,

R_{s} \in ℝ^{(D - d) \times d}

represents the orthogonal complement to

P_{s}

. Using the canonical Euclidean metric of the Riemannian manifold, the geodesic flow is parameterized [38] as follows:

Φ (t) \in G (d, D), t \in [0, 1]

Additionally,

P_{s} = Φ (0), P_{T} = Φ (1)

, for the other t:

Φ (t) = P_{s} U_{1} Γ (t) - R_{s} U_{2} Σ (t)

where

U_{1} \in ℝ^{d \times d}, U_{2} \in ℝ^{(D - d) \times d}

are orthogonal matrices, and they are given by the following SVDs:

P_{s}^{T} P_{T} = U_{1} Γ V^{T}, R_{s}^{T} P_{T} = - U_{2} Σ V^{T}

where

Γ

and

Σ

are diagonal matrices. The diagonal elements are

\cos θ_{i}

and

\sin θ_{i}

, i = 1,2,…,d.

θ_{i}

represents the principal angles between

P_{s}

and

P_{T}

.

0 \leq θ_{1} \leq θ_{2} \leq \dots \leq θ_{d} \leq π / 2

For the two original feature vectors

F_{i}

and

F_{j}

, we compute their projections into Φ(t) for a continuous t from 0 to 1 and concatenate all the projections into infinite-dimensional feature vectors

Ζ_{i}^{\infty}

and

Ζ_{j}^{\infty}

. The inner product [40] between them defines our geodesic-flow kernel as follows:

〈Ζ_{i}^{\infty}, Ζ_{j}^{\infty}〉 = \int_{0}^{1} (Φ {(t)}^{T} F_{i}) (Φ {(t)}^{T} F_{j}) d t = F_{i}^{T} G F_{j}

Therefore, through

Ζ = \sqrt{G} F

, the features in the original space can be transformed into the Grassmann manifold space and the migration from the source project to the target project can be completed.

G

is a positive semidefinite matrix, which can be computed [41] in a closed-form from previously defined matrices as follows:

G = [P_{s} U_{1} R_{s} U_{2}] (\begin{matrix} Λ_{1} & Λ_{2} \\ Λ_{2} & Λ_{3} \end{matrix}) (\begin{matrix} U_{1}^{T} & P_{s}^{T} \\ U_{2}^{T} & R_{s}^{T} \end{matrix})

where

Λ_{1}, Λ_{2}, Λ_{3}

are diagonal matrices, whose diagonal elements are

λ_{1 i} = 1 + \frac{\sin 2 θ_{i}}{2 θ_{i}}, λ_{2 i} = 1 + \frac{\cos 2 θ_{i} - 1}{2 θ_{i}}, λ_{3 i} = 1 - \frac{\sin 2 θ_{i}}{2 θ_{i}}

3.4. Model Building Algorithm

According to the above process, the model construction algorithm is shown as follows (Algorithm 1):

Algorithm 1 MFTCPDP

Input: Labeled Source Project S, Unlabeled Target Project T
Output: Predicted labels L

1: Preprocess the project data to get S_P and T_P;

2: Construct the manifold space according to Equations (1)–(3);

3: Calculate G according to Equations (5)–(7);

4: Use G to perform a manifold transformation on S_P and T_P and get S_G, T_G;

5: Use S_G to train naive Bayes classifier and predict the labels of T_G;

6: loop all projects.

4. Experimental Research

This section explains the experimental research process to explore the experimental performance of the MFTCPDP method. We first introduce the experimental datasets and performance evaluation indicators used in the experiment, then set the parameters involved in the experiment, and finally analyze the experimental results.

4.1. Experimental Datasets

The Relink dataset [42] and the AEEEM dataset [43] are two open source datasets that are frequently used by researchers in the field of software defect prediction. We also used these two datasets in our experiments. Table 1 shows the project name, the number of modules, the number of features, the number of defects, and the defect ratio of the two datasets [44]. Among them, Apache, Safe, and Z**ng belong to the Relink dataset, and the rest are projects in the AEEEM dataset.

From Table 1, we can observe that the projects in the Relink dataset have 26 features, and the projects in the AEEEM dataset have 61 features. The research of this paper aimed to carry out experimental exploration on the phenomenon of feature distortion when there are many features and put forward the solution of manifold feature transformation.

The features of the two datasets are mainly described from code complexity and abstract syntax tree. The Understand website (https://www.scitools.com, accessed on 18 August 2021) can query the specific meaning of each feature.

4.2. Evaluation Indicator

This paper focuses on the two classifications of software defects—that is, whether the classification result is defective or not. There are some two-category evaluation indicators in research related to machine learning, such as accuracy, precision, and recall.

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

Among these indicators, researchers in the field of software defect prediction often use the F1 value. From the actual situation of the software project, only a few software modules may have defects. If only the accuracy indicator is used to evaluate, even if all modules are judged to be free of defects, the accuracy value will be very high, but this does not mean that the effect of the model is better. It can be inferred from Equation (8) that the F1 value is the harmonic average of precision and recall indicators, and the values of these two indicators are considered at the same time. Therefore, we also used the F1 value as the main evaluation indicator.

4.3. Experimental Parameter Setting

The MFTCPDP method mainly includes two main steps—manifold feature transformation and defect prediction model construction. In the process of manifold feature transformation, we need to set the feature dimension. For the Relink dataset, we set the feature dimension as 2, and for the AEEEM dataset, we set the feature dimension as 10. In the experiment, we chose the naive Bayes classifier as the default classifier when building the model. The influence of different feature dimensions and different classifiers on the MFTCPDP method are discussed below.

All experiments were run on a computer with the following configuration: Windows 10, 64-bit; CPU: Intel(R) Core(TM) i5-6300HQ [email protected]; RAM: 12G.

4.4. Experimental Results and Analysis

In order to verify the effectiveness of the MFTCPDP method, we first conducted an experimental study on the MFTCPDP method on two datasets frequently used by researchers, then compared the performance with the benchmark method, several currently popular CPDP methods and WPDP methods. Finally, we considered the internal factors that affect the performance of the MNTCPDP method, including the influence of the method parameters and the choice of a classifier on the method performance. Therefore, we designed three research questions and realized a comprehensive analysis of the MFTCPDP method through the answers to these questions.

Question 1: Compared with the benchmark method, can the MFTCPDP method improve the performance of CPDP? Additionally, is it better than several popular CPDP methods? How does the performance compare to the WPDP method?

Herbold et al. [45] and Zhou et al. [46] conducted empirical studies on the current reproducible CPDP methods and found that most of the current CPDP methods are difficult to surpass the performance of some existing classical methods, such as the method of reducing distribution difference proposed by Watanabe et al. [47] evaluated by Herbold et al., the Burak filtering method based on instance selection proposed by TURHAN et al. [14], and the TCA + method based on feature transformation and migration proposed by Nam et al. [19]. These three methods rank high in the comprehensive performance of all CPDP methods; therefore, we compared the MFTCPDP method with these three CPDP methods. Moreover, in the CPDP research, Zimmermann et al. [26] proposed a method of directly using source project data to train the model and then predicting the target project, which is recorded as the DCPDP (i.e., direct cross-project defect prediction) method. It is often used as a benchmark method for comparison. In addition, we have also conducted an experimental comparison with the 10-fold cross-validated WPDP method. The 10-fold cross-validated WPDP method is the most reasonable experimental setting, which can show the gap between the MFTCPDP method and WPDP and analyze the problems of the current method.

Before the performance comparison of the methods, we first conducted experiments to explore the feature distortion phenomenon in the original feature space. Therefore, we carried out the experiment of the DCPDP method with data standardization operation, which was recorded as the comparison between the DCPDP-Norm and the DCPDP method. Generally speaking, a simple data normalization operation can improve the experimental performance, but when the value distribution in the feature space is distorted, the experimental performance may be reduced.

We conducted experiments on two datasets and obtained the following experimental results. Table 2 shows the performance comparison between DCPDP-Norm and DCPDP. Table 3 shows the performance comparison between the MFTCPDP method and benchmark method DCPDP, several DCPDP methods, and WPDP method.

First, we compared DCPDP and DCPDP-Norm with data standardization operations. As shown in Table 2, the overall performance difference between the DCPDP-Norm and the DCPDP on the Relink dataset (i.e., rows 2 to 7) is considerably noticeable. Specifically, it can be found in each pair of CPDP experiments that data standardization operation sometimes cannot improve the performance of the method or even regress the performance. In Relink dataset, the performance of four pairs of experiments decreased significantly using the DCPDP-Norm method; in the cases in which the performance decreased significantly, such as in the Safe→Apache pair, the F1 value of DCPDP was 0.714, but the F1 value of DCPDP-Norm was 0.583. Among the six experiments on Relink dataset, only two sets of experiments have improved the performance of data standardization operations. There is also such a phenomenon on the AEEEM dataset (i.e., rows 8 to 27). For example, JDT→LC, JDT→PDE, and PDE→ML are all reduced in performance after data standardization operations. Generally speaking, on the two datasets, performance degradation is not rare. There are 26 groups of CPDP experiments in total, and 12 groups have performance degradation. Almost half of the experimental performances have not been improved. However, under normal circumstances, using the data after the data standardization operation to train the model can often obtain better performance. Therefore, we believe that there is a feature distortion phenomenon in the data distribution of the original feature space, which causes a decrease in the performance level of the experiment, rather than an increase.

Table 3 shows the performance comparison between the MFTCPDP method and the benchmark method DCPDP, several CPDP methods, and the WPDP method. Firstly, compared with DCPDP, in a total of 26 CPDP experiments on the two datasets, the MFTCPDP method can achieve better prediction performance in most of the cross-project defect prediction experiments. The F1 value of the MFTCPDP method is higher in 20 groups of experiments, and the F1 value of the remaining 6 groups of experiments is higher in the DCPDP method, but the performance gap of these 6 groups of experiments is very weak. For example, in the Safe→Z** very rapidly, such as online learning algorithms [48], multi-objective optimization algorithms [49], related heuristic algorithms [50], etc. Therefore, CPDP research does not stop at using the feature transformation or feature migration methods mentioned in this article. The above is also worthy of being introduced into different scenarios of CPDP research. For example, online learning algorithms are combined with just-in-time software defect prediction [51], training models are obtained through dynamic incremental methods, multi-objective optimization and some heuristic algorithms are used for feature selection and instance selection [52], some data classification algorithms are used to enhance the effectiveness of defect prediction [53], etc.

There are still many future steps worth noting in this paper. Firstly, the feature dimension of the original feature space transformed to the manifold space was not automatically selected. The later stage can focus on the automatic selection of the optimal parameters. Secondly, the distribution difference between the source project and the target project in the manifold space can be adapted from two aspects—edge distribution and conditional distribution, which were not considered in this paper. Thirdly, in the experiment, we chose two datasets of software projects, and our method can further extend to other suitable datasets in later periods. Finally, different classifiers have a certain impact on the method, and relevant research shows that ensemble learning can significantly improve the performance of the method. In the future, we can study the impact of ensemble learning on the performance of the cross-project defect prediction method.

Author Contributions

Conceptualization, Y.Z. (Yu Zhao) and Y.Z. (Yi Zhu); methodology, Y.Z. (Yu Zhao); software, Y.Z. (Yu Zhao) and Q.Y.; validation, Y.Z. (Yu Zhao), Y.Z. (Yi Zhu) and X.C.; formal analysis, Y.Z. (Yu Zhao); investigation, Y.Z. (Yu Zhao); resources, Y.Z. (Yu Zhao); data curation, Y.Z. (Yu Zhao); writing—original draft preparation, Y.Z. (Yu Zhao); writing—review and editing, Y.Z. (Yu Zhao) and Y.Z. (Yi Zhu); visualization, Y.Z. (Yu Zhao); supervision, Q.Y.; project administration, X.C.; funding acquisition, Y.Z. (Yu Zhao), Y.Z. (Yi Zhu), Q.Y., and X.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62077029, 61902161); the Open Project Fund of Key Laboratory of Safety-Critical Software Ministry of Industry and Information Technology (NJ2020022); the Applied Basic Research Program of Xuzhou (KC19004); the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (18KJB520016); the Graduate Science Research Innovation Program of Jiangsu Province (KYCX20_2384, KYCX20_2380).

Data Availability Statement

The datasets presented in this study are available on https://bug.inf.usi.ch/download.php (accessed on 18 August 2021); For any other questions, please contact the corresponding author or first author of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gong, L.N.; Jiang, S.J.; Jiang, L. Research progress of software defect prediction. J. Softw. 2019, 30, 3090–3114. [Google Scholar]
Chen, X.; Gu, Q.; Liu, W.S.; Liu, S.; Ni, C. Survey of static software defect prediction. J. Softw. 2016, 27, 1–25. [Google Scholar]
Tracy, H.; Sarah, B.; David, B.; Gray, D.; Counsell, S. A systematic literature review on fault prediction performance. IEEE Trans. Softw. Eng. 2012, 38, 1276–1304. [Google Scholar]
Li, Z.; **g, X.Y.; Zhu, X. Progress on approaches to software defect prediction. IET Softw. 2018, 12, 161–175. [Google Scholar] [CrossRef]
Li, Y.; Huang, Z.Q.; Wang, Y.; Fang, B.W. Survey on data driven software defects prediction. Acta Electron. Sin. 2017, 45, 982–988. [Google Scholar]
Chidamber, S.R.; Kemerer, C.F. A metrics suite for object-oriented design. IEEE Trans. Softw. Eng. 1994, 20, 476–493. [Google Scholar] [CrossRef] [Green Version]
**, C. Software defect prediction model based on distance metric learning. Soft Comput. 2021, 25, 447–461. [Google Scholar] [CrossRef]
Hosseini, S.; Turhan, B.; Gunarathna, D. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Trans. Softw. Eng. 2019, 45, 111–147. [Google Scholar] [CrossRef] [Green Version]
Bowes, D.; Hall, T.; Petric, J. Software defect prediction: Do different classifiers find the same defects. Softw. Qual. J. 2018, 26, 525–552. [Google Scholar] [CrossRef] [Green Version]
Manjula, C.; Florence, L. Deep neural network-based hybrid approach for software defect prediction using software metrics. Clust. Comput. 2019, 22, 9847–9863. [Google Scholar] [CrossRef]
Chen, S.; Ye, J.M.; Liu, T. Domain adaptation approach for cross-project software defect prediction. J. Softw. 2020, 31, 266–281. [Google Scholar]
Chen, X.; Wang, L.P.; Gu, Q.; Wang, Z.; Ni, C.; Liu, W.S.; Wang, Q. A survey on cross-project software defect prediction methods. Chin. J. Comput. 2018, 41, 254–274. [Google Scholar]
Herbold, S.; Trautsch, A.; Grabowski, J. Global vs. local models for cross-project defect prediction. Empir. Softw. Eng. 2017, 22, 1866–1902. [Google Scholar] [CrossRef]
Turhan, B.; Menzies, T.; Bener, A.; Di Stefano, J. On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. 2009, 14, 540–578. [Google Scholar] [CrossRef] [Green Version]
Peters, F.; Menzies, T.; Marcus, A. Better cross company defect prediction. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, 18–19 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 409–418. [Google Scholar]
Ying, M.; Luo, G.; Xue, Z.; Chen, A. Transfer learning for cross-company software defect prediction. Inf. Softw. Technol. 2012, 54, 248–256. [Google Scholar]
**a, X.; Lo, D.; Pan, S.J.; Nagappan, N.; Wang, X. Hydra: Massively compositional model for cross-project defect prediction. IEEE Trans. Softw. Eng. 2016, 42, 977–998. [Google Scholar] [CrossRef]
Sun, Z.; Li, J.; Sun, H.; He, L. CFPS: Collaborative filtering based source projects selection for cross-project defect prediction. Appl. Soft Comput. 2020, 99, 106940. [Google Scholar]
Nam, J.; Pan, S.J.; Kim, S. Transfer defect learning. In Proceedings of the 2013 35th International Conference on Software Engineering, San Francisco, CA, USA, 18–26 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 382–391. [Google Scholar]
Yu, Q.; Jiang, S.; Zhang, Y. A feature matching and transfer approach for cross-company defect prediction. J. Syst. Softw. 2017, 132, 366–378. [Google Scholar] [CrossRef]
Ni, C.; Chen, X.; Liu, W.S. Cross-project defect prediction method based on feature transfer and instance transfer. J. Softw. 2019, 30, 1308–1329. [Google Scholar]
Wang, J.; Chen, Y.; Hao, S.; Feng, W.; Shen, Z. Balanced distribution adaptation for transfer learning. In Proceedings of the 2017 IEEE International Conference on Data Mining, New Orleans, LA, USA, 18–21 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1129–1134. [Google Scholar]
Wang, J.; Feng, W.; Chen, Y.; Yu, H.; Huang, M.; Yu, P.S. Visual domain adaptation with manifold embedded distribution alignment. In Proceedings of the 26th ACM International Conference on Multimedia, New York, NY, USA, 22–26 October 2018; ACM: New York, NY, USA, 2018; pp. 402–410. [Google Scholar]
Baktashmotlagh, M.; Harandi, M.T.; Lovell, B.C.; Salzmann, M. Unsupervised domain adaptation by domain invariant projection. In Proceedings of the IEEE International Conference on Computer Vision. Sydney Convention and Exhibition Centre, Sydney, Australia, 1–8 December 2013; ACM: New York, NY, USA, 2013; pp. 769–776. [Google Scholar]
Briand, L.C.; Melo, W.L.; Wust, J. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans. Softw. Eng. 2002, 28, 706–720. [Google Scholar] [CrossRef] [Green Version]
Zimmermann, T.; Nagappan, N.; Gall, H.; Giger, E.; Murphy, B. Cross-project defect prediction: A large scale experiment on data vs. domain vs. process. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM Sigsoft Symposium on the Foundations of Software Engineering, New York, NY, USA, 24–28 August 2009; pp. 91–100. [Google Scholar]
He, Z.; Shu, F.; Yang, Y.; Li, M.; Wang, Q. An investigation on the feasibility of cross-project defect prediction. Autom. Softw. Eng. 2012, 19, 167–199. [Google Scholar] [CrossRef]
Herbold, S. Training data selection for cross-project defect prediction. In Proceedings of the 9th International Conference on Predictive Models in Software Engineering, New York, NY, USA, 9 October 2013; pp. 61–69. [Google Scholar]
Yu, Q.; Jiang, S.; Qian, J. Which is more important for cross-project defect prediction: Instance or feature. In Proceedings of the 2016 International Conference on Software Analysis, Testing and Evolution, Kunming, China, 3–4 November 2016; Volume 11, pp. 90–95. [Google Scholar]
Yu, Q.; Qian, J.; Jiang, S.; Wu, Z.; Zhang, G. An empirical study on the effectiveness of feature selection for cross-project defect prediction. IEEE Access 2019, 7, 35710–35718. [Google Scholar] [CrossRef]
Wu, F.; **g, X.Y.; Sun, Y.; Sun, J.; Huang, L.; Cui, F. Cross-project and within-project semi supervised software defect prediction: A unified approach. IEEE Trans. Reliab. 2018, 67, 581–597. [Google Scholar] [CrossRef]
Li, Z.; **g, X.Y.; Wu, F.; Zhu, X.; Xu, B.; Ying, S. Cost-sensitive transfer kernel canonical correlation analysis for heterogeneous defect prediction. Autom. Softw. Eng. 2018, 25, 201–245. [Google Scholar] [CrossRef]
Fan, G.; Diao, X.; Yu, H.; Chen, L. Cross-project defect prediction method based on instance filtering and transfer. Comput. Eng. 2020, 46, 197–202+209. [Google Scholar]
Zhang, Y.; Lo, D.; **a, X.; Sun, J. An empirical study of classifier combination for cross-project defect prediction. In Proceedings of the IEEE Computer Software & Applications Conference, Taichung, Taiwan, 1–5 July 2015; Volume 7, pp. 264–269. [Google Scholar]
Chen, J.; Hu, K.; Yang, Y.; Liu, Y.; Xuan, Q. Collective transfer learning for defect prediction. Neurocomputing 2020, 416, 103–116. [Google Scholar] [CrossRef]
Balasubramanian, M.; Schwartz, E.L.; Tenenbaum, J.B.; de Silva, V.; Langford, J.C. The isomap algorithm and topological stability. Science 2002, 295, 7. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hamm, J.; Lee, D.D. Grassmann discriminant analysis: A unifying view on subspace-based learning. In Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, 5–9 July 2008; pp. 376–383. [Google Scholar]
Gopalan, R.; Li, R.; Chellappa, R. Domain adaptation for object recognition: An unsupervised approach. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 999–1006. [Google Scholar]
Baktashmotlagh, M.; Harandi, M.T.; Lovell, B.C.; Salzmann, M. Domain adaptation on the statistical manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2481–2488. [Google Scholar]
Gong, B.; Shi, Y.; Sha, F.; Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 2066–2073. [Google Scholar]
Wang, J.; Chen, Y.; Feng, W.; Yu, H.; Huang, M.; Yang, Q. Transfer learning with dynamic distribution adaptation. ACM Trans. Intell. Syst. Technol. 2020, 11, 1–25. [Google Scholar] [CrossRef] [Green Version]
Wu, R.; Zhang, H.; Kim, S.; Cheung, S.C. ReLink: Recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, 5–9 September 2011; ACM: New York, NY, USA, 2011; pp. 15–25. [Google Scholar]
Dambros, M.; Lanza, M.; Robbes, R. Evaluating defect prediction approaches: A benchmark and an extensive comparison. Empir. Softw. Eng. 2012, 17, 531–577. [Google Scholar] [CrossRef]
Yang, Y.; Yang, J.; Qian, H. Defect prediction by using cluster ensembles. In Proceedings of the 2018 Tenth International Conference on Advanced Computational Intelligence, **amen, China, 29–31 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 631–636. [Google Scholar]
Steffen, H.; Alexander, T.; Jens, G. A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans. Softw. Eng. 2017, 44, 811–833. [Google Scholar]
Zhou, Y.; Yang, Y.; Lu, H.; Chen, L.; Li, Y.; Zhao, Y.; Qian, J.; Xu, B. How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans. Softw. Eng. Methodol. 2018, 27, 1–51. [Google Scholar] [CrossRef]
Watanabe, S.; Kaiya, H.; Kaijiri, K. Adapting a fault prediction model to allow inter language reuse. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, Leipzig, Germany, 12–13 May 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 19–24. [Google Scholar]
Zhao, H.; Zhang, C. An online-learning-based evolutionary many-objective algorithm. Inf. Sci. 2020, 509, 1–21. [Google Scholar] [CrossRef]
Liu, Z.Z.; Wang, Y.; Huang, P.Q. A many-objective evolutionary algorithm with angle-based selection and shift-based density estimation. Inf. Sci. 2020, 509, 400–419. [Google Scholar] [CrossRef] [Green Version]
Dulebenets, M.A. A novel memetic algorithm with a deterministic parameter control for efficient berth scheduling at marine container terminals. Marit. Bus. Rev. 2017, 2, 302–330. [Google Scholar] [CrossRef] [Green Version]
Pasha, J.; Dulebenets, M.A.; Kavoosi, M.; Abioye, O.F.; Wang, H.; Guo, W. An optimization model and solution algorithms for the vehicle routing problem with a “factory-in-a-box”. IEEE Access 2020, 8, 134743–134763. [Google Scholar] [CrossRef]
D’angelo, G.; Pilla, R.; Tascini, C.; Rampone, S. A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. 2019, 23, 11775–11791. [Google Scholar] [CrossRef]
Panda, N.; Majhi, S.K. How effective is the salp swarm algorithm in data classification. In Computational Intelligence in Pattern Recognition; Springer: Singapore, 2020; pp. 579–588. [Google Scholar]

Figure 1. MFTCPDP method framework.

Figure 2. F1 values obtained from different subspace dimensions on the Relink dataset.

Figure 3. F1 values obtained from different subspace dimensions on the AEEEM dataset.

Figure 4. F1 values obtained by different classifiers on the Relink dataset.

Figure 5. F1 values obtained by different classifiers on the AEEEM dataset.

Table 1. Experimental datasets.

Project	Modules	Features	Defects	Defect Ratio
Apache	194	26	98	50%
Safe	56	26	22	39%
Z**ng	399	26	118	30%
EQ	325	61	129	40%
JDT	997	61	206	21%
LC	399	61	64	9%
ML	1862	61	245	13%
PDE	1492	61	209	14%

Table 2. Comparison of F1 values between DCPDP-Norm and DCPDP.

Source→Target	DCPDP-Norm	DCPDP
Apache→Safe	0.787	0.459
Apache→Z**ng	0.606	0.582
Safe→Apache	0.583	0.714
Safe→Z**ng	0.651	0.654
Z**ng→Apache	0.437	0.645
Z**ng→Safe	0.645	0.699
EQ→JDT	0.596	0.422
EQ→LC	0.843	0.813
EQ→ML	0.367	0.344
EQ→PDE	0.775	0.687
JDT→EQ	0.716	0.633
JDT→LC	0.755	0.884
JDT→ML	0.788	0.769
JDT→PDE	0.702	0.826
LC→EQ	0.709	0.663
LC→JDT	0.804	0.545
LC→ML	0.770	0.462
LC→PDE	0.771	0.789
ML→EQ	0.711	0.688
ML→JDT	0.732	0.578
ML→LC	0.837	0.875
EQ→JDT	0.804	0.809
EQ→LC	0.722	0.674
EQ→ML	0.519	0.571
EQ→PDE	0.523	0.871
JDT→EQ	0.384	0.437

Table 3. Comparison of F1 value between MFTCPDP and several CPDP methods and WPDP.

Source→Target	MFTCPDP	TCA+	Watanabe	Burak	DCPDP	WPDP
Safe→Apache	0.711	0.670	0.716	0.565	0.714	0.625
Z**ng→Apache	0.653	0.671	0.705	0.358	0.645	0.625
Z**ng→Safe	0.769	0.512	0.717	0.735	0.699	0.703
Apache→Safe	0.460	0.569	0.717	0.234	0.459	0.703
Apache→Z**ng	0.587	0.595	0.653	0.155	0.582	0.666
Safe→Z**ng	0.652	0.628	0.636	0.596	0.654	0.666
MEAN	0.638	0.607	0.691	0.441	0.625	0.665
JDT→EQ	0.556	0.606	0.688	0.452	0.633	0.723
LC→EQ	0.667	0.549	0.683	0.473	0.663
ML→EQ	0.623	0.637	0.679	0.452	0.688
PDE→EQ	0.628	0.608	0.687	0.453	0.674
LC→JDT	0.572	0.731	0.818	0.700	0.545	0.829
PDE→JDT	0.724	0.743	0.827	0.702	0.571
ML→JDT	0.608	0.726	0.828	0.701	0.578
EQ→JDT	0.527	0.455	0.736	0.432	0.422
JDT→LC	0.889	0.798	0.825	0.861	0.884	0.865
ML→LC	0.882	0.861	0.860	0.863	0.875
PDE→LC	0.876	0.786	0.811	0.860	0.871
EQ→LC	0.842	0.479	0.015	0.808	0.813
JDT→ML	0.836	0.772	0.777	0.805	0.769	0.837
LC→ML	0.781	0.470	0.806	0.807	0.462
PDE→ML	0.833	0.788	0.807	0.807	0.437
EQ→ML	0.762	0.385	0.349	0.590	0.344
EQ→PDE	0.718	0.619	0.760	0.525	0.687	0.831
JDT→PDE	0.828	0.797	0.781	0.790	0.826
LC→PDE	0.780	0.804	0.812	0.796	0.789
ML→PDE	0.822	0.819	0.822	0.794	0.809
MEAN	0.738	0.671	0.718	0.684	0.667	0.817

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Y.; Zhu, Y.; Yu, Q.; Chen, X. Cross-Project Defect Prediction Method Based on Manifold Feature Transformation. Future Internet 2021, 13, 216. https://doi.org/10.3390/fi13080216

AMA Style

Zhao Y, Zhu Y, Yu Q, Chen X. Cross-Project Defect Prediction Method Based on Manifold Feature Transformation. Future Internet. 2021; 13(8):216. https://doi.org/10.3390/fi13080216

Chicago/Turabian Style

Zhao, Yu, Yi Zhu, Qiao Yu, and **aoying Chen. 2021. "Cross-Project Defect Prediction Method Based on Manifold Feature Transformation" Future Internet 13, no. 8: 216. https://doi.org/10.3390/fi13080216

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

Abstract

1. Introduction

3. Cross-Project Defect Prediction Method Based on Manifold Feature Transformation

3.1. Definition of Related Symbols

3.2. Method Framework

3.3. Manifold Feature Transformation

3.4. Model Building Algorithm

4. Experimental Research

4.1. Experimental Datasets

4.2. Evaluation Indicator

4.3. Experimental Parameter Setting

4.4. Experimental Results and Analysis

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI