1. Introduction
In recent years, with the advancement of information technology, many new applications, such as social networks and e-Commerce, have emerged as hubs for information gathering and dissemination. The majority of the information generated and shared in the Internet is in the form of text data, for example, reports, scientific articles, tweets, product reviews, etc.
Text classification has an important role to handle and organize text data in real-world scenarios, such as classifying webpages or documents, user sentiment analysis for social network multimedia, spam email filtering, disseminating information, document genre identification, recommendation systems, etc.
Many different machine learning methods are being used to text classification, including support vector machines (SVM), logistic regression, boosting, Naive Bayes, nearest neighbor (kNN), and neural-networks-based systems.
For the classification of scientific texts, these methods are generally applied only on titles and abstracts. However, users searching full texts are more likely to find relevant articles than when searching only titles and abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly growing digital archives.
In this paper, we show the advantages of using ensemble learning techniques and multi-view models to full text classification. The primary aim is to take advantage of the organization in structured documents in order to build different views to represent each instance. In this way, an ensemble of classifiers can be created by training a specific classifier per view. To obtain views of data, we propose the use of text sections.
Another focus of the research is to address the comparison between full text classification and only titles and abstracts as well as the use of specific sections to conclude their significance in the scientific article classification process.
The rest of the paper is organized as follows.
Section 2 reviews the concepts of ensemble classifiers and multi-view ensemble learning.
Section 3 introduces related works to the research scope of this paper. In
Section 4, we present a novel multi-view ensemble learning scheme for structured text classification.
Section 5 demonstrates the use of the technique to classify a biomedical full text corpus.
Section 6 and
Section 7 show and discuss the results, respectively. Finally,
Section 8 concludes the paper by summarizing the main contributions and presenting future work to improve the model.
5. Experiments
The proposed architecture is able to classify any data that can be transformed into structured text instances. In order to test its performance, the system is used to classify Medline full text scientific documents.
Figure 3 shows the complete configuration of the proposed architecture used for the experiments. It has two phases that need to be defined before training: the view generation phase and the ensemble training phase.
Previously, we have to obtain a reliable and pre-classified full text corpus, as explained below.
5.1. Dataset Construction
For the purpose of this study, we use a corpus based on OHSUMED [
15]. OHSUMED is composed of
MEDLINE documents that contain Title, Abstract, MeSH terms, author, source and publication type of biomedical articles published between 1988 and 1991.
Each document of OHSUMED has one or more associated categories (from 26 diseases categories). To carry out a binary classification, we select one of these categories as relevant and consider the others as non-relevant. If a document has assigned two or more categories and one of them is the one considered relevant, then the document is considered relevant and is excluded from the set of non-relevant documents.
For example, in order to build a corpus for the C14 Cardiovascular Diseases category, we select documents that belong to the C14 category as relevant. Then, from the common bag of non-relevant categories, all the possible documents categorized as “Cardiovascular Diseases” are removed. The resultant set is taken as the non-relevant set of documents. In this way, the number of relevant and non-relevant documents on each corpus is shown in
Table 1.
Note that C21 and C24 categories were discarded because they have only 1 and 17 relevant documents, respectively.
As OHSUMED only contains the title and abstract of the documents, we downloaded a full text corpus available at PubMed/NCBI (459,009 documents in total). The PubMed tool provides access to references and abstracts on life sciences and biomedical topics. Most of these documents are manually annotated by health experts with the MeSH Heading descriptors under 16 major categories, which facilitates the search for specific biomedical-related topics.
To obtain the MeSH terms, we downloaded the 2017 MeSH trees from NCBI. MEDLINE MeSH Headings are mapped with OHSUMED categories through the MeSH terms associated.
The documents were filtered by the MeSH classes (Medical Subject Headings), the National Library of Medicine (NLM) controlled vocabulary thesaurus used for indexing PubMed articles, and the corresponding full text documents were obtained from the NCBI PubMed Central (PMC) repositories.
Another important issue to mention is that all the MEDLINE scientific full text documents contained in the corpus have a common structure, which we aggregated according to the following sections: Title, Abstract, Introduction, Methods (Materials and Methods, Methods, Experimental Procedures), Results (Results, Discussion, Results and Discussion) and Conclusions.
Finally, we obtained an OHSUMED-based full text corpus. A more detailed description of the full text document corpus creation process is available at [
16,
17].