1. Introduction
Information and knowledge are the basis for the development of human society. Text records 80 (
https://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/, accessed on 5 May 2022) percent of the information of human civilization. The core task of information extraction (IE) is to obtain structured triples from unstructured text. It relies on two fundamental tasks: entity recognition and relation extraction. Li et al. [
1] proposed an entity recognition method that performs well. For relation extraction, new relation prediction is a challenge. Traditional relation extraction mainly adopts supervised learning methods for predefined relations. Its essence is to transform relation extraction into relation classification. There are two paradigms: pipeline relation extraction [
2] and joint relation extraction [
3]. Traditional RE performs well but faces two challenges. The first challenge is that predefined relation classifications do not work well on new relation extraction tasks. The second challenge is that relational data relies too much on manual cleaning and labeling, which is costly. In addition, for large-scale knowledge bases such as Wikidata, manual annotation would be challenging to accomplish.
To solve this problem, Banko [
4] first proposed the concept of open information extraction. That is, extracting structured relational facts from open and growing unstructured text. Information extraction should not be limited to a small set of known relations. RE should be able to extract a wide variety of relations in a text. The scope of its research is that the entity pair of the relation is known, and the relationship type between the entity pair is unlimited. Open-domain relation extraction should meet three academic requirements: automation, non-homologous corpus, and high efficiency.
Automation
The open relation extraction system can execute automatically, and the algorithm only needs to go through the corpus once for triple tuples extraction. It should be based on an unsupervised extraction strategy and cannot be a predefined relation. In addition, the cost of manually constructing training samples is small, and only a tiny number of initialization seeds need to be labeled or a small number of extraction templates need to be defined.
Non-homologous corpus
The goal of open-domain relation extraction is domain-independent and should not be limited to a particular domain. In addition, it is important to avoid using domain-dependent deep syntactic analysis tools, such as syntactic analysis.
Efficiency
Open information extraction systems handle large-scale corpora, and a high efficiency is required to ensure a rapid response. Prioritizing shallow syntactic features is also necessary.
The open-domain relation extraction methodology mainly includes four methods: based learning, clause, rule, and tuple association. OLLIE [
5] is a knowledge base based on the open information extraction system REVERB [
6]. Its idea is to learn a pattern to extract relations, which requires high generalization and robustness of the pattern. ReNoun [
7] is an open information extraction system that entirely focuses on noun relation extraction. In this system, a high-precision lexical template is first formulated manually. Then, based on the template, the little relation tuples are mined, and the confidence is marked. This work introduces a distant supervision method that reduces the reliance on labeled data. Many complex sentences have relations associated with multiple clauses, and the model has a low performance when dealing with complex sentences.
To improve the accuracy, Mausam [
8] proposed a clause-based system, ClausIE, which converts the dependencies of input sentences into simple coherent clauses based on linguistic knowledge. Then, the argument information is extracted according to the clause type. The Stanford open information extractor was proposed by Angeli et al. [
9]. In addition, the rule-based method is more effective in improving accuracy. Gabriel [
10] proposed a non-lexicalization rule to convert the dependency syntax tree into a relational directed graph for processing. It works well on multilingual tasks. Rule-based methods are more accurate for specific tasks but rely on human rules and are more complex. In order to improve the accuracy of extracting triples, a method based on tuple association has also been proposed to solve the problem of sentence association.
The method reconstructs sentences, transforming a complex non-normalized sentence segment into a set of clauses with dependencies and easy-to-extract triples. Graphene et al. [
11] proposed a method for the hierarchical simplification of complex sentences from the perspective of obtaining tuple associations and used syntactic and lexical patterns to predict the modification relationship between sentences. The process is sentence reconstruction, clause splitting, information reduction, and clause triple extraction. This method is beneficial in improving accuracy.
The method based on rules, clauses, and obtaining tuple associations uses templates to extract relational tuples from a syntactic and lexical point of view. Learning-based methods learn a pattern from semantic information and use the learned pattern to extract tuples. At present, rule-based methods can achieve a high accuracy. Furthermore, learning-based methods such as Neural Snowball are effective, and form the baseline for new relation extraction tasks. This paper uses the Neural Snowball [
12] as the open-domain RE baseline model for subsequent experiments. Most of these open-domain relation extraction methods only utilize a small amount of supervised data and do not effectively utilize the newly growing relational data.
Most of the current open relation extraction methods only utilize a small amount of supervised data and do not utilize the setting of newly growing relational data. However, these new relational data can predict unseen relations, which is valuable. Based on the Bootstrap [
13] algorithm, this paper comprehensively utilizes small-scale labeled data, large-scale unsupervised data, and few-sample new relational data. We propose a system that can realize data self-labeling and new relation extraction, ORES.
Figure 1 shows the process of how ORES utilizes different kinds of data to learn new relations. After distant supervision obtains sentences, the model based on the self-sampling algorithm screens instances. Furthermore, a new-relation classifier is trained by inputting filtered high-confidence instances and large-scale supervised data. The classifier can iteratively discover more reliable instances with new relation facts and adapt them to open growth scenarios. In more detail, we design sample selector (SS), based on a Siamese network (SN) [
14] to select high-confidence instances. The SN is used to classify whether existing instances and new ones express the same relation. The experimental results show that ORES can select high-quality instances and significantly improve extracting new relations through few-shot samples. In conclusion, our main contributions are as follows:
We propose a new open relation extraction system (ORES), a novel structure to train a neural relation classifier with few initial new relation instances by iteratively accumulating new instances and facts from unlabeled data with prior data knowledge of historical relations. We design three functional components of ORES to realize the mining of Web text information, automatic labeling, and extraction and output of new relation data.
We design and combine two new encoders based on tBERT and K-BERT language models to better express features in textual information. Experiments show that combined cooperation is beneficial for improving model performance. Specifically, for the sample selector 1 of the data mining component, we introduce a topic model and design a new encoder based on tBERT, which improves the recall rate. For the example selector 2 of the relation extraction component, we design a new encoder based on K-BERT to inject external knowledge for further improvement in accuracy. We also conducted a fusion experiment to prove the effectiveness of the combination.
The rest of the paper is structured as follows: Related Work treats important methods in relation extraction and ORES. Methodology concerns principles and operation process of ORES. The Experiments and Results section provides detailed information on the design, setup, implementation, and results of the experiments. Discussion discusses and analyzes the experimental results in detail. Conclusion and Future Work summarizes the work of this paper and proposes better research directions for the future.
3. Methodology
Bert [
28] can learn the semantic representation of most entities in the pre-training stage, but the representation of specific entities is not accurate enough. Therefore, we introduce tBERT and K-BERT language models to improve component performance. Chen et al. [
29] proved that awareness topics could yield improvements in machine translation. Inspired by these works, we integrate a topic model into BERT to compute topic information for each sentence. In the similarity detection stage, it can consider its topic-relatedness. Although there is no standard method for combining topic models with pre-trained models, Nicole Peinelt [
30] proposed the tBERT model by combining best and topic models for semantic similarity. It performs better than BERT in specific fields, and we also cite the idea and methods of this work. This paper will introduce the structure and principle of each component when introducing the working process of ORES and explain the corresponding formulas and symbols.
3.1. Few-Shot Relation Extraction Learning Framework
As shown in
Figure 3, ORES includes three parts: a data mining component, an open relation extraction component, and a real-time data management component. The data mining component is used to mine instances from Web text with the same relation as the seed set. The open relation extraction component trains a classifier based on few-shot learning to predict a new relation, r. It is a real-time data management component that stores and manages the output of new relational data. The working process of ORES is multiple iterations, and each iteration has two stages:
Phase 1
(a) Add sentences in the initial seed set to the coarse selection set. As shown in
Figure 4: it put “Einstein founded the theory of relativity” into the coarse selection set from the seed set.
(b) Using distant supervision, mine sentence pairs from Web text and add them to the coarse selection set and the selection set. At first, perform named-entity recognition, and extract the head and tail entities from the initial seed sentence X as entity pair
.
is the initial seed set,
is the entity pair matching function, and
X is the sentence in the initial seed set. Then mine all sentences containing entity pair
in the Web text denoted as
.
is a rough selection, and is a Web text library.
(c) Load sentences in the coarse selection set into instance selector 1. The tBERT encoder in instance selector 1 outputs the representation vectors of the embedding features of the sentence pair, denoted as em. The instance selector 1 through the Siamese network calculates the similarity of the two sentences and scores them. The similarity distance function is:
where
and
are learnable parameters.
(d) After the instance selector 1 filters out sentences that meet the threshold, add them to selection 1 and the seed set. The instance selector 1 and instance selector 2 are denoted as
f, and the threshold is set to 0.5, marked as
. Selection 1 is denoted as
:
where
represents Web text.
(e) Train a relation classifier with a small amount of labeled data and instances in selection 1 as input to identify sentences with a new relation r.
Phase 2
(a) Classifier g mines sentences in Web text that may belong to relation r. The confidence region
of g is set to 0.9. When the input instance meets the threshold condition, it is filtered out and added to the selection set.
is the sentence selection set. is the large-scale annotated relation set.
(b) Filter the selection set again to enhance the performance of the classifier. Then, load the sentences in the selection set to the instance selector 2.
(c) After sentences are loaded into the coarse selection set, one iteration is complete. To further improve the accuracy, the instance selector 2 adopts the K-BERT encoder. K-BERT is a language model that can inject external knowledge.
At the end of the first stage, the relation classifier can extract a new relation r. When mining new relational sentences based on relation r, entity pairs can change. As shown in
Figure 3, when the classifier learns the new relation “of_founder”, the sentence pair is matched: “Newton established the classical mechanics system” and “Tesla invented the alternating current system”. The first instance reflects the founder relationship. The second reflects the new relationship inventor. As shown in
Figure 4, such instances, called query instances, will be expanded to the rough selection set in subsequent iterations. Query instances and seed sentences will perform similarity detection. When the similarity is less than or equal to the threshold
, it is regarded as a new relation. In subsequent rounds of iterations, it will be annotated and expanded as new relational data. The above two stages are an iterative process, and multiple iterations can continuously learn and expand new relationship types.
3.2. Instance Selector
The instance selector is pre-trained, and it performs similarity detection between sentences obtained from Web text and initial seed sentences for filtering out instances with high confidence. The instance selector adopts the relation Siamese network structure. As shown in
Figure 5, sentence pair
X and
Y are input. The similarity score
is output, which represents the likelihood that
X and
Y share the same relation in the range [0, 1]. After the
sentence pair is input to the encoder, the word embedding process is performed. The encoder learns an embedding matrix to produce the representation vectors of instances’ feature and output extracted word vector. Then, the Z vector is processed with a fully connected layer and a scalar is output. Finally, a sigmoid activation function is used to obtain a real value between 0 and 1. After the
sentence pair is input to the encoder, the word embedding process is performed.
The encoder learns an embedding matrix to produce the representation vectors of instances’ features, and an extracted word vector is output. Then, the Z vector is processed with a fully connected layer and a scalar is output. Finally, a sigmoid activation function is used to obtain a real value between 0 and 1. The cross-entropy of the label and prediction is used as the loss function. Backpropagation is used to compute gradients and update model parameters by gradient descent. The instance selector has two parts. One is the encoder neural network, which extracts instance embedding features to generate a word vector. The other is a fully connected layer used to predict similarity. The training process is to update the parameters of these two parts. Specifically, the gradient is passed from the loss function back to the fully connected layer and the parameters of the vector Z. Then, gradients are further propagated from the vector Z to the encoder neural network, where gradients are used to update the parameters of the encoder neural network. The above is one iteration of the sample selector training process.
The encoder trained through few-shot learning needs to prepare the same number of positive samples and negative samples. The labels of negative samples are set to 0, and the goal is to predict the neural network close to 0 as well. Likewise, the model updates the fully connected layer parameters and the encoder neural network parameters through negative backpropagation. In particular, the training data for the Siamese network do not contain the new relation. Encoders are pre-trained and embedded into the ORES instance selectors. Trained instance selectors (IS) can predict whether an instance represents a new relation. Then, the instance selector can predict which instance in the seed set has the highest similarity in relation to distant supervised instances.
3.3. Encoder
Instance selector 1 and instance selector 2 adopt the same structure, namely, the Siamese network. The difference is the encoder setting, which is determined by the functionality of the components. The instance selector filters the examples obtained through distant supervision in the data mining component. If only based on the semantically similar screening instance, it is easy to fall into the comfort zone of semantically similar but insufficient diversity. Instances processed by the data mining component are topic-specific. In the open relation extraction component, we adopt K-BERT as the encoder of sample selector 2 in order to improve the similarity detection effect.
3.3.1. tBERT Encoder
Neural Snowball filters instances based on BERT-encoded semantic information. Experiments show that the model has the problem of a low recall rate. In the process of screening instances, they stay in the comfort zone with similar semantics but insufficient diversity. Further analysis can also find that filtering instances based on semantic similarity can easily lead to the problem of semantic similarity but insufficient diversity. Adding topic information to the model is beneficial to increase the diversity of the selected instances and help the model get out of the “comfort zone” of semantic similarity but insufficient diversity. tBERT performs better than BERT in specific domains, and this paper draws on the ideas and methods of this work. In addition, Topic Snowball [
31] is our previous work, and this paper further extends on this basis. The tBERT encoder consists of BERT and topic model. After the sentence pair is input, the specific process is as follows:
Firstly, encode the sentence pair and launch all tokens into a topic model, denoted as tModel. The topic model matches a topic
for each word in the sentence:
where
t is the number of topic types. Then, pooling obtains sentence-level topics
,
:
and
represent the number of words in sentence
X and
Y, respectively. Meanwhile, it pushes all tokens into BERT for encoding sentence
X and sentence
Y, using the
C vector as a sentence pair representation:
where
h represents the hidden layer dimension inside BERT. The joint real-valued vector
C with sentence topic vectors
and
is:
Finally, cross-entropy is used as the loss function for training. The loss function is as follows:
where
is the true outcome distribution, and g is the output distribution predicted by the relation classifier. Cross-entropy is used as the loss function and the parameters are adjusted according to the loss. The fine-tuning stage uses a grid search to select the parameter settings for the best experimental results. Adding a topic model to the BERT encoder essentially considers the topic a similarity weight. Semantic-based similarity detection enables the model to slightly reduce the weight of semantic similarity while noticing that the topic is related. It is conducive to screening more diverse instances and improving recall.
3.3.2. K-BERT Encoder
BERT does well in common named-entity representation, but not in the professional field of named entities. It is related to the pre-train mechanism that BERT captures entity representation on a large-scale common corpus but lacks domain-specific knowledge. The underlying reason is that domain knowledge is rarely present in the pre-training stage. Although expertise can be absorbed during the fine-tuning stage, the actual effect is not good. If it is directly trained on the corpus in the professional field, it will need to invest a significant cost and have low transferability. Therefore, Liu proposes directly injecting domain knowledge annotation that would only appear in the fine-tuning stage into the pre-training stage. The knowledge graph is injected into the instance as external domain knowledge. This will not only improve performance but also effectively reduce costs.
There are two reasons for choosing a knowledge graph [
32] as an external knowledge source. The first is that a knowledge graph is marked and verified structured knowledge, and thus is rich, reliable, and easy to query. The other is that the construction of the knowledge graph is controllable, and the model has good interpretability. Knowledge injection needs to solve two problems. The first one is physical alignment. Generally speaking, the entity’s word embedding vector space in text and external knowledge are different. Therefore, it is necessary to align the entities so that they have the same word vector representation. The second is the problem of knowledge noise. Improperly injecting external knowledge can adversely affect the original semantics. Correct and reasonable injection methods can effectively reduce or even avoid noise generation. Liu proposed K-BERT [
33], which is compatible with any pre-trained BERT model and does well in embedding space. For this paper, the authors designed sample selector 2 based on K-BERT. As shown in
Figure 6, the K-BERT working process can be divided into three stages:
Stage 1: The nowledge layer. This stage injects domain knowledge into the original sentence sequence and outputs an abundant graph-structure sentence tree. It can be further subdivided into two processes: knowledge injection and sentence tree transformation. The external professional knowledge base KG equips the model with knowledge triple-tuple and then injects domain knowledge. The knowledge layer transforms primary sentences into abundant sentence trees.
First, perform a knowledge query (
). Take the entities in the original sentence as the query, and traverse the KG.
is a knowledge query function,
S represents a query sentence, and
K represents an external knowledge graph. External knowledge
E can be expressed as:
Its specific form is:
where is a collection of the corresponding triples. The specific form of the sentence
S is:
The specific form of the sentence tree
St is:
Injecting external knowledge according to the knowledge graph can be formulated as:
where
is a function that injects knowledge into the sentence tree and keeps structure index information.
The queried knowledge is placed directly after the entity, and the retrieval process is based on the visible matrix and soft position. In this paper, there are many branches in a sentence tree. Furthermore, its depth is limited to 1 for subsequent processing.
As shown in
Figure 6, the input sentence X is: “Einstein created the theory of relativity”. “Einstein” is associated with the tuple information “is born in Germany” in the knowledge graph. The relativity theory is related to “a non-inertial-system theory”. After tuple injection, a sentence tree with a rich knowledge background is generated.
Stage 2: The embedding layer and seeing layer. BERT cannot handle graph-structured sentences. The role of the embedding layer is to solve this problem. The role of the seeing layer is to control the reference domain of external knowledge and avoid knowledge noise.
Process 1:
BERT can only process sentence input with sequence structure. The structural information will be lost if the sentence tree is tiled into a sequence. The visual matrix can convert the sentence tree into a sequence and save the structural information, and its principle is shown in
Figure 7.
The embedding representation includes three parts: token embedding, position embedding, and segment embedding. In the token embedding section, K-BERT flattens the tokens in the sentence tree into embedding sequences according to the hard position index. “[CLS](0) Einstein(1) is_born_in(2) German(3) founded(4) the_theory_of relativity(5) is_ proposed _by(6) Einstein(7)”. Obviously, tiling causes losses in sentence structure and readability. For example, the positional index of “is born in(2)” and “founded(2)” are both (2), and they all follow “Einstein(1)”. This information can be recovered by soft-position coding.
As shown in
Figure 7, “[CLS](0) Einstein(1) is born in(2) German(3) founded(2) the(3) theory of relativity(4) a(5) non-inertial-system(6) theory (7)”. In the soft-position embedding section, K-BERT restores the graph structure and readability of the sentence tree through soft-position encoding. K-BERT uses soft-position embedding to map the readable information of sentence trees. Nevertheless, only using soft positions is not enough because it will make the model misunderstand that found(2) follows German(3), which will also lead to knowledge noise. In order to solve this problem, we can convert the soft-position coding into a visible matrix through the process 2 seeing layer. The role of this last section segment embedding is to identify sentence 1 and sentence 2. For the sentence pair, it is marked with a sequence of segment tags,
.
Process 2:
The seeing layer’s role is to limit the field of view of the knowledge referenced by the entities. It can avoid false references to knowledge and the noise caused by mutual interference between knowledge. The visible matrix
M is defined as:
where
i and
j are the hard-position index,
and
are co-branched, meaning that they are the same entity, and its distance is 0. On the contrary, its distance is negative infinity.
The seeing layer’s core idea is to ensure that the original sentence entities and the injected background knowledge do not interfere. As shown in
Figure 7, the sentence tree has nine tokens, which are mapped into a 9 * 9 visible matrix. White means that the two tokens in the corresponding position are visible to each other, and black means that they are invisible to each other. The black cell in column 3, row 9, means German(3) cannot see “theory(9)”. Similarly, [German] is invisible to [CLS], but through [Einstein] it indirectly acts on [CLS], thereby reducing knowledge noise.
Stage 3: The mask-transformer encoder processing. Traditional transformer encoders cannot accept a visible matrix as input. The role of the mask-transformer encoder is to introduce the structural information in the sentence tree into the model. It is a stack of multiple mask self-attention blocks. The formula is as follows:
,
, and
are the parameters that the model needs to learn, the mask self-attention block of the hidden state;
is the scaling factor.
5. Conclusions and Future Work
This paper proposes a new relation extraction system, ORES, for Web text information. It can automatically extract and label new relational data. It is expected to contribute great value to the field of intelligent education. We designed new sample selectors based on t-BERT and K-BERT. When the number of initial seed sentences was low, ORES performed significantly better than Neural Snowball. When the number of seed sentences increased, ORES’s performance improvement was not satisfactory enough compared to Neural Snowball. In the comparison experiment with Topic Snowball, we found that the K-BERT setting of sample selector 2 could improve the accuracy of entity representation in the word embedding process and reduce the weight of topic information. The experiments show that injecting external knowledge effectively improves classifier accuracy and overall performance. The few-shot relational learning approach proposed in this paper is well suited for open-domain scenarios such as Web text. Amid today’s exponential growth in knowledge, it can continuously mine data from Web texts and iteratively discover new relationship types, which have broad application prospects.
In the future, there are two directions worth exploring deeply. (1) We can explore how to calculate topic similarity and semantic similarity reasonably. The sample selector adds the weight of the topic when encoding sentence information. Its essence is to broaden the similarity threshold. However, the similarity domain extension also affects accuracy. Thus, the manner in which the weight of the topic vector and the semantic representation vector are balanced is a valuable research direction.
(2) Although distant supervision to obtain training samples is very effective, it is inevitably accompanied by data noise. With the number of seed sentences increasing, instances of distant supervision matching will have more wrong label problems, which may lead to overfitting. Therefore, reducing the data noise of distant supervision is a valuable research direction. succeeding in representing a negative training method performance well. In the future, we will build a negative indicator based on BTOD to improve ORES’s performance.