Graph-based Evidence Aggregating and Reasoning kỹ thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng

Graph-based Evidence Aggregating and Reasoning Ki thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng iúp sinh viên tham khảo, ôn luyện và phục vụ nhu cầu học tập của mình cụ thể là có định hướng, ôn tập, nắm vững kiến thức môn học và làm bài tốt trong những bài kiểm tra, bài tiểu luận, bài tập kết thúc học phần, từ đó học tập tốt và có kết quả cao cũng như có thể vận dụng tốt những kiến thức mình đã học

GEAR: Graph-based Evidence Aggregating and Reasoning
for Fact Verification
Jie Zhou
1 2, ,3
, Xu Han
1 2, ,3
, Cheng Yang
1 2, ,3
, Zhiyuan Liu
1 2 3, ,
Lifeng Wang
4
, Changcheng Li
4
, Maosong Sun
1 2, ,3
1
Department of Computer Science and Technology, Tsinghua University, Beijing, China
2
Institute for Artificial Intelligence, Tsinghua University, Beijing, China
3
State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China
4
Tencent Marketing Solution, Tencent, Shenzhen, China
{zhoujie18, hanxu17, cheng-ya14}@mails.tsinghua.edu.cn
{fandywang, harrychli} {@tencent.com, liuzy, sms}@tsinghua.edu.cn
Abstract
Fact verification (FV) is a challenging task
which requires to retrieve relevant evidence
from plain text and use the evidence to ver-
ify given claims. Many claims require to si-
multaneously integrate and reason over several
pieces of evidence for verification. However,
previous work employs simple models to ex-
tract information from evidence without let-
ting evidence communicate with each other,
e.g., merely concatenate the evidence for pro-
cessing. Therefore, these methods are unable
to grasp sufficient relational and logical infor-
mation among the evidence. To alleviate this
issue, we propose a graph-based evidence ag-
gregating and reasoning (GEAR) framework
which enables information to transfer on a
fully-connected evidence graph and then uti-
lizes different aggregators to collect multi-
evidence information. We further employ
BERT, an effective pre-trained language repre-
sentation model, to improve the performance.
Experimental results on a large-scale bench-
mark dataset FEVER have demonstrated that
GEAR could leverage multi-evidence infor-
mation for FV and thus achieves the promis-
ing result with a test FEVER score of 67.10%.
Our code is available at https://github.
com/thunlp/GEAR.
1 Introduction
Due to the rapid development of information ex-
traction (IE), huge volumes of data have been
extracted. How to automatically verify the
data becomes a vital problem for various data-
driven applications, e.g., knowledge graph com-
pletion (Wang et al., 2017) and open domain
question answering (Chen et al., 2017a). Hence,
many recent research efforts have been devoted to
fact verification (FV), which aims to verify given
claims with the evidence retrieved from plain text.
Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn)
“SUPPORTED” Example
Claim The Rodney King riots took place in the most populous
county in the USA.
Evidence
(1) The 1992 Los Angeles riots, also known as the
Rodney King riots were a series of riots, lootings, ar-
sons, and civil disturbances that occurred in Los An-
geles County, California in April and May 1992.
(2) Los Angeles County, officially the County of Los
Angeles, is the most populous county in the USA.
“REFUTED” Example
Claim Giada at Home was only available on DVD.
Evidence
(1) Giada at Home is a television show and first aired
on October 18, 2008, on the Food Network.
(2) Food Network is an American basic cable and
satellite television channel.
Table 1: Some examples of reasoning over several
pieces of evidence together for verification. The italic
words are the key information to verify the claim. Both
of the claims require to reason and aggregate multiple
evidence sentences for verification.
More specifically, given a claim, an FV system is
asked to label it as “SUPPORTED”, “REFUTED”,
or “NOT ENOUGH INFO”, which indicate that
the evidence can support, refute, or is not suffi-
cient for the claim.
Existing FV methods formulate FV as a natural
language inference (NLI) (Angeli and Manning,
2014) task. However, they utilize simple evidence
combination methods such as concatenating the
evidence or just dealing with each evidence-claim
pair. These methods are unable to grasp sufficient
relational and logical information among the ev-
idence. In fact, many claims require to simulta-
neously integrate and reason over several pieces
of evidence for verification. As shown in Ta-
ble 1, for both of the “SUPPORTED” example and
“REFUTED” example, we cannot verify the given
claims via checking any evidence in isolation. The
claims can be verified only by understanding and
reasoning over the multiple evidence.
To integrate and reason over information from
multiple pieces of evidence, we propose a
graph-based evidence aggregating and reasoning
(GEAR) framework. Specifically, we first build a
fully-connected evidence graph and encourage in-
formation propagation among the evidence. Then,
we aggregate the pieces of evidence and adopt a
classifier to decide whether the evidence can sup-
port, refute, or is not sufficient for the claim. In-
tuitively, by sufficiently exchanging and reason-
ing over evidence information on the evidence
graph, the proposed model can make the best of
the information for verifying claims. For exam-
ple, by delivering the information “Los Angeles
County is the most populous county in the USA
to “the Rodney King riots occurred in Los Ange-
les County” through the evidence graph, the syn-
thetic information can support “The Rodney King
riots took place in the most populous county in
the USA”. Furthermore, we adopt an effective pre-
trained language representation model BERT (De-
vlin et al., 2019) to better grasp both evidence and
claim semantics.
We conduct experiments on the large-scale
benchmark dataset for Fact Extraction and VER-
ification (FEVER) (Thorne et al., 2018a). Ex-
perimental results show that the proposed frame-
work outperforms recent state-of-the-art baseline
systems. The further case study indicates that our
framework could better leverage multi-evidence
information and reason over the evidence for FV.
2 Related Work
2.1 FEVER Shared Task
The FEVER shared task (Thorne et al., 2018b)
challenges participants to develop automatic fact
verification systems to check the veracity of
human-generated claims by extracting evidence
from Wikipedia. The shared task is hosted as
a competition on Codalab
1
with a blind test
set. Nie et al. (2019); Yoneda et al. (2018) and
Hanselowski et al. (2018) have achieved the top
three results among 23 teams.
Existing methods mainly formulate FV as an
NLI task. Thorne et al. (2018a) simply concate-
nate all evidence together, and then feed the con-
catenated evidence and the given claim into the
NLI model. Luken et al. (2018) adopt the de-
composable attention model (DAM) (Parikh et al.,
2016) to generate NLI predictions for each claim-
evidence pair individually and then aggregate all
1
https://competitions.codalab.org/
competitions/18814
NLI predictions for final verification. Then,
Hanselowski et al. (2018); Yoneda et al. (2018);
Hidey and Diab (2018) adopt the enhanced se-
quential inference model (ESIM) (Chen et al.,
2017b), a more effective NLI model, to infer the
relevance between evidence and claims instead
of DAM. As pre-trained language models have
achieved great results on various NLP applica-
tions, Malon (2018) fine-tunes the generative pre-
training transformer (GPT) (Radford et al., 2018)
for FV. Based on the methods mentioned above,
Nie et al. (2019) specially design the neural se-
mantic matching network (NSMN), which is a
modification of ESIM and achieves the best results
in the competition. Unlike these methods, Yin
and Roth (2018) propose the TWOWINGOS sys-
tem which trains the evidence identification and
claim verification modules jointly.
2.2 Natural Language Inference
The natural language inference (NLI) task requires
a system to label the relationship between a pair of
premise and hypothesis as entailment, contradic-
tion or neutral. Several large-scale datasets have
been proposed to promote the research in this di-
rection, such as SNLI (Bowman et al., 2015) and
Multi-NLI (Williams et al., 2018). These datasets
have made it feasible to train complicated neural
models which have achieved the state-of-the-art
results (Bowman et al., 2015 2016; Parikh et al., ;
Sha et al. Chen et al., 2016; , 2017b c Munkhdalai, ;
and Yu, 2017 2017 Conneau; Nie and Bansal, ;
et al., 2017 2018; Gong et al., ; Tay et al., 2018;
Ghaeini et al., 2018). It is intuitive to transfer
NLI models into the claim verification stage of
the FEVER task and several teams from the shared
task have achieved promising results by this way.
2.3 Pre-trained Language Models
Pre-trained language representation models such
as ELMo (Peters et al., 2018) and OpenAI
GPT (Radford et al., 2018) are proven to be ef-
fective on many NLP tasks. BERT (Devlin et al.,
2019) employs bidirectional transformer and well-
designed pre-training tasks to fuse bidirectional
context information and obtains the state-of-the-
art results on the NLI task. In our experiments, we
find the fine-tuned BERT model outperforms other
NLI-based models on the claim verification sub-
task of FEVER. Hence, we use BERT as the sen-
tence encoder in our framework to better encoding
semantic information of evidence and claims.
0.3620
0.1241
0.0237
0.0014
0.0009
0.0003
ScoreEvidence
Evidence Filter
Claim
BERT
Claim
Evi 1 Claim
Attention
Max
Mean
Document Retrieval Sentence Selection Claim Verification (GEAR)
Sentence Encoding Evidence Reasoning Aggregation
Evi 2 Claim
Evi 3 Claim
Evi 4 Claim
Threshold 0.001
ERNet
Layer 0 Layer T
É
Figure 1: The pipeline of our method. The GEAR framework is illustrated in the claim verification section.
3 Method
We employ a three-step pipeline with compo-
nents for document retrieval, sentence selection
and claim verification to solve the task. In the doc-
ument retrieval and sentence selection stages, we
simply follow the method from Hanselowski et al.
(2018) since their method has the highest score on
evidence recall in the former FEVER shared task.
And we propose our Graph-based Evidence Ag-
gregating and Reasoning (GEAR) framework in
the final claim verification stage. The full pipeline
of our method is illustrated in Figure .1
3.1 Document Retrieval and Sentence
Selection
In this section, we describe our document retrieval
and sentence selection components. Additionally,
we add a threshold filter after the sentence selec-
tion component to filter out those noisy evidence.
In the document retrieval step, we adopt the
entity linking approach from Hanselowski et al.
(2018).
The seven highest-ranked
results for each query are stored to form a candi-
date article set. Finally, the method drops the ar-
ticles which are not in the offline Wikipedia dump
and filters the articles by the word overlap between
their titles and the claim.
The sentence selection component selects the
most relevant evidence for the claim from all sen-
tences in the retrieved documents.
2
https://www.mediawiki.org/wiki/API:
Main_page
In the
max(0, 1+
s
n
s
p
)
s
In addition to the original model (Hanselowski
et al., 2018), we add a relevance score filter with
a threshold τ . Sentences with relevance scores
lower than τ are filtered out to alleviate the noises.
Thus the final size of the retrieved evidence set is
equal to or less than 5. We choose different val-
ues of τ and select the value based on the dev
set result. The evaluation results of the document
retrieval and sentence selection components are
shown in Section 5.1.
3.2 Claim Verification with GEAR
In this section, we describe our GEAR framework
for claim verification. As shown in Figure 1, given
a claim and the retrieved evidence,
Sentence Encoder
Given an input sentence, we employ BERT (De-
vlin et al., 2019) as our sentence encoder by ex-
tracting the final hidden state of the [CLS] token
as the representation, where [CLS] is the special
classification embedding in BERT.
Note that we concatenate the evidence and the
claim to extract the evidence representation be-
cause the evidence nodes in the reasoning graph
need the information from the claim to guide the
message passing process among them.
Evidence Reasoning Network
To encourage the information propagation among
evidence, we build a fully-connected evidence
graph where each node indicates a piece of evi-
dence. We also add self-loop to every node be-
cause each node needs the information from it-
self in the message propagation process. We use
h h
t
= {
t
1
, h
t
2
, ..., h
t
N
} to represent the hidden
states of nodes at layer
t, where h
t
i
R
F ×1
and F
is the number of features in each node. The initial
hidden state of each evidence node
h
0
i
is initialized
by the evidence presentation:
h
0
i
= e
i
.
Inspired by recent work on semi-supervised
graph learning and relational reasoning (Kipf and
Welling, 2017; Velickovic et al., 2018; Palm et al.,
2018), we propose an evidence reasoning network
(ERNet) to propagate information among the ev-
idence nodes.
X
Evidence Aggregator
We employ an evidence aggregator to gather infor-
mation from different evidence nodes and obtain
the final hidden state
o R
F ×1
. The aggregator
may utilize different aggregating strategies and we
suggest three aggregators in our framework:
=
The max aggregator per-
forms the element-wise Max operation among hid-
den states.
o
= Max(h
T
1
, h
T
2
, ..., h
T
N
). (6)
The mean aggregator per-
forms the Mean operation amongelement-wise
hidden states.
o
= Mean(h
T
1
, h
T
2
, ..., h
T
N
). (7)
Once the final state o is obtained, we employ a
one-layer MLP to get the final prediction .l
l = softmax(ReLU(Wo+ b)), (8)
where
W R
C×F
and b R
C×1
are parame-
ters, and C is the number of prediction labels.
4 Experimental Settings
4.1 Dataset
We conduct our experiments on the large-scale
dataset FEVER (Thorne et al., 2018a). The dataset
consists of 185,455 annotated claims with a set
of 5,416,537 Wikipedia documents from the June
2017 Wikipedia dump. We follow the dataset par-
tition from the FEVER Shared Task (Thorne et al.,
2018b). Table 2 shows the statistics of the dataset.
Split SUPPORTED REFUTED NEI
Train 80,035 29,775 35,639
Dev 6,666 6,666 6,666
Test 6,666 6,666 6,666
Table 2: Statistics of FEVER dataset.
4.2 Baselines
In this section, we describe the baseline systems in
our experiments. We first introduce the top-3 sys-
tems from the FEVER shared task. As BERT (De-
vlin et al., 2019) has achieved promising perfor-
mance on several NLP tasks, we also implement
two baseline systems via fine-tuning BERT in the
claim verification task.
Shared Task Systems
We choose the top-3 models from the FEVER
shared task as our baselines.
The Athene UKP TU Darmstadt team (Athene)
(Hanselowski et al., 2018) combines five inference
vectors from the ESIM model via attention mech-
anism to make the final prediction.
The UCL Machine Reading Group (UCL
MRG) (Yoneda et al., 2018) predicts the label of
each evidence-claim pair and aggregates the re-
sults via a label aggregation component.
The UNC NLP team (Nie et al., 2019) proposes
the neural semantic matching network and uses the
model jointly to solve all three subtasks. They
also incorporate additional information such as
pageview frequency and WordNet features. They
have achieved best results in the competition.
BERT Fine-tuning Systems
We implement two BERT fine-tuning systems
with different evidence combination approaches.
The BERT-Concat system concatenates all evi-
dence into a single string while the BERT-Pair
system encodes each evidence-claim pair indepen-
dently and then aggregates the results. Both sys-
tems share the same document retrieval and sen-
tence selection components proposed by us.
BERT-Concat. In the BERT-Concat system,
we simply concatenate all evidence into a single
sentence and utilize BERT to predict the relation
between the concatenated evidence and the claim.
In the training phase, we add the ground truth ev-
idence into the retrieved evidence set with rele-
vance score 1 and select five pieces of evidence
with the highest scores. In the test phase, we con-
catenate the retrieved evidence for predicting.
BERT-Pair. In the BERT-Pair system, we uti-
lize BERT to predict the label for each evidence-
claim pair. Concretely, we use each evidence-
claim pair as the input and the label of the claim
as the prediction target. In the training phase, we
select the ground truth evidence for SUPPORTED
and REFUTED claims and the retrieved evidence
for NEI claims. In the test phase, we predict labels
for all retrieved evidence-claim pairs. Because dif-
ferent evidence-claim pairs may have inconsistent
predicted labels, we then utilize an aggregator to
obtain the final claim label. We find the aggre-
gator only returning the predicted label from the
most relevant evidence has the best performance.
4.3 Hyperparameter Settings
We utilize BERT
BASE
(Devlin et al., 2019) in all
of the BERT fine-tuning baselines and our GEAR
framework. The learning rate is 2e-5.
For BERT-Concat, the maximum sequence
length is 256 and the batch size is 16. We limit
the max length for concatenated evidence to 240
and the max length for claims to 16. We train this
model for two epochs based on dev results. For
BERT-Pair, we set the maximum sequence length
to 128 and batch size to 32. We train this model for
one epoch. As for the GEAR framework, we use
the fine-tuned BERT-Pair model to extract features
and the batch size is 512.
In our ERNet, we set the batch size to 256,
the number of features to 768 and the dimen-F
sion of weight matrices H to 64. The model is
trained to minimize the negative log likelihood
loss on the predicted label using the Adam opti-
mizer (Kingma and Ba, 2015) with an initial learn-
ing rate of 5e-3 and L2 weight decay of 5e-4. We
use an early stopping strategy on the label accu-
racy of the validation set, with a patience of 20
epochs. We attempt to stack 0-3 ERNet layers and
analyze the effect of different layer numbers.
4.4 Evaluation Metrics
Besides traditional evaluation metrics such as la-
bel accuracy and F1, we use other two metrics to
evaluate our model.
FEVER score. The FEVER score is the la-
bel accuracy conditioned on providing at least one
complete set of evidence. Claims labeled as “NEI”
do not need the evidence.
Model OFEVER
Athene 93.55
UCL MRG -
UNC NLP 92.82
Our Model 93.33
Table 3: Document retrieval evaluation on dev set (%).
(’-’ denotes a missing value)
τ OFEVER Precision Recall F1 GEAR LA
0 91.10 24.08 86.72 37.69 74.84
10
−4
91.04 30.88 86.63 45.53 74.86
10
−3
90.86 40.60 86.36 55.23 74.91
10
−2
90.27 53.12 85.47 65.52 74.89
10
−1
87.70 70.61 81.64 74.8175.72
Table 4: Sentence selection evaluation and average la-
bel accuracy of GEAR with different thresholds on dev
set (%).
OFEVER score. The document retrieval and
sentence selection components are usually eval-
uated by the oracle FEVER (OFEVER) score,
which is the upper bound of the FEVER score by
assuming perfect downstream systems.
For all of the experiments with GEAR, the
scores (label accuracy, FEVER score) we report
on the dev set are mean values with 10runs ini-
tialized by different random seeds.
5 Experimental Results and Analysis
In this section, we first present the evaluations
of the document retrieval and sentence selection
components. Then we evaluate our GEAR frame-
work in several different aspects. Finally, we
present a case study to demonstrate the effective-
ness of our framework.
5.1 Document Retrieval and Sentence
Selection
We use the OFEVER metric to evaluate the doc-
ument retrieval component. Table 3 shows the
OFEVER scores of our model and models from
other teams. After running the same model pro-
posed by Hanselowski et al. (2018), we find our
OFEVER score is slightly lower, which may due
to the random factors.
Then we compare our sentence selection com-
ponent with different thresholds, as shown in Ta-
ble 4. We find the model with threshold 0 achieves
the highest recall and OFEVER score. When
the threshold increases, the recall value and the
OFEVER score drop gradually while the precision
and F1 score increase. The results are consistent
with our intuition. If we do not filter out evidence,
more claims could be provided with the full evi-
dence set. If we increase the value of the thresh-
old, more pieces of noisy evidence are filtered out,
which contributes to the increase of precision and
F1.
5.2 Claim Verification with GEAR
In this section, we evaluate our GEAR framework
in different aspects. We first compare the label ac-
curacy scores between our framework and base-
line systems. Then we explore the effect of differ-
ent thresholds from the upstream sentence filter.
We also conduct additional experiments to check
the effect of sentence embedding. As there are
nearly 39% of claims require reasoning over mul-
tiple pieces of evidence, we construct a difficult
dev subset and check the effectiveness of our ER-
Net for evidence reasoning. Finally, we make an
error analysis and provide the theoretical upper-
bound label accuracy of our framework.
Model Evaluation
We use the label accuracy metric to evaluate the
effectiveness of different claim verification mod-
els. The second column of Table 7 shows the la-
bel accuracy of different models on the dev set.
We find the BERT fine-tuning models outperform
all of the models from the shared task, which
shows the strong capacity of BERT in represen-
tation learning and semantic understanding. The
BERT-Concat model has a slight improvement
over BERT-Pair, which is 0.37%.
Our final model outperforms the best BERT-
Concat baseline by 1.17%. As our framework pro-
vides a better way for evidence aggregating and
reasoning, the improvement demonstrates that our
framework has a better ability to integrate features
from different evidence by propagating, analyzing
and aggregating the features.
Effect of Sentence Thresholds
The rightmost column of Table 4 shows the results
of our GEAR frameworks with different sentence
selection thresholds. We choose the model with
threshold
τ = 10
−3
, which has the highest label
accuracy, as our final model. When the thresh-
old increases from 0 to
10
−3
, the label accuracy
increases due to less noisy information. However,
when the threshold increases from
10
−3
to ,10
−1
ERNet Layers
Aggregator
Attention Max Mean
0 66.17 65.36 65.03
1 67.13 66.63 66.76
2 67.44 67.24 67.56
3 66.53 66.72 66.89
Table 5: Label accuracy on the difficult dev set with
different ERNet layers and evidence aggregators (%).
ERNet Layers
Aggregator
Attention Max Mean
0 77.12 76.95 76.30
1 77.74 77.66 77.62
2 77.82 77.66 77.73
3 77.70 77.55 77.60
Table 6: Label accuracy on the evidence-enhanced dev
set with different ERNet layers and evidence aggrega-
tors (%).
the label accuracy decreases because informative
evidence is filtered out, and the model can not
obtain sufficient evidence to make the right infer-
ence.
Effect of Sentence Embedding
The BERT model we used in the sentence encod-
ing step is fine-tuned on the FEVER dataset for
one epoch. We need to find out whether the fine-
tuning process or simply incorporating the sen-
tence embeddings from BERT makes the major
contribution to the final result. We conduct an ex-
periment using a BERT model without the fine-
tuning process and we find the final dev label ac-
curacy is close to the result from a random guess.
Therefore, the fine-tuning process rather than sen-
tence embeddings plays an important role in this
task. We need the fine-tuning process to capture
the semantic and logical relations between evi-
dence and the claim. Sentence embeddings are
more general and cannot perform well in this spe-
cific task. So that we cannot just use sentence em-
beddings from other methods (e.g., ELMo, CNN)
to replace the sentence embeddings we used here.
Effectiveness of ERNet
In our observation, more than half of the claims in
the dev dataset only need one piece of evidence to
make the right inference. To verify the effective-
ness of our framework on reasoning over multiple
pieces of evidence, we build a difficult dev sub-
Model
Dev Test
LA FEVER LA FEVER
Athene 68.49 64.74 65.46 61.58
UCL MRG 69.66 65.41 67.62 62.52
UNC NLP 69.72 66.49 68.21 64.21
BERT Pair 73.30 68.90 69.75 65.18
BERT Concat 73.67 68.89 71.01 65.64
Our pipeline 74.84 70.69 71.60 67.10
Table 7: Evaluations of the full pipeline. The results of
our pipeline are chosen from the model which has the
highest dev FEVER score (%).
set via selecting samples from the original dev set.
For SUPPORTED and REFUTED classes, claims
which can be fully supported by only one piece of
evidence are filtered out. All of the NEI claims
are selected because the model needs all of the re-
trieved evidence to conclude that there is “NOT
ENOUGH INFO”. The difficult subset contains
7870 samples, which includes more than 39% of
the dev set.
We test our final model on the difficult sub-
set and present the results in Table 5. We find
our models with ERNet perform better than mod-
els without ERNet and the minimal improvement
between them is 1.27%. We can also discover
from the table that models with 2 ERNet lay-
ers achieve the best results, which indicates that
claims from the difficult subset require multi-step
evidence propagation. This result demonstrates
the ability of our framework to deal with claims
which need multiple evidence.
Error Analysis
In this section, we examine the effect of errors
propagating from upstream components. We uti-
lize an evidence-enhanced dev subset, which as-
sumes all pieces of ground truth evidence are re-
trieved, to test the theoretical upper-bound score
of our GEAR framework.
In our analysis, the main errors of our frame-
work come from the upstream document retrieval
and sentence selection components which can not
extract sufficient evidence for inferring. For exam-
ple, to verify the claim “Giada at Home was only
available on DVD”, we need the evidence “Giada
at Home is a television show and first aired on Oc-
tober 18, 2008, on the Food Network. and “Food
Network is an American basic cable and satellite
television channel.”. However, the entity linking
Claim:
Al Jardine is an American rhythm guitarist.
Truth evidence:
{Al Jardine, 0}, {Al Jardine, 1}
Retrieved evidence:
{ } {Al Jardine, 1 , { } { }Al Jardine, 0 , Al Jardine, 2 , Al Jar-
dine, 5}, {Jardine, 42}
Evidence:
(1) He is best known as the band’s rhythm guitarist, and for
occasionally singing lead vocals on singles such as “Help Me,
Rhonda” (1965), “Then I Kissed Her” (1965) and “Come Go
with Me” (1978).
(2) Alan Charles Jardine an(born September 3, 1942) is
American musician, singer and songwriter who co-founded
the Beach Boys.
(3) In 2010, Jardine released his debut solo studio album, A
Postcard from California.
(4) In 1988, Jardine was inducted into the Rock and Roll Hall
of Fame as a member of the Beach Boys.
(5) Ray Jardine American rock climber, lightweight back-
packer, inventor, author and global adventurer.
Label: SUPPORTED
Table 8: A case of the claim that requires integrating
multiple evidence to verify. The representation for ev-
idence {DocName, LineNum} means the evidence is
extracted from the document DocName and of which
the line number is .LineNum
method used in our document retrieval component
could not retrieve the “Food Network” document
only from parsing the content of the claim. Thus
the claim verification component can not make the
right inference with insufficient evidence.
To explore the effect of this issue, we test our
models on an evidence-enhanced dev set, in which
we add the ground truth evidence with relevance
score 1 into the evidence set before the sentence
threshold filter. It ensures that each claim in the
evidence-enhanced set is provided with the ground
truth evidence as well as the retrieved evidence.
The experimental results are shown in Table 6.
We can find that all scores in the table increase by
more than 1.4% compared to the original dev set
label accuracy in Table 7 because of the addition of
the ground truth evidence. Because of the assump-
tion of oracle upstream components, the results in
Table 6 indicate the theoretical upper bound label
accuracy of our framework.
The results show the challenges in the previous
evidence retrieval task, which could not be solved
by current models. Nie et al. (2019) propose a
two-hop evidence enhancement method which im-
proves 0.08% on their final FEVER score. As the
addition of the ground truth evidence leads to a
Figure 2: Attention map for the case in Table 8. The
first five rows indicate the attention weights from nodes
1 to 5 in the first ERNet layer and the last row shows
the attention weights from the attention aggregator.
more than 1.4% increase in our experiment, it is
worthwhile to design a better evidence retrieval
pipeline, which remains to be our future research.
5.3 Full Pipeline
We present the evaluation of our full pipeline in
this section. Note that there is a gap between the
label accuracy and the final FEVER score due to
the completeness of the evidence set. We find that
a model which is good at predicting NEI instances
tends to obtain higher FEVER score. So we
choose our final model based on the dev FEVER
score among all of our experiments. This model
contains one layer of ERNet and uses the attention
aggregator. The threshold of the sentence filter is
10
−3
.
Table 7 presents the evaluations of the full
pipeline. We find the test FEVER score of BERT
fine-tuning systems outperform other shared task
models by nearly 1%. Furthermore, our full
pipeline outperforms the BERT-Concat baseline
by 1.46% and achieves significant improvements.
5.4 Case study
Table 8 shows an example in our experiments
which needs multiple pieces of evidence to make
the right inference. The ground truth evidence set
contains the sentences from the article Al Jar-
dine” with line number 0 and 1. These two pieces
of evidence are also ranked at top two in our re-
trieved evidence set. To verify whether Al Jar-
dine is an American rhythm guitarist”, our model
needs the evidence “He is best known as the bands
rhythm guitarist” as well as the evidence Alan
Charles Jardine ... is an American musician”. We
plot the attention map from our final model with
one layer of ERNet and the attention aggregator
in Figure 2. We can find that all evidence nodes
tend to attend the first and the second evidence
nodes, which provide the most useful information
in this case. The attention weights in other evi-
dence nodes are pretty low, which indicates that
our model has the ability to select useful informa-
tion from multiple pieces of evidence.
6 Conclusion
We propose a novel Graph-based Evidence Aggre-
gating and Reasoning (GEAR) framework on the
claim verification subtask of FEVER. The frame-
work utilizes the BERT sentence encoder, the evi-
dence reasoning network (ERNet) and an evidence
aggregator to encode, propagate and aggregate in-
formation from multiple pieces of evidence. The
framework is proven to be effective and our final
pipeline achieves significant improvements. In the
future, we would like to design a multi-step ev-
idence extractor and incorporate external knowl-
edge into our framework.
7 Acknowledgements
This work is supported by the National Key Re-
search and Development Program of China (No.
2018YFB1004503), the National Natural Sci-
ence Foundation of China (NSFC No.61772302,
61572273). This work is also supported by 2018
Tencent Marketing Solution Rhino-Bird Focused
Research Program No. 201808. Han is also sup-
ported by 2018 Tencent Rhino-Bird Elite Training
Program.
References
Gabor Angeli and Christopher D Manning. 2014. Nat-
uralli: Natural logic inference for common sense
reasoning. In Proceedings of EMNLP, pages 534–
545.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
and Christopher D Manning. 2015. A large anno-
tated corpus for learning natural language inference.
In Proceedings of EMNLP, pages 632–642.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Bordes. 2017a. Reading Wikipedia to answer open-
domain questions. In Proceedings of ACL, pages
1870–1879.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
Jiang, and Diana Inkpen. 2017b. Enhanced lstm for
natural language inference. In Proceedings of ACL,
pages 1657–1668.
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
Jiang, and Diana Inkpen. 2017c. Recurrent neural
network-based sentence encoder with gated atten-
tion for natural language inference. In Proceedings
of ACL, Workshop on Evaluating Vector Space Rep-
resentations for NLP, pages 36–40.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨
ıc
Barrault, and Antoine Bordes. 2017. Supervised
learning of universal sentence representations from
natural language inference data. In Proceedings of
EMNLP, pages 670–680.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of deep
bidirectional transformers for language understand-
ing. In Proceedings of NAACL-HLT.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind
Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-
ters, Michael Schmitz, and Luke Zettlemoyer. 2018.
Allennlp: A deep semantic natural language pro-
cessing platform. In Proceedings of ACL, Workshop
for NLP Open Source Software, pages 1–6.
Reza Ghaeini, Sadid A Hasan, Vivek Datla, Joey
Liu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aa-
ditya Prakash, Xiaoli Fern, and Oladimeji Farri.
2018. Dr-bilstm: Dependent reading bidirectional
lstm for natural language inference. In Proceedings
of NAACL-HLT, pages 1460–1469.
Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-
ural language inference over interaction space. In
Proceedings of ICLR.
Andreas Hanselowski, Hao Zhang, Zile Li, Daniil
Sorokin, Benjamin Schiller, Claudia Schulz, and
Iryna Gurevych. 2018. Ukp-athene: Multi-sentence
textual entailment for claim verification. Proceed-
ings of EMNLP, pages 103–108.
Christopher Hidey and Mona Diab. 2018. Team
sweeper: Joint sentence extraction and fact check-
ing with pointer networks. In Proceedings of the
First Workshop on FEVER, pages 150–155.
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
method for stochastic optimization. In Proceedings
of ICLR.
Thomas N Kipf and Max Welling. 2017. Semi-
supervised classification with graph convolutional
networks. In Proceedings of ICLR.
Jackson Luken, Nanjiang Jiang, and Marie-Catherine
de Marneffe. 2018. Qed: A fact verification system
for the fever shared task. In Proceedings of the First
Workshop on FEVER, pages 156–160.
Christopher Malon. 2018. Team papelo: Transformer
networks at fever. In Proceedings of the First Work-
shop on FEVER, pages 109–113.
Tsendsuren Munkhdalai and Hong Yu. 2017. Neural
tree indexers for text understanding. In Proceedings
of EACL, pages 11–21.
Yixin Nie and Mohit Bansal. 2017. Shortcut-stacked
sentence encoders for multi-domain inference. In
Proceedings of ACL, Workshop on Evaluating Vec-
tor Space Representations for NLP, pages 41–45.
Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.
Combining fact extraction and verification with neu-
ral semantic matching networks. In Proceedings of
AAAI.
Rasmus Palm, Ulrich Paquet, and Ole Winther. 2018.
Recurrent relational networks. In Proceedings of
NIPS, pages 3368–3378.
Ankur Parikh, Oscar T¨ackstr¨
om, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable attention
model for natural language inference. In Proceed-
ings of EMNLP, pages 2249–2255.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of NAACL-HLT, pages
2227–2237.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. In Proceedings
of Technical report, OpenAI.
Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.
2016. Reading and thinking: Re-read lstm unit for
textual entailment recognition. In Proceedings of
COLING, pages 2870–2879.
Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.
Compare, compress and propagate: Enhancing neu-
ral architectures with alignment factorization for
natural language inference. In Proceedings of
EMNLP, pages 1565–1575.
James Thorne, Andreas Vlachos, Christos
Christodoulopoulos, and Arpit Mittal. 2018a.
Fever: a large-scale dataset for fact extraction and
verification. In Proceedings of NAACL-HLT, pages
809–819.
James Thorne, Andreas Vlachos, Oana Cocarascu,
Christos Christodoulopoulos, and Arpit Mittal.
2018b. The Fact Extraction and VERification
(FEVER) shared task. In Proceedings of the First
Workshop on FEVER, pages 1–9.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Lio, and Yoshua Bengio.
2018. Graph attention networks. In Proceedings of
ICLR.
Quan Wang, Zhendong Mao, Bin Wang, and Li Guo.
2017. Knowledge graph embedding: A survey
of approaches and applications. IEEE TKDE,
29(12):2724–2743.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of NAACL-HLT, pages 1112–1122.
Wenpeng Yin and Dan Roth. 2018. Twowingos: A
two-wing optimization strategy for evidential claim
verification. In Proceedings of EMNLP, pages 105–
114.
Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus
Stenetorp, and Sebastian Riedel. 2018. Ucl machine
reading group: Four factor framework for fact find-
ing (hexaf). In Proceedings of the First Workshop
on FEVER, pages 97–102.
| 1/10

Preview text:

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification Jie Zhou1 2 , ,3, Xu Han1 2 , ,3, Cheng Yang1 2 , ,3, Zhiyuan Liu1 2 3 , , †
Lifeng Wang4, Changcheng Li4, Maosong Sun1 2 , ,3
1Department of Computer Science and Technology, Tsinghua University, Beijing, China
2Institute for Artificial Intelligence, Tsinghua University, Beijing, China
3State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China
4Tencent Marketing Solution, Tencent, Shenzhen, China
{zhoujie18, hanxu17, cheng-ya14}@mails.tsinghua.edu.cn
{fandywang, harrychli}@tencent.com, {liuzy, sms}@tsinghua.edu.cn Abstract “SUPPORTED” Example Claim
The Rodney King riots took place in the most populous
Fact verification (FV) is a challenging task county in the USA.
which requires to retrieve relevant evidence
(1) The 1992 Los Angeles riots, also known as the Evidence
Rodney King riots were a series of riots, lootings, ar-
from plain text and use the evidence to ver-
sons, and civil disturbances that occurred in Los An-
ify given claims. Many claims require to si-
geles County, California in April and May 1992.
(2) Los Angeles County, officially the County of Los
multaneously integrate and reason over several
Angeles, is the most populous county in the USA.
pieces of evidence for verification. However, “REFUTED” Example
previous work employs simple models to ex- Claim
Giada at Home was only available on DVD.
tract information from evidence without let-
(1) Giada at Home is a television show and first aired
ting evidence communicate with each other, Evidence
on October 18, 2008, on the Food Network.
e.g., merely concatenate the evidence for pro-
(2) Food Network is an American basic cable and satellite television channel.
cessing. Therefore, these methods are unable
to grasp sufficient relational and logical infor-
Table 1: Some examples of reasoning over several
mation among the evidence. To alleviate this
pieces of evidence together for verification. The italic
issue, we propose a graph-based evidence ag-
words are the key information to verify the claim. Both
gregating and reasoning (GEAR) framework
of the claims require to reason and aggregate multiple
which enables information to transfer on a
evidence sentences for verification.
fully-connected evidence graph and then uti-
lizes different aggregators to collect multi- evidence information. We further employ
More specifically, given a claim, an FV system is
BERT, an effective pre-trained language repre-
asked to label it as “SUPPORTED”, “REFUTED”,
sentation model, to improve the performance.
Experimental results on a large-scale bench-
or “NOT ENOUGH INFO”, which indicate that
mark dataset FEVER have demonstrated that
the evidence can support, refute, or is not suffi-
GEAR could leverage multi-evidence infor- cient for the claim.
mation for FV and thus achieves the promis-
Existing FV methods formulate FV as a natural
ing result with a test FEVER score of 67.10%.
language inference (NLI) (Angeli and Manning,
Our code is available at https://github.
2014) task. However, they utilize simple evidence com/thunlp/GEAR.
combination methods such as concatenating the 1 Introduction
evidence or just dealing with each evidence-claim
pair. These methods are unable to grasp sufficient
Due to the rapid development of information ex-
relational and logical information among the ev-
traction (IE), huge volumes of data have been
idence. In fact, many claims require to simulta- extracted.
How to automatically verify the
neously integrate and reason over several pieces
data becomes a vital problem for various data- of evidence for verification. As shown in Ta-
driven applications, e.g., knowledge graph com-
ble 1, for both of the “SUPPORTED” example and
pletion (Wang et al., 2017) and open domain
“REFUTED” example, we cannot verify the given
question answering (Chen et al., 2017a). Hence,
claims via checking any evidence in isolation. The
many recent research efforts have been devoted to
claims can be verified only by understanding and
fact verification (FV), which aims to verify given
reasoning over the multiple evidence.
claims with the evidence retrieved from plain text.
To integrate and reason over information from
† Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn)
multiple pieces of evidence, we propose a
graph-based evidence aggregating and reasoning
NLI predictions for final verification. Then,
(GEAR) framework. Specifically, we first build a
Hanselowski et al. (2018); Yoneda et al. (2018);
fully-connected evidence graph and encourage in-
Hidey and Diab (2018) adopt the enhanced se-
formation propagation among the evidence. Then,
quential inference model (ESIM) (Chen et al.,
we aggregate the pieces of evidence and adopt a
2017b), a more effective NLI model, to infer the
classifier to decide whether the evidence can sup-
relevance between evidence and claims instead
port, refute, or is not sufficient for the claim. In-
of DAM. As pre-trained language models have
tuitively, by sufficiently exchanging and reason-
achieved great results on various NLP applica-
ing over evidence information on the evidence
tions, Malon (2018) fine-tunes the generative pre-
graph, the proposed model can make the best of
training transformer (GPT) (Radford et al., 2018)
the information for verifying claims. For exam-
for FV. Based on the methods mentioned above,
ple, by delivering the information “Los Angeles
Nie et al. (2019) specially design the neural se-
County is the most populous county in the USA”
mantic matching network (NSMN), which is a
to “the Rodney King riots occurred in Los Ange-
modification of ESIM and achieves the best results
les County” through the evidence graph, the syn-
in the competition. Unlike these methods, Yin
thetic information can support “The Rodney King
and Roth (2018) propose the TWOWINGOS sys-
riots took place in the most populous county in
tem which trains the evidence identification and
the USA”. Furthermore, we adopt an effective pre-
claim verification modules jointly.
trained language representation model BERT (De-
vlin et al., 2019) to better grasp both evidence and 2.2 Natural Language Inference claim semantics.
The natural language inference (NLI) task requires
We conduct experiments on the large-scale
a system to label the relationship between a pair of
benchmark dataset for Fact Extraction and VER-
premise and hypothesis as entailment, contradic-
ification (FEVER) (Thorne et al., 2018a). Ex-
tion or neutral. Several large-scale datasets have
perimental results show that the proposed frame-
been proposed to promote the research in this di-
work outperforms recent state-of-the-art baseline
rection, such as SNLI (Bowman et al., 2015) and
systems. The further case study indicates that our
Multi-NLI (Williams et al., 2018). These datasets
framework could better leverage multi-evidence
have made it feasible to train complicated neural
information and reason over the evidence for FV.
models which have achieved the state-of-the-art
results (Bowman et al., 2015; Parikh et al., 2016; 2 Related Work
Sha et al., 2016; Chen et al., 2017b,c; Munkhdalai
and Yu, 2017; Nie and Bansal, 2017; Conneau 2.1 FEVER Shared Task
et al., 2017; Gong et al., 2018; Tay et al., 2018;
The FEVER shared task (Thorne et al., 2018b)
Ghaeini et al., 2018). It is intuitive to transfer
challenges participants to develop automatic fact
NLI models into the claim verification stage of
verification systems to check the veracity of
the FEVER task and several teams from the shared
human-generated claims by extracting evidence
task have achieved promising results by this way. from Wikipedia. The shared task is hosted as
a competition on Codalab1 with a blind test 2.3 Pre-trained Language Models
set. Nie et al. (2019); Yoneda et al. (2018) and
Pre-trained language representation models such
Hanselowski et al. (2018) have achieved the top
as ELMo (Peters et al., 2018) and OpenAI three results among 23 teams.
GPT (Radford et al., 2018) are proven to be ef-
Existing methods mainly formulate FV as an
fective on many NLP tasks. BERT (Devlin et al.,
NLI task. Thorne et al. (2018a) simply concate-
2019) employs bidirectional transformer and well-
nate all evidence together, and then feed the con-
designed pre-training tasks to fuse bidirectional
catenated evidence and the given claim into the
context information and obtains the state-of-the-
NLI model. Luken et al. (2018) adopt the de-
art results on the NLI task. In our experiments, we
composable attention model (DAM) (Parikh et al.,
find the fine-tuned BERT model outperforms other
2016) to generate NLI predictions for each claim-
NLI-based models on the claim verification sub-
evidence pair individually and then aggregate all
task of FEVER. Hence, we use BERT as the sen- 1
tence encoder in our framework to better encoding
https://competitions.codalab.org/ competitions/18814
semantic information of evidence and claims. Evidence Filter Sentence Encoding Evidence Reasoning Aggregation Claim Evidence Score Claim 0.3620 Evi 1 Claim Attention 0.1241 0.0237 Evi 2 Claim BERT 0.0014 Max Evi 3 Claim Threshold 0.001 Evi 4 Claim Mean … 0.0009 Layer 0 É Layer T 0.0003 ERNet Document Retrieval Sentence Selection Claim Verification (GEAR)
Figure 1: The pipeline of our method. The GEAR framework is illustrated in the claim verification section. 3 Method In the
We employ a three-step pipeline with compo-
nents for document retrieval, sentence selection max(0, 1+ sn−sp)
and claim verification to solve the task. In the doc- s
ument retrieval and sentence selection stages, we
simply follow the method from Hanselowski et al.
(2018) since their method has the highest score on
evidence recall in the former FEVER shared task.
And we propose our Graph-based Evidence Ag-
gregating and Reasoning (GEAR) framework in
In addition to the original model (Hanselowski
the final claim verification stage. The full pipeline
et al., 2018), we add a relevance score filter with
of our method is illustrated in Figure 1.
a threshold τ . Sentences with relevance scores 3.1
Document Retrieval and Sentence
lower than τ are filtered out to alleviate the noises.
Thus the final size of the retrieved evidence set is Selection
equal to or less than 5. We choose different val-
In this section, we describe our document retrieval
ues of τ and select the value based on the dev
and sentence selection components. Additionally,
set result. The evaluation results of the document
we add a threshold filter after the sentence selec-
retrieval and sentence selection components are
tion component to filter out those noisy evidence. shown in Section 5.1.
In the document retrieval step, we adopt the
entity linking approach from Hanselowski et al. 3.2 Claim Verification with GEAR (2018).
In this section, we describe our GEAR framework
for claim verification. As shown in Figure 1, given
a claim and the retrieved evidence, The seven highest-ranked
results for each query are stored to form a candi-
date article set. Finally, the method drops the ar-
ticles which are not in the offline Wikipedia dump
and filters the articles by the word overlap between their titles and the claim.
The sentence selection component selects the Sentence Encoder
most relevant evidence for the claim from all sen-
Given an input sentence, we employ BERT (De-
tences in the retrieved documents.
vlin et al., 2019) as our sentence encoder by ex-
tracting the final hidden state of the [CLS] token 2
as the representation, where [CLS] is the special
https://www.mediawiki.org/wiki/API: Main_page
classification embedding in BERT. Evidence Aggregator
We employ an evidence aggregator to gather infor-
mation from different evidence nodes and obtain
Note that we concatenate the evidence and the
the final hidden state o ∈ RF ×1. The aggregator
claim to extract the evidence representation be-
may utilize different aggregating strategies and we
cause the evidence nodes in the reasoning graph
suggest three aggregators in our framework:
need the information from the claim to guide the
message passing process among them. Evidence Reasoning Network
To encourage the information propagation among
evidence, we build a fully-connected evidence
graph where each node indicates a piece of evi-
dence. We also add self-loop to every node be-
cause each node needs the information from it-
self in the message propagation process. We use =
ht = {ht1, ht2, ..., ht } to represent the hidden N
states of nodes at layer t, where ht ∈ RF ×1 and F i
is the number of features in each node. The initial The max aggregator per-
hidden state of each evidence node h0i is initialized
forms the element-wise Max operation among hid-
by the evidence presentation: h0 = e i i. den states.
Inspired by recent work on semi-supervised
graph learning and relational reasoning (Kipf and o = Max(hT1 , hT2 , ..., hT). N (6)
Welling, 2017; Velickovic et al., 2018; Palm et al.,
2018), we propose an evidence reasoning network The mean aggregator per-
(ERNet) to propagate information among the ev-
forms the element-wise Mean operation among idence nodes. hidden states.
o = Mean(hT1 , hT2 , ..., hT ). N (7)
Once the final state o is obtained, we employ a
one-layer MLP to get the final prediction l. l = softmax(ReLU(Wo+ b)), (8)
where W ∈ RC×F and b ∈ RC×1 are parame-
ters, and C is the number of prediction labels. 4 Experimental Settings 4.1 Dataset
We conduct our experiments on the large-scale
dataset FEVER (Thorne et al., 2018a). The dataset
consists of 185,455 annotated claims with a set
of 5,416,537 Wikipedia documents from the June
2017 Wikipedia dump. We follow the dataset par- X
tition from the FEVER Shared Task (Thorne et al.,
2018b). Table 2 shows the statistics of the dataset. Split SUPPORTED REFUTED NEI
with the highest scores. In the test phase, we con- Train 80,035 29,775 35,639
catenate the retrieved evidence for predicting. Dev 6,666 6,666 6,666
BERT-Pair. In the BERT-Pair system, we uti- Test 6,666 6,666 6,666
lize BERT to predict the label for each evidence- claim pair.
Concretely, we use each evidence-
Table 2: Statistics of FEVER dataset.
claim pair as the input and the label of the claim
as the prediction target. In the training phase, we
select the ground truth evidence for SUPPORTED 4.2 Baselines
and REFUTED claims and the retrieved evidence
In this section, we describe the baseline systems in
for NEI claims. In the test phase, we predict labels
our experiments. We first introduce the top-3 sys-
for all retrieved evidence-claim pairs. Because dif-
tems from the FEVER shared task. As BERT (De-
ferent evidence-claim pairs may have inconsistent
vlin et al., 2019) has achieved promising perfor-
predicted labels, we then utilize an aggregator to
mance on several NLP tasks, we also implement
obtain the final claim label. We find the aggre-
two baseline systems via fine-tuning BERT in the
gator only returning the predicted label from the claim verification task.
most relevant evidence has the best performance. Shared Task Systems 4.3 Hyperparameter Settings
We choose the top-3 models from the FEVER
We utilize BERTBASE (Devlin et al., 2019) in all shared task as our baselines.
of the BERT fine-tuning baselines and our GEAR
The Athene UKP TU Darmstadt team (Athene)
framework. The learning rate is 2e-5.
(Hanselowski et al., 2018) combines five inference
For BERT-Concat, the maximum sequence
vectors from the ESIM model via attention mech-
length is 256 and the batch size is 16. We limit
anism to make the final prediction.
the max length for concatenated evidence to 240
The UCL Machine Reading Group (UCL
and the max length for claims to 16. We train this
MRG) (Yoneda et al., 2018) predicts the label of
model for two epochs based on dev results. For
each evidence-claim pair and aggregates the re-
BERT-Pair, we set the maximum sequence length
sults via a label aggregation component.
to 128 and batch size to 32. We train this model for
The UNC NLP team (Nie et al., 2019) proposes
one epoch. As for the GEAR framework, we use
the neural semantic matching network and uses the
the fine-tuned BERT-Pair model to extract features
model jointly to solve all three subtasks. They and the batch size is 512.
also incorporate additional information such as
In our ERNet, we set the batch size to 256,
pageview frequency and WordNet features. They
the number of features F to 768 and the dimen-
have achieved best results in the competition.
sion of weight matrices H to 64. The model is
trained to minimize the negative log likelihood BERT Fine-tuning Systems
loss on the predicted label using the Adam opti-
We implement two BERT fine-tuning systems
mizer (Kingma and Ba, 2015) with an initial learn-
with different evidence combination approaches.
ing rate of 5e-3 and L2 weight decay of 5e-4. We
The BERT-Concat system concatenates all evi-
use an early stopping strategy on the label accu-
dence into a single string while the BERT-Pair
racy of the validation set, with a patience of 20
system encodes each evidence-claim pair indepen-
epochs. We attempt to stack 0-3 ERNet layers and
dently and then aggregates the results. Both sys-
analyze the effect of different layer numbers.
tems share the same document retrieval and sen- 4.4 Evaluation Metrics
tence selection components proposed by us.
BERT-Concat. In the BERT-Concat system,
Besides traditional evaluation metrics such as la-
we simply concatenate all evidence into a single
bel accuracy and F1, we use other two metrics to
sentence and utilize BERT to predict the relation evaluate our model.
between the concatenated evidence and the claim.
FEVER score. The FEVER score is the la-
In the training phase, we add the ground truth ev-
bel accuracy conditioned on providing at least one
idence into the retrieved evidence set with rele-
complete set of evidence. Claims labeled as “NEI”
vance score 1 and select five pieces of evidence do not need the evidence. Model OFEVER
OFEVER score drop gradually while the precision Athene 93.55
and F1 score increase. The results are consistent UCL MRG -
with our intuition. If we do not filter out evidence, UNC NLP 92.82
more claims could be provided with the full evi- Our Model 93.33
dence set. If we increase the value of the thresh-
old, more pieces of noisy evidence are filtered out,
Table 3: Document retrieval evaluation on dev set (%).
which contributes to the increase of precision and
(’-’ denotes a missing value) F1. τ OFEVER Precision Recall F1 GEAR LA 5.2 Claim Verification with GEAR 0 91.10 24.08 86.72 37.69 74.84 10−4
In this section, we evaluate our GEAR framework 91.04 30.88 86.63 45.53 74.86 10−3 90.86 40.60 86.36 55.23 74.91
in different aspects. We first compare the label ac- 10−2 90.27 53.12 85.47 65.52 74.89
curacy scores between our framework and base- 10−1 87.70 70.61 81.64 75.72 74.81
line systems. Then we explore the effect of differ-
ent thresholds from the upstream sentence filter.
Table 4: Sentence selection evaluation and average la-
We also conduct additional experiments to check
bel accuracy of GEAR with different thresholds on dev
the effect of sentence embedding. As there are set (%).
nearly 39% of claims require reasoning over mul-
tiple pieces of evidence, we construct a difficult
OFEVER score. The document retrieval and
dev subset and check the effectiveness of our ER-
sentence selection components are usually eval-
Net for evidence reasoning. Finally, we make an
uated by the oracle FEVER (OFEVER) score,
error analysis and provide the theoretical upper-
which is the upper bound of the FEVER score by
bound label accuracy of our framework.
assuming perfect downstream systems. Model Evaluation
For all of the experiments with GEAR, the
scores (label accuracy, FEVER score) we report
We use the label accuracy metric to evaluate the
on the dev set are mean values with 10 runs ini-
effectiveness of different claim verification mod-
tialized by different random seeds.
els. The second column of Table 7 shows the la-
bel accuracy of different models on the dev set. 5
Experimental Results and Analysis
We find the BERT fine-tuning models outperform
all of the models from the shared task, which
In this section, we first present the evaluations
shows the strong capacity of BERT in represen-
of the document retrieval and sentence selection
tation learning and semantic understanding. The
components. Then we evaluate our GEAR frame-
BERT-Concat model has a slight improvement
work in several different aspects. Finally, we
over BERT-Pair, which is 0.37%.
present a case study to demonstrate the effective-
Our final model outperforms the best BERT- ness of our framework.
Concat baseline by 1.17%. As our framework pro-
vides a better way for evidence aggregating and 5.1
Document Retrieval and Sentence
reasoning, the improvement demonstrates that our Selection
framework has a better ability to integrate features
We use the OFEVER metric to evaluate the doc-
from different evidence by propagating, analyzing
ument retrieval component. Table 3 shows the and aggregating the features.
OFEVER scores of our model and models from
other teams. After running the same model pro- Effect of Sentence Thresholds
posed by Hanselowski et al. (2018), we find our
The rightmost column of Table 4 shows the results
OFEVER score is slightly lower, which may due
of our GEAR frameworks with different sentence to the random factors.
selection thresholds. We choose the model with
Then we compare our sentence selection com-
threshold τ = 10−3, which has the highest label
ponent with different thresholds, as shown in Ta-
accuracy, as our final model. When the thresh-
ble 4. We find the model with threshold 0 achieves
old increases from 0 to 10−3, the label accuracy
the highest recall and OFEVER score. When
increases due to less noisy information. However,
the threshold increases, the recall value and the
when the threshold increases from 10−3 to 10−1, Aggregator Dev Test ERNet Layers Model Attention Max Mean LA FEVER LA FEVER 0 66.17 65.36 65.03 Athene 68.49 64.74 65.46 61.58 1 67.13 66.63 66.76 UCL MRG 69.66 65.41 67.62 62.52 2 67.44 67.24 67.56 UNC NLP 69.72 66.49 68.21 64.21 3 66.53 66.72 66.89 BERT Pair 73.30 68.90 69.75 65.18 BERT Concat 73.67 68.89 71.01 65.64
Table 5: Label accuracy on the difficult dev set with Our pipeline 74.84 70.69 71.60 67.10
different ERNet layers and evidence aggregators (%).
Table 7: Evaluations of the full pipeline. The results of Aggregator
our pipeline are chosen from the model which has the ERNet Layers Attention Max Mean highest dev FEVER score (%). 0 77.12 76.95 76.30 1 77.74 77.66 77.62
set via selecting samples from the original dev set. 2 77.82 77.66 77.73 3 77.70 77.55 77.60
For SUPPORTED and REFUTED classes, claims
which can be fully supported by only one piece of
evidence are filtered out. All of the NEI claims
Table 6: Label accuracy on the evidence-enhanced dev
are selected because the model needs all of the re-
set with different ERNet layers and evidence aggrega- tors (%).
trieved evidence to conclude that there is “NOT
ENOUGH INFO”. The difficult subset contains
7870 samples, which includes more than 39% of
the label accuracy decreases because informative the dev set.
evidence is filtered out, and the model can not
We test our final model on the difficult sub-
obtain sufficient evidence to make the right infer-
set and present the results in Table 5. We find ence.
our models with ERNet perform better than mod- Effect of Sentence Embedding
els without ERNet and the minimal improvement
between them is 1.27%. We can also discover
The BERT model we used in the sentence encod-
from the table that models with 2 ERNet lay-
ing step is fine-tuned on the FEVER dataset for
ers achieve the best results, which indicates that
one epoch. We need to find out whether the fine-
claims from the difficult subset require multi-step
tuning process or simply incorporating the sen-
evidence propagation. This result demonstrates
tence embeddings from BERT makes the major
the ability of our framework to deal with claims
contribution to the final result. We conduct an ex- which need multiple evidence.
periment using a BERT model without the fine-
tuning process and we find the final dev label ac- Error Analysis
curacy is close to the result from a random guess.
Therefore, the fine-tuning process rather than sen-
In this section, we examine the effect of errors
propagating from upstream components. We uti-
tence embeddings plays an important role in this
task. We need the fine-tuning process to capture
lize an evidence-enhanced dev subset, which as-
sumes all pieces of ground truth evidence are re-
the semantic and logical relations between evi-
dence and the claim. Sentence embeddings are
trieved, to test the theoretical upper-bound score of our GEAR framework.
more general and cannot perform well in this spe-
cific task. So that we cannot just use sentence em-
In our analysis, the main errors of our frame-
beddings from other methods (e.g., ELMo, CNN)
work come from the upstream document retrieval
to replace the sentence embeddings we used here.
and sentence selection components which can not
extract sufficient evidence for inferring. For exam- Effectiveness of ERNet
ple, to verify the claim “Giada at Home was only
In our observation, more than half of the claims in
available on DVD”, we need the evidence “Giada
the dev dataset only need one piece of evidence to
at Home is a television show and first aired on Oc-
make the right inference. To verify the effective-
tober 18, 2008, on the Food Network.” and “Food
ness of our framework on reasoning over multiple
Network is an American basic cable and satellite
pieces of evidence, we build a difficult dev sub-
television channel.”. However, the entity linking Claim:
Al Jardine is an American rhythm guitarist. Truth evidence:
{Al Jardine, 0}, {Al Jardine, 1} Retrieved evidence:
{Al Jardine, 1}, {Al Jardine, 0}, {Al Jardine, 2}, {Al Jar- dine, 5}, {Jardine, 42} Evidence:
(1) He is best known as the band’s rhythm guitarist, and for
occasionally singing lead vocals on singles such as “Help Me,
Rhonda” (1965), “Then I Kissed Her” (1965) and “Come Go with Me” (1978).
(2) Alan Charles Jardine (born September 3, 1942) is an
American musician, singer and songwriter who co-founded the Beach Boys.
(3) In 2010, Jardine released his debut solo studio album, A Postcard from California.
Figure 2: Attention map for the case in Table 8. The
first five rows indicate the attention weights from nodes
(4) In 1988, Jardine was inducted into the Rock and Roll Hall
of Fame as a member of the Beach Boys.
1 to 5 in the first ERNet layer and the last row shows
(5) Ray Jardine American rock climber, lightweight back-
the attention weights from the attention aggregator.
packer, inventor, author and global adventurer. Label: SUPPORTED
more than 1.4% increase in our experiment, it is
Table 8: A case of the claim that requires integrating
worthwhile to design a better evidence retrieval
multiple evidence to verify. The representation for ev-
pipeline, which remains to be our future research.
idence “{DocName, LineNum}” means the evidence is
extracted from the document “DocName” and of which 5.3 Full Pipeline the line number is LineNum.
We present the evaluation of our full pipeline in
this section. Note that there is a gap between the
label accuracy and the final FEVER score due to
method used in our document retrieval component
the completeness of the evidence set. We find that
could not retrieve the “Food Network” document
a model which is good at predicting NEI instances
only from parsing the content of the claim. Thus
tends to obtain higher FEVER score. So we
the claim verification component can not make the
choose our final model based on the dev FEVER
right inference with insufficient evidence.
score among all of our experiments. This model
To explore the effect of this issue, we test our
contains one layer of ERNet and uses the attention
models on an evidence-enhanced dev set, in which
aggregator. The threshold of the sentence filter is
we add the ground truth evidence with relevance 10−3.
score 1 into the evidence set before the sentence
Table 7 presents the evaluations of the full
threshold filter. It ensures that each claim in the
pipeline. We find the test FEVER score of BERT
evidence-enhanced set is provided with the ground
fine-tuning systems outperform other shared task
truth evidence as well as the retrieved evidence. models by nearly 1%. Furthermore, our full
The experimental results are shown in Table 6.
pipeline outperforms the BERT-Concat baseline
We can find that all scores in the table increase by
by 1.46% and achieves significant improvements.
more than 1.4% compared to the original dev set
label accuracy in Table 7 because of the addition of 5.4 Case study
the ground truth evidence. Because of the assump-
Table 8 shows an example in our experiments
tion of oracle upstream components, the results in
which needs multiple pieces of evidence to make
Table 6 indicate the theoretical upper bound label
the right inference. The ground truth evidence set accuracy of our framework.
contains the sentences from the article “Al Jar-
The results show the challenges in the previous
dine” with line number 0 and 1. These two pieces
evidence retrieval task, which could not be solved
of evidence are also ranked at top two in our re-
by current models. Nie et al. (2019) propose a
trieved evidence set. To verify whether “Al Jar-
two-hop evidence enhancement method which im-
dine is an American rhythm guitarist”, our model
proves 0.08% on their final FEVER score. As the
needs the evidence “He is best known as the bands
addition of the ground truth evidence leads to a
rhythm guitarist” as well as the evidence “Alan
Charles Jardine ... is an American musician”. We
natural language inference. In Proceedings of ACL,
plot the attention map from our final model with pages 1657–1668.
one layer of ERNet and the attention aggregator
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
in Figure 2. We can find that all evidence nodes
Jiang, and Diana Inkpen. 2017c. Recurrent neural
tend to attend the first and the second evidence
network-based sentence encoder with gated atten-
nodes, which provide the most useful information
tion for natural language inference. In Proceedings
of ACL, Workshop on Evaluating Vector Space Rep-
in this case. The attention weights in other evi-
resentations for NLP, pages 36–40.
dence nodes are pretty low, which indicates that
our model has the ability to select useful informa-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc
Barrault, and Antoine Bordes. 2017. Supervised
tion from multiple pieces of evidence.
learning of universal sentence representations from 6 Conclusion
natural language inference data. In Proceedings of EMNLP, pages 670–680.
We propose a novel Graph-based Evidence Aggre-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
gating and Reasoning (GEAR) framework on the
Kristina Toutanova. 2019. Bert: Pre-training of deep
claim verification subtask of FEVER. The frame-
bidirectional transformers for language understand-
work utilizes the BERT sentence encoder, the evi-
ing. In Proceedings of NAACL-HLT.
dence reasoning network (ERNet) and an evidence
Matt Gardner, Joel Grus, Mark Neumann, Oyvind
aggregator to encode, propagate and aggregate in-
Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-
formation from multiple pieces of evidence. The
ters, Michael Schmitz, and Luke Zettlemoyer. 2018.
Allennlp: A deep semantic natural language pro-
framework is proven to be effective and our final
cessing platform. In Proceedings of ACL, Workshop
pipeline achieves significant improvements. In the
for NLP Open Source Software, pages 1–6.
future, we would like to design a multi-step ev-
idence extractor and incorporate external knowl-
Reza Ghaeini, Sadid A Hasan, Vivek Datla, Joey
Liu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aa- edge into our framework.
ditya Prakash, Xiaoli Fern, and Oladimeji Farri.
2018. Dr-bilstm: Dependent reading bidirectional 7 Acknowledgements
lstm for natural language inference. In Proceedings
of NAACL-HLT, pages 1460–1469.
This work is supported by the National Key Re-
search and Development Program of China (No.
Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-
2018YFB1004503), the National Natural Sci-
ural language inference over interaction space. In Proceedings of ICLR.
ence Foundation of China (NSFC No.61772302,
61572273). This work is also supported by 2018
Andreas Hanselowski, Hao Zhang, Zile Li, Daniil
Tencent Marketing Solution Rhino-Bird Focused
Sorokin, Benjamin Schiller, Claudia Schulz, and
Research Program No. 201808. Han is also sup-
Iryna Gurevych. 2018. Ukp-athene: Multi-sentence
textual entailment for claim verification. Proceed-
ported by 2018 Tencent Rhino-Bird Elite Training
ings of EMNLP, pages 103–108. Program.
Christopher Hidey and Mona Diab. 2018. Team
sweeper: Joint sentence extraction and fact check- References
ing with pointer networks. In Proceedings of the
First Workshop on FEVER, pages 150–155.
Gabor Angeli and Christopher D Manning. 2014. Nat-
uralli: Natural logic inference for common sense
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
reasoning. In Proceedings of EMNLP, pages 534–
method for stochastic optimization. In Proceedings 545. of ICLR.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
Thomas N Kipf and Max Welling. 2017. Semi-
and Christopher D Manning. 2015. A large anno-
supervised classification with graph convolutional
tated corpus for learning natural language inference.
networks. In Proceedings of ICLR.
In Proceedings of EMNLP, pages 632–642.
Jackson Luken, Nanjiang Jiang, and Marie-Catherine
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
de Marneffe. 2018. Qed: A fact verification system
Bordes. 2017a. Reading Wikipedia to answer open-
for the fever shared task. In Proceedings of the First
domain questions. In Proceedings of ACL, pages
Workshop on FEVER, pages 156–160. 1870–1879.
Christopher Malon. 2018. Team papelo: Transformer
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
networks at fever. In Proceedings of the First Work-
Jiang, and Diana Inkpen. 2017b. Enhanced lstm for
shop on FEVER, pages 109–113.
Tsendsuren Munkhdalai and Hong Yu. 2017. Neural
Adina Williams, Nikita Nangia, and Samuel Bowman.
tree indexers for text understanding. In Proceedings
2018. A broad-coverage challenge corpus for sen- of EACL, pages 11–21.
tence understanding through inference. In Proceed-
ings of NAACL-HLT, pages 1112–1122.
Yixin Nie and Mohit Bansal. 2017. Shortcut-stacked
sentence encoders for multi-domain inference. In
Wenpeng Yin and Dan Roth. 2018. Twowingos: A
Proceedings of ACL, Workshop on Evaluating Vec-
two-wing optimization strategy for evidential claim
tor Space Representations for NLP, pages 41–45.
verification. In Proceedings of EMNLP, pages 105– 114.
Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.
Combining fact extraction and verification with neu-
Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus
ral semantic matching networks. In Proceedings of
Stenetorp, and Sebastian Riedel. 2018. Ucl machine AAAI.
reading group: Four factor framework for fact find-
ing (hexaf). In Proceedings of the First Workshop
Rasmus Palm, Ulrich Paquet, and Ole Winther. 2018. on FEVER, pages 97–102. Recurrent relational networks. In Proceedings of NIPS, pages 3368–3378.
Ankur Parikh, Oscar T¨ackstr¨ om, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable attention
model for natural language inference. In Proceed-
ings of EMNLP, pages 2249–2255.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of NAACL-HLT, pages 2227–2237.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. In Proceedings of Technical report, OpenAI.
Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.
2016. Reading and thinking: Re-read lstm unit for
textual entailment recognition. In Proceedings of COLING, pages 2870–2879.
Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.
Compare, compress and propagate: Enhancing neu-
ral architectures with alignment factorization for natural language inference. In Proceedings of EMNLP, pages 1565–1575. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a.
Fever: a large-scale dataset for fact extraction and
verification. In Proceedings of NAACL-HLT, pages 809–819.
James Thorne, Andreas Vlachos, Oana Cocarascu,
Christos Christodoulopoulos, and Arpit Mittal. 2018b.
The Fact Extraction and VERification
(FEVER) shared task. In Proceedings of the First
Workshop on FEVER, pages 1–9.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Lio, and Yoshua Bengio.
2018. Graph attention networks. In Proceedings of ICLR.
Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017.
Knowledge graph embedding: A survey
of approaches and applications. IEEE TKDE, 29(12):2724–2743.