10 trang 44 lượt tải

Graph-based Evidence Aggregating and Reasoning kỹ thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng

Graph-based Evidence Aggregating and Reasoning Ki thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng iúp sinh viên tham khảo, ôn luyện và phục vụ nhu cầu học tập của mình cụ thể là có định hướng, ôn tập, nắm vững kiến thức môn học và làm bài tốt trong những bài kiểm tra, bài tiểu luận, bài tập kết thúc học phần, từ đó học tập tốt và có kết quả cao cũng như có thể vận dụng tốt những kiến thức mình đã học

Môn: Kỹ thuật lập trình(KTLT) 33 tài liệu

Trường: Trường Đại học Bách khoa, Đại học Đà Nẵng 410 tài liệu

Tác giả:

Kim Oanh

1 năm trước

Danh sách Quiz

GEAR: Graph-based Evidence Aggregating and Reasoning

for Fact Verification

Jie Zhou

1 2, ,3

, Xu Han

1 2, ,3

, Cheng Yang

1 2, ,3

, Zhiyuan Liu

1 2 3, , †

Lifeng Wang

, Changcheng Li

, Maosong Sun

1 2, ,3

Department of Computer Science and Technology, Tsinghua University, Beijing, China

Institute for Artificial Intelligence, Tsinghua University, Beijing, China

State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China

Tencent Marketing Solution, Tencent, Shenzhen, China

{zhoujie18, hanxu17, cheng-ya14}@mails.tsinghua.edu.cn

{fandywang, harrychli} {@tencent.com, liuzy, sms}@tsinghua.edu.cn

Abstract

Fact verification (FV) is a challenging task

which requires to retrieve relevant evidence

from plain text and use the evidence to ver-

ify given claims. Many claims require to si-

multaneously integrate and reason over several

pieces of evidence for verification. However,

previous work employs simple models to ex-

tract information from evidence without let-

ting evidence communicate with each other,

e.g., merely concatenate the evidence for pro-

cessing. Therefore, these methods are unable

to grasp sufficient relational and logical infor-

mation among the evidence. To alleviate this

issue, we propose a graph-based evidence ag-

gregating and reasoning (GEAR) framework

which enables information to transfer on a

fully-connected evidence graph and then uti-

lizes different aggregators to collect multi-

evidence information. We further employ

BERT, an effective pre-trained language repre-

sentation model, to improve the performance.

Experimental results on a large-scale bench-

mark dataset FEVER have demonstrated that

GEAR could leverage multi-evidence infor-

mation for FV and thus achieves the promis-

ing result with a test FEVER score of 67.10%.

Our code is available at https://github.

com/thunlp/GEAR.

1 Introduction

Due to the rapid development of information ex-

traction (IE), huge volumes of data have been

extracted. How to automatically verify the

data becomes a vital problem for various data-

driven applications, e.g., knowledge graph com-

pletion (Wang et al., 2017) and open domain

question answering (Chen et al., 2017a). Hence,

many recent research efforts have been devoted to

fact verification (FV), which aims to verify given

claims with the evidence retrieved from plain text.

†

Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn)

“SUPPORTED” Example

Claim The Rodney King riots took place in the most populous

county in the USA.

Evidence

(1) The 1992 Los Angeles riots, also known as the

Rodney King riots were a series of riots, lootings, ar-

sons, and civil disturbances that occurred in Los An-

geles County, California in April and May 1992.

(2) Los Angeles County, officially the County of Los

Angeles, is the most populous county in the USA.

“REFUTED” Example

Claim Giada at Home was only available on DVD.

Evidence

(1) Giada at Home is a television show and first aired

on October 18, 2008, on the Food Network.

(2) Food Network is an American basic cable and

satellite television channel.

Table 1: Some examples of reasoning over several

pieces of evidence together for verification. The italic

words are the key information to verify the claim. Both

of the claims require to reason and aggregate multiple

evidence sentences for verification.

More specifically, given a claim, an FV system is

asked to label it as “SUPPORTED”, “REFUTED”,

or “NOT ENOUGH INFO”, which indicate that

the evidence can support, refute, or is not suffi-

cient for the claim.

Existing FV methods formulate FV as a natural

language inference (NLI) (Angeli and Manning,

2014) task. However, they utilize simple evidence

combination methods such as concatenating the

evidence or just dealing with each evidence-claim

pair. These methods are unable to grasp sufficient

relational and logical information among the ev-

idence. In fact, many claims require to simulta-

neously integrate and reason over several pieces

of evidence for verification. As shown in Ta-

ble 1, for both of the “SUPPORTED” example and

“REFUTED” example, we cannot verify the given

claims via checking any evidence in isolation. The

claims can be verified only by understanding and

reasoning over the multiple evidence.

To integrate and reason over information from

multiple pieces of evidence, we propose a

graph-based evidence aggregating and reasoning

(GEAR) framework. Specifically, we first build a

fully-connected evidence graph and encourage in-

formation propagation among the evidence. Then,

we aggregate the pieces of evidence and adopt a

classifier to decide whether the evidence can sup-

port, refute, or is not sufficient for the claim. In-

tuitively, by sufficiently exchanging and reason-

ing over evidence information on the evidence

graph, the proposed model can make the best of

the information for verifying claims. For exam-

ple, by delivering the information “Los Angeles

County is the most populous county in the USA”

to “the Rodney King riots occurred in Los Ange-

les County” through the evidence graph, the syn-

thetic information can support “The Rodney King

riots took place in the most populous county in

the USA”. Furthermore, we adopt an effective pre-

trained language representation model BERT (De-

vlin et al., 2019) to better grasp both evidence and

claim semantics.

We conduct experiments on the large-scale

benchmark dataset for Fact Extraction and VER-

ification (FEVER) (Thorne et al., 2018a). Ex-

perimental results show that the proposed frame-

work outperforms recent state-of-the-art baseline

systems. The further case study indicates that our

framework could better leverage multi-evidence

information and reason over the evidence for FV.

2 Related Work

2.1 FEVER Shared Task

The FEVER shared task (Thorne et al., 2018b)

challenges participants to develop automatic fact

verification systems to check the veracity of

human-generated claims by extracting evidence

from Wikipedia. The shared task is hosted as

a competition on Codalab

with a blind test

set. Nie et al. (2019); Yoneda et al. (2018) and

Hanselowski et al. (2018) have achieved the top

three results among 23 teams.

Existing methods mainly formulate FV as an

NLI task. Thorne et al. (2018a) simply concate-

nate all evidence together, and then feed the con-

catenated evidence and the given claim into the

NLI model. Luken et al. (2018) adopt the de-

composable attention model (DAM) (Parikh et al.,

2016) to generate NLI predictions for each claim-

evidence pair individually and then aggregate all

https://competitions.codalab.org/

competitions/18814

NLI predictions for final verification. Then,

Hanselowski et al. (2018); Yoneda et al. (2018);

Hidey and Diab (2018) adopt the enhanced se-

quential inference model (ESIM) (Chen et al.,

2017b), a more effective NLI model, to infer the

relevance between evidence and claims instead

of DAM. As pre-trained language models have

achieved great results on various NLP applica-

tions, Malon (2018) fine-tunes the generative pre-

training transformer (GPT) (Radford et al., 2018)

for FV. Based on the methods mentioned above,

Nie et al. (2019) specially design the neural se-

mantic matching network (NSMN), which is a

modification of ESIM and achieves the best results

in the competition. Unlike these methods, Yin

and Roth (2018) propose the TWOWINGOS sys-

tem which trains the evidence identification and

claim verification modules jointly.

2.2 Natural Language Inference

The natural language inference (NLI) task requires

a system to label the relationship between a pair of

premise and hypothesis as entailment, contradic-

tion or neutral. Several large-scale datasets have

been proposed to promote the research in this di-

rection, such as SNLI (Bowman et al., 2015) and

Multi-NLI (Williams et al., 2018). These datasets

have made it feasible to train complicated neural

models which have achieved the state-of-the-art

results (Bowman et al., 2015 2016; Parikh et al., ;

Sha et al. Chen et al., 2016; , 2017b c Munkhdalai, ;

and Yu, 2017 2017 Conneau; Nie and Bansal, ;

et al., 2017 2018; Gong et al., ; Tay et al., 2018;

Ghaeini et al., 2018). It is intuitive to transfer

NLI models into the claim verification stage of

the FEVER task and several teams from the shared

task have achieved promising results by this way.

2.3 Pre-trained Language Models

Pre-trained language representation models such

as ELMo (Peters et al., 2018) and OpenAI

GPT (Radford et al., 2018) are proven to be ef-

fective on many NLP tasks. BERT (Devlin et al.,

2019) employs bidirectional transformer and well-

designed pre-training tasks to fuse bidirectional

context information and obtains the state-of-the-

art results on the NLI task. In our experiments, we

find the fine-tuned BERT model outperforms other

NLI-based models on the claim verification sub-

task of FEVER. Hence, we use BERT as the sen-

tence encoder in our framework to better encoding

semantic information of evidence and claims.

…

0.3620

0.1241

0.0237

0.0014

0.0009

0.0003

ScoreEvidence

Evidence Filter

Claim

BERT

Claim

Evi 1 Claim

Attention

Max

Mean

Document Retrieval Sentence Selection Claim Verification (GEAR)

Sentence Encoding Evidence Reasoning Aggregation

Evi 2 Claim

Evi 3 Claim

Evi 4 Claim

Threshold 0.001

ERNet

Layer 0 Layer T

Figure 1: The pipeline of our method. The GEAR framework is illustrated in the claim verification section.

3 Method

We employ a three-step pipeline with compo-

nents for document retrieval, sentence selection

and claim verification to solve the task. In the doc-

ument retrieval and sentence selection stages, we

simply follow the method from Hanselowski et al.

(2018) since their method has the highest score on

evidence recall in the former FEVER shared task.

And we propose our Graph-based Evidence Ag-

gregating and Reasoning (GEAR) framework in

the final claim verification stage. The full pipeline

of our method is illustrated in Figure .1

3.1 Document Retrieval and Sentence

Selection

In this section, we describe our document retrieval

and sentence selection components. Additionally,

we add a threshold filter after the sentence selec-

tion component to filter out those noisy evidence.

In the document retrieval step, we adopt the

entity linking approach from Hanselowski et al.

(2018).

The seven highest-ranked

results for each query are stored to form a candi-

date article set. Finally, the method drops the ar-

ticles which are not in the offline Wikipedia dump

and filters the articles by the word overlap between

their titles and the claim.

The sentence selection component selects the

most relevant evidence for the claim from all sen-

tences in the retrieved documents.

https://www.mediawiki.org/wiki/API:

Main_page

In the

max(0, 1+

−s

)

In addition to the original model (Hanselowski

et al., 2018), we add a relevance score filter with

a threshold τ . Sentences with relevance scores

lower than τ are filtered out to alleviate the noises.

Thus the final size of the retrieved evidence set is

equal to or less than 5. We choose different val-

ues of τ and select the value based on the dev

set result. The evaluation results of the document

retrieval and sentence selection components are

shown in Section 5.1.

3.2 Claim Verification with GEAR

In this section, we describe our GEAR framework

for claim verification. As shown in Figure 1, given

a claim and the retrieved evidence,

Sentence Encoder

Given an input sentence, we employ BERT (De-

vlin et al., 2019) as our sentence encoder by ex-

tracting the final hidden state of the [CLS] token

as the representation, where [CLS] is the special

classification embedding in BERT.

Note that we concatenate the evidence and the

claim to extract the evidence representation be-

cause the evidence nodes in the reasoning graph

need the information from the claim to guide the

message passing process among them.

Evidence Reasoning Network

To encourage the information propagation among

evidence, we build a fully-connected evidence

graph where each node indicates a piece of evi-

dence. We also add self-loop to every node be-

cause each node needs the information from it-

self in the message propagation process. We use

h h

= {

, h

, ..., h

} to represent the hidden

states of nodes at layer

t, where h

∈ R

F ×1

and F

is the number of features in each node. The initial

hidden state of each evidence node

is initialized

by the evidence presentation:

= e

Inspired by recent work on semi-supervised

graph learning and relational reasoning (Kipf and

Welling, 2017; Velickovic et al., 2018; Palm et al.,

2018), we propose an evidence reasoning network

(ERNet) to propagate information among the ev-

idence nodes.

Evidence Aggregator

We employ an evidence aggregator to gather infor-

mation from different evidence nodes and obtain

the final hidden state

o ∈ R

F ×1

. The aggregator

may utilize different aggregating strategies and we

suggest three aggregators in our framework:

The max aggregator per-

forms the element-wise Max operation among hid-

den states.

= Max(h

, h

, ..., h

). (6)

The mean aggregator per-

forms the Mean operation amongelement-wise

hidden states.

= Mean(h

, h

, ..., h

). (7)

Once the final state o is obtained, we employ a

one-layer MLP to get the final prediction .l

l = softmax(ReLU(Wo+ b)), (8)

where

W ∈ R

C×F

and b ∈ R

C×1

are parame-

ters, and C is the number of prediction labels.

4 Experimental Settings

4.1 Dataset

We conduct our experiments on the large-scale

dataset FEVER (Thorne et al., 2018a). The dataset

consists of 185,455 annotated claims with a set

of 5,416,537 Wikipedia documents from the June

2017 Wikipedia dump. We follow the dataset par-

tition from the FEVER Shared Task (Thorne et al.,

2018b). Table 2 shows the statistics of the dataset.

Split SUPPORTED REFUTED NEI

Train 80,035 29,775 35,639

Dev 6,666 6,666 6,666

Test 6,666 6,666 6,666

Table 2: Statistics of FEVER dataset.

4.2 Baselines

In this section, we describe the baseline systems in

our experiments. We first introduce the top-3 sys-

tems from the FEVER shared task. As BERT (De-

vlin et al., 2019) has achieved promising perfor-

mance on several NLP tasks, we also implement

two baseline systems via fine-tuning BERT in the

claim verification task.

Shared Task Systems

We choose the top-3 models from the FEVER

shared task as our baselines.

The Athene UKP TU Darmstadt team (Athene)

(Hanselowski et al., 2018) combines five inference

vectors from the ESIM model via attention mech-

anism to make the final prediction.

The UCL Machine Reading Group (UCL

MRG) (Yoneda et al., 2018) predicts the label of

each evidence-claim pair and aggregates the re-

sults via a label aggregation component.

The UNC NLP team (Nie et al., 2019) proposes

the neural semantic matching network and uses the

model jointly to solve all three subtasks. They

also incorporate additional information such as

pageview frequency and WordNet features. They

have achieved best results in the competition.

BERT Fine-tuning Systems

We implement two BERT fine-tuning systems

with different evidence combination approaches.

The BERT-Concat system concatenates all evi-

dence into a single string while the BERT-Pair

system encodes each evidence-claim pair indepen-

dently and then aggregates the results. Both sys-

tems share the same document retrieval and sen-

tence selection components proposed by us.

BERT-Concat. In the BERT-Concat system,

we simply concatenate all evidence into a single

sentence and utilize BERT to predict the relation

between the concatenated evidence and the claim.

In the training phase, we add the ground truth ev-

idence into the retrieved evidence set with rele-

vance score 1 and select five pieces of evidence

with the highest scores. In the test phase, we con-

catenate the retrieved evidence for predicting.

BERT-Pair. In the BERT-Pair system, we uti-

lize BERT to predict the label for each evidence-

claim pair. Concretely, we use each evidence-

claim pair as the input and the label of the claim

as the prediction target. In the training phase, we

select the ground truth evidence for SUPPORTED

and REFUTED claims and the retrieved evidence

for NEI claims. In the test phase, we predict labels

for all retrieved evidence-claim pairs. Because dif-

ferent evidence-claim pairs may have inconsistent

predicted labels, we then utilize an aggregator to

obtain the final claim label. We find the aggre-

gator only returning the predicted label from the

most relevant evidence has the best performance.

4.3 Hyperparameter Settings

We utilize BERT

BASE

(Devlin et al., 2019) in all

of the BERT fine-tuning baselines and our GEAR

framework. The learning rate is 2e-5.

For BERT-Concat, the maximum sequence

length is 256 and the batch size is 16. We limit

the max length for concatenated evidence to 240

and the max length for claims to 16. We train this

model for two epochs based on dev results. For

BERT-Pair, we set the maximum sequence length

to 128 and batch size to 32. We train this model for

one epoch. As for the GEAR framework, we use

the fine-tuned BERT-Pair model to extract features

and the batch size is 512.

In our ERNet, we set the batch size to 256,

the number of features to 768 and the dimen-F

sion of weight matrices H to 64. The model is

trained to minimize the negative log likelihood

loss on the predicted label using the Adam opti-

mizer (Kingma and Ba, 2015) with an initial learn-

ing rate of 5e-3 and L2 weight decay of 5e-4. We

use an early stopping strategy on the label accu-

racy of the validation set, with a patience of 20

epochs. We attempt to stack 0-3 ERNet layers and

analyze the effect of different layer numbers.

4.4 Evaluation Metrics

Besides traditional evaluation metrics such as la-

bel accuracy and F1, we use other two metrics to

evaluate our model.

FEVER score. The FEVER score is the la-

bel accuracy conditioned on providing at least one

complete set of evidence. Claims labeled as “NEI”

do not need the evidence.

Model OFEVER

Athene 93.55

UCL MRG -

UNC NLP 92.82

Our Model 93.33

Table 3: Document retrieval evaluation on dev set (%).

(’-’ denotes a missing value)

τ OFEVER Precision Recall F1 GEAR LA

0 91.10 24.08 86.72 37.69 74.84

−4

91.04 30.88 86.63 45.53 74.86

−3

90.86 40.60 86.36 55.23 74.91

−2

90.27 53.12 85.47 65.52 74.89

−1

87.70 70.61 81.64 74.8175.72

Table 4: Sentence selection evaluation and average la-

bel accuracy of GEAR with different thresholds on dev

set (%).

OFEVER score. The document retrieval and

sentence selection components are usually eval-

uated by the oracle FEVER (OFEVER) score,

which is the upper bound of the FEVER score by

assuming perfect downstream systems.

For all of the experiments with GEAR, the

scores (label accuracy, FEVER score) we report

on the dev set are mean values with 10runs ini-

tialized by different random seeds.

5 Experimental Results and Analysis

In this section, we first present the evaluations

of the document retrieval and sentence selection

components. Then we evaluate our GEAR frame-

work in several different aspects. Finally, we

present a case study to demonstrate the effective-

ness of our framework.

5.1 Document Retrieval and Sentence

Selection

We use the OFEVER metric to evaluate the doc-

ument retrieval component. Table 3 shows the

OFEVER scores of our model and models from

other teams. After running the same model pro-

posed by Hanselowski et al. (2018), we find our

OFEVER score is slightly lower, which may due

to the random factors.

Then we compare our sentence selection com-

ponent with different thresholds, as shown in Ta-

ble 4. We find the model with threshold 0 achieves

the highest recall and OFEVER score. When

the threshold increases, the recall value and the

OFEVER score drop gradually while the precision

and F1 score increase. The results are consistent

with our intuition. If we do not filter out evidence,

more claims could be provided with the full evi-

dence set. If we increase the value of the thresh-

old, more pieces of noisy evidence are filtered out,

which contributes to the increase of precision and

F1.

5.2 Claim Verification with GEAR

In this section, we evaluate our GEAR framework

in different aspects. We first compare the label ac-

curacy scores between our framework and base-

line systems. Then we explore the effect of differ-

ent thresholds from the upstream sentence filter.

We also conduct additional experiments to check

the effect of sentence embedding. As there are

nearly 39% of claims require reasoning over mul-

tiple pieces of evidence, we construct a difficult

dev subset and check the effectiveness of our ER-

Net for evidence reasoning. Finally, we make an

error analysis and provide the theoretical upper-

bound label accuracy of our framework.

Model Evaluation

We use the label accuracy metric to evaluate the

effectiveness of different claim verification mod-

els. The second column of Table 7 shows the la-

bel accuracy of different models on the dev set.

We find the BERT fine-tuning models outperform

all of the models from the shared task, which

shows the strong capacity of BERT in represen-

tation learning and semantic understanding. The

BERT-Concat model has a slight improvement

over BERT-Pair, which is 0.37%.

Our final model outperforms the best BERT-

Concat baseline by 1.17%. As our framework pro-

vides a better way for evidence aggregating and

reasoning, the improvement demonstrates that our

framework has a better ability to integrate features

from different evidence by propagating, analyzing

and aggregating the features.

Effect of Sentence Thresholds

The rightmost column of Table 4 shows the results

of our GEAR frameworks with different sentence

selection thresholds. We choose the model with

threshold

τ = 10

−3

, which has the highest label

accuracy, as our final model. When the thresh-

old increases from 0 to

−3

, the label accuracy

increases due to less noisy information. However,

when the threshold increases from

−3

to ,10

−1

ERNet Layers

Aggregator

Attention Max Mean

0 66.17 65.36 65.03

1 67.13 66.63 66.76

2 67.44 67.24 67.56

3 66.53 66.72 66.89

Table 5: Label accuracy on the difficult dev set with

different ERNet layers and evidence aggregators (%).

ERNet Layers

Aggregator

Attention Max Mean

0 77.12 76.95 76.30

1 77.74 77.66 77.62

2 77.82 77.66 77.73

3 77.70 77.55 77.60

Table 6: Label accuracy on the evidence-enhanced dev

set with different ERNet layers and evidence aggrega-

tors (%).

the label accuracy decreases because informative

evidence is filtered out, and the model can not

obtain sufficient evidence to make the right infer-

ence.

Effect of Sentence Embedding

The BERT model we used in the sentence encod-

ing step is fine-tuned on the FEVER dataset for

one epoch. We need to find out whether the fine-

tuning process or simply incorporating the sen-

tence embeddings from BERT makes the major

contribution to the final result. We conduct an ex-

periment using a BERT model without the fine-

tuning process and we find the final dev label ac-

curacy is close to the result from a random guess.

Therefore, the fine-tuning process rather than sen-

tence embeddings plays an important role in this

task. We need the fine-tuning process to capture

the semantic and logical relations between evi-

dence and the claim. Sentence embeddings are

more general and cannot perform well in this spe-

cific task. So that we cannot just use sentence em-

beddings from other methods (e.g., ELMo, CNN)

to replace the sentence embeddings we used here.

Effectiveness of ERNet

In our observation, more than half of the claims in

the dev dataset only need one piece of evidence to

make the right inference. To verify the effective-

ness of our framework on reasoning over multiple

pieces of evidence, we build a difficult dev sub-

Model

Dev Test

LA FEVER LA FEVER

Athene 68.49 64.74 65.46 61.58

UCL MRG 69.66 65.41 67.62 62.52

UNC NLP 69.72 66.49 68.21 64.21

BERT Pair 73.30 68.90 69.75 65.18

BERT Concat 73.67 68.89 71.01 65.64

Our pipeline 74.84 70.69 71.60 67.10

Table 7: Evaluations of the full pipeline. The results of

our pipeline are chosen from the model which has the

highest dev FEVER score (%).

set via selecting samples from the original dev set.

For SUPPORTED and REFUTED classes, claims

which can be fully supported by only one piece of

evidence are filtered out. All of the NEI claims

are selected because the model needs all of the re-

trieved evidence to conclude that there is “NOT

ENOUGH INFO”. The difficult subset contains

7870 samples, which includes more than 39% of

the dev set.

We test our final model on the difficult sub-

set and present the results in Table 5. We find

our models with ERNet perform better than mod-

els without ERNet and the minimal improvement

between them is 1.27%. We can also discover

from the table that models with 2 ERNet lay-

ers achieve the best results, which indicates that

claims from the difficult subset require multi-step

evidence propagation. This result demonstrates

the ability of our framework to deal with claims

which need multiple evidence.

Error Analysis

In this section, we examine the effect of errors

propagating from upstream components. We uti-

lize an evidence-enhanced dev subset, which as-

sumes all pieces of ground truth evidence are re-

trieved, to test the theoretical upper-bound score

of our GEAR framework.

In our analysis, the main errors of our frame-

work come from the upstream document retrieval

and sentence selection components which can not

extract sufficient evidence for inferring. For exam-

ple, to verify the claim “Giada at Home was only

available on DVD”, we need the evidence “Giada

at Home is a television show and first aired on Oc-

tober 18, 2008, on the Food Network.” and “Food

Network is an American basic cable and satellite

television channel.”. However, the entity linking

Claim:

Al Jardine is an American rhythm guitarist.

Truth evidence:

{Al Jardine, 0}, {Al Jardine, 1}

Retrieved evidence:

{ } {Al Jardine, 1 , { } { }Al Jardine, 0 , Al Jardine, 2 , Al Jar-

dine, 5}, {Jardine, 42}

Evidence:

(1) He is best known as the band’s rhythm guitarist, and for

occasionally singing lead vocals on singles such as “Help Me,

Rhonda” (1965), “Then I Kissed Her” (1965) and “Come Go

with Me” (1978).

(2) Alan Charles Jardine an(born September 3, 1942) is

American musician, singer and songwriter who co-founded

the Beach Boys.

(3) In 2010, Jardine released his debut solo studio album, A

Postcard from California.

(4) In 1988, Jardine was inducted into the Rock and Roll Hall

of Fame as a member of the Beach Boys.

(5) Ray Jardine American rock climber, lightweight back-

packer, inventor, author and global adventurer.

Label: SUPPORTED

Table 8: A case of the claim that requires integrating

multiple evidence to verify. The representation for ev-

idence “{DocName, LineNum}” means the evidence is

extracted from the document “DocName” and of which

the line number is .LineNum

method used in our document retrieval component

could not retrieve the “Food Network” document

only from parsing the content of the claim. Thus

the claim verification component can not make the

right inference with insufficient evidence.

To explore the effect of this issue, we test our

models on an evidence-enhanced dev set, in which

we add the ground truth evidence with relevance

score 1 into the evidence set before the sentence

threshold filter. It ensures that each claim in the

evidence-enhanced set is provided with the ground

truth evidence as well as the retrieved evidence.

The experimental results are shown in Table 6.

We can find that all scores in the table increase by

more than 1.4% compared to the original dev set

label accuracy in Table 7 because of the addition of

the ground truth evidence. Because of the assump-

tion of oracle upstream components, the results in

Table 6 indicate the theoretical upper bound label

accuracy of our framework.

The results show the challenges in the previous

evidence retrieval task, which could not be solved

by current models. Nie et al. (2019) propose a

two-hop evidence enhancement method which im-

proves 0.08% on their final FEVER score. As the

addition of the ground truth evidence leads to a

Figure 2: Attention map for the case in Table 8. The

first five rows indicate the attention weights from nodes

1 to 5 in the first ERNet layer and the last row shows

the attention weights from the attention aggregator.

more than 1.4% increase in our experiment, it is

worthwhile to design a better evidence retrieval

pipeline, which remains to be our future research.

5.3 Full Pipeline

We present the evaluation of our full pipeline in

this section. Note that there is a gap between the

label accuracy and the final FEVER score due to

the completeness of the evidence set. We find that

a model which is good at predicting NEI instances

tends to obtain higher FEVER score. So we

choose our final model based on the dev FEVER

score among all of our experiments. This model

contains one layer of ERNet and uses the attention

aggregator. The threshold of the sentence filter is

−3

Table 7 presents the evaluations of the full

pipeline. We find the test FEVER score of BERT

fine-tuning systems outperform other shared task

models by nearly 1%. Furthermore, our full

pipeline outperforms the BERT-Concat baseline

by 1.46% and achieves significant improvements.

5.4 Case study

Table 8 shows an example in our experiments

which needs multiple pieces of evidence to make

the right inference. The ground truth evidence set

contains the sentences from the article “Al Jar-

dine” with line number 0 and 1. These two pieces

of evidence are also ranked at top two in our re-

trieved evidence set. To verify whether “Al Jar-

dine is an American rhythm guitarist”, our model

needs the evidence “He is best known as the bands

rhythm guitarist” as well as the evidence “Alan

Charles Jardine ... is an American musician”. We

plot the attention map from our final model with

one layer of ERNet and the attention aggregator

in Figure 2. We can find that all evidence nodes

tend to attend the first and the second evidence

nodes, which provide the most useful information

in this case. The attention weights in other evi-

dence nodes are pretty low, which indicates that

our model has the ability to select useful informa-

tion from multiple pieces of evidence.

6 Conclusion

We propose a novel Graph-based Evidence Aggre-

gating and Reasoning (GEAR) framework on the

claim verification subtask of FEVER. The frame-

work utilizes the BERT sentence encoder, the evi-

dence reasoning network (ERNet) and an evidence

aggregator to encode, propagate and aggregate in-

formation from multiple pieces of evidence. The

framework is proven to be effective and our final

pipeline achieves significant improvements. In the

future, we would like to design a multi-step ev-

idence extractor and incorporate external knowl-

edge into our framework.

7 Acknowledgements

This work is supported by the National Key Re-

search and Development Program of China (No.

2018YFB1004503), the National Natural Sci-

ence Foundation of China (NSFC No.61772302,

61572273). This work is also supported by 2018

Tencent Marketing Solution Rhino-Bird Focused

Research Program No. 201808. Han is also sup-

ported by 2018 Tencent Rhino-Bird Elite Training

Program.

References

Gabor Angeli and Christopher D Manning. 2014. Nat-

uralli: Natural logic inference for common sense

reasoning. In Proceedings of EMNLP, pages 534–

545.

Samuel R Bowman, Gabor Angeli, Christopher Potts,

and Christopher D Manning. 2015. A large anno-

tated corpus for learning natural language inference.

In Proceedings of EMNLP, pages 632–642.

Danqi Chen, Adam Fisch, Jason Weston, and Antoine

Bordes. 2017a. Reading Wikipedia to answer open-

domain questions. In Proceedings of ACL, pages

1870–1879.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui

Jiang, and Diana Inkpen. 2017b. Enhanced lstm for

natural language inference. In Proceedings of ACL,

pages 1657–1668.

Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui

Jiang, and Diana Inkpen. 2017c. Recurrent neural

network-based sentence encoder with gated atten-

tion for natural language inference. In Proceedings

of ACL, Workshop on Evaluating Vector Space Rep-

resentations for NLP, pages 36–40.

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨

ıc

Barrault, and Antoine Bordes. 2017. Supervised

learning of universal sentence representations from

natural language inference data. In Proceedings of

EMNLP, pages 670–680.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. In Proceedings of NAACL-HLT.

Matt Gardner, Joel Grus, Mark Neumann, Oyvind

Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-

ters, Michael Schmitz, and Luke Zettlemoyer. 2018.

Allennlp: A deep semantic natural language pro-

cessing platform. In Proceedings of ACL, Workshop

for NLP Open Source Software, pages 1–6.

Reza Ghaeini, Sadid A Hasan, Vivek Datla, Joey

Liu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aa-

ditya Prakash, Xiaoli Fern, and Oladimeji Farri.

2018. Dr-bilstm: Dependent reading bidirectional

lstm for natural language inference. In Proceedings

of NAACL-HLT, pages 1460–1469.

Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-

ural language inference over interaction space. In

Proceedings of ICLR.

Andreas Hanselowski, Hao Zhang, Zile Li, Daniil

Sorokin, Benjamin Schiller, Claudia Schulz, and

Iryna Gurevych. 2018. Ukp-athene: Multi-sentence

textual entailment for claim verification. Proceed-

ings of EMNLP, pages 103–108.

Christopher Hidey and Mona Diab. 2018. Team

sweeper: Joint sentence extraction and fact check-

ing with pointer networks. In Proceedings of the

First Workshop on FEVER, pages 150–155.

Diederik P Kingma and Jimmy Ba. 2015. Adam: A

method for stochastic optimization. In Proceedings

of ICLR.

Thomas N Kipf and Max Welling. 2017. Semi-

supervised classification with graph convolutional

networks. In Proceedings of ICLR.

Jackson Luken, Nanjiang Jiang, and Marie-Catherine

de Marneffe. 2018. Qed: A fact verification system

for the fever shared task. In Proceedings of the First

Workshop on FEVER, pages 156–160.

Christopher Malon. 2018. Team papelo: Transformer

networks at fever. In Proceedings of the First Work-

shop on FEVER, pages 109–113.

Tsendsuren Munkhdalai and Hong Yu. 2017. Neural

tree indexers for text understanding. In Proceedings

of EACL, pages 11–21.

Yixin Nie and Mohit Bansal. 2017. Shortcut-stacked

sentence encoders for multi-domain inference. In

Proceedings of ACL, Workshop on Evaluating Vec-

tor Space Representations for NLP, pages 41–45.

Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.

Combining fact extraction and verification with neu-

ral semantic matching networks. In Proceedings of

AAAI.

Rasmus Palm, Ulrich Paquet, and Ole Winther. 2018.

Recurrent relational networks. In Proceedings of

NIPS, pages 3368–3378.

Ankur Parikh, Oscar T¨ackstr¨

om, Dipanjan Das, and

Jakob Uszkoreit. 2016. A decomposable attention

model for natural language inference. In Proceed-

ings of EMNLP, pages 2249–2255.

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke

Zettlemoyer. 2018. Deep contextualized word rep-

resentations. In Proceedings of NAACL-HLT, pages

2227–2237.

Alec Radford, Karthik Narasimhan, Tim Salimans, and

Ilya Sutskever. 2018. Improving language under-

standing by generative pre-training. In Proceedings

of Technical report, OpenAI.

Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.

2016. Reading and thinking: Re-read lstm unit for

textual entailment recognition. In Proceedings of

COLING, pages 2870–2879.

Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.

Compare, compress and propagate: Enhancing neu-

ral architectures with alignment factorization for

natural language inference. In Proceedings of

EMNLP, pages 1565–1575.

James Thorne, Andreas Vlachos, Christos

Christodoulopoulos, and Arpit Mittal. 2018a.

Fever: a large-scale dataset for fact extraction and

verification. In Proceedings of NAACL-HLT, pages

809–819.

James Thorne, Andreas Vlachos, Oana Cocarascu,

Christos Christodoulopoulos, and Arpit Mittal.

2018b. The Fact Extraction and VERification

(FEVER) shared task. In Proceedings of the First

Workshop on FEVER, pages 1–9.

Petar Velickovic, Guillem Cucurull, Arantxa Casanova,

Adriana Romero, Pietro Lio, and Yoshua Bengio.

2018. Graph attention networks. In Proceedings of

ICLR.

Quan Wang, Zhendong Mao, Bin Wang, and Li Guo.

2017. Knowledge graph embedding: A survey

of approaches and applications. IEEE TKDE,

29(12):2724–2743.

Adina Williams, Nikita Nangia, and Samuel Bowman.

2018. A broad-coverage challenge corpus for sen-

tence understanding through inference. In Proceed-

ings of NAACL-HLT, pages 1112–1122.

Wenpeng Yin and Dan Roth. 2018. Twowingos: A

two-wing optimization strategy for evidential claim

verification. In Proceedings of EMNLP, pages 105–

114.

Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus

Stenetorp, and Sebastian Riedel. 2018. Ucl machine

reading group: Four factor framework for fact find-

ing (hexaf). In Proceedings of the First Workshop

on FEVER, pages 97–102.

Bấm Tải xuống để xem toàn bộ.

Preview text:

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification Jie Zhou1 2 , ,3, Xu Han1 2 , ,3, Cheng Yang1 2 , ,3, Zhiyuan Liu1 2 3 , , †
Lifeng Wang4, Changcheng Li4, Maosong Sun1 2 , ,3
1Department of Computer Science and Technology, Tsinghua University, Beijing, China
2Institute for Artificial Intelligence, Tsinghua University, Beijing, China
3State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China
4Tencent Marketing Solution, Tencent, Shenzhen, China
{zhoujie18, hanxu17, cheng-ya14}@mails.tsinghua.edu.cn
{fandywang, harrychli}@tencent.com, {liuzy, sms}@tsinghua.edu.cn Abstract “SUPPORTED” Example Claim
The Rodney King riots took place in the most populous
Fact verification (FV) is a challenging task county in the USA.
which requires to retrieve relevant evidence
(1) The 1992 Los Angeles riots, also known as the Evidence
Rodney King riots were a series of riots, lootings, ar-
from plain text and use the evidence to ver-
sons, and civil disturbances that occurred in Los An-
ify given claims. Many claims require to si-
geles County, California in April and May 1992.
(2) Los Angeles County, officially the County of Los
multaneously integrate and reason over several
Angeles, is the most populous county in the USA.
pieces of evidence for verification. However, “REFUTED” Example
previous work employs simple models to ex- Claim
Giada at Home was only available on DVD.
tract information from evidence without let-
(1) Giada at Home is a television show and first aired
ting evidence communicate with each other, Evidence
on October 18, 2008, on the Food Network.
e.g., merely concatenate the evidence for pro-
(2) Food Network is an American basic cable and satellite television channel.
cessing. Therefore, these methods are unable
to grasp sufficient relational and logical infor-
Table 1: Some examples of reasoning over several
mation among the evidence. To alleviate this
pieces of evidence together for verification. The italic
issue, we propose a graph-based evidence ag-
words are the key information to verify the claim. Both
gregating and reasoning (GEAR) framework
of the claims require to reason and aggregate multiple
which enables information to transfer on a
evidence sentences for verification.
fully-connected evidence graph and then uti-
lizes different aggregators to collect multi- evidence information. We further employ
More specifically, given a claim, an FV system is
BERT, an effective pre-trained language repre-
asked to label it as “SUPPORTED”, “REFUTED”,
sentation model, to improve the performance.
Experimental results on a large-scale bench-
or “NOT ENOUGH INFO”, which indicate that
mark dataset FEVER have demonstrated that
the evidence can support, refute, or is not suffi-
GEAR could leverage multi-evidence infor- cient for the claim.
mation for FV and thus achieves the promis-
Existing FV methods formulate FV as a natural
ing result with a test FEVER score of 67.10%.
language inference (NLI) (Angeli and Manning,
Our code is available at https://github.
2014) task. However, they utilize simple evidence com/thunlp/GEAR.
combination methods such as concatenating the 1 Introduction
evidence or just dealing with each evidence-claim
pair. These methods are unable to grasp sufficient
Due to the rapid development of information ex-
relational and logical information among the ev-
traction (IE), huge volumes of data have been
idence. In fact, many claims require to simulta- extracted.
How to automatically verify the
neously integrate and reason over several pieces
data becomes a vital problem for various data- of evidence for verification. As shown in Ta-
driven applications, e.g., knowledge graph com-
ble 1, for both of the “SUPPORTED” example and
pletion (Wang et al., 2017) and open domain
“REFUTED” example, we cannot verify the given
question answering (Chen et al., 2017a). Hence,
claims via checking any evidence in isolation. The
many recent research efforts have been devoted to
claims can be verified only by understanding and
fact verification (FV), which aims to verify given
reasoning over the multiple evidence.
claims with the evidence retrieved from plain text.
To integrate and reason over information from
† Corresponding author: Z.Liu(liuzy@tsinghua.edu.cn)
multiple pieces of evidence, we propose a
graph-based evidence aggregating and reasoning
NLI predictions for final verification. Then,
(GEAR) framework. Specifically, we first build a
Hanselowski et al. (2018); Yoneda et al. (2018);
fully-connected evidence graph and encourage in-
Hidey and Diab (2018) adopt the enhanced se-
formation propagation among the evidence. Then,
quential inference model (ESIM) (Chen et al.,
we aggregate the pieces of evidence and adopt a
2017b), a more effective NLI model, to infer the
classifier to decide whether the evidence can sup-
relevance between evidence and claims instead
port, refute, or is not sufficient for the claim. In-
of DAM. As pre-trained language models have
tuitively, by sufficiently exchanging and reason-
achieved great results on various NLP applica-
ing over evidence information on the evidence
tions, Malon (2018) fine-tunes the generative pre-
graph, the proposed model can make the best of
training transformer (GPT) (Radford et al., 2018)
the information for verifying claims. For exam-
for FV. Based on the methods mentioned above,
ple, by delivering the information “Los Angeles
Nie et al. (2019) specially design the neural se-
County is the most populous county in the USA”
mantic matching network (NSMN), which is a
to “the Rodney King riots occurred in Los Ange-
modification of ESIM and achieves the best results
les County” through the evidence graph, the syn-
in the competition. Unlike these methods, Yin
thetic information can support “The Rodney King
and Roth (2018) propose the TWOWINGOS sys-
riots took place in the most populous county in
tem which trains the evidence identification and
the USA”. Furthermore, we adopt an effective pre-
claim verification modules jointly.
trained language representation model BERT (De-
vlin et al., 2019) to better grasp both evidence and 2.2 Natural Language Inference claim semantics.
The natural language inference (NLI) task requires
We conduct experiments on the large-scale
a system to label the relationship between a pair of
benchmark dataset for Fact Extraction and VER-
premise and hypothesis as entailment, contradic-
ification (FEVER) (Thorne et al., 2018a). Ex-
tion or neutral. Several large-scale datasets have
perimental results show that the proposed frame-
been proposed to promote the research in this di-
work outperforms recent state-of-the-art baseline
rection, such as SNLI (Bowman et al., 2015) and
systems. The further case study indicates that our
Multi-NLI (Williams et al., 2018). These datasets
framework could better leverage multi-evidence
have made it feasible to train complicated neural
information and reason over the evidence for FV.
models which have achieved the state-of-the-art
results (Bowman et al., 2015; Parikh et al., 2016; 2 Related Work
Sha et al., 2016; Chen et al., 2017b,c; Munkhdalai
and Yu, 2017; Nie and Bansal, 2017; Conneau 2.1 FEVER Shared Task
et al., 2017; Gong et al., 2018; Tay et al., 2018;
The FEVER shared task (Thorne et al., 2018b)
Ghaeini et al., 2018). It is intuitive to transfer
challenges participants to develop automatic fact
NLI models into the claim verification stage of
verification systems to check the veracity of
the FEVER task and several teams from the shared
human-generated claims by extracting evidence
task have achieved promising results by this way. from Wikipedia. The shared task is hosted as
a competition on Codalab1 with a blind test 2.3 Pre-trained Language Models
set. Nie et al. (2019); Yoneda et al. (2018) and
Pre-trained language representation models such
Hanselowski et al. (2018) have achieved the top
as ELMo (Peters et al., 2018) and OpenAI three results among 23 teams.
GPT (Radford et al., 2018) are proven to be ef-
Existing methods mainly formulate FV as an
fective on many NLP tasks. BERT (Devlin et al.,
NLI task. Thorne et al. (2018a) simply concate-
2019) employs bidirectional transformer and well-
nate all evidence together, and then feed the con-
designed pre-training tasks to fuse bidirectional
catenated evidence and the given claim into the
context information and obtains the state-of-the-
NLI model. Luken et al. (2018) adopt the de-
art results on the NLI task. In our experiments, we
composable attention model (DAM) (Parikh et al.,
find the fine-tuned BERT model outperforms other
2016) to generate NLI predictions for each claim-
NLI-based models on the claim verification sub-
evidence pair individually and then aggregate all
task of FEVER. Hence, we use BERT as the sen- 1
tence encoder in our framework to better encoding
https://competitions.codalab.org/ competitions/18814
semantic information of evidence and claims. Evidence Filter Sentence Encoding Evidence Reasoning Aggregation Claim Evidence Score Claim 0.3620 Evi 1 Claim Attention 0.1241 0.0237 Evi 2 Claim BERT 0.0014 Max Evi 3 Claim Threshold 0.001 Evi 4 Claim Mean … 0.0009 Layer 0 É Layer T 0.0003 ERNet Document Retrieval Sentence Selection Claim Verification (GEAR)
Figure 1: The pipeline of our method. The GEAR framework is illustrated in the claim verification section. 3 Method In the
We employ a three-step pipeline with compo-
nents for document retrieval, sentence selection max(0, 1+ sn−sp)
and claim verification to solve the task. In the doc- s
ument retrieval and sentence selection stages, we
simply follow the method from Hanselowski et al.
(2018) since their method has the highest score on
evidence recall in the former FEVER shared task.
And we propose our Graph-based Evidence Ag-
gregating and Reasoning (GEAR) framework in
In addition to the original model (Hanselowski
the final claim verification stage. The full pipeline
et al., 2018), we add a relevance score filter with
of our method is illustrated in Figure 1.
a threshold τ . Sentences with relevance scores 3.1
Document Retrieval and Sentence
lower than τ are filtered out to alleviate the noises.
Thus the final size of the retrieved evidence set is Selection
equal to or less than 5. We choose different val-
In this section, we describe our document retrieval
ues of τ and select the value based on the dev
and sentence selection components. Additionally,
set result. The evaluation results of the document
we add a threshold filter after the sentence selec-
retrieval and sentence selection components are
tion component to filter out those noisy evidence. shown in Section 5.1.
In the document retrieval step, we adopt the
entity linking approach from Hanselowski et al. 3.2 Claim Verification with GEAR (2018).
In this section, we describe our GEAR framework
for claim verification. As shown in Figure 1, given
a claim and the retrieved evidence, The seven highest-ranked
results for each query are stored to form a candi-
date article set. Finally, the method drops the ar-
ticles which are not in the offline Wikipedia dump
and filters the articles by the word overlap between their titles and the claim.
The sentence selection component selects the Sentence Encoder
most relevant evidence for the claim from all sen-
Given an input sentence, we employ BERT (De-
tences in the retrieved documents.
vlin et al., 2019) as our sentence encoder by ex-
tracting the final hidden state of the [CLS] token 2
as the representation, where [CLS] is the special
https://www.mediawiki.org/wiki/API: Main_page
classification embedding in BERT. Evidence Aggregator
We employ an evidence aggregator to gather infor-
mation from different evidence nodes and obtain
Note that we concatenate the evidence and the
the final hidden state o ∈ RF ×1. The aggregator
claim to extract the evidence representation be-
may utilize different aggregating strategies and we
cause the evidence nodes in the reasoning graph
suggest three aggregators in our framework:
need the information from the claim to guide the
message passing process among them. Evidence Reasoning Network
To encourage the information propagation among
evidence, we build a fully-connected evidence
graph where each node indicates a piece of evi-
dence. We also add self-loop to every node be-
cause each node needs the information from it-
self in the message propagation process. We use =
ht = {ht1, ht2, ..., ht } to represent the hidden N
states of nodes at layer t, where ht ∈ RF ×1 and F i
is the number of features in each node. The initial The max aggregator per-
hidden state of each evidence node h0i is initialized
forms the element-wise Max operation among hid-
by the evidence presentation: h0 = e i i. den states.
Inspired by recent work on semi-supervised
graph learning and relational reasoning (Kipf and o = Max(hT1 , hT2 , ..., hT). N (6)
Welling, 2017; Velickovic et al., 2018; Palm et al.,
2018), we propose an evidence reasoning network The mean aggregator per-
(ERNet) to propagate information among the ev-
forms the element-wise Mean operation among idence nodes. hidden states.
o = Mean(hT1 , hT2 , ..., hT ). N (7)
Once the final state o is obtained, we employ a
one-layer MLP to get the final prediction l. l = softmax(ReLU(Wo+ b)), (8)
where W ∈ RC×F and b ∈ RC×1 are parame-
ters, and C is the number of prediction labels. 4 Experimental Settings 4.1 Dataset
We conduct our experiments on the large-scale
dataset FEVER (Thorne et al., 2018a). The dataset
consists of 185,455 annotated claims with a set
of 5,416,537 Wikipedia documents from the June
2017 Wikipedia dump. We follow the dataset par- X
tition from the FEVER Shared Task (Thorne et al.,
2018b). Table 2 shows the statistics of the dataset. Split SUPPORTED REFUTED NEI
with the highest scores. In the test phase, we con- Train 80,035 29,775 35,639
catenate the retrieved evidence for predicting. Dev 6,666 6,666 6,666
BERT-Pair. In the BERT-Pair system, we uti- Test 6,666 6,666 6,666
lize BERT to predict the label for each evidence- claim pair.
Concretely, we use each evidence-
Table 2: Statistics of FEVER dataset.
claim pair as the input and the label of the claim
as the prediction target. In the training phase, we
select the ground truth evidence for SUPPORTED 4.2 Baselines
and REFUTED claims and the retrieved evidence
In this section, we describe the baseline systems in
for NEI claims. In the test phase, we predict labels
our experiments. We first introduce the top-3 sys-
for all retrieved evidence-claim pairs. Because dif-
tems from the FEVER shared task. As BERT (De-
ferent evidence-claim pairs may have inconsistent
vlin et al., 2019) has achieved promising perfor-
predicted labels, we then utilize an aggregator to
mance on several NLP tasks, we also implement
obtain the final claim label. We find the aggre-
two baseline systems via fine-tuning BERT in the
gator only returning the predicted label from the claim verification task.
most relevant evidence has the best performance. Shared Task Systems 4.3 Hyperparameter Settings
We choose the top-3 models from the FEVER
We utilize BERTBASE (Devlin et al., 2019) in all shared task as our baselines.
of the BERT fine-tuning baselines and our GEAR
The Athene UKP TU Darmstadt team (Athene)
framework. The learning rate is 2e-5.
(Hanselowski et al., 2018) combines five inference
For BERT-Concat, the maximum sequence
vectors from the ESIM model via attention mech-
length is 256 and the batch size is 16. We limit
anism to make the final prediction.
the max length for concatenated evidence to 240
The UCL Machine Reading Group (UCL
and the max length for claims to 16. We train this
MRG) (Yoneda et al., 2018) predicts the label of
model for two epochs based on dev results. For
each evidence-claim pair and aggregates the re-
BERT-Pair, we set the maximum sequence length
sults via a label aggregation component.
to 128 and batch size to 32. We train this model for
The UNC NLP team (Nie et al., 2019) proposes
one epoch. As for the GEAR framework, we use
the neural semantic matching network and uses the
the fine-tuned BERT-Pair model to extract features
model jointly to solve all three subtasks. They and the batch size is 512.
also incorporate additional information such as
In our ERNet, we set the batch size to 256,
pageview frequency and WordNet features. They
the number of features F to 768 and the dimen-
have achieved best results in the competition.
sion of weight matrices H to 64. The model is
trained to minimize the negative log likelihood BERT Fine-tuning Systems
loss on the predicted label using the Adam opti-
We implement two BERT fine-tuning systems
mizer (Kingma and Ba, 2015) with an initial learn-
with different evidence combination approaches.
ing rate of 5e-3 and L2 weight decay of 5e-4. We
The BERT-Concat system concatenates all evi-
use an early stopping strategy on the label accu-
dence into a single string while the BERT-Pair
racy of the validation set, with a patience of 20
system encodes each evidence-claim pair indepen-
epochs. We attempt to stack 0-3 ERNet layers and
dently and then aggregates the results. Both sys-
analyze the effect of different layer numbers.
tems share the same document retrieval and sen- 4.4 Evaluation Metrics
tence selection components proposed by us.
BERT-Concat. In the BERT-Concat system,
Besides traditional evaluation metrics such as la-
we simply concatenate all evidence into a single
bel accuracy and F1, we use other two metrics to
sentence and utilize BERT to predict the relation evaluate our model.
between the concatenated evidence and the claim.
FEVER score. The FEVER score is the la-
In the training phase, we add the ground truth ev-
bel accuracy conditioned on providing at least one
idence into the retrieved evidence set with rele-
complete set of evidence. Claims labeled as “NEI”
vance score 1 and select five pieces of evidence do not need the evidence. Model OFEVER
OFEVER score drop gradually while the precision Athene 93.55
and F1 score increase. The results are consistent UCL MRG -
with our intuition. If we do not filter out evidence, UNC NLP 92.82
more claims could be provided with the full evi- Our Model 93.33
dence set. If we increase the value of the thresh-
old, more pieces of noisy evidence are filtered out,
Table 3: Document retrieval evaluation on dev set (%).
which contributes to the increase of precision and
(’-’ denotes a missing value) F1. τ OFEVER Precision Recall F1 GEAR LA 5.2 Claim Verification with GEAR 0 91.10 24.08 86.72 37.69 74.84 10−4
In this section, we evaluate our GEAR framework 91.04 30.88 86.63 45.53 74.86 10−3 90.86 40.60 86.36 55.23 74.91
in different aspects. We first compare the label ac- 10−2 90.27 53.12 85.47 65.52 74.89
curacy scores between our framework and base- 10−1 87.70 70.61 81.64 75.72 74.81
line systems. Then we explore the effect of differ-
ent thresholds from the upstream sentence filter.
Table 4: Sentence selection evaluation and average la-
We also conduct additional experiments to check
bel accuracy of GEAR with different thresholds on dev
the effect of sentence embedding. As there are set (%).
nearly 39% of claims require reasoning over mul-
tiple pieces of evidence, we construct a difficult
OFEVER score. The document retrieval and
dev subset and check the effectiveness of our ER-
sentence selection components are usually eval-
Net for evidence reasoning. Finally, we make an
uated by the oracle FEVER (OFEVER) score,
error analysis and provide the theoretical upper-
which is the upper bound of the FEVER score by
bound label accuracy of our framework.
assuming perfect downstream systems. Model Evaluation
For all of the experiments with GEAR, the
scores (label accuracy, FEVER score) we report
We use the label accuracy metric to evaluate the
on the dev set are mean values with 10 runs ini-
effectiveness of different claim verification mod-
tialized by different random seeds.
els. The second column of Table 7 shows the la-
bel accuracy of different models on the dev set. 5
Experimental Results and Analysis
We find the BERT fine-tuning models outperform
all of the models from the shared task, which
In this section, we first present the evaluations
shows the strong capacity of BERT in represen-
of the document retrieval and sentence selection
tation learning and semantic understanding. The
components. Then we evaluate our GEAR frame-
BERT-Concat model has a slight improvement
work in several different aspects. Finally, we
over BERT-Pair, which is 0.37%.
present a case study to demonstrate the effective-
Our final model outperforms the best BERT- ness of our framework.
Concat baseline by 1.17%. As our framework pro-
vides a better way for evidence aggregating and 5.1
Document Retrieval and Sentence
reasoning, the improvement demonstrates that our Selection
framework has a better ability to integrate features
We use the OFEVER metric to evaluate the doc-
from different evidence by propagating, analyzing
ument retrieval component. Table 3 shows the and aggregating the features.
OFEVER scores of our model and models from
other teams. After running the same model pro- Effect of Sentence Thresholds
posed by Hanselowski et al. (2018), we find our
The rightmost column of Table 4 shows the results
OFEVER score is slightly lower, which may due
of our GEAR frameworks with different sentence to the random factors.
selection thresholds. We choose the model with
Then we compare our sentence selection com-
threshold τ = 10−3, which has the highest label
ponent with different thresholds, as shown in Ta-
accuracy, as our final model. When the thresh-
ble 4. We find the model with threshold 0 achieves
old increases from 0 to 10−3, the label accuracy
the highest recall and OFEVER score. When
increases due to less noisy information. However,
the threshold increases, the recall value and the
when the threshold increases from 10−3 to 10−1, Aggregator Dev Test ERNet Layers Model Attention Max Mean LA FEVER LA FEVER 0 66.17 65.36 65.03 Athene 68.49 64.74 65.46 61.58 1 67.13 66.63 66.76 UCL MRG 69.66 65.41 67.62 62.52 2 67.44 67.24 67.56 UNC NLP 69.72 66.49 68.21 64.21 3 66.53 66.72 66.89 BERT Pair 73.30 68.90 69.75 65.18 BERT Concat 73.67 68.89 71.01 65.64
Table 5: Label accuracy on the difficult dev set with Our pipeline 74.84 70.69 71.60 67.10
different ERNet layers and evidence aggregators (%).
Table 7: Evaluations of the full pipeline. The results of Aggregator
our pipeline are chosen from the model which has the ERNet Layers Attention Max Mean highest dev FEVER score (%). 0 77.12 76.95 76.30 1 77.74 77.66 77.62
set via selecting samples from the original dev set. 2 77.82 77.66 77.73 3 77.70 77.55 77.60
For SUPPORTED and REFUTED classes, claims
which can be fully supported by only one piece of
evidence are filtered out. All of the NEI claims
Table 6: Label accuracy on the evidence-enhanced dev
are selected because the model needs all of the re-
set with different ERNet layers and evidence aggrega- tors (%).
trieved evidence to conclude that there is “NOT
ENOUGH INFO”. The difficult subset contains
7870 samples, which includes more than 39% of
the label accuracy decreases because informative the dev set.
evidence is filtered out, and the model can not
We test our final model on the difficult sub-
obtain sufficient evidence to make the right infer-
set and present the results in Table 5. We find ence.
our models with ERNet perform better than mod- Effect of Sentence Embedding
els without ERNet and the minimal improvement
between them is 1.27%. We can also discover
The BERT model we used in the sentence encod-
from the table that models with 2 ERNet lay-
ing step is fine-tuned on the FEVER dataset for
ers achieve the best results, which indicates that
one epoch. We need to find out whether the fine-
claims from the difficult subset require multi-step
tuning process or simply incorporating the sen-
evidence propagation. This result demonstrates
tence embeddings from BERT makes the major
the ability of our framework to deal with claims
contribution to the final result. We conduct an ex- which need multiple evidence.
periment using a BERT model without the fine-
tuning process and we find the final dev label ac- Error Analysis
curacy is close to the result from a random guess.
Therefore, the fine-tuning process rather than sen-
In this section, we examine the effect of errors
propagating from upstream components. We uti-
tence embeddings plays an important role in this
task. We need the fine-tuning process to capture
lize an evidence-enhanced dev subset, which as-
sumes all pieces of ground truth evidence are re-
the semantic and logical relations between evi-
dence and the claim. Sentence embeddings are
trieved, to test the theoretical upper-bound score of our GEAR framework.
more general and cannot perform well in this spe-
cific task. So that we cannot just use sentence em-
In our analysis, the main errors of our frame-
beddings from other methods (e.g., ELMo, CNN)
work come from the upstream document retrieval
to replace the sentence embeddings we used here.
and sentence selection components which can not
extract sufficient evidence for inferring. For exam- Effectiveness of ERNet
ple, to verify the claim “Giada at Home was only
In our observation, more than half of the claims in
available on DVD”, we need the evidence “Giada
the dev dataset only need one piece of evidence to
at Home is a television show and first aired on Oc-
make the right inference. To verify the effective-
tober 18, 2008, on the Food Network.” and “Food
ness of our framework on reasoning over multiple
Network is an American basic cable and satellite
pieces of evidence, we build a difficult dev sub-
television channel.”. However, the entity linking Claim:
Al Jardine is an American rhythm guitarist. Truth evidence:
{Al Jardine, 0}, {Al Jardine, 1} Retrieved evidence:
{Al Jardine, 1}, {Al Jardine, 0}, {Al Jardine, 2}, {Al Jar- dine, 5}, {Jardine, 42} Evidence:
(1) He is best known as the band’s rhythm guitarist, and for
occasionally singing lead vocals on singles such as “Help Me,
Rhonda” (1965), “Then I Kissed Her” (1965) and “Come Go with Me” (1978).
(2) Alan Charles Jardine (born September 3, 1942) is an
American musician, singer and songwriter who co-founded the Beach Boys.
(3) In 2010, Jardine released his debut solo studio album, A Postcard from California.
Figure 2: Attention map for the case in Table 8. The
first five rows indicate the attention weights from nodes
(4) In 1988, Jardine was inducted into the Rock and Roll Hall
of Fame as a member of the Beach Boys.
1 to 5 in the first ERNet layer and the last row shows
(5) Ray Jardine American rock climber, lightweight back-
the attention weights from the attention aggregator.
packer, inventor, author and global adventurer. Label: SUPPORTED
more than 1.4% increase in our experiment, it is
Table 8: A case of the claim that requires integrating
worthwhile to design a better evidence retrieval
multiple evidence to verify. The representation for ev-
pipeline, which remains to be our future research.
idence “{DocName, LineNum}” means the evidence is
extracted from the document “DocName” and of which 5.3 Full Pipeline the line number is LineNum.
We present the evaluation of our full pipeline in
this section. Note that there is a gap between the
label accuracy and the final FEVER score due to
method used in our document retrieval component
the completeness of the evidence set. We find that
could not retrieve the “Food Network” document
a model which is good at predicting NEI instances
only from parsing the content of the claim. Thus
tends to obtain higher FEVER score. So we
the claim verification component can not make the
choose our final model based on the dev FEVER
right inference with insufficient evidence.
score among all of our experiments. This model
To explore the effect of this issue, we test our
contains one layer of ERNet and uses the attention
models on an evidence-enhanced dev set, in which
aggregator. The threshold of the sentence filter is
we add the ground truth evidence with relevance 10−3.
score 1 into the evidence set before the sentence
Table 7 presents the evaluations of the full
threshold filter. It ensures that each claim in the
pipeline. We find the test FEVER score of BERT
evidence-enhanced set is provided with the ground
fine-tuning systems outperform other shared task
truth evidence as well as the retrieved evidence. models by nearly 1%. Furthermore, our full
The experimental results are shown in Table 6.
pipeline outperforms the BERT-Concat baseline
We can find that all scores in the table increase by
by 1.46% and achieves significant improvements.
more than 1.4% compared to the original dev set
label accuracy in Table 7 because of the addition of 5.4 Case study
the ground truth evidence. Because of the assump-
Table 8 shows an example in our experiments
tion of oracle upstream components, the results in
which needs multiple pieces of evidence to make
Table 6 indicate the theoretical upper bound label
the right inference. The ground truth evidence set accuracy of our framework.
contains the sentences from the article “Al Jar-
The results show the challenges in the previous
dine” with line number 0 and 1. These two pieces
evidence retrieval task, which could not be solved
of evidence are also ranked at top two in our re-
by current models. Nie et al. (2019) propose a
trieved evidence set. To verify whether “Al Jar-
two-hop evidence enhancement method which im-
dine is an American rhythm guitarist”, our model
proves 0.08% on their final FEVER score. As the
needs the evidence “He is best known as the bands
addition of the ground truth evidence leads to a
rhythm guitarist” as well as the evidence “Alan
Charles Jardine ... is an American musician”. We
natural language inference. In Proceedings of ACL,
plot the attention map from our final model with pages 1657–1668.
one layer of ERNet and the attention aggregator
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
in Figure 2. We can find that all evidence nodes
Jiang, and Diana Inkpen. 2017c. Recurrent neural
tend to attend the first and the second evidence
network-based sentence encoder with gated atten-
nodes, which provide the most useful information
tion for natural language inference. In Proceedings
of ACL, Workshop on Evaluating Vector Space Rep-
in this case. The attention weights in other evi-
resentations for NLP, pages 36–40.
dence nodes are pretty low, which indicates that
our model has the ability to select useful informa-
Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo¨ıc
Barrault, and Antoine Bordes. 2017. Supervised
tion from multiple pieces of evidence.
learning of universal sentence representations from 6 Conclusion
natural language inference data. In Proceedings of EMNLP, pages 670–680.
We propose a novel Graph-based Evidence Aggre-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
gating and Reasoning (GEAR) framework on the
Kristina Toutanova. 2019. Bert: Pre-training of deep
claim verification subtask of FEVER. The frame-
bidirectional transformers for language understand-
work utilizes the BERT sentence encoder, the evi-
ing. In Proceedings of NAACL-HLT.
dence reasoning network (ERNet) and an evidence
Matt Gardner, Joel Grus, Mark Neumann, Oyvind
aggregator to encode, propagate and aggregate in-
Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe-
formation from multiple pieces of evidence. The
ters, Michael Schmitz, and Luke Zettlemoyer. 2018.
Allennlp: A deep semantic natural language pro-
framework is proven to be effective and our final
cessing platform. In Proceedings of ACL, Workshop
pipeline achieves significant improvements. In the
for NLP Open Source Software, pages 1–6.
future, we would like to design a multi-step ev-
idence extractor and incorporate external knowl-
Reza Ghaeini, Sadid A Hasan, Vivek Datla, Joey
Liu, Kathy Lee, Ashequl Qadir, Yuan Ling, Aa- edge into our framework.
ditya Prakash, Xiaoli Fern, and Oladimeji Farri.
2018. Dr-bilstm: Dependent reading bidirectional 7 Acknowledgements
lstm for natural language inference. In Proceedings
of NAACL-HLT, pages 1460–1469.
This work is supported by the National Key Re-
search and Development Program of China (No.
Yichen Gong, Heng Luo, and Jian Zhang. 2018. Nat-
2018YFB1004503), the National Natural Sci-
ural language inference over interaction space. In Proceedings of ICLR.
ence Foundation of China (NSFC No.61772302,
61572273). This work is also supported by 2018
Andreas Hanselowski, Hao Zhang, Zile Li, Daniil
Tencent Marketing Solution Rhino-Bird Focused
Sorokin, Benjamin Schiller, Claudia Schulz, and
Research Program No. 201808. Han is also sup-
Iryna Gurevych. 2018. Ukp-athene: Multi-sentence
textual entailment for claim verification. Proceed-
ported by 2018 Tencent Rhino-Bird Elite Training
ings of EMNLP, pages 103–108. Program.
Christopher Hidey and Mona Diab. 2018. Team
sweeper: Joint sentence extraction and fact check- References
ing with pointer networks. In Proceedings of the
First Workshop on FEVER, pages 150–155.
Gabor Angeli and Christopher D Manning. 2014. Nat-
uralli: Natural logic inference for common sense
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
reasoning. In Proceedings of EMNLP, pages 534–
method for stochastic optimization. In Proceedings 545. of ICLR.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
Thomas N Kipf and Max Welling. 2017. Semi-
and Christopher D Manning. 2015. A large anno-
supervised classification with graph convolutional
tated corpus for learning natural language inference.
networks. In Proceedings of ICLR.
In Proceedings of EMNLP, pages 632–642.
Jackson Luken, Nanjiang Jiang, and Marie-Catherine
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
de Marneffe. 2018. Qed: A fact verification system
Bordes. 2017a. Reading Wikipedia to answer open-
for the fever shared task. In Proceedings of the First
domain questions. In Proceedings of ACL, pages
Workshop on FEVER, pages 156–160. 1870–1879.
Christopher Malon. 2018. Team papelo: Transformer
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
networks at fever. In Proceedings of the First Work-
Jiang, and Diana Inkpen. 2017b. Enhanced lstm for
shop on FEVER, pages 109–113.
Tsendsuren Munkhdalai and Hong Yu. 2017. Neural
Adina Williams, Nikita Nangia, and Samuel Bowman.
tree indexers for text understanding. In Proceedings
2018. A broad-coverage challenge corpus for sen- of EACL, pages 11–21.
tence understanding through inference. In Proceed-
ings of NAACL-HLT, pages 1112–1122.
Yixin Nie and Mohit Bansal. 2017. Shortcut-stacked
sentence encoders for multi-domain inference. In
Wenpeng Yin and Dan Roth. 2018. Twowingos: A
Proceedings of ACL, Workshop on Evaluating Vec-
two-wing optimization strategy for evidential claim
tor Space Representations for NLP, pages 41–45.
verification. In Proceedings of EMNLP, pages 105– 114.
Yixin Nie, Haonan Chen, and Mohit Bansal. 2019.
Combining fact extraction and verification with neu-
Takuma Yoneda, Jeff Mitchell, Johannes Welbl, Pontus
ral semantic matching networks. In Proceedings of
Stenetorp, and Sebastian Riedel. 2018. Ucl machine AAAI.
reading group: Four factor framework for fact find-
ing (hexaf). In Proceedings of the First Workshop
Rasmus Palm, Ulrich Paquet, and Ole Winther. 2018. on FEVER, pages 97–102. Recurrent relational networks. In Proceedings of NIPS, pages 3368–3378.
Ankur Parikh, Oscar T¨ackstr¨ om, Dipanjan Das, and
Jakob Uszkoreit. 2016. A decomposable attention
model for natural language inference. In Proceed-
ings of EMNLP, pages 2249–2255.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of NAACL-HLT, pages 2227–2237.
Alec Radford, Karthik Narasimhan, Tim Salimans, and
Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. In Proceedings of Technical report, OpenAI.
Lei Sha, Baobao Chang, Zhifang Sui, and Sujian Li.
2016. Reading and thinking: Re-read lstm unit for
textual entailment recognition. In Proceedings of COLING, pages 2870–2879.
Yi Tay, Anh Tuan Luu, and Siu Cheung Hui. 2018.
Compare, compress and propagate: Enhancing neu-
ral architectures with alignment factorization for natural language inference. In Proceedings of EMNLP, pages 1565–1575. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018a.
Fever: a large-scale dataset for fact extraction and
verification. In Proceedings of NAACL-HLT, pages 809–819.
James Thorne, Andreas Vlachos, Oana Cocarascu,
Christos Christodoulopoulos, and Arpit Mittal. 2018b.
The Fact Extraction and VERification
(FEVER) shared task. In Proceedings of the First
Workshop on FEVER, pages 1–9.
Petar Velickovic, Guillem Cucurull, Arantxa Casanova,
Adriana Romero, Pietro Lio, and Yoshua Bengio.
2018. Graph attention networks. In Proceedings of ICLR.
Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017.
Knowledge graph embedding: A survey
of approaches and applications. IEEE TKDE, 29(12):2724–2743.

Graph-based Evidence Aggregating and Reasoning kỹ thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng

Tài liệu liên quan:

Lab15 Owasp Zap để kiểm thử bảo mật cho Web Application và API Tài liệu ôn thi Vstep Speaking | Đại học Bách Khoa, Đại học Đà Nẵng

Thiết kế mạch giải mã và địa chỉ kết nối dữ liệu cho hệ thống môn Kỹ thuật lập trình mạng | Đại học Bách Khoa, Đại học Đà Nẵng

Bài tập: Viết ứng dụng quản lý có menu thực hiện các phương thức trong lớp List môn Kỹ thuật máy tính | Đại học Bách Khoa, Đại học Đà Nẵng

Bài tập lập trình C Hàm môn Kỹ thuật lập trình | Đại học Bách Khoa, Đại học Đà Nẵng

Thi kết thúc học phần thực hành kỹ thuật lập trình năm 2019 -2020 | Đại học Bách Khoa, Đại học Đà Nẵng