26 trang 37 lượt tải

Comprehensive overview of NLP: Parsing, MT, and QA systems môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội

Parsing involves analyzing the syntactic structure of a sentence based on a formal set of rules called a grammar. A grammar specifies the permissible structures within a language, typically through rewrite rules that define how symbols (terminals and nonterminals) can be combined. Tài liệu được sưu tầm gồm 26 trang, giúp các bạn ôn luyện và phục vụ cho việc học tập, đạt kết quả tốt. Mời các bạn đón xem!

Môn: Xử lý ngôn ngữ tự nhiên 10 tài liệu

Trường: Đại học Bách Khoa Hà Nội 3.7 K tài liệu

Tác giả:

My Lữ

2 tháng trước

Tải xuống Chia sẻ Báo cáo

Danh sách Quiz

Comprehensive Guide to Syntacc

Parsing, Machine Translaon, Text

Summarizaon, and Queson Answering

Parsing Problem and Grammar

Parsing involves analyzing the syntacc structure of a sentence based on a formal set

of rules called a grammar. A grammar species the permissible structures within a

language, typically through rewrite rules that dene how symbols (terminals and

nonterminals) can be combined. The primary goal of parsing is to determine a parse

tree that accurately reects the hierarchical syntacc relaonships among sentence

components, such as noun phrases and verb phrases. Parsing is fundamental in

natural language processing applicaons like translaon, queson answering, and

speech recognion, as it provides the structural foundaon necessary for

understanding and manipulang language data.

Syntacc Structure and Parse Trees

A parse tree visually and hierarchically represents the syntacc structure of a

sentence. It shows how words and phrases are combined according to grammacal

rules. For example, in the sentence "The plan to swallow Wanda has been thrilling

Oo," a parse tree illustrates the relaonships among the noun phrase "The plan to

swallow Wanda," the verb phrase "has been thrilling," and the subject "Oo." These

trees help in understanding sentence meaning, disambiguaon, and further

processing like semanc analysis or translaon. The structure typically starts from

the sentence (S) node and branches down into smaller constuents like NP (noun

phrase), VP (verb phrase), and individual words.

Applicaons of Parsing

Parsing plays a crical role across numerous NLP applicaons:

Text Summarizaon: Idenfying key sentences or phrases by understanding

sentence structure.

Informaon Extracon: Extracng specic data such as names, dates, or

relaonships from text.

Database Querying: Interpreng natural language queries to generate formal

database commands.

Grammar Checking: Detecng grammacal errors by analyzing syntacc

correctness.

Queson Answering and Chatbots: Understanding queson structure to generate

accurate responses.

Machine Translaon: Converng syntacc structures from source to target

language.

Speech Recognion: Using syntacc cues to improve transcripon accuracy.

Grammar and Context-Free Grammar (CFG)

A grammar denes the rules for construcng valid sentences. In parcular, Context-

Free Grammar (CFG) is a widely used formalism characterized by rewrite rules where

a single nonterminal symbol rewrites into a sequence of terminals and nonterminals.

CFGs are represented as G = <T, N, P, S>, where:

T: Set of terminal symbols (actual words or tokens).

N: Set of nonterminal symbols (syntacc categories).

P: Set of producon rules.

S: Start symbol (usually represenng a complete sentence).

CFGs are powerful enough to generate many natural language structures and are

foundaonal in parsing algorithms like CYK and Earley's.

Ambiguity in Natural Language Grammars

Natural language grammars are inherently ambiguous; a single sentence can have

mulple valid parse trees. For example, the sentence "John saw snow on the

campus" can be parsed with dierent aachment points for the preposional phrase

"on the campus." This ambiguity complicates parsing because mulple

interpretaons must be considered, requiring probabilisc models or disambiguaon

strategies to select the most plausible parse.

Top-Down Parsing

Top-down parsing begins with the start symbol (e.g., S) and recursively expands

nonterminals to match the input sentence. It aims to predict the structure that could

generate the sentence by rewring goals into subgoals. This approach is goal-driven

and searches for a derivaon that matches the input. However, it faces challenges

such as:

Le recursion causing innite loops.

Ineciency due to exploring many unproducve paths.

Diculty in handling lexical items directly, since lexical lookup is typically boom-

up.

Repeated work when common substructures are expanded mulple mes.

Boom-Up Parsing

Boom-up parsing starts with the input tokens and aempts to build larger

constuents unl reaching the start symbol. It works by matching sequences of

words to right-hand sides of grammar rules and replacing them with their le-hand

nonterminal. This data-driven approach is ecient in some contexts but can be

inecient with lexical ambiguity and may perform exponenal work with complex

sentences.

CKY Algorithm

The CKY (Cocke-Younger-Kasami) algorithm is a dynamic programming boom-up

parser applicable to CFGs in Chomsky Normal Form (CNF). It constructs a chart table

of size n×n for an input sentence of length n. Each cell [i,j] contains all nonterminals

that can generate the substring from posion i to j. The algorithm systemacally

combines smaller constuents to form larger ones, ensuring polynomial me

complexity. CKY is eecve for parsing ambiguous sentences and guarantees

completeness for grammars in CNF.

Chomsky Normal Form

Chomsky Normal Form restricts CFG rules to two types:

A → BC (binary rules where A, B, C are nonterminals).

A → a (terminal rules where A is a nonterminal and a is a terminal). This form

simplies parsing algorithms like CKY by ensuring rules are binary, facilitang

ecient chart-based parsing.

Earley’s Algorithm

Earley’s algorithm combines top-down predicon and boom-up recognion,

handling ambiguity and le recursion eciently. It maintains a parse table with

states represenng paral parses, and processes input tokens incrementally from le

to right. It predicts possible constuents, scans input tokens, and completes

constuents by aaching recognized parts. It is exible, capable of parsing all CFGs,

and parcularly eecve with ambiguous and le-recursive grammars.

Recursive Descent Parsing and Le Recursion

Recursive descent parsing employs recursive funcons for each nonterminal rule,

aempng to match the input sequence. It is simple but fails with le-recursive

rules (e.g., A → A α), leading to innite recursion. To x this, grammars are oen

transformed to eliminate le recursion or alternave parsing strategies like Earley's

are used.

Parsing as Search

Parsing can be viewed as a search problem over phrase space, represented as an

and-or tree:

Disjuncts (or): alternave parse paths (e.g., dierent rules for the same

nonterminal).

Conjuncts (and): conjuncons of sub-constuents (e.g., rule expansions). This

perspecve guides search algorithms like CYK and Earley's, which explore possible

derivaons to nd valid parse trees.

Le-Corner Parsing

Le-corner parsing combines boom-up and top-down strategies by rst idenfying

the lemost (le-corner) symbol of a phrase from the input and then verifying the

rest of the structure top-down. It is headdriven, ecient for head-inial languages

like English, and can handle certain ambiguies beer than pure top-down or

boom-up methods.

Challenges in Machine Translaon

Machine translaon faces several core challenges:

Morphological variaon: Languages dier in morpheme density; isolang

languages like Vietnamese have one morpheme per word, while polysynthec

languages may encode enre sentences in a single word.

Syntacc dierences: Word order varies across languages (e.g., English vs.

Vietnamese), complicang direct translaon.

Conceptual gaps: Some words or concepts lack direct equivalents, such as Japanese

words for "privacy" or "lial piety."

Structural and lexical dierences require models to capture nuanced language-

specic features for accurate translaon.

Stascal Machine Translaon (SMT)

SMT models translaon as a noisy channel, where the goal is to nd the target

sentence that maximizes the probability P(E|F) (English given French, for example). It

decomposes into:

Language model (P(E)): probability of the target sentence.

Translaon model (P(F|E)): probability of the source given the target.

Alignment models: map words or phrases across languages. The decoding process

searches for the most probable target sentence based on these models, oen using

phrase tables and probabilisc models.

Word Alignment and EM Algorithm

Word alignment links words in source and target sentences. The Expectaon-

Maximizaon (EM) algorithm iteravely esmates translaon probabilies:

E-step: computes expected counts of alignments based on current probabilies.

M-step: updates translaon probabilies to maximize likelihood. This process

improves alignment quality over mulple iteraons, enabling beer phrase

extracon and translaon accuracy.

Neural Machine Translaon (NMT) and Sequence-to-

Sequence Models

NMT employs a single neural network with an encoder-decoder architecture:

Encoder: encodes the source sentence into a xed-length or contextual

representaon.

Decoder: generates the target sentence condioned on the encoder’s output.

Recurrent neural networks (RNNs), especially LSTMs or GRUs, are common,

allowing the model to learn complex language paerns endto-end. Aenon

mechanisms help the decoder focus on relevant parts of the source, greatly

improving translaon quality for long sentences.

Beam Search Decoding in NMT

Instead of greedy decoding (choosing the most probable word at each step), beam

search explores mulple hypotheses simultaneously, maintaining a xed number

(beam size) of top candidates. This reduces errors caused by early incorrect choices,

leading to more accurate and uent translaons.

Aenon Mechanism in NMT

Aenon addresses the boleneck problem of encoding enre source sequences

into a single vector by dynamically focusing on relevant source parts during

decoding. At each decoding step, aenon computes scores (via dot product or

learned funcons) between the decoder state and encoder states, producing a

weighted context vector. This allows the model to handle long sentences and

provides interpretability through learned alignments, which visually correspond to

word correspondences in translaon.

Evaluaon of Machine Translaon - BLEU Score

BLEU (Bilingual Evaluaon Understudy) measures translaon quality by comparing

system output to one or more human references:

Calculates n-gram precision (up to 4-grams).

Applies a brevity penalty to discourage overly short outputs.

The BLEU score ranges from 0 to 1 (or 0 to 100), with higher scores indicang

closer matches to references. While widely used, BLEU has limitaons, such as not

capturing semanc adequacy or uency comprehensively.

Advantages and Disadvantages of NMT

Advantages:

Produces more uent, natural translaons.

Beer at capturing context and phrase similaries.

End-to-end training reduces human engineering.

Uniform approach across languages.

Disadvantages:

Less interpretable and harder to debug.

Dicult to incorporate explicit rules or constraints.

Safety and controllability concerns, such as hallucinang facts.

Requires large amounts of data and computaonal resources.

Development and Impact of NMT

Since its incepon around 2014, NMT rapidly overtook tradional SMT methods,

with major tech companies like Google adopng it by 2016. Its ability to produce

high-quality translaons with minimal engineering has revoluonized machine

translaon, enabling real-me, high-quality mullingual communicaon.

Text Summarizaon

Text summarizaon condenses long documents into shorter summaries that preserve

key informaon. It is applicable in news headlines, meeng minutes, reviews,

biographies, weather bullens, and historical chronologies, enabling quick

comprehension of large texts.

Extracve vs Abstracve Summarizaon

Extracve summarizaon selects important sentences or phrases directly from the

original text, maintaining delity but potenally lacking coherence.

Abstracve summarizaon generates new sentences that paraphrase and

synthesize the main ideas, oen using sequence-to-sequence models with

aenon mechanisms.

Techniques for Extracve Summarizaon

Methods include:

Topic Representaon: scoring sentences based on relevance to main topics

idened through keywords, lexical chains, LSA, or Bayesian models like LDA.

Indicator Features: using sentence features such as posion, length, keyword

presence, and cue phrases, oen combined with machine learning classiers.

Centroid-based Summarizaon: compung a centroid vector represenng the

main content and selecng sentences most similar to it.

Graph-based Methods: applying algorithms like TextRank and LexRank, where

sentences are nodes connected by similarity, and ranking scores determine

importance.

Supervised Learning: training classiers to predict sentence importance based on

features, then selecng top-ranked sentences.

Evaluaon Metrics - ROUGE

ROUGE measures the overlap between generated and reference summaries:

ROUGE-N: overlaps of n-grams (e.g., unigrams, bigrams).

ROUGE-L: longest common subsequence, capturing sentence-level uency and

structure. High ROUGE scores indicate content similarity but do not fully assess

coherence or readability.

Deep Learning for Extracve Summarizaon

Models like SummaRuNNer ulize recurrent neural networks to encode sentences

and assign importance scores based on relevance, novelty, and posion. Sentences

with high scores are selected to form summaries, enabling more nuanced and

context-aware extracon.

Sequence-to-Sequence Models for Abstracve

Summarizaon

Encoder-decoder architectures with aenon mechanisms generate summaries by

translang input sequences into condensed outputs. These models can produce

more uent and coherent summaries but somemes hallucinate or repeat content.

Pointer-Generator Networks and Coverage Mechanism

To improve abstracve summaries:

Pointer-Generator Networks:** allow models to copy words directly from the

source text, addressing out-of-vocabulary issues.

Coverage Mechanism: tracks past aenon to prevent repeve content,

encouraging the model to cover all important informaon without redundancy.

Pre-trained Language Models for Summarizaon

Models like PEGASUS and PRIMERA leverage pretraining with objecves like Gap

Sentence Generaon or masked sentence reconstrucon to enhance summarizaon

quality, especially in mul-document sengs. They learn inter-sentence coherence

and content selecon without requiring large amounts of annotated summaries.

Longformer for Handling Long Documents

Longformer employs sliding window aenon and global aenon tokens to process

long inputs (up to 4096 tokens) eciently, enabling summarizaon and

comprehension of lengthy texts that exceed typical transformer input limits.

Queson Answering (QA) Systems

QA systems automacally respond to human quesons using various sources such as

text passages, web documents, knowledge bases, and databases. They serve

applicaons like FAQs, chatbots, educaon advising, and informaon retrieval.

Types of Quesons and Answers

Quesons are classied by:

Type: fact-based ("Who", "Which") or explanaon ("Why", "How").

Answer format: short words, phrases, paragraphs, lists, yes/no. QA systems must

detect intent, idenfy key enes, and generate appropriate responses, oen

involving classicaon, slot lling, and retrieval.

Machine Reading Comprehension (MRC) and SQuAD

MRC tasks involve systems understanding a passage to answer quesons. The SQuAD

dataset contains over 100,000 annotated (passage, queson, answer) triples, with

answers typically as short spans within passages. Models like BERT and BiDAF

achieve near-human performance by predicng answer span boundaries.

BiDAF: Bidireconal Aenon Flow

BiDAF encodes context and query with word and character embeddings, applies

bidireconal LSTMs, and computes two types of aenon:

Context-to-query: focusing on relevant query words for each context word.

Query-to-context: idenfying context words most relevant to the query. This

bidireconal aenon facilitates accurate span predicon for answers.

BERT for Reading Comprehension

BERT pre-trained on large corpora uses masked language modeling and next

sentence predicon to produce deep contextual embeddings. Finetuned for QA,

BERT predicts answer spans by classifying start and end posions, achieving

performance close to or surpassing human levels on datasets like SQuAD.

Architectures in QA: Bi-Encoder vs CrossEncoder

Bi-Encoder: independently encodes queson and passage into embeddings,

enabling fast retrieval but with less nuanced relevance.

Cross-Encoder: jointly encodes queson and passage, considering full interacon,

yielding higher accuracy but at higher computaonal cost.

Semanc Search with Sentence-BERT (SBERT)

SBERT modies BERT with Siamese networks to generate semancally meaningful

sentence embeddings, enabling ecient similarity search and clustering. It drascally

reduces inference me compared to full BERT models, making large-scale retrieval

feasible.

Inverted File Index for Semanc Retrieval

Semanc retrieval employs inverted indexes mapping quanzed embedding clusters

to documents. When a query is embedded, it is mapped to relevant clusters, and

documents within these clusters are retrieved and ranked by similarity, enabling

scalable semanc search.

Contrasve Learning for Sentence Embeddings

Contrasve learning trains models to bring similar sentence pairs closer and push

dissimilar pairs apart using loss funcons like contrasve loss, triplet loss, or InfoNCE.

This enhances the semanc quality of embeddings, improving retrieval and

clustering tasks.

SimCSE: Simple Contrasve Learning of Sentence

Embeddings

SimCSE improves sentence representaons by:

Unsupervised: using dropout to generate dierent embeddings of the same

sentence as posive pairs.

Supervised: leveraging labeled paraphrase data. It employs contrasve loss to

produce high-quality, semancally meaningful embeddings suitable for retrieval

and semanc similarity tasks.

This comprehensive overview spans the core concepts, algorithms, models, and

applicaons across syntacc parsing, machine translaon, text summarizaon, and

queson answering, providing a solid foundaon for advanced NLP studies and

praccal implementaon.# Comprehensive

Guide to Syntacc Parsing, Machine Translaon, Text Summarizaon, and

Queson Answering