



















Preview text:
Comprehensive Guide to Syntactic
Parsing, Machine Translation, Text
Summarization, and Question Answering
Parsing Problem and Grammar
Parsing involves analyzing the syntactic structure of a sentence based on a formal set
of rules called a grammar. A grammar specifies the permissible structures within a
language, typically through rewrite rules that define how symbols (terminals and
nonterminals) can be combined. The primary goal of parsing is to determine a parse
tree that accurately reflects the hierarchical syntactic relationships among sentence
components, such as noun phrases and verb phrases. Parsing is fundamental in
natural language processing applications like translation, question answering, and
speech recognition, as it provides the structural foundation necessary for
understanding and manipulating language data.
Syntactic Structure and Parse Trees
A parse tree visually and hierarchically represents the syntactic structure of a
sentence. It shows how words and phrases are combined according to grammatical
rules. For example, in the sentence "The plan to swallow Wanda has been thrilling
Otto," a parse tree illustrates the relationships among the noun phrase "The plan to
swallow Wanda," the verb phrase "has been thrilling," and the subject "Otto." These
trees help in understanding sentence meaning, disambiguation, and further
processing like semantic analysis or translation. The structure typically starts from
the sentence (S) node and branches down into smaller constituents like NP (noun
phrase), VP (verb phrase), and individual words.
Applications of Parsing
Parsing plays a critical role across numerous NLP applications:
Text Summarization: Identifying key sentences or phrases by understanding sentence structure.
Information Extraction: Extracting specific data such as names, dates, or relationships from text.
Database Querying: Interpreting natural language queries to generate formal database commands.
Grammar Checking: Detecting grammatical errors by analyzing syntactic correctness.
Question Answering and Chatbots: Understanding question structure to generate accurate responses.
Machine Translation: Converting syntactic structures from source to target language.
Speech Recognition: Using syntactic cues to improve transcription accuracy.
Grammar and Context-Free Grammar (CFG)
A grammar defines the rules for constructing valid sentences. In particular, Context-
Free Grammar (CFG) is a widely used formalism characterized by rewrite rules where
a single nonterminal symbol rewrites into a sequence of terminals and nonterminals.
CFGs are represented as G = , where:
T: Set of terminal symbols (actual words or tokens).
N: Set of nonterminal symbols (syntactic categories).
P: Set of production rules.
S: Start symbol (usually representing a complete sentence).
CFGs are powerful enough to generate many natural language structures and are
foundational in parsing algorithms like CYK and Earley's.
Ambiguity in Natural Language Grammars
Natural language grammars are inherently ambiguous; a single sentence can have
multiple valid parse trees. For example, the sentence "John saw snow on the
campus" can be parsed with different attachment points for the prepositional phrase
"on the campus." This ambiguity complicates parsing because multiple
interpretations must be considered, requiring probabilistic models or disambiguation
strategies to select the most plausible parse. Top-Down Parsing
Top-down parsing begins with the start symbol (e.g., S) and recursively expands
nonterminals to match the input sentence. It aims to predict the structure that could
generate the sentence by rewriting goals into subgoals. This approach is goal-driven
and searches for a derivation that matches the input. However, it faces challenges such as:
Left recursion causing infinite loops.
Inefficiency due to exploring many unproductive paths.
Difficulty in handling lexical items directly, since lexical lookup is typically bottom- up.
Repeated work when common substructures are expanded multiple times. Bottom-Up Parsing
Bottom-up parsing starts with the input tokens and attempts to build larger
constituents until reaching the start symbol. It works by matching sequences of
words to right-hand sides of grammar rules and replacing them with their left-hand
nonterminal. This data-driven approach is efficient in some contexts but can be
inefficient with lexical ambiguity and may perform exponential work with complex sentences. CKY Algorithm
The CKY (Cocke-Younger-Kasami) algorithm is a dynamic programming bottom-up
parser applicable to CFGs in Chomsky Normal Form (CNF). It constructs a chart table
of size n×n for an input sentence of length n. Each cell [i,j] contains all nonterminals
that can generate the substring from position i to j. The algorithm systematically
combines smaller constituents to form larger ones, ensuring polynomial time
complexity. CKY is effective for parsing ambiguous sentences and guarantees
completeness for grammars in CNF. Chomsky Normal Form
Chomsky Normal Form restricts CFG rules to two types:
A → BC (binary rules where A, B, C are nonterminals).
A → a (terminal rules where A is a nonterminal and a is a terminal). This form
simplifies parsing algorithms like CKY by ensuring rules are binary, facilitating
efficient chart-based parsing. Earley’s Algorithm
Earley’s algorithm combines top-down prediction and bottom-up recognition,
handling ambiguity and left recursion efficiently. It maintains a parse table with
states representing partial parses, and processes input tokens incrementally from left
to right. It predicts possible constituents, scans input tokens, and completes
constituents by attaching recognized parts. It is flexible, capable of parsing all CFGs,
and particularly effective with ambiguous and left-recursive grammars.
Recursive Descent Parsing and Left Recursion
Recursive descent parsing employs recursive functions for each nonterminal rule,
attempting to match the input sequence. It is simple but fails with left-recursive
rules (e.g., A → A α), leading to infinite recursion. To fix this, grammars are often
transformed to eliminate left recursion or alternative parsing strategies like Earley's are used. Parsing as Search
Parsing can be viewed as a search problem over phrase space, represented as an and-or tree:
Disjuncts (or): alternative parse paths (e.g., different rules for the same nonterminal).
Conjuncts (and): conjunctions of sub-constituents (e.g., rule expansions). This
perspective guides search algorithms like CYK and Earley's, which explore possible
derivations to find valid parse trees. Left-Corner Parsing
Left-corner parsing combines bottom-up and top-down strategies by first identifying
the leftmost (left-corner) symbol of a phrase from the input and then verifying the
rest of the structure top-down. It is headdriven, efficient for head-initial languages
like English, and can handle certain ambiguities better than pure top-down or bottom-up methods.
Challenges in Machine Translation
Machine translation faces several core challenges:
Morphological variation: Languages differ in morpheme density; isolating
languages like Vietnamese have one morpheme per word, while polysynthetic
languages may encode entire sentences in a single word.
Syntactic differences: Word order varies across languages (e.g., English vs.
Vietnamese), complicating direct translation.
Conceptual gaps: Some words or concepts lack direct equivalents, such as Japanese
words for "privacy" or "filial piety."
Structural and lexical differences require models to capture nuanced language-
specific features for accurate translation.
Statistical Machine Translation (SMT)
SMT models translation as a noisy channel, where the goal is to find the target
sentence that maximizes the probability P(E|F) (English given French, for example). It decomposes into:
Language model (P(E)): probability of the target sentence.
Translation model (P(F|E)): probability of the source given the target.
Alignment models: map words or phrases across languages. The decoding process
searches for the most probable target sentence based on these models, often using
phrase tables and probabilistic models.
Word Alignment and EM Algorithm
Word alignment links words in source and target sentences. The Expectation-
Maximization (EM) algorithm iteratively estimates translation probabilities:
E-step: computes expected counts of alignments based on current probabilities.
M-step: updates translation probabilities to maximize likelihood. This process
improves alignment quality over multiple iterations, enabling better phrase
extraction and translation accuracy.
Neural Machine Translation (NMT) and Sequence-to- Sequence Models
NMT employs a single neural network with an encoder-decoder architecture:
Encoder: encodes the source sentence into a fixed-length or contextual representation.
Decoder: generates the target sentence conditioned on the encoder’s output.
Recurrent neural networks (RNNs), especially LSTMs or GRUs, are common,
allowing the model to learn complex language patterns endto-end. Attention
mechanisms help the decoder focus on relevant parts of the source, greatly
improving translation quality for long sentences.
Beam Search Decoding in NMT
Instead of greedy decoding (choosing the most probable word at each step), beam
search explores multiple hypotheses simultaneously, maintaining a fixed number
(beam size) of top candidates. This reduces errors caused by early incorrect choices,
leading to more accurate and fluent translations.
Attention Mechanism in NMT
Attention addresses the bottleneck problem of encoding entire source sequences
into a single vector by dynamically focusing on relevant source parts during
decoding. At each decoding step, attention computes scores (via dot product or
learned functions) between the decoder state and encoder states, producing a
weighted context vector. This allows the model to handle long sentences and
provides interpretability through learned alignments, which visually correspond to
word correspondences in translation.
Evaluation of Machine Translation - BLEU Score
BLEU (Bilingual Evaluation Understudy) measures translation quality by comparing
system output to one or more human references:
Calculates n-gram precision (up to 4-grams).
Applies a brevity penalty to discourage overly short outputs.
The BLEU score ranges from 0 to 1 (or 0 to 100), with higher scores indicating
closer matches to references. While widely used, BLEU has limitations, such as not
capturing semantic adequacy or fluency comprehensively.
Advantages and Disadvantages of NMT Advantages:
Produces more fluent, natural translations.
Better at capturing context and phrase similarities.
End-to-end training reduces human engineering.
Uniform approach across languages. Disadvantages:
Less interpretable and harder to debug.
Difficult to incorporate explicit rules or constraints.
Safety and controllability concerns, such as hallucinating facts.
Requires large amounts of data and computational resources.
Development and Impact of NMT
Since its inception around 2014, NMT rapidly overtook traditional SMT methods,
with major tech companies like Google adopting it by 2016. Its ability to produce
high-quality translations with minimal engineering has revolutionized machine
translation, enabling real-time, high-quality multilingual communication. Text Summarization
Text summarization condenses long documents into shorter summaries that preserve
key information. It is applicable in news headlines, meeting minutes, reviews,
biographies, weather bulletins, and historical chronologies, enabling quick comprehension of large texts.
Extractive vs Abstractive Summarization
Extractive summarization selects important sentences or phrases directly from the
original text, maintaining fidelity but potentially lacking coherence.
Abstractive summarization generates new sentences that paraphrase and
synthesize the main ideas, often using sequence-to-sequence models with attention mechanisms.
Techniques for Extractive Summarization Methods include:
Topic Representation: scoring sentences based on relevance to main topics
identified through keywords, lexical chains, LSA, or Bayesian models like LDA.
Indicator Features: using sentence features such as position, length, keyword
presence, and cue phrases, often combined with machine learning classifiers.
Centroid-based Summarization: computing a centroid vector representing the
main content and selecting sentences most similar to it.
Graph-based Methods: applying algorithms like TextRank and LexRank, where
sentences are nodes connected by similarity, and ranking scores determine importance.
Supervised Learning: training classifiers to predict sentence importance based on
features, then selecting top-ranked sentences.
Evaluation Metrics - ROUGE
ROUGE measures the overlap between generated and reference summaries:
ROUGE-N: overlaps of n-grams (e.g., unigrams, bigrams).
ROUGE-L: longest common subsequence, capturing sentence-level fluency and
structure. High ROUGE scores indicate content similarity but do not fully assess coherence or readability.
Deep Learning for Extractive Summarization
Models like SummaRuNNer utilize recurrent neural networks to encode sentences
and assign importance scores based on relevance, novelty, and position. Sentences
with high scores are selected to form summaries, enabling more nuanced and context-aware extraction.
Sequence-to-Sequence Models for Abstractive Summarization
Encoder-decoder architectures with attention mechanisms generate summaries by
translating input sequences into condensed outputs. These models can produce
more fluent and coherent summaries but sometimes hallucinate or repeat content.
Pointer-Generator Networks and Coverage Mechanism
To improve abstractive summaries:
Pointer-Generator Networks:** allow models to copy words directly from the
source text, addressing out-of-vocabulary issues.
Coverage Mechanism: tracks past attention to prevent repetitive content,
encouraging the model to cover all important information without redundancy.
Pre-trained Language Models for Summarization
Models like PEGASUS and PRIMERA leverage pretraining with objectives like Gap
Sentence Generation or masked sentence reconstruction to enhance summarization
quality, especially in multi-document settings. They learn inter-sentence coherence
and content selection without requiring large amounts of annotated summaries.
Longformer for Handling Long Documents
Longformer employs sliding window attention and global attention tokens to process
long inputs (up to 4096 tokens) efficiently, enabling summarization and
comprehension of lengthy texts that exceed typical transformer input limits.
Question Answering (QA) Systems
QA systems automatically respond to human questions using various sources such as
text passages, web documents, knowledge bases, and databases. They serve
applications like FAQs, chatbots, education advising, and information retrieval.
Types of Questions and Answers Questions are classified by:
Type: fact-based ("Who", "Which") or explanation ("Why", "How").
Answer format: short words, phrases, paragraphs, lists, yes/no. QA systems must
detect intent, identify key entities, and generate appropriate responses, often
involving classification, slot filling, and retrieval.
Machine Reading Comprehension (MRC) and SQuAD
MRC tasks involve systems understanding a passage to answer questions. The SQuAD
dataset contains over 100,000 annotated (passage, question, answer) triples, with
answers typically as short spans within passages. Models like BERT and BiDAF
achieve near-human performance by predicting answer span boundaries.
BiDAF: Bidirectional Attention Flow
BiDAF encodes context and query with word and character embeddings, applies
bidirectional LSTMs, and computes two types of attention:
Context-to-query: focusing on relevant query words for each context word.
Query-to-context: identifying context words most relevant to the query. This
bidirectional attention facilitates accurate span prediction for answers.
BERT for Reading Comprehension
BERT pre-trained on large corpora uses masked language modeling and next
sentence prediction to produce deep contextual embeddings. Finetuned for QA,
BERT predicts answer spans by classifying start and end positions, achieving
performance close to or surpassing human levels on datasets like SQuAD.
Architectures in QA: Bi-Encoder vs CrossEncoder
Bi-Encoder: independently encodes question and passage into embeddings,
enabling fast retrieval but with less nuanced relevance.
Cross-Encoder: jointly encodes question and passage, considering full interaction,
yielding higher accuracy but at higher computational cost.
Semantic Search with Sentence-BERT (SBERT)
SBERT modifies BERT with Siamese networks to generate semantically meaningful
sentence embeddings, enabling efficient similarity search and clustering. It drastically
reduces inference time compared to full BERT models, making large-scale retrieval feasible.
Inverted File Index for Semantic Retrieval
Semantic retrieval employs inverted indexes mapping quantized embedding clusters
to documents. When a query is embedded, it is mapped to relevant clusters, and
documents within these clusters are retrieved and ranked by similarity, enabling scalable semantic search.
Contrastive Learning for Sentence Embeddings
Contrastive learning trains models to bring similar sentence pairs closer and push
dissimilar pairs apart using loss functions like contrastive loss, triplet loss, or InfoNCE.
This enhances the semantic quality of embeddings, improving retrieval and clustering tasks.
SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE improves sentence representations by:
Unsupervised: using dropout to generate different embeddings of the same sentence as positive pairs.
Supervised: leveraging labeled paraphrase data. It employs contrastive loss to
produce high-quality, semantically meaningful embeddings suitable for retrieval
and semantic similarity tasks.
This comprehensive overview spans the core concepts, algorithms, models, and
applications across syntactic parsing, machine translation, text summarization, and
question answering, providing a solid foundation for advanced NLP studies and
practical implementation.# Comprehensive
Guide to Syntactic Parsing, Machine Translation, Text Summarization, and Question Answering
Parsing Problem and Grammar
Parsing involves analyzing the syntactic structure of a sentence based on a formal set
of rules called a grammar. A grammar specifies the permissible structures within a
language, typically through rewrite rules that define how symbols (terminals and
nonterminals) can be combined. The primary goal of parsing is to determine a parse
tree that accurately reflects the hierarchical syntactic relationships among sentence
components, such as noun phrases and verb phrases. Parsing is fundamental in
natural language processing applications like translation, question answering, and
speech recognition, as it provides the structural foundation necessary for
understanding and manipulating language data.
Syntactic Structure and Parse Trees
A parse tree visually and hierarchically represents the syntactic structure of a
sentence. It shows how words and phrases are combined according to grammatical
rules. For example, in the sentence "The plan to swallow Wanda has been thrilling
Otto," a parse tree illustrates the relationships among the noun phrase "The plan to
swallow Wanda," the verb phrase "has been thrilling," and the subject "Otto." These
trees help in understanding sentence meaning, disambiguation, and further
processing like semantic analysis or translation. The structure typically starts from
the sentence (S) node and branches down into smaller constituents like NP (noun
phrase), VP (verb phrase), and individual words.
Applications of Parsing
Parsing plays a critical role across numerous NLP applications:
Text Summarization: Identifying key sentences or phrases by understanding sentence structure.
Information Extraction: Extracting specific data such as names, dates, or relationships from text.
Database Querying: Interpreting natural language queries to generate formal database commands.
Grammar Checking: Detecting grammatical errors by analyzing syntactic correctness.
Question Answering and Chatbots: Understanding question structure to generate accurate responses.
Machine Translation: Converting syntactic structures from source to target language.
Speech Recognition: Using syntactic cues to improve transcription accuracy.
Grammar and Context-Free Grammar (CFG)
A grammar defines the rules for constructing valid sentences. In particular, Context-
Free Grammar (CFG) is a widely used formalism characterized by rewrite rules where
a single nonterminal symbol rewrites into a sequence of terminals and nonterminals.
CFGs are represented as G = , where:
T: Set of terminal symbols (actual words or tokens).
N: Set of nonterminal symbols (syntactic categories).
P: Set of production rules.
S: Start symbol (usually representing a complete sentence).
CFGs are powerful enough to generate many natural language structures and are
foundational in parsing algorithms like CYK and Earley's.
Ambiguity in Natural Language Grammars
Natural language grammars are inherently ambiguous; a single sentence can have
multiple valid parse trees. For example, the sentence "John saw snow on the
campus" can be parsed with different attachment points for the prepositional phrase
"on the campus." This ambiguity complicates parsing because multiple
interpretations must be considered, requiring probabilistic models or disambiguation
strategies to select the most plausible parse. Top-Down Parsing
Top-down parsing begins with the start symbol (e.g., S) and recursively expands
nonterminals to match the input sentence. It aims to predict the structure that could
generate the sentence by rewriting goals into subgoals. This approach is goal-driven
and searches for a derivation that matches the input. However, it faces challenges such as:
Left recursion causing infinite loops.
Inefficiency due to exploring many unproductive paths.
Difficulty in handling lexical items directly, since lexical lookup is typically bottom- up.
Repeated work when common substructures are expanded multiple times. Bottom-Up Parsing
Bottom-up parsing starts with the input tokens and attempts to build larger
constituents until reaching the start symbol. It works by matching sequences of
words to right-hand sides of grammar rules and replacing them with their left-hand
nonterminal. This data-driven approach is efficient in some contexts but can be
inefficient with lexical ambiguity and may perform exponential work with complex sentences. CKY Algorithm
The CKY (Cocke-Younger-Kasami) algorithm is a dynamic programming bottom-up
parser applicable to CFGs in Chomsky Normal Form (CNF). It constructs a chart table
of size n×n for an input sentence of length n. Each cell [i,j] contains all nonterminals
that can generate the substring from position i to j. The algorithm systematically
combines smaller constituents to form larger ones, ensuring polynomial time
complexity. CKY is effective for parsing ambiguous sentences and guarantees
completeness for grammars in CNF. Chomsky Normal Form
Chomsky Normal Form restricts CFG rules to two types:
A → BC (binary rules where A, B, C are nonterminals).
A → a (terminal rules where A is a nonterminal and a is a terminal). This form
simplifies parsing algorithms like CKY by ensuring rules are binary, facilitating
efficient chart-based parsing. Earley’s Algorithm
Earley’s algorithm combines top-down prediction and bottom-up recognition,
handling ambiguity and left recursion efficiently. It maintains a parse table with
states representing partial parses, and processes input tokens incrementally from left
to right. It predicts possible constituents, scans input tokens, and completes
constituents by attaching recognized parts. It is flexible, capable of parsing all CFGs,
and particularly effective with ambiguous and left-recursive grammars.
Recursive Descent Parsing and Left Recursion
Recursive descent parsing employs recursive functions for each nonterminal rule,
attempting to match the input sequence. It is simple but fails with left-recursive
rules (e.g., A → A α), leading to infinite recursion. To fix this, grammars are often
transformed to eliminate left recursion or alternative parsing strategies like Earley's are used. Parsing as Search
Parsing can be viewed as a search problem over phrase space, represented as an and-or tree:
Disjuncts (or): alternative parse paths (e.g., different rules for the same nonterminal).
Conjuncts (and): conjunctions of sub-constituents (e.g., rule expansions). This
perspective guides search algorithms like CYK and Earley's, which explore possible
derivations to find valid parse trees. Left-Corner Parsing
Left-corner parsing combines bottom-up and top-down strategies by first identifying
the leftmost (left-corner) symbol of a phrase from the input and then verifying the
rest of the structure top-down. It is headdriven, efficient for head-initial languages
like English, and can handle certain ambiguities better than pure top-down or bottom-up methods.
Challenges in Machine Translation
Machine translation faces several core challenges:
Morphological variation: Languages differ in morpheme density; isolating
languages like Vietnamese have one morpheme per word, while polysynthetic
languages may encode entire sentences in a single word.
Syntactic differences: Word order varies across languages (e.g., English vs.
Vietnamese), complicating direct translation.
Conceptual gaps: Some words or concepts lack direct equivalents, such as Japanese
words for "privacy" or "filial piety."
Structural and lexical differences require models to capture nuanced language-
specific features for accurate translation.
Statistical Machine Translation (SMT)
SMT models translation as a noisy channel, where the goal is to find the target
sentence that maximizes the probability P(E|F) (English given French, for example). It decomposes into:
Language model (P(E)): probability of the target sentence.
Translation model (P(F|E)): probability of the source given the target.
Alignment models: map words or phrases across languages. The decoding process
searches for the most probable target sentence based on these models, often using
phrase tables and probabilistic models.
Word Alignment and EM Algorithm
Word alignment links words in source and target sentences. The Expectation-
Maximization (EM) algorithm iteratively estimates translation probabilities:
E-step: computes expected counts of alignments based on current probabilities.
M-step: updates translation probabilities to maximize likelihood. This process
improves alignment quality over multiple iterations, enabling better phrase
extraction and translation accuracy.
Neural Machine Translation (NMT) and Sequence-to- Sequence Models
NMT employs a single neural network with an encoder-decoder architecture:
Encoder: encodes the source sentence into a fixed-length or contextual representation.
Decoder: generates the target sentence conditioned on the encoder’s output.
Recurrent neural networks (RNNs), especially LSTMs or GRUs, are common,
allowing the model to learn complex language patterns endto-end. Attention
mechanisms help the decoder focus on relevant parts of the source, greatly
improving translation quality for long sentences.
Beam Search Decoding in NMT
Instead of greedy decoding (choosing the most probable word at each step), beam
search explores multiple hypotheses simultaneously, maintaining a fixed number
(beam size) of top candidates. This reduces errors caused by early incorrect choices,
leading to more accurate and fluent translations.
Attention Mechanism in NMT
Attention addresses the bottleneck problem of encoding entire source sequences
into a single vector by dynamically focusing on relevant source parts during
decoding. At each decoding step, attention computes scores (via dot product or
learned functions) between the decoder state and encoder states, producing a
weighted context vector. This allows the model to handle long sentences and
provides interpretability through learned alignments, which visually correspond to
word correspondences in translation.
Evaluation of Machine Translation - BLEU Score
BLEU (Bilingual Evaluation Understudy) measures translation quality by comparing
system output to one or more human references:
Calculates n-gram precision (up to 4-grams).
Applies a brevity penalty to discourage overly short outputs.
The BLEU score ranges from 0 to 1 (or 0 to 100), with higher scores indicating
closer matches to references. While widely used, BLEU has limitations, such as not
capturing semantic adequacy or fluency comprehensively.
Advantages and Disadvantages of NMT Advantages:
Produces more fluent, natural translations.
Better at capturing context and phrase similarities.
End-to-end training reduces human engineering.
Uniform approach across languages. Disadvantages: