lOMoARcPSD| 59703641
Natural Language Processing
AC3110E
1
Lecturer: PhD. DO Thi Ngoc Diep
Word
Every human language is composed of words
“Fundamental building block of language”
of language
Word carries meaning (objecve or praccal meaning), can be used on its
Every task of NLP requires extensive knowledge about words
How to model words (or to get the base form of word)?
Morphological parsing
Normalizaon
Stemming, Lemmazaon
How to esmate the distance between words ?
Edit distance calculaon
(45
(21
leers) - the longest word “in common usage.
hps://www.grammarly.com/blog/14-of-the-longest-words-in-english/
lOMoARcPSD| 59703641
2
main classes of morphemes:
the “main” morpheme of the word, supplying the main meaning
prexes, suxes, inxes, and circumxes
to say)
said
Ex. unbelievably
1
3
axes
un-, -able, -ly
4
ways to combine morphemes to create words
Inecon
Derivaon
Compounding
Clicizaon
lOMoARcPSD| 59703641
A word stem + grammacal morpheme
a word of the same class
Usually lling some syntacc funcon like
agreement
Plural on nouns:
Verbal inecon:
s, -ing, -ed
Ex: in French
hps://french.kwiziq.com/french-grammar-cefr-A0
lOMoARcPSD| 59703641
A word stem + grammacal morpheme
a word of the dierent class
Irregular meaning change
Formaon of new nouns from verbs or adjecves: -aon, -ee, -er, -ness
verb + -
noun
Formaon of new adjecves from verbs or noun: -al, -able, -less
Many paths are possible…
Combinaon of mulple word stems together.
lOMoARcPSD| 59703641
A word stem + a clicizaon
clicizaon: a morpheme that acts syntaccally like a word, but reduced in form and
aached to another word
Ex in Arabic
lOMoARcPSD| 59703641
Some words refuse to follow the rules
•Nouns:
Irregulars : mouse/mice, goose/geese, ox/oxen
Regulars: Walk, walks, walking, walked, walked
Irregulars
Eat, eats, eang,
ate, eaten
caught, caught
Cut, cuts, cung,
cut, cut
lOMoARcPSD| 59703641
convert to the stem + morphological features
Need to prepare
Surface form: input words
Lexicon: list of stems + axes + class (N, V, etc.)
Morphotaccs: Model of morpheme ordering
N+PL, V+Past, etc.
Orthographic rules:
model the changes that occur in a word
city
s
to
ambiguous !
Morphotaccs Modeling: represent the morphotacc structure
FSA
Nominal inecon
Verbal inecon
Morphotaccs Modeling
FSA
Derivaon morphology:
more complex
A fragment of English adjecve morphology
Lexical Modeling:
Use nite state transducer to do parsing
FST takes as input a string of leers (surface)
and products as output a string of morphemes (lexical)
Morphotacc FSA
regular and irregular noun stems
above: morphological element
below: surface form
ˆ : morpheme boundary
# : word boundary
Won’t work for cases where there is a spelling change
Spelling rules (orthographic rules)
To handle irregular spelling changes, add intermediate tapes with
intermediate symbols
aer a morpheme-nal
x, s, or z,
and before the morpheme
s
Transducer for the E-inseron rule
surface: add
pairs “
ˆ:
Combine
Recognize strings in the language: Accept or Reject
Parse a string (
morphological analysis
to nd the morphology structure in it
Produce/Generate a surface form from a structure of morphology
Parsing and Generang
Unionizable
Union-ize-able
Un-ion-ize-able
to deal with this problem
Simply take the rst output found
Find all the possible outputs (all paths) and return them all (without choosing)
Bias the search so that only one or a few likely paths are explored
What counts as a word ?
Punctuaon marks ?
Date (
12
th
July
, percentage
55
Capitalized tokens vs. uncapitalized tokens ?
Abbreviaon ?
Filler words:
At around 9:25 a.m., VN-Index lost the 1,150 point mark - an important support
increasing in price by 2% or more such as SSI, SHS, VCI, HCM, MBS and FTS in
Done before almost any natural language processing of a text
1
2
. Normalizing word formats
3
. Segmenng sentences
Depends on the NLP tasks
Problem of segmenng text into tokens (characters, words)
Word boundary:
Whitespace is sucient ?
Punctuaon:
“job,
/02/06, hp://www.google.com
Clicizaon vs apostrophe:
vs.

Preview text:

lOMoAR cPSD| 59703641 Natural Language Processing AC3110E 1
Lecturer: PhD. DO Thi Ngoc Diep Word
Every human language is composed of words (45
(21 letters) - the longest word “in common usage.”
“Fundamental building block of language” – of language
Word carries meaning (objective or practical meaning), can be used on its
Every task of NLP requires extensive knowledge about words
How to model words (or to get the base form of word)? Morphological parsing Normalization Stemming, Lemmatization
How to estimate the distance between words ? Edit distance calculation
https://www.grammarly.com/blog/14-of-the-longest-words-in-english/ lOMoAR cPSD| 59703641 2 main classes of morphemes:
the “main” morpheme of the word, supplying the main meaning
prefixes, suffixes, infixes, and circumfixes to say) said Ex. unbelievably 1 3 affixes un-, -able, -ly
4 ways to combine morphemes to create words Inflection Derivation Compounding Cliticization lOMoAR cPSD| 59703641
A word stem + grammatical morpheme a word of the same class
Usually filling some syntactic function like agreement Plural on nouns:
Verbal inflection: s, -ing, -ed Ex: in French
https://french.kwiziq.com/french-grammar-cefr-A0 lOMoAR cPSD| 59703641
A word stem + grammatical morpheme
a word of the different class Irregular meaning change
Formation of new nouns from verbs or adjectives: -ation, -ee, -er, -ness verb + - noun
Formation of new adjectives from verbs or noun: -al, -able, -less Many paths are possible…
Combination of multiple word stems together. lOMoAR cPSD| 59703641
A word stem + a cliticization
cliticization: a morpheme that acts syntactically like a word, but reduced in form and attached to another word Ex in Arabic lOMoAR cPSD| 59703641
Some words refuse to follow the rules •Nouns:
• Irregulars : mouse/mice, goose/geese, ox/oxen
Regulars: Walk, walks, walking, walked, walked Irregulars
• Eat, eats, eating, ate, eaten caught, caught
• Cut, cuts, cutting, cut, cut lOMoAR cPSD| 59703641
convert to the stem + morphological features Need to prepare Surface form: input words
Lexicon: list of stems + affixes + class (N, V, etc.)
Morphotactics: Model of morpheme ordering N+PL, V+Past, etc. Orthographic rules:
model the changes that occur in a word city s to ambiguous !
Morphotactics Modeling: represent the morphotactic structure FSA Nominal inflection Verbal inflection Morphotactics Modeling FSA
A fragment of English adjective morphology Derivation morphology: more complex Lexical Modeling:
Use finite state transducer to do parsing
FST takes as input a string of letters (surface)
and products as output a string of morphemes (lexical) Morphotactic FSA above: morphological element below: surface form ˆ : morpheme boundary # : word boundary
regular and irregular noun stems
Won’t work for cases where there is a spelling change
Spelling rules (orthographic rules)
To handle irregular spelling changes, add intermediate tapes with intermediate symbols after a morpheme-final
x, s, or z, and before the morpheme s
Transducer for the E-insertion rule surface: add pairs “ ˆ: Combine
Recognize strings in the language: Accept or Reject
Parse a string ( morphological analysis to find the morphology structure in it
Produce/Generate a surface form from a structure of morphology Parsing and Generating • Unionizable • Union-ize-able • Un-ion-ize-able to deal with this problem
• Simply take the first output found
• Find all the possible outputs (all paths) and return them all (without choosing)
• Bias the search so that only one or a few likely paths are explored
At around 9:25 a.m., VN-Index lost the 1,150 point mark - an important support
increasing in price by 2% or more such as SSI, SHS, VCI, HCM, MBS and FTS in What counts as a word ? Punctuation marks ? Date ( th 12 July , percentage 55
Capitalized tokens vs. uncapitalized tokens ? Abbreviation ? Filler words:
Done before almost any natural language processing of a text 1 2 . Normalizing word formats 3 . Segmenting sentences Depends on the NLP tasks
Problem of segmenting text into tokens (characters, words) Word boundary: Whitespace is sufficient ? Punctuation: “job,
/02/06, http://www.google.com Cliticization vs apostrophe: vs.