36 trang 40 lượt tải

Chapter 4: Lexical representations môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội

Every task of NLP requires extensive knowledge about words. How to model words (or to get the base form of word)? Tài liệu được sưu tầm gồm 36 trang, giúp các bạn nắm vững kiến thức, rèn luyện kỹ năng và đạt được kết quả tốt trong học tập. Mời các bạn đón xem!

Môn: Xử lý ngôn ngữ tự nhiên 12 tài liệu

Trường: Đại học Bách Khoa Hà Nội 4.7 K tài liệu

Tác giả:

My Lữ

3 tháng trước

Tải xuống Báo cáo

Danh sách Quiz

lOMoARcPSD| 59703641

Natural Language Processing

AC3110E

Lecturer: PhD. DO Thi Ngoc Diep

Word

Every human language is composed of words

“Fundamental building block of language” –

of language

Word carries meaning (objecve or praccal meaning), can be used on its

Every task of NLP requires extensive knowledge about words

How to model words (or to get the base form of word)?

Morphological parsing

Normalizaon

Stemming, Lemmazaon

How to esmate the distance between words ?

Edit distance calculaon

(45

(21

leers) - the longest word “in common usage.”

hps://www.grammarly.com/blog/14-of-the-longest-words-in-english/

lOMoARcPSD| 59703641

main classes of morphemes:

the “main” morpheme of the word, supplying the main meaning

prexes, suxes, inxes, and circumxes

to say)

said

Ex. unbelievably

axes

un-, -able, -ly

ways to combine morphemes to create words

Inecon

Derivaon

Compounding

Clicizaon

lOMoARcPSD| 59703641

A word stem + grammacal morpheme

a word of the same class

Usually lling some syntacc funcon like

agreement

Plural on nouns:

Verbal inecon:

s, -ing, -ed

Ex: in French

hps://french.kwiziq.com/french-grammar-cefr-A0

lOMoARcPSD| 59703641

A word stem + grammacal morpheme

a word of the dierent class

Irregular meaning change

Formaon of new nouns from verbs or adjecves: -aon, -ee, -er, -ness

verb + -

noun

Formaon of new adjecves from verbs or noun: -al, -able, -less

Many paths are possible…

Combinaon of mulple word stems together.

lOMoARcPSD| 59703641

A word stem + a clicizaon

clicizaon: a morpheme that acts syntaccally like a word, but reduced in form and

aached to another word

Ex in Arabic

lOMoARcPSD| 59703641

Some words refuse to follow the rules

•Nouns:

• Irregulars : mouse/mice, goose/geese, ox/oxen

Regulars: Walk, walks, walking, walked, walked

Irregulars

• Eat, eats, eang,

ate, eaten

caught, caught

• Cut, cuts, cung,

cut, cut

lOMoARcPSD| 59703641

convert to the stem + morphological features

Need to prepare

Surface form: input words

Lexicon: list of stems + axes + class (N, V, etc.)

Morphotaccs: Model of morpheme ordering

N+PL, V+Past, etc.

Orthographic rules:

model the changes that occur in a word

city

ambiguous !

Morphotaccs Modeling: represent the morphotacc structure

FSA

Nominal inecon

Verbal inecon

Morphotaccs Modeling

FSA

Derivaon morphology:

more complex

A fragment of English adjecve morphology

Lexical Modeling:

Use nite state transducer to do parsing

FST takes as input a string of leers (surface)

and products as output a string of morphemes (lexical)

Morphotacc FSA

regular and irregular noun stems

above: morphological element

below: surface form

ˆ : morpheme boundary

# : word boundary

Won’t work for cases where there is a spelling change

Spelling rules (orthographic rules)

To handle irregular spelling changes, add intermediate tapes with

intermediate symbols

aer a morpheme-nal

x, s, or z,

and before the morpheme

Transducer for the E-inseron rule

surface: add

pairs “

ˆ:

Combine

Recognize strings in the language: Accept or Reject

Parse a string (

morphological analysis

to nd the morphology structure in it

Produce/Generate a surface form from a structure of morphology

Parsing and Generang

• Unionizable

• Union-ize-able

• Un-ion-ize-able

to deal with this problem

• Simply take the rst output found

• Find all the possible outputs (all paths) and return them all (without choosing)

• Bias the search so that only one or a few likely paths are explored

What counts as a word ?

Punctuaon marks ?

Date (

July

, percentage

Capitalized tokens vs. uncapitalized tokens ?

Abbreviaon ?

Filler words:

At around 9:25 a.m., VN-Index lost the 1,150 point mark - an important support

increasing in price by 2% or more such as SSI, SHS, VCI, HCM, MBS and FTS in

Done before almost any natural language processing of a text

. Normalizing word formats

. Segmenng sentences

Depends on the NLP tasks

Problem of segmenng text into tokens (characters, words)

Word boundary:

Whitespace is sucient ?

Punctuaon:

“job,

/02/06, hp://www.google.com

Clicizaon vs apostrophe:

vs.

Bấm Tải xuống để xem toàn bộ.

Preview text:

lOMoAR cPSD| 59703641 Natural Language Processing AC3110E 1
Lecturer: PhD. DO Thi Ngoc Diep Word
Every human language is composed of words (45
(21 letters) - the longest word “in common usage.”
“Fundamental building block of language” – of language
Word carries meaning (objective or practical meaning), can be used on its
Every task of NLP requires extensive knowledge about words
How to model words (or to get the base form of word)? Morphological parsing Normalization Stemming, Lemmatization
How to estimate the distance between words ? Edit distance calculation
https://www.grammarly.com/blog/14-of-the-longest-words-in-english/ lOMoAR cPSD| 59703641 2 main classes of morphemes:
the “main” morpheme of the word, supplying the main meaning
prefixes, suffixes, infixes, and circumfixes to say) said Ex. unbelievably 1 3 affixes un-, -able, -ly
4 ways to combine morphemes to create words Inflection Derivation Compounding Cliticization lOMoAR cPSD| 59703641
A word stem + grammatical morpheme a word of the same class
Usually filling some syntactic function like agreement Plural on nouns:
Verbal inflection: s, -ing, -ed Ex: in French
https://french.kwiziq.com/french-grammar-cefr-A0 lOMoAR cPSD| 59703641
A word stem + grammatical morpheme
a word of the different class Irregular meaning change
Formation of new nouns from verbs or adjectives: -ation, -ee, -er, -ness verb + - noun
Formation of new adjectives from verbs or noun: -al, -able, -less Many paths are possible…
Combination of multiple word stems together. lOMoAR cPSD| 59703641
A word stem + a cliticization
cliticization: a morpheme that acts syntactically like a word, but reduced in form and attached to another word Ex in Arabic lOMoAR cPSD| 59703641
Some words refuse to follow the rules •Nouns:
• Irregulars : mouse/mice, goose/geese, ox/oxen
Regulars: Walk, walks, walking, walked, walked Irregulars
• Eat, eats, eating, ate, eaten caught, caught
• Cut, cuts, cutting, cut, cut lOMoAR cPSD| 59703641
convert to the stem + morphological features Need to prepare Surface form: input words
Lexicon: list of stems + affixes + class (N, V, etc.)
Morphotactics: Model of morpheme ordering N+PL, V+Past, etc. Orthographic rules:
model the changes that occur in a word city s to ambiguous !
Morphotactics Modeling: represent the morphotactic structure FSA Nominal inflection Verbal inflection Morphotactics Modeling FSA
A fragment of English adjective morphology Derivation morphology: more complex Lexical Modeling:
Use finite state transducer to do parsing
FST takes as input a string of letters (surface)
and products as output a string of morphemes (lexical) Morphotactic FSA above: morphological element below: surface form ˆ : morpheme boundary # : word boundary
regular and irregular noun stems
Won’t work for cases where there is a spelling change
Spelling rules (orthographic rules)
To handle irregular spelling changes, add intermediate tapes with intermediate symbols after a morpheme-final
x, s, or z, and before the morpheme s
Transducer for the E-insertion rule surface: add pairs “ ˆ: Combine
Recognize strings in the language: Accept or Reject
Parse a string ( morphological analysis to find the morphology structure in it
Produce/Generate a surface form from a structure of morphology Parsing and Generating • Unionizable • Union-ize-able • Un-ion-ize-able to deal with this problem
• Simply take the first output found
• Find all the possible outputs (all paths) and return them all (without choosing)
• Bias the search so that only one or a few likely paths are explored
At around 9:25 a.m., VN-Index lost the 1,150 point mark - an important support
increasing in price by 2% or more such as SSI, SHS, VCI, HCM, MBS and FTS in What counts as a word ? Punctuation marks ? Date ( th 12 July , percentage 55
Capitalized tokens vs. uncapitalized tokens ? Abbreviation ? Filler words:
Done before almost any natural language processing of a text 1 2 . Normalizing word formats 3 . Segmenting sentences Depends on the NLP tasks
Problem of segmenting text into tokens (characters, words) Word boundary: Whitespace is sufficient ? Punctuation: “job,
/02/06, http://www.google.com Cliticization vs apostrophe: vs.

Chapter 4: Lexical representations môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội

Tài liệu liên quan:

Báo cáo Tóm tắt trích rút đơn văn bản Tiếng Việt | Môn Xử lý ngôn ngữ tự nhiên - Đại học Bách Khoa Hà Nội

Báo cáo bài tập lớn: Dịch máy với Transformer | Môn Xử lý ngôn ngữ tự nhiên - Đại học Bách Khoa Hà Nội

Comprehensive overview of NLP: Parsing, MT, and QA systems môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội

Text classification: Concepts & methods môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội

Tóm tắt lý thuyết xử lý văn bản và âm thanh trong NLP môn Xử lý ngôn ngữ tự nhiên | Trường Đại học Bách Khoa Hà Nội