lOMoARcPSD| 59703641
Natural Language Processing
AC3110E
1
2
Lecturer: PhD. DO Thi Ngoc Diep
3
collecon of rules, complex sets of hand-wrien rules.
computer emulates NLP tasks by applying those rules to the data it confronts.
hand-coding of a set of rules for manipulang symbols, coupled with a diconary
lookup
introducon of
machine learning
algorithms for language processing
increase in computaonal power + enormous amount of data available
In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
achieve state-of-the-art results in many natural language tasks
lOMoARcPSD| 59703641
Part 1: Introducon of Machine Learning in NLP
Document representaon in machine learning
Part 3: Sequence Labeling task
lOMoARcPSD| 59703641
Part 1: Introducon of Machine Learning
in NLP
lOMoARcPSD| 59703641
Machine Learning (ML) is a research eld of Arcial Intelligence
A computer has the ability to learn based on
v
past
Training examples will create a space of samples/instances
Computer will build a general model/funcon about this space
lOMoARcPSD| 59703641
1
A target funcon
v
The objecve in machine learning is to nd an approximaon
funcon close to the target funcon f
For new samples, enable to produce suciently accurate predicons
lOMoARcPSD| 59703641
Regression: if
is a numerical value (a real number value)
Predicng age of a person from image, ...
belongs to a discrete set (a x label set)
Spam emails classicaon (Spam or Not Spam), Image classicaon (cat or dog), ...
Y can be clusters of data: clustering
Y can be hidden structures: trend detecon, community detecon
Discover customer segments for markeng purposes, Arcle clustering based on topics and abstract text,
Reinforcement learning
No training data
Make decisions and receive feedback from the environment to reinforce behavior
Game applicaon, Robot navigaon, etc.
Acve learning
Deep Learning
Machine learning with Mul-layer Neuron Networks
lOMoARcPSD| 59703641
Examples
hps://www.datasciencecentral.com/machine-learning-in-one-picture/
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
Machine Learning Algorithms
Classicaon :
Linear classiers, K-Nearest Neighbours, Support vector machines, Decision Tree
Classicaon, Random Forest Classicaon, Logisc regression, etc.
Regression :
Linear Regression, Mulvariate Regression, Mulple Regression, etc.
Bayesian : apply the Bayes theorem for classicaon and regression problems.
Naive Bayes, Gaussian Naive Bayes, Mulnomial Naive Bayes, Bayesian Belief Network,
Bayesian Network, etc.
Clustering :
k-Means, k-Medians, Expectaon Maximisaon, Hierarchical Clustering, etc.
Arcial Neural Network : complex paern matching and predicon processes in
classicaon and regression problems based from the biological neurons in the
human brain
Perceptron, Mullayer Perceptrons, Stochasc Gradient Descent, Back-Propagaon, etc.
Deep Learning Algorithms: modernized versions of arcial neural network, uses
self-taught learning constructs with many hidden layers, to handle big data and
provides more powerful computaonal resources
Convoluonal Neural Network, Recurrent Neural Networks, Long Short-Term Memory
Networks, Encoder-decoder sequence-to-sequence, etc.
lOMoARcPSD| 59703641
2. Machine learning for natural language processing
NLP models: nding relaonships between the constuent parts of
language (the leers, words, and sentences, and other aspects of text)
found in a text dataset
• NLP architectures use various methods for data preprocessing, feature
extracon, and modeling.
The process in natural language processing tasks
Modeling
Mathemacal representaon of the tasks • Tasks of classicaon, regression, clustering, etc.
Data collecon: in the form of text documents, or speech recordings
Data preprocessing:
Stemming and lemmazaon, Sentence segmentaon, Stop word removal, Tokenizaon, POS,
normalizaon
Feature extracon
To describe a text/document
Features:
Non-linguisc features: Document length, n-gram sequence counts, etc.
Linguisc features: morphological/lexical/syntacc/semanc representaons, Bag-of-Words, TF-IDF, etc.
Model Training, Opmizing, Evaluaon
Model deployment
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
lOMoARcPSD| 59703641
4
Stanford CoreNLP
spaCy
supports more than 66 languages.
provides pre-trained word vectors
implements many popular models like BERT.
Gensim
Train large-scale semanc NLP models
OpenNLP
lOMoARcPSD| 59703641
Numpy
N-dimensional array object
Linear Algebra, Fourier Transform
Complex computaon funcons
Random number generator
Draw graphs, histograms, charts, etc.
Data mining, Data analysis

Preview text:

lOMoAR cPSD| 59703641 Natural Language Processing AC3110E 1
Lecturer: PhD. DO Thi Ngoc Diep 2
collection of rules, complex sets of hand-written rules.
computer emulates NLP tasks by applying those rules to the data it confronts.
hand-coding of a set of rules for manipulating symbols, coupled with a dictionary lookup
introduction of machine learning algorithms for language processing
increase in computational power + enormous amount of data available
In the 2010s, deep neural network-style (featuring many hidden layers) machine learning methods
achieve state-of-the-art results in many natural language tasks 3 lOMoAR cPSD| 59703641
Part 1: Introduction of Machine Learning in NLP
Document representation in machine learning
Part 3: Sequence Labeling task lOMoAR cPSD| 59703641
Part 1: Introduction of Machine Learning in NLP lOMoAR cPSD| 59703641
Machine Learning (ML) is a research field of Artificial Intelligence
A computer has the ability to learn based on v past
Training examples will create a space of samples/instances
Computer will build a general model/function about this space lOMoAR cPSD| 59703641 1 A target function v
The objective in machine learning is to find an approximation
function close to the target function f
For new samples, enable to produce sufficiently accurate predictions lOMoAR cPSD| 59703641
Regression: if is a numerical value (a real number value)
Predicting age of a person from image, ...
belongs to a discrete set (a fix label set)
Spam emails classification (Spam or Not Spam), Image classification (cat or dog), ...
Y can be clusters of data: clustering
Y can be hidden structures: trend detection, community detection
Discover customer segments for marketing purposes, Article clustering based on topics and abstract text, Reinforcement learning No training data
Make decisions and receive feedback from the environment to reinforce behavior
Game application, Robot navigation, etc. Active learning Deep Learning
Machine learning with Multi-layer Neuron Networks lOMoAR cPSD| 59703641 Examples
https://www.datasciencecentral.com/machine-learning-in-one-picture/ lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 Machine Learning Algorithms • Classification :
• Linear classifiers, K-Nearest Neighbours, Support vector machines, Decision Tree
Classification, Random Forest Classification, Logistic regression, etc. • Regression :
• Linear Regression, Multivariate Regression, Multiple Regression, etc.
• Bayesian : apply the Bayes theorem for classification and regression problems.
• Naive Bayes, Gaussian Naive Bayes, Multinomial Naive Bayes, Bayesian Belief Network, Bayesian Network, etc. • Clustering :
• k-Means, k-Medians, Expectation Maximisation, Hierarchical Clustering, etc.
• Artificial Neural Network : complex pattern matching and prediction processes in
classification and regression problems based from the biological neurons in the human brain
• Perceptron, Multilayer Perceptrons, Stochastic Gradient Descent, Back-Propagation, etc.
• Deep Learning Algorithms: modernized versions of artificial neural network, uses
self-taught learning constructs with many hidden layers, to handle big data and
provides more powerful computational resources
• Convolutional Neural Network, Recurrent Neural Networks, Long Short-Term Memory
Networks, Encoder-decoder sequence-to-sequence, etc. lOMoAR cPSD| 59703641
2. Machine learning for natural language processing
• NLP models: finding relationships between the constituent parts of
language (the letters, words, and sentences, and other aspects of text) found in a text dataset
• NLP architectures use various methods for data preprocessing, feature extraction, and modeling.
• The process in natural language processing tasks • Modeling
• Mathematical representation of the tasks • Tasks of classification, regression, clustering, etc.
• Data collection: in the form of text documents, or speech recordings • Data preprocessing:
• Stemming and lemmatization, Sentence segmentation, Stop word removal, Tokenization, POS, normalization • Feature extraction
• To describe a text/document • Features:
• Non-linguistic features: Document length, n-gram sequence counts, etc.
• Linguistic features: morphological/lexical/syntactic/semantic representations, Bag-of-Words, TF-IDF, etc.
• Model Training, Optimizing, Evaluation • Model deployment lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 4 Stanford CoreNLP spaCy
supports more than 66 languages.
provides pre-trained word vectors
implements many popular models like BERT. Gensim
Train large-scale semantic NLP models OpenNLP lOMoAR cPSD| 59703641 Numpy N-dimensional array object
Linear Algebra, Fourier Transform
Complex computation functions Random number generator
Draw graphs, histograms, charts, etc. Data mining, Data analysis