



















Preview text:
lOMoAR cPSD| 59703641 Natural Language Processing AC3110E 1
Lecturer: PhD. DO Thi Ngoc Diep 2
collection of rules, complex sets of hand-written rules.
computer emulates NLP tasks by applying those rules to the data it confronts.
hand-coding of a set of rules for manipulating symbols, coupled with a dictionary lookup
introduction of machine learning algorithms for language processing
increase in computational power + enormous amount of data available
In the 2010s, deep neural network-style (featuring many hidden layers) machine learning methods
achieve state-of-the-art results in many natural language tasks 3 lOMoAR cPSD| 59703641
Part 1: Introduction of Machine Learning in NLP
Document representation in machine learning
Part 3: Sequence Labeling task lOMoAR cPSD| 59703641
Part 1: Introduction of Machine Learning in NLP lOMoAR cPSD| 59703641
Machine Learning (ML) is a research field of Artificial Intelligence
A computer has the ability to learn based on v past
Training examples will create a space of samples/instances
Computer will build a general model/function about this space lOMoAR cPSD| 59703641 1 A target function v
The objective in machine learning is to find an approximation
function close to the target function f
For new samples, enable to produce sufficiently accurate predictions lOMoAR cPSD| 59703641
Regression: if is a numerical value (a real number value)
Predicting age of a person from image, ...
belongs to a discrete set (a fix label set)
Spam emails classification (Spam or Not Spam), Image classification (cat or dog), ...
Y can be clusters of data: clustering
Y can be hidden structures: trend detection, community detection
Discover customer segments for marketing purposes, Article clustering based on topics and abstract text, Reinforcement learning No training data
Make decisions and receive feedback from the environment to reinforce behavior
Game application, Robot navigation, etc. Active learning Deep Learning
Machine learning with Multi-layer Neuron Networks lOMoAR cPSD| 59703641 Examples
https://www.datasciencecentral.com/machine-learning-in-one-picture/ lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 Machine Learning Algorithms • Classification :
• Linear classifiers, K-Nearest Neighbours, Support vector machines, Decision Tree
Classification, Random Forest Classification, Logistic regression, etc. • Regression :
• Linear Regression, Multivariate Regression, Multiple Regression, etc.
• Bayesian : apply the Bayes theorem for classification and regression problems.
• Naive Bayes, Gaussian Naive Bayes, Multinomial Naive Bayes, Bayesian Belief Network, Bayesian Network, etc. • Clustering :
• k-Means, k-Medians, Expectation Maximisation, Hierarchical Clustering, etc.
• Artificial Neural Network : complex pattern matching and prediction processes in
classification and regression problems based from the biological neurons in the human brain
• Perceptron, Multilayer Perceptrons, Stochastic Gradient Descent, Back-Propagation, etc.
• Deep Learning Algorithms: modernized versions of artificial neural network, uses
self-taught learning constructs with many hidden layers, to handle big data and
provides more powerful computational resources
• Convolutional Neural Network, Recurrent Neural Networks, Long Short-Term Memory
Networks, Encoder-decoder sequence-to-sequence, etc. lOMoAR cPSD| 59703641
2. Machine learning for natural language processing
• NLP models: finding relationships between the constituent parts of
language (the letters, words, and sentences, and other aspects of text) found in a text dataset
• NLP architectures use various methods for data preprocessing, feature extraction, and modeling.
• The process in natural language processing tasks • Modeling
• Mathematical representation of the tasks • Tasks of classification, regression, clustering, etc.
• Data collection: in the form of text documents, or speech recordings • Data preprocessing:
• Stemming and lemmatization, Sentence segmentation, Stop word removal, Tokenization, POS, normalization • Feature extraction
• To describe a text/document • Features:
• Non-linguistic features: Document length, n-gram sequence counts, etc.
• Linguistic features: morphological/lexical/syntactic/semantic representations, Bag-of-Words, TF-IDF, etc.
• Model Training, Optimizing, Evaluation • Model deployment lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 lOMoAR cPSD| 59703641 4 Stanford CoreNLP spaCy
supports more than 66 languages.
provides pre-trained word vectors
implements many popular models like BERT. Gensim
Train large-scale semantic NLP models OpenNLP lOMoAR cPSD| 59703641 Numpy N-dimensional array object
Linear Algebra, Fourier Transform
Complex computation functions Random number generator
Draw graphs, histograms, charts, etc. Data mining, Data analysis