Chapter 1: Data Mining1. What is data mining - Tài liệu tham khảo | Đại học Hoa Sen

Chapter 1: Data Mining1. What is data mining - Tài liệu tham khảo | Đại học Hoa Sen và thông tin bổ ích giúp sinh viên tham khảo, ôn luyện và phục vụ nhu cầu học tập của mình cụ thể là có định hướng, ôn tập, nắm vững kiến thức môn học và làm bài tốt trong những bài kiểm tra, bài tiểu luận, bài tập kết thúc học phần, từ đó học tập tốt và có kết quả

Chapter 1: Data Mining
1. What is data mining:
Data mining is a set of techniques used to automatically mine and find the mutual relationships
of data in a huge and complex set of data, while also looking for potential patterns in that dataset.
It is a process of distilling or extracting knowledge from a large amount of data.
Data mining is an important part of overall data analytics and one of the core areas of data
science, using advanced analytics techniques to find useful information in the dataset. On a more
detailed level, data mining is a step in the process of discovering knowledge in a database, a data
science method for collecting, processing, and analyzing data. The data mining process is a
complex process that includes an intensive data warehouse as well as computing technologies.
Moreover, Data Mining is not only limited to data extraction but also used for conversion,
cleaning, data integration and sample analysis.
Some applications of Data Mining
There are many applications of Data Mining such as:
Market and securities analysis
Fraud detection
Risk management and business analysis
Customer lifetime value analysis
Data mining process:
Data scientists or analysts often carry out data mining. In addition, business analysts, business
executives who are knowledgeable about data can also carry out data mining.
Central elements of data mining include statistical analysis and artificial intelligence (AL) and
management tasks performed to prepare data for analysis. Process automation and helping to
mine huge datasets is thanks to the use of machine learning algorithms and artificial intelligence
(Al) tools. It also helps to exploit a lot of data such as customer data, transaction records and
mobile applications.
There are four main stages of the data mining process.
Collect data.
Relevant data for an analytics application is identified and aggregated. Data can be placed
in different source systems, data warehouses or data lakes, an increasingly common
repository in a large data environment that contains a mixture of structured and
unstructured data. External data sources can also be used. Wherever the data comes from,
a data scientist usually transfers it to a data lake for the remaining steps in the process.
Data preparation
This stage consists of a series of steps to prepare for data mining. It starts with the
exploration, profiling and pre-processing of data, followed by the work of cleaning up the
data to correct errors and other data quality issues . Data conversions are also made to
make datasets consistent, unless a data scientist is looking to analyze unfiltered raw data
for a particular application.
Data mining.
After the data is prepared, a data scientist chooses the appropriate data mining technique
and then deploys one or more algorithms to perform the exploit. In machine learning
applications, algorithms often have to be trained on sample datasets to search for the
information being searched for before they run across the entire dataset.
Analyze and interpret the data.
Data mining results are used to create analytical models that can help drive decision-
making and other business actions. The data scientist or another member of the data
science team must also communicate the results to business executives and users, often
through data visualization and the use of data storytelling techniques.
Key methods in data mining
Combined legal method
One of the common themes of data mining is the exploration of combined laws. The
purpose of exploring the combined law is to determine the relationship, the combination
of data items in a large database.
Deciding tree method
Simple knowledge description aims to divide data objects into certain classes. The nodes
of the tree are labeled as the names of data items, the edges are assigned possible values
of the data items, the leaves describe different layers. The objects are layered according
to the paths on the tree, through the edges corresponding to the values of the data item to
the leaves.
K-Mean Method
There are many methods used in clustering, the K-Mean method is considered the basic
techniques of clustering. With this method, it will divide the set with n objects into k
clusters so that the objects in the same cluster are the same, the objects are different.
Sample-based method
This method uses sequential temporal patterns. Technically, it's similar to data mining
using a combination law but adds order and timing. A law describing the sequential
pattern in the form of X-> Y reflecting the occurrence of event X will lead to the
subsequent occurrence of event Y. This approach is widely applied in the financial sector
and the stock market because they are highly predictive.
2. How does a business obtain to Data?
As data collection and analytics technology increases, so does the ability of businesses to
contextualize data and draw new insights into data. Artificial intelligence is an important tool for
collecting, analyzing data, and gathering information that many businesses are using for a variety
of purposes, including better understanding day-to-day operations, making more informed
business decisions, and learning about their customers.
Customer data can be collected in three ways: by asking the customer directly, by indirect
tracking of the customer, and by adding other sources of customer data to yourself.
Businesses are very good at collecting all kinds of data from almost every corner. The most
obvious places are from consumer activity on their websites, social networking sites, through
customer phone calls and face-to-face conversations, but there are also some more interesting
methods in the workplace.
Types of consumer data that businesses collect
The consumer data that the business collects can be divided into four categories:
Personal data.
This category includes personally identifiable information such as Social Security
number and gender number as well as personally identifiable information, including your
IP address, web browser cookies, and device ID (which both your laptop and mobile
device have).
Interactive data.
This type of data details how consumers interact with business websites, mobile apps,
text messages, social networking sites, emails, paid ads, and customer service routes.
Behavioral data.
This category includes transaction details such as purchase history, product usage
information (e.g., repetitive actions) and qualitative data (e.g., mouse migration
information).
Vertical data.
This type of data includes indicators of consumer satisfaction, purchasing criteria,
product desire, etc.
3. The meaning of ktwo waterfalls legally and morally .
Privacy is part of the 1950 European Convention on Human Rights, which stipulates that
everyone has the right to respect their right to privacy and family life, housing and
correspondence. From that basis, countries in the European Union (EU) have sought to secure
this right through the development of a common legal document, especially when the Internet
appears. In 1995, the EU adopted the European Data Protection Directive (95/46/EC), which sets
minimum data security and privacy standards for member states to enforce by including in their
own laws. However, the 1995 Directive was drafted at a time when the Internet was only used by
1% of the world's population. So when the Internet exploded, it emerged that new laws were
required to address issues that arose about protecting personal data from the use of the Internet
and smart devices on a large scale. This is the reason for the introduction of the General
Regulation on Data Protection (GDPR) developed by the European Commission, with the aim of
drawing up a reform plan for the protection of personal data across the European Union.
Along with the development of the Industrial Revolution 4.0, personal data today has become a
commodity, sought after by organizations and individuals, used for commercial purposes, and at
the same time used by countries for the purpose of human management. In the context of
information technology development, there are more and more programs, systems, measures to
collect information, monitor and monitor individuals on a large scale, at the national level, even
on a global scale. The problem is that there are many such programs and systems that are being
built and operated by state, economic, commercial, technological and some other entities,
seriously violating the privacy of individuals.
This is protected by various legal documents such as the Law on Electronic Transactions, the
Law on Information Technology, the Law on Protection of Consumer Rights, the Law on
CyberInformation Security, the Law on Cyber Security, Decree No. 52/2013/ND-CP on e-
commerce and Decree No. 72/2013/ND-CP on management, provide and use of Internet
services and information on the internet.
In an effort to strengthen the legal framework on information privacy, Vietnam enacted the
CyberInformation Security Law in 2015. The law defines personal information, the principles of
data privacy protection, the provisions on the collection, use, modification and deletion of
personal information along with the government's responsibility to protect personal data.
The Cybersecurity Law 2018 requires businesses providing services in cyberspace in Vietnam to
notify users directly if their data is compromised, damaged or lost. This regulation is similar to
the requirements outlined in the GDPR of the European Union.
Privacy International, a non-profit organization, conducted research on 34 popular Android apps
(with installs ranging from 10 to 500 million), between April 4 and April 4. All of these apps
transfer user data to Facebook through a suite of software development tools (SDK). However,
using the testing tool, Privacy International found that at least 20 apps (61%) automatically
transferred data to Facebook as soon as the user opened the app without the user's consent.
However, in the world around us today, there are countless types of sensors that are collecting
information that we don't know about, especially those that can read data right from sensors built
into the phone. Turn around and build the product. There are many opportunities for technology
to capture human data, and this is not always intended to be bad. While some people think that
the issue of individual rights will suffer, the useful practical applications of this technology are
beyond doubt. Health care providers can access voice data to detect diseases (e.g., changes in
speech can be a manifestation of Alzheimer's dementia) or teachers may know how students
respond to their lectures. The realization of human inner states and the transformation of
subjective intangible objects such as emotions into measurable are the goals that scientists aim
for, as well as seek to manage technology to serve humans.
We have classified entities that receive user data into parties: the first party, when the application
transmits user data to the developer or parent company (the user is considered a second party),
the third party, when the application directly transmits user data to external entities and fourth
parties, companies where third parties have shared more user data. In the next section, we will
present an application that we have implemented that collects user data (only if the user agrees)
to improve the quality of the software.
| 1/6

Preview text:

Chapter 1: Data Mining 1. What is data mining:
Data mining is a set of techniques used to automatically mine and find the mutual relationships
of data in a huge and complex set of data, while also looking for potential patterns in that dataset.
It is a process of distilling or extracting knowledge from a large amount of data.
Data mining is an important part of overall data analytics and one of the core areas of data
science, using advanced analytics techniques to find useful information in the dataset. On a more
detailed level, data mining is a step in the process of discovering knowledge in a database, a data
science method for collecting, processing, and analyzing data. The data mining process is a
complex process that includes an intensive data warehouse as well as computing technologies.
Moreover, Data Mining is not only limited to data extraction but also used for conversion,
cleaning, data integration and sample analysis.
Some applications of Data Mining
There are many applications of Data Mining such as:
 Market and securities analysis  Fraud detection
 Risk management and business analysis
 Customer lifetime value analysis Data mining process:
Data scientists or analysts often carry out data mining. In addition, business analysts, business
executives who are knowledgeable about data can also carry out data mining.
Central elements of data mining include statistical analysis and artificial intelligence (AL) and
management tasks performed to prepare data for analysis. Process automation and helping to
mine huge datasets is thanks to the use of machine learning algorithms and artificial intelligence
(Al) tools. It also helps to exploit a lot of data such as customer data, transaction records and mobile applications.
There are four main stages of the data mining process.Collect data.
Relevant data for an analytics application is identified and aggregated. Data can be placed
in different source systems, data warehouses or data lakes, an increasingly common
repository in a large data environment that contains a mixture of structured and
unstructured data. External data sources can also be used. Wherever the data comes from,
a data scientist usually transfers it to a data lake for the remaining steps in the process.  Data preparation
This stage consists of a series of steps to prepare for data mining. It starts with the
exploration, profiling and pre-processing of data, followed by the work of cleaning up the
data to correct errors and other data quality issues . Data conversions are also made to
make datasets consistent, unless a data scientist is looking to analyze unfiltered raw data for a particular application.  Data mining.
After the data is prepared, a data scientist chooses the appropriate data mining technique
and then deploys one or more algorithms to perform the exploit. In machine learning
applications, algorithms often have to be trained on sample datasets to search for the
information being searched for before they run across the entire dataset.
Analyze and interpret the data.
Data mining results are used to create analytical models that can help drive decision-
making and other business actions. The data scientist or another member of the data
science team must also communicate the results to business executives and users, often
through data visualization and the use of data storytelling techniques.
Key methods in data mining
Combined legal method
One of the common themes of data mining is the exploration of combined laws. The
purpose of exploring the combined law is to determine the relationship, the combination
of data items in a large database.
Deciding tree method
Simple knowledge description aims to divide data objects into certain classes. The nodes
of the tree are labeled as the names of data items, the edges are assigned possible values
of the data items, the leaves describe different layers. The objects are layered according
to the paths on the tree, through the edges corresponding to the values of the data item to the leaves.  K-Mean Method
There are many methods used in clustering, the K-Mean method is considered the basic
techniques of clustering. With this method, it will divide the set with n objects into k
clusters so that the objects in the same cluster are the same, the objects are different.  Sample-based method
This method uses sequential temporal patterns. Technically, it's similar to data mining
using a combination law but adds order and timing. A law describing the sequential
pattern in the form of X-> Y reflecting the occurrence of event X will lead to the
subsequent occurrence of event Y. This approach is widely applied in the financial sector
and the stock market because they are highly predictive.
2. How does a business obtain to Data?
As data collection and analytics technology increases, so does the ability of businesses to
contextualize data and draw new insights into data. Artificial intelligence is an important tool for
collecting, analyzing data, and gathering information that many businesses are using for a variety
of purposes, including better understanding day-to-day operations, making more informed
business decisions, and learning about their customers.
Customer data can be collected in three ways: by asking the customer directly, by indirect
tracking of the customer, and by adding other sources of customer data to yourself.
Businesses are very good at collecting all kinds of data from almost every corner. The most
obvious places are from consumer activity on their websites, social networking sites, through
customer phone calls and face-to-face conversations, but there are also some more interesting methods in the workplace.
Types of consumer data that businesses collect
The consumer data that the business collects can be divided into four categories:Personal data.
This category includes personally identifiable information such as Social Security
number and gender number as well as personally identifiable information, including your
IP address, web browser cookies, and device ID (which both your laptop and mobile device have).  Interactive data.
This type of data details how consumers interact with business websites, mobile apps,
text messages, social networking sites, emails, paid ads, and customer service routes.  Behavioral data.
This category includes transaction details such as purchase history, product usage
information (e.g., repetitive actions) and qualitative data (e.g., mouse migration information).  Vertical data.
This type of data includes indicators of consumer satisfaction, purchasing criteria, product desire, etc.
3. The meaning of ktwo waterfalls legally and morally .
Privacy is part of the 1950 European Convention on Human Rights, which stipulates that
everyone has the right to respect their right to privacy and family life, housing and
correspondence. From that basis, countries in the European Union (EU) have sought to secure
this right through the development of a common legal document, especially when the Internet
appears. In 1995, the EU adopted the European Data Protection Directive (95/46/EC), which sets
minimum data security and privacy standards for member states to enforce by including in their
own laws. However, the 1995 Directive was drafted at a time when the Internet was only used by
1% of the world's population. So when the Internet exploded, it emerged that new laws were
required to address issues that arose about protecting personal data from the use of the Internet
and smart devices on a large scale. This is the reason for the introduction of the General
Regulation on Data Protection (GDPR) developed by the European Commission, with the aim of
drawing up a reform plan for the protection of personal data across the European Union.
Along with the development of the Industrial Revolution 4.0, personal data today has become a
commodity, sought after by organizations and individuals, used for commercial purposes, and at
the same time used by countries for the purpose of human management. In the context of
information technology development, there are more and more programs, systems, measures to
collect information, monitor and monitor individuals on a large scale, at the national level, even
on a global scale. The problem is that there are many such programs and systems that are being
built and operated by state, economic, commercial, technological and some other entities,
seriously violating the privacy of individuals.
This is protected by various legal documents such as the Law on Electronic Transactions, the
Law on Information Technology, the Law on Protection of Consumer Rights, the Law on
CyberInformation Security, the Law on Cyber Security, Decree No. 52/2013/ND-CP on e-
commerce and Decree No. 72/2013/ND-CP on management, provide and use of Internet
services and information on the internet.
In an effort to strengthen the legal framework on information privacy, Vietnam enacted the
CyberInformation Security Law in 2015. The law defines personal information, the principles of
data privacy protection, the provisions on the collection, use, modification and deletion of
personal information along with the government's responsibility to protect personal data.
The Cybersecurity Law 2018 requires businesses providing services in cyberspace in Vietnam to
notify users directly if their data is compromised, damaged or lost. This regulation is similar to
the requirements outlined in the GDPR of the European Union.
Privacy International, a non-profit organization, conducted research on 34 popular Android apps
(with installs ranging from 10 to 500 million), between April 4 and April 4. All of these apps
transfer user data to Facebook through a suite of software development tools (SDK). However,
using the testing tool, Privacy International found that at least 20 apps (61%) automatically
transferred data to Facebook as soon as the user opened the app without the user's consent.
However, in the world around us today, there are countless types of sensors that are collecting
information that we don't know about, especially those that can read data right from sensors built
into the phone. Turn around and build the product. There are many opportunities for technology
to capture human data, and this is not always intended to be bad. While some people think that
the issue of individual rights will suffer, the useful practical applications of this technology are
beyond doubt. Health care providers can access voice data to detect diseases (e.g., changes in
speech can be a manifestation of Alzheimer's dementia) or teachers may know how students
respond to their lectures. The realization of human inner states and the transformation of
subjective intangible objects such as emotions into measurable are the goals that scientists aim
for, as well as seek to manage technology to serve humans.
We have classified entities that receive user data into parties: the first party, when the application
transmits user data to the developer or parent company (the user is considered a second party),
the third party, when the application directly transmits user data to external entities and fourth
parties, companies where third parties have shared more user data. In the next section, we will
present an application that we have implemented that collects user data (only if the user agrees)
to improve the quality of the software.