Tổng hợp bài giảng môn Lưu trữ và xử lý dữ liệu lớn_Thầy Đào Thanh Chung| Bài giảng môn Lưu trữ và xử lý dữ liệu lớn| Trường Đại học Bách Khoa Hà Nội

Tổng hợp bài giảng môn Lưu trữ và xử lý dữ liệu lớn_Thầy Đào Thanh Chung| Bài giảng môn Lưu trữ và xử lý dữ liệu lớn| Trường Đại học Bách Khoa Hà Nội. Tài liệu gồm 214 trang giúp bạn ôn tập và đạt kết quả cao trong kỳ thi sắp tới. Mời bạn đọc đón xem.

Lecture 1: Introduction to Apache Spark
1
IT4931
Lưu tr phân tích d liu ln
11/2021
Thanh-Chung Dao Ph.D.
IT4931
¨ W1: Spark introduction + Lab
¨ W2: Spark RDD + Lab
¨ W3: Spark Machine Learning + Lab
¨ W4: Spark on Blockchain Storage + Lab
Lecture Agenda
2
History of Spark
Introduction
Components of Stack
Resilient Distributed Dataset RDD
Today’s Agenda
3
HISTORY OF SPARK
4
History of Spark
2002
2002
MapReduce @ Google
2004
MapReduce paper
2006
Hadoop @Yahoo!
2004 2006 2008 2010 2012 2014
2014
Apache Spark top-level
2010
Spark paper
2008
Hadoop Summit
History of Spark
circa 1979 Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for parallel processing
www-formal.stanford.edu/ jmc/ history/lisp/ lisp.htm
circa 2004 Google
MapReduce: Simplified Data Processing on Large
Clusters
Jeffrey Dean and Sanjay Ghemawat
research.google.com/ archive/ mapreduce.html
circa 2006 Apache
Hadoop
, originating from the Nutch Project Doug Cutting
research.yahoo.com/ files/ cutting.pdf
circa 2008 Yahoo
web scale search indexing Hadoop Submit, HUG, etc.
developer.yahoo.com/hadoop/
circa 2009 Amazon AWS
Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/ elasticmapreduce/
MapReduce
Most current cluster programming models are
based on
acyclic data flow
from stable storage to
stable storage
Map
Map
Map
Reduce
Reduce
Inpu t Output
MapReduce
Acyclic data flow is inefficient for applications
that repeatedly reuse a
working set
of data:
Iterative algorithms (machine learning, graphs)
Interactive data mining tools (R, Excel, Python)
Low latency (interactive) queries on historical
data: enable faster decisions
E.g., identify why a site is slow and fix it
Low latency queries on live data (streaming):
enable decisions on real-time data
E.g., detect & block worms in real-time (a worm
may infect 1mil hosts in 1.3sec)
Sophisticated data processing: enable better
decisions
E.g., anomaly detection, trend analysis
Data Processing Goals
Therefore, people built specialized
systems as workarounds
Specialized Systems
MapReduce
General Batch Processing
Pregel Giraph
Dremel Drill Tez
Impala GraphLab
Storm S4
Specialized Systems:
iterative,interactive,streaming,graph,etc.
The State of Spark,andWhereWe're Going Next
Matei Zaharia
Spark Summit (2013)
yo utu.be/ nU6vO2EJAb4
NoSQL battles
Compute battles
HBase
vs
Cassanrdra
Relational vs NoSQL
Redis
vs
Memcached
vs
Riak
MongoDB vs
CouchDB
vs
Couchbase
MapReduce
vs
Spark
Spark Streaming vs Storm
Hive vs Spark SQL vs
Impala
Mahout vs MLlib vs H20
Solr
vs
Elasticsearch
Neo4j vs Titan vs
Giraph
vs
OrientDB
Storage vs Processing Wars
NoSQL battles
Compute battles
HBase
vs
Cassanrdra
Relational
vs
NoSQL
Redis
vs
Memcached
vs
Riak
MongoDB
vs
CouchDB
vs
Couchbase
Neo4j
vs Titan vs
Giraph
vs
OrientDB
MapReduce
vs
Spark
Spark Streaming vs Storm
Hive vs Spark SQL vs
Impala
Mahout vs MLlib vs H20
Solr
vs
Elasticsearch
Storage vs Processing Wars
General Batch Processing
Pregel
Dremel
Impala
GraphLab
Giraph
Drill
Tez
S4
Storm
Specialized
Systems
(iterative, interactive, ML, streaming, graph, SQL, etc)
General Unified Engine
(2004 2013)
(2007 2015?)
(2014 ?)
Mahout
Specialized Systems
vs
YARN
SQL
MLlib
Streaming
Mesos
Tachyon
10x 100x
Aggressive use of
memory
Why?
1. Memory transfer rates >> disk or
SSDs
2. Many datasets already fit into
memory
Inputs of over 90% of jobs in
Facebook, Yahoo!, and Bing
clusters fit into memory
e.g., 1TB = 1 billion records
@ 1KB each
3. Memory density (still) grows with
Moores law
RAM/SSD hybrid memories at
horizon
Support Interactive and Streaming Comp.
High end datacenter node
16 cores
10-30TB
128-512GB
1-4TB
10Gbps
0.2-
1GB/s
(x10 disks)
1-
4GB/s
(x4 disks)
40-60GB/s
Increase
pa rallelism
Why?
Reduce work per node à improve
latency
Techniques:
Low latency parallel scheduler that
achieve high locality
Optimized parallel communication
patterns (e.g., shuffle, broadcast)
Efficient recovery from failures and
straggler mitigation
Support Interactive and Streaming Comp.
result
T
result
T
new
(< T)
§ Launched January 2011: 6 Year Plan
8 CS Faculty
~40 students
3 software engineers
Organized for collaboration:
Berkeley AMPLab
Funding:
XData, CISE Expedition Grant
Industrial, founding sponsors
18 other sponsors, including
Berkeley AMPLab
Goal: Next Generation of Analytics Data Stack for Industry &
Research:
Berkeley Data Analytics Stack (BDAS)
Release as Open Source
making big data simple
Databricks Cloud:
“A unified platform for building Big Data pipelines from
ETL to Exploration and Dashboards, to Advanced Analytics
and Data Products.
Founded in late 2013
by the creators of Apache Spark
Original team from UC Berkeley AMPLab
Raised $47 Million in 2 rounds
Databricks
| 1/214

Preview text:

Lecture 1: Introduction to Apache Spark 1 IT4931
Lưu trữ và phân tích dữ liệu lớn 11/2021 IT4931 Thanh-Chung Dao Ph.D. Lecture Agenda
¨ W1: Spark introduction + Lab ¨ W2: Spark RDD + Lab
¨ W3: Spark Machine Learning + Lab
¨ W4: Spark on Blockchain Storage + Lab 2 Today’s Agenda • History of Spark • Introduction • Components of Stack
• Resilient Distributed Dataset – RDD 3 HISTORY OF SPARK 4 History of Spark 2004 2010 MapReduce paper Spark paper 2002 2004 2006 2008 2010 2012 2014 2002 2008 2014 MapReduce @ Google Hadoop Summit Apache Spark top-level 2006 Hadoop @Yahoo! History of Spark
circa 1979 – Stanford, MIT, CMU, etc.
set/list operations in LISP, Prolog, etc., for paral el processing
www-formal.stanford.edu/ jmc/ history / lisp/ lisp.htm circa 2004 – Google
MapReduce: Simplified Data Processing on Large
Clusters Jeffrey Dean and Sanjay Ghemawat
research.google.com/ archive/ mapreduce.html circa 2006 – Apache
Hadoop, originating from the Nutch Project Doug Cutting
research.yahoo.com/ files/ cutting.pdf circa 2008 – Yahoo
web scale search indexing Hadoop Submit, HUG, etc.
developer.yahoo.com/hadoop/
circa 2009 – Amazon AWS Elastic MapReduce
Hadoop modified for EC2/S3, plus support for Hive, Pig, Cascading, etc.
aws.amazon.com/ elasticmapreduce/ MapReduce
Most current cluster programming models are
based on acyclic data flow from stable storage to stable storage Map Reduce Input Map Output Reduce Map MapReduce
• Acyclic data flow is inefficient for applications
that repeatedly reuse a working set of data:
Iterative algorithms (machine learning, graphs)
Interactive data mining tools (R, Excel, Python) Data Processing Goals
Low latency (interactive) queries on historical
data: enable faster decisions •
E.g., identify why a site is slow and fix it •
Low latency queries on live data (streaming):
enable decisions on real-time data •
E.g., detect & block worms in real-time (a worm
may infect 1mil hosts in 1.3sec) •
Sophisticated data processing: enable “better” decisions •
E.g., anomaly detection, trend analysis
Therefore, people built specialized systems as workarounds… Specialized Systems Pregel Giraph Dremel Dril Tez MapReduce Impala GraphLab Storm S4
General Batch Processing Specialized Systems:
iterative,interactive,streaming,graph,etc.
The State of Spark,andWhereWe're Going Next Matei Zaharia Spark Summit (2013) youtu.be/ nU6vO2EJAb4
Storage vs Processing Wars NoSQL battles Compute battles MapReduce Re vs l ational vs NoSQ Sp L ark HBase vs Spark Streaming vs Storm Cassanrdra Redis vs Memcache Ri d Hi a vs k ve vs Spark SQL vs Im pala
MongoDB vs CouchDB vs Couchbase Neo4j Mahout vs MLlib vs H20 vs Titan vs Giraph vs OrientDB Solr vs Elasticsearch
Storage vs Processing Wars NoSQL battles Compute battles MapReduce Re vs l ational vs NoSQ Sp L ark HBase vs Spark Streaming vs Storm Cassanrdra Redis vs Memcache Ri d Hi a vs k ve vs Spark SQL vs Im pala
MongoDB vs CouchDB vs Couchbase Neo4j vs Titan Mahout vs MLlib vs H20 vs Giraph vs OrientDB Solr vs Elasticsearch Specialized Systems (2007 – 2015?) Giraph Pregel (2014 – ?) (2004 – 2013) Tez Dril Dremel Mahout Storm S4 Impala GraphLab Specialized Systems
General Batch Processing
(iterative, interactive, ML, streaming, graph, SQL, etc) General Unified Engine vs YARN Mesos Tachyon SQL MLlib Streaming 10x – 100x
Support Interactive and Streaming Comp.
Aggressive use of memory • Why? 10Gbps
1. Memory transfer rates >> disk or SSDs 128-512GB
2. Many datasets already fit into memory 40-60GB/s • Inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory 16 cores • e.g., 1TB = 1 bil ion records @ 1KB each 0.2- 1-
3. Memory density (stil ) grows with 1GB/s (x10 disks) 4GB/s Moore’s law (x4 disks) 10-30TB • RAM/SSD hybrid memories at 1-4TB horizon High end datacenter node
Support Interactive and Streaming Comp. • Increase parallelism • Why?
• Reduce work per node à improve latency result • Techniques: •
Low latency paral el scheduler that achieve high locality T •
Optimized paral el communication
patterns (e.g., shuffle, broadcast) •
Efficient recovery from failures and straggler mitigation result Tnew (< T) Berkeley AMPLab
§ “Launched” January 2011: 6 Year Plan • 8 CS Faculty • ~40 students • 3 software engineers
• Organized for col aboration: Berkeley AMPLab • Funding: • XData, CISE Expedition Grant • Industrial, founding sponsors • 18 other sponsors, including
Goal: Next Generation of Analytics Data Stack for Industry & Research:
• Berkeley Data Analytics Stack (BDAS) • Release as Open Source Databricks making big data simple • Founded in late 2013
• by the creators of Apache Spark
• Original team from UC Berkeley AMPLab
• Raised $47 Mil ion in 2 rounds Databricks Cloud:
“A unified platform for building Big Data pipelines – from
ETL to Exploration and Dashboards, to Advanced Analytics and Data Products.”