Spark Introduction - Hệ điều hành | Trường Đại Học Tài Nguyên và Môi Trường TP HCM

Spark Introduction - Hệ điều hành | Trường Đại Học Tài Nguyên và Môi Trường TP HCM được sưu tầm và soạn thảo dưới dạng file PDF để gửi tới các bạn sinh viên cùng tham khảo, ôn tập đầy đủ kiến thức, chuẩn bị cho các buổi học thật tốt. Mời bạn đọc đón xem!

Big Data
(Spark)
Instructor: Trong-Hop Do
November 24
th
2020
S
3
Lab
Smart Software System Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
Chris Lynch, Vertica Systems
B
ig DataBig Data
2
Big Data
Introduction
3
Apache Spark is an open-source cluster computing framework for real-
time processing
Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance.
Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
Big Data
Introduction
4
Introduction
5
Big Data
Applications
6
Big Data
Components
7
Spark Core and Resilient Distributed Datasets or RDDs
Spark SQL
Spark Streaming
Machine Learning Library or MLlib
GraphX
Big Data
Components
8
Big Data
Spark Core
Components
9
The base engine for large-scale parallel and distributed data processing.
The core is the distributed execution engine and the Java, Scala, and
Python APIs offer a platform for distributed ETL application development.
Further, additional libraries which are built atop the core allow diverse
workloads for streaming, SQL, and machine learning. It is responsible for:
Memory management and fault recovery
Scheduling, distributing and monitoring jobs on a cluster
Interacting with storage systems
Big Data
Spark Streaming
Components
10
Used to process real-time streaming data.
It enables high-throughput and fault-tolerant stream processing of live
data streams. The fundamental stream unit is which is basically DStream
a series of (Resilient Distributed Datasets) to process the real-time RDDs
data.
Big Data
Spark Streaming - Workflow
Components
11
Big Data
Spark Streaming - Fundamentals
Components
12
Streaming Context
DStream (Discretized Stream)
Caching
Accumulators, Broadcast Variables and Checkpoints
Big Data
Spark Streaming - Fundamentals
Components
13
Big Data
Spark Streaming - Fundamentals
Components
14
Big Data
Spark Streaming - Fundamentals
Components
15
Big Data
Spark Streaming - Fundamentals
Components
16
Big Data
Spark SQL
Components
17
Spark SQL integrates relational processing with Spark functional
programming. Provides support for various data sources and makes it
possible to weave SQL queries with code transformations thus resulting in
a very powerful tool. The following are the four libraries of Spark SQL.
Data Source API
DataFrame API
Interpreter & Optimizer
SQL Service
Big Data
Spark SQL
Components
18
Big Data
Spark SQL - Data Source API
Components
19
Universal API for loading and storing structured data.
Built in support for Hive, JSON, Avro, JDBC, Parquet, ect.
Support third party integration through spark packages
Support for smart sources
Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi-
structured data
Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage
systems (HDFS, HIVE Tables, MySQL, etc.)
Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to
multi-node clusters.
Big Data
Spark SQL - DataFrame API
Components
20
A Data Frame is a distributed collection of data organized into named
column. It is equivalent to a relational table in SQL used for storing data
into tables.
| 1/59

Preview text:

Big Data (Spark) Instructor: Trong-Hop Do November 24th 2020 S3Lab
Smart Software System Laboratory 1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems Big i Da D t a a t 2 Introduction
● Apache Spark is an open-source cluster computing framework for real- time processing
● Spark provides an interface for programming entire clusters with implicit
data parallelism and fault-tolerance.
● Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming Big Data 3 Introduction Big Data 4 Introduction 5 Applications Big Data 6 Components
● Spark Core and Resilient Distributed Datasets or RDDs ● Spark SQL ● Spark Streaming
● Machine Learning Library or MLlib ● GraphX Big Data 7 Components Big Data 8 Components Spark Core
● The base engine for large-scale parallel and distributed data processing.
The core is the distributed execution engine and the Java, Scala, and
Python APIs offer a platform for distributed ETL application development.
Further, additional libraries which are built atop the core allow diverse
workloads for streaming, SQL, and machine learning. It is responsible for: ○
Memory management and fault recovery ○
Scheduling, distributing and monitoring jobs on a cluster ○
Interacting with storage systems Big Data 9 Components Spark Streaming
● Used to process real-time streaming data.
● It enables high-throughput and fault-tolerant stream processing of live
data streams. The fundamental stream unit is DStream which is basically
a series of RDDs (Resilient Distributed Datasets) to process the real-time data. Big Data 10 Components
Spark Streaming - Workflow Big Data 11 Components
Spark Streaming - Fundamentals ● Streaming Context
● DStream (Discretized Stream) ● Caching
● Accumulators, Broadcast Variables and Checkpoints Big Data 12 Components
Spark Streaming - Fundamentals Big Data 13 Components
Spark Streaming - Fundamentals Big Data 14 Components
Spark Streaming - Fundamentals Big Data 15 Components
Spark Streaming - Fundamentals Big Data 16 Components Spark SQL
● Spark SQL integrates relational processing with Spark functional
programming. Provides support for various data sources and makes it
possible to weave SQL queries with code transformations thus resulting in
a very powerful tool. The following are the four libraries of Spark SQL. ○ Data Source API ○ DataFrame API ○ Interpreter & Optimizer ○ SQL Service Big Data 17 Components Spark SQL Big Data 18 Components
Spark SQL - Data Source API
● Universal API for loading and storing structured data. ○
Built in support for Hive, JSON, Avro, JDBC, Parquet, ect. ○
Support third party integration through spark packages ○ Support for smart sources ○
Data Abstraction and Domain Specific Language (DSL) applicable on structure and semi- structured data ○
Supports different data formats (Avro, CSV, Elastic Search and Cassandra) and storage
systems (HDFS, HIVE Tables, MySQL, etc.) ○
Can be easily integrated with all Big Data tools and frameworks via Spark-Core. ○
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi-node clusters. Big Data 19 Components
Spark SQL - DataFrame API
● A Data Frame is a distributed collection of data organized into named
column. It is equivalent to a relational table in SQL used for storing data into tables. Big Data 20