Data Lake Overview| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Agenda
- Big Data Architectures
- Why data lakes?
- Top-down vs Bottom-up
- Data lake defined
- Creating ADLS Gen2
- Data Lake Use Cases

Thông tin:
53 trang 3 tháng trước

Bình luận

Vui lòng đăng nhập hoặc đăng ký để gửi bình luận.

Data Lake Overview| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Agenda
- Big Data Architectures
- Why data lakes?
- Top-down vs Bottom-up
- Data lake defined
- Creating ADLS Gen2
- Data Lake Use Cases

28 14 lượt tải Tải xuống
Data Lake Overview
James Serra
Data & AI Architect
Microsoft, NYC MTC
JamesSerra3@gmail.com
Blog: JamesSerra.com
About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
Agenda
Big Data Architectures
Why data lakes?
Top-down vs Bottom-up
Data lake defined
Creating ADLS Gen2
Data Lake Use Cases
?
?
?
?
Big Data Architectures
Enterprise data warehouse augmentation
Seen when EDW has been in
existence a while and EDW can’t
handle new data
Data hub, not data lake
Cons: not offloading EDW work,
can’t use existing tools, difficulty
joining data in data hub with EDW
Data hub plus EDW
Data hub is used as temporary
staging and refining, no reporting
Cons: data hub is temporary, no
reporting/analyzing done with the
data hub
(temporary)
All-in-one
Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/
Data hub is total solution, no EDW
Cons: queries are slower, new
training for reporting tools, difficulty
understanding data, security
limitations
Modern Data Warehouse
Evolution of three previous scenarios
Ultimate goal
Supports future data needs
Data harmonized and analyzed in the
data lake or moved to EDW for more
quality and performance
INGEST STORE PREP & TRAIN MODEL & SERVE
M O D E R N D A T A W A R E H O U S E
Azure Data Lake Store Gen2
Logs (unstructured)
Azure Data Factory
Azure Databricks
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.
Media (unstructured)
Files (unstructured)
PolyBase
Business/custom apps
(structured)
Azure SQL Data
Warehouse
Azure Analysis
Services
Power BI
Advanced Analytics
Social
LOB
Graph
IoT
Image
CRM
INGEST STORE PREP MODEL & SERVE
(& store)
Data orchestration
and monitoring
Big data store Transform & Clean Data warehouse
AI
BI + Reporting
Azure Data Factory
SSIS
Azure Data Lake
Storage Gen2
Blob Storage
SQL Server 2019 Big
Data Cluster
Azure Databricks
Azure HDInsight
PolyBase & Stored
Procedures
Power BI Dataflows
Azure SQL Data Warehouse
Azure Analysis Services
SQL Database (Single, MI,
HyperScale, Serverless)
SQL Server in a VM
Cosmos DB
Power BI Aggregations
?
?
?
?
Why data lakes?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB
Applications
Traditional business analytics process
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema
(‘schema-on-write’)
5. Create reports. Analyze data
All data not immediately required is discarded or archived
12
Harness the growing and changing nature of data
Need to collect any data
StreamingStructured
Challenge is combining transactional data stored in relational databases with less structured data
Big Data = All Data
Get the right information to the right people at the right time in the right format
Unstructured
The three V’s
Store indefinitely Analyze See results
Gather data
from all sources
Iterate
New big data thinking: All data has value
Use a data lake:
All data has potential value
Data hoarding
No defined schemastored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
?
?
?
?
Top-down vs Bottom-up
Observation
Pattern
Theory
Hypothesis
What will
happen?
How can we
make it happen?
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did
it happen?
Descriptive
Analytics
Diagnostic
Analytics
Confirmation
Theory
Hypothesis
Observation
Two Approaches to getting value out of data: Top-Down +
Bottoms-Up
Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data Warehousing Uses A Top-Down Approach
Data sources
Gather
Requirements
Business
Requirements
Technical
Requirements
The data lake Uses A Bottoms-Up Approach
Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
Data Lake + Data Warehouse Better Together
Data sources
What happened?
Descriptive
Analytics
Diagnostic
Analytics
Why did it happen?
What will happen?
Predictive
Analytics
Prescriptive
Analytics
How can we make it happen?
| 1/53

Preview text:

Data Lake Overview James Serra Data & AI Architect Microsoft, NYC MTC JamesSerra3@gmail.com Blog: JamesSerra.com About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012” Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Creating ADLS Gen2  Data Lake Use Cases ? ? Big Data Architectures ? ?
Enterprise data warehouse augmentation • Seen when EDW has been in
existence a while and EDW can’t handle new data • Data hub, not data lake
• Cons: not offloading EDW work,
can’t use existing tools, difficulty
joining data in data hub with EDW Data hub plus EDW
• Data hub is used as temporary
staging and refining, no reporting
• Cons: data hub is temporary, no
reporting/analyzing done with the data hub (temporary) All-in-one
• Data hub is total solution, no EDW
• Cons: queries are slower, new
training for reporting tools, difficulty understanding data, security limitations
Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ Modern Data Warehouse
• Evolution of three previous scenarios • Ultimate goal • Supports future data needs
• Data harmonized and analyzed in the
data lake or moved to EDW for more quality and performance
M O D E R N D A T A W A R E H O U S E INGEST STORE PREP & TRAIN MODEL & SERVE Logs (unstructured) Azure Databricks Media (unstructured) PolyBase Files (unstructured) Azure Data Factory
Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services Business/custom apps (structured)
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. LOB CRM BI + Reporting INGEST STORE PREP MODEL & SERVE (& store) Graph Advanced Analytics Image Social Data orchestration Big data store Transform & Clean Data warehouse and monitoring AI IoT Azure Data Factory Azure Data Lake Azure Databricks Azure SQL Data Warehouse Storage Gen2 SSIS Azure HDInsight Azure Analysis Services Blob Storage PolyBase & Stored SQL Database (Single, MI, SQL Server 2019 Big Procedures HyperScale, Serverless) Data Cluster Power BI Dataflows SQL Server in a VM Cosmos DB Power BI Aggregations ? ? Why data lakes? ? ?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-on-write’)
5. Create reports. Analyze data
Dedicated ETL tools (e.g. SSIS) Relational ETL pipeline Queries LOB Applications Defined schema Results
All data not immediately required is discarded or archived 12 Need to col ect any data
Harness the growing and changing nature of data Structured Unstructured Streaming “ ”
Challenge is combining transactional data stored in relational databases with less structured data Big Data = Al Data
Get the right information to the right people at the right time in the right format The three V’s
New big data thinking: Al data has value Use a data lake: All data has potential value Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit Iterate Gather data Store indefinitely Analyze See results from all sources ? ? Top-down vs Bottom-up ? ?
Two Approaches to getting value out of data: Top-Down + Bottoms-Up How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis it happen? Hypothesis Diagnostic Pattern What Observation happened? Analytics Observation Descriptive Confirmation Analytics
Data Warehousing Uses A Top-Down Approach Understand Implement Data Warehouse Gather Corporate Requirements Reporting & Strategy Reporting & Analytics Analytics Design Business Development Requirements Dimension Modelling Physical Design ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune
The “data lake” Uses A Bottoms-Up Approach Ingest all data Store all data Do analysis
regardless of requirements
in native format without Using analytic engines schema definition like Hadoop Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse
Data Lake + Data Warehouse Better Together What happened? What will happen? Descriptive Predictive Analytics Analytics Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics