53 trang 89 lượt tải

Data Lake Overview| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

177

Agenda
- Big Data Architectures
- Why data lakes?
- Top-down vs Bottom-up
- Data lake defined
- Creating ADLS Gen2
- Data Lake Use Cases

Môn: Quản trị dữ liệu và trực quan hóa 51 tài liệu

Trường: Đại học Bách Khoa Hà Nội 5.8 K tài liệu

Tác giả:

VietJack

1 năm trước

Tải xuống Báo cáo

Danh sách Quiz

Data Lake Overview

James Serra

Data & AI Architect

Microsoft, NYC MTC

JamesSerra3@gmail.com

Blog: JamesSerra.com

About Me

 Microsoft, Big Data Evangelist

 In IT for 30 years, worked on many BI and DW projects

 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM

architect, PDW/APS developer

 Been perm employee, contractor, consultant, business owner

 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference

 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure

Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data

Platform Solutions

 Blog at JamesSerra.com

 Former SQL Server MVP

 Author of book “Reporting with Microsoft SQL Server 2012”

Agenda



Big Data Architectures



Why data lakes?



Top-down vs Bottom-up



Data lake defined



Creating ADLS Gen2



Data Lake Use Cases

Big Data Architectures

Enterprise data warehouse augmentation

• Seen when EDW has been in

existence a while and EDW can’t

handle new data

• Data hub, not data lake

• Cons: not offloading EDW work,

can’t use existing tools, difficulty

joining data in data hub with EDW

Data hub plus EDW

• Data hub is used as temporary

staging and refining, no reporting

• Cons: data hub is temporary, no

reporting/analyzing done with the

data hub

(temporary)

All-in-one

Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/

• Data hub is total solution, no EDW

• Cons: queries are slower, new

training for reporting tools, difficulty

understanding data, security

limitations

Modern Data Warehouse

• Evolution of three previous scenarios

• Ultimate goal

• Supports future data needs

• Data harmonized and analyzed in the

data lake or moved to EDW for more

quality and performance

INGEST STORE PREP & TRAIN MODEL & SERVE

M O D E R N D A T A W A R E H O U S E

Azure Data Lake Store Gen2

Logs (unstructured)

Azure Data Factory

Azure Databricks

Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.

Media (unstructured)

Files (unstructured)

PolyBase

Business/custom apps

(structured)

Azure SQL Data

Warehouse

Azure Analysis

Services

Power BI

Advanced Analytics

Social

LOB

Graph

IoT

Image

CRM

INGEST STORE PREP MODEL & SERVE

(& store)

Data orchestration

and monitoring

Big data store Transform & Clean Data warehouse

BI + Reporting

Azure Data Factory

SSIS

Azure Data Lake

Storage Gen2

Blob Storage

SQL Server 2019 Big

Data Cluster

Azure Databricks

Azure HDInsight

PolyBase & Stored

Procedures

Power BI Dataflows

Azure SQL Data Warehouse

Azure Analysis Services

SQL Database (Single, MI,

HyperScale, Serverless)

SQL Server in a VM

Cosmos DB

Power BI Aggregations

Why data lakes?

ETL pipeline

Dedicated ETL tools (e.g. SSIS)

Defined schema

Queries

Results

Relational

LOB

Applications

Traditional business analytics process

1. Start with end-user requirements to identify desired reports

and analysis

2. Define corresponding database schema and queries

3. Identify the required data sources

4. Create a Extract-Transform-Load (ETL) pipeline to extract

required data (curation) and transform it to target schema

(‘schema-on-write’)

5. Create reports. Analyze data

All data not immediately required is discarded or archived

Harness the growing and changing nature of data

Need to collect any data

StreamingStructured

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data

Get the right information to the right people at the right time in the right format

Unstructured

“ ”

The three V’s

Store indefinitely Analyze See results

Gather data

from all sources

Iterate

New big data thinking: All data has value

Use a data lake:

All data has potential value

Data hoarding

No defined schema—stored in native format

Schema is imposed and transformations are done at query time (schema-on-read).

Apps and users interpret the data as they see fit

Top-down vs Bottom-up

Observation

Pattern

Theory

Hypothesis

What will

happen?

How can we

make it happen?

Predictive

Analytics

Prescriptive

Analytics

What

happened?

Why did

it happen?

Descriptive

Analytics

Diagnostic

Analytics

Confirmation

Theory

Hypothesis

Observation

Two Approaches to getting value out of data: Top-Down +

Bottoms-Up

Implement Data Warehouse

Physical Design

ETL

Development

Reporting &

Analytics

Development

Install and Tune

Reporting &

Analytics Design

Dimension Modelling

ETL Design

Setup Infrastructure

Understand

Corporate

Strategy

Data Warehousing Uses A Top-Down Approach

Data sources

Gather

Requirements

Business

Requirements

Technical

Requirements

The “data lake” Uses A Bottoms-Up Approach

Ingest all data

regardless of requirements

Store all data

in native format without

schema definition

Do analysis

Using analytic engines

like Hadoop

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

Data Lake + Data Warehouse Better Together

Data sources

What happened?

Descriptive

Analytics

Diagnostic

Analytics

Why did it happen?

What will happen?

Predictive

Analytics

Prescriptive

Analytics

How can we make it happen?

Bấm Tải xuống để xem toàn bộ.

Preview text:

Data Lake Overview James Serra Data & AI Architect Microsoft, NYC MTC JamesSerra3@gmail.com Blog: JamesSerra.com About Me
 Microsoft, Big Data Evangelist
 In IT for 30 years, worked on many BI and DW projects
 Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer
 Been perm employee, contractor, consultant, business owner
 Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
 Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data Platform Solutions  Blog at JamesSerra.com  Former SQL Server MVP
 Author of book “Reporting with Microsoft SQL Server 2012” Agenda  Big Data Architectures  Why data lakes?  Top-down vs Bottom-up  Data lake defined  Creating ADLS Gen2  Data Lake Use Cases ? ? Big Data Architectures ? ?
Enterprise data warehouse augmentation • Seen when EDW has been in
existence a while and EDW can’t handle new data • Data hub, not data lake
• Cons: not offloading EDW work,
can’t use existing tools, difficulty
joining data in data hub with EDW Data hub plus EDW
• Data hub is used as temporary
staging and refining, no reporting
• Cons: data hub is temporary, no
reporting/analyzing done with the data hub (temporary) All-in-one
• Data hub is total solution, no EDW
• Cons: queries are slower, new
training for reporting tools, difficulty understanding data, security limitations
Is the traditional data warehouse dead? https://www.jamesserra.com/archive/2017/12/is-the-traditional-data-warehouse-dead/ Modern Data Warehouse
• Evolution of three previous scenarios • Ultimate goal • Supports future data needs
• Data harmonized and analyzed in the
data lake or moved to EDW for more quality and performance
M O D E R N D A T A W A R E H O U S E INGEST STORE PREP & TRAIN MODEL & SERVE Logs (unstructured) Azure Databricks Media (unstructured) PolyBase Files (unstructured) Azure Data Factory
Azure Data Lake Store Gen2 Azure SQL Data Azure Analysis Power BI Warehouse Services Business/custom apps (structured)
Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs. LOB CRM BI + Reporting INGEST STORE PREP MODEL & SERVE (& store) Graph Advanced Analytics Image Social Data orchestration Big data store Transform & Clean Data warehouse and monitoring AI IoT Azure Data Factory Azure Data Lake Azure Databricks Azure SQL Data Warehouse Storage Gen2 SSIS Azure HDInsight Azure Analysis Services Blob Storage PolyBase & Stored SQL Database (Single, MI, SQL Server 2019 Big Procedures HyperScale, Serverless) Data Cluster Power BI Dataflows SQL Server in a VM Cosmos DB Power BI Aggregations ? ? Why data lakes? ? ?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-on-write’)
5. Create reports. Analyze data
Dedicated ETL tools (e.g. SSIS) Relational ETL pipeline Queries LOB Applications Defined schema Results
All data not immediately required is discarded or archived 12 Need to col ect any data
Harness the growing and changing nature of data Structured Unstructured Streaming “ ”
Challenge is combining transactional data stored in relational databases with less structured data Big Data = Al Data
Get the right information to the right people at the right time in the right format The three V’s
New big data thinking: Al data has value Use a data lake: All data has potential value Data hoarding
No defined schema—stored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit Iterate Gather data Store indefinitely Analyze See results from all sources ? ? Top-down vs Bottom-up ? ?
Two Approaches to getting value out of data: Top-Down + Bottoms-Up How can we make it happen? Prescriptive What will Analytics happen? Theory Predictive Theory Analytics Why did Hypothesis it happen? Hypothesis Diagnostic Pattern What Observation happened? Analytics Observation Descriptive Confirmation Analytics
Data Warehousing Uses A Top-Down Approach Understand Implement Data Warehouse Gather Corporate Requirements Reporting & Strategy Reporting & Analytics Analytics Design Business Development Requirements Dimension Modelling Physical Design ETL ETL Design Development Technical Requirements Data sources Setup Infrastructure Install and Tune
The “data lake” Uses A Bottoms-Up Approach Ingest all data Store all data Do analysis
regardless of requirements
in native format without Using analytic engines schema definition like Hadoop Batch queries Devices Interactive queries Real-time analytics Machine Learning Data warehouse
Data Lake + Data Warehouse Better Together What happened? What will happen? Descriptive Predictive Analytics Analytics Why did it happen? How can we make it happen? Diagnostic Prescriptive Data sources Analytics Analytics

Data Lake Overview| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Tài liệu liên quan:

Bài giảng Bài 01: Các dịch vụ mạng Windows 2000 môn Quản trị mạng Window | Đai học Bách Khoa Hà Nội

Perception in Visualization| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

39 studies about human perception in 30 minutes| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

My steps to learn about Apache NiFi| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Text Visualization Browser| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội