Go Big (with Data Lake Architecture) or Go Home| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization.

Thông tin:
35 trang 3 tháng trước

Bình luận

Vui lòng đăng nhập hoặc đăng ký để gửi bình luận.

Go Big (with Data Lake Architecture) or Go Home| Tài liệu tham khảo môn quản trị dữ liệu và trực quan hóa| Trường Đại học Bách Khoa Hà Nội

Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization.

22 11 lượt tải Tải xuống
BR008
Microsoft Machine Learning
& Data Science Summit
September 26
27 | Atlanta, GA
Go Big (with Data Lake Architecture)
or Go Home!
Afnan
Session objectives and key takeaways
Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization.
Key Takeaways:
Data lake architectures can be additive to your data
warehouse
Azure Data Lake makes building big data environments easy
The traditional data warehouse
data warehousing has reached
the most significant tipping point
since its inception. The biggest,
possibly most elaborate data
management system in IT is
changing.
Gartner, The State of Data Warehousing”*
4
Data sources
OLTP
ERP
CRM
LOB
ETL
Data warehouse
BI and analytics
Dashboards
Reporting
* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012)
Big Data is driving transformative changes
Big data
is
high
-
volume
,
high
-
velocity
and/or
high
-
variety
information assets that demand
cost
-
effective
,
innovative forms of information processing that enable
enhanced insight
,
decision making
, and
process
automation
.
Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data is driving transformative changes
Cost CultureData
Characteristics
Big Data is driving transformative changes
Cost
Culture
Data
Characteristics
Traditional Big Data
Relational
(with highly modeled schema)
All Data
(with schema agility)
Expensive
(storage and compute capacity)
Commodity
(storage and compute capacity)
Rear-view reporting
(using relational algebra)
Intelligent action
(using relational algebra AND ML,
graph, streaming, image processing)
“I can see us…creating predictive, context-aware financial
services applications that give information based on the time
and where the customer is.”
Billy Lo
Head of Enterprise Architecture
Example:
Culture of experimentation
Tangerine instantly adapts to
customer feedback to offer customers
what they want, when they want it
Scenario
Lack of insight for targeted campaigns
Inability to support data growth
Solution
Azure HDInsight
(Hadoop-as-a-service) with the Analytics
Platform System enables instant analysis of social sentiment
and customer feedback
across digital, face-to-face and
phone interactions.
Result
Reduced time to customer insight
Ability to make changes to campaigns or adjust product
rollouts based on real-time customer reactions
Ability to offer incentives and new services to retain
and
growits customer base
Why Data Lakes?
ETL pipeline
Dedicated ETL tools (e.g. SSIS)
Defined schema
Queries
Results
Relational
LOB Applications
Traditional business analytics process
10
I
d
e
n
t
i
f
y
d
a
t
a
s
c
h
e
m
a
a
n
d
q
u
e
r
i
e
s
I
d
e
n
t
i
f
y
d
a
t
a
s
o
u
r
c
e
s
C
r
e
a
t
e
r
e
p
o
r
t
s
D
o
a
n
a
l
y
t
i
c
s
C
r
e
a
t
e
E
T
L
p
i
p
e
l
i
n
e
N
e
w
r
e
q
u
i
r
e
m
e
n
t
s
1. Start with end-user requirements to identify desired reports
and analysis
2. Define corresponding database schema and queries
3. Identify the required data sources
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema-
on-write’)
5. Create reports, analyze data
All data not immediately required is discarded or archived
Store indefinitely
Analyze
See results
Gather data
from all sources
Iterate
New big data thinking:
All data has value
11
All data has potential value
Data hoarding
No defined schemastored in native format
Schema is imposed and transformations are done at query time (schema-on-read).
Apps and users interpret the data as they see fit
The data lake and warehouse
Interactive queries
Batch queries
Machine Learning
Real-time analytics
Devices
Sensors
Video
Web
Social
Click
-
stream
Devices
Sensors
Video
Social
Clickstream
ETL pipeline
Defined schema
Queries
Results
Relational
LOB Applications
Web
Meta-Data,
Joins
Cooked
Data
Cooked
Data
Dashboards
Reports
Exploration
However, Big Data is not easy
*Gartner: Survey Analysis Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Obtaining skills
and capabilities
Determining how
to get value
Integrating with
existing IT investments
Microsoft made the transition to Big Data
We wanted to build better
products based on real usage and
experimentation,
So we built…
A data lake for everyone to put their data
Tools approachable by any developer
Machine learning tools for collaborating
across large experiment models
Result
Used at Microsoft across Office, Xbox Live, Azure,
Windows, Bing and Skype
10K+ Developers running diverse workloads and
scenarios
Exabytes of data under management
1 2 3 4 5 6 7
Data Stored
Windows
SMSG
Live
Bing
CRM/Dynamics
Xbox Live
Office365
Malware Protection
Microsoft Stores
Commerce Risk
Skype
LCA
Exchange
Yammer
Patterns for
Big Data
Big Data Analytics Data Flow
DATA
Business
apps
Custom
apps
Sensors
and devices
INTELLIGENCE ACTION
People
Preparation, Analytics and
Machine Learning
Azure Data Lake Store
Ingestion
Bulk Ingestion
Event Ingestion
Discovery
Azure
Data Catalog
Visualization
Power BI
Event ingestion patterns
Business
apps
Custom
apps
Sensors
and devices
Events
Events
Azure Data Lake Store
Transformed
Data
Real Time Dashboards
Power BI
Raw Events
Azure Event
Hubs
Kafka
Event Collection
Azure
Stream Analytics
Spark Streaming
Stream Processing
Lambda architecture
Event hubs
Machine Learning
Flatten &
Metadata Join
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
Machine
Learning
Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
ASA Job Rule
CONSUME
DATA
SOURCES
Web/LOB
Dashboards
On Premise
Hot Path
Cold Path
Archived Data
Data Lake
Store
Sensors (IoT,
Devices, Mobile)
Reference Data
Event hubs
ASA Job Rule
Event hubs
Real-time
Scoring
Aggregated
Data
Data Lake
Store
Logs (CSV, JSON,
XML)
Data Lake
Store
Data Lake
Analytics
Batch Scoring
Offline
Training
Hourly, Daily,
Monthly Roll-Ups
Cortana
Email Server
Leading Computer Manufacturer / Retailer
How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online
How They Did It
Collect clickstream data
In tab separated text files
Adding 22 new files per hour ~5-18
MB/file
Currently 1TB and growing
Spin up Hadoop
Use Hive scripts because of SQL-like
syntax
Extracts click behavior like buys,
additions to carts, reviews etc. and
assigns scores
Jobs run hourly
Currently 8-nodes with plans to 16
Clickstream,
Recommendation
BK1
HDInsight Cluster
AzureML
Event Hub
NRT
AzureML
Azure SQL
DE
Blog
Storage
NoSQL
Storage
Targeted email
Template data
IaaS VM
Training/validation data
Azure Service for Target Email
MB ase
Targeted Email
Scored data
Recommendations
Deterministic
Non
-
Deterministic
Web logs
Persisted
Storage
Azure Service for Recommendations
Blog
Storage
Blog
Storage
Visitor
Information
Service
Email
to user
User
clicks
Web logs
Scored data
B Click feedback
data
Catched
data
Catched
data
To be scored
Capture
Deterministic
User segment info
Omniture
Product
Catalog
Website.com
Making Big
Data Easy
| 1/35

Preview text:

Microsoft Machine Learning & Data Science Summit
September 26 – 27 | Atlanta, GA BR008
Go Big (with Data Lake Architecture) or Go Home! Omid Afnan
Session objectives and key takeaways Objective:
Understand how the traditional data landscape is changing,
what opportunities big data presents and what architectures
allow you to maximize the benefits to your organization. Key Takeaways:
Data lake architectures can be additive to your data warehouse
Azure Data Lake makes building big data environments easy
The traditional data warehouse BI and analytics
“… data warehousing has reached
the most significant tipping point Dashboards Reporting Data warehouse
since its inception. The biggest, possibly most elaborate data management system in IT is ETL changing.” Data sources
– Gartner, “The State of Data Warehousing”* OLTP ERP CRM LOB
* Donald Feinberg, Mark Beyer, Merv Adrian, Roxane Edjlali (Gartner), The State of Data Warehousing in 2012 (Stamford, CT.: Gartner, 2012) 4
Big Data is driving transformative changes
“Big data is high-volume, high-velocity and/or high-
variety information assets that demand cost-effective,
innovative forms of information processing that enable
enhanced insight, decision making, and process automation.”
– Gartner, Big Data Definition*
* Gartner, Big Data (Stamford, CT.: Gartner, 2016), URL: http://www.gartner.com/it-glossary/big-data/
Big Data is driving transformative changes Data Cost Culture Characteristics
Big Data is driving transformative changes Traditional Big Data Data Characteristics Relational All Data (with highly modeled schema) (with schema agility) Cost Expensive Commodity (storage and compute capacity) (storage and compute capacity) Culture Rear-view reporting Intelligent action (using relational algebra)
(using relational algebra AND ML,
graph, streaming, image processing) Example: Culture of experimentation
Tangerine instantly adapts to
customer feedback to offer customers
what they want, when they want it
Scenario Lack of insight for targeted campaigns
Inability to support data growth
Azure HDInsight (Hadoop-as-a-service) with the Analytics
Solution Platform System enables instant analysis of social sentiment
and customer feedback across digital, face-to-face and phone interactions.
• Reduced time to customer insight
• Ability to make changes to campaigns or adjust product Result
rollouts based on real-time customer reactions
“I can see us…creating predictive, context-aware financial
• Ability to offer incentives and new services to retain—and
services applications that give information based on the time grow—its customer base and where the customer is.” Billy Lo
Head of Enterprise Architecture Why Data Lakes?
Traditional business analytics process
1. Start with end-user requirements to identify desired reports New requirements and analysis d I a a d n t e
2. Define corresponding database schema and queries d a n reports q s t c i u f h y e e analytics ri m
3. Identify the required data sources e o s a Create D
4. Create a Extract-Transform-Load (ETL) pipeline to extract
required data (curation) and transform it to target schema (‘schema- Crea on-write’) p te ip E e T l L Identify ine data sources
5. Create reports, analyze data
Dedicated ETL tools (e.g. SSIS) Relational ETL pipeline Queries LOB Applications Defined schema Results
All data not immediately required is discarded or archived 10
New big data thinking: Al data has value
• All data has potential value • Data hoarding
• No defined schema—stored in native format
• Schema is imposed and transformations are done at query time (schema-on-read).
• Apps and users interpret the data as they see fit Iterate Gather data Store indefinitely Analyze See results from all sources 11 The data lake and warehouse Batch queries Dashboards Devices Social Reports Devices Interactive queries Exploration Cooked Data Web Video Web Real-time analytics Sensors Machine Learning Queries Social Clickstream Sensors Click- stream Video Meta-Data, Cooked Joins Data Relational Results ETL pipeline LOB Applications Defined schema
However, Big Data is not easy… Obtaining skills Determining how Integrating with and capabilities to get value existing IT investments
*Gartner: Survey Analysis – Hadoop Adoption Drivers and Challenges (Stamford, CT.: Gartner, 2015)
Microsoft made the transition to Big Data We wanted to build better Data Stored
products based on real usage and experimentation, So we built… Xbox Live
A data lake for everyone to put their data
Tools approachable by any developer Office365 LCA
Machine learning tools for collaborating Live across large experiment models Bing SMSG Yammer Result… CRM/Dynamics Skype
Used at Microsoft across Office, Xbox Live, Azure, Exchange Windows Windows, Bing and Skype Malware Protection Microsoft Stores
10K+ Developers running diverse workloads and Commerce Risk scenarios
Exabytes of data under management 1 2 3 4 5 6 7 Patterns for Big Data
Big Data Analytics – Data Flow Ingestion Discovery Azure Data Catalog Business apps Preparation, Analytics and Bulk Ingestion Machine Learning Visualization People Power BI Custom apps Sensors Event Ingestion and devices Azure Data Lake Store DATA INTELLIGENCE ACTION Event ingestion patterns Power BI Real Time Dashboards Business Azure apps Azure Event Stream Analytics Hubs Events Events Transformed Data Kafka Custom apps Spark Streaming Event Col ection Stream Processing Sensors and devices Raw Events Azure Data Lake Store Lambda architecture DATA INGEST PREPARE ANALYZE PUBLISH CONSUME SOURCES Machine Learning Hot Path Real-time Scoring Reference Data Event hubs ASA Job Rule Cortana Event hubs Flatten & Event hubs Sensors (IoT, ASA Job Rule Metadata Join Devices, Mobile) Aggregated Hourly, Daily, Archived Data Data Monthly Roll-Ups Power BI Data Lake Data Lake Data Lake Data Lake Store Store Analytics Store Offline Training Machine Batch Scoring Learning Azure SQL Web/LOB Data Warehouse Dashboards Logs (CSV, JSON,
Data Factory: Move Data, Orchestrate, Schedule, and Monitor XML…) On Premise Cold Path Clickstream,
Leading Computer Manufacturer / Retailer Recommendation
How They Did It: Analyzing Clickstream to Provide Real-time Recommendations Online Azure Service for Target Email HDInsight Cluster AzureML • How They Did It Training/validation data • Collect clickstream data
• In tab separated text files Catched data B Click feedback Catched data data To be scored
• Adding 22 new files per hour ~5-18 Scored data MB/file Blog Blog MB ase Targeted email Storage Storage Template data • Currently 1TB and growing IaaS VM Web logs Email Server • Spin up Hadoop Visitor Information Omniture Product Website.com Catalog Service User Email clicks to user
• Use Hive scripts because of SQL-like Capture Targeted Email syntax Web logs User segment info Scored data Deterministic Recommendations Deterministic
• Extracts click behavior like buys,
additions to carts, reviews etc. and Non-Deterministic assigns scores NRT • Jobs run hourly Event Hub AzureML Azure SQL Blog NoSQL Persisted
• Currently 8-nodes with plans to 16 DE Storage Storage Storage
BK1Azure Service for Recommendations Making Big Data Easy